EQEmulator Forums - Speed optimizations?

EQEmulator Forums (https://www.eqemulator.org/forums/index.php)

- Archive::Development (https://www.eqemulator.org/forums/forumdisplay.php?f=621)

- - Speed optimizations? (https://www.eqemulator.org/forums/showthread.php?t=11504)

Speed optimizations?

I saw it said on GW2 that server lag was mainly due to CPU load rather than too small a network pipe, so I spent some time looking at the server code to look for places to speed up the code. Without a profiler it's really hard to tell where the problem is, so I decided to simply poke around. My understanding is that there's a thread for each client that's connected to a zone server, and one or more threads that handle communication to and from the socket port.

I could be way off base, but one thing that really bothered me was all the memory allocations and deallocations. Though I had a real hard time finding a way to avoid this, one possibility might be in the EQNetworkPacket class. It seems that the data area is allocated to match the amount of bytes to be sent, but I'm wondering if it might be better to define it as a static array of 512 bytes (the maximum) instead. It sacrifices some memory, but that way the data area will be allocated along with the class and maybe it will save some time.

Another possibility (though you probably won't like this one) would be to make APPLAYER a struct instead of a class and preallocate the data area at the same time as the struct. You'd have to change all of its methods to procedures, but it would knock out half of the memory allocations and deallocations...

WC

Optimizing stuff like this is fun. If I were a betting man, I'd put money on the line-of-sight calculations as the guilty party for large CPU loads. That area could probably use optimization, but ultimately there comes a point where the code can't be optimized further and the load should be handled by > 1 server.

Maybe the people who run guildwars can set up a profiler of some sort to determine what causes the load. The quickest and easiest profiling is to see which thread is getting the most attention. There's only a limited number of threads in the app, and all of the line-of-sight calcs are done on one thread.

For APPLAYER in particular, my guess would be to make the object smarter and bear the memory alloc. Most instances of the object end up being copied several times before making it to the network pipe. Doodman had a good idea along these lines a while back, maybe he could share it here :)

Another thing I ran across was code somewhere that did a linear search through some sort of door list to find one with a certain ID. A binary search would be infinitely more efficient (log(n) vs. n).

WC

We reduced alot of the search problems when we split the list up into the multiple componets... Every search through the entity list for ANYTHING would run through the entire entity list, so it was insane..

I havent done any profiling recently, but i'm sure the CheckCloseArgo function and its descendant function calls are still the top of the list. I put a very low cost distance calc. filter at the top of that function that cuts out 95% of the extra calls though, but it could probably still use some work.

More efficent structures would be nice too in some places rather than a list. We talked about red-black BST's at one time which would make searches much more efficient..

You can't do a binary search on an unsorted list or a non ordinal list. Ours is currently the former. You also run into the problem of what to sort by? It's certainly the most expensive to search by name, so we could sort by name and speed that up the most. However, if we call by EntID or ID more, it wouldn't help in the in. It's a sticky stituation. APPLAYER is already smart about it's memory and allocates it itself. You cannot easily get around the news and deletes of applayers, even if you statically allocate them they are still going to be news and deletes in basically the same place (think C99 and compliant compilers).

here is a piece of code that might help.

Basically, pre-computes neighborhood for all mobs in zone every x seconds. This is quite fast, and --linear--
in Entity::AggroCheckClose, instead of parsing all mob_list and checking for neighbors, use that cache

Below are two snippets

i used DoorProcess to update neighborhood, as this is called in most zones (if not all) whatever the players in it. Thus, every 5 secs, an update of the terrain (a static singleton of Terrain class) if performed. After this has been called, terrain hopefully has a list (hastable) of all mobs within a given range, for each mob. i know this is not the right place to do it, but for the moment i didn't want to get too intrusive and add pieces of code in net.cpp or wherever.

Code:

--- ../../cvs/eqemu/eqemu/Source/zone/entity.cpp  2004-02-02 13:55:07.000000000 +0000

+++ entity.cpp  2004-02-05 11:18:55.000000000 +0000

@@ -52,6 +52,10 @@

 #define strcasecmp _stricmp

 #endif



+#ifdef KD

+#include "../extra/Vector.h"

+#include "../extra/Terrain.h"

+#endif

 extern Zone* zone;

 extern volatile bool ZoneLoaded;

 extern WorldServer worldserver;

@@ -216,6 +220,24 @@

    else

      iterator.Advance();

  }

+#ifdef KD

+ printf("in EntityList::DoorProcess, time is %4.1f: \n",Timer::GetCurrentTime()*0.001f);

+ LinkedListIterator&lt;Mob*> miter(mob_list);

+ Vector *vmob=new Vector();

+ miter.Reset();

+ while(miter.MoreElements()) {

+   //printf("Yet another mob in the list...\n");

+   vmob->add(miter.GetData());

+   miter.Advance();

+ }

+ printf("got %d entities from mob_list\n",vmob->size());

+ Mob* mobs[vmob->size()];

+ vmob->copyInto(mobs);

+ // aggregate on a max 200.0 distance

+ // this should be dynamically set to the max aggro range of all mobs

+ terrain.update(vmob->size(),mobs,200.0f);

+ delete vmob;

+#endif

  if(count==0)

in MobAI
when checking for a AgrroRadius match, get the vector of mobs that were matched as neighbors. Use this list insted if the full mob_list to check for aggro.

Code:

--- ../../cvs/eqemu/eqemu/Source/zone/MobAI.cpp 2004-02-02 13:54:59.000000000 +0000

+++ MobAI.cpp 2004-02-05 10:41:31.000000000 +0000

@@ -20,6 +20,11 @@

  extern SPDat_Spell_Struct spells[SPDAT_RECORDS];

 #endif



+#ifdef KD

+#include "../extra/LL.h"

+#include "../extra/Terrain.h"

+#endif

+

 extern EntityList entity_list;

 extern Database database;

 extern Zone *zone;

@@ -222,7 +227,34 @@

 Mob* EntityList::AICheckCloseArrgo(Mob* sender, float iArrgoRange, float iAssistRange) {

  if (!sender || !sender->IsNPC())

    return 0;

+#ifdef KD

+ Vector *ngh=terrain.neighbors(sender->GetID());

+ if(ngh==NULL){

+   //printf("Mob::AICheckCloseArrgo , %d has no neighbors\n",sender->GetID());

+   return 0;

+ }

+ int ncount= ngh->size();

+ //printf("Mob::AICheckCloseArrgo , got %d neighbors for %d ",ncount, sender->GetID());

+ Mob* neighbors[ncount];

+ ngh->copyInto(neighbors);

+ //for(int i=0;i&lt;ncount;i++) {

+ //  printf("%d:[%4.1f,%4.1f]", ((Mob*)ngh->get(i))->GetID(),

+ //                          ((Mob*)ngh->get(i))->GetX(), ((Mob*)ngh->get(i))->GetY());

+ //}

+ KDIterator&lt;Mob*> iterator(neighbors,ncount,FORWARD);

+#ifdef KDSTAT

+ printf("%d :",sender->GetID());

+ iterator.Reset();

+ while(iterator.MoreElements()) {

+   Mob* mob = iterator.GetData();

+   printf("&lt;%d>",mob->GetID());

+   iterator.Advance();

+ }

+ printf("\n"); fflush(stdout);

+#endif //KDSTAT

+#else

  LinkedListIterator&lt;Mob*> iterator(mob_list);

+#endif

  iterator.Reset();

  float dist;

  //float distZ;

the other code (Terrain and Hashtable and Vector ... )are a available here
http://perso.wanadoo.fr/afou/khalzed...ighbors.tar.gz
you need to add -DKD and -DKDSTAT to you COPTS in the zone makefile
tune verbosity by editing the VERBOSE define in Terrain.cpp

Tpical output

Code:

...

Request for neighborhood before initializing

Request for neighborhood before initializing

...

Code:

in EntityList::DoorProcess, time is  5.0:

got 55 entities from mob_list

updating terrain, 55 mobs

enter collide with 55 elements

 Terrain::collide for distance 100.0 found :

   54 groups of average size 9.9 (max group : 18 + self),   and 1 lonely

 linear parse : 532 instead of 3025 (17.6%)

leavin collide

nieghbours set

llnear : parse checks wil be made using cache, and thus estimates the amount of cpu used (17,6% means you cut down cpu for check of 82%)

Drawbacks :
* you must select a big-enough value for the cache range checking. on a 'typical' DB :

Code:

mysql> select aggroradius, count(id) from npc_types group by aggroradius;

+-------------+-----------+

| aggroradius | count(id) |

+-------------+-----------+

|           0 |        63 |

|          45 |       152 |

|          65 |     26669 |

|          85 |        37 |

|         200 |     44854 |

+-------------+-----------+

set distance to 200.0f + 20.f (estimated distance moved between two cache updates). The lower, the smaller the groups, thus the less CPU used. But some zones might only have 65.0 agrroradius mobs.

* Use a better timer. Once every second should be fine.

* Nothing ensures that the first call to cache is valid. Thus the first checkings may abort. It does not seem to have any impact, as process goes up normally after the cache has been set.

* the code was written from scratch, or translated from java. It has memory leaks (at least freeing the old hashtable before update, but most probably more). However interfaces are small, so one could easily subst Hashtable and Vector classes for more reliable ones.
As of this code, it gets Sig segv after a few minutes. =(

* some comments are out of date, and irrelevant :shock:

Besides this (little) unpleansant effect, results are good. Very good.

worked on it a bit more. The link from the previous post is updated

Some memory leaks fixed, and a few changes to the support code. Most zones work fine (as far as tested, only qeynos crashes, after 4-5 mins, always on a pet...).

'd be interested for some feedback if any.

I have never run EQEMU under a profiler. That being said, I still thought I'd mention a couple of things. Ancillary data structures of pointers to entities can be used to satisfy the requirement of multiple sort criteria. Adding non-standard containers (with the possible exception of hashtable) to the code is probably not wise - espescially if they are not proven or have unusual semantics. The caching suggestion is interesting and I'd love to see it under a profiler. I'm wondering if it would lead to natural extension using some sort of bounding volumes. Once you deliniate the caching "terrain" for a given entity, it seems like a natural extension to use that as a bounding volume to simplify in/out calculations. If one could assume square bounding boxes, the range calculations would be so fast as to probably eliminate the need to be cached (although it would open up some interesting new strategies - like sneaking up on a mob's diagonal).

as for the profiler, here is an exeirpt of profiler output (gprof)

Code:

Flat profile:



Each sample counts as 0.01 seconds.

  %   cumulative   self              self     total           

 time   seconds   seconds    calls   s/call   s/call  name    

 64.00   1237.48  1237.48 14718573     0.00     0.00  EntityList::AICheckCloseArrgo(Mob*, float, float)

 11.85   1466.65   229.17 3970091636     0.00     0.00  LinkedListIterator&lt;Mob*>::Advance()

  5.73   1577.41   110.76 3985303690     0.00     0.00  LinkedListIterator&lt;Mob*>::MoreElements()

  2.92   1633.95    56.54 1261006787     0.00     0.00  Timer::Check(bool)

  2.21   1676.74    42.79 128348838     0.00     0.00  Mob::AI_Process()

  1.84   1712.26    35.51 128348838     0.00     0.00  NPC::Process()

  1.43   1739.93    27.68      344     0.08     0.08  LinkedListIterator&lt;Mob*>::RemoveCurrent(bool)

  1.38   1766.52    26.59 3970146525     0.00     0.00  LinkedListIterator&lt;Mob*>::GetData()

  0.92   1784.30    17.78 65021129     0.00     0.00  NPC::GetFactionCon(Mob*)

  0.90   1801.78    17.48 128348838     0.00     0.00  Mob::SpellProcess()

  0.76   1816.52    14.73 65021473     0.00     0.00  Mob::IsInvisible(Mob*)

...

you can get the full file (incluing call graph) and more examples at :
http://perso.wanadoo.fr/afou/khalzed...qzone.gmon.out
http://perso.wanadoo.fr/afou/khalzed...qzone.gmon.out
http://perso.wanadoo.fr/afou/khalzed...qzone.gmon.out
http://perso.wanadoo.fr/afou/khalzed...dzone.gmon.out
http://perso.wanadoo.fr/afou/khalzed...dzone.gmon.out

or all in a pack :
http://perso.wanadoo.fr/afou/khalzed...e-files.tar.gz

These are short runs, with one player if any.

If you don't bother to read, caching as it is now gives cpu load a rough 50% down, depending on the zone (AICheckCloseArrgo cumulated accounts for between 60-70% of non-cached version, divided by 10 or so in the cached version).

'm not sure about the oddity of a hash or vector class, instead of inline manipulation

'don't know about bounding boxes extensions. It's like making a function recursive to non recursive, or vice-versa. It's always possible, but .. i'll work in it afterwards

Quote:

Originally Posted by smogo

'don't know about bounding boxes extensions. It's like making a function recursive to non recursive, or vice-versa. It's always possible, but .. i'll work in it afterwards

Don't really follow you there. Maybe I misinterpreted your strategy, but I thought it consisted of caching a list of all entities within some bounding area for each mob. When thinking in those terms it occurred to me that it would probably make sense to just bound mobs with a square and check for inclusion within that square instead of checking for radial distance. You give up a little accuracy, but I don't think it would really matter in this case. Checking to see if a point lies within a square requires two subtractions and two comparisons. Proper vector distance calculations for radial inclusion usually require 3 multiplies an addition a subtraction and a comparison. I haven't looked at the code for this lately, but it does still do a vector distance calc, doesn't it?

Maybe I was unnecessarily confusing by relating this to your plans. It just seemed to me that you'd already have to deal with bounding volumes in order to generate your caches.

EDIT: ack! what am I thinking? More like 3 multiplies, 3 subtracts, an add, and a comparison (for 2d, and the add/subtract are the expensive ops for floats).

Quote:

Checking to see if a point lies within a square requires two subtractions and two comparisons. Proper vector distance calculations for radial inclusion usually require 3 multiplies an addition a subtraction and a comparison.I haven't looked at the code for this lately, but it does still do a vector distance calc, doesn't it?

the actual code does box-checking in AICheckCloseArrgo as far as i know, thus abs(x1-x2)<dist && abs(y1-y2) <dist.

In the Terrain code, it is exact-distance calc. However this is little overhead, as :
- those in the corner (pass square test, fail exact test) are not checked against lately in AICheckCloseArrgo), thus time saved in the end
- event though caching uses cpu (about 3% in very heavy zones), it wipes-out a lot of tests afterwards.

Quote:

Originally Posted by smogo

Quote:

Yeah, I remember when Trump added that box filter. I think it _adds_ to computation time instead of reducing it. My suggestion was that we rely solely on it, instead of using it as merely a filter. Maybe I'm out of my element, but bounding volumes seem like a very forward-looking solution. They are extremely fast and map nodes can be mapped onto them in a fairly straight-forward manner (Kay-Kajiya bounding trees, maybe?). i.e... you can just lop off portions of your bounding square to reflect walls which nets you line-of-sight calculations for free.

The reason for the CPU usage problem is moreso due to memory leaks, when the ram usage becomes to the point where it starts using the hard drive as ram (pagefile) thats when the 'shit hits the fan'. Other than that its more of the fact that I am currently running 15 STATIC zones (not bootable) per computer or more (This changes tonight as im setting up the new computers). And the computers are outdated (p3 733, athlon 600), the 'good' computers are xp 1700 and xp 2000. But if you put enough on them they can't handle it obviousely. But compared to what EQEMu was a while ago we are doing much better IMO.

Quote:

the computers are outdated (p3 733, athlon 600)

Almost, mine is AMD K6-2, 475Mhz. ;)

I can run up to 10 zones, like kaladima, kaladimb, nexus, i.e up to 50 npcs only per zone, average 10M memory and 1 to 5 % cpu (that's what top says).

But when it comes to more populated zones, eg around 150 mobs, it goes up to 40-50% cpu, even more wih the pets spawning

i could not run any zone more than 200 mobs before this cache, due to cpu at first glance.

Load is, as far as AICheckCloseArrgo is concerned, N*N. thus for 150 mobs it accounts 70% of zone load.

Anyway, with the cache, i could run 500+ mobs in a zone on that very machine ;) Memory then is about 40-50M, growing up slowly.

Even so, the idea of bounding boxes is growing on me. It is looking like a really elegant solution to the line of sight problem. In the simplest case, you attach a vector of rectangles to each mob. One rectangle would be the mob's aggro area and the others would be facets cut away by occluding intercepts like walls. Determining line-of-sight then becomes a simple matter of taking the difference of our rectangles and testing for inclusion of a point. Extend to 3d as necessary. The math is simple and the bounding areas/volumes can be precomputed and stored in the db since mobs always spawn in the same location AFAIK. Contrast this to the existing method (or at least the method I saw suggested by Wiz), which says that if you want to find out if you can see from a to b you should take delta-sized steps in the direction of b and check at each step to see if you're standing on a wall (I doubt there is an acceptable delta in terms of both performance and wall-thickness).

The prime need for line of sight calculations that I'm aware of right now stems from the fact that radial distances are the only way to specify aggro distance. I.e... in a dungeon, you can't specify that a monster in a corner should only attack critters in the same room. Even without using multiple rectangles per mob, moving to a rectangular bounding area would allow more intelligent specification of aggro ranges (IMHO).

This sounds interesting.

Quote:

The math is simple and the bounding areas/volumes can be precomputed and stored in the db since mobs always spawn in the same location AFAIK.

This is true as regards to standing mobs. Those that move are a bit different.
Generally speaking, this is a kind of BSP you refer to, clipped by Aggro/Assist/whatever radius. Such BSP exist in the client's zone files. It seems also that the map package in zone (will|might|could|ought to) be used to describe, and thus maybe partition, space. It uses *.map files, but i spent very little time looking at what it does, so maybe it's just missaying.

BTW, i found something about caching : you must keep it up to date ;). I came accross the idea of tracing despawns, and, yes, after despawn, using the reference from the cache doesnt end up too well ... Still some work on that topic.

Quote:

Originally Posted by smogo

This is true as regards to standing mobs. Those that move are a bit different.

Aye, but I'd venture a guess that for those to whom los is a consideration (indoor critters), their natural mobility is restricted to an area no larger than their intended los. Thus, it doesn't matter.

Quote:

Generally speaking, this is a kind of BSP you refer to, clipped by Aggro/Assist/whatever radius.

Generally speaking, I'm referring to bounding volumes. BSPs are a way of partitioning space. Bounding volumes are a way of approximating an object's geometry - in this case the object is the mob's line-of-sight. The biggest difference between the two can be seen as the area around the bounding volume grows. My scheme doesn't care - a point is either in the box or out of it. BSPs, on the other hand, get very very ugly in such a case (which is why doom is an inside-only game).

Quote:

Such BSP exist in the client's zone files. It seems also that the map package in zone (will|might|could|ought to) be used to describe, and thus maybe partition, space. It uses *.map files, but i spent very little time looking at what it does, so maybe it's just missaying.

Whatever the case, I think that utilizing some geometry is going to prove a better solution for checking los than the existing baby-steps method. I'm definitely NOT going to write a bsp tree implementation, though. Espescially when perfectly satisfactory results can be had by simply specifying aggro area w/ a rectangle instead of a radius.

Quote:

Originally Posted by smogo

here is a piece of code that might help.
<snip>
the other code (Terrain and Hashtable and Vector ... )are a available here
http://perso.wanadoo.fr/afou/khalzed...ighbors.tar.gz
<snip>

I hope you don't take offense at my posting the following comment, since it is clear that your heart is in the right place.

I just downloaded and examined some of your code and it is the funniest thing I've seen in a while. The "hashtable" in particular is the funniest data structure I've ever seen.

You state in the code that it should have O(n) inserts and O(log n) searches. This is the first clue that something is seriously wrong. A proper hash table should have O(1) inserts, O(1) deletes and O(1) lookups (assuming good hash functions). What you have is really a sorted vector. The really funny part, though, is when you realize that it isn't even a good sorted vector! The search function is recursive, which is a little ugly. The really funny part is that you walk the vector to insert in sorted order instead of using the binary search function you've already created. And all of these numbers are biased by using comparisons as our atomic operation. If you look at overall efficiency, the story saddens considerably more.

A good knowledge of data structures is the most dramatic difference between a good programmer and a poor one. In my experience, 80% of all programming tasks are solved by setting up appropriate data structures and then manipulating them (classes are data structures when using this view). I commend you for trying to write your own, since that can be an excellent way to learn, but still felt compelled to rip on you a bit for such a funny implementation.

i wont take offense. Still, it saves over half cpu....

The structures where built from scratch, and what is called a hashable, well, it's a keyed set. Think of searching a 'flat" balanced binary tree. Whatever the names or code involved, setting up data structures was a waste of time in my mind at that time.

i would encourage anyone willing to provide new structures to the project

It could event have used linked lists, as this -crappy- piece of code's efficiency comes not from data structures. Well, good and efficient DS would help.