EQEmulator Forums - lag problems

EQEmulator Forums (https://www.eqemulator.org/forums/index.php)

- Support::Windows Servers (https://www.eqemulator.org/forums/forumdisplay.php?f=587)

- - lag problems (https://www.eqemulator.org/forums/showthread.php?t=42311)

For the sake of troubleshooting, can you guys try this build? No guarantees but it will tell us something if the symptoms go away

https://ci.appveyor.com/project/KimL...2988/artifacts

Quote:

Originally Posted by Akkadius (Post 261840)

So - we could be dealing with a few factors here, while the resend issue was a very valid issue that we took care of, I have a hunch that something is of influence in the windows realm here

I have another question for you guys, where have you guys been getting your binaries?

Have you been compiling them yourselves? In the past few months we switched our main source of windows binary updates from our CI system and I just want to rule out a bad or imperformant library or compilation setting

At the end of the day, Windows or Linux you should be able to run on either, we'll get it figured out

Lately I've been getting the updated source and binaries from the installer prompt in the main server folder: eqemu_server.pl

When I tried it the first time I got the files you put on this thread and also tried the latest unstable I pulled from there.

I'll load up those new source files you posted in a few here, stress test, and than share our experiences on here.

The experiment with the binaries you posted above failed at 18 players with a few pets out. We spiraled out of control with lag in the Nexus.

Sounds good, thanks for trying that

Good News

Our core configurations are easy to tweak

Bad News

We don't know what settings to tweak yet

Good News Again

We are working on integrating stats tooling into the server code so that you guys can give us dumps on what is occurring in the zones at the time of this so we can see exactly what is happening at the network layer so we can give a proper prognosis

We are also working on a way to connect hundreds of headless clients to a server to simulate a stress condition so we can also debug this same issue without being reliant on someone with a player base and players logging in to replicate this issue, we can then replicate it on our own

That being said

The resend issue discussed prior in this thread was a real issue and we addressed the aggressive resend issues in weeks worth of testing on a few different servers

It seems that this network issue is common to Windows environments on a 2-2.6ghz Xeon chip and anytime there are 20+ clients in the zone

We are committed to getting it resolved, you shouldn't need to switch your OS and hardware to get around this issue so please be patient and we will get it sorted

I can confirm that I 100% agree with Akkadius. I attempted to upgrade our hardware and in my folly i was wrong. superior hardware is not going to fix this issue. I changed from a xeon 2.0 processor to a Intel I7-6700 32gb etc. and experienced the same issues. So at the very least we have eliminated the theory of hardware.

We just tried running the installer on a fresh setup, and started getting lag once we passed 23 in zone.

Quote:

Originally Posted by blooberry_eq99 (Post 261854)

We just tried running the installer on a fresh setup, and started getting lag once we passed 23 in zone.

As another test.. Use the same binaries, everything else.. but in the /Maps/nav directory, create a subdirectory. Something like /Maps/nav/removed

Move all of the files from the /nav directory into the new subdirectory. Then run the server again.. Lag goes away for me.

NOTE: I'm working with highly customized server code and don't have the latest update.

As a second test, I turned off .mmf file loading and left .nav files in the /nav directory. Either solution worked for me.

I will trying this on Alternate Everquest and report back if this works for us

MMF loading has been disabled for some time.

The new mapping system has not been applied to mmf loads and, tbh, I am unsure of the behavior of trying to use them.

Worst case scenarios, you would default to standard map loading or zone crashes.

The pre-built binaries do not include mmf loads as an enabled option.

Waiting until Akkadius posts an official fix before I would touch the maps and navmesh files. That sounds like a bandaid fix which could lead to some zone crashes or NPCs screwing up, and not a full-on repair of the problem.

This may not solve your problem, but it may help. (especially if you use innodb as your storage engine)

I took a look at how the Save method in client.cpp worked and saw some issues in the way locking happens and transactions are used (or not used).

From what I can tell between the lock in front of the DB connection and the lack of bulking up statements into a transaction, this could cause some serious issues if people were not using a very good ssd on their database and had quick network connection to their database of choice. You can easly get 'fsync' choked. Symptions are low io/cpu usage, everything starts lagging out waiting for the locks to be release for fsync to happen on the database for transaction purposes.

I've created a pull request.
https://github.com/EQEmu/Server/pull/827

**WARNING*** my testing has been minimal so use as your own risk, but I am encouraged by the results (over 2x faster with zero load on the system)

**Note**, you will also need to included two indexes on tables, its in the pull request. Its important as we are doing table scans during the save without them.

Hope this can help some of you. right now its a stop gap and hopefully I can come up with a better solution in the near future. I would like feedback if it helps resolve the lag issues with pets out.
This is an example bulking of the transactions of a simple save.

START TRANSACTION;
REPLACE INTO `character_currency` (id, platinum, gold, silver, copper,platinum_bank, gold_bank, silver_bank, copper_bank,platinum_cursor, gold_cursor, silver_cursor, copper_cursor, radiant_crystals, career_radiant_crystals, ebon_crystals, career_ebon_crystals)VALUES (685273, 184, 153, 149, 101, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 189, 0, 18.000000, -147.000000, 20.000000, 64.000000, 0);
REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 1);
REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 2);
REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 3);
REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 4);
DELETE FROM `character_buffs` WHERE `character_id` = '685273';
DELETE FROM `character_pet_buffs` WHERE `char_id` = 685273;
DELETE FROM `character_pet_inventory` WHERE `char_id` = 685273;
INSERT INTO `character_pet_info` (`char_id`, `pet`, `petname`, `petpower`, `spell_id`, `hp`, `mana`, `size`) VALUES (685273, 0, 'Labann000', 0, 632, 3150, 0, 5.000000) ON DUPLICATE KEY UPDATE `petname` = 'Labann000', `petpower` = 0, `spell_id` = 632, `hp` = 3150, `mana` = 0, `size` = 5.000000;
DELETE FROM `character_tribute` WHERE `id` = 685273;REPLACE INTO character_activities (charid, taskid, activityid, donecount, completed) VALUES (685273, 22, 1, 0, 0), (685273, 22, 2, 0, 0), (685273, 22, 3, 0, 0), (685273, 22, 4, 0, 0), (685273, 22, 5, 0, 0);
REPLACE INTO character_activities (charid, taskid, activityid, donecount, completed) VALUES (685273, 23, 0, 0, 0);REPLACE INTO character_activities (charid, taskid, activityid, donecount, completed) VALUES (685273, 138, 0, 0, 0);
REPLACE INTO `character_data` ( id,account_id,`name`, last_name, gender, race, class, `level`, deity,birthday,last_login,time_played,pvp_status,l evel2, anon, gm, intoxication,hair_color,beard_color,eye_color_1,ey e_color_2,hair_style,beard,ability_time_seconds,ab ility_number,ability_time_minutes,ability_time_hou rs, title,suffix, exp, points, mana, cur_hp, str, sta, cha, dex, `int`,agi, wis, face, y, x, z, heading, pvp2, pvp_type,autosplit_enabled, zone_change_count, drakkin_heritage, drakkin_tattoo,drakkin_details, toxicity,hunger_level,thirst_level,ability_up,zone _id, zone_instance,leadership_exp_on, ldon_points_guk, ldon_points_mir, ldon_points_mmc, ldon_points_ruj, ldon_points_tak, ldon_points_available,tribute_time_remaining, show_helm, career_tribute_points,tribute_points,tribute_activ e,endurance, group_leadership_exp,raid_leadership_exp, group_leadership_points, raid_leadership_points, air_remaining,pvp_kills, pvp_deaths,pvp_current_points, pvp_career_points, pvp_best_kill_streak,pvp_worst_death_streak, pvp_current_kill_streak, aa_points_spent, aa_exp, aa_points, group_auto_consent, raid_auto_consent, guild_auto_consent, RestTimer, e_aa_effects, e_percent_to_aa, e_expended_aa_spent, e_last_invsnapshot, mailkey ) VALUES (685273,90536,'Rekka','',0,6,13,50,396,1550636815, 1552018689,22507,0,70,0,1,0,17,255,4,4,2,255,0,0,0 ,0,'','',164708608,345,2299,1589,60,80,60,75,134,9 0,83,3,-1831.625000,-225.750000,3.127999,37.500000,0,0,0,0,0,0,0,0,4480 ,4480,0,22,0,0,0,0,0,0,0,0,4294967295,0,0,0,0,1291 ,0,0,0,0,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,'B F01A8C0586571A2');
COMMIT;

If this doesn't help, I'm sorry if I have muddied the waters a bit.

Is there data that makes you think DB is acting as a bottleneck?

While I don't want to undermine the helpful suggestions. I did try a lot of DB performance tuning and changing storage engines to tweak. I also setup fairly robust DB monitoring to make sure it wasn't DB performance.

When I was having massive lag spikes, there was virtually no load, locks, waits, anything really on the DB. The DB was on-box and was on SSD. I didn't save the reporting but nothing about it in my case pointed to DB performance issues.

As others said, a more methodical approach is probably in order to make sure the situation doesn't get more complex.

From what I can tell from just looking at the code , the locking isn't at the database layer per say, it's at the MySQL connection per zone In the zonedb.cpp. there is only one connection per zone, at least from what I see.

Fsync is a delay or latency issue on the DB when dealing with transactions. Every single query in save is a transaction. (All 13+of them.) You can have a system that can do 500 transactions per sec but can do 100,000 inserts per sec if you bulk up statements.

A small latency can have a massive impact when locks are involved.

If there is latency at the DB it queues up on the zone depending on how many are in each zone. Less people in the zone the less this impacts them

Lowering the latency by limiting the fsync on the transaction call can ease the pressure on the lock on the connection which prevents stalling of the character saved. Or that is at least the idea.

**Note** this lock prevents all queries in the zone , not jsut during saves.

Also note the removal of the table scanning of the pet tables (adding indexes) helps to lower latency of the call as well

It can be easy to confuse work with latency/locks. You can have a slow system doing no work.

Honestly I would like to do more work on this and know it's a stop gap but figured doing 13x less transaction s per save was a win when someone in this thread noted commenting how some of the saves improved their latency. (Mine improved 2-3x)

I know it can be many issues and I may be barking up the wrong tree , but this is simply another option that does have a very clear improvement in performance around the zone locks.

Side note, A single mysql connection for a process is generally a less than idea situation. It is too much of a blockage area for network IO. Locks should be kept for nano/microseconds, not milliseconds. Possibly make seperate connections for read/writes depending on how the threading is setup on the zone process.. (note I have not really looked at the threading model of the zone yet, so this may be moot and may be my misunderstanding)

Note on a phone so sorry for formatting/bad Grammer. Very small window :)

Quote:

Originally Posted by Rekka (Post 261889)

Rekka, thank you so much for spending the time and performing analysis on things and trying to help, we always appreciate folks who take initiative to contribute to the project

There are quite a few things I want to highlight about this though and contrast what the real problems are here

On the Performance Standpoint

EQEmu is not database heavy at all, there was once upon a time where we did all kinds of stupid things but 100's of hours have been poured over reducing our performance bottlenecks across the board in departments of CPU, I/O and even Network. To illustrate here is PEQ: http://peq.akkadius.com:19999/#menu_...late;help=true

With PEQ at 800-1000 toons on the daily, we barely break 100 IOPS a second, with barely 1MB/s in writes with minimal spikes here and there. That is virtually nothing

Also with your benchmark (I assume this was you on the PR), the current stock code on some middle of the line server hardware produces the following timings

HTML Code:

[Debug] ZoneDatabase::SaveCharacterData 1, done... Took 0.000107 seconds

[Debug] ZoneDatabase::SaveCharacterData 1, done... Took 0.000091 second

This is a sub 1ms operation that is not going to hurt even when it is synchronous

(These timings entirely will depend on your hardware and MySQL server configuration of course)

These operations also happen very infrequently where it is not going to even matter.

There are many many factors that play into the overall performance of a server and since the server is essentially an infinite loop, anything within that loop can influence the amount of time that a CPU is not spent idling (Network, I/O, overly CPU intensive operations etc.). Hardware is of influence, your MySQL server configuration is of influence and most of all software is of influence

EQEmu used to be way way way more resource intensive and we've come along way to where that is not even an issue anymore. We have one outstanding bug that is isolated to the networking layer that made its way through because we never saw it on PEQ during our normal QA routines

We are currently working on the code to measure application layer network stats so folks can give us dump metrics off of so we can give a proper fix. We've seen what happens during the network anomaly during a CPU profile and there's not much that it is going to show alone but where it is spending most of its time.

We folks at EQEmu definitely have jobs that have been a deterrer from resolving said bug but we will have it on lockdown soon enough as we know exactly what we need to do, the only thing in our way is time as a resource

We are not grasping at straws to fix this folks, so please just be patient as this is just not a quick fix with our schedules

Quote:

Originally Posted by ptarp (Post 261882)

Nav may "help" lag because you have less position updates being sent around from mobs not pathing or pathing less frequently along with less CPU intensive path calculations, again I will defer to my statements above that folks just be patient and we'll have a fix for folks when we have the time

Update to this, KLS and I have been working on our stats internals so we can start dumping out data

You can see a preview of this here of what KLS has working from stats internals

https://media.discordapp.net/attachm...4/EQ000016.png

I have built out a API which will pipe these stats to a web front-end to display graphs and metrics of all types here so we can perform some analysis against affected servers and zones. From there we should make some good progress and we've seen a handful of things from it already

We will update when we have more