lag problems [Archive] - EQEmulator Forums

flbren

02-01-2019, 09:08 AM

What would cause lag on my server once 25ish toons enter a zone? The lag goes up steady to 15000+ms. With a couple toons in zone I have no issues.For example I can have 24 toons loaded in guild lobby and lag out from anywhere to 500ms to 15000ms but have 3 toons in pok at 60ms. Lag doesn't seem to affect other zones.

Nightrider84

02-01-2019, 11:37 AM

Now to clarify is this 25 toons in 1 zone while 3 toons are in a seperate zone with no issues all at the same time? Or do you mean if you have more than X players at anytime the server lags everywhere?

flbren

02-01-2019, 05:47 PM

The lag is contained in the zone with the 25 toons and other zones with 3 or 4 players run fine. When loading 25ish toons in the zone when I first log in my ms is fine. But after 20ish toons its starts rising to 200 300 1000 etc.

Nightrider84

02-03-2019, 12:25 AM

Well that could be a few things, Could be router related. Sometimes older routers can't handle the required lanes of traffic from multiple locations. Quick test for this would be to have like 30 toons logged in. But put them in about 10 different zones and see if the problem shows up again. If the problem doesn't show up than it might be something on the server itself thats causing the issue. But that is a weird issue to have honestly. Also are these 25 seperate people or are they all bots ect?

Scorpious2k

02-03-2019, 12:53 PM

You might also want to check CPU usage, RAM usage (and swap space), and especially bandwidth. Sometimes ISP's talk about xx bps but they are really talking about downlink (incoming) and uplink (outgoing) is much MUCH lower, which would limit packets going to clients.

flbren

02-03-2019, 03:16 PM

25 separate accounts that I loaded. What I can tell everything looked ok from the host.

Nightrider84

02-03-2019, 10:37 PM

well test the 25 accounts at once in difference zones and see if you get that same latency spike, if you dont then have them zone into the same zone and see if the problem arrises

ptarp

02-09-2019, 03:38 PM

I'm having this same problem. Running the code on Windows 10, there is no problem.. running on any windows server version, the problem is exactly the same.

Running on Windows Server, if you distribute the toons in different zones, most times the lag will go away.. Put them all in the same zone, ping and packet loss increase until it's unplayable. EQ reports up to 5k m.s. pings on a server within my local network. Server CPU stays around 6%, memory around 3%, and network around 250kbs sent on a gigabit connection.
If you do have toons in other zones, they aren't affected until that zone is loaded up with toons also. 22 or so in lobby and 6 in WoS.. the 6 in WoS can keep playing, the ones in guild lobby all lag out.

Running on top of windows 10 on the same machine with a copy of the same code, and copy of the same database, the problem goes away. Log in same toons, put them all in the same zone, and it stays playable.

Akkadius

02-11-2019, 03:13 PM

Just for visibility guys, this is a known issue that we have been working on as of late.

I'll keep you posted when we have an update, I've not had time for my emulator backlog in the past few weeks

Tom Cross

02-11-2019, 09:08 PM

Does anyone know if this is happening on linux boxes too?

Akkadius

02-12-2019, 09:14 PM

Happens on Linux too yes, slightly less prevalent because of net efficiency of the linux core but don't both making a switching decision on that alone

We have a solution, we've been carefully testing it over the past few weeks because its a complex problem

I'll update when we have it in, we need to run it on PEQ for a while now that we've had it running at smaller scale

Tom Cross

02-13-2019, 03:45 PM

So we get the whole community to get on the PEQ and we all have a party in the Nexus for a good test run?

Akkadius

02-13-2019, 10:38 PM

Haha, not quite necessary, we have between 500-1000 toons on in a given day we should be able to give it a good run

ptarp

02-15-2019, 05:48 PM

Just for visibility guys, this is a known issue that we have been working on as of late.

I'll keep you posted when we have an update, I've not had time for my emulator backlog in the past few weeks

I've tinkered with this a bit.. I've gotten ping times to drop down to 350 with 47 in zone by dropping some of the extra database calls that aren't needed in Client::Save(uint8 iCommitNow) function.

Binds are a big one especially now with 5 binds. They don't need saved every loop through client::save, because they are already saved every time they change.

Pets are another problem.. Every time it goes through the loop, pets shouldn't be saved for every toon. Even warriors were getting hit by database.SavePetInfo(this); call. I moved that up 4 lines just above the } else {.

I have also added a Boolean to mine to prevent saving tribute if it hasn't changed, though that call to database.SaveCharacterTribute(this->CharacterID(), &m_pp); could probably be moved right into Client::Handle_OP_TributeUpdate function.

Still weeding some of it out..

Akkadius

02-15-2019, 06:18 PM

These are all separate from the issue reported in the thread

There is a network problem that occurs under the condition where enough clients are in a zone and one or many have not so great connections our server network code will resend very aggressively which results in a server with a lower CPU frequency such as a Xeon 2.3Ghz to essentially choke itself from the rapid burst of resend packets.

The resend packets are choked primarily by compressing packets and sending them over the line at a rapid rate and the CPU will max out for that thread and can take several minutes to recover from

Correcting the resend logic is a complex issue because we use dynamic algorithms for resend rate and packet recovery that has to take in a lot of different factors.

We have some ideal settings figured out and are looking to get them pushed to mainline as soon as we feel confident that they don't produce any more regressions

ptarp

02-16-2019, 12:16 AM

Yes. This is different. Server is running on an i5. CPU stays low. Around 6%. The issue with me seems that entity_list.Process() is taking too long. By the time you get through the whole list of clients, the first one is starving for packets. Each client added increases the ms reading reported by EQ (F11 to show it in top left corner)

This same thing will affect all operating systems. Win 10 seems better than server, but is still not working well.

I enabled MySQL logging to disk. Logging was on a secondary SSD drive. MySQL data files on drive C:, server folder on D:. that's how I saw how many times per second the MySQL was being accessed.. Some of it may not take that much time, but all together, it's a DoS bomb for the hard drive even if it's SSD like mine. Turn it on and look at a zone with over 24 or 25 in it, and you'll see what I mean. Look at the times for the first client going through client::save and compare it to the last.
Since I'm logging on a separate hard drive, performance doesn't change when I turn logging on/off.

I recommend you think about dealing with this before you worry about re-send logic.
Correct the issues I'm talking about, and your re-send issues may go away.
Hope this helps.

Akkadius

02-16-2019, 01:27 PM

Yes. This is different. Server is running on an i5. CPU stays low. Around 6%. The issue with me seems that entity_list.Process() is taking too long. By the time you get through the whole list of clients, the first one is starving for packets. Each client added increases the ms reading reported by EQ (F11 to show it in top left corner)

This same thing will affect all operating systems. Win 10 seems better than server, but is still not working well.

I enabled MySQL logging to disk. Logging was on a secondary SSD drive. MySQL data files on drive C:, server folder on D:. that's how I saw how many times per second the MySQL was being accessed.. Some of it may not take that much time, but all together, it's a DoS bomb for the hard drive even if it's SSD like mine. Turn it on and look at a zone with over 24 or 25 in it, and you'll see what I mean. Look at the times for the first client going through client::save and compare it to the last.
Since I'm logging on a separate hard drive, performance doesn't change when I turn logging on/off.

I recommend you think about dealing with this before you worry about re-send logic.
Correct the issues I'm talking about, and your re-send issues may go away.
Hope this helps.

Again, these are completely unrelated.

Just because you saw a bunch of disk activity and a bunch of queries in a file doesn't mean that its the reason for lag. If you have an improperly tuned MySQL server along with something enabled that is pegging your MySQL server that is another thing and I'm happy to help diagnose those with you

I want you to contrast all of what you observed with PEQ's disk activity:

http://peq.akkadius.com:19999/#menu_disk_submenu_sda;theme=slate;help=true

PEQ has over 800 players right now at maybe stays around 1MB/s writes if at all and occasional bursts, IO operations stay down at a very very low amount even for 800 players

Client::Save is a very light operation, there's maybe a handful of INSERT's or REPLACE into's that occur which are all sub 10ms inserts. We could use less Client::Save's in general but it really isn't the problem here

You don't need to turn on the MySQL general log when you can see exactly what a zone process is doing by enabling MySQL logging at the process level. Even if you pipe that to another drive it still is overhead to the MySQL process

https://github.com/EQEmu/Server/wiki/Logging-System#gm-say

In the `logsys_categories` table you can shut off any category you are piping to file

Back to the Network Issue

We know exactly what's going on with the network issue because we've taken CPU snapshot profiles during the problem. It's just not a quick "Fix" and we typically chose to go through a very careful staged approach before reintroducing this into mainline because of the complex factors involved

The reason we've seen this far less on PEQ is because PEQ has a OC'ed 5Ghz core processor, DDR4 memory and NVME Datacenter SSD's. When the zone processes goes into resend storm logic, it can keep up with the very aggressive resend logic just enough until the client either disconnects from their own terrible connection or the client itself recovers.

There is still a breaking point with our hardware however, it just takes a lot more to get there. If we had over 100 toons in a zone PEQ and we had something produce enough resend logic (Like raid combat spam burning) it would trip the same inflection point that most folks are seeing on their Windows nodes at 20-40 people in a zone with 2.6Ghz ish processors and whatever else they're using on their boxes. Even with over 100 toons it is still very rare to see it just because of the very tight hardware that is being utilized

Regardless, you shouldn't need the above specs to run a server, that is not the point at all. The point is why we've not run into this issue up until this point because most of our code QA goes through PEQ and our hardware has been masking the problem. Before we released the new netcode overhaul to mainline we went through several several iterations of issues and actually drastically improved our overall netcode utilization massively which I am still super stoked about to this day, we just have this one issue plaguing people and we will have it resolved soon, so just stay tuned for updates

Drakiyth

02-16-2019, 09:30 PM

Again, these are completely unrelated.

Just because you saw a bunch of disk activity and a bunch of queries in a file doesn't mean that its the reason for lag. If you have an improperly tuned MySQL server along with something enabled that is pegging your MySQL server that is another thing and I'm happy to help diagnose those with you

I want you to contrast all of what you observed with PEQ's disk activity:

http://peq.akkadius.com:19999/#menu_disk_submenu_sda;theme=slate;help=true

PEQ has over 800 players right now at maybe stays around 1MB/s writes if at all and occasional bursts, IO operations stay down at a very very low amount even for 800 players

Client::Save is a very light operation, there's maybe a handful of INSERT's or REPLACE into's that occur which are all sub 10ms inserts. We could use less Client::Save's in general but it really isn't the problem here

You don't need to turn on the MySQL general log when you can see exactly what a zone process is doing by enabling MySQL logging at the process level. Even if you pipe that to another drive it still is overhead to the MySQL process

https://github.com/EQEmu/Server/wiki/Logging-System#gm-say

In the `logsys_categories` table you can shut off any category you are piping to file

Back to the Network Issue

We know exactly what's going on with the network issue because we've taken CPU snapshot profiles during the problem. It's just not a quick "Fix" and we typically chose to go through a very careful staged approach before reintroducing this into mainline because of the complex factors involved

The reason we've seen this far less on PEQ is because PEQ has a OC'ed 5Ghz core processor, DDR4 memory and NVME Datacenter SSD's. When the zone processes goes into resend storm logic, it can keep up with the very aggressive resend logic just enough until the client either disconnects from their own terrible connection or the client itself recovers.

There is still a breaking point with our hardware however, it just takes a lot more to get there. If we had over 100 toons in a zone PEQ and we had something produce enough resend logic (Like raid combat spam burning) it would trip the same inflection point that most folks are seeing on their Windows nodes at 20-40 people in a zone with 2.6Ghz ish processors and whatever else they're using on their boxes. Even with over 100 toons it is still very rare to see it just because of the very tight hardware that is being utilized

Regardless, you shouldn't need the above specs to run a server, that is not the point at all. The point is why we've not run into this issue up until this point because most of our code QA goes through PEQ and our hardware has been masking the problem. Before we released the new netcode overhaul to mainline we went through several several iterations of issues and actually drastically improved our overall netcode utilization massively which I am still super stoked about to this day, we just have this one issue plaguing people and we will have it resolved soon, so just stay tuned for updates

Akkadius,

I just want to say that the Varlyndria players and myself really appreciate everything you and the main EQ Devs are doing to fix this lag issue. I could only imagine the frustration it could bring. One thing I have done for my hub zone is create public instances that players can travel to. This helps free up congestion if lag starts occurring in the non-instanced zone. I encourage any server owner to do the same while this issue remains.

Here is to a quick recovery so we can all once again enjoy a solid amount of players in the same zone with no issues.

eldarian

02-25-2019, 07:53 PM

has there been any new progress on this issue? Very frustrating a commonly used processor for hosting is causing this much turmoil

Akkadius

02-25-2019, 08:33 PM

has there been any new progress on this issue? Very frustrating a commonly used processor for hosting is causing this much turmoil

Update is we've had it on PEQ, we're making additional tweaks that go live tomorrow, this takes time to test until we feel its ready to go back into mainline

eldarian

02-25-2019, 08:35 PM

i know AEQ would be very happy to be your test server for this fix community has communicated as much to me. feel free to reach out to me and we can do what we need too to test it in operation

Akkadius

03-02-2019, 06:50 PM

We pushed changes last night that have been tested on PEQ for over a week with 800+ toons with no issues. Also tested on Legacy of Norrath before they shut down

https://ci.appveyor.com/api/projects/KimLS/server/artifacts/eqemu-x86-no-bots.zip

Give that a whirl

Drakiyth

03-02-2019, 09:48 PM

We pushed changes last night that have been tested on PEQ for over a week with 800+ toons with no issues. Also tested on Legacy of Norrath before they shut down

https://ci.appveyor.com/api/projects/KimLS/server/artifacts/eqemu-x86-no-bots.zip

Give that a whirl

I plan to add this tomorrow morning to Varlyndria. We all thank you for this fix.

eldarian

03-03-2019, 11:16 AM

I plan to add this tomorrow morning to Varlyndria. We all thank you for this fix.

Let me know if this fix worked for you in any areas

Drakiyth

03-03-2019, 05:08 PM

Let me know if this fix worked for you in any areas

I added the source code from Akkadius' link above to Varlyndria early this morning, and then did a stress test with the server that didn't go so well. I even tried the pull method of the latest unstable source in the folder. The stress test in Nexus started bugging out with 18+ players when the spike came back. It does appear to be better than it was before, but not what I was expecting. Varlyndria is currently on a AWS Large T3 windows system. It has held over 118 clients online + pets just fine as long as they are in different zones/instances of the high-traffic hubs and under 11 in total. (on average). Now it seems like 16-18 or so, but the lag does come back full force and spikes the zone out badly -- eventually crashing it, or forcing me to shut it down.

I've heard from a source on my discord that Linux using developers are having more luck with it. I've been running windows since I started with EQemu and I've never seen an issue like this before, aside from not having enough connection speed to handle the player population.

At this point, I am hesitant/undecided to see if a stronger connection than T3 Large would produce better results with the change.

Any professional advice that can be given on the situation would be helpful.

ptarp

03-04-2019, 09:38 AM

There just appears to be something in the windows compile that's making it "hiccup".. I'm wondering if I have to switch to Linux.

Maze_EQ

03-04-2019, 11:02 AM

We pushed changes last night that have been tested on PEQ for over a week with 800+ toons with no issues. Also tested on Legacy of Norrath before they shut down

https://ci.appveyor.com/api/projects/KimLS/server/artifacts/eqemu-x86-no-bots.zip

Give that a whirl

It worked on our dev build with 80 clients in the same zone.

Our dev environment previously couldn't handle 20+.

ptarp

03-04-2019, 12:36 PM

It worked on our dev build with 80 clients in the same zone.

Our dev environment previously couldn't handle 20+.

You're running windows?

Akkadius

03-04-2019, 02:44 PM

So - we could be dealing with a few factors here, while the resend issue was a very valid issue that we took care of, I have a hunch that something is of influence in the windows realm here

I have another question for you guys, where have you guys been getting your binaries?

Have you been compiling them yourselves? In the past few months we switched our main source of windows binary updates from our CI system and I just want to rule out a bad or imperformant library or compilation setting

At the end of the day, Windows or Linux you should be able to run on either, we'll get it figured out

Maze_EQ

03-04-2019, 04:01 PM

So - we could be dealing with a few factors here, while the resend issue was a very valid issue that we took care of, I have a hunch that something is of influence in the windows realm here

I have another question for you guys, where have you guys been getting your binaries?

Have you been compiling them yourselves? In the past few months we switched our main source of windows binary updates from our CI system and I just want to rule out a bad or imperformant library or compilation setting

At the end of the day, Windows or Linux you should be able to run on either, we'll get it figured out

I built these myself.

I'll see if i can repro with your installer.

Akkadius

03-04-2019, 07:56 PM

For the sake of troubleshooting, can you guys try this build? No guarantees but it will tell us something if the symptoms go away

https://ci.appveyor.com/project/KimLS/server/builds/21452988/artifacts

Drakiyth

03-04-2019, 09:27 PM

So - we could be dealing with a few factors here, while the resend issue was a very valid issue that we took care of, I have a hunch that something is of influence in the windows realm here

I have another question for you guys, where have you guys been getting your binaries?

Have you been compiling them yourselves? In the past few months we switched our main source of windows binary updates from our CI system and I just want to rule out a bad or imperformant library or compilation setting

At the end of the day, Windows or Linux you should be able to run on either, we'll get it figured out

Lately I've been getting the updated source and binaries from the installer prompt in the main server folder: eqemu_server.pl

When I tried it the first time I got the files you put on this thread and also tried the latest unstable I pulled from there.

I'll load up those new source files you posted in a few here, stress test, and than share our experiences on here.

Drakiyth

03-04-2019, 11:59 PM

The experiment with the binaries you posted above failed at 18 players with a few pets out. We spiraled out of control with lag in the Nexus.

Akkadius

03-05-2019, 03:05 PM

Sounds good, thanks for trying that

Good News

Our core configurations are easy to tweak

Bad News

We don't know what settings to tweak yet

Good News Again

We are working on integrating stats tooling into the server code so that you guys can give us dumps on what is occurring in the zones at the time of this so we can see exactly what is happening at the network layer so we can give a proper prognosis

We are also working on a way to connect hundreds of headless clients to a server to simulate a stress condition so we can also debug this same issue without being reliant on someone with a player base and players logging in to replicate this issue, we can then replicate it on our own

That being said

The resend issue discussed prior in this thread was a real issue and we addressed the aggressive resend issues in weeks worth of testing on a few different servers

It seems that this network issue is common to Windows environments on a 2-2.6ghz Xeon chip and anytime there are 20+ clients in the zone

We are committed to getting it resolved, you shouldn't need to switch your OS and hardware to get around this issue so please be patient and we will get it sorted

eldarian

03-05-2019, 11:10 PM

I can confirm that I 100% agree with Akkadius. I attempted to upgrade our hardware and in my folly i was wrong. superior hardware is not going to fix this issue. I changed from a xeon 2.0 processor to a Intel I7-6700 32gb etc. and experienced the same issues. So at the very least we have eliminated the theory of hardware.

blooberry_eq99

03-05-2019, 11:10 PM

We just tried running the installer on a fresh setup, and started getting lag once we passed 23 in zone.

ptarp

03-07-2019, 10:51 PM

We just tried running the installer on a fresh setup, and started getting lag once we passed 23 in zone.

As another test.. Use the same binaries, everything else.. but in the /Maps/nav directory, create a subdirectory. Something like /Maps/nav/removed

Move all of the files from the /nav directory into the new subdirectory. Then run the server again.. Lag goes away for me.

NOTE: I'm working with highly customized server code and don't have the latest update.

As a second test, I turned off .mmf file loading and left .nav files in the /nav directory. Either solution worked for me.

eldarian

03-07-2019, 11:02 PM

I will trying this on Alternate Everquest and report back if this works for us

Uleat

03-07-2019, 11:39 PM

MMF loading has been disabled for some time.

The new mapping system has not been applied to mmf loads and, tbh, I am unsure of the behavior of trying to use them.

Worst case scenarios, you would default to standard map loading or zone crashes.

The pre-built binaries do not include mmf loads as an enabled option.

Drakiyth

03-08-2019, 12:51 AM

Waiting until Akkadius posts an official fix before I would touch the maps and navmesh files. That sounds like a bandaid fix which could lead to some zone crashes or NPCs screwing up, and not a full-on repair of the problem.

Rekka

03-08-2019, 10:57 AM

This may not solve your problem, but it may help. (especially if you use innodb as your storage engine)

I took a look at how the Save method in client.cpp worked and saw some issues in the way locking happens and transactions are used (or not used).

From what I can tell between the lock in front of the DB connection and the lack of bulking up statements into a transaction, this could cause some serious issues if people were not using a very good ssd on their database and had quick network connection to their database of choice. You can easly get 'fsync' choked. Symptions are low io/cpu usage, everything starts lagging out waiting for the locks to be release for fsync to happen on the database for transaction purposes.

I've created a pull request.
https://github.com/EQEmu/Server/pull/827

**WARNING*** my testing has been minimal so use as your own risk, but I am encouraged by the results (over 2x faster with zero load on the system)

**Note**, you will also need to included two indexes on tables, its in the pull request. Its important as we are doing table scans during the save without them.

Hope this can help some of you. right now its a stop gap and hopefully I can come up with a better solution in the near future. I would like feedback if it helps resolve the lag issues with pets out.
This is an example bulking of the transactions of a simple save.

START TRANSACTION;
REPLACE INTO `character_currency` (id, platinum, gold, silver, copper,platinum_bank, gold_bank, silver_bank, copper_bank,platinum_cursor, gold_cursor, silver_cursor, copper_cursor, radiant_crystals, career_radiant_crystals, ebon_crystals, career_ebon_crystals)VALUES (685273, 184, 153, 149, 101, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);
REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 189, 0, 18.000000, -147.000000, 20.000000, 64.000000, 0);
REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 1);
REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 2);
REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 3);
REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 4);
DELETE FROM `character_buffs` WHERE `character_id` = '685273';
DELETE FROM `character_pet_buffs` WHERE `char_id` = 685273;
DELETE FROM `character_pet_inventory` WHERE `char_id` = 685273;
INSERT INTO `character_pet_info` (`char_id`, `pet`, `petname`, `petpower`, `spell_id`, `hp`, `mana`, `size`) VALUES (685273, 0, 'Labann000', 0, 632, 3150, 0, 5.000000) ON DUPLICATE KEY UPDATE `petname` = 'Labann000', `petpower` = 0, `spell_id` = 632, `hp` = 3150, `mana` = 0, `size` = 5.000000;
DELETE FROM `character_tribute` WHERE `id` = 685273;REPLACE INTO character_activities (charid, taskid, activityid, donecount, completed) VALUES (685273, 22, 1, 0, 0), (685273, 22, 2, 0, 0), (685273, 22, 3, 0, 0), (685273, 22, 4, 0, 0), (685273, 22, 5, 0, 0);
REPLACE INTO character_activities (charid, taskid, activityid, donecount, completed) VALUES (685273, 23, 0, 0, 0);REPLACE INTO character_activities (charid, taskid, activityid, donecount, completed) VALUES (685273, 138, 0, 0, 0);
REPLACE INTO `character_data` ( id,account_id,`name`, last_name, gender, race, class, `level`, deity,birthday,last_login,time_played,pvp_status,l evel2, anon, gm, intoxication,hair_color,beard_color,eye_color_1,ey e_color_2,hair_style,beard,ability_time_seconds,ab ility_number,ability_time_minutes,ability_time_hou rs, title,suffix, exp, points, mana, cur_hp, str, sta, cha, dex, `int`,agi, wis, face, y, x, z, heading, pvp2, pvp_type,autosplit_enabled, zone_change_count, drakkin_heritage, drakkin_tattoo,drakkin_details, toxicity,hunger_level,thirst_level,ability_up,zone _id, zone_instance,leadership_exp_on, ldon_points_guk, ldon_points_mir, ldon_points_mmc, ldon_points_ruj, ldon_points_tak, ldon_points_available,tribute_time_remaining, show_helm, career_tribute_points,tribute_points,tribute_activ e,endurance, group_leadership_exp,raid_leadership_exp, group_leadership_points, raid_leadership_points, air_remaining,pvp_kills, pvp_deaths,pvp_current_points, pvp_career_points, pvp_best_kill_streak,pvp_worst_death_streak, pvp_current_kill_streak, aa_points_spent, aa_exp, aa_points, group_auto_consent, raid_auto_consent, guild_auto_consent, RestTimer, e_aa_effects, e_percent_to_aa, e_expended_aa_spent, e_last_invsnapshot, mailkey ) VALUES (685273,90536,'Rekka','',0,6,13,50,396,1550636815, 1552018689,22507,0,70,0,1,0,17,255,4,4,2,255,0,0,0 ,0,'','',164708608,345,2299,1589,60,80,60,75,134,9 0,83,3,-1831.625000,-225.750000,3.127999,37.500000,0,0,0,0,0,0,0,0,4480 ,4480,0,22,0,0,0,0,0,0,0,0,4294967295,0,0,0,0,1291 ,0,0,0,0,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,'B F01A8C0586571A2');
COMMIT;

If this doesn't help, I'm sorry if I have muddied the waters a bit.

Straps

03-08-2019, 11:22 AM

Is there data that makes you think DB is acting as a bottleneck?

While I don't want to undermine the helpful suggestions. I did try a lot of DB performance tuning and changing storage engines to tweak. I also setup fairly robust DB monitoring to make sure it wasn't DB performance.

When I was having massive lag spikes, there was virtually no load, locks, waits, anything really on the DB. The DB was on-box and was on SSD. I didn't save the reporting but nothing about it in my case pointed to DB performance issues.

As others said, a more methodical approach is probably in order to make sure the situation doesn't get more complex.

Rekka

03-08-2019, 12:48 PM

From what I can tell from just looking at the code , the locking isn't at the database layer per say, it's at the MySQL connection per zone In the zonedb.cpp. there is only one connection per zone, at least from what I see.

Fsync is a delay or latency issue on the DB when dealing with transactions. Every single query in save is a transaction. (All 13+of them.) You can have a system that can do 500 transactions per sec but can do 100,000 inserts per sec if you bulk up statements.

A small latency can have a massive impact when locks are involved.

If there is latency at the DB it queues up on the zone depending on how many are in each zone. Less people in the zone the less this impacts them

Lowering the latency by limiting the fsync on the transaction call can ease the pressure on the lock on the connection which prevents stalling of the character saved. Or that is at least the idea.

**Note** this lock prevents all queries in the zone , not jsut during saves.

Also note the removal of the table scanning of the pet tables (adding indexes) helps to lower latency of the call as well

It can be easy to confuse work with latency/locks. You can have a slow system doing no work.

Honestly I would like to do more work on this and know it's a stop gap but figured doing 13x less transaction s per save was a win when someone in this thread noted commenting how some of the saves improved their latency. (Mine improved 2-3x)

I know it can be many issues and I may be barking up the wrong tree , but this is simply another option that does have a very clear improvement in performance around the zone locks.

Side note, A single mysql connection for a process is generally a less than idea situation. It is too much of a blockage area for network IO. Locks should be kept for nano/microseconds, not milliseconds. Possibly make seperate connections for read/writes depending on how the threading is setup on the zone process.. (note I have not really looked at the threading model of the zone yet, so this may be moot and may be my misunderstanding)

Note on a phone so sorry for formatting/bad Grammer. Very small window :)

Akkadius

03-08-2019, 06:00 PM

From what I can tell from just looking at the code , the locking isn't at the database layer per say, it's at the MySQL connection per zone In the zonedb.cpp. there is only one connection per zone, at least from what I see.

Fsync is a delay or latency issue on the DB when dealing with transactions. Every single query in save is a transaction. (All 13+of them.) You can have a system that can do 500 transactions per sec but can do 100,000 inserts per sec if you bulk up statements.

A small latency can have a massive impact when locks are involved.

If there is latency at the DB it queues up on the zone depending on how many are in each zone. Less people in the zone the less this impacts them

Lowering the latency by limiting the fsync on the transaction call can ease the pressure on the lock on the connection which prevents stalling of the character saved. Or that is at least the idea.

**Note** this lock prevents all queries in the zone , not jsut during saves.

Also note the removal of the table scanning of the pet tables (adding indexes) helps to lower latency of the call as well

It can be easy to confuse work with latency/locks. You can have a slow system doing no work.

Honestly I would like to do more work on this and know it's a stop gap but figured doing 13x less transaction s per save was a win when someone in this thread noted commenting how some of the saves improved their latency. (Mine improved 2-3x)

I know it can be many issues and I may be barking up the wrong tree , but this is simply another option that does have a very clear improvement in performance around the zone locks.

Side note, A single mysql connection for a process is generally a less than idea situation. It is too much of a blockage area for network IO. Locks should be kept for nano/microseconds, not milliseconds. Possibly make seperate connections for read/writes depending on how the threading is setup on the zone process.. (note I have not really looked at the threading model of the zone yet, so this may be moot and may be my misunderstanding)

Note on a phone so sorry for formatting/bad Grammer. Very small window :)

Rekka, thank you so much for spending the time and performing analysis on things and trying to help, we always appreciate folks who take initiative to contribute to the project

There are quite a few things I want to highlight about this though and contrast what the real problems are here

On the Performance Standpoint

EQEmu is not database heavy at all, there was once upon a time where we did all kinds of stupid things but 100's of hours have been poured over reducing our performance bottlenecks across the board in departments of CPU, I/O and even Network. To illustrate here is PEQ: http://peq.akkadius.com:19999/#menu_disk_submenu_sda;theme=slate;help=true

With PEQ at 800-1000 toons on the daily, we barely break 100 IOPS a second, with barely 1MB/s in writes with minimal spikes here and there. That is virtually nothing

Also with your benchmark (I assume this was you on the PR), the current stock code on some middle of the line server hardware produces the following timings

[Debug] ZoneDatabase::SaveCharacterData 1, done... Took 0.000107 seconds
[Debug] ZoneDatabase::SaveCharacterData 1, done... Took 0.000091 second
This is a sub 1ms operation that is not going to hurt even when it is synchronous

(These timings entirely will depend on your hardware and MySQL server configuration of course)

These operations also happen very infrequently where it is not going to even matter.

There are many many factors that play into the overall performance of a server and since the server is essentially an infinite loop, anything within that loop can influence the amount of time that a CPU is not spent idling (Network, I/O, overly CPU intensive operations etc.). Hardware is of influence, your MySQL server configuration is of influence and most of all software is of influence

EQEmu used to be way way way more resource intensive and we've come along way to where that is not even an issue anymore. We have one outstanding bug that is isolated to the networking layer that made its way through because we never saw it on PEQ during our normal QA routines

We are currently working on the code to measure application layer network stats so folks can give us dump metrics off of so we can give a proper fix. We've seen what happens during the network anomaly during a CPU profile and there's not much that it is going to show alone but where it is spending most of its time.

We folks at EQEmu definitely have jobs that have been a deterrer from resolving said bug but we will have it on lockdown soon enough as we know exactly what we need to do, the only thing in our way is time as a resource

We are not grasping at straws to fix this folks, so please just be patient as this is just not a quick fix with our schedules

As another test.. Use the same binaries, everything else.. but in the /Maps/nav directory, create a subdirectory. Something like /Maps/nav/removed

Move all of the files from the /nav directory into the new subdirectory. Then run the server again.. Lag goes away for me.

NOTE: I'm working with highly customized server code and don't have the latest update.

As a second test, I turned off .mmf file loading and left .nav files in the /nav directory. Either solution worked for me.

Nav may "help" lag because you have less position updates being sent around from mobs not pathing or pathing less frequently along with less CPU intensive path calculations, again I will defer to my statements above that folks just be patient and we'll have a fix for folks when we have the time

Akkadius

03-11-2019, 03:02 AM

Update to this, KLS and I have been working on our stats internals so we can start dumping out data

You can see a preview of this here of what KLS has working from stats internals

https://media.discordapp.net/attachments/213405724682878976/554554251616190474/EQ000016.png

I have built out a API which will pipe these stats to a web front-end to display graphs and metrics of all types here so we can perform some analysis against affected servers and zones. From there we should make some good progress and we've seen a handful of things from it already

We will update when we have more

Drakiyth

03-11-2019, 04:27 PM

Update to this, KLS and I have been working on our stats internals so we can start dumping out data

You can see a preview of this here of what KLS has working from stats internals

https://media.discordapp.net/attachments/213405724682878976/554554251616190474/EQ000016.png

I have built out a API which will pipe these stats to a web front-end to display graphs and metrics of all types here so we can perform some analysis against affected servers and zones. From there we should make some good progress and we've seen a handful of things from it already

We will update when we have more

With these new changes, what time-frame are we looking at for a windows fix in the source?

ptarp

03-31-2019, 12:07 AM

Not sure if this is still being looked at or anything's been done, but I thought I'd mention this because it might bring some things to light.

I took code that was based around 2010 or so. (Note that was before the _character table changed to character_data). I added custom changes from that to code to newer code I grabbed from your GIT at around 8-2017 to use on our server. I didn't notice the difference right away because most of my testing involved a few toons here and there.

Since I noticed the problem with logging in more than 24 toons, I adjusted things in my own code, and started updating my code by incorporating each change to your GIT.. It is now up to date with 9-2018, at which time I noticed one change that really helped, but it was buried in a merge.

I thought that was all there was to it.. But since then, I logged 48 toons in one zone on my server..
All seemed fine and there was no lag.. Until I buffed all my toons. This added to the lag because of added work for database.SaveBuffs(this); function call.

Now, I have begun comparing the code difference between the old version (before the database split) and the new.

In the old version.. The only database call in Client::Save(uint8 iCommitNow) function was using the DBAsyncWork class unless iCommitNow was > 0. That particular class actually added it's own thread to work on a queue of database queries without slowing down the main thred.

I don't see that thread anywhere in the new code. Maybe I'm mistaken, but I think it would maybe help.
Don't know for sure, but just a thought.

Akkadius

03-31-2019, 12:17 AM

A lot has been being done, if you had read the thread I have been giving updates on exactly what we've been working on. For the umpteenth time, this is not a database problem

We built metrics into the server, we built a web admin panel and a comms API to the server so we can visually see the problem

https://media.discordapp.net/attachments/212684816104030209/558777400553504768/unknown.png?width=1080&height=247

https://media.discordapp.net/attachments/212684816104030209/558778605052887050/unknown.png?width=1080&height=876

Below is the visual of the exact problem that we are running into, this is the cascading resend problem that chokes the main thread of the application if the processor can't keep up, the core/process collapses upon itself

We had 60 toons come in (Which isn't a problem at all for the hardware we run on) and they all ran macros that generated an excess amount of spam. It all runs fine until the process gets behind on resends, then cascades in its ability to keep up with the resends because of all of the packet loss

https://media.discordapp.net/attachments/212684816104030209/558780953561006081/unknown.png?width=1080&height=538

https://media.discordapp.net/attachments/212684816104030209/558779655805468673/unknown.png?width=1080&height=726

Here is when resends get to the point where the server can no longer send keepalives to the clients, the clients disconnect and then the process eventually catches up again and everything flatlines

https://media.discordapp.net/attachments/212684816104030209/558785876386381824/unknown.png?width=1080&height=798

TLDR; the server keeps up just fine until the process buckles

The reason for this is that the packet communications happen in the main thread, which hasn't been a problem until we discovered this scenario in recent months

We are working on removing the client communications from the main thread so that we don't run into thus buckling problem from back-pressure. Our networking should not be occurring on the main thread regardless and getting two threads to communicate networking responsibilities over an internal queue isn't the most trivial of processes either so its taking us some time

We also can't measure changes without having taken the time that we have to build out metrics to show the problem and know that it actually has been resolved

Also, for context and clarity, the network code was completely overhaul at this time. While we've ironed out most things, this is our last outstanding issue and it hasn't been an easy one from a time and resource perspective because it has been incredibly elusive, harder to reproduce and didn't have any way to measure or capture the problem

== 4/16/2017 ==
KLS: Merge eqstream branch
- UDP client stack completely rewritten should both have better throughput and recover better (peq has had far fewer reports of desyncs).
- TCP Server to Server connection stack completely rewritten.
- Server connections reconnect much more reliably and quickly now.
- Now supports optional packet encryption via libsodium (https://download.libsodium.org/doc/).
- Protocol behind the tcp connections has changed (see breaking changes section).
- API significantly changed and should be easier to write new servers or handlers for.
- Telnet console connection has been separated out from the current port (see breaking changes section).
- Because of changes to the TCP stack, lsreconnect and echo have been disabled.
- The server tic rate has been changed to be approx 30 fps from 500+ fps.
- Changed how missiles and movement were calculated slightly to account for this (Missiles in particular are not perfect but close enough).

- Breaking changes:
- Users who use the cmake install feature should be aware that the install directory is now %cmake_install_dir%/bin instead of just %cmake_install_dir%/
- To support new features such as encryption the underlying protocol had to change... however some servers such as the public login server will be slow to change so we've included a compatibility layer for legacy login connections:
- You should add <legacy>1</legacy> to the login section of your configuration file when connecting to a server that is using the old protocol.
- The central eqemu login server uses the old protocol and probably will for the forseeable future so if your server is connecting to it be sure to add that tag to your configuration file in that section.
- Telnet no longer uses the same port as the Server to Server connection and because of this the tcp tag no longer has any effect on telnet connections.
- To enable telnet you need to add a telnet tag in the world section of configuration such as:
<telnet ip="0.0.0.0" port="9001" enabled="true"/>

Also a friendly reminder to keep in mind we have life obligations so we don't have most weekdays to keep hammering at this problem

We'll let you know when we have the code fixes in place

eldarian

04-03-2019, 06:54 PM

Thank you Kindly for the very detailed Update, I believe this was a Side effect to Thanos Snap

eldarian

04-15-2019, 07:39 PM

I for one am Very pleased with the current build that was released to me *shared with Varlydra server* Akkaidus and KLS and anyone else I may not know who was involved work very hard and they kept their promise to find a fix and they delivered. Tested this with 24 clients in zone. Keep in mind your MS bar may say one thing but in reality you can cast spells with normal fresh time. My server is a 2 box server but permitted 3 for the purpose of our testing. Once more thank you for taking this problem serious and investing time and resources to see it fixed.

Akkadius

04-15-2019, 07:54 PM

Keep in mind this is not merged mainline, but we have a general fix in a working branch currently, we have a handful of things we need to take care of before merging mainline

If you're interested in the build for your server, download it at the following and report back

https://www.dropbox.com/s/2s2mput1q4lfiwl/win-binaries.zip?dl=0

Also, keep in mind you will need to run this update manually: https://github.com/EQEmu/Server/blob/feature/eqemu-api-data-service-netstats/utils/sql/git/required/2019_03_13_update_range.sql

Drakiyth

04-15-2019, 09:30 PM

Keep in mind this is not merged mainline, but we have a general fix in a working branch currently, we have a handful of things we need to take care of before merging mainline

If you're interested in the build for your server, download it at the following and report back

https://www.dropbox.com/s/2s2mput1q4lfiwl/win-binaries.zip?dl=0

Stellar work all around. I can't wait to fire this up on Varlyndria for everybody in tomorrow's update. Thank you very much for all you guys do to keep this place the best emulator project in the world.

almightie

05-24-2019, 04:04 PM

Hey guys just checking in to see if there is any update to this issue and if the fix will be pushed out.

Thank you

Akkadius

05-24-2019, 04:14 PM

Hey guys just checking in to see if there is any update to this issue and if the fix will be pushed out.

Thank you

Fix is mainline and on master

Huppy

05-26-2019, 04:17 PM

I meant to ask you Akka, that fix, was that the "compression level" update that was commited ? The only reason I ask, one of my "toy boxes", is sticking to slightly older code, but picking away at manually applying feasible updates, when I can get away with it. :)

peterigz

12-06-2019, 03:25 PM

Hey Folks,

Posted this in discord but things can drift up on there so just posting here as well.

We've just completed a server update and merged in the latest changes in the eqemu master branch. Everything is great with the exception that we seem to have now run into the dreaded windows server lag bug as report here: http://www.eqemulator.org/forums/showthread.php?t=42311&page=4 According to that thread it was fixed, but it seems it's happening to our server still for some reason. Symptoms are exactly the same, a resend cascade leading to big lag spikes (going by netstats).

We have windows server 2019, 2 zeon 2.4 processors, 32gig ram and more bandwidth then you can shake a stick at. Any ideas as to settings to tweak or other highly welcome! Meanwhile I'll see if I can utilise those metrics in the thread to get more insights.

Uleat

12-06-2019, 04:58 PM

I haven't been on discord yet..

..but, I would start with ensuring that you have the correct zlib dll.

If it's not from 2019, I wouldn't trust it to be correct.

Make sure that you're using the one acquired from the eqemu_server.pl download option.

The new vcpkg method seems to install a zlib dll from 2018 into the build directory that seems to cause this issue.

If you do a select all -> copy -> paste from build to server install, and this dll is present in build, it will overwrite your current server copy.

As well, any older dependency-related copy will do it too.

The issue is related to build flags (mostly) and forces the encryption to operate in single-thread mode.

The copy obtained through the eqemu_server download is known to be correctly flagged (as of the my commit to that repo.)

peterigz

12-07-2019, 12:04 PM

Thanks again Uleat.

So after a bit of experimenting here's some notes:

I've been compiling with the Build Zlib flag set in cmake, so I didn't actually need or use a zlib1.dll in the folder - not sure what that means, shouldn't it compile with zlib with multithreaded mode in that case or is it not configured for that in cmake?

I can untick the build with zlib, in which case I do need the zlib1.dll in the folder to run it, however I've been compiling in 64bit mode so the x86 zlib1.dll from the installer doesn't work with it. However I can just compile a x86 version instead and use the zlib1.dll which is what I'll try next, at least then I know for sure is is using the right zlib then.