EQEmulator Forums - lag problems

EQEmulator Forums (https://www.eqemulator.org/forums/index.php)

- Support::Windows Servers (https://www.eqemulator.org/forums/forumdisplay.php?f=587)

- - lag problems (https://www.eqemulator.org/forums/showthread.php?t=42311)

Quote:

Originally Posted by Akkadius (Post 261898)

Update to this, KLS and I have been working on our stats internals so we can start dumping out data

You can see a preview of this here of what KLS has working from stats internals

https://media.discordapp.net/attachm...4/EQ000016.png

I have built out a API which will pipe these stats to a web front-end to display graphs and metrics of all types here so we can perform some analysis against affected servers and zones. From there we should make some good progress and we've seen a handful of things from it already

We will update when we have more

With these new changes, what time-frame are we looking at for a windows fix in the source?

Not sure if this is still being looked at or anything's been done, but I thought I'd mention this because it might bring some things to light.

I took code that was based around 2010 or so. (Note that was before the _character table changed to character_data). I added custom changes from that to code to newer code I grabbed from your GIT at around 8-2017 to use on our server. I didn't notice the difference right away because most of my testing involved a few toons here and there.

Since I noticed the problem with logging in more than 24 toons, I adjusted things in my own code, and started updating my code by incorporating each change to your GIT.. It is now up to date with 9-2018, at which time I noticed one change that really helped, but it was buried in a merge.

I thought that was all there was to it.. But since then, I logged 48 toons in one zone on my server..
All seemed fine and there was no lag.. Until I buffed all my toons. This added to the lag because of added work for database.SaveBuffs(this); function call.

Now, I have begun comparing the code difference between the old version (before the database split) and the new.

In the old version.. The only database call in Client::Save(uint8 iCommitNow) function was using the DBAsyncWork class unless iCommitNow was > 0. That particular class actually added it's own thread to work on a queue of database queries without slowing down the main thred.

I don't see that thread anywhere in the new code. Maybe I'm mistaken, but I think it would maybe help.
Don't know for sure, but just a thought.

A lot has been being done, if you had read the thread I have been giving updates on exactly what we've been working on. For the umpteenth time, this is not a database problem

We built metrics into the server, we built a web admin panel and a comms API to the server so we can visually see the problem

https://media.discordapp.net/attachm...080&height=247

https://media.discordapp.net/attachm...080&height=876

Below is the visual of the exact problem that we are running into, this is the cascading resend problem that chokes the main thread of the application if the processor can't keep up, the core/process collapses upon itself

We had 60 toons come in (Which isn't a problem at all for the hardware we run on) and they all ran macros that generated an excess amount of spam. It all runs fine until the process gets behind on resends, then cascades in its ability to keep up with the resends because of all of the packet loss

https://media.discordapp.net/attachm...080&height=538

https://media.discordapp.net/attachm...080&height=726

Here is when resends get to the point where the server can no longer send keepalives to the clients, the clients disconnect and then the process eventually catches up again and everything flatlines

https://media.discordapp.net/attachm...080&height=798

TLDR; the server keeps up just fine until the process buckles

The reason for this is that the packet communications happen in the main thread, which hasn't been a problem until we discovered this scenario in recent months

We are working on removing the client communications from the main thread so that we don't run into thus buckling problem from back-pressure. Our networking should not be occurring on the main thread regardless and getting two threads to communicate networking responsibilities over an internal queue isn't the most trivial of processes either so its taking us some time

We also can't measure changes without having taken the time that we have to build out metrics to show the problem and know that it actually has been resolved

Also, for context and clarity, the network code was completely overhaul at this time. While we've ironed out most things, this is our last outstanding issue and it hasn't been an easy one from a time and resource perspective because it has been incredibly elusive, harder to reproduce and didn't have any way to measure or capture the problem

Code:

== 4/16/2017 ==

KLS: Merge eqstream branch

        - UDP client stack completely rewritten should both have better throughput and recover better (peq has had far fewer reports of desyncs).

        - TCP Server to Server connection stack completely rewritten.

                - Server connections reconnect much more reliably and quickly now.

                - Now supports optional packet encryption via libsodium (https://download.libsodium.org/doc/).

                - Protocol behind the tcp connections has changed (see breaking changes section).

                - API significantly changed and should be easier to write new servers or handlers for.

                - Telnet console connection has been separated out from the current port (see breaking changes section).

                        - Because of changes to the TCP stack, lsreconnect and echo have been disabled.

        - The server tic rate has been changed to be approx 30 fps from 500+ fps.

                - Changed how missiles and movement were calculated slightly to account for this (Missiles in particular are not perfect but close enough).

        

        - Breaking changes:

                - Users who use the cmake install feature should be aware that the install directory is now %cmake_install_dir%/bin instead of just %cmake_install_dir%/

                - To support new features such as encryption the underlying protocol had to change... however some servers such as the public login server will be slow to change so we've included a compatibility layer for legacy login connections:

                        - You should add <legacy>1</legacy> to the login section of your configuration file when connecting to a server that is using the old protocol.  

                        - The central eqemu login server uses the old protocol and probably will for the forseeable future so if your server is connecting to it be sure to add that tag to your configuration file in that section.

                        - Telnet no longer uses the same port as the Server to Server connection and because of this the tcp tag no longer has any effect on telnet connections.

                                - To enable telnet you need to add a telnet tag in the world section of configuration such as:

                                        <telnet ip="0.0.0.0" port="9001" enabled="true"/>

Also a friendly reminder to keep in mind we have life obligations so we don't have most weekdays to keep hammering at this problem

We'll let you know when we have the code fixes in place

Thank you Kindly for the very detailed Update, I believe this was a Side effect to Thanos Snap

I for one am Very pleased with the current build that was released to me *shared with Varlydra server* Akkaidus and KLS and anyone else I may not know who was involved work very hard and they kept their promise to find a fix and they delivered. Tested this with 24 clients in zone. Keep in mind your MS bar may say one thing but in reality you can cast spells with normal fresh time. My server is a 2 box server but permitted 3 for the purpose of our testing. Once more thank you for taking this problem serious and investing time and resources to see it fixed.

Keep in mind this is not merged mainline, but we have a general fix in a working branch currently, we have a handful of things we need to take care of before merging mainline

If you're interested in the build for your server, download it at the following and report back

https://www.dropbox.com/s/2s2mput1q4...aries.zip?dl=0

Also, keep in mind you will need to run this update manually: https://github.com/EQEmu/Server/blob...date_range.sql

Quote:

Originally Posted by Akkadius (Post 262153)

Stellar work all around. I can't wait to fire this up on Varlyndria for everybody in tomorrow's update. Thank you very much for all you guys do to keep this place the best emulator project in the world.

Hey guys just checking in to see if there is any update to this issue and if the fix will be pushed out.

Thank you

Quote:

Originally Posted by almightie (Post 262597)

Hey guys just checking in to see if there is any update to this issue and if the fix will be pushed out.

Thank you

Fix is mainline and on master

I meant to ask you Akka, that fix, was that the "compression level" update that was commited ? The only reason I ask, one of my "toy boxes", is sticking to slightly older code, but picking away at manually applying feasible updates, when I can get away with it. :)

Hey Folks,

Posted this in discord but things can drift up on there so just posting here as well.

We've just completed a server update and merged in the latest changes in the eqemu master branch. Everything is great with the exception that we seem to have now run into the dreaded windows server lag bug as report here: http://www.eqemulator.org/forums/sho...t=42311&page=4 According to that thread it was fixed, but it seems it's happening to our server still for some reason. Symptoms are exactly the same, a resend cascade leading to big lag spikes (going by netstats).

We have windows server 2019, 2 zeon 2.4 processors, 32gig ram and more bandwidth then you can shake a stick at. Any ideas as to settings to tweak or other highly welcome! Meanwhile I'll see if I can utilise those metrics in the thread to get more insights.

I haven't been on discord yet..

..but, I would start with ensuring that you have the correct zlib dll.

If it's not from 2019, I wouldn't trust it to be correct.

Make sure that you're using the one acquired from the eqemu_server.pl download option.

The new vcpkg method seems to install a zlib dll from 2018 into the build directory that seems to cause this issue.

If you do a select all -> copy -> paste from build to server install, and this dll is present in build, it will overwrite your current server copy.

As well, any older dependency-related copy will do it too.

The issue is related to build flags (mostly) and forces the encryption to operate in single-thread mode.

The copy obtained through the eqemu_server download is known to be correctly flagged (as of the my commit to that repo.)

Thanks again Uleat.

So after a bit of experimenting here's some notes:

I've been compiling with the Build Zlib flag set in cmake, so I didn't actually need or use a zlib1.dll in the folder - not sure what that means, shouldn't it compile with zlib with multithreaded mode in that case or is it not configured for that in cmake?

I can untick the build with zlib, in which case I do need the zlib1.dll in the folder to run it, however I've been compiling in 64bit mode so the x86 zlib1.dll from the installer doesn't work with it. However I can just compile a x86 version instead and use the zlib1.dll which is what I'll try next, at least then I know for sure is is using the right zlib then.