Go Back   EQEmulator Home > EQEmulator Forums > Support > Support::Windows Servers

Support::Windows Servers Support forum for Windows EQEMu users.

Reply
 
Thread Tools Display Modes
  #1  
Old 03-08-2019, 12:48 PM
Rekka
Fire Beetle
 
Join Date: Jan 2019
Location: North Carolina
Posts: 2
Default

From what I can tell from just looking at the code , the locking isn't at the database layer per say, it's at the MySQL connection per zone In the zonedb.cpp. there is only one connection per zone, at least from what I see.

Fsync is a delay or latency issue on the DB when dealing with transactions. Every single query in save is a transaction. (All 13+of them.) You can have a system that can do 500 transactions per sec but can do 100,000 inserts per sec if you bulk up statements.

A small latency can have a massive impact when locks are involved.

If there is latency at the DB it queues up on the zone depending on how many are in each zone. Less people in the zone the less this impacts them

Lowering the latency by limiting the fsync on the transaction call can ease the pressure on the lock on the connection which prevents stalling of the character saved. Or that is at least the idea.

**Note** this lock prevents all queries in the zone , not jsut during saves.

Also note the removal of the table scanning of the pet tables (adding indexes) helps to lower latency of the call as well

It can be easy to confuse work with latency/locks. You can have a slow system doing no work.

Honestly I would like to do more work on this and know it's a stop gap but figured doing 13x less transaction s per save was a win when someone in this thread noted commenting how some of the saves improved their latency. (Mine improved 2-3x)

I know it can be many issues and I may be barking up the wrong tree , but this is simply another option that does have a very clear improvement in performance around the zone locks.

Side note, A single mysql connection for a process is generally a less than idea situation. It is too much of a blockage area for network IO. Locks should be kept for nano/microseconds, not milliseconds. Possibly make seperate connections for read/writes depending on how the threading is setup on the zone process.. (note I have not really looked at the threading model of the zone yet, so this may be moot and may be my misunderstanding)

Note on a phone so sorry for formatting/bad Grammer. Very small window
Reply With Quote
  #2  
Old 03-08-2019, 06:00 PM
Akkadius's Avatar
Akkadius
Administrator
 
Join Date: Feb 2009
Location: MN
Posts: 2,072
Default

Quote:
Originally Posted by Rekka View Post
From what I can tell from just looking at the code , the locking isn't at the database layer per say, it's at the MySQL connection per zone In the zonedb.cpp. there is only one connection per zone, at least from what I see.

Fsync is a delay or latency issue on the DB when dealing with transactions. Every single query in save is a transaction. (All 13+of them.) You can have a system that can do 500 transactions per sec but can do 100,000 inserts per sec if you bulk up statements.

A small latency can have a massive impact when locks are involved.

If there is latency at the DB it queues up on the zone depending on how many are in each zone. Less people in the zone the less this impacts them

Lowering the latency by limiting the fsync on the transaction call can ease the pressure on the lock on the connection which prevents stalling of the character saved. Or that is at least the idea.

**Note** this lock prevents all queries in the zone , not jsut during saves.

Also note the removal of the table scanning of the pet tables (adding indexes) helps to lower latency of the call as well

It can be easy to confuse work with latency/locks. You can have a slow system doing no work.

Honestly I would like to do more work on this and know it's a stop gap but figured doing 13x less transaction s per save was a win when someone in this thread noted commenting how some of the saves improved their latency. (Mine improved 2-3x)

I know it can be many issues and I may be barking up the wrong tree , but this is simply another option that does have a very clear improvement in performance around the zone locks.

Side note, A single mysql connection for a process is generally a less than idea situation. It is too much of a blockage area for network IO. Locks should be kept for nano/microseconds, not milliseconds. Possibly make seperate connections for read/writes depending on how the threading is setup on the zone process.. (note I have not really looked at the threading model of the zone yet, so this may be moot and may be my misunderstanding)

Note on a phone so sorry for formatting/bad Grammer. Very small window
Rekka, thank you so much for spending the time and performing analysis on things and trying to help, we always appreciate folks who take initiative to contribute to the project

There are quite a few things I want to highlight about this though and contrast what the real problems are here

On the Performance Standpoint

EQEmu is not database heavy at all, there was once upon a time where we did all kinds of stupid things but 100's of hours have been poured over reducing our performance bottlenecks across the board in departments of CPU, I/O and even Network. To illustrate here is PEQ: http://peq.akkadius.com:19999/#menu_...late;help=true

With PEQ at 800-1000 toons on the daily, we barely break 100 IOPS a second, with barely 1MB/s in writes with minimal spikes here and there. That is virtually nothing

Also with your benchmark (I assume this was you on the PR), the current stock code on some middle of the line server hardware produces the following timings

HTML Code:
[Debug] ZoneDatabase::SaveCharacterData 1, done... Took 0.000107 seconds
[Debug] ZoneDatabase::SaveCharacterData 1, done... Took 0.000091 second
This is a sub 1ms operation that is not going to hurt even when it is synchronous

(These timings entirely will depend on your hardware and MySQL server configuration of course)

These operations also happen very infrequently where it is not going to even matter.

There are many many factors that play into the overall performance of a server and since the server is essentially an infinite loop, anything within that loop can influence the amount of time that a CPU is not spent idling (Network, I/O, overly CPU intensive operations etc.). Hardware is of influence, your MySQL server configuration is of influence and most of all software is of influence

EQEmu used to be way way way more resource intensive and we've come along way to where that is not even an issue anymore. We have one outstanding bug that is isolated to the networking layer that made its way through because we never saw it on PEQ during our normal QA routines

We are currently working on the code to measure application layer network stats so folks can give us dump metrics off of so we can give a proper fix. We've seen what happens during the network anomaly during a CPU profile and there's not much that it is going to show alone but where it is spending most of its time.

We folks at EQEmu definitely have jobs that have been a deterrer from resolving said bug but we will have it on lockdown soon enough as we know exactly what we need to do, the only thing in our way is time as a resource

We are not grasping at straws to fix this folks, so please just be patient as this is just not a quick fix with our schedules

Quote:
Originally Posted by ptarp View Post
As another test.. Use the same binaries, everything else.. but in the /Maps/nav directory, create a subdirectory. Something like /Maps/nav/removed

Move all of the files from the /nav directory into the new subdirectory. Then run the server again.. Lag goes away for me.

NOTE: I'm working with highly customized server code and don't have the latest update.

As a second test, I turned off .mmf file loading and left .nav files in the /nav directory. Either solution worked for me.
Nav may "help" lag because you have less position updates being sent around from mobs not pathing or pathing less frequently along with less CPU intensive path calculations, again I will defer to my statements above that folks just be patient and we'll have a fix for folks when we have the time
Reply With Quote
  #3  
Old 03-11-2019, 03:02 AM
Akkadius's Avatar
Akkadius
Administrator
 
Join Date: Feb 2009
Location: MN
Posts: 2,072
Default

Update to this, KLS and I have been working on our stats internals so we can start dumping out data

You can see a preview of this here of what KLS has working from stats internals



I have built out a API which will pipe these stats to a web front-end to display graphs and metrics of all types here so we can perform some analysis against affected servers and zones. From there we should make some good progress and we've seen a handful of things from it already

We will update when we have more
Reply With Quote
  #4  
Old 03-11-2019, 04:27 PM
Drakiyth's Avatar
Drakiyth
Dragon
 
Join Date: Apr 2012
Posts: 545
Default

Quote:
Originally Posted by Akkadius View Post
Update to this, KLS and I have been working on our stats internals so we can start dumping out data

You can see a preview of this here of what KLS has working from stats internals



I have built out a API which will pipe these stats to a web front-end to display graphs and metrics of all types here so we can perform some analysis against affected servers and zones. From there we should make some good progress and we've seen a handful of things from it already

We will update when we have more


With these new changes, what time-frame are we looking at for a windows fix in the source?
Reply With Quote
  #5  
Old 03-31-2019, 12:07 AM
ptarp
Fire Beetle
 
Join Date: Jan 2010
Location: Idaho
Posts: 27
Default

Not sure if this is still being looked at or anything's been done, but I thought I'd mention this because it might bring some things to light.

I took code that was based around 2010 or so. (Note that was before the _character table changed to character_data). I added custom changes from that to code to newer code I grabbed from your GIT at around 8-2017 to use on our server. I didn't notice the difference right away because most of my testing involved a few toons here and there.

Since I noticed the problem with logging in more than 24 toons, I adjusted things in my own code, and started updating my code by incorporating each change to your GIT.. It is now up to date with 9-2018, at which time I noticed one change that really helped, but it was buried in a merge.

I thought that was all there was to it.. But since then, I logged 48 toons in one zone on my server..
All seemed fine and there was no lag.. Until I buffed all my toons. This added to the lag because of added work for database.SaveBuffs(this); function call.

Now, I have begun comparing the code difference between the old version (before the database split) and the new.

In the old version.. The only database call in Client::Save(uint8 iCommitNow) function was using the DBAsyncWork class unless iCommitNow was > 0. That particular class actually added it's own thread to work on a queue of database queries without slowing down the main thred.

I don't see that thread anywhere in the new code. Maybe I'm mistaken, but I think it would maybe help.
Don't know for sure, but just a thought.
Reply With Quote
  #6  
Old 03-31-2019, 12:17 AM
Akkadius's Avatar
Akkadius
Administrator
 
Join Date: Feb 2009
Location: MN
Posts: 2,072
Default

A lot has been being done, if you had read the thread I have been giving updates on exactly what we've been working on. For the umpteenth time, this is not a database problem

We built metrics into the server, we built a web admin panel and a comms API to the server so we can visually see the problem





Below is the visual of the exact problem that we are running into, this is the cascading resend problem that chokes the main thread of the application if the processor can't keep up, the core/process collapses upon itself

We had 60 toons come in (Which isn't a problem at all for the hardware we run on) and they all ran macros that generated an excess amount of spam. It all runs fine until the process gets behind on resends, then cascades in its ability to keep up with the resends because of all of the packet loss





Here is when resends get to the point where the server can no longer send keepalives to the clients, the clients disconnect and then the process eventually catches up again and everything flatlines



TLDR; the server keeps up just fine until the process buckles

The reason for this is that the packet communications happen in the main thread, which hasn't been a problem until we discovered this scenario in recent months

We are working on removing the client communications from the main thread so that we don't run into thus buckling problem from back-pressure. Our networking should not be occurring on the main thread regardless and getting two threads to communicate networking responsibilities over an internal queue isn't the most trivial of processes either so its taking us some time

We also can't measure changes without having taken the time that we have to build out metrics to show the problem and know that it actually has been resolved

Also, for context and clarity, the network code was completely overhaul at this time. While we've ironed out most things, this is our last outstanding issue and it hasn't been an easy one from a time and resource perspective because it has been incredibly elusive, harder to reproduce and didn't have any way to measure or capture the problem

Code:
== 4/16/2017 ==
KLS: Merge eqstream branch
	- UDP client stack completely rewritten should both have better throughput and recover better (peq has had far fewer reports of desyncs).
	- TCP Server to Server connection stack completely rewritten.
		- Server connections reconnect much more reliably and quickly now.
		- Now supports optional packet encryption via libsodium (https://download.libsodium.org/doc/).
		- Protocol behind the tcp connections has changed (see breaking changes section).
		- API significantly changed and should be easier to write new servers or handlers for.
		- Telnet console connection has been separated out from the current port (see breaking changes section).
			- Because of changes to the TCP stack, lsreconnect and echo have been disabled.
	- The server tic rate has been changed to be approx 30 fps from 500+ fps.
		- Changed how missiles and movement were calculated slightly to account for this (Missiles in particular are not perfect but close enough).
	
	- Breaking changes:
		- Users who use the cmake install feature should be aware that the install directory is now %cmake_install_dir%/bin instead of just %cmake_install_dir%/
		- To support new features such as encryption the underlying protocol had to change... however some servers such as the public login server will be slow to change so we've included a compatibility layer for legacy login connections:
			- You should add <legacy>1</legacy> to the login section of your configuration file when connecting to a server that is using the old protocol.  
			- The central eqemu login server uses the old protocol and probably will for the forseeable future so if your server is connecting to it be sure to add that tag to your configuration file in that section.
			- Telnet no longer uses the same port as the Server to Server connection and because of this the tcp tag no longer has any effect on telnet connections.
				- To enable telnet you need to add a telnet tag in the world section of configuration such as:
					<telnet ip="0.0.0.0" port="9001" enabled="true"/>
Also a friendly reminder to keep in mind we have life obligations so we don't have most weekdays to keep hammering at this problem

We'll let you know when we have the code fixes in place

Last edited by Akkadius; 03-31-2019 at 12:30 AM..
Reply With Quote
Reply


Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump

   

All times are GMT -4. The time now is 02:50 PM.


 

Everquest is a registered trademark of Daybreak Game Company LLC.
EQEmulator is not associated or affiliated in any way with Daybreak Game Company LLC.
Except where otherwise noted, this site is licensed under a Creative Commons License.
       
Powered by vBulletin®, Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.
Template by Bluepearl Design and vBulletin Templates - Ver3.3