Thread: lag problems
View Single Post
  #48  
Old 03-31-2019, 12:17 AM
Akkadius's Avatar
Akkadius
Administrator
 
Join Date: Feb 2009
Location: MN
Posts: 2,071
Default

A lot has been being done, if you had read the thread I have been giving updates on exactly what we've been working on. For the umpteenth time, this is not a database problem

We built metrics into the server, we built a web admin panel and a comms API to the server so we can visually see the problem





Below is the visual of the exact problem that we are running into, this is the cascading resend problem that chokes the main thread of the application if the processor can't keep up, the core/process collapses upon itself

We had 60 toons come in (Which isn't a problem at all for the hardware we run on) and they all ran macros that generated an excess amount of spam. It all runs fine until the process gets behind on resends, then cascades in its ability to keep up with the resends because of all of the packet loss





Here is when resends get to the point where the server can no longer send keepalives to the clients, the clients disconnect and then the process eventually catches up again and everything flatlines



TLDR; the server keeps up just fine until the process buckles

The reason for this is that the packet communications happen in the main thread, which hasn't been a problem until we discovered this scenario in recent months

We are working on removing the client communications from the main thread so that we don't run into thus buckling problem from back-pressure. Our networking should not be occurring on the main thread regardless and getting two threads to communicate networking responsibilities over an internal queue isn't the most trivial of processes either so its taking us some time

We also can't measure changes without having taken the time that we have to build out metrics to show the problem and know that it actually has been resolved

Also, for context and clarity, the network code was completely overhaul at this time. While we've ironed out most things, this is our last outstanding issue and it hasn't been an easy one from a time and resource perspective because it has been incredibly elusive, harder to reproduce and didn't have any way to measure or capture the problem

Code:
== 4/16/2017 ==
KLS: Merge eqstream branch
	- UDP client stack completely rewritten should both have better throughput and recover better (peq has had far fewer reports of desyncs).
	- TCP Server to Server connection stack completely rewritten.
		- Server connections reconnect much more reliably and quickly now.
		- Now supports optional packet encryption via libsodium (https://download.libsodium.org/doc/).
		- Protocol behind the tcp connections has changed (see breaking changes section).
		- API significantly changed and should be easier to write new servers or handlers for.
		- Telnet console connection has been separated out from the current port (see breaking changes section).
			- Because of changes to the TCP stack, lsreconnect and echo have been disabled.
	- The server tic rate has been changed to be approx 30 fps from 500+ fps.
		- Changed how missiles and movement were calculated slightly to account for this (Missiles in particular are not perfect but close enough).
	
	- Breaking changes:
		- Users who use the cmake install feature should be aware that the install directory is now %cmake_install_dir%/bin instead of just %cmake_install_dir%/
		- To support new features such as encryption the underlying protocol had to change... however some servers such as the public login server will be slow to change so we've included a compatibility layer for legacy login connections:
			- You should add <legacy>1</legacy> to the login section of your configuration file when connecting to a server that is using the old protocol.  
			- The central eqemu login server uses the old protocol and probably will for the forseeable future so if your server is connecting to it be sure to add that tag to your configuration file in that section.
			- Telnet no longer uses the same port as the Server to Server connection and because of this the tcp tag no longer has any effect on telnet connections.
				- To enable telnet you need to add a telnet tag in the world section of configuration such as:
					<telnet ip="0.0.0.0" port="9001" enabled="true"/>
Also a friendly reminder to keep in mind we have life obligations so we don't have most weekdays to keep hammering at this problem

We'll let you know when we have the code fixes in place

Last edited by Akkadius; 03-31-2019 at 12:30 AM..
Reply With Quote