EQEmulator Forums

EQEmulator Forums (https://www.eqemulator.org/forums/index.php)
-   Development::Server Code Submissions (https://www.eqemulator.org/forums/forumdisplay.php?f=669)
-   -   Deadlock in TCPConnection::ClearBuffers from FinishDisconnect (https://www.eqemulator.org/forums/showthread.php?t=41012)

image 11-03-2016 10:49 AM

Deadlock in TCPConnection::ClearBuffers from FinishDisconnect
 
I am pretty confident my world server encountered this issue after doing a code inspection and it seems the eqemu code is still susceptible to this problem.

https://github.com/EQEmu/Server/blob...ction.cpp#L293 -> https://github.com/EQEmu/Server/blob...ction.cpp#L310 -> https://github.com/EQEmu/Server/blob...ction.cpp#L504

we double lock MState which is not possible so it becomes a deadspin/deadlock whatever you want to call it.

Need to remove the lock of MState in ClearBuffers: https://github.com/EQEmu/Server/blob...ction.cpp#L504

So far no issues in my testing, but its hard to test all disconnect paths manually so gotta let it run.

image 11-07-2016 09:23 AM

Just to update seems I was wrong on this being the source of the deadlock. I still don't think its a good idea for us to lock a mutex we already locked before, seems like a bad design and might not be the only place.

In any case I reviewed and in net.cpp there was some older code that was delaying how long till the reconnect happens which I removed (so now its solely on the 10 second timer instead of like 120+ seconds).

What I did see is last night we failed our first reconnect attempt to the eqemu LS, typically I don't even see a reconnect attempt just the ending thread error. I also added a log message inside the AutoInitLoginServer thread creation in net.cpp to track this:


20366 [11.06. - 22:37:22] [COMMON__THREADS] Ending TCPConnectionLoop with thread ID -54917376
20366 [11.06. - 22:37:24] [WORLD__INIT_ERR] Not all login servers are connected, calling AutoInitLoginServer.
20366 [11.06. - 22:37:24] [WORLD__LS] Connecting to login server: login.eqemulator.net:5998
20366 [11.06. - 22:37:24] [COMMON__THREADS] Starting TCPConnectionLoop with thread ID -546068736
20366 [11.06. - 22:37:34] [WORLD__INIT_ERR] Not all login servers are connected, calling AutoInitLoginServer.
20366 [11.06. - 22:37:34] [WORLD__LS] Connecting to login server: login.eqemulator.net:5998
20366 [11.06. - 22:37:34] [WORLD__LS] Connected to Loginserver: login.eqemulator.net:5998

Will continue monitoring to see if any issues happen again, but maybe this shorter retry is helping the situation.


All times are GMT -4. The time now is 07:18 PM.

Powered by vBulletin®, Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.