For the sake of troubleshooting, can you guys try this build? No guarantees but it will tell us something if the symptoms go away
https://ci.appveyor.com/project/KimL...2988/artifacts |
Quote:
When I tried it the first time I got the files you put on this thread and also tried the latest unstable I pulled from there. I'll load up those new source files you posted in a few here, stress test, and than share our experiences on here. |
The experiment with the binaries you posted above failed at 18 players with a few pets out. We spiraled out of control with lag in the Nexus.
|
Sounds good, thanks for trying that
Good News Our core configurations are easy to tweak Bad News We don't know what settings to tweak yet Good News Again We are working on integrating stats tooling into the server code so that you guys can give us dumps on what is occurring in the zones at the time of this so we can see exactly what is happening at the network layer so we can give a proper prognosis We are also working on a way to connect hundreds of headless clients to a server to simulate a stress condition so we can also debug this same issue without being reliant on someone with a player base and players logging in to replicate this issue, we can then replicate it on our own That being said The resend issue discussed prior in this thread was a real issue and we addressed the aggressive resend issues in weeks worth of testing on a few different servers It seems that this network issue is common to Windows environments on a 2-2.6ghz Xeon chip and anytime there are 20+ clients in the zone We are committed to getting it resolved, you shouldn't need to switch your OS and hardware to get around this issue so please be patient and we will get it sorted |
I can confirm that I 100% agree with Akkadius. I attempted to upgrade our hardware and in my folly i was wrong. superior hardware is not going to fix this issue. I changed from a xeon 2.0 processor to a Intel I7-6700 32gb etc. and experienced the same issues. So at the very least we have eliminated the theory of hardware.
|
We just tried running the installer on a fresh setup, and started getting lag once we passed 23 in zone.
|
Quote:
Move all of the files from the /nav directory into the new subdirectory. Then run the server again.. Lag goes away for me. NOTE: I'm working with highly customized server code and don't have the latest update. As a second test, I turned off .mmf file loading and left .nav files in the /nav directory. Either solution worked for me. |
I will trying this on Alternate Everquest and report back if this works for us
|
MMF loading has been disabled for some time.
The new mapping system has not been applied to mmf loads and, tbh, I am unsure of the behavior of trying to use them. Worst case scenarios, you would default to standard map loading or zone crashes. The pre-built binaries do not include mmf loads as an enabled option. |
Waiting until Akkadius posts an official fix before I would touch the maps and navmesh files. That sounds like a bandaid fix which could lead to some zone crashes or NPCs screwing up, and not a full-on repair of the problem.
|
Possible help
This may not solve your problem, but it may help. (especially if you use innodb as your storage engine)
I took a look at how the Save method in client.cpp worked and saw some issues in the way locking happens and transactions are used (or not used). From what I can tell between the lock in front of the DB connection and the lack of bulking up statements into a transaction, this could cause some serious issues if people were not using a very good ssd on their database and had quick network connection to their database of choice. You can easly get 'fsync' choked. Symptions are low io/cpu usage, everything starts lagging out waiting for the locks to be release for fsync to happen on the database for transaction purposes. I've created a pull request. https://github.com/EQEmu/Server/pull/827 **WARNING*** my testing has been minimal so use as your own risk, but I am encouraged by the results (over 2x faster with zero load on the system) **Note**, you will also need to included two indexes on tables, its in the pull request. Its important as we are doing table scans during the save without them. Hope this can help some of you. right now its a stop gap and hopefully I can come up with a better solution in the near future. I would like feedback if it helps resolve the lag issues with pets out. This is an example bulking of the transactions of a simple save. START TRANSACTION; REPLACE INTO `character_currency` (id, platinum, gold, silver, copper,platinum_bank, gold_bank, silver_bank, copper_bank,platinum_cursor, gold_cursor, silver_cursor, copper_cursor, radiant_crystals, career_radiant_crystals, ebon_crystals, career_ebon_crystals)VALUES (685273, 184, 153, 149, 101, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0); REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 189, 0, 18.000000, -147.000000, 20.000000, 64.000000, 0); REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 1); REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 2); REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 3); REPLACE INTO `character_bind` (id, zone_id, instance_id, x, y, z, heading, slot) VALUES (685273, 41, 0, -980.000000, 148.000000, -38.000000, 64.000000, 4); DELETE FROM `character_buffs` WHERE `character_id` = '685273'; DELETE FROM `character_pet_buffs` WHERE `char_id` = 685273; DELETE FROM `character_pet_inventory` WHERE `char_id` = 685273; INSERT INTO `character_pet_info` (`char_id`, `pet`, `petname`, `petpower`, `spell_id`, `hp`, `mana`, `size`) VALUES (685273, 0, 'Labann000', 0, 632, 3150, 0, 5.000000) ON DUPLICATE KEY UPDATE `petname` = 'Labann000', `petpower` = 0, `spell_id` = 632, `hp` = 3150, `mana` = 0, `size` = 5.000000; DELETE FROM `character_tribute` WHERE `id` = 685273;REPLACE INTO character_activities (charid, taskid, activityid, donecount, completed) VALUES (685273, 22, 1, 0, 0), (685273, 22, 2, 0, 0), (685273, 22, 3, 0, 0), (685273, 22, 4, 0, 0), (685273, 22, 5, 0, 0); REPLACE INTO character_activities (charid, taskid, activityid, donecount, completed) VALUES (685273, 23, 0, 0, 0);REPLACE INTO character_activities (charid, taskid, activityid, donecount, completed) VALUES (685273, 138, 0, 0, 0); REPLACE INTO `character_data` ( id,account_id,`name`, last_name, gender, race, class, `level`, deity,birthday,last_login,time_played,pvp_status,l evel2, anon, gm, intoxication,hair_color,beard_color,eye_color_1,ey e_color_2,hair_style,beard,ability_time_seconds,ab ility_number,ability_time_minutes,ability_time_hou rs, title,suffix, exp, points, mana, cur_hp, str, sta, cha, dex, `int`,agi, wis, face, y, x, z, heading, pvp2, pvp_type,autosplit_enabled, zone_change_count, drakkin_heritage, drakkin_tattoo,drakkin_details, toxicity,hunger_level,thirst_level,ability_up,zone _id, zone_instance,leadership_exp_on, ldon_points_guk, ldon_points_mir, ldon_points_mmc, ldon_points_ruj, ldon_points_tak, ldon_points_available,tribute_time_remaining, show_helm, career_tribute_points,tribute_points,tribute_activ e,endurance, group_leadership_exp,raid_leadership_exp, group_leadership_points, raid_leadership_points, air_remaining,pvp_kills, pvp_deaths,pvp_current_points, pvp_career_points, pvp_best_kill_streak,pvp_worst_death_streak, pvp_current_kill_streak, aa_points_spent, aa_exp, aa_points, group_auto_consent, raid_auto_consent, guild_auto_consent, RestTimer, e_aa_effects, e_percent_to_aa, e_expended_aa_spent, e_last_invsnapshot, mailkey ) VALUES (685273,90536,'Rekka','',0,6,13,50,396,1550636815, 1552018689,22507,0,70,0,1,0,17,255,4,4,2,255,0,0,0 ,0,'','',164708608,345,2299,1589,60,80,60,75,134,9 0,83,3,-1831.625000,-225.750000,3.127999,37.500000,0,0,0,0,0,0,0,0,4480 ,4480,0,22,0,0,0,0,0,0,0,0,4294967295,0,0,0,0,1291 ,0,0,0,0,60,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,'B F01A8C0586571A2'); COMMIT; If this doesn't help, I'm sorry if I have muddied the waters a bit. |
Is there data that makes you think DB is acting as a bottleneck?
While I don't want to undermine the helpful suggestions. I did try a lot of DB performance tuning and changing storage engines to tweak. I also setup fairly robust DB monitoring to make sure it wasn't DB performance. When I was having massive lag spikes, there was virtually no load, locks, waits, anything really on the DB. The DB was on-box and was on SSD. I didn't save the reporting but nothing about it in my case pointed to DB performance issues. As others said, a more methodical approach is probably in order to make sure the situation doesn't get more complex. |
From what I can tell from just looking at the code , the locking isn't at the database layer per say, it's at the MySQL connection per zone In the zonedb.cpp. there is only one connection per zone, at least from what I see.
Fsync is a delay or latency issue on the DB when dealing with transactions. Every single query in save is a transaction. (All 13+of them.) You can have a system that can do 500 transactions per sec but can do 100,000 inserts per sec if you bulk up statements. A small latency can have a massive impact when locks are involved. If there is latency at the DB it queues up on the zone depending on how many are in each zone. Less people in the zone the less this impacts them Lowering the latency by limiting the fsync on the transaction call can ease the pressure on the lock on the connection which prevents stalling of the character saved. Or that is at least the idea. **Note** this lock prevents all queries in the zone , not jsut during saves. Also note the removal of the table scanning of the pet tables (adding indexes) helps to lower latency of the call as well It can be easy to confuse work with latency/locks. You can have a slow system doing no work. Honestly I would like to do more work on this and know it's a stop gap but figured doing 13x less transaction s per save was a win when someone in this thread noted commenting how some of the saves improved their latency. (Mine improved 2-3x) I know it can be many issues and I may be barking up the wrong tree , but this is simply another option that does have a very clear improvement in performance around the zone locks. Side note, A single mysql connection for a process is generally a less than idea situation. It is too much of a blockage area for network IO. Locks should be kept for nano/microseconds, not milliseconds. Possibly make seperate connections for read/writes depending on how the threading is setup on the zone process.. (note I have not really looked at the threading model of the zone yet, so this may be moot and may be my misunderstanding) Note on a phone so sorry for formatting/bad Grammer. Very small window :) |
Quote:
There are quite a few things I want to highlight about this though and contrast what the real problems are here On the Performance Standpoint EQEmu is not database heavy at all, there was once upon a time where we did all kinds of stupid things but 100's of hours have been poured over reducing our performance bottlenecks across the board in departments of CPU, I/O and even Network. To illustrate here is PEQ: http://peq.akkadius.com:19999/#menu_...late;help=true With PEQ at 800-1000 toons on the daily, we barely break 100 IOPS a second, with barely 1MB/s in writes with minimal spikes here and there. That is virtually nothing Also with your benchmark (I assume this was you on the PR), the current stock code on some middle of the line server hardware produces the following timings HTML Code:
[Debug] ZoneDatabase::SaveCharacterData 1, done... Took 0.000107 seconds (These timings entirely will depend on your hardware and MySQL server configuration of course) These operations also happen very infrequently where it is not going to even matter. There are many many factors that play into the overall performance of a server and since the server is essentially an infinite loop, anything within that loop can influence the amount of time that a CPU is not spent idling (Network, I/O, overly CPU intensive operations etc.). Hardware is of influence, your MySQL server configuration is of influence and most of all software is of influence EQEmu used to be way way way more resource intensive and we've come along way to where that is not even an issue anymore. We have one outstanding bug that is isolated to the networking layer that made its way through because we never saw it on PEQ during our normal QA routines We are currently working on the code to measure application layer network stats so folks can give us dump metrics off of so we can give a proper fix. We've seen what happens during the network anomaly during a CPU profile and there's not much that it is going to show alone but where it is spending most of its time. We folks at EQEmu definitely have jobs that have been a deterrer from resolving said bug but we will have it on lockdown soon enough as we know exactly what we need to do, the only thing in our way is time as a resource We are not grasping at straws to fix this folks, so please just be patient as this is just not a quick fix with our schedules Quote:
|
Update to this, KLS and I have been working on our stats internals so we can start dumping out data
You can see a preview of this here of what KLS has working from stats internals https://media.discordapp.net/attachm...4/EQ000016.png I have built out a API which will pipe these stats to a web front-end to display graphs and metrics of all types here so we can perform some analysis against affected servers and zones. From there we should make some good progress and we've seen a handful of things from it already We will update when we have more |
All times are GMT -4. The time now is 09:31 AM. |
Powered by vBulletin®, Copyright ©2000 - 2025, Jelsoft Enterprises Ltd.