Terminology I'll be using
Tickrate - Server FPS, i.e. how many times a second does the server process new information.
Packet loss - When I say this, I mean an arithmetic mean of packet loss per each player on the server. This is when information is sent to or from the server, but doesn't reach its destination as it should.
Issue at hand
Every couple seconds, there's 0.5-1 second lag with chat and vehicles go out of sync, this is most noticeable when switching lanes, making turns or accelerating/stopping.
After the issues first began, we began monitoring server's performance to see anything unusual. It was obvious this isn't a regular old attack, because the flow of packets to and from the server seemed normal and we have pretty sophisticated methods of capturing packets that don't belong (I won't go into detail with this for obvious reasons). There were two things we originally noticed:
- Increased packet loss - what we're normally used to is 0.1% - 0.2% of average packet loss per player on the server, unless there's a heavy attack going on. We noticed an increase in the packet loss to at least 0.3% at all times, sometimes reaching as high as 0.8%.
- Drops in Tickrate - another thing that was very obvious a dip in server FPS ever so often; about every couple seconds. This was fine at lower player count, but critical during peak. Once we reached 350-400 players, the server FPS would drop down to even 1-10 once every few seconds.
The lag began, conveniently, around the time we released the 5.6.005 update. Initially I thought we must've messed something up in the release that'd cause the lag, so the first thing we did was to go over the changes made. This was a challenge, as it was many things spanning across three months of work, but as much as I tried, I couldn't find anything that'd make sense, given what I was looking for.
The next logical step here was to reverse to previous revision anyway, to see if it wasn't the reason anyway. We reversed and there were signs of the lag as early as with 300 players online, plus the packet loss was still as high as ever, so we moved back to the new revision within a few hours.
Hardware Fault & Upgrade
Another possible cause of lag coming out of blue could be faulty hardware. Back then we were running on a, what is by today's industry standards considered an old machine, fourth generation i7 CPU and a high-speed SSD. In order to rule out any possibility of the issue being in the hardware or the datacenter's routing, we opted for an upgrade to something better. We're now running on a seventh generation i7-7700k and NVMe drives, which are even faster than already high performing SSD's. We even changed our datacenter. The issue however persisted and despite getting a much needed upgrade for our flag ship, hardware wasn't the issue.
Still convinced the issue must be the throttling of server's FPS, we ran zeex's performance profiler a few times. Here's the result after our first run:
What does that tell us? Well the first thing that popped into my mind was "who the hell calls that much!". It turns out that our Phone timer, which contains calculations for the player's signal (and then further decides what to do, e.g. hang up their call if they've no signal) was very inefficient. Each iteration of the player loop (one iteration per player online) contained roughly seventy iterations, each of which container proximity checks to radio towers. For those of you who didn't understand a word I just said: with 500 players on, that's 35000 distance checks every time the loop is performed. That's a lot. And proximity checks are very demanding computations.
Further investigation showed that the loop is performed exactly every three seconds. Ho! The spike happens every few, say, three seconds! That must be it, right? Holy shit I saved the day yet again.
We put a patch in the same day and well, the spike that happens every couple seconds was milder, but still there. We are still looking at drops down to 10-30 server FPS every now and then, which could definitely cause the issues we're having. Second run of the performance profiler confirmed the phone issue was gone, but there were some new obvious suspects.
OnPlayerUpdate is a callback that is ran every time there's anything at all new about your client (when you walk, move your camera, etc). I found an old feature still residing there, which contained a loop of each player online every thirty OnPlayerUpdate hits. This would explain why the issue gets more severe with more players on, as with more entities and activity, the calls would rise exponentially. Anyway, like I said — old feature, I just removed that.
There wasn't anything suspicious really, except OnPlayerPropertyTickCount. That's a property system related callback executed every second. It primarily handled displaying checkpoints to players near houses. Albeit some optimizations for it were in place, at peak (let's say 500 players), it could yield about a million iterations a second (500 players x 2000 properties), and you guessed it, a lot of them contain proximity checks (player to house). Okay, that's really gotta be it. I mean it's just so, so obvious this is it, and I've saved th--
Yeah well, this is where it gets interesting. I did manage to stabilize the tick rate at 200-300 at peak, with mild drops to 80-120 every now and then. While that could still use some work, that's far better than what it used to be. However, the lag did not care. We ran another round of profiling, proving that there isn't a single function that'd be unnecessarily hogging server resources.
Back to square... 1?
I guess. At this point I question whether my first instinct — chasing the server performance, was correct. With the server FPS drops being gone, both packet loss and lag are still present. Bet that's it, then. Originally, I'd assume that if network was the issue, it'd be the same regardless of the amount of players online. Perhaps as the amount of players increases, the amount of packets increases — our protection gets more sensitive and begins filtering out packets it shouldn't.
Since then, we tried many things, including but not limited to: server config tweaking - different stream rates, update rates and other settings. No avail. We tried disabling or tweaking our DDOS protection and firewalls, both hardware and software. No significant difference was shown, either. Among other things, we also upgraded all of our software and optimized our MySQL tables.
Next in store
There's still a couple more options we can try. We'll be looking into changing our database engine and doing further optimizations in areas we deem could be responsible, I'll keep this post updated.
Some of you expressed interest in helping us diagnose the issue, but given the circumstances and how unique the issue is, that'd probably cause more harm and headache than good. We do appreciate anyone that's patient while we work on trying to resolve this; we've been hard at work trying to fix it before the anniversary even started. I know very well how frustrating it is and none of us will rest until it's fixed. I also wanted to show you a different side of development that you're used to see, so I hope the read was at least a bit interesting.