Author Topic: Big-ish server, server software looks to be the cause, please help  (Read 76 times)

BloodyIron

  • Newbie
  • *
  • Status:
    Offline
    Posts:
    10
    • View Profile
With the whole Pandemic thing going on and Avorion 1.0 releasing, we wanted to put together an event where we wiped our Galaxy and tried to get more gamers engaged on our server ( https://www.lanified.com/News/2020/Join-Our-Galaxy-Adventures ). We have been successful at getting more gamers to play regularly, however we have had performance issues with the Avorion dedicated server software just not scaling well at all. We cannot identify a hardware or resource bottleneck at this point in our environment, so we believe confidently (but would love to be proved wrong) that the issue lies with inefficiencies in the game server itself.

Before we start uploading server logs and the like, I want to describe the resources allocated and what we've done so far, and stuff like that.

Since we launched the new Galaxy we've had a steady stream of gamers, having anywhere from 4-13 gamers connected simultaneously (we have the limit currently set to 50), with people connecting from around the globe. We're in Canada.

Pretty much the whole time the server has been going since we started this Friday the gamers keep seeing "Server load is over 100%" in the top left, and the server frame graphs (what I assume they are) in-game continually are either yellow or red, rarely green. When we see lots of red, hit registration goes out the window, and other wonkiness ensues, until it clears up for whatever reason.

We can't yet find a pattern around this in regards to what the players are doing. I don't believe anyone has fighters yet, and most certainly there aren't any wars going on.

As for the resourcing... the Avorion dedicated server runs in an Ubuntu 16.04 VM with regular updates. We may upgrade to 18.04 in the near future, but not right now as it's being used regularly.

The VM resides on a striped-mirror array of SSD storage that runs ZFS, which is very fast. When we look at the ZFS IO stats we do not see any concerning level of IOPS or throughput that could be bottlenecking so we are very confident this is not a storage issue.

We have, however, scaled up the CPU, RAM and workers for the server several steps to try to address the issue, and while it has helped, it has not properly eliminated the issue.

First step:
CPU: 6x cores
RAM: 4GB

Second step:
CPU: 12x cores
RAM: 8GB

Third step:
CPU: 16x cores
RAM: 24GB

As we upped the CPU we upped the workers (current values I'll list below), and the earlier steps did help, but we feel we have plateaued from the CPU/worker aspect at this point. The RAM upping certainly was necessary, as we are now using 12GB of RAM out of our 24GB. We can add more RAM if need be, but I see no reason to do that.

Here are the current worker parameters:
workerThreads=32
generatorThreads=16
scriptBackgroundThreads=16

We have tried 16/8/8 when we had 16x CPU cores, however we have not seen a difference between 16/8/8 and 32/16/16, so for now I am leaving it at 32/16/16 as I have to restart the server each time to change that (don't want to scare away/frustrate our gamers).

I have also had profiling on multiple times, and took a look at the worker map html for the 32/16/16 configuration, generating that map when a "red storm" happened (red graphs), and I did not see anything stick out as "oh we need more workers" or something is problematic, etc.

Additionally, when profiling is off, the server console frequently spits out "Server Frame took over 1 second" and then shows the frame chart kind of thing.

The VM runs on Proxmox VE, effectively LinuxKVM at the core of the hypervisor, a very fast and good hypervisor. The VM currently is on a host that is a Dell R720 with the CPUs in it being 2x intel Xeon e5-2650 v0's. The Linux OS reports a load average of 6/6/6, so while we have lots of worker threads, I do not see any sort of thread queueing being an issue here.

As for the upload bandwidth, the connection is 15mbps and we're using about 4-5mbps for the Avorion server, and nothing else is pushing us close to our upload limit. The gateway is a very reliable and fast pfSense box, and the CPU on that is not getting pinned, so it is not a routing performance issue (I see these issues on LAN too btw).

So, at this point, I am very confident this is a software efficiency issue.

I can provide server logs privately to the developers and other debug info where possible. I'm not comfortable posting that publicly.

If there are any areas that I may have overlooked, I'm all ears, because I've ran out of online resources that I could dig up to help me with this problem. I read through all of the release notes for the last 2 years, and could not find anything to help me with this case.

We really want to be able to scale this up to 20-30+ gamers, but based on what I'm seeing, the experience is going to suffer more and more as we scale up.

Please help!



BloodyIron

  • Newbie
  • *
  • Status:
    Offline
    Posts:
    10
    • View Profile
Also, the server load looks like this a lot : https://i.imgur.com/1dUy9ln.png

As of that picture, we have only 8 gamers on the server.



BloodyIron

  • Newbie
  • *
  • Status:
    Offline
    Posts:
    10
    • View Profile
Is this the wrong section here? Should I be posting elsewhere? The performance really is a big issue here. :(



FuryoftheStars

  • Sr. Member
  • ****
  • Status:
    Offline
    Posts:
    296
    • View Profile
It’s possible no one knows enough about this yet.  Sorry. :(



BloodyIron

  • Newbie
  • *
  • Status:
    Offline
    Posts:
    10
    • View Profile
It’s possible no one knows enough about this yet.  Sorry. :(

Well I certainly can understand that. I just want to make sure the right eyes are seeing this is all, so we can help get it addressed, by doing bug reports, debug filings, etc. :P

So if this is the right forum, great. If it isn't, well I can post elsewhere? Hmmmm.



BloodyIron

  • Newbie
  • *
  • Status:
    Offline
    Posts:
    10
    • View Profile
So, curious situation, increased "aliveSectorsPerPlayer" to 12 (was 7), decreased "workerThreads" to 16 (was 32) & "generatorThreads" to 8 (was 16), leaving "scriptBackgroundThreads" at 16. And now the server seems to be running better.

Right now just two players online, so we'll have to see how it holds up when more players jump back on, but the "server 100% load" alert for players is not currently happening (again, 2 players).

Still think this game needs actual programmatic tuning, but figured I'd share this possible improvement.



BloodyIron

  • Newbie
  • *
  • Status:
    Offline
    Posts:
    10
    • View Profile
The Server load didn't seem to increase until a particular player logged on, destroyed 5-6 ships and started salvaging them. So that seems like a probable candidate for a cause of the drastic server load increase. And an area where efficiency increases probably would help lots.

Not 100% sure, but IMO worth looking into, devs. ;P