Roy 10,832 / 0 Report Post Posted January 17, 2021 Hi everyone, I just wanted to provide another update to the recent performance issues we've been suffering from. This is more of a follow-up to this update, but I wanted to make it a separate thread since we will be performing maintenance this upcoming Tuesday (January 19th). Machines Overloaded + New Machine Issues Our game server machines have been overloaded recently, especially GS16 which hosts our Rust servers specifically. We purchased another machine with the Intel i9-10900K and tried moving some servers to that machine for load-balancing last week. However, we ran into even more complicated issues regarding RPF policies (filters put in-place to prevent spoofed traffic originating from our hosting provider's network). We ran into these issues because with our current setup, technically speaking, we're "spoofing" as our Anycast network when sending traffic from our game server machines directly (we aren't announcing the Anycast IP ranges via BGP on our game server machines themselves and there's no real way to do that right now that doesn't conflict with our current Anycast setup). Resolving the issue is very tricky and it has become a huge pain to deal with without us purchasing a switch for GFL (read below). Unfortunately, we had to revert the move last week after finding it too complicated to get things working with the new machine. At first, we were hoping we'd be able to setup a BGP session on our game server machines and announce our specific Anycast IPs allocated to the game server machine without traffic from the Internet also routing to them. This would allow us to send out as the Anycast IPs since they're being announced and RPF would succeed, but unfortunately, there's no way to prevent the Internet from routing to the game server machines in this case and this wouldn't work because we have an IPIP setup meaning we're expecting incoming packets to the game server machine to be in IPIP format. If we were to go with a solution like this, we'd literally need something to encapsulate each incoming packet into IPIP and even then, we'd also need to keep track of the internal IPs which would just be a mess (so more overhead since we'd need to encapsulate incoming packets, bigger packet sizes by 20 bytes each, and a solution to map internal IPs properly). This isn't ideal in my opinion. A solution our hosting provider (GSK) offered was purchasing a switch for GFL and then we'd plug all of our game server machines into this switch along with our Dallas POP (which is with GSK). Since the Dallas POP is announcing our Anycast range already and is plugged into the same switch as our game server machines, this would allow us to send traffic out as our Anycast network directly and not have to worry about RPF policies dropping them since our Anycast IP ranges are being announced on the switch by our Dallas POP. I believe this makes most sense in my opinion and is the best option for us. This way, we wouldn't have to worry about not being able to spoof as our Anycast network from our game server machines. I've decided to accept this solution. With that being said, 30 or so seconds of downtime will be required for each game server machine and our Dallas POP. The 30 seconds or so of downtime would be network-related and this is assuming the move is smooth. I'd like to move our least popular machine first to this switch and ensure everything is working properly. Afterwards, we will move the others. We're likely going to move the new game server machine and Dallas POP first to ensure things are running okay on them and we're able to send traffic out as our Anycast IPs that aren't associated with existing game servers (since they would fail anyways due to them being directed towards a different switch). For game server machines in production use, I'm planning for the maintenance to start after my current job finishes on Tuesday (January 19th) which is 4 PM CST. Therefore, the timeframe will be 4 PM - 7 PM CST on Tuesday most likely. There's a high chance we'll finish sooner, but I just wanted to provide a longer time period in case. We'll likely try getting the Dallas POP and new game server machine moved earlier if possible on Tuesday just so I can do testing on those while on my breaks or on lunch at my current job. GS16 Heating Issues + Poor Performance Last night, I was informed that many of our modified Rust servers (running additional addons and whatnot) were experiencing poor performance issues along with Vanilla from time to time. I started looking into this and immediately saw an issue. The processor was downclocking from 4.9 GHz to ~4.4 - 4.6 GHz. This was a huge issue and after trying to receive the CPU temperatures via the sensors command, the temperatures seemed okay mostly (~85 - 90C). However, our hosting provider confirmed this was due to thermal throttling. Therefore, those temperatures were most likely inaccurate. I thought I had stress-tested this machine specifically before putting it into production use. However, after thinking about it, I believe I only stress-tested the GS14 and GS15 machines (GS15 having the same specs as GS16). The GS15 machine is performing fine and we ran additional stress testing (without impacting performance) to ensure it can receive equal amount of load to GS16 without thermal throttling occurring. I believe we were trying to get GS16 up as fast as possible because of performance issues we were having on our old machines, but I should have still stress-tested this machine before production use and I apologize for this. Our hosting provider believes having the thermal paste reapplied to the CPU will resolve this issue and I agree. Therefore, at some point (unknown when we'll do this), we will most likely be taking out the machine and reapplying the thermal paste. This would result in 30 minutes or so downtime assuming everything is smooth. I'm not sure if I'd want to do stress-testing after the machine comes back up to ensure it can handle high loads without thermal throttling and this would require more downtime unfortunately unless if we want server performance being heavily impacted during this time. I still have no ETA, but since the switch maintenance on Tuesday should allow us to move servers to the new machine without issues, I don't believe this is urgent. Moving servers to a new machine will lower the load on GS16 which'll result in cooler CPU temperatures and the CPU being clocked at 4.9 GHz on all cores. Once I have an update on the GS16 maintenance specifically, I'll let everybody know. In the meantime, I believe we had our Rust 10x server's tick-rate set to 20 instead of 30 to see if that helps with the performance issues at higher player counts for now. A lower tick-rate should result in less CPU consumption from the server itself. Therefore, it should have improved performance at higher player counts even while the CPU is being throttled to 4.4 - 4.6 GHz. The drawback to this is the server itself won't be as smooth as before (since it's calculating less ticks per second), but hopefully it isn't really noticeable. This is all a temporary solution anyways unless if nobody sees a difference running at 20 instead of 30 ticks per second. Conclusion I understand the frustration recently in regards to poor performance. We saw a huge spike in players recently as stated in the other update post and we did try moving servers to a new machine last week, but ran into complicated issues. The truth is, we have a very unique setup since we own the network itself and while this comes with many pros, it also makes things a lot more complicated (e.g. having to find ways to obey RPF policies and whatnot). Thankfully, we'll be doing things a lot differently with the new packet processing/forwarding/filtering software we'll be making for the Anycast network, but that's still in development and quite complicated as you can see in this post. The new software will make it so we don't have to obey RPF policies since we won't be sending out as our Anycast network directly, thankfully. Overall, we'll be having maintenance on our game server machines and Dallas POP on Tuesday, January 19th (2021). This will result in ~30 seconds network downtime for each game server machine and the Dallas POP assuming all goes well. We're going to try moving our Dallas POP and new machine to our new switch at first to ensure things are working. Once things are confirmed working with the new machine and POP, we will be moving the rest of our game server machines (GS14, GS15, and GS16) to the switch from 4 - 7 PM CST (when my current job ends for the day). Afterwards, @Aurora will be able to offload servers to the new machine without running into issues assuming everything works correctly. This alone should resolve the GS16's thermal throttling issues mentioned above since it'll lower the CPU load/temps and therefore, the CPU will be clocked at 4.9 GHz on all cores again. We'll also want to schedule a time at some point to take out our GS16 machine and reapply thermal paste. This would likely result in at least 30 minutes downtime, but isn't urgent right now since we're going to be moving servers after the Tuesday maintenance which'll result in the machine load going down and CPU being clocked at 4.9 GHz like normal. More updates on that will be announced at a later time. Server performance should improve after the machines are load-balanced. If servers are still experiencing performance issues after the machines are load-balanced and it is confirmed that both the CPU is clocked at 4.9 GHz on all cores and that it isn't a network-related issue, the server itself is most likely unoptimized. At that point, managers of the server need to look through its addons to see if there's anything consuming too many CPU cycles, change the server's tick-rate, adjust the player count, or more. I also want to state this is in no way our hosting provider's fault (GSK). It's really good that they have RPF policies (not enough hosting providers do this in my opinion), it's just unfortunately our setup is unique and currently relies on spoofing traffic out as our Anycast network. I really appreciate GSK's help in regards to this as well since a lot of hosting providers wouldn't allow you to spoof traffic at all. If you have any questions, please let me know and once again, I apologize for the inconvenience recently. I hope this post clears things up and allows people to see what issues we're running into while trying to resolve everything. I'm trying to be as transparent as possible in regards to these issues and our solutions to them. Thank you for your time. Share this post Link to post Share on other sites More sharing options...
Roy 10,832 / 0 Report Post Posted January 20, 2021 The switch move is completed. We did have more downtime than expected and I apologize for this. But the move occurred earlier in the morning when not as many players were on our servers. Please see the Discord announcement below for more information. Quote All servers should be back up. I had to restart the IPIP Direct program I made that sends traffic back to the client directly because it appears the switch move changed the gateway's MAC address (the MAC address is retrieved at the beginning of the program via https://github.com/gamemann/IPIPDirect-TC/blob/master/src/IPIPDirect_loader.c#L46). This was something I wasn't expecting and did result in more downtime until I woke up and was able to restart it. I apologize for the inconvenience. I've confirmed I'm able to connect to the servers and a VM I have in NJ (which routes through the NYC POP) can retrieve A2S_RULES responses without any issues. If you're still experiencing issues, please let me know. Thank you. Share this post Link to post Share on other sites More sharing options...