Roy 10,832 / 0 Report Post Posted April 3, 2020 Hey everyone! I just wanted to provide an update/announcement on our recent performance issues on our game servers running under our Anycast network (92.119.148.0/24). I want to announce I created a fix for issue #1 (read below). POP Server Overloading From Game Server Outbound Traffic (Issue #1) For months, we've been suffering from performance issues on our NYC servers. One server this was very noticeable on was Rust Modded. For more information on the issue itself, you may read the end of this post here. Recently, I've been working hard to create a fix for this issue. I've made many attempts to get this issue resolved by using standard Linux functionality such as IPTables. You can read more about that here. Unfortunately, these attempts didn't work because SRCDS doesn't support binding to multiple interfaces (e.g. one for receiving and the other for sending). After learning more about the TC hook, which would allow me to make a program to modify ingress and egress traffic. I've decided to try creating a TC BPF program that would utilize the TC egress hook. I wanted the program to attach to the TC egress filter (outgoing packets) and check for IPIP packets. Once an IPIP packet is found, it makes sure the outer IP header's source address matches the interface's IP address. This tells us it's definitely an outgoing IPIP packet. From here, it makes a copy of the outer IP header's destination address, which would be the forwarding server's IP (in our case, the game server/Anycast IP) and then strip the outer IP header. Afterwards, it replaces the inner IP header's source address and recalculates the layer 3 (IP) and layer 4 (transport protocol such as UDP, TCP, and so on) header's checksums. While this was fairly difficult due to the complexity of TC + BPF, I learned a lot and got through it. After multiple headaches, I'm happy to announce I created a program that should resolve this specific performance issue! I made a TC egress program that does the above and released the source code here This program will improve performance on our game servers. It also has the following advantages to our setup: Less load on POP servers that reside in the same location as the game servers since outbound game server traffic won't need to go through the closest POP server to the game server machines. Less bandwidth overage fees (probably will save us $300 - $400/m which is a huge amount!). We won't need a beefy POP in the locations our game servers reside in. We may use hosting providers that aren't in the same location as the POP servers. Less latency since the traffic won't have to flow through the closest POP server. The outbound traffic will be sent back to the client directly. We can scale our POP coverage more efficiently since we won't have to worry about game server outbound traffic. Eliminates one single-point-of-failure. There really aren't any cons to this program as long as the program's performance is well which is to be expected. The TC hook is fairly fast since it's implemented early on in the Linux networking path. Egress traffic should be fine as well. I'm going to be making some optimization improvements to the program here soon as well which should help. I've tested this on a new machine I've setup under our Anycast network and everything is working fine including TCP traffic. Our hosting provider in Dallas removed ACL rules that were preventing us from spoofing the inner IP header's source address as the Anycast/game server IP. Therefore, everything is working. I need to get the status on allowing us to spoof as our Anycast network from our hosting provider in New York City. As of right now, other than the new machine I setup, none of our game servers are running this program. In order to get the program working on our main machines, I will need to upgrade the kernel. The newer kernels add support for a mode with the BPF bpf_skb_adjust_room() function named BPF_ADJ_ROOM_MAC which is responsible for removing the outer IP header. Upgrading the kernel will require a reboot assuming everything goes well (which it should). Therefore, I'll be scheduling maintenance at some point to do this. Maintenance windows will be announced publicly and probably will be done later at night. In the meantime, I'm going to continue doing testing and fixing up some small things with the program. Once I feel we can start applying it to production machines (other than the new machine I setup that we'll be moving a couple servers slowly to), I'll let everyone know. Other Strange Routing Issues (Issue #2) We've had a few complaints from users experiencing high latency and packet loss on our game servers under the network. After inspecting some MTRs and trace routes the users supplied to me to our Anycast network, it appears NTT (our direct peer) has issues in some locations. I'd assume this is due to the big surge in traffic from the Coronavirus lock down. I've read that Telia (another peer) started processing 3 - 4 times more traffic after the Coronavirus lock down. I've also seen a lot of issues at my job relating to hosting providers and Microsoft. It's crazy! With that said, it seems some people are routing to our Seattle POP and the route the Seattle POP takes to our game server machines appears to get messed up resulting in high latency. I'm not sure what is causing this, but I may try removing NTT as a direct peer temporarily to see if anything improves. I also plan to make a ticket with our POP hosting provider to see if they're able to assist with this (not holding my breath on that one since our POP hosting provider hasn't been very helpful in the past). Some MTRs showed packet loss and high latency beginning before hitting our direct peers (NTT and GTT). We may not have much control over these type of issues and the users may have to contact their ISP. We may be able to email the specific AS NOC to see if they're able to assist and look into it, though. To conclude, the TC program I made will result in higher performance on our Anycast network and should resolve the performance issues with our NYC servers. With that said, it comes with many other benefits and fits our network well. We also have been seeing some strange routing issues as well that I'm looking into. If you have any questions, please feel free to ask! Thank you for reading. Share this post Link to post Share on other sites More sharing options...
Roy 10,832 / 0 Report Post Posted April 4, 2020 Just an update, I setup a test CS:S server on the new machine under our Anycast network. The server got to 32/32 last night (a completely stock server) and the TC BPF program was performing very well. All packets were being sent back to the client directly and I saw no performance issues on the game server. No packet loss and latency was good. We will be moving some servers from GS10 to this machine today since GS10 is suffering from hardware failures most likely. Everything is looking good 😄 Thank you! Share this post Link to post Share on other sites More sharing options...
Coven 59 / 2,585 Report Post Posted April 4, 2020 Edited April 5, 2020 by Coven · Hidden Hidden Edited April 5, 2020 by Coven Share this post Link to post Achievements
_Rocket_ 656 / 8,277 Report Post Posted April 4, 2020 · Hidden Hidden First of all, you don't get enough appreciation for the amount of painful work this stuff takes. And the fact that you even spent the time to make this post is really big to me, and I admire it. Anytime I deal with hours of headache trying to fix a problem the first thing I want to do is spend 2 days playing Tetris. So really, thank you for the hard work miboi. This networking stuff is interesting but boi is it complicated. I think I'm better suited as a software/game dev hahaha I write programs and stuff. If you need to contact me, here is my discord tag: Dustin#6688 I am a busy person. So responses may be delayed. Share this post Link to post
VilhjalmrF 2,099 / 30,674 Report Post Posted April 4, 2020 Edited April 4, 2020 by VilhjalmrF · Hidden Hidden 1 hour ago, Roy said: Just an update, I setup a test CS:S server on the new machine under our Anycast network. The server got to 32/32 last night (a completely stock server) and the TC BPF program was performing very well. All packets were being sent back to the client directly and I saw no performance issues on the game server. No packet loss and latency was good. We will be moving some servers from GS10 to this machine today since GS10 is suffering from hardware failures most likely. Everything is looking good 😄 Thank you! 👌 So then what will happen with GS10? Just no longer use this machine since we have our new one (feels like a dumb question to ask since GS10 is kinda shitty, just toss it in the trash)? Edited April 4, 2020 by VilhjalmrF Average HL2RP Enjoyer. Share this post Link to post Achievements
Roy 10,832 / 0 Report Post Posted April 4, 2020 29 minutes ago, _Rocket_ said: First of all, you don't get enough appreciation for the amount of painful work this stuff takes. And the fact that you even spent the time to make this post is really big to me, and I admire it. Anytime I deal with hours of headache trying to fix a problem the first thing I want to do is spend 2 days playing Tetris. So really, thank you for the hard work miboi. This networking stuff is interesting but boi is it complicated. I think I'm better suited as a software/game dev hahaha Thank you That means a lot! While it does take a lot of work and I did go through a lot of headaches at first since I'm fairly new to BPF and TC, this type of programming is definitely something I have a strong interest in! Since I have a better feel of BPF, I think I'll be able to make better TC and XDP programs in the future. I've also been communicating over the BPF + XDP mailing lists and I've been learning a lot from the people there. I want to try to help @Dreae the most I can with the new packet processing software since it seems very interesting. Micro optimization will be important with these programs and that's something I'm working on (you can see a lot of the optimizations I made to my TC program by seeing the latest commits, the usage of the likely() and unlikely() functions, and using specific integer sizes). We're still planning on how we're going to do everything with the new packet processing software (e.g. are we going to use plain NAT or use IPIP packets still?), but once it's made, our Anycast network will be A LOT more efficient and we'll be able to scale our POP coverage easily since we plan to make a backbone that controls each POP server. Now finding new hosting providers and so on is another big project I'll need to do afterwards. Thank you! Share this post Link to post Share on other sites More sharing options...
Roy 10,832 / 0 Report Post Posted April 4, 2020 13 minutes ago, VilhjalmrF said: 👌 So then what will happen with GS10? Just no longer use this machine since we have our new one (feels like a dumb question to ask since GS10 is kinda shitty, just toss it in the trash)? The CEO of Nexril (James) is going to be performing more tests on it once we move our servers off of it. The new machine has a weaker processor than GS10, but it's going to be acting as a temporary machine regardless. Once we can get the TC BPF program working on the NYC machines and the hosting provider removes the ACLs that are preventing us from spoofing traffic as our Anycast network, we'll probably just move all the servers to that machine once the performance issues are confirmed to be fixed. We have plenty of room on the NYC machines themselves. Also, the server IPs won't change, that's the power of owning a /24 IPv4 block and Anycast network Thanks! Share this post Link to post Share on other sites More sharing options...
VilhjalmrF 2,099 / 30,674 Report Post Posted April 4, 2020 · Hidden Hidden 4 minutes ago, Roy said: The CEO of Nexril (James) is going to be performing more tests on it once we move our servers off of it. The new machine has a weaker processor than GS10, but it's going to be acting as a temporary machine regardless. Once we can get the TC BPF program working on the NYC machines and the hosting provider removes the ACLs that are preventing us from spoofing traffic as our Anycast network, we'll probably just move all the servers to that machine once the performance issues are confirmed to be fixed. We have plenty of room on the NYC machines themselves. Also, the server IPs won't change, that's the power of owning a /24 IPv4 block and Anycast network Thanks! Beautiful. Average HL2RP Enjoyer. Share this post Link to post Achievements
Roy 10,832 / 0 Report Post Posted April 5, 2020 We've moved the servers from GS10 to the new machine and the TC program is performing very well (thank you @Xy for assisting with the server moves)! The bandwidth usage on the Dallas POP has already went down since outbound traffic isn't being sent through the POP server anymore (excluding A2S_INFO responses): With that said, it turns out our NYC hosting provider has removed the ACL rules that prevented us from sending traffic as our Anycast network. I tested this with a C program I made that spoofs the source address and I was able to send out as our Anycast network successfully. Therefore, the only thing we need to do is upgrade the kernel and run the TC program. I will be planning maintenance for this probably in the next couple of days after talking to @dagreek. As for our Dallas machines, we will be doing the same thing. I will be looking to do this in the next few days. Thank you! Share this post Link to post Share on other sites More sharing options...
Roy 10,832 / 0 Report Post Posted April 5, 2020 Our NYC machines are now running the program successfully. I haven't seen any packet loss since enabling it! Unfortunately, upgrading the last two machines in Dallas will be painful. Therefore, I am going to measure performance on our NYC machines and if things go well, we can start moving servers from those Dallas machines to our NYC machines since we have a lot of room in NYC. Thank you. Share this post Link to post Share on other sites More sharing options...
Roy 10,832 / 0 Report Post Posted April 15, 2020 I upgraded GS08 and GS09 today and started running the TC BPF program. Everything is running smoothly so far. Thank you! Share this post Link to post Share on other sites More sharing options...
TokenDude 74 / 4,265 Report Post Posted April 15, 2020 · Hidden Hidden Nice! Token -Former Nerd- Share this post Link to post