Roy 10,830 Report Post Posted April 26, 2021 Hi everyone, Around two - three hours ago, we started seeing many networking issues and slowly players were timing out along with services our servers connected to. When I first started investigating this, it appeared completely random packet types were dropping somewhere. When I mean random, I mean completely random. For example, I was able to connect to our game servers fine, but my A2S_PLAYER challenge/requests were being dropped somewhere since that query never responded for me. However, this query worked on some of our other servers as well in Dallas, TX. With that said, I performed packet captures for our services our game servers connected to that were failing, we were seeing packets in both directions. This further proved it was random packet types being dropped somewhere. I then started talking to our hosting provider in Dallas, TX because I wasn't seeing any issues with our NYC servers. From what I saw, certain packets weren't making it back from our hosting provider which are sent directly from our Anycast ranges on our machines (92.119.148.0/24 and 185.240.217.0/24). We were working together for a thirty minutes to an hour on this and then we discovered our BGP session on our GSK POP server wasn't active. This must be active in order to send traffic out as our Anycast ranges due to uRPF policies (basically we have a switch that our GSK POP server and game server machines plug into, since the POP server is announcing the IP ranges via BGP, our game server machines are able to send traffic out directly). Now, an hour or so before all of this happened, I fixed an issue with our Squad server here and this was related to the Anycast network as well. At first, I suspected the changes I made to somehow be breaking stuff (even though the changes I made had nothing to do with the type of random packets being dropped). However, I enabled debugging within the XDP/BPF program which prints to the Linux kernel trace pipe when a packet is dropped (along with the unsigned 32-bit integer in network byte order, port, etc). The packets that weren't making it back weren't being filtered by the XDP/BPF program. Therefore, I thought things breaking soon after was just a coincidence. I also tried reverting back to the previous versions on the POP servers I was routing to and see if there were any changes. Unfortunately, I wasn't routing through the GSK POP server, so this didn't allow me to see the issue quicker. Anyways, the packet processing/filtering software running on our POP servers is managed in a private GitHub repository. This is how I update everything, etc. The reason the BGP session broke is because I forgot to whitelist the BGP neighbor IP address from GSK. The whitelist code wasn't pushed to the GitHub repository which is why when I reset the GitHub repo and updated it with the Squad fix, that code was lost. The reason I didn't push this code to that repository is because the repository has 8 - 9 other server owners I've tried helping mitigate (D)DoS attacks for and gave them the filters I made for GFL. I felt pushing code for the neighbor address like that wouldn't be secure. We had to do the same thing on our Vultr POP servers as well, but they use a local link IP address (to save IP addresses I guess, I don't know), so pushing the Vultr neighbor whitelist code to the GitHub repository wasn't a big deal in my opinion. After I whitelisted the GSK neighbor address on the GSK POP server, the BGP session became active again. When the BGP session goes down, all traffic is supposed to be dropped out of our IP ranges (I know this is also a single-point-of-failure, but I'm working on a solution for that as well where we'd have fallback BGP sessions). The reason not all traffic was dropped is because we had a lot of issues with sending traffic out as our Anycast ranges months ago because we weren't plugged into the same switch as our GSK POP server (some of you probably remember) and GSK tried manually accepting certain traffic from us. Therefore, some traffic was still allowed through even while the BGP session was down. Honestly, if all traffic got cut off, I would have probably found the issue quicker considering we could have easily narrowed it down. It was the fact that 10 - 20% of the traffic was coming through and random packets were being dropped on all connections. Unfortunately, we don't have a dev environment for our Anycast network besides running a single forwarding server on my home server and testing out filters, etc. Something like this we wouldn't have been able to test. This is something I will be considering in the future though, but it'd cost us a bit of money which is why I haven't done so yet. In the end, this was my major mistake and I do apologize (don't blame anybody else besides me for this). I hope the above went into detail on my mind process regarding it. I'm going to look into having fallback BGP sessions to eliminate this single-point-of-failure. All servers have been up for around an hour or so now. Thank you for your time and understanding. Share this post Link to post Share on other sites More sharing options...
Felisae 4 Report Post Posted April 26, 2021 Edited April 26, 2021 by Felisae · Hidden Hidden Thank you for pulling your hair out trying to fix this! I was confused watching some of the servers' player counts going up and down, but I couldn't ever seem to connect. Good to know it wasn't something on my end! Edited April 26, 2021 by Felisae Share this post Link to post
DrakoHD 222 Report Post Posted April 26, 2021 · Hidden Hidden TLDR roy was playing with squad needed a file got a file with wrong thing and servers used wrong file, and file caused packets to drop and caused server population to die also roy dont trip it was an accident shit happens at least its all fixed now Thanks to @Auralanity for the amazing signature Share this post Link to post
Akris 806 Report Post Posted April 26, 2021 · Hidden Hidden I mean you did kinda create/ help create everything GFL is today so you get one free pass [Former] PVK Manager Anarchy Admin Share this post Link to post
Alexis 573 Report Post Posted April 26, 2021 · Hidden Hidden Does that mean my weird issue was fixed? @Roy Share this post Link to post
Ozkody or Killes 63 Report Post Posted April 26, 2021 · Hidden Hidden Does that mean I can still play as an Aussie???? Former CWRP Admin - / Currently CWRP - Main - Gmod - Rust Discord Mod Share this post Link to post
zurenea 69 Report Post Posted April 26, 2021 · Hidden Hidden Share this post Link to post Achievements
TheJitFace 688 Report Post Posted April 26, 2021 · Hidden Hidden lol jitticus Share this post Link to post
Roy 10,830 Report Post Posted April 26, 2021 7 hours ago, Alexis said: Does that mean my weird issue was fixed? @Roy Unfortunately, the issue you are experiencing isn't related and I don't believe it's anything on the Anycast network's end as well. It appears to be something related to your network, but that's an issue I haven't seen before. I'll try looking more into it when I have the time, but will be busy the next few days or so. 4 hours ago, Ozkody or Killes said: Does that mean I can still play as an Aussie???? If you can beat me in CoD 4, then yes Share this post Link to post Share on other sites More sharing options...
Pachimo 2,727 Report Post Posted April 26, 2021 · Hidden Hidden So dumb omg Former | too many things to fit in a signature Share this post Link to post Achievements
Ozkody or Killes 63 Report Post Posted April 26, 2021 · Hidden Hidden 1 hour ago, Roy said: Unfortunately, the issue you are experiencing isn't related and I don't believe it's anything on the Anycast network's end as well. It appears to be something related to your network, but that's an issue I haven't seen before. I'll try looking more into it when I have the time, but will be busy the next few days or so. If you can beat me in CoD 4, then yes you are my Bitch and will stay my Bitch in any game we play Former CWRP Admin - / Currently CWRP - Main - Gmod - Rust Discord Mod Share this post Link to post
VilhjalmrF 2,094 Report Post Posted April 26, 2021 · Hidden Hidden 14 minutes ago, Pachimo said: So dumb omg shutup Average HL2RP Enjoyer. Share this post Link to post Achievements
Paul. 44 Report Post Posted April 26, 2021 · Hidden Hidden Thanks for fix Share this post Link to post
calebevb 103 Report Post Posted April 26, 2021 · Hidden Hidden No, Roy you did this all wrong you have to blame it on someone else like uhh Alexis or something Share this post Link to post
Roy 10,830 Report Post Posted April 27, 2021 On 4/26/2021 at 10:19 AM, Pachimo said: So dumb omg That's rude On 4/26/2021 at 10:28 AM, Ozkody or Killes said: you are my Bitch and will stay my Bitch in any game we play 17 hours ago, calebevb said: No, Roy you did this all wrong you have to blame it on someone else like uhh Alexis or something If it's not my fault, blame @Alexis!! Share this post Link to post Share on other sites More sharing options...
Alexis 573 Report Post Posted April 27, 2021 · Hidden Hidden 1 minute ago, Roy said: If it's not my fault, blame @Alexis!! Share this post Link to post