[12-14-20] Server Downtime

~~Roy~~ · December 15, 2020

Hi everybody,

Around two hours or so, a majority of our servers experienced a full outage. Specifically any servers on our GSK machines.

This was due to stricter RPF filtering GSK started enforcing earlier tonight. This basically made it so traffic originating as our Anycast IP range from our GSK machines were being dropped. With that said, our GSK POP wasn't able to send IPIP traffic to our game server machines and vice versa which broke A2S_INFO responses and players routing through our GSK POP wouldn't be able to send packets to any of our servers.

This change was brought to my attention a couple of weeks ago or so, but preparations were more complex since it involved additional BGP sessions to my understanding. I wasn't aware they were making the change today. As a temporary solution, they've lifted restrictions for our specific Anycast IPs on our GSK machines and also allow IPIP traffic to certain destinations from the GSK POP. The permanent solution will be setting up the needed BGP sessions to allow us to obey the RPF filters without issues. However, this will likely time more time and will be more complex.

I'm going to talk to GSK and ensure communication for things like this is established better in the future since we weren't aware of the specific date and the only communication we had was it probably happening in a couple of weeks without BGP session details, etc.

I do apologize for the inconvenience.

If you have any questions, please feel free to reply.

~~Roy~~ · December 15, 2020

There were also on and off network issues last night due to GSK. An announcement will be made by GSK soon and I will link it here when done.

Everything should be good to go now, though.

mastergodown · December 15, 2020

Thanks chief.

~~Roy~~ · December 16, 2020

More information on the network outages from the other night (after the initial outage) may be found here.

~~Roy~~ · December 18, 2020

Update

I spoke to Renual from GSK a couple days ago and addressed the recent concerns. My biggest concern was regarding the RPF filters that were put in-place without any real warning or ways to prevent the effects (the outage) from occurring. GSK was very understanding of this and mentioned they were initially planning to wait on it. However, due to some things occurring that could have led to something damaging, they needed to implement the RPF filters sooner.

With that being said, the issue related to the core switch running out of TCAM space should be resolved. This issue heavily impacted our servers as well, especially the 24 hours after the initial outage.

In addition to the above, I suggested a status page for GSK and GSK soon afterwards made one here! I also suggested making announcements to GSK's website and any social media they have when an outage is occurring. This would help a lot with keeping clients informed.

Recent Issues From Europe/Asia Players

Recently, I've received reports of players timing out on our servers from Europe/Asia. These players reported being able to hear players in-game, but they were frozen in-place. We had similar symptoms back months ago when I was running into issues when trying to reliably read eBPF map values within my modified version of Compressor that implements filters to help prevent malicious traffic from our POP servers being forwarded to our game server machines ((D)DoS attack traffic, etc). I don't believe this is an issue with the filters this time, though, because it was running previously for months straight without issues and some symptoms weren't the same (e.g. players this time stated they couldn't reconnect after being timed out which means even when reinitiating the handshake sequence, they were still timing out).

I'm suspecting players in-game could hear other user's voices because they were still receiving the UDP packets directly from GSK's machines themselves because the server thought the client was still connected. This tells me GSK is able to send traffic back to the client without any issues and no RPF filters are being tripped here. It seems like traffic from the POP server to the GSK machine is being dropped at some point. I've made Renual aware of this and he has stated he'll need more information. Therefore, I'm working to collect information from the affected users with the help of our Server Managers and Division Leaders.

If you run into this issue, please do the following while the issue is ongoing:

Run an MTR or trace route to the server's IP address (preferably an MTR, but that isn't required in this case). You can read my guide on how to do this here. Record the results down and either take a screenshot or copy and paste the results.
Try reconnecting to the game server. Please note whether you're able to reconnect or not. Try it multiple times as well.
Keep track of how long you aren't able to connect for.
Try connecting to one of our servers on our GS3900x machine and note the results. (e.g. 92.119.148.80:27015 for GMod Hide & Seek, 92.119.148.87:27015 for CS:S Dust2, or 92.119.148.52:27015 for TF2 MGEMod). If you're able to connect to one of those servers, please try reconnecting to the server which was having issues and confirm you still aren't able to connect.

Recently, we've also seen the (D)DoS mitigation analyzer being more sensitive than normal on GSK's end. I've addressed this issue and GSK has resolved it. It's possible this played a part in the above as well, but we'll see.

I understand the recent events have been stressful and our stability hasn't been great. I do want to apologize for this and I believe stability will be improving in the future. Up until the outage the other day, I don't believe GSK was directly the cause to downtime. Our Anycast network is pretty complicated and it has a lot of moving parts. There's still a lot of work that needs to be done to make everything user-friendly and fully stable/consistent. Unfortunately, a majority of that work falls on me and I'm currently in a stressful situation life-wise and have been for a long time now which is why I haven't been able to make as much progress with the network as I was initially expected and would have liked.

I wanted to note that down because I've noticed people becoming fed up and wanting to move hosting providers due to a history of issues (I understand the fed up part, but these issues weren't due to GSK directly until the recent outage). All hosting providers have downtime associated with them (you guys should see Microsoft...). With that said, there isn't any other provider that has even came close to offering what GSK has offered us and we have so much flexibility with GSK on an infrastructure front.

I want to also note to our internal users/staff that a lot of the issues we're facing aren't related to GSK directly, but instead related to how the Anycast network works, IPIP tunnel bugs issues within Linux/network namespaces, our control panel, and more.

Thank you for understanding.

Sign In

[12-14-20] Server Downtime

Recommended Posts

Roy 10,832 / 0

Share this post

Link to post

Share on other sites

Roy 10,832 / 0

Share this post

Link to post

Share on other sites

mastergodown 2 / 1,049

Share this post

Link to post

Roy 10,832 / 0

Share this post

Link to post

Share on other sites

Roy 10,832 / 0

Share this post

Link to post

Share on other sites

Recently Browsing 0 members

Latest Topics

Latest Posts on Topics

Recent Milestones

Recent Ranks