Jump to content
 Share

Roy

[12-14-20] Server Downtime

Recommended Posts

There were also on and off network issues last night due to GSK. An announcement will be made by GSK soon and I will link it here when done.

 

Everything should be good to go now, though.

Share this post


Link to post
Share on other sites


Update

I spoke to Renual from GSK a couple days ago and addressed the recent concerns. My biggest concern was regarding the RPF filters that were put in-place without any real warning or ways to prevent the effects (the outage) from occurring. GSK was very understanding of this and mentioned they were initially planning to wait on it. However, due to some things occurring that could have led to something damaging, they needed to implement the RPF filters sooner.

 

With that being said, the issue related to the core switch running out of TCAM space should be resolved. This issue heavily impacted our servers as well, especially the 24 hours after the initial outage.

 

In addition to the above, I suggested a status page for GSK and GSK soon afterwards made one here! I also suggested making announcements to GSK's website and any social media they have when an outage is occurring. This would help a lot with keeping clients informed.

 

Recent Issues From Europe/Asia Players

Recently, I've received reports of players timing out on our servers from Europe/Asia. These players reported being able to hear players in-game, but they were frozen in-place. We had similar symptoms back months ago when I was running into issues when trying to reliably read eBPF map values within my modified version of Compressor that implements filters to help prevent malicious traffic from our POP servers being forwarded to our game server machines ((D)DoS attack traffic, etc). I don't believe this is an issue with the filters this time, though, because it was running previously for months straight without issues and some symptoms weren't the same (e.g. players this time stated they couldn't reconnect after being timed out which means even when reinitiating the handshake sequence, they were still timing out).

 

I'm suspecting players in-game could hear other user's voices because they were still receiving the UDP packets directly from GSK's machines themselves because the server thought the client was still connected. This tells me GSK is able to send traffic back to the client without any issues and no RPF filters are being tripped here. It seems like traffic from the POP server to the GSK machine is being dropped at some point. I've made Renual aware of this and he has stated he'll need more information. Therefore, I'm working to collect information from the affected users with the help of our Server Managers and Division Leaders.

 

If you run into this issue, please do the following while the issue is ongoing:

 

  1. Run an MTR or trace route to the server's IP address (preferably an MTR, but that isn't required in this case). You can read my guide on how to do this here. Record the results down and either take a screenshot or copy and paste the results.
  2. Try reconnecting to the game server. Please note whether you're able to reconnect or not. Try it multiple times as well.
  3. Keep track of how long you aren't able to connect for.
  4. Try connecting to one of our servers on our GS3900x machine and note the results. (e.g. 92.119.148.80:27015 for GMod Hide & Seek, 92.119.148.87:27015 for CS:S Dust2, or 92.119.148.52:27015 for TF2 MGEMod). If you're able to connect to one of those servers, please try reconnecting to the server which was having issues and confirm you still aren't able to connect.

 

Recently, we've also seen the (D)DoS mitigation analyzer being more sensitive than normal on GSK's end. I've addressed this issue and GSK has resolved it. It's possible this played a part in the above as well, but we'll see.

 

I understand the recent events have been stressful and our stability hasn't been great. I do want to apologize for this and I believe stability will be improving in the future. Up until the outage the other day, I don't believe GSK was directly the cause to downtime. Our Anycast network is pretty complicated and it has a lot of moving parts. There's still a lot of work that needs to be done to make everything user-friendly and fully stable/consistent. Unfortunately, a majority of that work falls on me and I'm currently in a stressful situation life-wise and have been for a long time now which is why I haven't been able to make as much progress with the network as I was initially expected and would have liked.

 

I wanted to note that down because I've noticed people becoming fed up and wanting to move hosting providers due to a history of issues (I understand the fed up part, but these issues weren't due to GSK directly until the recent outage). All hosting providers have downtime associated with them (you guys should see Microsoft...). With that said, there isn't any other provider that has even came close to offering what GSK has offered us and we have so much flexibility with GSK on an infrastructure front.

 

I want to also note to our internal users/staff that a lot of the issues we're facing aren't related to GSK directly, but instead related to how the Anycast network works, IPIP tunnel bugs issues within Linux/network namespaces, our control panel, and more.

 

Thank you for understanding.

Share this post


Link to post
Share on other sites




×
×
  • Create New...