Two Network Incidents From October 12th

~~Roy~~ · October 15, 2020

Hey everyone,

I'm writing this thread for documentation purposes and I'm a bit late to writing it (I'm sorry, I've been all over the place recently, lol).

On October 12th, we experienced two network-related issues (one somewhat related to the other).

Redis Server Outage

Around 9:42 PM CST, Vultr (one of our hosting providers) had an outage in their Atlanta location:

Our Redis server for A2S_INFO caching is hosted at this DC and I wasn't able to route to it (there seemed to be a BGP issue with one of their upstreams since I didn't make it out of my ISP when running an MTR to the server's IP). This caused A2S_INFO responses to break temporarily until Compressor stopped using the Redis server (which I believe took around 10 - 15 minutes to kick in). During this time, players wouldn't be able to see the servers in the server browser. However, since we don't have any actual POPs in Vultr's Atlanta location, everybody should have been able to route to our game servers successfully (you'd need to connect directly via console if you wanted to play, though).

This lasted around an hour.

Miami POP Outage

Our Miami POP experienced an outage at around 10:45 PM CST.

When the Redis server came back online, I restarted Compressor on all of our POP servers so they'd be able to establish a valid connection with the Redis server again for A2S_INFO caching. However, Compressor failed to restart on our Miami POP. Since our IPv4 block was still being announced via BGP (from BIRD), any clients routing through this POP would have lost access to our game servers.

Compressor failed to start because there wasn't enough memory available on the VM. Some of our POPs are still running with 1 GB of RAM and while adding SWAP has helped with this Compressor issue, sometimes our POPs fail to restart Compressor due to being out of memory still. A simple full VM restart resolves this issue and that's what I did.

This issue will be fully resolved once we finish expanding our Anycast infrastructure and all of our POPs run with more than 1 GB of RAM. With that said, I believe I can add a dependency to BIRD's systemd configuration so when Compressor's service is down, it will also take down BIRD which will stop announcing our IPv4 block via BGP. This won't fix the main issue, but at least if Compressor is found offline, BIRD will stop and clients will stop routing to the affected POP.

@Alexis This was the issue we talked about two days ago

If you have any questions, please let me know.

Thank you!

Alexis · October 15, 2020

41 minutes ago, Roy said:

Hey everyone,

I'm writing this thread for documentation purposes and I'm a bit late to writing it (I'm sorry, I've been all over the place recently, lol).

On October 12th, we experienced two network-related issues (one somewhat related to the other).

Redis Server Outage

Around 9:42 PM CST, Vultr (one of our hosting providers) had an outage in their Atlanta location:

Our Redis server for A2S_INFO caching is hosted at this DC and I wasn't able to route to it (there seemed to be a BGP issue with one of their upstreams since I didn't make it out of my ISP when running an MTR to the server's IP). This caused A2S_INFO responses to break temporarily until Compressor stopped using the Redis server (which I believe took around 10 - 15 minutes to kick in). During this time, players wouldn't be able to see the servers in the server browser. However, since we don't have any actual POPs in Vultr's Atlanta location, everybody should have been able to route to our game servers successfully (you'd need to connect directly via console if you wanted to play, though).

This lasted around an hour.

Miami POP Outage

Our Miami POP experienced an outage at around 10:45 PM CST.

When the Redis server came back online, I restarted Compressor on all of our POP servers so they'd be able to establish a valid connection with the Redis server again for A2S_INFO caching. However, Compressor failed to restart on our Miami POP. Since our IPv4 block was still being announced via BGP (from BIRD), any clients routing through this POP would have lost access to our game servers.

Compressor failed to start because there wasn't enough memory available on the VM. Some of our POPs are still running with 1 GB of RAM and while adding SWAP has helped with this Compressor issue, sometimes our POPs fail to restart Compressor due to being out of memory still. A simple full VM restart resolves this issue and that's what I did.

This issue will be fully resolved once we finish expanding our Anycast infrastructure and all of our POPs run with more than 1 GB of RAM. With that said, I believe I can add a dependency to BIRD's systemd configuration so when Compressor's service is down, it will also take down BIRD which will stop announcing our IPv4 block via BGP. This won't fix the main issue, but at least if Compressor is found offline, BIRD will stop and clients will stop routing to the affected POP.

@Alexis This was the issue we talked about two days ago

If you have any questions, please let me know.

Thank you!

Thanks for the explanation. ❤️

Sign In

Two Network Incidents From October 12th

Recommended Posts

Roy 10,832 / 0

Share this post

Link to post

Share on other sites

Alexis 578 / 12,012

Share this post

Link to post

Recently Browsing 0 members

Latest Topics

Latest Posts on Topics

Recent Milestones

Recent Ranks

Browse

Activity

Donate

Servers

Applications

Server Applications

CS:GO/CS2

GMOD

TF2

CS:S

Unturned

Team Applications

Mute/Ban Appeals

CS:GO/CS2 Appeals

CS:S Appeals

GMod Appeals

TF2 Appeals

Report Someone

CS:GO/CS2 Reports

GMOD Reports

TF2 Reports

Report Staff