Jump to content
 Share

Roy

Anycast Network Issues

Recommended Posts

Hey everyone,

 

Tonight, our Anycast network experienced a major issue for our servers located in Dallas, TX under Nexril. Around a couple hours ago, I started announcing a new POP server in Dallas, TX which is a part of the major Anycast expansion that I'll be making a thread about tomorrow or whenever this issue is fixed.

 

What started occurring was all game servers under Nexril weren't responding to A2S_INFO queries. This resulted in the server being thrown off the server browser. At first, I thought this was due to the AF_XDP part of Compressor breaking on the new POP server (which is a common reason to why A2S_INFO wouldn't be working and we've ran into this in the past). However, after disabling the POP server and confirming the BGP session was inactive, the issue continued to occur. I confirmed these machines were routing to one of our Vultr's POPs and not the new one. Therefore, I was honestly stumped on what the issue could be.

 

After troubleshooting, I discovered the servers completely went down for the Nexril machines when I disabled my IPIP Direct program. This meant all game server packets being sent back from the Nexril machines to the closest POP server (our Vultr Dallas POP) weren't making it back. This also would explain why A2S_INFO broke initially because my IPIP Direct program is configured to send A2S_INFO responses back through the POP only for caching purposes (by this line).

 

So this had me confused because when running an MTR to the Anycast network from the Nexril machines, it was receiving ICMP replies fine. With that said, I tried disabling BGP on the Vultr Dallas POP so the machines routed to the Chicago POP. However, the same issue occurred. This is when I needed to dig even deeper and unfortunately, since XDP isn't debug-friendly (what Compressor, our packet processing software uses), I did have to stop Compressor on our Dallas POP multiple times after announcing it to the network again (so the machines routed to this POP again, etc).

 

When I did this, I used my Packet Flooding tool here to continuously send UDP and TCP packets back to the Anycast network every second (the closest POP which is Dallas, TX) from the Nexril machine. When Compressor was disabled and I had a TCPDump running, it WAS able to see the standard UDP/TCP packets. However, when I disabled the IPIPDirect program on the Nexril machine, Compressor on the Dallas POP, and ran a TCPDump for any packet coming from the game server machine on the Dallas POP, it ONLY saw those standard TCP/UDP packets. It DID NOT see the IPIP packets being sent back. I also ran a TCPDump on the game server machine catching all the IPIP packets it was trying to send back to the POP and there were many of them. None of them made it back to the Dallas POP server via TCPDump.

 

It appears one of the routers/hops from the Nexril machines to the Dallas POP are dropping/filtering our IPIP packets. I honestly haven't seen such a strange issue before on our Anycast network.

 

I've opened up a ticket with Nexril and also recorded a video of all my troubleshooting showing there's clearly an issue with IPIP packets. I'm waiting for a response from them.

 

In the meantime, what I did to get the servers up and running is run the IPIPDirect program and also include A2S_INFO responses so it'll send those out directly instead of through the POP. A2S_INFO caching will be disabled due to this, but once Nexril and us are able to figure out the issue, I'll be rebuilding the IPIPDirect program to send A2S_INFO responses back through the POP again for caching purposes.

 

I do want to apologize for the inconvenience and downtime associated with our Dallas servers along with players routing through our Dallas POP since I needed to take that down at times for debugging purposes.

 

Once I have an update, I will let you know.

 

Thank you.

Share this post


Link to post
Share on other sites


Also, to add, I don't believe this issue is related to the new hosting provider directly. However, it seems perhaps the updating of routes when doing the BGP session may have messed up something with Nexril.

 

Thanks.

Share this post


Link to post
Share on other sites


Update

I've implemented IPIP support into my Packet Flooding tool here. I've concluded this is affecting IPIP traffic on all machines from Nexril (I thought initially it was only affecting 3/4 of our machines).

 

I am sending 11 mbps consistently with the above tool so our hosting provider can track down these packets easier.

 

I'm waiting for an update on their side.

 

Thank you.

Share this post


Link to post
Share on other sites


Another Update

This issue was resolved! It appears our new POP hosting provider and Nexril use Hivelocity. Nexril and I are assuming there was some sort of internal conflict when announcing the new POP yesterday. Once Nexril made it so traffic back to the Anycast network routes through Cogent instead of Hivelocity, the POPs starting seeing IPIP traffic again from the game server machines.

 

Honestly, this is one of the strangest issues I've seen so far. But I'm glad we were able to troubleshoot and resolve the issue. I've recompiled the IPIPDirect program on all machines and exclude the A2S_INFO for caching purposes. There doesn't appear to be any issues at the moment from this.

 

Thanks!

Share this post


Link to post
Share on other sites




×
×
  • Create New...