Jump to content
 Share

Roy

[11-16-19] Network Outage

Recommended Posts

Hello,

 

Our network (92.119.148.0/24) went down at around 7:00 PM CST and came back up at 7:15 PM CST. There was a definite routing loop at the NYC PoP, so I tried turning that PoP off, but it didn't do anything (didn't expect it to since a majority of the routes were having issues). There appeared to be a routing loop with a majority of the routes going to the network including our physical servers. Some appeared to route correctly as shown here:

 

1394-11-16-2019-PSldAT90.png

 

However, since the physical servers weren't routing to the network, all servers went offline for everyone regardless.

 

I have a ticket open with our PoP hosting provider at the moment for investigation. With that said, once I come back from break, I'll be looking into other hosting providers to use which should prevent this from happening again since we'll have providers to fallback on. In fact, NTT (a Tier 1) ISP reached out to me asking if I wanted to buy direct transit and I'm going to get back to them (this is pretty cool considering they stated they looked at my threads about the network and so on). However, I'm going to guess that's not going to happen considering how expensive it most likely is (going to get a price just because I'm curious, though).

 

Once I have an update, I'll post in this thread.

 

Thanks.

Share this post


Link to post
Share on other sites


12 hours ago, Shuruia said:

Keep in mind that you are supposed to be on a break. It sucks that you always have to be on standby for these somewhat frequent network issues.

I do agree, but I honestly was expecting something like this to occur. There's really nobody with access other than @Dreaethat is familiar with our networking setup since it's complicated to most people. Thankfully, our network highly interests me and I do enjoy working on it. But it does feel like a full-time job many times with the amount of work I put into it.

 

As for the issues, this is the first time I've seen a flat out outage from our PoP hosting provider (it was definitely an infinite routing loop as shown below). I also noticed in the screenshot I provided in my original post, the only routes working were from Vultr themselves (our PoP hosting provider). Therefore, I believe all external traffic was cutoff. I'm also doubting the two peers we use (NTT and GTT) experienced an outage at the same time, LOL. So yeah, more than likely something on Vultr's side and they STILL haven't replied to my ticket (they rarely do, especially on the weekends)...

 

JsYcTvHj3G.png

 

To be honest, most outages have came from our physical hosting provider. Thankfully, nothing like that has happened in a few months though and they've put measurements in-place to ensure they don't happen again. There also has been a couple big issues with the network since I acquired the new ASN (example), but I since then have figured it out and resolved those issues. The issues were due to things I wasn't prepared for nor expecting (it was my fault). This entire project is a learning experience for myself and that's why I'm so happy I get to work on it. I've learned A TON in the last few months or so from it :) I still have so much to learn, but I'm hoping in the next few years, I can really become very skilled on the networking side of things since it's something I'm aiming for as a career along with something I really do love/enjoy.

 

Towards the bright side, with the new machines from @dagreekalong with the fact I can and will make the network more redundant, I don't believe we'll experience as many issues as we have been. The ASN was the last big piece we needed for the network which we got. It's going to be stressful finding new PoP hosting providers since we have very specific needs and I've been emailing hosting providers like crazy trying to find good ones for a decent price.

 

In regards to the break, I actually somehow didn't work on anything GFL-related for four days. That's probably the most time I went without working on GFL (that's a huge accomplishment for me, lmao). Unfortunately, I've been dealing with other stressful things so it didn't feel like a break much sadly though :( I'm planning to complete some last projects (including the Anycast layout) and then take a big step back from GFL (only being an on-call person for the Anycast network basically). I've written around 500+ lines of code for a new project I will be proposing soon which will be one of the last for me while I'm fully active, I think.

 

Thanks!

Share this post


Link to post
Share on other sites


Here is information I've received from Vultr regarding the previous outage:

 

1422-11-19-2019-AalwhvOB.png

 

This appears to be more complicated than I initially thought and it turns out a Tier 1 ISP may have accidentally announced our prefix incorrectly. I'm not really sure who would have done this, but I don't think this will happen again considering it has to be very specific.

 

Thanks.

Share this post


Link to post
Share on other sites




×
×
  • Create New...