Roy 10,832 / 0 Report Post Posted October 14, 2020 Hello everybody, Around 1:50 PM CST today, our GS14 machine went down who is hosted with GSK. I was at a GP appointment at this time and @Aurora was looking into this issue. Unfortunately, we didn't have KVM access. Therefore, we asked GSK to reboot the machine which resulted in it coming back online. However, once we tried launching a game server, it went down again. When I got home, GSK was in the process of getting a KVM attached to the machine. At 4:51 PM CST, the KVM was successfully attached and I was able to connect. However, I received a call and had to go for another ten - twenty minutes. So investigation with the KVM didn't start until 5 - 5:10 PM CST. Last night, we updated our control panel which included a daemon update that was applied to all of our game server machines. For some reason, the new configuration file on GS14 was setting our control panel daemon to use the entire host machine as its network instead of the separate bridge it creates internally (172.18.0.0/16). Basically what this does is exposes the interfaces on the host machine within the game server's Docker containers and network namespaces. Now, without our Anycast setup, this probably would have worked fine. However, exposing each container to all the interfaces on the machine defeats the purpose of isolating each container's network and would be less secure. The reason the machine was going down is because we use Docker Gen to deploy an IPIP tunnel for our Anycast network to use (since Compressor forwards traffic from our POPs to our game servers via IPIP formatted packets, we need to setup an IPIP tunnel/endpoint in each container with the remote host and internal IP). When running our control panel in "host" mode, Docker Gen would execute these commands on the main host interface instead of inside each game server's network namespace. Two commands this ran were deleting the old default gateway and replacing it with the new gateway which pointed towards the IPIP tunnel. Since this was being ran on the host, it removed the main machine's gateway (which went to GSK's router of course from the NIC) and replaced it with the IPIP tunnel (which wasn't even configured at this time). This resulted in the machine losing network connectivity and could not be restored until you either restart the machine or attach a KVM, log in that way, remove the old default gateway, and add the correct default gateway (which is what we did). To resolve this issue, firstly, we modified the control panel's daemon config to set up its own bridge which are linked to the veth pairs our control panel adds when spinning up a game server within Docker containers. Afterwards, restart the daemon and if the machine is still offline, you can check the default route via the ip route command. If the default route is set to the IPIP tunnel Docker Gen was trying to set up, you will need to delete the default route via ip route delete default and add the original default route via ip route add default dev <Main Interface> via <Gateway IP>. After this, we were able to spin up servers, but they weren't getting network connectivity. This was due to my IPIPDirect program here and when it first started up (on machine boot), it's supposed to get the default gateway's MAC address here and here. I believe it was starting up when the default gateway was set to the IPIP tunnel (so the MAC address was probably 00:00:00:00:00:00 or something virtualized/fake). This resulted in outbound game server traffic not being routed out properly. Simple restarting the application via systemctl restart IPIPDirect resolved this issue since it was able to save the correct destination MAC address (the default gateway's) and all game servers were online again. What's not yet clear is why the machine didn't go down after the panel update last night. My suspicion is we updated the panel and it started using the internal network initially. However, the config file was still specified to use the host network. Therefore, at some point, the panel restarted today and that's when everything went down. @Aurora and I will most likely be digging through the logs to see if we can find anything that would have caused this. With that said, we're going to be looking into purchasing our own KVMs for GS14 along with our future game server machines with GSK. This is just so we have our own dedicated KVMs we can access at any time and don't have to rely on GSK manually attaching a KVM (which can be used by other clients which is why it took longer than usual this time). I understand this post is more technical, but figured I'd give this information for those interested and who knows, maybe if I'm not here, this'll help others like @Aurora with what to look at and so on in the case this happens again. I also wanted to say thank you to @Aurora for looking into this and contacting GSK while I was at my GP appointment This would have taken longer if she hadn't started looking into it. I apologize for inconvenience and thank you for your patience regarding this. If you have any questions, please feel free to reply! Share this post Link to post Share on other sites More sharing options...