Jump to content

Roy

Banned
  • Posts

    2,853
  • Joined

  • Last visited

  • Days Won

    383

Posts posted by Roy

  1. Which platform(s) are you appealing?: All game servers

    Discord/TS3 username (Add your tag and ID if possible):

    cdeacon#6401

     

    Why were you banned?: @Pyros

     

    Why should you be unbanned?

    Pyros is an abusing bitch and I deserve to be unbanned . This is unacceptable !!! I don't even know why I was banned ... . ?

  2. Hey everyone,

     

    This thread is being made for documentation purposes.

     

    At 10:16 AM CST today, our Dallas POP with GSK lost network connectivity. Since the POP at the time was still announcing our IP range, this resulted in players being briefly disconnected until our IP range stopped being announced. For the next thirty or so minutes, the POP was announcing the IP range on and off which caused connectivity issues for users routing through our Dallas POP. This was because the machine was still online and BIRD was running which resulted in our range still being announced (it only needs to communicate with the neighbor address), but external traffic into the POP failed.

     

    Players routing to a POP not directly affected by this incident may have not been able to see our game servers in the server browser as well during this time, but should have been able to connect to them directly. This is because our game servers in Texas route through the Dallas POP and it would use that POP server to cache A2S_INFO responses. Since this POP server went down, the A2S_INFO responses never made it back to the Redis server causing the players to never receive A2S_INFO responses (the server wouldn't show up in the server browser in this case). With Bifrost, we'll have a better method of handling A2S_INFO responses/caching which should resolve this issue from occurring again if the main POP goes down.

     

    The Dallas POP's issue was corrected 45 minutes or so after the initial incident. However, our GS14 machine (with GSK as well) went down soon afterwards. This resulted in downtime for game servers on this machine. It turns out it ran into similar issues regarding the network as the POP server.

     

    GSK (Gameserverkings) released an official statement here stating what caused the outage.

     

    I apologize for the inconvenience and thank you for understanding. I don't believe this'll happen again and I do appreciate the transparency from GSK on this situation.

     

    If you have any questions or concerns, please feel free to reply.

     

    Thank you.

  3. 44 minutes ago, JGuary551 said:

    @Roy Spit some facts on this kiddo real quick xD

    Assuming this isn't a troll post (which I'm assuming/hoping it is), there usually isn't a point in answering people who automatically assume you run on cloud servers without any evidence and also ruling out other things like possible routing issues from them to the game servers network-wise or just poor optimization on the server (or map) itself (which wouldn't be related to the server's hardware). But eh, fuck it, I'm out of it at the moment.

     

    Anyways, feel free to read this thread ->
     

     

    To answer your assumption, we don't run on cloud servers. Every dedicated machine we have right now supports more than 4.0 GHz turbo on all cores. Now, depending on the server, there could be issues with optimization either with the addons/plugins ran, the map, or the machine itself could potentially be overloaded (which will be resolved as soon as we fully move to the new setup proposed in the above thread, but less likely since we've moved a lot of servers from overloaded machines to GS14 IIRC). This means it is less than likely the hardware. Apart from that, our new setup includes one machine with the Intel i9-9900K and two others with the Intel i9-10900K (both probably the most powerful CPUs for single-threaded performance which matters the most for the type of game servers we run generally). The Intel i9-9900K on GS14 has been clocking at 4.7 GHz steadily with anything above 20% overall load on the machine and I've been monitoring this constantly the last week. I've also performed a stress test on the machine before moving it to production which had consistent 4.7 GHz on all cores and the results can be found here -> 

     

     

    @Hayze If you're not trolling with this thread then provide the game server you're referring to and also provide evidence that we run on cloud servers.

     

    BUT IMMA ASSUME THIS IS BAIT OR A TROLL POST AND IM OUT OF IT RIGHT NOW. So :thumbsup: on that!

     

    I'm going to go pass out now and ❤️  you all.

  4. For reference, this is what I sent the Garry's Mod developer yesterday morning:

     

    Quote

    Hello,


    I made a support ticket regarding this case. However, one of the support agents told me to reach out to you directly.


    One of our servers was blacklisted with the IP '92.119.148.34:27015'.


    I just wanted to know the reasoning for this. My first thought was due to caching the A2S_INFO response on our Anycast network. However, it was to my understanding that Facepunch was allowing this after an outrage late last year which resulted in servers being unbanned who cache the A2S_INFO response on multiple POP servers.


    I do understand why this wouldn't be allowed, but to my understanding Facepunch wasn't taking any further action towards this (I heard a new server browser was being made as well to mitigate the effects of A2S_INFO caching). If Facepunch is going to start taking action against servers doing this again, will this apply to all servers in Garry's Mod?


    Thank you for your time.

     

  5. Update

    I received the following from the Garry's Mod developer:

     

    Quote

    Looks like it was blacklisted by mistake, should be resolved, sorry about that.

     

    The server shouldn't be on the blacklist anymore. I appreciate their fairly quick response on this as well (two days is honestly somewhat quick given how many other tickets/emails they probably get)!

  6. I just want to clarify the situation a bit more since I'm sure some of you are wondering why we were blacklisted. We're not exactly sure what the reason is yet since Facepunch hasn't replied to my ticket yet. However, we're suspecting it is due to A2S_INFO caching we have enabled for Breach and a majority of our other servers at the moment.

     

    If you're wondering what A2S_INFO caching is and what it does to our servers along with why we have it on to begin with, I'd recommend checking out this post along with reading this Google Doc. You can also read this post for more information on the technical benefits of caching the A2S_INFO response on our POPs. I want to make quite clear that this decision was made entirely by me when deploying the network and I take full responsibility of it. It was obviously quite a hard decision to make and you can read what my thought process was in the Google Doc linked earlier. However, this is the final decision I went with.

     

    Servers using a specific hosting provider in Garry's Mod have been doing this for three to four years now. These included a lot of the top servers in GMod (now and back then). Late last year, Facepunch finally started cracking down on servers using the A2S_INFO caching and blacklisted our servers first. We were complying with them and disabled the caching at this time. Eventually, though, they blacklisted the servers that have been doing this for years. Since this included a lot of the top servers, a big outrage occurred from the server owners and players. This resulted in them being unbanned and no further action has been taken against servers using the A2S_INFO caching until today (most likely). From what I was told, Facepunch apparently said they weren't banning for this. Therefore, we re-enabled the caching.

     

    The part most frustrating about this is how inconsistent Facepunch has been. I reported these servers a couple years ago before our Anycast network was even a thing. All of this is heavily documented here and Facepunch themselves said they couldn't do anything about this because it was a Valve/Steam issue (Valve didn't want to either as documented in the project I linked above). However, late last year Facepunch started cracking down on servers which is fine. I completely understand why they would and also why players would hate/report servers doing this (you can direct all the hate towards me if you want). Though, of course, when they finally performed action against the servers doing this for years, they backed out and unbanned everybody after complaints.

     

    All I really want is consistency from Facepunch over this. As I said though, it isn't confirmed if A2S_INFO caching is the cause. However, I'm pretty sure it is.

  7. I don't believe GFL had a Scoutknivez server in CS:GO and we had a non-populated one in CS:S back in 2011/2012 (far too long ago haha).

     

    Though, a SK server would be cool and I was thinking about it the other day! I won't be making that decision though, but I'd recommend speaking to the CS:GO DL (@FrenZy) if you want to suggest it.

  8. Do you know if this has always been a problem or did you start noticing it happening recently?

     

    I believe we've had this same issue with SourceMod timers in the past and it's usually due to the timers themselves not being properly optimized. Not sure if it's the same case, though.

     

    Other than that, I doubt it's anything network-related since I bumped the TCP rate limit up by 4 times recently (and haven't heard of any issues beforehand as well). So unless if the route from our web server to the closest POP is getting screwed (which I just checked and it's fine), I don't believe it's network-related at least (if this was the case, all of our game servers would be seeing issues as well).

     

    Anyways, I'll let the Server Manager(s) and other TAs look into this. Hopefully the above information is helpful to them.

  9. 9 hours ago, JGuary551 said:

    An Info section wouldn’t be to bad 🤔 maybe like once you become a member it’ll take you to an info section that’ll tell you a bunch of things like, what servers we have what community based stuff we have like event team shit like that could be nice. Tbh there probably is something like this but idk

    This is a good idea. I believe @Leks mentioned this idea and said we could send this info via a DM when somebody registers on the website. We used to do this back on the old forums (2014 - 2016). It would send them a DM once registered with useful links/info along with ways to help GFL (at the time, we wrote things down like spreading the word, registering and applying for our group/clan on GameTracker, and more).

     

    I think this'll be neat once we get the Social Media platform going for GFL again and then we can share all of this information via a DM.

     

    Edit

    Also found this thread on the old forums.

  10. Another Update

    I just wanted to provide another update. I still need to talk to @Dreae about this to make sure he agrees.

     

    As stated in my last post, I believe it's best to use XDP to drop malicious traffic instead of the DPDK for Bifrost. The reason I believe this is because using the DPDK would be a lot more complicated and honestly, we don't need the 5 - 10% performance increase if we're planning to use BGP Flowspec to push policies to our upstreams to have them mitigate certain types of attacks. XDP is still a great solution for (D)DoS mitigation regardless and since our POPs will have either one or ten gbps NIC speeds max, there is no need to use the DPDK in this case over XDP unless if we already had programming experience with the DPDK, of course. I still plan on learning the DPDK regardless because I'd like to use it for some personal projects I plan to work on in the future and perhaps I could make a Bifrost module that would utilize the DPDK.

     

    Now, I did rethink the design of the packet flow with Bifrost because I wasn't sure how we'd take a whitelist approach if we used XDP to block unneeded/malicious traffic.

     

    I believe it's best if we use XDP only to drop malicious traffic that we detect. Therefore, we won't be dropping unneeded traffic with it. Instead, unneeded traffic will go through the standard Linux network stack and either be discarded or sent to the appropriate service. This shouldn't impact performance by much at all since unneeded traffic won't have a high volume.

     

    We will include a whitelisting option for our forwarding module. What this means is we'll have TC BPF (both ingress and egress) that will decide whether packets should be forwarded via NFTables/IPTables by marking the packet. So incoming traffic to our Anycast IPs will go through these TC programs and if the packet matches the initial handshake or the source IP was already validated, it will mark the packet and send it on its way down the network stack. NFTables and IPTables will only forward these packets if they have, for example, a marking of 0x7. Now obviously, you won't need to take a whitelisting approach if you don't want to. We're planning to have everything configurable.

     

    Now, the packet filtering aspect of Bifrost will be a module on its own. I'm still unsure what we want to use to inspect incoming packets. But I believe we could just use NFTables or IPTables for this as long as it includes inspecting each and every part of the packet and being able to use an API of some sort in C so we can match it against our filtering rules. With that said, any packets that match filters that have a block time set should be added to the XDP blacklist map for a period of time so they can be dropped by the XDP program. The only thing we need to make sure of is we're not copying the packet to the user space for inspection because this would obviously cause a big performance degrade. However, NFTables and IPTables do zero-copy to my understanding since it comes before the user space (they utilize net filter). One thing to note with this module is it is stateless meaning we're just matching each incoming packet against a set of filters and aren't keeping track of connections.

     

    With the above said, I ALSO want to make a module in the future that is more stateful for filtering packets. This would be a separate module because we won't have individual filters. Instead, we'll have this module try to detect patterns in connections/packets (UDP, TCP, and any other protocol) and then you can set what to do with the packets (either block them via XDP with a block time, push BGP Flowspec policies, drop the packet individually, or just alert the admins). We would probably utilize TC for packet inspection on this since we'd like for it to be as fast as possible. However, I'll need to see if BPF would allow me to do payload matching with TC (I know it doesn't like it with XDP, but I've noticed TC programs being a bit less strict with the BPF compiler for some reason). Otherwise, I suppose we could use IPTables or NFTables, but I don't think that'd give us full control over the packet inspection.

     

    Our POPs will basically be responsible for absorbing attacks that either have changing characteristics with each packet (e.g. random source IP, protocol, packet length, TTL, and so on, basically packets we can't find any patterns for) and packets we drop via a payload match since BGP Flowspec doesn't support payload matching right now (it DOES appear to be in-progress though according to here). Any other attacks should be detected and a BGP Flowspec policy should be pushed to drop the traffic at the upstream so the POP isn't responsible for that. I do believe our future POPs should have a 10 gbps NIC and link just so we can support dropping up to 10 gbps on each POP.

     

    Anyways, other than that, the caching aspect of Bifrost will also be a module. When Bifrost is booted up on the servers, they will use a TCP encrypted connection to communicate with the centralized server to retrieve settings and other information such as filters, forwarding rules, etc. I think we should make our own encryption for this using Libsodium and Erlang Crypto which @Dreae and I already have experience with (Dreae a lot more than me, but I did make this haha).

     

    Apart from that, the other neat thing with GSK (our primary POP hosting provider going forward) is they allow us to purchase hardware and put the hardware in-front of our servers to my understanding. Therefore, if we ever had the money and wanted to, we could purchase a router or switch that has NICs with ASIC built-in to process specific type of packets/packet flows via TCAM which is A LOT faster since it results in no load being put onto the CPU. GSK's upstream/network also has switches and routers that have ASIC built into the NIC. I know they support layer 3/4 filtering via ASIC which we'll be utilizing via BGP Flowspec policies. But I do not know if they have a router that has specific ASIC built-in to handle payload matching (there are some out there, but I'm not sure if GSK has any).

     

    Things are looking good though and I'm confident this'll be a much cleaner system than what I initially had planned (which was honestly delaying a lot of the development on Bifrost since my previous plan was messy).

     

    Thank you!

  11. And finally, I also made a CS:S test server and spawned in 63 bots with a DM plugin enabled:

     

     

    It worked out quite well. At some points there were small FPS dips, but I believe this was due to collisions (block was enabled for players/bots) and defuse kits being left behind by the CTs (when they collide, it can impact server performance).

     

    The CPU cores were running at around 4.7 - 4.9 GHz at this time!

  12. For those curious, I've performed another stress test using the Ubuntu package on our GS14 machine for 15+ minutes. GS14, as explained in the original post, includes the Intel i9-9900K. Here are the results!

     

    100% Load

    3187-08-23-2020-x7tuSdCL.png

     

    3188-08-23-2020-LuTO4O8h.png

     

    3189-08-23-2020-MqjsX66q.png

     

    We saw ~4.7 GHz consistently for 15+ minutes at 100% load on all cores :) This would be the biggest test and it stayed around 90C. The dip you see on the CPU frequency graph was one core going from 4.7 GHz to 4.69996 GHz :omegalul:

     

    Additionally, if the temps stayed around 90C for 15+ minutes, it will more than likely not get any hotter. However, we did run it for a while last night just to be sure and it stayed around these temperatures.

     

    Command executed to initialize stress test:

     

    stress --cpu 16

     

    50% Load

    3190-08-23-2020-OLdc9f51.png

     

    3191-08-23-2020-wI1HqjeG.png

     

    3192-08-23-2020-L8OV8j00.png

     

    Command executed to initialize stress test:

     

    stress --cpu 8

     

    Note - I believe the reason the temperatures stayed high is because we were putting load on all cores technically still. When executing the above command, I believe the first 1 - 8 are the physical cores themselves. 9 - 16 would be the threads. Regardless, it still performed great, as expected.

     

    I'm actually now really curious how the Intel i9-10900K will perform at 100% load on all cores with our current liquid cooler. It will be interesting to see, but keep in mind that our machines will likely only see 50 - 60% load (during peak) assuming we don't overload them (which is the plan, of course).

     

    Thanks!

  13. Hey everyone,

     

    I just wanted to provide a brief update on deploying filters to the rest of our POPs. For those that do not know, I was trying to deploy a version of Compressor I made that included filters that drop non-legitimate traffic. For more information, I'd suggest reading this post (under the "Temporary Solution"). With that said, this is all suppose to be temporary until Bifrost is made by Dreae and I.

     

    Anyways, as of right now, we have the new filters deployed to all of our Asia and Europe POPs. It has been running stable on these POPs and I did try deploying the filters initially to our GSK POP as well. However, it ran into some sort of strange bug where Compressor would block my home IP until Compressor was restarted (this wasn't due to rate limiting either). This hasn't happened since I deployed my "V1" filters that doesn't include in-depth filtering (e.g. handshake validation, etc). Therefore, I'm assuming something specific is happening to my newest filters and the hardware on the machine since my newest filters are running stable on our Europe/Asia POPs. I was planning to debug this as well, but debugging in XDP is a pain and I just didn't have enough time at that moment to debug it.

     

    With that said, I overlooked something when beginning to deploy the new filters to our NA POPs. I need to implement functionality that makes it so when a player's IP is validated via the handshake process to the game server, it sends the IP to a global list that each POP grabs from and whitelists as well. Without this, if a player is connected to our game server after the validation process to that specific POP, but their route changes while in-game to another POP server with the new filters, they will time out because they aren't added to the latest POP-specific handshake validated eBPF map. This makes it so they need to reconnect each time to get validated on that specific POP again. This is obviously a pain in some cases.

     

    I was planning to implement this functionality into Compressor. However, this is starting to feel like I'm putting in A LOT of work for a temporary solution (the private GitHub repo that includes these filters has 150 commits from me alone on implementing functionality just for this specific branch). Therefore, I halted that plan for the time being until I decided if it was worth it or if we should just continue on with Bifrost instead. I've made the decision it'd be best to halt the temporary solution for now and continue planning out Bifrost.

     

    With that said, I was notified by the GSK owner, Renual, recently on something really neat that he is able to do. When GSK's network detects an attack on our IP range, he is able to send policies to GTT (one of their carriers and our primary peer with Vultr other than NTT) that forces the single IP on the network to be offloaded via GTT. This makes it so all traffic goes through a scrubbing device with GTT and allows GSK's network to also mitigate the attack to my understanding. Since we use GTT primarily with our other hosting provider, this results in traffic going through GTT to our other POPs to go through this scrubbing device and GSK's network as well. This essentially allows GSK to mitigate malicious traffic on our network that would normally go through POPs that we don't have with GSK. This is pretty neat in my opinion.

     

    All of our POPs in the US support GTT right now. So one thing I'd really like to do as a temporary solution is disable NTT entirely on our NA POPs. This way, all traffic to our NA POPs are going through GTT and this will allow GSK to mitigate the attacks and ingest all the malicious traffic being sent to our NA POPs via GTT offload.

     

    Sadly, GTT isn't available for some POPs in Europe/Asia. However, since these are running my new filters, they shouldn't be forwarding the traffic to the game server machines anyways. So the worst that can happen is that specific POP gets overloaded, but it won't impact anybody else routing to different POPs assuming the attacker isn't smart enough to bypass the new filters I have deployed.

     

    I will likely be disabling NTT as a peer for our NA POPs in the next couple of weeks. Otherwise, I can just pre-pend 1 - 3 hops to NTT as well which will make it so a lot less traffic goes through NTT. I will also replace our NA POPs with our V1 filters that'll include whitelisting Rust+ entirely. Therefore, players not being able to use Rust+ on our Rust servers at the moment (since they're routing through a POP with my "V1" strict filters currently) will be able to use Rust+ :)

     

    As mentioned, this is just a temporary solution since ideally, as an end-goal, I'd like each POP to be able to mitigate as much malicious traffic as possible (that's the entire point of using Anycast for (D)DoS protection, after all). Once we start expanding our Anycast infrastructure, I plan on introducing a blend of carriers including GTT, NTT, Telia, Comcast, and more. I'll definitely have to do BGP tuning for this, but this would result in better routing on the network if done correctly (players would see less latency to our network if their routes are optimized).

     

    Also, as a side note, I made a script that notifies our Staff Discord when an attack is detected by GSK. Feel free to check the script out here. This script is temporary until I find out how to log these offloads with BIRD (sadly when logging everything right now, it doesn't include these offloads. I think this is due to the route not being imported correctly through BIRD). Once I find out how to log it with BIRD, I'll just make a Go program or something that watches the BIRD log file and detects entries for when GSK offloads one of our IPs to GTT and I'll make it send a message to the Discord Web Hook. We've been attacked twice already:

     

    3180-08-23-2020-YvHmG886.png

     

    Thanks!

  14. Update

    I just wanted to provide an update on this. I was told that we should get the same performance using the DPDK KNI library compared to using the DPDK without it when processing packets with the DPDK application (e.g. packets we truly care about). However, packets that we decide to send down the network stack (e.g. packets we don't care to filter/forward) will suffer a slight performance impact which makes sense because we're intercepting it with the DPDK application and sending it down its way through the network stack.

     

    This was more so what I was hoping for if we used the DPDK KNI library. Truthfully, traffic that we'll be sending down the network stack will more than likely be traffic to the POP itself (SSH, package updates, BGP, and so on). Having a slight performance degrade on this specific traffic wouldn't mean much since we don't truly care to filter that as of right now. We will still be dropping any unneeded/malicious traffic using the DPDK application itself.

     

    I do plan on making benchmarks against XDP and DPDK (with and without the KNI library) in the future. Hopefully this'll confirm everything :)

     

    Honestly, I think we're going to use XDP to drop packets still. I say this because for our POP servers, XDP will be able to drop 10 gbps with ease which will be the max NIC speed anyways that we go with. It'd make sense to write a DPDK application if we had to drop 10+ gbps and if the network equipment supported ASIC (hardware integrated into the NIC that allows for faster packet processing).

  15. 10 minutes ago, williampickthall12 said:

    How long would that have taken to write??? also good upgrades!

    Two hours or so haha. I'm used to it by now though. You should have seen some of my update posts I made years ago. Some of them would extend to 20+ pages and I had them usually posted on a ~two-week basis xd. This is one I found off my list that was 17 pages alone lol.

×
×
  • Create New...