Jump to content
 Share

Roy

Prop Hunt HTTP Request Issues (E.g. User Management Breaking)

Recommended Posts

Hey everyone,

 

I am creating this thread for documentation purposes and also to be transparent about the issue GMod Prop Hunt has been facing the last month or so I believe. I wanted to apologize on the delay for resolving this issue as well. The issue appears to be very complex and also randomly happens which is strange. I've been also really busy recently so focusing on one server's issue can be overwhelming at times.

 

The Issue

  • There are extended periods of time where HTTP/HTTPS requests will fail with the message "Unsuccessful" which breaks addons like User Management and Discord Integration (since GWSockets on 64-bit is broken on this server for some reason).
  • This results in players not receiving their correct GFL rank (only Member, Supporter, and VIP) and also the Discord Integration addon not working (this is one of the primary servers that gets people into our official Discord server and one of the main reasons we're at 13K+ members now instead of 2 - 3K).

 

This is a search on the console.log file for the message "unsuccessful" in Visual Studio Code:

 

3494-11-14-2020-cf79G9wN.png

 

The highlights within the scrollbar indicate where these messages are within the log file and as you can see, there are huge gaps where everything will be fine. However, there are periods (some lasting very long) where it'll be repeatedly failing to send the HTTP/HTTPS requests.

 

When Did This Issue Start Occurring?

  • I'm not fully sure when it started occurring.
  • My suspicion is it started occurring after we moved the server to GMod 64-bit.
  • This issue was occurring before we moved the server to GS15 (GSK).

 

What We Know So Far

  • This issue only seems to be occurring for GMod Prop Hunt consistently. I've heard reports of this happening from other servers in Garry's Mod a while back as well, but those were never fully investigated and I haven't heard of reports of this issue happening consistently on other servers like it does for Prop Hunt.
  • This issue is most likely not network-related because if it were, we would likely be seeing it happen on other servers including servers outside of GMod (which I haven't seen any reports regarding). With that said, I remember testing a while back a script I made to send HTTP/HTTPS requests while the issue was ongoing and it immediately failed with "unsuccessful". I'd imagine there's some sort of timeout set (probably 5 - 10 seconds or so) and I feel we'd see it timing out with that error if it were to be network-related instead of immediately having it return "unsuccessful".
  • This seems to be something with GMod 64-bit, but nothing is fully confirmed.

 

What We Need To Find Out

  • Both the Discord Integration and User Management addons connect to a REST API via HTTP that goes through CloudFlare. Therefore, we need to find out if HTTP requests fail to any host while this issue is occurring for our User Management or Discord Integration addons.
  • If there's a way to reproduce this issue, it would make debugging it 1000x easier. Going through the logs, I unfortunately haven't seen any sort of pattern when these happen.
  • If none of the easy potential solutions work, I'll like to perform a TCPDump scanning for two types of packets (one incoming and one outgoing) to see if we're receiving/sending the HTTP/HTTPS requests/responses.
  • The first filter would be incoming and that'll have to scan for IPIP packets coming in from Prop Hunt's Anycast IP along with the TCP protocol and ports 80 or 443 (HTTP/HTTPS). Scanning for incoming packets will be harder since TCPDump doesn't natively support filtering the inner IP header's information. So for example, we'll need to do some math and ensure we're scanning for 2 bytes when filtering by the inner IP header's source ports (since source port is 16-bits in length in the TCP header).
  • The second filter would be simple and since my IPIP Direct program is enabled, we can simply scan for outgoing packets sourcing from Prop Hunt's IP address (92.119.148.4) and TCP (with destination ports 80 and 443 for HTTP/HTTPS).

 

I've made a lua script to test HTTP/HTTPS requests to any host here:

 

if SERVER then
    concommand.Add("httptest", function (ply, cmd, args)
        if ply and ply:IsPlayer() and not ply:IsSuperAdmin() then
            ply:ChatPrint("You cannot use this function :P.")

            return
        end

        if #args < 4 then
            if ply and ply:IsPlayer() then
                ply:ChatPrint("Usage: httptest <url> <amount> <token> <steamid>")

                return
            else
                print("Usage: httptest <amount> <token> <steamid>")
            end

            return
        end

        local url = args[1]
        local amount = args[2]
        local token = args[3]
        
        for i=1, amount do
            http.Fetch(url, 
                function (body, length, headers, code)
                    if ply and ply:IsPlayer() then
                        ply:ChatPrint("Got code => " .. code .. " with => " .. body)
                    else
                        print("Got code => " .. code .. "\nBody =>" .. body)
                    end
                end,

                function (message)
                    if ply and ply:IsPlayer() then
                        ply:ChatPrint("HTTP Fetch Failed => " .. message)
                    else
                        print("HTTP Fetch Failed => " .. message)
                    end
                end,

                {
                    ["Authorization"] = token
                }
            )
        end
    end)
end

I executed this script with the amount set to 100 - 200 times requesting the GFL API (going through CloudFlare) multiple times to see if there was any potential rate limiting being applied here. These all succeeded showing it was not being rate-limited. I suppose it's possible CloudFlare is rate-limiting us at this time, but we won't know unless we execute the above lua script's command (httptest) against a non-CloudFlare proxied host while the issue is ongoing.

 

As for the TCPDump commands, scanning for outgoing packets is simple and can be used with something like this:

 

tcpdump -i any 'src host 92.119.148.4 and tcp and (port 80 or port 443)' -nne

 

Scanning the inner IP header for incoming IPIP packets for HTTP/HTTPS requests is a bit more difficult. So here we have the TCP header:

 

TCP-Header-In-TCP-protocol-the-flags-fie

 

The source port takes the first 16-bits of the first 32-bit word within the header (that's what they call it for some reason in guides I've read). This makes things simpler since we won't have to perform much math into the TCP header. So what I typically do is start from the first IP header and then put into account the first and second IP header's length (which is normally 20-bytes in size, so 20*2 = 40). And since we only want to capture the first 2 bytes (16-bits) of the TCP header, we can do something like this:

 

tcpdump -i any 'src host 92.119.148.4 and (ip[40:2]=80 or ip[40:2]=443)' -nne

 

Now let's put both filters together so we won't have to run two separate TCPDumps and log it into a PCAP file:

 

tcpdump -i any '(src host 92.119.148.4 and tcp and (port 80 or port 443)) or (src host 92.119.148.4 and (ip[40:2]=80 or ip[40:2]=443))' -nne -w mypcktcap.pcap

 

We can probably run this in a screen as well so it'll run in the back-ground and we'll have it running for a few days or so. It'll probably have a lot of packets, so the file may be a bit large, but we can inspect it using Wireshark and look at the timestamps. We'll also need to make sure timestamps are enabled within the logs on Prop Hunt which appear to be disabled at this moment.

 

Possible Solutions

  • Validate the server via SteamCMD (simply set the SERVERUPDATE variable within the panel to 2 and restart the server which will launch SteamCMD using the validate flag). It's possible this may resolve it if it wasn't attempted considering this seems to be some sort of strange issue on the server itself.

 

Possible Work-Arounds

  • Reprogram User Management to use GWSockets which to my understanding would also improve performance because instead of a separate TCP connection and 3-way handshake being established every HTTP/HTTPS request, there would be a constant TCP connection via sockets which would improve performance technically and give us other functionality such as keep-alive and more.
  • Fix Discord Integration breaking with GWSockets on GMod 64-bit. As of right now, only the Discord Integration addon is broken when using GWSockets after upgrading to GMod 64-bit which is very strange. It also fails with "Network is unreachable" which makes no sense and it seems like it's not seeing the correct default route (the IPIP tunnel) within the Docker container/network namespace.

 

That's all for now! If anybody has any ideas, please let me know!

 

Once I have an update, I will reply to this thread.

Share this post


Link to post
Share on other sites


So I forgot about something in the original post somehow. I could also try either:

 

  1. Using the network namespace the Docker container resides in and cURL to request URLs to see if they fail.
  2. Access the Docker container's shell and execute the cURL command within the Docker container.

 

If I do this while the server is experiencing the issue and all cURL requests succeed within the network namespace, it definitely points to an issue with the server itself because cURL requests within the container act the same as requests from the server since the default route goes out the IPIP tunnel from the Anycast network. Executing it within the network namespace would be easier and more convenient because I could setup a cron-job to do the same.

 

Something like the following would help with the Docker container execution:

 

docker exec -it <container UID> bash
# This will put me into the Docker's Bash shell and I can then do.
curl <opts> <url>

 

For the network namespace, I could do:

 

# Inspect the Docker container and get its main ID.
docker inspect <container UID>
# I'll be able to get the main container ID from here and execute the below.
ip netns exec <id> curl <opts> <url>

 

A neat Bash script that should work on the host machine would be:

 

#!/bin/bash
id=<ID>
date=$(date)
curl=$(ip netns exec $id curl <opts> <url>)

echo "[$date] $curl" >> /path/to/file.log

 

Note - I believe the Docker container ID can change after restart. So I may have to parse the output from docker container inspect ... and then perform regex to get the container's ID.

 

Then a cron job could be:

 

# Execute cron every minute and should be within root's cron jobs.
* * * * * /path/to/bash/file.sh > /dev/null 2>&1

 

Having these cron jobs and also the packet captures running will help a TON with debugging the issue. I will be setting this up in a bit.

Share this post


Link to post
Share on other sites


Update

I have a cron job running every minute now with the following (host names, IPs, paths, and IDs are of course stripped for security):

 

#!/bin/bash
id="xxxx-xxxxx-xxxxx"
date=$(date)

curl=$(docker exec $id curl -o /dev/null -w '%{http_code}' -m 10 -s https://myapi.com/endpoint)
curl2=$(docker exec $id curl -o /dev/null -w '%{http_code}' -m 10 -s https://google.com)
curl3=$(docker exec $id curl -o /dev/null -w '%{http_code}' -m 10 -s http://xxx.xxx.xxx.xxx)

echo "[$date][API] $curl" >> /home/xxx/curlph.log
echo "[$date][Goo] $curl2" >> /home/xxx/curlph.log
echo "[$date][Sta] $curl3" >> /home/xxx/curlph.log

 

While I wasn't able to use network namespaces because the loopback interface is enabled within our Docker containers (this is needed for DNS in our case, but stops DNS from working when executing within network namespaces while up). I was able to just use the docker exec <id> <cmd> command without -it (for interactive terminal I believe). What this does is send three HTTP requests via cURL to our API (that keeps failing with User Management when the issue is going on), Google, and a VM IP that I have a simple NGINX server in. Our API uses CloudFlare in this case, so this will allow us to see if CF is possibly failing when the issue starts occurring again (if CF times out, but Google and the other IP doesn't, it's something with CF). With that said, this will also tell us if it could be something DNS related (e.g. if CF and Google timeout, but the VM IP is fine, this is likely something related to DNS).

 

I'm still waiting for the issue to occur again. If the issue occurs and we see no timeouts in the log from my monitoring script, this points to the issue being on the game server itself.

 

Thank you!

Share this post


Link to post
Share on other sites


Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now


×
×
  • Create New...