How to DOS a Django application with bad GET requests.

Like any complex production service, Thread experiences outages and issues in production that affect our users from time to time. In order to improve our service quality we run post-mortems of these incidents to identify ways to fix the root causes and improve service reliability moving forward. This post summarises one such analysis.

Typical Django deployments geared for production are usually comprised of a WSGI server such as Gunicorn running your app, and a reverse proxy such as NGINX. The latter tend to scale better and are less susceptible to common Denial Of Service attacks; which is one of the main reasons to not expose your WSGI server directly.

Because WSGI servers like Gunicorn aren't built to protect from Denial Of Service attacks, it's easy to trigger DoS conditions against a Django application with even slightly malformed requests. For example, sending a request with a body smaller than the advertised Content-Length. If the view tries to read the body, it will hang and block the current worker. This issue has more details. With NGINX the proxy_request_buffering setting is what protects us from this before it hits the backend.

This specific situation should not have been a problem for us given that we run behind a properly configured NGINX instance; so we were pretty surprised when we tracked down a drop in Gunicorn capacity to workers being stuck on a read() call and waiting for a GET request's body to be available 1.

We tracked the original request coming into our infrastructure and at the point of hitting NGINX this requests was correct (i.e. the Content-Length header matched its body). This pointed to the real cause here: we'd recently introduced a Node.js based server to handle server-side rendering of our React frontend. The architecture is as follows:

Requests from Node.js to Gunicorn do not go through NGINX for 2 main reasons:

  1. This could introduce a routing loop if NGINX decides to redirect the request back to a Node server.
  2. It ensures minimal latency between the 2 backend components by keeping them local to a single host.

The requests causing our Gunicorn workers to lock up were coming from the Node.js server because we were stripping the bodies from GET requests, corrupting them along the way and creating what was essentially self induced Denial of Service. We do this because the Fetch API does not allow sending GET request with a body 2. As we never rely on this ourselves, we missed stripping the relevant headers and this was never caught in normal operations (in fact it took some malicious requests 1 trigger this failure mode).

The fix for this was simple:

1

Specifically, we were receiving malformed GET requests sent to our GraphQL endpoint. They had reproduced the request we make from our frontend, replaced some variables with SQL injection and then sent them as GET. The library we use attempts to read the body of the request regardless of the HTTP method. As we send all our GraphQL requests over POST and GraphQL over GET usually uses query parameters we had not encountered this particular failure mode before.

2

Notably this valid HTTP/1.1 according to RFC 7231, although the semantics are undefined. This is not valid in HTTP/2.