How we handle deploys and failover without disrupting user experience

At Mixpanel, we believe giving our customers a smooth, seamless experience when they are analyzing data is critically important. When something happens on the backend, we want the user experience to be disrupted as little as possible. We’ve gone to great lengths to learn new ways for maintaining this level of quality, and today I want to share some of the techniques were employing.

During deploys runs Django behind nginx using FastCGI. Some time ago, our deploys consisted of updating the code on our application servers, then simply restarting the Django process. This would result in a few of our rubber chicken error pages when nginx failed to connect to the upstream Django app servers during the restart. I did some Googling and was unable to find any content solving this problem conclusively for us, so here’s what we ended up doing.

The fundamental concept is very simple. Suppose that currently, the upstream Django server is running on port 8000. I added this upstream block:

So now, when we fastcgi_pass to app, all the requests get sent to our Django server running on port 8000. When we deploy, we get the most up to date code and start up a new Django server on port 8001. Then we rewrite the upstream app block to mark 8000 as down instead of 8001, and we perform an nginx reload. The nginx reload starts up new worker processes running the new configuration, and when the old worker processes finish their existing requests, they get gracefully shutdown, resulting in no downtime.

Another option to consider is using the backup directive instead of using down. This causes nginx to automatically failover to the servers marked with backup when connections to the other servers in the block fail. You’re then able to seamlessly deploy by first restarting the backup server, and then the live one. The advantage here is there’s no configuration file rewriting required, nor any restarting of nginx. Unfortunately, some legitimate requests take longer than a second to resolve, resulting in a false positive for the original server being down.

Spawning is yet another option. Spawning can run your Django server, monkeypatched with eventlet to provide asynchronous IO. Furthermore, it has graceful code reloading. Whenever it detects any of your application’s python files have been changed, it starts up new processes using the updated files and gracefully switches all request handling to the new process. Unfortunately, attempting this solution didn’t work out for us, as somewhere within our large Django application, we had some long blocking code. This prevented eventlet from switching to another execution context, resulting in timeouts. Nevertheless, this would still be the best option if you can make sure that your WSGI application doesn’t have any blocking code.

During data store failures

At Mixpanel, we employ a custom built data store we call “arb” to perform the vast majority of queries that our customers run on data. These machines are fully redundant and are queried through HTTP requests using httplib2. When a machine fails for any reason, we want to be able to seamlessly detect the failure and redirect all requests to the corresponding replica machine. Properly doing this required some modification of the HTTPConnection class.

The main problem was httplib2 only supported a single socket timeout parameter, used for sending and receiving through the underlying socket. However, we wanted initial connection timeout to fail very quickly, but still have a long receive timeout, since a query over large amounts of data could correctly take a long amount of time. Luckily, httplib2 requests allow for passing in a custom connection type, as long as it implements the methods of httplib.HTTPConnection. Armed with this knowledge, we created our own subclass of HTTPConnection that had a custom connect method. Prior to making the connection, we used settimeout on the socket object to lower the timeout to a short 1 second. If the connection was successful, we revert the timeout it back to the original setting.

This way, if we get a socket.error exception on the connection, a custom ConnectTimeoutException gets raised and the machine being connected to is properly marked as down. One small drawback is that the request takes an additional second, but this only needs to happen a small number of times before all future requests see the machine being marked as down. For the requests that timeout on connections, we simply handle the ConnectTimeoutException and retry the query on the replica machine.

The takeaway here is to take advantage of the ability to change the socket timeout to check for an unresponsive machine. Often with systems that work with large volumes of data, long timeouts are required for database queries. But this is only necessary for established connections. When the connection is initially created, failing fast results in a better user experience, avoiding long delays when a machine goes down.

5 thoughts on “How we handle deploys and failover without disrupting user experience

  1. Another option: use the http-check option to remove server from a pool

    backend cluster
    # This should 404 when the server is being deployed to
    option httpchk GET /http-check

    http-check disable-on-404
    server zaphod
    server trillian
    server arthur

  2. I forgot to add, the nice thing about this is that you can make /http-check (or whatever) 404 before you actually take the server down, so haproxy has time to run a check and mark it as down before it disappears.

  3. I assume you have a load balancer in front of your nginx instances? Either some commercial offering, haproxy, or even nginx itself.

    If so, why not gracefully remove the node from rotation, deploy code, test/warm, put back into rotation. This is pretty much the standard way of doing things, and allows for the same result.


    – remove node from rotation at LB
    – check no more active connections ( pretty simple in most lb techs, and even a sensible time out works which i assume you alread have )
    – deploy new code, symlink new code to running location and restart services.
    – warm/test as needed
    – put back into rotation
    – check for errors

    While this may seem like a lot of steps, it can be easily automated, and you do get truly seamless deploys and rollbacks.

    Hope its useful 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.