We went down, so we wrote a better pure python memcache client

Memcache is great. Here at Mixpanel, we use it in a lot of places, mostly to cache MySQL queries but also for other data stores. We also use kestrel, a queue server that speaks the memcache protocol.

Because we use eventlet, we need a pure python memcache client so that eventlet can patch the socket operations to be non-blocking. The de-facto standard for this is python-memcached, which we used until recently.

In which things fail

When customers send data to our /track API endpoint, it hits a load balancer which forwards it to an API server. The API server sticks the data on a queue and we process it “later” (usually in a couple of milliseconds). Our kestrel queues implement the same interface as memcached, so we simply .set() to enqueue and .get() to dequeue.

On April 16, 2012, our /track API endpoint went down. Part of the reason is that one of our kestrel servers went down, but that shouldn’t have been a problem – that’s why we have redundancy and failover. Unfortunately, failover only works when failure is detected. This is where we had problems.

It turns out that socket operations that fail in python-memcached don’t return an error or raise any exceptions. Calling .get() on a dead server returns None on failure, the same thing that is returned when there’s no value. As such, our API servers weren’t able to recover by marking the kestrel server as down and kept trying to enqueue events to it.

With python-memcached, it’s difficult to control how long the client should wait before timing out. When our API servers tried to .set() onto the down queue, it would take around 30 seconds before failing. By then, the load balancers had already given up and our poor API server would be marked down through no fault of its own.

These two issues combined caused us to lose most of the data sent during the 85 minutes it took to fix things. Data loss is a worst-case scenario for us, so we took this incident very seriously.

If you’re like us and memcache is a critical part of your infrastructure, you need to know when your servers are down – your client needs to throw an exception on failure. But even if you’re just using memcached for caching, it’s important that you have control over your timeouts – you don’t want your caches to become a performance bottleneck when things go south.

Enter memcache_client

So, we set out to write a memcache client that suited our particular needs. We gave each instance of the client a configurable connect_timeout and timeout. It’s not quite a drop-in replacement; most of our code didn’t have to be updated at all, but there were some changes that needed to be made.

The main differences are:

  • Our client only supports the set, get, delete, and stats commands.
    We don’t support incr, decr, append, prepend, replace, or cas.
  • We only connect to a single memcached.
    We don’t support autosharding over slabs.
  • TCP sockets only.
    No support for UDP or UNIX sockets.
  • Bytestring keys and values only.
    No pickling, casting to str, or auto-encoding of unicode.
  • And, of course, timeouts.
    Timeouts get raised on get’s and set’s, so you’ll need to catch those if you want the old behavior.

An interlude about types

You may be wondering why we got rid of the automatic pickling of keys and values. Our library doesn’t support some operations like incr and we also dropped support for sharding, but that was because we didn’t need them. Enforcing a “bytestrings in, bytestrings out” rule, however, was a conscious decision.

First of all, we noticed that, within our own codebase, there were different conventions for using the client. Some call sites would cast everything to strs themselves, some would rely on the autopickling, and some were just blindly using it and it happened to work – most of the time.

We think it’s important to know exactly what data you’re putting in and what data you’re getting back. Our more disciplined approach means you don’t get subtle errors when some unicode gets automatically encoded. We enforce the calling convention with code in our library that looks like this:

The code is robust about things like non-fatal network errors but it is intentionally brittle about how you interact with our API. As a side effect, the keys and values you store will be accessible from other clients, Python or otherwise.

Where we stand now

memcache_client has been a critical part of our infrastructure for about two months now. It’s running on over 200 machines, speeding up queries on both our external API and internally. In addition, every event we’ve received – approximately 10 billion during this time period – has successfully passed through our memcache client to make it onto the kestrel queues.

Now that it’s been battle-tested, we’re ready to open-source it. If you have a python application and you use memcached, we hope you’ll check it out. It was a bit of work to integrate, but the integration revealed our inconsistent handling of types and the resulting code is both aware of and resilient to hardware and network failures.

Check out the code at https://github.com/mixpanel/memcache_client
And the documentation at http://mixpanel.github.com/memcache_client/

As always, patches welcome (also, a big thanks to the people who’ve submitted patches to our other projects).

8 thoughts on “We went down, so we wrote a better pure python memcache client

  1. You might consider changing the isinstance check to check against basestring instead of just str. Doing so means you can handle unicode strings:
    >>> isinstance(u”test”,str)
    >>> isinstance(u”test”,basestring)

  2. Thanks for working on this, though it would have been nice if you could have made changes to the python-memcache module to allow it to have worked for this use-case.

    But, you are right that python-memcache is meant to talk to memcache, not to talk to memcache-compatible protocols. The base module was written by the memcache authors, and it follows a use-case of:

    1) Look up data in the cache.

    2) If there is a value present, use it.

    3) If there is no value present, calculate the value and store it in the memcache.

    They made the decision that if the module fails due to connection or server failure, that instead of raising an exception, it returns a value indicating that the value should be recalculated and added to the cache.

    For most of the uses *WITH MEMCACHE*, that is probably reasonable. However, for non-cache uses, that’s probably not what you want.


  3. python-memcached shouldn’t every have been used with Kestrel. First, failures are swallowed up and second, picking a server in a Kestrel cluster is random not based on the key.

    Using modulo sharding or even consistent hashing would dedicate one node in the cluster to a single queue. If you have one “photos” queue to process incoming photos for instance would distribute all the work to a single node.

    I had to fork the python-memcached client in order to work with Kestrel (https://github.com/ericmoritz/python-kestrel) and it was a bit of a pain and I felt really dirty duplicating all that code. This project is an excellent low-level package for building memcached speaking clients that behavior appropriately for memcached or Kestrel. Thanks for the work!

  4. Pingback: Web performance - Weekend must-read articles #25

  5. Pingback: Web performance – Weekend must-read articles #25 | The Good Net Guide

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.