(This is part II of a two part series of posts, you can find part I here)
One of the most powerful parts of the Mixpanel query language is the
any operator, which allows you to select events or profiles based on the value of any element in a list. The
any operator is just a bit more magical than the other operators in our query language, both in its power and in its implementation.
We’ve already written about building the Mixpanel expression language – the language we built inside of the Mixpanel data store to allow you to query and select data for reports. The model we built in the last post can do a lot of work, but parsing and interpreting the
any query takes the language to another level, both metaphorically and syntactically.
Like the basic expression language post, we’ll be using Python and JSON to talk about procedures and data, but won’t assume you’re a serious Pythonista. It will also be worth taking another look at the simple expression language post, since this post elaborates that model.
A few weeks ago we started noticing a dramatic change in the pattern of network traffic hitting our tracking API servers in Washington DC. From a fairly stable daily pattern, we started seeing spikes of 300-400 Mbps, but our rate of legitimate traffic (events and people updates) was unchanged.
Suddenly our network traffic started spiking like crazy.
Pinning down the source of this spurious traffic was a top priority, as some of these spikes were triggering our upstream routers into a DDos mitigation mode, where traffic was being throttled.
On Monday we shipped distinct_id aliasing, a service that makes it possible for our customers to link multiple unique identifiers to the same person. It’s running smoothly now, but we ran into some interesting performance problems during development. I’ve been fairly liberal with my keywords; hopefully this will show up in Google if you encounter the same problem.
The operation we’re doing is conceptually simple: for each event we receive, we make a single MySQL SELECT query to see if the distinct_id is an alias for another ID. If it is, we replace it. This means we get the benefits of multiple IDs without having to change our sharding scheme or moving data between machines.
A single SELECT would not normally be a big deal – but we’re doing a lot more of them than most people. Combined, our customers have many millions of end users, and they send Mixpanel events whenever those users do stuff. We did a little back-of-the-envelope math and determined that we would have to handle at least 50,000 queries per second right out of the gate.
At Mixpanel, we believe giving our customers a smooth, seamless experience when they are analyzing data is critically important. When something happens on the backend, we want the user experience to be disrupted as little as possible. We’ve gone to great lengths to learn new ways for maintaining this level of quality, and today I want to share some of the techniques were employing.
Mixpanel.com runs Django behind nginx using FastCGI. Some time ago, our deploys consisted of updating the code on our application servers, then simply restarting the Django process. This would result in a few of our rubber chicken error pages when nginx failed to connect to the upstream Django app servers during the restart. I did some Googling and was unable to find any content solving this problem conclusively for us, so here’s what we ended up doing.
Memcache is great. Here at Mixpanel, we use it in a lot of places, mostly to cache MySQL queries but also for other data stores. We also use kestrel, a queue server that speaks the memcache protocol.
Because we use eventlet, we need a pure python memcache client so that eventlet can patch the socket operations to be non-blocking. The de-facto standard for this is python-memcached, which we used until recently.
This post is a follow up to Why we moved off the cloud.
As a company, we want to do reliable backups on the cheap. By “cheap” I mean in terms of cost and, more importantly, in terms of developer’s time and attention. In this article, I’ll discuss how we’ve been able to accomplish this and the factors that we consider important.
Backups are an insurance policy. Like conventional insurance policies (e.g. renter’s), you want piece of mind that your stuff is covered if disaster strikes, while paying the best price you can from the available options.
Backups are similar. Both your team and your customers can rest a bit more easily knowing that you have your data elsewhere in case of unforeseen events. But on the flip side, backups cost money and time that could be better applied to improving your product — delivering more features, making it faster, etc. This is good motivation for keeping the cost low while still being reliable.
At Mixpanel, we process billions of API transactions each month and that number can sometimes increase rapidly just in the course of a day. It’s not uncommon for us to see 100 req/s spikes when new customers decide to integrate. Thinking of ways to distribute data intelligently is pivotal in our ability to remain real-time.
I am going to discuss several techniques that allow people to horizontally distribute data. We have conducted interviews (by the way, we’re hiring engineers) with people in the past that make poor decisions in partitioning (e.g. partitioning by the first letter in a user’s name) and I think we can spread some knowledge around. Hopefully, you’ll learn something new.