Live Downsizing Google Cloud Persistent Disks for Fun and Profit

Background

At Mixpanel, we heavily utilize Google Cloud Platform(GCP)’s SSD provisioned persistent disk (PD-SSD) to store the event data that underlies our service. We chose it to support our low-latency analytics database, which serves customer queries.

We found one major problem with PD-SSD: the cost. PD-SSD is 5x the cost per GB stored than Google Cloud Storage (GCS). Our infra team designed Mixpanel’s database, Arb, utilizing both PD-SSD and GCS; it stores recent, frequently changing data in PD-SSD for low-latency reads and much older, immutable data in GCS for a lower cost and acceptable latency.

As all engineers do, we first engineered for correctness/performance and later optimized for cost. PD-SSD initially held a large proportion of data for performance reasons, accounting for significant portion of our infrastructure cost. However, given that the vast majority of our data is immutable, we could shift the balance to GCS to drastically reduce our storage cost.

While read latency was initially a concern (average latency in GCS is worse than in our PD based solution), we realized that in a system like Arb, where a single query touches hundreds of disks, p99 matters more than average for end-user latency. We confirmed that while GCS was slower on average, its p99 was consistently equal to or better than PD-SSD at our throughput.

After the latency experiment, we shifted data older than 3 months to GCS, leaving our PD-SSD utilization was incredibly low. To realize the cost savings, we needed to downsize the disks. Unfortunately, GCP does not support downsizing a PD. This blog post describes how we built our own solution to do just that.

Requirements

  • Live: The most important requirement is not impacting our customer’s experience. PD-SSD downsizing should not cause data delay / query downtime.
  • Idempotent: Repeat downsizing of the same disk should be a no-op.
  • Automated: Downsizing should be triggered by a single command.
  • Revertible: Design should allow rollback to the prior disk in case of error.

Primitives we have

  • Multiple Zones: Arb uses multiple GCP zones for redundancy with each zone holding an independent copy of data and its own compute nodes. This design comes with several benefits: distributes load during peak time, relatively simple query and storage logic to maintain and improve, and no downtime during deployment or canary. We leveraged this for downsizing.
  • Kafka: Kafka is a key component of our Ingestion pipeline. Our ingestion servers push events to Kafka, which then get consumed by Storage Servers to be persisted to PD-SSDs. In our configuration, Kafka maintains data for 7 days, so that even if a Storage Server reverts to six days ago, it can catch up to the most recent data without any data loss.
  • GCP PD Snapshot: GCP provides a feature called Snapshot, which is what it sounds like: it creates a snapshot of a PD. Moreover, GCP supports PD-SSD creation based on an existing snapshot. It essentially clones a PD-SSD to a new PD-SSD (of the original size) with the same data. GCP intelligently creates snapshots incrementally based on previous snapshots, so it is storage and time efficient.
  • Kubernetes Jobs: A Kubernetes job creates one or multiple pods and tracks the status of the pods. If a specified number of pods are completed successfully, the job is completed.
  • GCP node pool: A node pool is a specified number of nodes with the same configuration. The configuration includes the number of CPUs and GPUs, size of memory and storage, etc.

Our solution

Here is our solution with the given primitives.

Step 1:  Create a cloned disk using snapshot

The first step is cloning a disk under a different pod called Downsizer using snapshot. The Downsizer pod is created in a separate node with ample resources. This guarantees the downsizing process has no impact on live Storage Servers, and allocates enough I/O throughput to perform downsizing as quickly as possible. There is a short period where reads/writes are blocked on this disk (up to 15 min) while taking a snapshot. After the snapshot is created, the Storage Server will catch up on backlog quickly and start serving traffic normally. Creating the cloned PD-SSD is performed independently in the Downsizer pod. If something goes wrong, the process is fully revertible using the snapshot.

The Downsizer pod is created by Kubernetes Job. Each PD-SSD downsizing job is named with the PD-SSD’s unique id, and the job creates a Downsizer pod with the same name. This Kubernetes configuration provides idempotency as a Downsizer job/pod for the same disk cannot be created multiple times.

This pod gets assigned to one of the nodes in the Downsizer node pool, with 16 CPUs per pod. This guarantees high PD-SSD performance in finishing the downsizing job in a timely manner, as the disk I/O throughput is roughly proportional to the number of CPUs of an instance. The node pool is specifically created for the downsizing process, allocated with enough CPU, memory, and the specified number of nodes.

Step 2: Create downsized disk and copy data from cloned disk

We successfully created a clone of the original disk, so it’s time to create a smaller PD-SSD and start copying the data. Because of the amount of the data and the throughput limitation, this process may take up to 10 hours. This process is performed in an independent Downsizer pod while Storage Server is serving traffic as usual, so there is no downtime.

The smaller PD-SSD size is determined by the disk usage of the cloned disk, and utilization factor. We used rsync to perform data copy/synchronization. rsync is reliable, fast and comes with built-in transmission checksum. Note that we did not use --checksum flag, which performs an additional checksum on the existing files and slows down the whole process significantly. After the rsync is finished, we unmounted the smaller PD-SSD to guarantee buffer flush.

Step 3: Delete cloned disk and repeat step 1 and 2

During step 2, the Storage Server is serving data as usual which means there shall be data added to the original disk that was ingested during the rsync process. The added data will turn into a backlog for the disk to catch up.

When there is a backlog for a disk, Mixpanel’s database ensures that reads for data on that disk goes to the other zone. Once the backlog drops below a watermark, that disk becomes readable again. Thus a higher backlog leads to longer query downtime for a disk in one zone.

To minimize the backlog, we decided to perform step 1 and and run an second rsync to the disk created in Step 2. The advantage comes from the fact that we already copied most of the data in the downsized disk. rsync compares the metadata of all files and copies only the difference. The data transfer rate of one PD-SSD to another in a dedicated node with ample resources is much faster than going through the ingestion pipeline, including Storage Server, which may share the I/O throughput with multiple PD-SSDs. This second rsync process takes only about 10 to 20 mins, and this additional time is well worth it.

Step 4: Create snapshot from the downsized disk and restore the original disk

Mixpanel infra engineers have been leveraging GCP’s snapshot as a periodic backup. Also, we take snapshot before performing any disk-related engineering work in case something goes wrong. We implemented and utilized the capability to restore a Storage Server disk from a snapshot, and we are going to use this for Downsizer.

After the second pass of rsync is finished, Downsizer takes a snapshot of the downsized disk. This downsized snapshot contains complete data and disk size information. By triggering restore from the snapshot, Storage Server stops the read and write to the original disk, deletes the disk, and creates a smaller disk with the downsized snapshot.

Step 5: Wait for backlog to catch up and enable Read

Now we have the downsized disk mounted to a Storage Server, but we are not quite done yet. Even after the second rsync, there is still a 20 ~ 30 mins gap between the time that the second snapshot is taken, to the time that the restore from the downsized snapshot is finished. If we serve this disk to query data right away, we may serve stale data and we need to avoid this. Downsizer is responsible for blocking reads to the disk, monitoring the amount of backlog for the disk, and enabling read only when the disk catches up to the most recent data. This step usually takes less than 10 mins.

Operationalizing it

Automation

The previous steps show the process of downsizing one disk, but we have hundreds of disks. It is not scalable to run Downsizer manually, one at a time. We need a tool to automate downsizing for many disks.

  • One zone at a time: From the downsizing description above, there are some steps that require unavoidable downtime: taking snapshots, restoring from a snapshot, and waiting for backlog. If we restrict the downsizing process in only one zone, then the other zone would serve the traffic normally, so no downtime would caused by Downsizer.
  • 10% of the disks in one zone at a time: Conservatively, stopping about 10% of the disks in one zone has shown minimal impact on our customer’s experience.

Based on this, we built a tool that performs downsizing for one zone:

  1. Get a list of all disks in one zone
  2. Check the number of Downsizer Kubernetes jobs that are completed and running.
  3. If the number of currently running jobs is lower than 10% of the total number of disks, create more jobs to meet 10%
  4. Repeat Step 2 and 3 until all of the disks are finished

This approach is even more conservative as it is highly unlikely to overlap the disk downtime all at the same time to reach 10% of the disks in a zone.

Coordinating with others

The last problem left to solve was how to execute Downsizer with a minimal impact on productivity of other engineers. At Mixpanel, deployment and/or canary happens almost every day, and running Downsizer means blocking concurrent Query Server and Storage Server deployments. We use a lightweight deployment lock system to allow engineers to take “locks” on resources, preventing concurrent deploys.

We initially estimated ten hours per disk for downsizing, and we have about 300 disks per zone. Running 10% of the number of disks at a time, it would take about four full days to complete. Even considering the weekends, we needed to block deployments for two work days, and that was not acceptable.

We could have just run it during weekends, but that was also not acceptable at the beginning because we didn’t know if downsizing would reveal unforeseen side effects, and we wanted to observe and catch them early before downsizing all of the disks.

The solution was rather simple: acquire deployment lock at the end of work day and start Downsizer; stop Downsizer around midnight, and release lock in the morning after all running Downsizer jobs are finished. Of course, communicating with other engineers (especially those holding the pager) was a must. In this way, we started off with a smaller number of disks and observed any changes in production behavior. We found that it takes about four hours on average to downsize a disk, because we have less data than what we had originally planned for. After a couple nights of downsizing, we confirmed that there were no side effects observed, so we kicked off full downsizing for one whole zone during the weekend. Now that we knew it takes less than two full days to downsize with no side effects, we confidently ran the other zone during the weekend, several weeks later.

Results

After the whole process was finished, we reduced Arb PD-SSD usage from 1.05PB to 390TB. That’s about $112k per month cost reduction minus the increased GCS cost. The GCS cost increased approximately about $20k including storage and operations cost. Thus the overall cost reduction is about $90k per month, $1M per year!

Takeaways

There are several things that we learned from this project:

  • Kubernetes and GCP: Idempotency could have been something tricky to achieve. With Kubernetes jobs, however, we just needed to configure the jobs correctly. Snapshots and node pools were very handy as well. Understanding the provided primitives is key for efficient tool building.
  • rsync and buffer flush: rsync is a powerful and efficient utility. We tested parallelizing rsync by using parallel, and we found no difference in return. The second pass sync would have been hard without rsync. Unmounting the disk was the method that we used to guarantee the buffer flush.
  • 2nd pass sync for minimal backlog: When we first implemented without the second pass sync, we set a timeout for backlog catch up time to several hours, proportional to the time that it took to rsync data from a cloned disk to a downsized disk. Depending on a Storage Server’s available write throughput, it may take up the entire time allotted, and during that time, read is blocked. This problem is solved by the second pass sync and minimized query downtime.
  • Back-of-envelope calculation: We planned out each step by back-of-envelope calculation first, and then performed implementation and execution. After execution, we checked if it matched our expectations. If it didn’t match we investigated and figured out why there were discrepancies. This way, we predicted ops impact accurately and was able to execute the downsizing process with minimal pain.
  • Communication: We looped in service owners at the design and planning stage. We learned all the available in-house utilities that we could use, and the service owners understood what we were planning to do, giving us valuable advice. This communication continued throughout the project and was the key factor of finishing it efficiently with minimal trial-and-error and side effects.

If you enjoyed this article and are interested in working on high-performance distributed systems at Mixpanel, we are hiring!

Building a (not so simple) expression language part II: Scope

(This is part II of a two part series of posts, you can find part I here)

One of the most powerful parts of the Mixpanel query language is the any  operator, which allows you to select events or profiles based on the value of any element in a list. The any  operator is just a bit more magical1 than the other operators in our query language, both in its power and in its implementation.

We’ve already written about building the Mixpanel expression language – the language we built inside of the Mixpanel data store to allow you to query and select data for reports. The model we built in the last post can do a lot of work, but parsing and interpreting the any  query takes the language to another level, both metaphorically and syntactically.

Like the basic expression language post, we’ll be using Python and JSON to talk about procedures and data, but won’t assume you’re a serious Pythonista. It will also be worth taking another look at the simple expression language post, since this post elaborates that model.

Continue reading

Diagnosing networking issues in the Linux Kernel

A few weeks ago we started noticing a dramatic change in the pattern of network traffic hitting our tracking API servers in Washington DC. From a fairly stable daily pattern, we started seeing spikes of 300-400 Mbps, but our rate of legitimate traffic (events and people updates) was unchanged.

Suddenly our network traffic started spiking like crazy.

Pinning down the source of this spurious traffic was a top priority, as some of these spikes were triggering our upstream routers into a DDos mitigation mode, where traffic was being throttled.

Continue reading

Debugging MySQL performance at scale

On Monday we shipped distinct_id aliasing, a service that makes it possible for our customers to link multiple unique identifiers to the same person. It’s running smoothly now, but we ran into some interesting performance problems during development. I’ve been fairly liberal with my keywords; hopefully this will show up in Google if you encounter the same problem.

The operation we’re doing is conceptually simple: for each event we receive, we make a single MySQL SELECT query to see if the distinct_id is an alias for another ID. If it is, we replace it. This means we get the benefits of multiple IDs without having to change our sharding scheme or moving data between machines.

A single SELECT would not normally be a big deal – but we’re doing a lot more of them than most people. Combined, our customers have many millions of end users, and they send Mixpanel events whenever those users do stuff. We did a little back-of-the-envelope math and determined that we would have to handle at least 50,000 queries per second right out of the gate.
Continue reading

How we handle deploys and failover without disrupting user experience

At Mixpanel, we believe giving our customers a smooth, seamless experience when they are analyzing data is critically important. When something happens on the backend, we want the user experience to be disrupted as little as possible. We’ve gone to great lengths to learn new ways for maintaining this level of quality, and today I want to share some of the techniques were employing.

During deploys

Mixpanel.com runs Django behind nginx using FastCGI. Some time ago, our deploys consisted of updating the code on our application servers, then simply restarting the Django process. This would result in a few of our rubber chicken error pages when nginx failed to connect to the upstream Django app servers during the restart. I did some Googling and was unable to find any content solving this problem conclusively for us, so here’s what we ended up doing.
Continue reading

We went down, so we wrote a better pure python memcache client

Memcache is great. Here at Mixpanel, we use it in a lot of places, mostly to cache MySQL queries but also for other data stores. We also use kestrel, a queue server that speaks the memcache protocol.

Because we use eventlet, we need a pure python memcache client so that eventlet can patch the socket operations to be non-blocking. The de-facto standard for this is python-memcached, which we used until recently.
Continue reading

How to do cheap backups

This post is a follow up to Why we moved off the cloud.

As a company, we want to do reliable backups on the cheap. By “cheap” I mean in terms of cost and, more importantly, in terms of developer’s time and attention. In this article, I’ll discuss how we’ve been able to accomplish this and the factors that we consider important.

Backups are an insurance policy. Like conventional insurance policies (e.g. renter’s), you want piece of mind that your stuff is covered if disaster strikes, while paying the best price you can from the available options.

Backups are similar. Both your team and your customers can rest a bit more easily knowing that you have your data elsewhere in case of unforeseen events. But on the flip side, backups cost money and time that could be better applied to improving your product — delivering more features, making it faster, etc. This is good motivation for keeping the cost low while still being reliable.

Continue reading

Sharding techniques

At Mixpanel, we process billions of API transactions each month and that number can sometimes increase rapidly just in the course of a day. It’s not uncommon for us to see 100 req/s spikes when new customers decide to integrate. Thinking of ways to distribute data intelligently is pivotal in our ability to remain real-time.

I am going to discuss several techniques that allow people to horizontally distribute data. We have conducted interviews (by the way, we’re hiring engineers) with people in the past that make poor decisions in partitioning (e.g. partitioning by the first letter in a user’s name) and I think we can spread some knowledge around. Hopefully, you’ll learn something new.

Continue reading