At Mixpanel, we heavily utilize Google Cloud Platform(GCP)’s SSD provisioned persistent disk (PD-SSD) to store the event data that underlies our service. We chose it to support our low-latency analytics database, which serves customer queries.
We found one major problem with PD-SSD: the cost. PD-SSD is 5x the cost per GB stored than Google Cloud Storage (GCS). Our infra team designed Mixpanel’s database, Arb, utilizing both PD-SSD and GCS; it stores recent, frequently changing data in PD-SSD for low-latency reads and much older, immutable data in GCS for a lower cost and acceptable latency.
As all engineers do, we first engineered for correctness/performance and later optimized for cost. PD-SSD initially held a large proportion of data for performance reasons, accounting for significant portion of our infrastructure cost. However, given that the vast majority of our data is immutable, we could shift the balance to GCS to drastically reduce our storage cost.
While read latency was initially a concern (average latency in GCS is worse than in our PD based solution), we realized that in a system like Arb, where a single query touches hundreds of disks, p99 matters more than average for end-user latency. We confirmed that while GCS was slower on average, its p99 was consistently equal to or better than PD-SSD at our throughput.
After the latency experiment, we shifted data older than 3 months to GCS, leaving our PD-SSD utilization was incredibly low. To realize the cost savings, we needed to downsize the disks. Unfortunately, GCP does not support downsizing a PD. This blog post describes how we built our own solution to do just that.
- Live: The most important requirement is not impacting our customer’s experience. PD-SSD downsizing should not cause data delay / query downtime.
- Idempotent: Repeat downsizing of the same disk should be a no-op.
- Automated: Downsizing should be triggered by a single command.
- Revertible: Design should allow rollback to the prior disk in case of error.
Primitives we have
- Multiple Zones: Arb uses multiple GCP zones for redundancy with each zone holding an independent copy of data and its own compute nodes. This design comes with several benefits: distributes load during peak time, relatively simple query and storage logic to maintain and improve, and no downtime during deployment or canary. We leveraged this for downsizing.
- Kafka: Kafka is a key component of our Ingestion pipeline. Our ingestion servers push events to Kafka, which then get consumed by Storage Servers to be persisted to PD-SSDs. In our configuration, Kafka maintains data for 7 days, so that even if a Storage Server reverts to six days ago, it can catch up to the most recent data without any data loss.
- GCP PD Snapshot: GCP provides a feature called Snapshot, which is what it sounds like: it creates a snapshot of a PD. Moreover, GCP supports PD-SSD creation based on an existing snapshot. It essentially clones a PD-SSD to a new PD-SSD (of the original size) with the same data. GCP intelligently creates snapshots incrementally based on previous snapshots, so it is storage and time efficient.
- Kubernetes Jobs: A Kubernetes job creates one or multiple pods and tracks the status of the pods. If a specified number of pods are completed successfully, the job is completed.
- GCP node pool: A node pool is a specified number of nodes with the same configuration. The configuration includes the number of CPUs and GPUs, size of memory and storage, etc.
Here is our solution with the given primitives.
Step 1: Create a cloned disk using snapshot
The first step is cloning a disk under a different pod called Downsizer using snapshot. The Downsizer pod is created in a separate node with ample resources. This guarantees the downsizing process has no impact on live Storage Servers, and allocates enough I/O throughput to perform downsizing as quickly as possible. There is a short period where reads/writes are blocked on this disk (up to 15 min) while taking a snapshot. After the snapshot is created, the Storage Server will catch up on backlog quickly and start serving traffic normally. Creating the cloned PD-SSD is performed independently in the Downsizer pod. If something goes wrong, the process is fully revertible using the snapshot.
The Downsizer pod is created by Kubernetes Job. Each PD-SSD downsizing job is named with the PD-SSD’s unique id, and the job creates a Downsizer pod with the same name. This Kubernetes configuration provides idempotency as a Downsizer job/pod for the same disk cannot be created multiple times.
This pod gets assigned to one of the nodes in the Downsizer node pool, with 16 CPUs per pod. This guarantees high PD-SSD performance in finishing the downsizing job in a timely manner, as the disk I/O throughput is roughly proportional to the number of CPUs of an instance. The node pool is specifically created for the downsizing process, allocated with enough CPU, memory, and the specified number of nodes.
Step 2: Create downsized disk and copy data from cloned disk
We successfully created a clone of the original disk, so it’s time to create a smaller PD-SSD and start copying the data. Because of the amount of the data and the throughput limitation, this process may take up to 10 hours. This process is performed in an independent Downsizer pod while Storage Server is serving traffic as usual, so there is no downtime.
The smaller PD-SSD size is determined by the disk usage of the cloned disk, and utilization factor. We used rsync to perform data copy/synchronization. rsync is reliable, fast and comes with built-in transmission checksum. Note that we did not use
--checksum flag, which performs an additional checksum on the existing files and slows down the whole process significantly. After the rsync is finished, we unmounted the smaller PD-SSD to guarantee buffer flush.
Step 3: Delete cloned disk and repeat step 1 and 2
During step 2, the Storage Server is serving data as usual which means there shall be data added to the original disk that was ingested during the rsync process. The added data will turn into a backlog for the disk to catch up.
When there is a backlog for a disk, Mixpanel’s database ensures that reads for data on that disk goes to the other zone. Once the backlog drops below a watermark, that disk becomes readable again. Thus a higher backlog leads to longer query downtime for a disk in one zone.
To minimize the backlog, we decided to perform step 1 and and run an second rsync to the disk created in Step 2. The advantage comes from the fact that we already copied most of the data in the downsized disk. rsync compares the metadata of all files and copies only the difference. The data transfer rate of one PD-SSD to another in a dedicated node with ample resources is much faster than going through the ingestion pipeline, including Storage Server, which may share the I/O throughput with multiple PD-SSDs. This second rsync process takes only about 10 to 20 mins, and this additional time is well worth it.
Step 4: Create snapshot from the downsized disk and restore the original disk
Mixpanel infra engineers have been leveraging GCP’s snapshot as a periodic backup. Also, we take snapshot before performing any disk-related engineering work in case something goes wrong. We implemented and utilized the capability to restore a Storage Server disk from a snapshot, and we are going to use this for Downsizer.
After the second pass of rsync is finished, Downsizer takes a snapshot of the downsized disk. This downsized snapshot contains complete data and disk size information. By triggering restore from the snapshot, Storage Server stops the read and write to the original disk, deletes the disk, and creates a smaller disk with the downsized snapshot.
Step 5: Wait for backlog to catch up and enable Read
Now we have the downsized disk mounted to a Storage Server, but we are not quite done yet. Even after the second rsync, there is still a 20 ~ 30 mins gap between the time that the second snapshot is taken, to the time that the restore from the downsized snapshot is finished. If we serve this disk to query data right away, we may serve stale data and we need to avoid this. Downsizer is responsible for blocking reads to the disk, monitoring the amount of backlog for the disk, and enabling read only when the disk catches up to the most recent data. This step usually takes less than 10 mins.
The previous steps show the process of downsizing one disk, but we have hundreds of disks. It is not scalable to run Downsizer manually, one at a time. We need a tool to automate downsizing for many disks.
- One zone at a time: From the downsizing description above, there are some steps that require unavoidable downtime: taking snapshots, restoring from a snapshot, and waiting for backlog. If we restrict the downsizing process in only one zone, then the other zone would serve the traffic normally, so no downtime would caused by Downsizer.
- 10% of the disks in one zone at a time: Conservatively, stopping about 10% of the disks in one zone has shown minimal impact on our customer’s experience.
Based on this, we built a tool that performs downsizing for one zone:
- Get a list of all disks in one zone
- Check the number of Downsizer Kubernetes jobs that are completed and running.
- If the number of currently running jobs is lower than 10% of the total number of disks, create more jobs to meet 10%
- Repeat Step 2 and 3 until all of the disks are finished
This approach is even more conservative as it is highly unlikely to overlap the disk downtime all at the same time to reach 10% of the disks in a zone.
Coordinating with others
The last problem left to solve was how to execute Downsizer with a minimal impact on productivity of other engineers. At Mixpanel, deployment and/or canary happens almost every day, and running Downsizer means blocking concurrent Query Server and Storage Server deployments. We use a lightweight deployment lock system to allow engineers to take “locks” on resources, preventing concurrent deploys.
We initially estimated ten hours per disk for downsizing, and we have about 300 disks per zone. Running 10% of the number of disks at a time, it would take about four full days to complete. Even considering the weekends, we needed to block deployments for two work days, and that was not acceptable.
We could have just run it during weekends, but that was also not acceptable at the beginning because we didn’t know if downsizing would reveal unforeseen side effects, and we wanted to observe and catch them early before downsizing all of the disks.
The solution was rather simple: acquire deployment lock at the end of work day and start Downsizer; stop Downsizer around midnight, and release lock in the morning after all running Downsizer jobs are finished. Of course, communicating with other engineers (especially those holding the pager) was a must. In this way, we started off with a smaller number of disks and observed any changes in production behavior. We found that it takes about four hours on average to downsize a disk, because we have less data than what we had originally planned for. After a couple nights of downsizing, we confirmed that there were no side effects observed, so we kicked off full downsizing for one whole zone during the weekend. Now that we knew it takes less than two full days to downsize with no side effects, we confidently ran the other zone during the weekend, several weeks later.
After the whole process was finished, we reduced Arb PD-SSD usage from 1.05PB to 390TB. That’s about $112k per month cost reduction minus the increased GCS cost. The GCS cost increased approximately about $20k including storage and operations cost. Thus the overall cost reduction is about $90k per month, $1M per year!
There are several things that we learned from this project:
- Kubernetes and GCP: Idempotency could have been something tricky to achieve. With Kubernetes jobs, however, we just needed to configure the jobs correctly. Snapshots and node pools were very handy as well. Understanding the provided primitives is key for efficient tool building.
- rsync and buffer flush: rsync is a powerful and efficient utility. We tested parallelizing rsync by using parallel, and we found no difference in return. The second pass sync would have been hard without rsync. Unmounting the disk was the method that we used to guarantee the buffer flush.
- 2nd pass sync for minimal backlog: When we first implemented without the second pass sync, we set a timeout for backlog catch up time to several hours, proportional to the time that it took to rsync data from a cloned disk to a downsized disk. Depending on a Storage Server’s available write throughput, it may take up the entire time allotted, and during that time, read is blocked. This problem is solved by the second pass sync and minimized query downtime.
- Back-of-envelope calculation: We planned out each step by back-of-envelope calculation first, and then performed implementation and execution. After execution, we checked if it matched our expectations. If it didn’t match we investigated and figured out why there were discrepancies. This way, we predicted ops impact accurately and was able to execute the downsizing process with minimal pain.
- Communication: We looped in service owners at the design and planning stage. We learned all the available in-house utilities that we could use, and the service owners understood what we were planning to do, giving us valuable advice. This communication continued throughout the project and was the key factor of finishing it efficiently with minimal trial-and-error and side effects.
Special thanks to Yu Chen who gave his blood, sweat, and tears with me to this project.
If you enjoyed this article and are interested in working on high-performance distributed systems at Mixpanel, we are hiring!