Safely Rewriting Mixpanel’s Highest Throughput Service in Golang

It’s always important to use the right tool. I grew up working on small outboard engines with my father and learned this lesson the hard way. I only had to strip a bolt once to realize that using the almost-right tool could result in a mistake that took hours to correct. At Mixpanel, we try to use the right tools to allow us to move quickly, and with confidence. One great example of this is how we used Diffy to check against live production traffic when migrating our ingestion API from Python to Golang.

A few years ago, we decided that Golang was the best tradeoff of readability and performance for our critical infrastructure services. In order to realize the gains and maintain a homogenous code base, this meant we had to migrate the existing performance-critical Python services. Since then, almost all of our existing critical infrastructure has been migrated to Golang, with the glaring exception of our data collection API. The main reason for this, honestly, is summed up by the cliché adage, “if it ain’t broke, don’t fix it”. The data collection API, where all incoming data is first processed, is one of the most critical services at Mixpanel. Downtime can result in lost data, increased latency can ruin user experiences, and incorrectness can cause permanent data corruption. The risk of migrating to brand-new code, in a completely different language, outweighed the potential benefits.

In January, we finally decided that in order to meet our 2019 goals, the data collection API needed to be migrated to Golang. First, let me describe how our existing API was setup and why. Figure 1 shows the environment the Python API was originally built for. At the time, Mixpanel was running our code directly on real, 8-core machines (not VMs), and since the Python API was designed to only use a single processor, we ran eight instances of the API serving eight consecutive ports. In front of the group of API machines was a handful of nginx instances which accepted the HTTP traffic and load-balanced it to the API servers. After processing the request, the data was added to a series of kestrel queues, and a simple error code returned to the users. 

Figure 1: Original API Setup

In 2018, we migrated from using real machines to using a kubernetes deployment hosted on Google Cloud Platform. We made several setup changes while keeping the core Python API code untouched. Figure 2 shows the end result of that migration. A request was directed, via the Google Load Balancer and kubernetes to a kubernetes pod, where an Envoy container load-balanced between eight Python API containers. The Python API containers then submitted the data to Google Pubsub queue via a pubsub sidecar container that had a kestrel interface. From the perspective of the API, nothing had changed, it still received events, processed them, and submitted to a “kestrel” queue, which was really just a proxy for submitting to Google Pubsub via Golang.

Figure 2: API Setup (January 2019)

At this point, the next logical step was to update the actual API. The Python API wasn’t built with the current GCP environment in mind, and it took several hacks to get it working. The only hesitation was that the API’s functionality was very complicated and nuanced. We needed a way, in addition to unit tests, to guarantee correctness for all the edge cases we see in the wild. Mixpanel processes over a trillion data points a month, and at this scale, exceptionally rare edge cases are seen every single day. We needed to test directly with a large amount of live traffic, without affecting production. 

To enable testing against live traffic, we created a dedicated setup. The setup was a separate kubernetes pod running in the same namespace and cluster as the API deployments. The pod ran an open source API correctness tool, Diffy, along with copies of the old and new API services. Diffy is a service that accepts HTTP requests, and forwards them to two copies of an existing HTTP service and one copy of a candidate HTTP service. It compares the responses, and then shows the differences via a web page. Using the two copies of the existing service, it categorizes expected differences, such as timestamps and random values, as noise, so that unexpected differences are easier to discern. We chose Diffy because we were able to filter out random noise, the comparison code could handle deeply nested lists and maps in JSON, it was able to handle our scale, and it was super easy to deploy within our existing infrastructure.  With this setup, the comparison code was nicely encapsulated, and easily portable to all of our API clusters. 

There were two additional changes that needed to be made. First, we had to modify our services to get the best analysis with Diffy. Typically, the services submit data to a queue and then return just an error code to the client. Instead of this, the modified services returned the data and error code. This was to ensure the processed data was correct, and fully utilize Diffy’s correctness checking. Second, we needed to direct live shadow traffic to the comparison pod, which we did using the existing Envoy containers. Using Envoy, this requires two simple steps. First, we added the Diffy service as another type of cluster, which can be seen below: 

Second, we added a request_mirror_policy to our regular route. From Envoy’s documentation, a request_mirror_policy is implemented as “fire and forget”, which is exactly what we needed for this scenario. Further, Envoy allows you to limit the amount of traffic sent to the shadow cluster, using a runtime value. Since we only wanted to run one instance of the comparison pod, we opted to limit this quite a bit, typically only directing between 0.5% and 2.0% of traffic. The change to add the request_mirror_policy is below:

The final Diffy pod setup, as described above, can be seen in Figure 3. 

Figure 3: Diffy Setup

Diffy was an invaluable tool during this migration. We were able to compare the processed data from the existing API and the new API, patch the code, and quickly deploy again to verify the fix. At least a dozen bugs were found and fixed in quite a short time frame. We eventually were able to run at least ten million live samples at a time through Diffy without seeing any unexpected differences. Checking the new service with this many live samples made us especially confidence in its correctness. 

Though this setup helped us avoid many critical bugs, it wasn’t perfect. In particular, we needed to change how the APIs handled the processed data in order to return it in the response. For the new APIs, we did this using a specific error type. This generally worked well, but resulted in one bug since Diffy was unable to test a particular non-error code path. Also, we weren’t able to do any type of load-testing using this setup. Still, despite these issues, this setup sped up the migration process and greatly reduced the number of bugs. 

The final setup of our data collection API can be seen in Figure 4. You can see that only the multiple API containers and pubsub sidecar have changed, otherwise the surrounding services are identical. One huge improvement is we only need to run a single API container per pod, since the new Golang code uses multiple processors. Now, we have a clean slate to begin further optimizations on our data collection API. We’ll definitely continue using Diffy to ensure correctness and allow our team to move fast on changes.

Figure 4: New Data Collection API

At Mixpanel, we strive for a balance between reliability and innovation. Our customers rely on our services to make critical business decisions, and any downtime can impact their ability to make these decisions quickly and correctly. However, we are constantly trying to improve the tools we offer to better suit customers’ present and future needs. Using tools like Diffy, we can achieve both goals, and do so confidently. 

Update (July 26, 2019):

There was a comment about the gains we saw from this migration. I checked the data, and from a rough calculation we saw about 40% decrease in the amount of CPU resources used. See Figure 5 for a chart of CPU usage over time.

Figure 5: The orange line represents the old API’s total CPU usage, in cores, and the blue line represents the new API’s total CPU usage.

Update #2 (July 26, 2019):

There was another comment about latency. Overall, we saw latency stabilize for both avg and max p99. Max p99 latency also decreased a bit. See Figures 6 and 7 for reference.

Figure 6: Avg p99 latency
Figure 7: Max p99 latency

Evan Noon

Evan is Senior Software Engineer working on Mixpanel’s Infrastructure Engineering Team. When not working, he enjoys sailing and kiteboarding, as he never actually got proficient at fixing boat motors.