How Mixpanel Built a “Fast Lane” for Our Modern Data Stack

Illirik Smirnov
Mixpanel Engineering
16 min readDec 13, 2023

--

Introducing YCat: A Simple, Near-Real-Time Architecture for Event-Driven Orchestration

Intro

I work on internal data infrastructure at Mixpanel, and our team has experienced the benefits of the modern data stack firsthand. Building and maintaining a source of truth using ETL and in-warehouse transformations, which is then synced to our internal tools and other destinations using Reverse ETL, is head and shoulders above the Wild West of point-to-point integrations with ad-hoc transformations. However, just like any other paradigm, there are gaps — puzzles that need to be solved within the paradigm. Previously, I wrote about the gaps in monitoring the costs of pay-as-you-go data warehouses like BigQuery. Today, I’ll write about the issues we ran into while decommissioning an email marketing tool, and how we solved them by building a simple event-driven orchestration system.

A photograph taken at a London Underground station, showing the edge of the train platform where a train’s door is open. Behind the tactile platform-edge surface, the tiles read “Mind The Gap”, referring to the gap between the platform edge and the train.
Data stacks and underground trains aren’t that different!

Context: Email Marketing Automation

For our email marketing platform, we’ve long used Adobe Marketo; not because we’re fans, but because we chose Marketo long ago and it just stuck around. The purpose of an email marketing automation platform is to enable targeted emails to your customers or prospects using the data you have about them. For example, when we plan an in-person event, we want to invite our users and customers who live nearby; we don’t want to invite users who live in e.g. Pittsburgh to an event in London (it’s a long flight)!

To enable this kind of targeting, the tool needs to have access to information about the users and customers. This can go well beyond simple demographic information. For example, if we want to send an email announcing our new Redshift connector, it likely doesn’t make sense to send that email to users who have already connected Mixpanel to a different data warehouse. To target that email, we would need product usage information — specifically, which users have integrated which connectors.

Marketo, like most other email marketing automation tools, maintains this information in a database within the system; like most Marketo users, we had configured Marketo to access this data through a Salesforce sync. At the time, we weren’t on the modern data stack and didn’t have a data warehouse, so our CRM was our best source for the targeting information Marketo needed. As I like to say, “If you don’t have a data warehouse, your CRM is your data warehouse.”

Like most legacy enterprise software, Marketo is endlessly configurable, with workflows and smart lists that allow you to build any sort of automation you can imagine. Since it syncs state both to and from Salesforce, you can use these automations to do much more than target emails. Users can build automations within Marketo that validate and transform the targeting information; it can essentially become a no-code automation tool.

Over time, these automations accumulate. Attributes are added for each new targeting factor. Old attributes are not deleted — nobody owns the system holistically, and the employees who configured the old automations have left, so it’s seen as safer to just add more and more stuff to Marketo, which leads to more and more state in Salesforce via the bi-directional sync. Eventually, you have over 600 fields on your Salesforce lead object, and it becomes very difficult to make any additional changes, barring a comprehensive project to migrate away from Marketo.

A seal depicting an octopus sitting on the Earth, perched atop the North Pole. Its tentacles are spread across the earth. The text of the seal reads “Nothing Is Beyond Our Reach”. This image is an edit of a mission patch produced by an American intelligence service, with the name of the mission replaced by the Marketo logo.

Reverse ETL latency and time-sensitive workflows

When we kicked off our comprehensive project to migrate away from Marketo, the first step was to interview our marketing team and deeply understand their use cases. They were stoked to get a clean source of truth from our data warehouse — our new system would contain only the folks that actually expressed interest in receiving these emails, with the same trustworthy demographic and product usage information that we’re using throughout the rest of our stack. Sending out invitations for events and product updates would be a breeze!

However, there were some use cases where the latencies of ETL, dbt (our in-warehouse transformation tool), and reverse ETL added up to a real stumbling block. Suppose a user fills out our contact sales form on our website. If we processed this form fill like any other interaction with our website, the data flow looks something like this:

  • The contents of the form fill are stored in our application database when the API call to submit the form is made.
  • The table containing the form fill is ETLed into our data warehouse in a sync that runs every fifteen minutes.
  • The source table is transformed using dbt models into a destination table. We have two “tiers” of tables; some are built only once per day, while tables we want to build more frequently are built every two hours.
  • The transformed records are synced to Salesforce (so the lead can be routed to a rep), as well as customer.io and other necessary destinations.
Flow required to process an event using a standard MDS stack — those latencies add up!

The biggest delay here is the in-warehouse transformation, which does have some room for optimization. We could build the tables more frequently, but that can get pricey. We could ensure that the entire lineage of our destination table consists of views, but many models are expensive to compute. Running a query against a deep lineage of views comes with its own performance problems, especially when you’re paying for queries as you go (the pricing model that better fits the spiky load of a regularly scheduled dbt run).

This latency becomes more problematic for multi-step workflows. Suppose our field marketing team is running an event. First, an invite list is put together and an invitation email is sent to the users that might be interested. When users register for the event, we need to send them a timely confirmation email. The event could fill up, so we might move the event to a space-available waitlist. Users might register just before an event happens; perhaps a conflicting meeting was cancelled at the last minute. All of these state changes need to be visible in near-real-time to the attendees (via email confirmations and waitlist notifications), the field marketing team, and sales reps (they need to know if one of their customers is attending the event). Even if in-warehouse transformation was instant (e.g. the “views all the way down” approach), the ETL and reverse ETL delays are simply too long — if someone registers just a few minutes before an event, they still expect their name on the list at the door.

An illustration of six turtles standing on each others’ backs. Each turtle is smaller than the turtle it is standing on. The illustration has been edited to label each turtle with dbt configuration code to materialize a model as a view.
Views all the way down can help ameliorate the latency of in-warehouse transformations, but ETL and reverse ETL delays can’t be helped.

Near-real-time problems and solutions

The consequences were clear — we needed a “fast lane” for processing time-sensitive events, the nearer to real-time the better. There are a whole host of ways to get data from one system to another, each with pros and cons to consider. Generally, there are three main approaches you can take here:

  1. Build a classic real-time data pipeline on a streaming architecture — spin up a Kafka broker, dust off the Spark book on your shelf, and do some good old-fashioned OLTP (online transaction processing). This is all well and good, but it introduces a lot of complexity and infrastructure to manage. Since we are a modern data stack team, we don’t have any pipelines like this — it’s a lot of complexity to build out, within an already massive project, just to process a relatively small number of events.
  2. Connect all of these internal systems (Salesforce, email marketing automation, event registration) directly using point-to-point integrations. Between native integrations, webhooks, and the host of rETL-like features popping up in so many internal tools, it can be surprisingly easy to implement all of these data flows by configuring syncs in the right way. However, not all of the tools in our stack integrate with each other. The only common denominator is Salesforce, and having our CRM manage more state is the exact opposite of our vision for our overall data architecture. Of course, sending all data through a bloated CRM was one of the factors that got us in this mess in the first place!
  3. Use a no-code or low-code automation tool, like Zapier, Parabola, Boom, Tray, Integrately, Workato, IFTTT, or one of the other thousands of alternatives. They’re so plentiful and omnipresent because they can be fantastic tools for solving this class of problems, especially for non-technical teams… when used properly and responsibly. Things can get quickly out of hand, and no-code means no version control (although there are startups out there tackling that problem). Given that we’re an engineering team, with backgrounds in data, full-stack, and backend engineering, the benefits of code (version control, unit tests, robust change management, etc.) were too much for us to pass up.

We needed a healthy middle ground — an architecture that allows us to own and manage these integrations using code, without the heft of a classic real-time data pipeline or the limitations of direct integrations. So, after we mapped out all of our workflows and use cases, we took the best parts of a classic real-time data pipeline and combined them with the best parts of direct integrations and no-code automation. We call this system YCat, for “yes-code automation tool” — a play on “no-code automation tool” — and I’ll go over the high-level architecture in this article.

Before I go into our architecture, I want to be clear that you can — and should! — tweak this approach to meet your needs, as well as your team’s unique context. Maybe your stack looks different, or you use more direct integrations, or you ingest data from an existing store rather than webhooks. Feel free to experiment and improvise!

Architecture Overview

An architectural diagram representing the YCat system, in the form of a flow chart. The alt-text character limit on Medium does not have enough space for a full description of the flow-chart, but the accompanying plaintext describess everything in the diagram in detail.
High-level architectural diagram of our YCat system

At a high level, the architecture is very similar to a traditional data pipeline, but simplified to the bare minimum necessary for processing events. We want to sync as much of the state as possible through our modern data stack, so user profiles, enrichment information, and the like don’t need to be processed or synced through this system — they can continue to be reverse ETLed in. This system will only process the latency-sensitive events that our modern data stack implementation struggles with.

By restricting our system to only process events, it can process those events as idempotent transactions; we map each event type (e.g. event registration, email sent, form filled out) to a callback handler that makes the appropriate API calls to the appropriate services in the appropriate order. This approach means we don’t need a datastore to maintain state within the system, which reduces complexity and eliminates state mismatches between YCat’s datastore and our data warehouse.

Events are ingested by a lightweight API service; we used FastAPI, which made this a breeze. The API service ingests events via webhook from the service each event originates from. The ingestion service should be kept as lightweight as possible and be hosted in some way that allows it to autoscale (so traffic spikes can be handled). It’s probably best to piggyback on whatever infrastructure you already have; otherwise, use a serverless solution like Google Cloud Functions or a lightweight service runner like Cloud Run. We chose the former and built it out as a Kubernetes workload on a GCP Autopilot cluster, which made it super easy to spin up and gave us a ton of monitoring and metrics out of the box.

The ingestion service validates the messages against the appropriate schemas, serializes them, and pushes the events as messages into a message broker, just like most classic data pipelines. We used Pub/Sub here, with one topic for each type of message. This allows us to easily monitor how many messages are in each topic, track latencies and SLAs with per-event granularity, and gracefully handle massive traffic spikes (like sending a product newsletter to hundreds of thousands of users).

The worker pool reads messages from the broker using a streaming pull subscription. Here, pub/sub’s support for exactly-once delivery is super valuable — we need to process our events idempotently (because we don’t have a memory store), and ensuring that two workers won’t be processing the same message at the same time reduces the surface area where race conditions could bubble up. We write a callback for each message type, which makes the appropriate API calls and implements other business logic, and package it in a specification that includes the schema for the message and the topic the worker should subscribe to. These callbacks can be unit tested like any other code, with ample logging and Prometheus metrics for performance and reliability monitoring. We also configured pages on these metrics, alerting us when the shape of our webhooks change upstream, or our API calls downstream time out or fail.

When a callback processes a message properly, it acknowledges the message, and the event is fully processed. If an error is encountered, the message is nacked (negative acknowledgement) so it can be redelivered later. If the message fails to process repeatedly, the message is redirected to a dead letter queue and we get paged; we do this after five delivery attempts all fail. That way, after we fix the issue, we can easily replay the message so that it’s processed properly.

It might sound like a lot up front, but it’s dead simple to work with and has been super reliable for us. Gone are the days of debugging point-to-point integrations and regexing page names out of failure notification emails — our ops lead rotations are much more efficient and issues are easily caught before bad state causes further downstream problems. Often, issues with direct integrations and no-code automations are brought to the attention of their owners only once they’ve had a significant downstream impact, such as an event attendee arriving to the venue only to find they’re not on the list, or a support ticket from a prospective customer asking why he hasn’t heard from the sales team after requesting a demo. Building out event handlers as unit-tested and monitored callbacks, which can be configured to retry individual api calls or fall back on simpler behavior, is fundamentally more reliable than a no-code automation tool or point-to-point integration — you’re not constrained by how the tool handles errors or service degradations.

Example: Processing a Form Fill

Let’s walk through this architecture through the lens of an example: processing a form fill. The event enters YCat through an API call. In this case, the form’s Submit button calls our app’s API, which records the form fill in its database and calls the ingestion service (via an asynchronous Celery task to enable retries). The ingestion service serializes the form data and publishes it as a pub/sub message to the topic for form fills. (It will also validate form data against the expected schema, but this is less relevant in this example; the form data will have been validated by the form already.)

The message is pulled from the topic by one of our workers. The worker then deserializes the message and calls the appropriate callback for the message. The callback makes the appropriate API calls for a form fill; in this case, it will:

  • Check if there is a Lead or Contact in Salesforce with the email address used to submit the form. If there isn’t one, we create one, using the information (like name and company name) provided in the form.
  • Check if the Lead or Contact in Salesforce (whether it already exists or we just created it) is a member of the appropriate Salesforce campaign. If they aren’t, add them to the campaign.
  • Check if there is a person in our marketing email system with this email address. If there isn’t, create one.
  • Send an event to the marketing email system representing the form fill. This event includes a deduplication key that is deterministically generated using the form, so if the event is sent twice, it will be deduplicated by the destination system.

Notably, this callback is idempotent, because each state change in the destination systems is only performed if the system doesn’t have the desired end state already. This is especially important for Salesforce, as there are many ways we can create Leads and Contacts there, and we want to avoid race conditions. (We avoid race conditions within YCat by using exactly-once delivery in pub/sub, which prevents a message from being processed by two workers at the same time.)

Suppose one of these API calls fail — the worker NACKs the message, putting it back on the queue to be retried later; if enough retries fail, the message is sent to a dead-letter queue and we are paged. On the other hand, if all the API calls succeed, the message is acknowledged and the form fill is officially complete.

Four things we learned while building the system

Your system, of course, might look a little different than ours, and you’re almost certainly going to be integrating with different services than ours. We did learn some things while building and implementing this system that I’d like to share, which should also shed some light on some of the problems and solutions you might run into during your journey.

If throughput is a problem, consider using bulk APIs, especially for Salesforce.

Often, you’ll be able to horizontally scale your workers to get more throughput. However, if the API you’re interacting with doesn’t let you horizontally scale enough, pulling batches of messages and passing them to a single callback can really help with throughput. This can happen when your API has strict rate limits or locks resources during a transaction.

For example, when you add members to a campaign in Salesforce, that API call locks not only the CampaignMember object you’re creating or updating, but also the Campaign itself. This means that we can’t horizontally scale to handle bursty “sent email” traffic. By pulling and processing batches of hundreds of messages in one callback, we can use bulk API calls to set up the proper Salesforce state with much higher throughput.

If you build out bulk support, your callback will need to keep track of which messages were successfully processed and ack or nack the messages as appropriate, usually at the end of the callback. You’ll also need to tweak your flow control settings, to ensure that your worker is able to pull messages efficiently. Unit testing this behavior is super important — a single-message callback failing in an unexpected manner will cause one message to get retried, while a bulk callback can cause hundreds or thousands of messages to be unnecessarily retried.

Retries, acknowledgement deadlines, and dead-lettering in Pub/Sub

You’ll need to tweak your acknowledgement deadlines so that your callbacks have enough time to process their messages. The default is 10 seconds for pub/sub, which can quickly be eaten up by slow API calls. The Salesforce Bulk API is especially slow — our bulk “email sent” handler has a deadline of ten minutes because it needs to make up to four Salesforce Bulk API calls.

If your “messages in dead letter” page is too noisy, and re-processing the messages clears up your issue every time, consider increasing the number of retries. The default and minimum number of delivery attempts is 5, but it can be configured up to 100. Exponential backoff can also be configured to wait a bit between retries, which is very helpful when integrating with flaky APIs.

Unit and integration testing for reliability

One of the most critical benefits of this architecture is that it’s easy to write unit and integration tests — so do! Each topic should have tests, not only for the happy path, but for the many types of edge cases you can run into. For example, if a specific API call fails, does the state left over make sense? Does retrying the message with a working API result in the same state as a message processed properly the first time? Be sure to test what happens when a message is processed twice (e.g. it got retried because it hit the acknowledgement deadline) — is idempotence maintained?

Rather than mocking individual API call signatures and their responses, build a fake (or mock — testing nomenclature can be messy!) client for the services you integrate with, which maintains the state of the system it’s impersonating. For example, when you create a Lead in your mock Salesforce client, actually store the lead data in the mock client class, and write your mock GET and PATCH endpoints to work on this state as well. This helps make your tests easier to write and less brittle; it also makes it easy to mock the state in the external systems that your callbacks are interfacing with.

Security and ingestion-time validation

You definitely need some sort of load balancer that’s able to ensure the requests to your ingestion API service are legit. Typically, the webhooks your upstream systems send will be signed, or come from a specific IP address; check the documentation of your upstream systems to see how they validate their webhooks. We use Cloud Armor to block requests from the general Internet, which prevents bad state from entering our systems.

Carefully consider what your ingestion layer should do when it gets malformed data. Usually, a webhook that gets a response outside the 2xx success rate will retry; this can spam your ingestion layer if you’re not careful. Our approach when we get a webhook we don’t want to process — for whatever reason — is to 201 the request, but not put the message on the topic. If your ingestion service is getting requests that can’t be retried for whatever reason, consider logging bad messages and paging on high error rates, which will help avoid potentially unrecoverable data loss.

Wrap-up

A lightweight, event-driven orchestration system was just the ticket for us to bridge the inherent gap caused by the latency of the ETL/dbt/rETL stack we use for everything that doesn’t need to be processed in near real-time. Classic data pipelines, no-code automation tools, and direct integrations can be helpful patterns to solve these problems, but by combining the best parts of all three, we were able to build a lightweight and reliable system and nail the timeline on our Marketo deprecation project. Consider building a similar system if you’re running into similar problems!

By the way, if you love building reliable and performant systems for processing data — Mixpanel engineering is hiring!

This article was not written by, or with the assistance of, AI tools, and contains no text generated by large language models.

--

--