Live Downsizing Google Cloud Persistent Disks for Fun and Profit

Background

At Mixpanel, we heavily utilize Google Cloud Platform(GCP)’s SSD provisioned persistent disk (PD-SSD) to store the event data that underlies our service. We chose it to support our low-latency analytics database, which serves customer queries.

We found one major problem with PD-SSD: the cost. PD-SSD is 5x the cost per GB stored than Google Cloud Storage (GCS). Our infra team designed Mixpanel’s database, Arb, utilizing both PD-SSD and GCS; it stores recent, frequently changing data in PD-SSD for low-latency reads and much older, immutable data in GCS for a lower cost and acceptable latency.

As all engineers do, we first engineered for correctness/performance and later optimized for cost. PD-SSD initially held a large proportion of data for performance reasons, accounting for significant portion of our infrastructure cost. However, given that the vast majority of our data is immutable, we could shift the balance to GCS to drastically reduce our storage cost.

While read latency was initially a concern (average latency in GCS is worse than in our PD based solution), we realized that in a system like Arb, where a single query touches hundreds of disks, p99 matters more than average for end-user latency. We confirmed that while GCS was slower on average, its p99 was consistently equal to or better than PD-SSD at our throughput.

After the latency experiment, we shifted data older than 3 months to GCS, leaving our PD-SSD utilization was incredibly low. To realize the cost savings, we needed to downsize the disks. Unfortunately, GCP does not support downsizing a PD. This blog post describes how we built our own solution to do just that.

Requirements

  • Live: The most important requirement is not impacting our customer’s experience. PD-SSD downsizing should not cause data delay / query downtime.
  • Idempotent: Repeat downsizing of the same disk should be a no-op.
  • Automated: Downsizing should be triggered by a single command.
  • Revertible: Design should allow rollback to the prior disk in case of error.

Primitives we have

  • Multiple Zones: Arb uses multiple GCP zones for redundancy with each zone holding an independent copy of data and its own compute nodes. This design comes with several benefits: distributes load during peak time, relatively simple query and storage logic to maintain and improve, and no downtime during deployment or canary. We leveraged this for downsizing.
  • Kafka: Kafka is a key component of our Ingestion pipeline. Our ingestion servers push events to Kafka, which then get consumed by Storage Servers to be persisted to PD-SSDs. In our configuration, Kafka maintains data for 7 days, so that even if a Storage Server reverts to six days ago, it can catch up to the most recent data without any data loss.
  • GCP PD Snapshot: GCP provides a feature called Snapshot, which is what it sounds like: it creates a snapshot of a PD. Moreover, GCP supports PD-SSD creation based on an existing snapshot. It essentially clones a PD-SSD to a new PD-SSD (of the original size) with the same data. GCP intelligently creates snapshots incrementally based on previous snapshots, so it is storage and time efficient.
  • Kubernetes Jobs: A Kubernetes job creates one or multiple pods and tracks the status of the pods. If a specified number of pods are completed successfully, the job is completed.
  • GCP node pool: A node pool is a specified number of nodes with the same configuration. The configuration includes the number of CPUs and GPUs, size of memory and storage, etc.

Our solution

Here is our solution with the given primitives.

Step 1:  Create a cloned disk using snapshot

The first step is cloning a disk under a different pod called Downsizer using snapshot. The Downsizer pod is created in a separate node with ample resources. This guarantees the downsizing process has no impact on live Storage Servers, and allocates enough I/O throughput to perform downsizing as quickly as possible. There is a short period where reads/writes are blocked on this disk (up to 15 min) while taking a snapshot. After the snapshot is created, the Storage Server will catch up on backlog quickly and start serving traffic normally. Creating the cloned PD-SSD is performed independently in the Downsizer pod. If something goes wrong, the process is fully revertible using the snapshot.

The Downsizer pod is created by Kubernetes Job. Each PD-SSD downsizing job is named with the PD-SSD’s unique id, and the job creates a Downsizer pod with the same name. This Kubernetes configuration provides idempotency as a Downsizer job/pod for the same disk cannot be created multiple times.

This pod gets assigned to one of the nodes in the Downsizer node pool, with 16 CPUs per pod. This guarantees high PD-SSD performance in finishing the downsizing job in a timely manner, as the disk I/O throughput is roughly proportional to the number of CPUs of an instance. The node pool is specifically created for the downsizing process, allocated with enough CPU, memory, and the specified number of nodes.

Step 2: Create downsized disk and copy data from cloned disk

We successfully created a clone of the original disk, so it’s time to create a smaller PD-SSD and start copying the data. Because of the amount of the data and the throughput limitation, this process may take up to 10 hours. This process is performed in an independent Downsizer pod while Storage Server is serving traffic as usual, so there is no downtime.

The smaller PD-SSD size is determined by the disk usage of the cloned disk, and utilization factor. We used rsync to perform data copy/synchronization. rsync is reliable, fast and comes with built-in transmission checksum. Note that we did not use --checksum flag, which performs an additional checksum on the existing files and slows down the whole process significantly. After the rsync is finished, we unmounted the smaller PD-SSD to guarantee buffer flush.

Step 3: Delete cloned disk and repeat step 1 and 2

During step 2, the Storage Server is serving data as usual which means there shall be data added to the original disk that was ingested during the rsync process. The added data will turn into a backlog for the disk to catch up.

When there is a backlog for a disk, Mixpanel’s database ensures that reads for data on that disk goes to the other zone. Once the backlog drops below a watermark, that disk becomes readable again. Thus a higher backlog leads to longer query downtime for a disk in one zone.

To minimize the backlog, we decided to perform step 1 and and run an second rsync to the disk created in Step 2. The advantage comes from the fact that we already copied most of the data in the downsized disk. rsync compares the metadata of all files and copies only the difference. The data transfer rate of one PD-SSD to another in a dedicated node with ample resources is much faster than going through the ingestion pipeline, including Storage Server, which may share the I/O throughput with multiple PD-SSDs. This second rsync process takes only about 10 to 20 mins, and this additional time is well worth it.

Step 4: Create snapshot from the downsized disk and restore the original disk

Mixpanel infra engineers have been leveraging GCP’s snapshot as a periodic backup. Also, we take snapshot before performing any disk-related engineering work in case something goes wrong. We implemented and utilized the capability to restore a Storage Server disk from a snapshot, and we are going to use this for Downsizer.

After the second pass of rsync is finished, Downsizer takes a snapshot of the downsized disk. This downsized snapshot contains complete data and disk size information. By triggering restore from the snapshot, Storage Server stops the read and write to the original disk, deletes the disk, and creates a smaller disk with the downsized snapshot.

Step 5: Wait for backlog to catch up and enable Read

Now we have the downsized disk mounted to a Storage Server, but we are not quite done yet. Even after the second rsync, there is still a 20 ~ 30 mins gap between the time that the second snapshot is taken, to the time that the restore from the downsized snapshot is finished. If we serve this disk to query data right away, we may serve stale data and we need to avoid this. Downsizer is responsible for blocking reads to the disk, monitoring the amount of backlog for the disk, and enabling read only when the disk catches up to the most recent data. This step usually takes less than 10 mins.

Operationalizing it

Automation

The previous steps show the process of downsizing one disk, but we have hundreds of disks. It is not scalable to run Downsizer manually, one at a time. We need a tool to automate downsizing for many disks.

  • One zone at a time: From the downsizing description above, there are some steps that require unavoidable downtime: taking snapshots, restoring from a snapshot, and waiting for backlog. If we restrict the downsizing process in only one zone, then the other zone would serve the traffic normally, so no downtime would caused by Downsizer.
  • 10% of the disks in one zone at a time: Conservatively, stopping about 10% of the disks in one zone has shown minimal impact on our customer’s experience.

Based on this, we built a tool that performs downsizing for one zone:

  1. Get a list of all disks in one zone
  2. Check the number of Downsizer Kubernetes jobs that are completed and running.
  3. If the number of currently running jobs is lower than 10% of the total number of disks, create more jobs to meet 10%
  4. Repeat Step 2 and 3 until all of the disks are finished

This approach is even more conservative as it is highly unlikely to overlap the disk downtime all at the same time to reach 10% of the disks in a zone.

Coordinating with others

The last problem left to solve was how to execute Downsizer with a minimal impact on productivity of other engineers. At Mixpanel, deployment and/or canary happens almost every day, and running Downsizer means blocking concurrent Query Server and Storage Server deployments. We use a lightweight deployment lock system to allow engineers to take “locks” on resources, preventing concurrent deploys.

We initially estimated ten hours per disk for downsizing, and we have about 300 disks per zone. Running 10% of the number of disks at a time, it would take about four full days to complete. Even considering the weekends, we needed to block deployments for two work days, and that was not acceptable.

We could have just run it during weekends, but that was also not acceptable at the beginning because we didn’t know if downsizing would reveal unforeseen side effects, and we wanted to observe and catch them early before downsizing all of the disks.

The solution was rather simple: acquire deployment lock at the end of work day and start Downsizer; stop Downsizer around midnight, and release lock in the morning after all running Downsizer jobs are finished. Of course, communicating with other engineers (especially those holding the pager) was a must. In this way, we started off with a smaller number of disks and observed any changes in production behavior. We found that it takes about four hours on average to downsize a disk, because we have less data than what we had originally planned for. After a couple nights of downsizing, we confirmed that there were no side effects observed, so we kicked off full downsizing for one whole zone during the weekend. Now that we knew it takes less than two full days to downsize with no side effects, we confidently ran the other zone during the weekend, several weeks later.

Results

After the whole process was finished, we reduced Arb PD-SSD usage from 1.05PB to 390TB. That’s about $112k per month cost reduction minus the increased GCS cost. The GCS cost increased approximately about $20k including storage and operations cost. Thus the overall cost reduction is about $90k per month, $1M per year!

Takeaways

There are several things that we learned from this project:

  • Kubernetes and GCP: Idempotency could have been something tricky to achieve. With Kubernetes jobs, however, we just needed to configure the jobs correctly. Snapshots and node pools were very handy as well. Understanding the provided primitives is key for efficient tool building.
  • rsync and buffer flush: rsync is a powerful and efficient utility. We tested parallelizing rsync by using parallel, and we found no difference in return. The second pass sync would have been hard without rsync. Unmounting the disk was the method that we used to guarantee the buffer flush.
  • 2nd pass sync for minimal backlog: When we first implemented without the second pass sync, we set a timeout for backlog catch up time to several hours, proportional to the time that it took to rsync data from a cloned disk to a downsized disk. Depending on a Storage Server’s available write throughput, it may take up the entire time allotted, and during that time, read is blocked. This problem is solved by the second pass sync and minimized query downtime.
  • Back-of-envelope calculation: We planned out each step by back-of-envelope calculation first, and then performed implementation and execution. After execution, we checked if it matched our expectations. If it didn’t match we investigated and figured out why there were discrepancies. This way, we predicted ops impact accurately and was able to execute the downsizing process with minimal pain.
  • Communication: We looped in service owners at the design and planning stage. We learned all the available in-house utilities that we could use, and the service owners understood what we were planning to do, giving us valuable advice. This communication continued throughout the project and was the key factor of finishing it efficiently with minimal trial-and-error and side effects.

Special thanks to Yu Chen who gave his blood, sweat, and tears with me to this project.

If you enjoyed this article and are interested in working on high-performance distributed systems at Mixpanel, we are hiring!

Making Web Components Work

or: How We Learned to Stop Worrying and Love the DOM

 

Clean, attractive user interfaces and effective user experience have always been pillars of Mixpanel’s products. Over the years, as our data visualization UIs have introduced richer interactions and more advanced capabilities, a central concern of ours has been managing ever-increasing front-end complexity, driving us to build and experiment with approaches that simplify development and enable more powerful results. While the front-end world at large has gone through waves of framework churn and the accompanying fatigue of “Rewriting Your Front End Every Six Weeks”, this burst of ecosystem activity has also produced some great ideas and productivity gains. A recurring theme which has emerged and guided Mixpanel’s UI work is the strength of the “component” concept. Many of the successful JavaScript frameworks and libraries of recent years – React, Angular, Polymer, Vue, etc. – organize code and conceptual models to reflect the tree hierarchy of the rendered DOM, in such a way that complex UIs emerge from the composition of smaller elements which can render themselves and act semi-independently.

Developing quietly for years in the background of the JS wars, the set of Web Components standards has always promised something that no 3rd-party framework can offer: a suite of native in-browser technologies for creating and managing encapsulated UI components, leveraging well-known existing DOM/HTML APIs and open standards. Back in 2015, our front-end team started exploring the possibilities of Web Components – specifically Custom Elements and Shadow DOM – for building new features and gradually unifying our suite of legacy UIs. Since then, this has grown into our standard toolset for building UIs, both for greenfield projects and for introducing incremental updates to older features: the basis of new products like InsightsJQL Console, and Signal, as well as our expanding standardized component library. Using Web Components as a cornerstone of complex productionized UIs, however, has required development of tooling and responses to issues and gaps in the basic technologies: standardizing the rendering cycle, composing and communicating between components effectively, understanding which features can be polyfilled reliably on older browsers, running component code in server-side environments, etc. The following discussion aims to describe our choices and approaches, particularly the features of our little open-source library Panel which marries Web Components technologies to a state-based Virtual DOM renderer, effectively extending the basic standard to facilitate composing full, powerful UIs easily.

What if your app were just a DOM element?

The fundamental unit of Web Components is the good old HTMLElement, which your code extends by implementing methods to run when lifecycle events occur: an instance of your custom element is created, it is added to the DOM, its HTML attributes change, etc. We will explore the power of this approach in the following discussion with the help of a small interactive demo, the “Panel Farm” running below:

The demo is also available at https://mixpanel.github.io/panel-farm/ (with code at https://github.com/mixpanel/panel-farm). This toy project includes building blocks of more advanced usage: component nesting and intercommunication, build system, client-side routing, shadow DOM, animations, etc. Check out the demo and try inspecting the DOM with your browser’s developer tools. You’ll notice some HTML elements with custom tag names:

The  <panel-farm> element at the top level is not just a rendered result of running the app code; it is the app, accessible in the JavaScript DOM API as an HTMLElement with all the methods and accessors available to normal DOM elements, as well as some new methods. Try calling  document.querySelector(`panel-farm`).update({welcomeText: `meow!`}) in the JS console and watch the DOM update automatically on the Welcome page. Via the standard built-in browser dev tools, you can inspect the current app state, find HTML elements it’s rendered, enumerate its DOM children or its subcomponents, and perform live manipulations. Modern browser tools offer powerful debugging environments for Web Components, by virtue of their nature as HTML elements:

(NB: For an even more seamless in-browser development and debugging experience, the Panel State Chrome extension by Noj Vek adds a dev tools tab to the Elements explorer for inspection and manipulation of state entries.)

Custom Elements of various other kinds can already be found “in the wild,” whether for example in GitHub’s subtle  <time-ago> component that displays relative times (in use on github.com since at least 2014, as seen in this interview), or in the more recent 2017 rewrite of Youtube’s UI (based on Google’s Polymer framework, as noted in their blog post on the launch):

Still, despite some good company in using Web Components, our choice in 2015 of embracing the standard was admittedly unusual, betting on an under-development built-in browser technology as opposed to simply picking up one of the more ready-made popular JS libraries like React or Angular (although back when we were exploring these options, the front-end dev world was much less crystallized into these few options, and the now-popular Vue had nowhere near its current traction). It was clear at the time that the component-based approaches of all these libraries offered a great central concept for hierarchical UI code, and the popularization of “Virtual DOM” and DOM-diffing provided well-supported practical implementations of powerfully simple rendering APIs. Less widely-used and experimental frameworks, such as Mercury, Cycle, and Ractive, demonstrated that there was space for further exploration into “reactive” DOM templating (where the UI updates automatically to reflect the current state of a data store). Adopting a similar Virtual-DOM/state-based approach allowed us, with quite minimal code, to standardize our workflows for view templating, DOM update management, animation, component composition, and data flow management (in particular, making it easy to nest and communicate between components without a rat’s nest of event listeners); in other words, to give Web Components just the boost they need to work well for advanced UI development.

How it works

The Panel library is available under the open-source MIT license, with source code available at https://github.com/mixpanel/panel and package installation via NPM at https://www.npmjs.com/package/panel. API documentation lives at http://mixpanel.github.io/panel/. The description from the repo’s Readme offers a good distillation of the project’s goals and approach:

Panel makes Web Components suitable for constructing full web UIs, not just low-level building blocks. It does so by providing an easy-to-use state management and rendering layer built on Virtual DOM (the basis of the core rendering technology of React). Through use of the Snabbdom Virtual DOM library and first-class support for multiple templating formats, Panel offers simple yet powerful APIs for rendering, animation, styling, and DOM lifecycle.

The basic usage is straightforward and familiar from numerous Virtual DOM UI tools. A component is a JS object which renders part of the UI, maintaining an internal state object which is fed to the view template; calls to the component’s update() method apply changes to the state and trigger a re-render of any parts of the DOM which change as a result. Component lifecycle, on the other hand (element creation, DOM entry/exit, etc), is managed directly through the Custom Elements API (hooks such as connectedCallback() and attributeChangedCallback()). Probably the most important aspect of the API design is the decision to maintain the “vanilla” Web Components APIs as far as possible, rather than wrapping them in higher-level abstractions. Developers using Panel can rely on quality external references such as MDN’s web docs and Eric Bidelman’s excellent overviews (e.g., “Shadow DOM v1”) to understand standard patterns and usage; and this knowledge is transferable to other environments that use Web Components.

To call Panel a “framework” would be a stretch – it’s really more of a minimal glue layer between the Web Components API and the Virtual DOM rendering engine provided by Snabbdom, with just enough built-ins to address the pain points that we’ve confronted in our production apps. The core library code runs to a few hundred lines, much of which is comments and documentation for public methods. Apart from the Component/View layer which translates state into rendered DOM, a simple built-in Router handles syncing the URL/History API and the app’s state. The intention was to keep the library code lightweight and easily understood, without sacrificing the power of the core reactive rendering flow.

There is no baked-in model layer or data-/state-management framework. External libraries such as Redux and RxJS can plug in seamlessly to the view layer offered by Panel, and an optional Panel “State Controller” offers a lightweight mechanism for managing state separately from Component internals without bringing in further dependencies. Anything which can send state updates by calling update() with a JS state object will work with Panel (see the example at https://github.com/mixpanel/panel/tree/master/examples/redux). Similarly, a more traditional MVC Model layer such as Backbone.Model can work, by sending Component updates in response to model events, e.g.,  myModel.on(`change`, () => myApp.update({field: `new content`})). In Mixpanel’s newer apps, depending on complexity, we tend to avoid event-flow and model libraries, finding a sufficient solution in Plain Old JavaScript Objects representing state, supplemented occasionally with ES2015 Classes for more involved model-layer code.

The following brief case studies introduce some of the other significant features of Panel and Web Components as tools for flexible, full-featured front-end development.

Your widget is an app, your app is a widget

There is no formal distinction between a simple component and an “application.” In the Panel Farm app, the  <animal-badge> which displays a picture of a cute animal in a circle frame is a completely standalone component. It has an HTML attribute animal which determines which picture it shows, and can be embedded anywhere simply by inserting into the DOM.

<animal-badge animal="husky"> “Woof!” (^^^ This is a running version of the <animal-badge> element. Try inspecting with browser dev tools and changing its animal attribute to “doge” or “llama” or…)

The  <panel-farm> “application” is composed of various such components and standard DOM elements, but conceptually it too is still just a Component, with nested child Components. Its main DOM template looks something like this (in Pug/Jade notation; see below on templating):

In the example above, since the  <animal-badge> element is a standalone Custom Element, its implementation doesn’t matter to the main app. It could be a Panel component, it could be a vanilla Web Component, or any other type of custom HTML element; it is simply inserted into the DOM and acts independently of the  <panel-farm> instance. The insertion of  <view-welcome> and  <view-farm> via the  child() method, however, explicitly links these elements to the <panel-farm> instance:

<panel-farm> and  <view-welcome> and  <view-farm> literally share a single state object. A call to  update() on any of these elements will result in all of them being updated if necessary. The various  <animal-badge>s, on the other hand, are Panel components which could maintain their own internal state and do not have access to the state of  <panel-farm>. This flexibility allows powerful combinations of self-similar Panel components, which can act in concert via the straightforward shared state mechanism, while still facilitating integration with 3rd-party components through their public APIs such as DOM events and HTML attribute listeners. In practice, state-sharing is useful for subdividing applications into linked components where updates to the central store cascade automatically (no need for swarms of event listeners and data flow logic), whereas standalone components work well for reusable UI building blocks with clear, limited APIs (and there are other options available to limit the state shared between linked components). This is how independent components from Mixpanel’s UI toolkit such as  <mp-dropdown> and  <mp-toggle> are gradually becoming integrated into parts of our front end written 5 years ago as well as last week.

Imperative and/or declarative

As Web Components, Panel components and apps can easily offer both declarative and imperative APIs. For instance, to mirror the type of imperative API favored by jQuery plugins, the <animal-badge> component could offer a public method that changes the picture it displays:

In this case, calling  setAnimal(`raccoon`) on an instance would render the template with updated state. The declarative alternative used in the Panel Farm code has the component read from its HTML attribute animal and update itself whenever its value changes, using the Custom Elements observedAttributes and attributeChangedCallback:

The declarative option is particularly suited to using components within Virtual DOM environments, where declaring the expected state of the DOM is the natural mechanism, rather than calling methods to manipulate the DOM imperatively.

Templates and functions

The <panel-farm> top-level template example in a previous section uses the dedicated templating language Pug (formerly Jade):

This is the notation we use in Mixpanel’s apps for convenience, but it is largely syntactic sugar for the construction of template functions. The same template can be expressed as a pure inline JS function:

This takes in the component’s state object as input and returns as output a Virtual DOM tree (constructed using the dialect of Hyperscript notation used by Snabbdom). For the conversion from Jade to JS, we use the virtual-jade library and simply import runnable template functions:

But at the end of the day, any format which can convert to (Snabbdom-compatible) Hyperscript can work seamlessly here, including Facebook’s famously divisive JSX format (see the example in the Panel repo):

Light and shadow

The question of component styling and CSS scoping has received two recent innovative responses, in the divergent approaches favored by Web Components (the Shadow DOM spec) and by Virtual DOM-based systems (inline styling via “CSS in JS”). Panel apps can benefit from both approaches – even mixing if necessary – facilitating the appropriate method for different contexts and workflows.

A Shadow DOM approach allows you to retain the power of traditional CSS with respect to cascading styles, inheritance, and notation, while keeping styles isolated to your component tree:

In this usage, the styling of elements within a component is managed largely in the “traditional” CSS manner, through the presence or absence of CSS classes and other selectors (and classes can be manipulated deftly through the object notation common to Jade and Snabbdom, e.g.,  {cool: true} to add or maintain the class cool on an element).

It is possible, however, to let the Virtual DOM renderer manage style properties itself, bypassing traditional stylesheets altogether, as the Panel Farm app does at one spot in the main template by setting a style object:

To see the effect of managing style this way, try running  document.querySelector(`panel-farm`).update({backgroundAnimalStyle: {top: `3px`, left: `10px`}}) in the JS console and watch the doge move to the other side of the viewport.

Both systems provide methods of scoping style rules to individual components without the problems of global selectors, and in Panel apps they can live side-by-side as necessary – the fine-grained declarative control of CSS-in-JS complementing the traditional cascading rulesets of Shadow DOM stylesheets. In practice, at Mixpanel we use CSS-in-JS techniques sparingly (for the exceptional cases which require true dynamic calculation in JS), sticking mostly to traditional global stylesheets for full application context (compiled from Stylus to CSS), and Shadow DOM scoped CSS (again compiled from Stylus) for generic UI components used across the product (with some caveats discussed below).

Bump and slide

Highly declarative UI models have always had some difficulty with animation: it’s easy to declare “this is what the DOM should look like right now,” but more difficult to notate transitions between different states cleanly. CSS transitions provide a relatively straightforward model for some situations and can be coupled to selector changes easily, e.g., “elements with class animal-badge have opacity: 1 by default, but when they have the class inout (entering or exiting) they have opacity: 0 and opacities transition between each other for 250ms.” These transitions work well with Virtual DOM systems, which can manage class and style changes seamlessly, but we run into trouble when trying to animate the main lifecycle events, elements being newly created or deleted. For these cases, some of the solutions suggested for Virtual DOM libraries can be pretty heavyweight and domain-specific (see for instance the discussion in https://github.com/Matt-Esch/virtual-dom/issues/112). It is largely due to Snabbdom’s simple, pragmatic support for element lifecycle hooks that we use it as the rendering engine for Panel, together with a simple class module extension that adds support for manipulating classes when adding and removing elements. These basic tools, for instance, allow the <view-farm> template to animate the removal and addition of <animal-badge>s by applying the inout class only when an element is transitioning in or out of the DOM:

Although complex animations that require JS calculations and multiple stages still need statement management based on their specific context, the basic cases of managing transitions/animations on entry/exit and class changes represent the vast majority of situations we need for our UIs. Being able to produce these in a simple declarative fashion is a win.

It’s not all roses

Of course, there are still plenty of bumps and warts in the Panel/Web Components environment, and open questions which we continue to explore and debate.

The browser compatibility story is delicate

Although it seems like every year someone predicts that this will be the year Web Components go big (“#shadowdom2016”, alas…), and the promise of a natively-supported, cross-browser componentization standard is an attractive prospect, the real world isn’t quite there yet. At the time of writing Chrome, Opera, and Safari have released native implementations of Custom Elements and Shadow DOM, with Firefox working on v1 API implementations (as of May 2018 Shadow DOM has been enabled in the Firefox Nightly build, and according to docs on MDN, both Custom Elements and Shadow DOM are “expected to ship in Firefox in 2018“); of the major browsers only Edge has not yet begun implementation work, and Shadow DOM and Custom Elements remain its most requested features (with “High” and “Medium” roadmap priority, respectively). So in order to work with the current versions of Firefox and Edge, we need to ship polyfills along with our production code. The suite of webcomponents.js polyfills from Google’s Polymer team is a marvelous piece of work and a wonderful gift to the open-source world – without the polyfills, using Web Components in customer-facing production environments would be a total non-starter – but there are many edge cases around DOM manipulation and it is impossible to replicate the behavior of native implementations exactly, particularly the style encapsulation of Shadow DOM. There were enough limitations/performance issues of the old Shadow DOM v0 polyfill and the newer ShadyCSS that we have needed to stick to scoping Shadow DOM CSS with specific classes until all our supported environments have Shadow DOM implementations; the Stylus prefix-classes built-in eases the pain considerably, but it is still a far cry from the real encapsulation of native Shadow DOM.

Custom Elements are global

Once you register an element definition with customElements.define(`my-widget`, myWidgetClass), every <my-widget> that pops up in your HTML uses the code that you initially passed. For most environments and workflows this is fine, but it does prevent multiple versions of a component from appearing in the same page with the same tag name. This limitation has affected us in cases where multiple scripts on the same page wanted to register the same components, but at the end of the day these are edge cases and it’s an ill-advised approach. Questions about how to package and export components remain unresolved, for instance whether a module should just export a component definition Class, or whether it makes sense for the module also to add the component to the global customElements registry.

Testing

Testing can require some involved infrastructure, because of the tight integration of components with browser APIs. The wct (Web Component Tester) tool, again from the Polymer team, provides a great solution for browser tests, integrating seamlessly with Sauce Labs to facilitate cross-browser testing in CI environments. Individual functions can be extracted from components for quicker/simpler unit tests; we do a fair amount of this with Mocha in a Node.js environment. But creating fast, simple, entirely deterministic tests for the behavioral logic of components – how components and apps transition between different states – has no one simple solution. State logic can be extracted to a StateController or Redux at the expense of extra layers of abstraction; Panel also provides a server-side environment which can load components and run their code without the overhead of loading a browser. The balance of different styles of tests and an agreed overall philosophy of UI testing are issues which we’re still pinning down.

At the end of the day, despite the problematic aspects, it’s become abundantly clear over several years of building on Web Components at Mixpanel that they are absolutely viable for real-world, productionized front-end work. Once Firefox and Edge finish their implementations of v1 Custom Elements and Shadow DOM, we’ll have a truly cross-browser, native, powerful API supercharging the DOM for the needs of modern web applications. Being able to work with the DOM API directly and browsers’ built-in development tools comes with distinct advantages, and helps replace the cognitive load of framework specifics with standardized techniques and tooling (HTML element attributes/properties, encapsulated styling via CSS, etc.). The occasionally-advanced idea that Web Components can spell the end of JS frameworks may be rather exaggerated – complex applications need much more management than just component encapsulation and lifecycle, and we built Panel to fill in some of the missing pieces of the Web Components environment around rendering, communication, and state management – but they do represent an important step forward for dynamic web UIs. Easy interoperability between disparate frameworks, a standardized API for componentization, simpler and more lightweight client-side code: these developments are not to be taken lightly, as elements of the frenetic JS library world begin to migrate to the more stable, long-term view from the browser-dev side. It’s early days yet, but Web Components open an exciting avenue forward for browser UI development, and it feels great to take steps toward that brighter future.

Building a (not so simple) expression language part II: Scope

(This is part II of a two part series of posts, you can find part I here)

One of the most powerful parts of the Mixpanel query language is the any  operator, which allows you to select events or profiles based on the value of any element in a list. The any  operator is just a bit more magical1 than the other operators in our query language, both in its power and in its implementation.

We’ve already written about building the Mixpanel expression language – the language we built inside of the Mixpanel data store to allow you to query and select data for reports. The model we built in the last post can do a lot of work, but parsing and interpreting the any  query takes the language to another level, both metaphorically and syntactically.

Like the basic expression language post, we’ll be using Python and JSON to talk about procedures and data, but won’t assume you’re a serious Pythonista. It will also be worth taking another look at the simple expression language post, since this post elaborates that model.

Continue reading

Straightening our Backbone: A lesson in event-driven UI development

Mixpanel’s web UI is built out of small pieces. Our Unix-inspired development philosophy favors the integration of lightweight, independent apps and components instead of the monolithic mega-app approach still common in web development. Explicit rather than implicit, direct rather than abstract, simple rather than magical: with these in-house programming ideals, it’s little surprise that we continue to build Single-Page Applications (SPAs) with Backbone.js, the no-nonsense progenitor of many heavier, more opinionated frameworks of recent years.

On an architectural level, the choice to use Backbone encourages classic Model-View designs in which control flow and communication between UI components is channeled through events, without the more opaque declarative abstraction layers of frameworks such as Angular. Backbone’s greatest strengths, however – its simplicity and flexibility – are a double-edged sword: without dictating One True Way to architect an application, the library leaves developers to find their own path. Common patterns and best practices, such as wiring up Views to listen for change events on their Models and re-render themselves, remain closer to suggestions than standard practices, and Backbone apps can descend into anarchy when they grow in scope without careful design decisions.

Continue reading

Diagnosing networking issues in the Linux Kernel

A few weeks ago we started noticing a dramatic change in the pattern of network traffic hitting our tracking API servers in Washington DC. From a fairly stable daily pattern, we started seeing spikes of 300-400 Mbps, but our rate of legitimate traffic (events and people updates) was unchanged.

Suddenly our network traffic started spiking like crazy.

Pinning down the source of this spurious traffic was a top priority, as some of these spikes were triggering our upstream routers into a DDos mitigation mode, where traffic was being throttled.

Continue reading

Building data products that people will actually use

This was originally posted on High Scalability.

Building data products is not easy.

Many people are uncomfortable with numbers, and even more don’t really understand statistics. It’s very, very easy to overwhelm people with numbers, charts, and tables – and yet numbers are more important than ever. The trend toward running companies in a data-driven way is only growing…which means more programmers will be spending time building data products. These might be internal reporting tools (like the dashboards that your CEO will use to run the company) or, like Mixpanel, you might be building external-facing data analysis products for your customers.

Either way, the question is: how do you build usable interfaces to data that still give deep insights?

We’ve spent the last 6 years at Mixpanel working on this problem. In that time, we’ve come up with a few simple rules that apply to almost everyone:

  1. Help your users understand and trust the data they are looking at
  2. Strike the right balance between ease and power
  3. Support rapid iteration & quick feedback loops

Continue reading

Feb 2015 Mixpanel C++ meetup: Fun with Lambdas (Effective Modern C++ chapter 6)

We’ve been hosting a series of monthly meetups on C++ programming topics. The theme of the series is a chapter-by-chapter reading of Scott Meyers’ new book, “Effective Modern C++”.

The meetings so far have been

  1. December: Arthur O’Dwyer on “C++11’s New Pointer Types” (EMC++ chapter 4)
  2. January: Jon Kalb on “Rvalue References, Move Semantics, and Perfect Forwarding” (EMC++ chapter 5)
  3. February: Sumant Tambe on “Fun with Lambdas” (EMC++ chapter 6)

Next up, we’ll be continuing chapter 6 with a presentation on “Generic Lambdas from Scratch“. Come by the office and check it out!

Building a simple expression language

(This is part one of a two part series, you can find part II here)

The Mixpanel reporting API is built around a custom expression language that customers (and our main reporting application) can use to slice and dice their data. The expression language is a simple tool that allows you to ask powerful and complex questions and quickly get the answers you need.

The actual Mixpanel expression engine is part of a complex, heavily optimized C program, but the core principles are simple. I’d like to build a model of how the expression engine works, in part to illustrate how simple those core principles are, and in part to use for exploring how some of the optimizations work.

This post will use a lot of Python to express common ideas about data and programs. Familiarity with Python should not be required to enjoy and learn from the text, but familiarity with a programming language that has string-keyed hash tables, maps, or dictionaries, or familiarity with the JSON data model will help a lot.
Continue reading

Queuing and Batching on the Client and the Server

We recommend setting up work queues and batching messages to our customers as an approach for scaling upward server-side Mixpanel implementations, but we use the same approach under the hood in our Android client library to scale downward to fit the constraints–battery power and CPU–of a mobile phone.

The basic technique, where work to be done is discovered in one part of your application and then stored to be executed in another, is a simple but broadly useful; both for scaling up in your big server farm and scaling down for your customer’s smartphones.

Continue reading

Debugging MySQL performance at scale

On Monday we shipped distinct_id aliasing, a service that makes it possible for our customers to link multiple unique identifiers to the same person. It’s running smoothly now, but we ran into some interesting performance problems during development. I’ve been fairly liberal with my keywords; hopefully this will show up in Google if you encounter the same problem.

The operation we’re doing is conceptually simple: for each event we receive, we make a single MySQL SELECT query to see if the distinct_id is an alias for another ID. If it is, we replace it. This means we get the benefits of multiple IDs without having to change our sharding scheme or moving data between machines.

A single SELECT would not normally be a big deal – but we’re doing a lot more of them than most people. Combined, our customers have many millions of end users, and they send Mixpanel events whenever those users do stuff. We did a little back-of-the-envelope math and determined that we would have to handle at least 50,000 queries per second right out of the gate.
Continue reading