Database File Format Optimization: Per Column Dictionary

John Mikhail
Mixpanel Engineering
4 min readOct 3, 2023

--

At Mixpanel, our proprietary multi-tenant columnar store database, known as ARB, serves as the backbone of our analytics platform. Following the ingestion of customer events, these data sets undergo a compaction process to create columnar files, which are then uploaded to Google Cloud Storage (GCS). In this blog post, we will delve into the recent updates we’ve implemented in this columnar file format. These updates enhance our query servers’ ability to accommodate more ARB files in their local SSDs, all without actually reducing file sizes. In some instances, these changes may even result in larger files.

Background

Before we delve into the details of our format update, it’s essential to introduce two common columnar database practices we employ in ARB.

1. Strings Dictionary: Consider a string column with low cardinality, such as the iPhone Model from our trends dashboard. Repeatedly storing strings like “iPhone 14 max” in the column’s data can be highly inefficient. Instead, we optimize this by compiling frequently recurring strings into a dictionary within the file’s header. This allows us to utilize dictionary indices within the column’s data, reducing redundancy.

2. Selective Column Retrieval: When a query involves only specific columns, our query server employs a selective download approach. It starts by fetching the file header, which contains crucial information such as the strings dictionary, column offsets, and other essential file metadata. Subsequently, using these offsets, the server requests only specific data ranges from the remainder of the file corresponding to the required columns. This process results in a sparse file stored on local SSDs. These SSDs serve as a caching layer between the query server and GCS, minimizing or even eliminating file download times for future queries.

To illustrate, let’s consider an example: a file with a header size of 5% and five columns of equally distributed sizes (although this is rarely the case in practice due to factors like repeat encoding). When downloading only a single column, just 24% of the file needs to be retrieved.

Format Change

In our format change, our primary objective was to reduce the size of sparse files on disk, thereby increasing the quantity of files we can store on local SSDs and improving the file cache hit ratio. The concept was straightforward: we recognized that the strings dictionary occupied a significant portion of the header space and a relatively non-negligible portion of the overall file (~6%). We pondered the possibility of redistributing this dictionary across the columns.

To ensure that this alteration would yield benefits without introducing regressions, we needed to validate our theory that there was minimal overlap between the strings used in different columns. Otherwise, this change might result in significantly larger file sizes as the same strings would exist in multiple dictionaries instead of one. The validation process was simple; we introduced a new metric comparing the number of distinct strings per file to the sum of distinct strings across all columns per file and compared the two. As expected, given the nature of our customers’ data, the numbers were nearly identical, with the latter being only slightly larger at times.

Building on the previous example, the updated file structure after moving the global strings dictionary from the file header to the columns headers now appears as follows: each of the five columns now occupies 19.9% of the file space, including an additional 0.9% that previously belonged to the header. When downloading only a single column, only 20.4% of the file is required. Compared to the old format, this version results in sparse files that are 15% smaller.

Bonus Performance gain

The adoption of significantly smaller dictionaries per column translates into improved CPU cache performance, as it ensures that all the necessary strings reside contiguously in memory. Previously, with one global dictionary, the strings for a single column could have so many interleaving strings that are not of interest to the query. It’s worth noting that we directly access the on-disk dictionary using mmap, without undergoing any transformation into an alternative in-memory structure.

Results

Since deploying the new format four months ago, we observed a notable decrease in our average file loaded percentage, which declined from 26% to 21%. Effectively allowing us to store 15% more files in our SSD cache. We anticipate that this figure will continue to get better. This is because we opted not to retroactively convert the old files to the new format, and a substantial portion of the files stored in the SSD cache pertain to data that is older than six months. (Our customers frequently query data spanning from recent months to several years in the past.)

In conclusion, our recent format enhancements represent a significant step forward in optimizing the performance and efficiency of Mixpanel’s ARB database. These improvements not only reduce storage overhead but also enhance query server performance, leading to a more seamless and responsive user experience. We are committed to continually refining our technology to better serve our customers’ evolving data analytics needs.

--

--