How we cut BigQuery costs 80% by hunting down costly queries

Published in

Mixpanel Engineering

13 min readJul 12, 2023

Over the past few years, we at Mixpanel have been working to modernize and streamline our internal data infrastructure. No, not our customer data infrastructure — we’ve been modernizing and streamlining that for over a decade now — but our internal data infra, which supports every facet of our business except helping our customers answer their questions. We’ve adopted the modern data stack, with our tools connecting to and running queries within BigQuery.

As we’ve ramped up our data infrastructure — gathering more data, building more models, and integrating more services with our data warehouse — our BigQuery ramped up as well. Our internal data spend is still a small fraction of our overall GCP bill, but when we broke down our spend by project, we were surprised that our relatively small team was incurring five figures a month in charges, under a line item simply titled Analysis.

Unfortunately, the GCP billing console doesn’t provide cost breakdowns at a finer granularity than SKU. You have to analyze those costs yourself — so we did. With just a few days of work, we cut our spend ~80% by:

Using BigQuery’s INFORMATION_SCHEMA.JOBS view to identify the service accounts and users running the most intensive (and thus most expensive) queries
Building interactive reporting using the queries against INFORMATION_SCHEMA.JOBS to monitor our ongoing spend
Identifying the most costly queries and optimizing them to save bytes processed, and thus money
Setting up ongoing alerting, so that if our query volume (and costs) spike, we’ll be notified immediately

Read along if you’d like your BigQuery daily cost graph to look like ours!

Quick notes on BigQuery pricing and our stack

Currently, we use BigQuery’s on-demand pricing model, with charges scaling linearly with the number of billable bytes processed, which is recorded in INFORMATION_SCHEMA.JOBS for each query. If you’re on capacity pricing, the same source table and approach can be used to identify the users and queries that use the most slots; I’ll discuss how to monitor slot usage below. For projects using autoscaling (BigQuery Editions), reducing your slot time consumption will result in savings, as you’ll be able to spend more time with fewer slots. On the other hand, if your project reserves slot capacity, you may need to manually reduce your commitment to reduce your bill.

We’ve set up Census, DBT, Fivetran, and our other tools with individual service accounts. The approach I’ll describe works best if you have a separate service account for each integration; if a shared service account is used or the integrations execute queries directly as a user, it will be more difficult to spot the integrations that are running the most intensive jobs.

Identifying intensive queries using INFORMATION_SCHEMA.JOBS

Most databases expose information about tables, columns, and other internals in a set of read-only information schema views. BigQuery also exposes the jobs that were run in your project in these views, which we’ll use to determine the costliest jobs and the users that run them. (To do this, you’ll need the bigquery.jobs.listAll IAM permission.)

Retrieving recent successful jobs is straightforward; this query will get all the billing-related information — as well as info on what kind of job was run — for all jobs from the past week. Note that INFORMATION_SCHEMA is per-region, so you must specify the region you want data for; you can’t query multiple regions together.

SELECT
  -- These columns have information that's useful for identifying the function and/or source of costly queries
  cache_hit,
  creation_time,
  end_time,
  job_id,
  job_type,
  parent_job_id
  query,
  referenced_tables,
  state,
  statement_type,
  total_bytes_billed,
  total_bytes_processed,
  total_slot_ms,
  user_email
FROM
  `region-us`.INFORMATION_SCHEMA.JOBS
WHERE
  creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) AND CURRENT_TIMESTAMP()
AND
  job_type = "QUERY"
AND
  statement_type != "SCRIPT"
AND
  cache_hit != true

Besides the filter on creation_time, we need to add three additional filters:

job_type = 'QUERY' to filter out extracts and loads;
statement_type != 'SCRIPT' prevents double-counting of multi-statement jobs, which are represented as multiple rows. Each child query within the job has parent_job_id populated, which references a parent job with statement_type 'SCRIPT'. If we don't exclude the parent jobs, each byte billed or slot ms spent on the child query will be double-counted on the parent as well as the child.
cache_hit != true to avoid counting queries that were answered from cache.

We can order these results by the cost field we care about (total_bytes_billed for on-demand or total_slot_ms for reserved pricing) to find the most expensive individual queries, but aggregating the data allows us to build a dashboard that we can use to monitor spend and observe trends over time.

For example, to identify the hourly costs incurred by each user over the past week, we could use a query like:

SELECT
  user_email,
  TIMESTAMP_TRUNC(end_time, HOUR) AS query_time_hour,
  SUM(total_bytes_billed) / POW(2, 40) AS tib_billed
FROM
  `region-us`.INFORMATION_SCHEMA.JOBS
WHERE
  end_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) AND CURRENT_TIMESTAMP()
AND
  job_type = "QUERY"
AND
  statement_type != "SCRIPT"
AND
  cache_hit != true
GROUP BY 1, 2
ORDER BY 2 DESC, 3 DESC

To go from terabytes billed to dollars spent, multiply by the price you pay per TiB — list price is $5 as of writing, but it can be lower if you’ve negotiated a better rate. The first TiB across all users per month is free. This charge shows up as “Analysis” in the GCP billing console.

If you’re on capacity pricing, I would recommend using average slot utilization as your metric. To do this, you’ll need to aggregate on a much smaller time-scale — e.g. per minute — and then divide your total_slot_ms by 60,000 to estimate the average number of slots used in that minute. This doesn’t exactly match the costs behavior of autoscaling; sometimes, BigQuery will overshoot the actual amount of slots needed, but it’s a good approximation of what your costs will be. The actual autoscaling behavior can be seen in your capacity monitoring dashboards.

Building an interactive cost dashboard

Now that we have our time series data, we can build a dashboard to visualize and monitor our costs. There’s plenty of ways to do this, from a simple Google Connected Sheet with a chart to a heavy dashboard in your BI tool. We use connected sheets and Looker heavily at Mixpanel, but for a quick interactive dashboard, Hex is a great tool that lets us build an interactive app quickly and iterate on the features we need.

Let’s build an MVP that:

Shows us the hourly costs incurred by each user
Tells us the users with the most bytes billed over the last week

Hourly costs incurred by each user

We start with a SQL cell, connected to the BQ project that we would like to monitor, and populate it with the query we wrote above.

User emails and costs incurred have been redacted to protect the guilty parties 😈

If we graph these results directly in a line chart, though, we wind up with gaps in some of our lines.

Yikes; this graph is difficult to parse.

If a user didn’t run any queries in a given time slice, there won’t be a row for that user and time in our result set. To fix this, we can join on a datetime spine, which will fill in the gaps with zero-valued rows.

The end results are also very spiky, as many users don’t run a query every hour, or may just have one or two queries scheduled to run at particular times each day. We can fix this by aggregating over a longer period of time (e.g. day), but after we filled in the gaps, a rolling average is easy to implement — we can use a window function to look back at previous rows.

Here’s an expanded version of the query, closer to what we use in production. It adds the datetime spine, calculates cost in dollars, aggregates slot time and raw query count, and implements a 24-hour rolling average. This produces a much nicer-looking line chart, in my opinion!

DECLARE bytes_per_tib INT64 DEFAULT 1024 * 1024 * 1024 * 1024;
DECLARE cost_per_tib INT64 DEFAULT 5;
DECLARE lookback_period_hrs INT64 DEFAULT 24 * 7;

WITH

  time_spine AS (
    SELECT
        *
    FROM 
        UNNEST(
            GENERATE_TIMESTAMP_ARRAY(
                TIMESTAMP_SUB(
                    TIMESTAMP_TRUNC(
                        CURRENT_TIMESTAMP(),
                        HOUR
                    ),
                    INTERVAL lookback_period_hrs HOUR
                ),
                TIMESTAMP_TRUNC(
                    CURRENT_TIMESTAMP(),
                    HOUR
                ),
                INTERVAL 1 HOUR
            )
        ) AS query_time_hour
    ),

  user_emails AS (
    SELECT DISTINCT user_email FROM `region-us`.INFORMATION_SCHEMA.JOBS
  ),

  time_spine_with_user_emails AS (
    SELECT
        user_email,
        query_time_hour
    FROM
        user_emails
    CROSS JOIN
        time_spine
  ),

  bq_usage_by_hour AS (
    SELECT
        user_email,
        TIMESTAMP_TRUNC(end_time, HOUR) AS query_time_hour,
        SUM(total_bytes_billed) / bytes_per_tib AS tib_billed,
        COUNT(*) AS queries_run,
        SUM(total_slot_ms) / 1000 AS total_slot_seconds
    FROM
        `region-us`.INFORMATION_SCHEMA.JOBS
    WHERE
        end_time BETWEEN TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR), INTERVAL lookback_period_hrs HOUR) AND CURRENT_TIMESTAMP()
    AND
        job_type = "QUERY"
    AND
        statement_type != "SCRIPT"
    AND
        cache_hit != true
    GROUP BY 1, 2
  ),

  bq_usage_with_timespine AS (
    SELECT
        COALESCE(bq_usage_by_hour.user_email, time_spine_with_user_emails.user_email) AS user_email,
        COALESCE(bq_usage_by_hour.query_time_hour, time_spine_with_user_emails.query_time_hour) AS query_time_hour,
        IFNULL(tib_billed, 0) AS tib_billed,
        IFNULL(tib_billed, 0) * cost_per_tib AS dollars_cost,
        IFNULL(queries_run, 0) AS queries_run,
        IFNULL(total_slot_seconds, 0) AS total_slot_seconds
    FROM
        time_spine_with_user_emails
    LEFT JOIN
        bq_usage_by_hour
    USING
        (user_email, query_time_hour)
  )

SELECT
    *,
    AVG(tib_billed)
    OVER(
        PARTITION BY user_email
        ORDER BY query_time_hour
        ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
    ) AS rolling_avg_tib_billed,
    AVG(dollars_cost)
    OVER(
        PARTITION BY user_email
        ORDER BY query_time_hour
        ROWS BETWEEN 23 PRECEDING AND CURRENT ROW
    ) AS rolling_avg_dollars_cost
FROM
    bq_usage_with_timespine

We can still identify spikes in usage easily with this graph — the pink user here ran a bunch of costly queries on July 1 — but we can now more easily parse the usage of the brown and red users, which run costly queries on a regular schedule.

Ranking users by total usage over the past week

Hex makes it easy to reuse the results from our previous query as a source for another SQL query, without having to materialize the results or create a view in our warehouse. By selecting Dataframes as our source, we can query our previous result sets using SQL.

Here’s a simple query that ranks users based on their total usage over the past 7 days:

WITH
    total_tib_billed_last_7d AS (
        SELECT
            user_email,
            SUM(tib_billed) AS total_tib_billed
        FROM
            costs_data
        GROUP BY 1
    )
SELECT
    user_email,
    total_tib_billed,
    RANK() OVER(ORDER BY total_tib_billed DESC) AS usage_rank
FROM
    total_tib_billed_last_7d
ORDER BY 3

We can either visualize this directly as a chart, or by joining it back on our previous result set, we can filter our graph so that it only shows our top users.

Root-causing high-cost queries

Now that we have a high-level view of our spend, it’s time to dive in and identify tweaks we can make — to the queries themselves or their timing patterns — to reduce costs. This is less straightforward than building a dashboard or aggregating costs per-user; we need to look at the queries themselves to root cause the costs.

The general approach here is:

Determine if the problem is an inefficient query, a query that runs too frequently, or both. The actual SQL itself is exposed in the column query, so if the same query is run each time, grouping by query can help you determine how often the query runs. If your queries are parametrized or compiled, query_info.query_hashes.normalized_literals could be useful; it contains a hash of the query that ignores comments, parameters, UDFs, and literals.
Optimize the query or reduce its frequency to reduce the overall bytes billed or slot time used.

In our case, Census accounted for the vast majority of our spend, so we were able to focus our search. To determine the highest cost queries, we looked at the queries Census ran during the previous day and grouped them by the tables referenced in each query. This allowed us to easily determine the most problematic source tables; we then reduced the frequency of those syncs, some of which were running every 15 minutes, to a more reasonable hourly or daily frequency (as appropriate). However, when the costly queries are coming from inside the house, you may need to tweak your queries themselves to realize cost savings.

If you’re also a Census user, you can use this query to get the tables that Census scans the most; by excluding internal configuration tables, the source tables are readily identifiable. If you have configured your syncs to use BigQuery models as sources, you can even search your syncs for the offending table’s name! The general approach of grouping by referenced_tables is useful whenever you can easily determine the query given the user and the tables referenced.

On the other hand, if you flip the WHERE condition, you can see the sync log tables that are being queried the most. The data inside of the SNAPSHOT_SYNC_CONFIGURATION tables will have a similar schema to the data being synced, which can point you towards the expensive sync.

WITH
  census_queries_june_3 AS (
		SELECT
      -- To allow GROUPing BY
      TO_JSON_STRING(referenced_tables) AS referenced_tables_str,
		  total_bytes_billed
		FROM
		  `region-us`.INFORMATION_SCHEMA.JOBS
		WHERE
		  TIMESTAMP_TRUNC(end_time, DAY) = "2023-06-03"
    AND
       -- not the actual integration user
      user_email = "census_user@serviceaccount.google.com"
    AND
      cache_hit = false
    AND
      job_type = "QUERY"
    AND
      statement_type != "SCRIPT"
)

SELECT
  referenced_tables_str,
  COUNT(*) AS count_queries,
  SUM(total_bytes_billed) / (1024 * 1024 * 1024) AS gb_billed
FROM
  census_queries_june_3
WHERE
  -- Census stores information about the state of your syncs here
  -- Excluding these tables will show us only the tables its syncing from
  referenced_tables_str NOT LIKE "%SYNC_CONFIGURATION%"
GROUP BY
  referenced_tables_str
ORDER BY
  3 DESC

Optimizing cost per query

However, if the query’s frequency can’t be reduced, you’ll have to optimize the query itself to reduce spend. There’s no one-size-fits-all fix that will make every query performant, but some low-lift approaches include:

Avoiding SELECT *

BigQuery, like most cloud data warehouses, is a column store, so if you don’t need every column, specifying only the columns you need will prevent those columns from being scanned, reducing the amount of bytes billed. Even if the query has a LIMIT clause, the full table will still be scanned!

Materializing your query results in stages

If your query is large and consists of many subqueries, writing each stage to a destination table effectively caches the results of each stage, reducing the amount of data that is read. The additional costs of storing the intermediate results will almost always be much lower than the cost of repeatedly running the entire query.

Partitioning and clustering your tables

These are two excellent techniques for reducing the amount of data scanned by a query. Partitioning splits your table into partitions based on the value of a column, while clustering sorts your table on that column before storing the data in blocks.

Partitioning on a date column is great for tables that span a long period of time. For example, if you have a purchasestable, clustering it on the date of the purchase before running a revenue report will prevent old rows from being scanned if your report only cares about the last year's purchases.

To further improve performance, you can cluster your partitioned tables on certain columns. For example, if that revenue report aggregates sales by region and store, clustering on those columns will ensure that purchases from the same region and store will be stored in the same blocks within each partition.

The BigQuery docs include excellent introductions to partitioning and clustering.

Incrementally materializing dbt models

If your culprit is expensive DBT queries, moving them to incremental materialization can reduce the amount of data scanned during each run. For example, a model that parses server logs doesn’t really need to do a full table scan each run; it can be optimized by only scanning the logs added to the source since the last run. You’ll need to partition your table on the incremental key; otherwise, the model will still require a full table scan.

Note that there are multiple incremental materialization strategies for BigQuery, so some experimentation may be required; there is a great discussion benchmarking the performance of these strategies here.

Setting up ongoing alerting

Finally, we would like to get notified if a particular user is acting up, or if our costs suddenly (or gradually) spike. There are a few different ways to do this, depending on how your alerting stack works. Ours is primarily driven by DBT tests; when they fail, we get paged, and the pages resolve when the test begins passing.

To implement a test that enforces a daily spend limit, we can write a DBT model like this:

-- Get all queries run in the last 24h,
-- if spend over last 24h was greater than 24h_spend_limit

{% set spend_last_24h_query %}
SELECT
  SUM(dollars_billed)
FROM
  -- Model that aggregates BQ spend per hour
  {{ ref("bq_spend_per_hour") }}
WHERE
  query_hour >= TIMESTAMP_SUB(
    TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), HOUR),
    INTERVAL 24 HOUR
  )
{% endset %}
{% set results = run_query(spend_last_24h_query) %}
{% if execute %}
-- Return the first value from the first column
{% set spend_last_24h = results.columns[0][0] %}
{% else %}
{% set spend_last_24h = 0 %}
{% endif %}
SELECT
  *
FROM
  {{ ref("bq_spend_per_hour") }}
WHERE
  query_end >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
AND
  {{ spend_last_24h }} > {{ 24h_spend_limit }}

This will either have no rows (if our spending is under the limit), or it will include all of the aggregated data over the last 24 hours. If rows are present in this table, the test will fail, and the responder will have all the information they need to root-cause the issue. When spend drops below our limit, the incident will then automatically resolve.

Wrap-up

It can be challenging to get — and keep — a handle on your BigQuery bill, but these techniques can help your team save money and avoid being surprised by the bill at the end of the month. Besides, who doesn’t want to send out a message announcing cloud cost savings?

By the way, if you love building systems that answer questions — whether they’re being asked by your colleagues or by your customers — Mixpanel engineering is hiring!

This article was not written by, or with the assistance of, AI tools, and contains no text generated by large language models.