How We Migrated Our Data Warehouse from Snowflake to DuckDB

5/30/2024

Steven Wang

Everyone loves Snowflake. The ease of use when it comes to elastic scaling, data backups, and table optimizations amongst other features make Snowflake a great choice when setting up a data stack. Given the proper budget and data resources, Snowflake is undoubtedly the leading data warehouse candidate. But has your company been considering alternatives to Snowflake, whether it be for cost, vendor lock-in concerns, or other reasons?

We’ve done this. Here at Definite we’ve been building an alternative approach that uses an open source database, DuckDB, and a self-hosted infrastructure. With this setup we were able to migrate all our analytical data and queries from Snowflake to DuckDB. Everything from internal team dashboards to ad-hoc queries (eg Stripe MRR metrics, Hubspot deal progress, etc) now runs on our self-hosted DuckDB database.

TLDR: it required a good bit of work but we saw a radical reduction in cost (>70%).

What Is DuckDB?

DuckDB is a columnar database management system built for analytical workloads. It’s fast (see below), open source and has a very active community. The Github repo has 17.5k stars on GitHub with 34 authors have pushing 575 commits to main in the past month. Some features of DuckDB:

  • It’s fast, consistently ranking near the top of performance benchmarks for analytical databases (see here and here)
  • It’s lightweight and can efficiently process large datasets even on a local machine. Set up is as simple as *pip install duckdb* and pointing it to your files
  • Uses a SQL dialect closely matching Postgres SQL, meaning most queries from other databases are directly translatable to Duckdb SQL
  • Accepts many file formats including csv, parquet, Excel, and json. This allows Duckdb to query data from wherever your data already is such as a local directory or cloud storage like S3. The added benefit is if a decision is made to move to another database, your data isn’t locked in to DuckDB

This article has a good overview of DuckDB and where it fits in your data stack.

A (Very) High Level Overview of How We Implemented DuckDB As A Data Warehouse

Unlike Snowflake, which supports multiple concurrent processes, DuckDB is designed for single-user workloads and becomes locked and un-queryable during data writes or table edits. To work around this issue, we set up two DuckDB instances, a write-to database and a read-only database:

  • Write-to Database: DuckDB installed on a VM with a persistent disk attached to it. An example of this is **Google Compute Engine with a persistent disk attached to it. All write operations such as INSERT, CREATE, UPDATE, etc are actioned on this database first.
  • Read-only Database: Duckdb also installed on a VM but has read access to replica database files stored in cheap blob storage such as Google Cloud Storage (GCS). After edits to the write-to DuckDB instance are finished, the new/changed tables are synced to GCS which the read-only DuckDB has access to. When a user wants to query the database (SELECT...), the query is routed to the read-only DuckDB which in turn fetches the data from GCS and returns it.

The benefit of having write-to and read-only DuckDB instances is that the data warehouse is still queryable while edits are happening in the background. This could be a long running ETL job such as loading Stripe transaction data or a user initiated UPDATE statement; we want to ensure that the data is still readable even while these are running.

Why not also use GCS as the file store for the write-to database given how much cheaper it is relative to using a persistent disk? Currently at the time of writing, DuckDB only supports read-only mode for blob storage such as GCS or S3.

Though the set up outlined above requires some engineering effort, it plays a crucial role in enabling data warehouse-like functionality required for analytics and BI use cases, and we are constantly working to optimize this piece of the infrastructure.

Cost

To run the DuckDB setup described above we need two Google Compute Engine instances, one persistent disk, and access to GCS. To estimate the daily cost of running all this we will make the following configuration and storage assumptions:

  • Size of data: 1TB
  • Compute Engine vCPUs: 16
  • Compute Engine RAM: 64GB

Since most data warehouses are not in use 24 hours a day, we will estimate the cost under different usage scenarios. In typical BI use cases involving dashboard refreshes, data ingestion, and ad-hoc queries, compute may only be needed during peak business hours.

We also compare the cost of different Snowflake warehouse sizes (compute) running under the different usage scenarios:

Untitled

As seen in the table on the right above, at 12 hours of usage per day DuckDB on GCP is ~55% cheaper than the smallest Snowflake warehouse. If we compare it to the next smallest Snowflake warehouse, Small, we can see it is ~77% cheaper and savings increase as we go up the Snowflake warehouse sizes.

The same table as before but in chart form:

Untitled

Of course there are a lot of assumptions in this estimate that could vary; GKE specs can be scaled up or down, Snowflake users may utilize multiple warehouses of different sizes, total data storage size could be larger or smaller, etc, but directionally self hosting DuckDB should yield cost savings relative to Snowflake (see bottom for GCP and Snowflake pricing pages). As mentioned in the beginning, anecdotally at Definite we’ve seen >70% savings by switching to DuckDB.

Interested In Trying DuckDB?

If you want to give DuckDB a try, here is an excellent guide on how to get it running on your local machine. If you want a more cloud data warehouse-like experience and have the resources to set up your own infrastructure, the DIY approach we outlined in this post is an option. Alternatively, managed solutions like Motherduck or Definite can be a pain-free way to quickly spin up your own DuckDB powered data warehouse.

With Definite, not only do you get a managed DuckDB warehouse, we’ve also added hundreds of integrations to a wide range of other products (Salesforce, Stripe, Postgres, etc) so you can seamlessly move data in and out of DuckDB. Additionally, we also built a BI layer and AI-enabled natural language query layer on top of DuckDB, allowing anyone on your team to derive data insights. Definite’s goal is to be your data-stack in a box with ETL, data warehousing, and analytics all deployed with a few clicks. If you would like to learn more, feel free to reach out!

Resource Pricing References:

Google Cloud Compute

Google Cloud Persistent Disk

Google Cloud Storage

Snowflake Credit and Storage Pricing

Snowflake Compute Warehouse Credit Usage