Explore with AI
ChatGPTClaudeGeminiPerplexity
12 min read

DuckDB and DuckLake: Why We Bet the Company on the Duck Stack

Mike Ritchie

Cover image for DuckDB and DuckLake: Why We Bet the Company on the Duck Stack

In May 2024, we migrated our entire analytics infrastructure from Snowflake to DuckDB. Not a side project. Not a proof of concept. The production system that powers every customer dashboard, every AI query, every data pipeline at Definite.

A year and a half later: it was the best infrastructure decision we ever made. With DuckLake approaching 1.0, the economics are about to get even better.

This is the business case for the duck stack. Not a tutorial (we have plenty of those). This is about money, architecture, and why the conventional wisdom about data warehouses is about to get a serious rewrite.

A Brief History of the Duck

DuckDB started as a research project at CWI in the Netherlands, the same lab that produced MonetDB and helped shape columnar database design. The premise was simple: what if you could have SQLite-level simplicity but for analytical workloads? An embeddable, in-process OLAP database that just works.

A lot of people wanted exactly that.

DuckDB hit 30,000 GitHub stars in June 2025. It jumped from 1.4% to 3.3% adoption in the Stack Overflow Developer Survey, landing at #4 among databases. On DB-Engines, it climbed from #81 to #51 in a single year, a 50.7% growth rate. In October 2025, DuckDB's in-memory variant hit #1 on ClickBench, the popular analytical database benchmark.

What happened? Three things converged:

  1. The cloud data warehouse bill shock. Companies that signed up for Snowflake or Databricks in 2020-2022 started getting their renewal quotes. The numbers were not fun. Mid-sized companies routinely spend $2,000 to $10,000 per month on Snowflake. Enterprises can easily hit $15,000 to $50,000+. Meanwhile, most of their actual query volume could run on a single machine.

  2. The "good enough" moment. DuckDB's performance crossed a threshold where it wasn't just fast for an embedded database; it was fast, period. Sub-second on queries that took Snowflake seconds. On ClickBench, it competes head-to-head with ClickHouse on single-node workloads. For the vast majority of analytical workloads (anything under a few hundred GB), DuckDB is simply faster than the cloud warehouses.

  3. The open source momentum. MotherDuck raised $100M at a $400M valuation to build cloud infrastructure around DuckDB. Fivetran embedded DuckDB to power transformations. dbt, Meltano, and dozens of other tools added native DuckDB support. The ecosystem crossed the credibility threshold.

Why We Chose DuckDB Over Everything Else

When we started building Definite, we had a decision to make. Our product is an all-in-one analytics platform: connectors, warehouse, BI, semantic layer, and an AI assistant. The query engine is the beating heart of the whole thing.

We evaluated Postgres, ClickHouse, Snowflake, and DuckDB. Here's how the decision went:

Postgres was the default. Every startup uses Postgres. But Postgres is a row-oriented OLTP database being asked to do OLAP work. It can handle it for small datasets, but analytical query performance degrades rapidly as data grows. We needed something purpose-built for analytics.

Snowflake was the enterprise choice. Great product. But we'd be building an analytics platform on top of another company's analytics platform. That's a dependency we couldn't afford, both financially and architecturally. Snowflake's pricing model (credits for compute, per-TB for storage, with charges for data transfer, Snowpipe, and a dozen other things) would have made our unit economics impossible at the price points we wanted to hit.

ClickHouse was the performance choice. Incredible for high-throughput, write-heavy workloads. But ClickHouse is a distributed system that requires a cluster to run. That's operational overhead we didn't want, and it's designed for scenarios (billions of rows ingested per second, real-time dashboards on petabyte-scale data) that don't match most of our customers' workloads.

DuckDB was the pragmatic choice. Embeddable, meaning we could ship it as part of our infrastructure without an external dependency. Blazing fast on analytical workloads. Postgres-compatible SQL dialect, so our customers' existing queries mostly just worked. And the cost structure was radically simpler: instead of paying per-credit or per-query, we pay for compute (a VM) and storage (object storage). That's it.

The tradeoff was real. DuckDB didn't support concurrent writers. It was single-process by design. We'd have to build infrastructure around those limitations. But the upside (performance, cost, control) was too compelling to ignore.

How We Run DuckDB in Production (It's Not a Toy)

DuckDB's reputation as a "local analytics tool" undersells it. We run DuckDB in production on Kubernetes, serving hundreds of customers, processing millions of queries, with sub-second response times.

Here's the architecture:

Compute layer. DuckDB runs inside containers on GKE (Google Kubernetes Engine). Each customer's analytical workloads run against DuckDB instances that can scale horizontally. We wrote a Rust-based DuckDB server that handles connection pooling, query routing, and lifecycle management.

Storage layer. All data lives in Parquet files on Google Cloud Storage. This is the key insight: by separating compute from storage (sound familiar?), we get the same elastic scaling that Snowflake advertises, but without the Snowflake price tag. DuckDB reads Parquet files directly from GCS, and its query optimizer is smart enough to push down predicates and only scan the columns it needs.

Concurrency model. DuckDB's single-writer limitation was the biggest engineering challenge. We solved it initially with a write-instance/read-instance pattern (detailed in our migration post). Then we built a Flight server for streaming workloads (Duck Takes Flight). And now, DuckLake solves the concurrency problem entirely by moving the coordination layer to PostgreSQL.

Ingestion. We built target-ducklake, a Meltano-compatible target that loads data from any Singer tap directly into DuckLake. This connects our 200+ data source connectors to the DuckDB query engine without any intermediate staging.

The result: a fully managed analytics warehouse that costs a fraction of Snowflake, with query performance that's often faster. Our Platform tier starts at $250/month. That includes the warehouse, connectors, BI, semantic layer, and AI. Try getting that from Snowflake.

The Performance Story: Real Numbers

Benchmarks can be gamed. Every vendor picks the one that makes them look good. So here's what we actually see in production, not a synthetic benchmark.

Typical dashboard query (aggregation over 10M rows): DuckDB returns in 200-400ms. The same query on Snowflake (X-Small warehouse) takes 2-5 seconds, mostly because of the cold-start overhead. Snowflake has a queuing and compilation step before execution even begins.

Ad-hoc exploration (filter + group by on 50M rows): DuckDB returns in under 1 second. This is the kind of query an analyst runs dozens of times in a session, tweaking filters, changing group-by columns. Sub-second latency changes how people interact with data. There's no "submit query and wait" workflow. It's instant.

Complex joins across multiple tables: This is where DuckDB's vectorized execution engine really shines. Multi-table joins that would take 10-30 seconds on Snowflake (because each table scan involves network roundtrips to S3) complete in 1-3 seconds on DuckDB reading from GCS, because DuckDB's query optimizer batches reads and processes data in-memory.

Where DuckDB loses: If you have genuinely petabyte-scale data and need to scan terabytes for a single query, a distributed system like Snowflake or ClickHouse will be faster. But most companies don't have that problem. The median analytical dataset is measured in gigabytes, not terabytes. And for those workloads, DuckDB is simply faster because there's no network overhead, no credit-based queuing, and no cluster coordination.

DuckLake: The Business Case

In May 2025, the DuckDB team released DuckLake. It changed our architecture for the better.

DuckLake is a lakehouse format. If you've been following the data world, you've heard of Apache Iceberg and Delta Lake. They all solve the same problem: how do you get ACID transactions, time travel, and schema evolution on top of Parquet files in object storage?

DuckLake's insight is that the metadata management problem is actually a solved problem. We already have technology that handles ACID transactions, concurrent access, and consistent reads: relational databases. So instead of inventing a new file-based metadata format (like Iceberg's manifest files or Delta Lake's transaction logs), DuckLake stores all metadata in PostgreSQL (or MySQL, or SQLite).

This sounds too simple. That's the point.

What This Means for Your Wallet

The lakehouse model (store data as Parquet in cheap object storage, query it with a fast engine) is the most cost-effective way to run analytics. Here's the math:

Storage costs:

  • Snowflake: ~$40/TB/month (active storage)
  • GCS/S3 (Parquet files): ~$20/TB/month (standard), ~$4-10/TB/month (nearline/coldline)
  • With Parquet compression (typically 5-10x): your effective cost drops to $2-8/TB/month

Compute costs:

  • Snowflake X-Small warehouse: ~$2/credit, runs up fast. A moderately active team can burn through $1,000-3,000/month in credits easily.
  • DuckDB on a c2-standard-16 (16 vCPU, 64GB RAM): ~$500/month on GCP. Always on. No per-query charges.

The hidden costs that disappear:

  • No data transfer charges between storage and compute (DuckDB reads from GCS natively)
  • No Snowpipe charges for data ingestion
  • No credit burns from warehouse auto-resume
  • No "oops someone left a Large warehouse running over the weekend" incidents
  • No per-seat licensing for the query engine itself

For a mid-market company running 1TB of analytical data with a team of 10-20 analysts, we typically see total cost comparisons like:

ComponentSnowflake + LookerDefinite (DuckDB/DuckLake)
Compute$2,000-5,000/moIncluded
Storage$40-80/mo$20-40/mo
BI tool$1,000-3,000/moIncluded
Connectors$500-2,000/mo (Fivetran)Included
Total$3,500-10,000/mo$250-500/mo

That's not a 20% savings. That's an order of magnitude.

ACID Without the Complexity

Before DuckLake, running a lakehouse meant dealing with Iceberg or Delta Lake. Both are powerful, but both come with significant operational complexity:

Apache Iceberg uses a file-based metadata system. Every commit creates new metadata files. Manifest files point to data files. Manifest lists point to manifest files. A snapshot points to a manifest list. To read a table, the query engine has to traverse this tree of files. It works, but it's complex, and when things go wrong (orphaned files, metadata corruption, manifest list growing too large), debugging requires deep knowledge of the format internals.

Delta Lake uses a JSON-based transaction log in the _delta_log directory. Every operation appends a JSON file. Periodically, these get compacted into checkpoint files. It's simpler than Iceberg but still requires careful management, and it's tightly coupled to the Spark ecosystem (despite improvements in recent years).

DuckLake stores metadata in a PostgreSQL table. That's it. Want to see all snapshots? Run a SQL query. Want to debug a failed transaction? Check the catalog database. Want to add an index on a metadata column? Go ahead, it's just Postgres. Every DBA, every backend engineer, every ops person already knows how to manage this.

The simplicity has real business value. Less operational complexity means fewer on-call incidents. Faster debugging means less engineering time wasted. And battle-tested PostgreSQL for metadata means you get decades of reliability engineering for free.

DuckLake vs. Iceberg vs. Delta Lake: When Each Makes Sense

DuckLake isn't the right choice for everyone.

Choose Apache Iceberg when:

  • You have petabyte-scale data and need distributed query engines (Spark, Trino, Flink) to scan it
  • You need engine-agnostic access (multiple teams using different tools all reading the same tables)
  • You're in a large enterprise with existing Iceberg infrastructure and governance tooling
  • You need fine-grained partition evolution on massive tables

Choose Delta Lake when:

  • You're already deep in the Databricks ecosystem
  • You need tight integration with Spark streaming for real-time workloads
  • Your data team primarily uses notebooks and Spark SQL

Choose DuckLake when:

  • Your data is in the gigabytes-to-low-terabytes range (which covers most companies)
  • You want lakehouse benefits (ACID, time travel, Parquet storage) without distributed systems complexity
  • You value operational simplicity and want your team to move fast
  • You're cost-conscious and don't want to pay for distributed infrastructure you don't need
  • You want to use DuckDB as your query engine (because it's faster for your workload size)

Most data infrastructure discussions miss this: the vast majority of companies are not Netflix, Uber, or Airbnb. They don't have petabytes of data. They have 50GB to 2TB of analytical data, a handful of data sources, and a team of 5-20 people who need answers from that data. For these companies (which is most companies), DuckLake + DuckDB is not a compromise. It's the optimal architecture.

The "No Cluster" Advantage

Every CTO has been conditioned to think they need a distributed data warehouse. Most don't.

A single modern machine with 16 vCPUs and 64GB of RAM, running DuckDB, can process analytical queries on hundreds of gigabytes of data in sub-second time. DuckDB's vectorized execution engine, columnar processing, and intelligent caching mean that a single node can do the work that would require a multi-node Snowflake warehouse or ClickHouse cluster.

What does "no cluster" mean in practice?

  • No cluster management. No nodes to scale up and down. No rebalancing. No split brain scenarios. No distributed consensus. Your infrastructure team (or lack thereof) will thank you.
  • No cold starts. Snowflake's auto-suspend/auto-resume saves money but introduces latency. The first query after a warehouse resumes takes 5-15 seconds while it warms up. DuckDB is always ready because it's just a process.
  • Predictable costs. A VM costs what it costs. There are no surprises. No "we accidentally ran a query that scanned the whole warehouse and burned $500 in credits."
  • Simpler debugging. When something goes wrong in a distributed system, you're looking at logs across multiple nodes, network partitions, and coordination failures. With DuckDB, you have one process, one log, one place to look.

The serverless DuckDB model (a single DuckDB process reading from object storage) gives you 90% of the benefits of a cloud data warehouse at 10% of the cost. The remaining 10% (infinite horizontal scaling, multi-petabyte scans) is something most teams will never need.

What's Coming: DuckLake 1.0 and Why It Matters

DuckLake is currently at version 0.3, with 1.0 targeted for early 2026. The roadmap includes features that will make it even more compelling for production use:

  • Data inlining with PostgreSQL as a catalog, meaning small tables can be stored directly in the catalog database rather than as separate Parquet files. This eliminates the overhead of tiny files for reference tables, lookup tables, and configuration data.
  • Inline deletes and updates, making CDC (Change Data Capture) workflows more efficient by avoiding full file rewrites for small mutations.
  • Incremental compaction for large volumes of small files, solving the "small file problem" that plagues every lakehouse format.
  • Variant types for semi-structured data, making it easier to work with JSON, logs, and event data.
  • Iceberg interoperability (already available in 0.3), meaning you can read Iceberg tables from DuckLake and vice versa. This is critical for migration paths: you don't have to choose DuckLake on day one. You can start reading your existing Iceberg tables and migrate incrementally.

The broader DuckDB ecosystem is also maturing rapidly. DuckDB 1.4 LTS (released October 2025) added AES-256 encryption, the MERGE statement, Iceberg writes, and significant performance improvements including a rewritten sort engine and enhanced query optimizer. The LTS designation means enterprises can adopt it with confidence in long-term stability.

The Bet That Paid Off

When we migrated from Snowflake to DuckDB in 2024, some people thought we were crazy. "You're building an analytics company on an embedded database?"

Yes. And here's what happened:

  • Cost reduction: >70%. Infrastructure costs dropped dramatically. That savings goes directly to customers. It's why we can offer an all-in-one analytics platform starting at $250/month when competing products charge $250/month for just the BI layer.
  • Performance: 5-10x faster on typical analytical queries. Not because DuckDB is magic, but because we eliminated network hops, warehouse cold starts, and credit-based queuing. The fastest query is the one that doesn't have to wait.
  • Engineering velocity: 2x. When your query engine is an embeddable library rather than an external service, everything gets simpler. Testing is simpler (spin up a DuckDB in-memory for unit tests). Development is simpler (every engineer has the full stack on their laptop). Debugging is simpler (one process, one log).
  • Customer experience: transformed. When dashboards load in under a second instead of 5-10 seconds, people use analytics differently. They explore more. They ask more questions. They actually use the data instead of waiting for it.

The Bottom Line

The data warehouse market is in the middle of a correction. For a decade, the narrative was: put everything in the cloud warehouse, pay by the query, and don't worry about cost because data is your most valuable asset.

That narrative worked when datasets were small and cloud warehouse margins were fat. But datasets grew, bills ballooned, and teams started asking: do we really need to pay Snowflake $5,000/month to run dashboard queries on 200GB of data?

The answer is no. You don't.

DuckDB gives you a query engine that's faster than Snowflake for most workloads, at a fraction of the cost. DuckLake gives you lakehouse features (ACID transactions, time travel, schema evolution) with the operational simplicity of a SQL database. Together, they're a different approach to analytics infrastructure: optimized for the workloads most companies actually have, not the workloads that look good in a vendor pitch deck.

At Definite, we've built our entire platform on this stack. We're a production DuckLake user. We've battle-tested it with real customers, real data, and real workloads. And we're convinced this is the future of analytics for the vast majority of companies.

If you're paying too much for analytics infrastructure that's too slow, give us a look. We built exactly the thing you need.


Try Definite free and go from raw data to live dashboards in under 30 minutes.

Mike Ritchie is the CEO and Founder of Definite, an all-in-one analytics platform powered by DuckDB and DuckLake. You can reach him on Twitter @thisritchie or schedule a call.

Data doesn't need to be so hard

Get the new standard in analytics. Sign up below or get in touch and we'll set you up in under 30 minutes.