Explore with AI
ChatGPTClaudeGeminiPerplexity
Essay

Building a DuckLake on GCP: AlloyDB + GCS + Cloud Run

Cover image for Building a DuckLake on GCP: AlloyDB + GCS + Cloud Run

I spent an afternoon standing up a DuckLake on GCP: AlloyDB as the catalog, GCS for parquet, Cloud Run for compute. Then I loaded TPC-H SF=100 (867M rows) and ran the 22 canonical queries against published Snowflake numbers.

The stack

Three pieces:

LayerServiceRole
CatalogAlloyDB (2 vCPU, Zonal)Postgres DB holding DuckLake metadata
StorageCloud Storage bucketParquet files, via DuckDB's httpfs extension with GCS HMAC keys
ComputeCloud RunPython + FastAPI + DuckDB with ducklake, postgres, httpfs, tpch extensions

The Cloud Run container does one thing at startup:

INSTALL ducklake; INSTALL postgres; INSTALL httpfs;
LOAD ducklake; LOAD postgres; LOAD httpfs;

CREATE SECRET gcs_hmac (TYPE gcs, KEY_ID '...', SECRET '...');

ATTACH 'ducklake:postgres:dbname=ducklake_catalog host=... port=6432 user=postgres password=...' AS lake
  (DATA_PATH 'gs://your-bucket/');

USE lake;

Cloud Run reaches AlloyDB's private IP via Direct VPC egress (no Serverless VPC Connector required anymore). GCS goes over the internet through the S3-compat layer. Idle cost: near zero with min-instances=0. Active cost at 8 vCPU / 32 GiB: about $1/hour.

Three gotchas

1. AlloyDB Managed Connection Pooling: nested flags map

Docs say default max_client_conn is 5,000, configurable up to 262,042. True, but the payload shape cost me an hour. Every example I tried (including the gcloud alpha --connection-pooling-max-client-connections=5000 flag) produced something like {"connectionPoolConfig": {"maxClientConn": "5000"}} and got silently ignored.

The correct v1 REST shape:

{
  "connectionPoolConfig": {
    "enabled": true,
    "flags": {
      "pool_mode": "transaction",
      "max_pool_size": "50",
      "max_client_connections": "5000"
    }
  }
}

Keys inside flags are snake_case strings. Values are strings, even numeric ones. pool_mode is lowercase. Until I got this right, the pooler was stuck at PgBouncer's default 100 concurrent clients, even though the instance reported connectionPoolConfig.enabled=true.

After the fix, a single Cloud Run container opened 1,000+ concurrent clients through the pooler with backend Postgres connections holding around 100. 10x client-to-backend reduction.

2. DuckLake 1.0 data inlining

Small inserts don't immediately land in parquet. DuckLake 1.0 inlines them into the catalog Postgres and flushes later when they hit a size threshold, or when you call ducklake_flush_inlined_data('lake').

First time I did a tiny CREATE TABLE + INSERT and looked at GCS, the bucket was empty. Rows were in AlloyDB. Works as designed once you know about it. Tiny parquet files in object storage are a known pain point.

3. dbgen(sf, step, children) is 0-indexed

Chunked the SF=100 load into 10 slices of SF=10 so the generator didn't have to hold 100 GB in RAM at once. Natural loop: for step in range(1, 11):. Every table came out 10% short. nation had 22 rows instead of 25, region had 4 instead of 5. That gave it away.

CALL dbgen(sf=100, step=N, children=K) wants N in 0..K-1, not 1..K. Step 10 with children=10 is out of range and produces nothing silently. range(CHUNKS) fixes it.

TPC-H SF=100 results

Data loaded via a Cloud Run Job (8 vCPU, 32 GiB, 30 minutes). Queries run sequentially through the public Cloud Run service, timed client-side, against a fresh revision for the cold case.

EngineSF=100 total (s)Note
Snowflake X-Small (native, warm)~102Published
BigQuery native, first run~136Published
DuckDB local parquet, 4 vCPU / 32 GB166r6id.xlarge, published
DuckDB local parquet, 8 vCPU / 64 GB~240Published
Our stack, 8 vCPU / 32 GiB, warm best~410
Databricks 2x-Small (Delta)~700Published
Our stack, warm median912
DuckDB local parquet, 2 vCPU / 16 GB571r6id.large, published
Our stack, cold first run~1,800

About 4x slower than Snowflake X-Small on native format. Beats Databricks 2x-Small on Delta. Same ballpark as local-disk DuckDB at similar core counts. Against Snowflake reading external Iceberg parquet (historically 2-3x slower than their native format), the gap closes to roughly 1.5x.

Cold runs pay the biggest tax. Remote parquet from GCS, first access pays for DuckLake metadata fetches from AlloyDB and per-file GCS opens. Warm runs reuse DuckDB's file cache and hash tables, which is where most of the speedup lives.

What moved the needle

In order of impact:

  1. Cloud Run RAM 16 → 32 GiB: 4-8x faster on multi-way joins (Q3, Q4, Q13). At 16 GiB those queries spilled to disk and warm runs didn't help because DuckDB had to rehydrate hash tables every time. At 32 GiB, everything fit. Biggest single-variable win.
  2. Managed Connection Pooling configured correctly: 100 → 1,000+ concurrent clients. Doesn't speed up TPC-H (single-query workload), but it's the difference between this stack scaling past a handful of users and not.
  3. File compaction (CALL ducklake_merge_adjacent_files(...)): 21x faster on pure-scan queries like Q6, marginal on join-heavy ones. The 10-chunk load left lineitem with 51 parquet files. Compaction plus ducklake_cleanup_old_files cut total files from 122 to 65.

Permissions: DuckLake's biggest gap

DuckLake doesn't ship with access control. By design: enforcement lives in the two systems it leans on (the catalog database and the object store).

On the catalog side, AlloyDB gives you standard Postgres 15 GRANT/REVOKE and ENABLE ROW LEVEL SECURITY. The official DuckLake role pattern:

-- reader: can ATTACH and SELECT
CREATE USER ducklake_reader WITH PASSWORD '...';
GRANT USAGE ON SCHEMA public TO ducklake_reader;
GRANT SELECT ON ALL TABLES IN SCHEMA public TO ducklake_reader;

-- writer: can also INSERT/UPDATE/DELETE
CREATE USER ducklake_writer WITH PASSWORD '...';
GRANT USAGE ON SCHEMA public TO ducklake_writer;
GRANT SELECT, INSERT, UPDATE, DELETE ON ALL TABLES IN SCHEMA public TO ducklake_writer;

Postgres RLS on catalog tables like ducklake_table and ducklake_column can hide metadata based on the connecting user:

ALTER TABLE ducklake_table ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_tables ON ducklake_table
  FOR SELECT USING (schema_name = current_setting('app.tenant'));

The catch: the catalog only tells DuckDB which parquet files exist. DuckDB reads those files directly from GCS with its own HMAC secret. If that secret leaks, every policy on the catalog is bypassable. Catalog RLS hides metadata. It doesn't enforce data access.

What actually contains a tenant:

  1. Per-tenant schema or catalog DB. Each tenant gets its own connection string and its own ATTACH alias. Simplest model. Scales to a few dozen tenants cleanly.
  2. Per-tenant GCS prefix plus scoped IAM. Separate HMAC keys per tenant, bucket policies that only allow reads under gs://.../tenant=X/. Combined with the catalog split, that's a real boundary.
  3. Proxy the queries. Clients never hold the catalog credential or the HMAC. An API in front authenticates the user, rewrites the SQL (WHERE tenant_id = ?), and returns results. This is what we run at Definite.

Row-level filtering is table stakes for multi-tenant analytics. DuckLake leaves it to the layers underneath, which means you have to design access control in from day one.

Where it nets out

You're not going to beat Snowflake's native columnar format on cold first-run queries. That format is the moat.

You can get within 2-4x of Snowflake on the same public benchmark using only managed GCP services, on a box that idles at $0, for 5-10% of the monthly cost.

The lakehouse pattern is simple: columnar storage on object store, fast single-node engine, metadata in boring Postgres. DuckLake is the cleanest implementation I've seen. It works on AWS, Azure, or GCP. It works on commodity hardware. DuckLake 1.0 shipped with a stable on-disk format.

If you're running a 5-figure Snowflake bill and most of your queries would fit in 32 GiB of RAM, this shape is worth a Saturday.

Your answer engine
is one afternoon away.

Book a 30-minute call. We'll build your first dashboard on the call — or you can stop paying us.