June 2, 202510 minute read

DuckLake: This is your Data Lake on ACID

Mike Ritchie
DuckLake: This is your Data Lake on ACID | Definite

How to build an easy (but not petabyte-scale) data lake with DuckLake, DuckDB's open-source Lakehouse extension

Introduction

In modern data architectures, we often face a dilemma: do we choose the flexibility and cost-effectiveness of data lakes, or the reliability, simplicity and performance of data warehouses? What if you could have both?

Enter DuckLake - an open-source Lakehouse extension from DuckDB that brings ACID transactions, time travel, and schema evolution to an easy to setup data lake. In this post, I'll show you how we built a real-time event processing pipeline that demonstrates DuckLake's capabilities.

The Challenge

Imagine you're building an analytics platform that needs to:

  • Ingest millions of events per day from your application
  • Provide real-time analytics with sub-second query performance
  • Allow historical analysis and debugging with time travel
  • Maintain data consistency with ACID guarantees
  • Keep costs low with efficient storage

Enter DuckLake

DuckLake solves these challenges by combining:

  • Parquet files for efficient, open storage
  • PostgreSQL for ACID-compliant metadata management
  • DuckDB for blazing-fast analytical queries
  • Git-like versioning for time travel and auditing

Let's build a production-ready event pipeline to see it in action. All the code for this tutorial is available on GitHub.

Architecture Overview

Our pipeline consists of three main components:

Step 1: Generating Some Data

First, we need a stream of realistic events. Our insert_event_data_loop.py script simulates a production application generating various event types:

EVENT_TYPES = {
    "user_login": {
        "weight": 35,  # Most common event
        "details_generator": lambda: {
            "user_id": f"user_{random.randint(1000, 9999)}",
            "ip_address": f"{random.randint(1, 255)}.{random.randint(0, 255)}.{random.randint(0, 255)}.{random.randint(0, 255)}",
            "session_id": str(uuid.uuid4()),
            "login_method": random.choice(["password", "oauth_google", "oauth_github", "sso"])
        }
    },
    "api_request": {
        "weight": 25,
        "details_generator": lambda: {
            "endpoint": random.choice(["/api/v1/users", "/api/v1/products", "/api/v1/orders"]),
            "method": random.choice(["GET", "POST", "PUT", "DELETE"]),
            "status_code": random.choices([200, 201, 400, 404, 500], weights=[50, 10, 10, 15, 5])[0],
            "response_time_ms": random.randint(10, 2000)
        }
    },
    # ... more event types
}

The script runs continuously, inserting batches of events with realistic distributions:

# Generate 1000 events every 5 seconds
uv run insert_event_data_loop.py --continuous --rows 1000 --delay 5

Step 2: Incremental ETL with State Management

We built a pipeline in events_pipeline.py, which uses dlt (Data Load Tool) for incremental extraction:

def extract_events_incremental(src_pg_conn: str, staging_db_file: str) -> dict:
    """Extract events table incrementally using DLT."""
    # Configure incremental loading on event_timestamp
    events_table = sql_table(
        credentials=src_pg_conn,
        schema="ducklake_src",
        table="events",
        incremental=dlt.sources.incremental(
            "event_timestamp", 
            initial_value=datetime(2020, 1, 1, tzinfo=timezone.utc)
        )
    )
    
    # Create pipeline with state persistence
    pipeline = dlt.pipeline(
        pipeline_name="events_incremental",
        destination=dlt.destinations.duckdb(staging_db_file),
        dataset_name="staging"
    )
    
    # Run the pipeline - only new records are extracted!
    info = pipeline.run(events_table, write_disposition="append")
    return load_info

The pipeline automatically:

  • Tracks state between runs
  • Extracts only new records based on event_timestamp
  • Handles failures with automatic retries
  • Provides metrics on rows processed

Step 3: Promoting to DuckLake

Once staged, we promote events to DuckLake with full ACID guarantees:

def promote_events_to_lake(staging_db_file: str, catalog_conn: str, data_path: str) -> int:
    """Promote staged events to Duck Lake."""
    conn = _connect(staging_db_file)
    
    # Attach DuckLake catalog
    attach_duck_lake(conn, catalog_conn, data_path)
    
    # Insert new events with deduplication
    conn.execute(f"""
        INSERT INTO {ducklake_path}.events 
        SELECT s.* FROM staging.events s
        WHERE NOT EXISTS (
            SELECT 1 FROM {ducklake_path}.events l 
            WHERE l.event_id = s.event_id
        )
    """)
    
    # Transaction automatically committed
    # DuckLake creates a new snapshot version

Step 4: Querying with Time Travel

Now the magic happens. With DuckLake, we can query our data at any point in time:

-- Current data
SELECT event_type, COUNT(*) as count
FROM lake.events
WHERE event_timestamp > NOW() - INTERVAL '1 hour'
GROUP BY event_type;

-- Historical data from yesterday
SELECT * FROM lake.events 
AT (TIMESTAMP => NOW() - INTERVAL '1 day');

-- Compare versions to see what changed
FROM lake.table_changes('events', 100, 150)
WHERE change_type = 'insert' 
  AND details->>'severity' = 'critical';

-- Investigate a specific user's journey
SELECT * FROM lake.events AT (VERSION => 42)
WHERE details->>'user_id' = 'user_1234'
ORDER BY event_timestamp;

IMAGE ALT TEXT HERE

Performance and Storage Optimization

DuckLake automatically optimizes storage through:

1. Automatic File Compaction

Small files are merged into larger ones for better performance:

SELECT * FROM ducklake_merge_adjacent_files('lake');

2. Column Statistics

Min/max values and null counts enable efficient query pruning:

-- DuckLake automatically maintains these statistics
SELECT * FROM lake.ducklake_table_column_stats;

3. Partitioning

Organize data by commonly filtered columns:

ALTER TABLE lake.events SET PARTITIONED BY (event_date);

Real-World Use Cases

This architecture is perfect for:

1. Event Sourcing

  • Immutable event log with full history
  • Replay events to rebuild state
  • Debug production issues with time travel

2. Change Data Capture (CDC)

  • Track all changes with table_changes() function
  • Build incremental data pipelines
  • Audit trail with who changed what and when

3. Real-Time Analytics

  • Sub-second queries on recent data
  • Historical comparisons and trends
  • No need for separate OLAP systems

4. Data Science Workflows

  • Experiment on historical snapshots
  • Reproduce results with version pinning
  • Schema evolution for new features

Production Considerations

Monitoring and Maintenance

# Schedule regular maintenance
schedule.every(1).hours.do(lambda: run_events_pipeline())
schedule.every(1).days.do(lambda: expire_old_snapshots())
schedule.every(1).weeks.do(lambda: compact_small_files())

Cost Optimization

  • Storage: Parquet compression reduces size by 10-100x
  • Compute: Query only what you need with column pruning
  • Network: Keep data close to compute (same region)

Scaling Strategies

  1. Horizontal: Partition by time or business dimension
  2. Vertical: Use column statistics for selective reads
  3. Caching: Leverage DuckDB's intelligent caching

Conclusion

DuckLake bridges the gap between data lakes and data warehouses, giving you:

  • Flexibility of open formats (Parquet)
  • Reliability of ACID transactions
  • Performance of columnar analytics
  • Simplicity of SQL interfaces

Our event processing pipeline demonstrates how easy it is to build production-ready data infrastructure without vendor lock-in or operational complexity.

Getting Started

Ready to build your own Lakehouse? Here's how:

# Clone the example code
git clone https://github.com/definite-app/ducklake-speedrun
cd ducklake-speedrun

# Install dependencies with uv
uv pip install -r requirements.txt

# Set up your environment
cp .env.example .env
# Edit .env with your connection details

# Start generating events
uv run insert_event_data_loop.py --continuous --rows 1000 --delay 5

# In another terminal, run the pipeline
uv run events_pipeline.py --run --data-path gs://your-bucket/ducklake/

# Query your data with DuckDB
duckdb
> ATTACH 'ducklake:postgres://...' AS lake (DATA_PATH 'gs://your-bucket/ducklake/');
> SELECT * FROM lake.events LIMIT 10;

What's Next?

We're building a data platform at Definite with DuckDB at its core. If you're excited about DuckLake and the possibilities of modern data infrastructure, you might be interested in what we're building. Check out our platform to see how we're making data analytics more accessible.

Resources

Have questions? Reach out on Twitter @thisritchie.

Data doesn't need to be so hard

Get the new standard in analytics. Sign up below or get in touch and we'll set you up in under 30 minutes.