Query any ducking thing: postgres, parquet, json, csv, and Google Sheets

4/5/2024

Mike Ritchie

Yeah, that's a lot of things to query. I've never needed to query all five of these sources at once, but I did need 3. I needed to join our Posthog logs (we export these to AWS as JSONL files) to our Supabase data (we use Supabase as a Postgres database). I also wanted to join in some data from a Google Sheet.

What's DuckDB?

DuckDB is a versatile, high-performance, in-memory analytical database that supports querying a wide range of data sources and formats. It's open-source and we use it as the data warehouse at Definite. A huge bonus with DuckDB is it's integrations and file support. It seamlessly interacts with various data sources, including PostgreSQL, Parquet, JSON, CSV, and even Google Sheets. Here's the query:

ATTACH 'postgres://your_username_password_host:5432/postgres' AS pgdb (TYPE postgres, READ_ONLY);
with interests as (
    select *
    from
    read_parquet('https://storage.googleapis.com/duck-demo-data/user_interests.parquet')
), user_preferences as (
    select *
    from read_json_auto(
        'https://storage.googleapis.com/duck-demo-data/user_preferences.jsonl',
        format = 'newline_delimited',
        records = true
        )
), user_details as (
    SELECT * FROM pgdb.user_details
), users as (
    select *
    from read_csv_auto('https://storage.googleapis.com/duck-demo-data/users.csv')
), one_more_thing as (
    SELECT *
    FROM read_csv_auto(
        'https://docs.google.com/spreadsheets/export?format=csv&id=1O-sbeSxCpzhzZj5iTRnOplZX-dIAiQJAeIO0mlh2kSU',
        normalize_names=True
        )
)
select
users.user_id,
users.name,
interests.interest,
user_preferences.theme,
user_preferences.language,
user_details.hobby,
one_more_thing.one_more_thing
from users left join
interests on users.user_id = interests.user_id left join
user_preferences on users.user_id = user_preferences.user_id left join
user_details on users.user_id = user_details.user_id left join
one_more_thing on users.user_id = one_more_thing.user_id

If you run this query on the DuckDB CLI, everything but Postgres will work because the GCS and Google Sheets URLs are public. Here's a version without Postgres:

with interests as (
    select *
    from
    read_parquet('https://storage.googleapis.com/duck-demo-data/user_interests.parquet')
), user_preferences as (
    select *
    from read_json_auto(
        'https://storage.googleapis.com/duck-demo-data/user_preferences.jsonl',
        format = 'newline_delimited',
        records = true
        )
), users as (
    select *
    from read_csv_auto('https://storage.googleapis.com/duck-demo-data/users.csv')
), one_more_thing as (
    SELECT *
    FROM read_csv_auto(
        'https://docs.google.com/spreadsheets/export?format=csv&id=1O-sbeSxCpzhzZj5iTRnOplZX-dIAiQJAeIO0mlh2kSU',
        normalize_names=True
        )
)
select
users.user_id,
users.name,
interests.interest,
user_preferences.theme,
user_preferences.language,
one_more_thing.one_more_thing
from users left join
interests on users.user_id = interests.user_id left join
user_preferences on users.user_id = user_preferences.user_id left join
one_more_thing on users.user_id = one_more_thing.user_id