Explore with AI
ChatGPTClaudeGeminiPerplexity
Essay

The 2026 Math on AI Agents in Snowflake — and Why a Budget Cap Won't Fix It

Cover image for The 2026 Math on AI Agents in Snowflake — and Why a Budget Cap Won't Fix It

Snowflake is the right warehouse for analyst dashboards and the wrong place to let AI agents run wild.

Same compute. Same SKU. Wildly different economics. Per Anthropic's published engineering telemetry, agents emit roughly 4× the tokens of a chat interaction — and every retry is a new billable Cortex Analyst message at ~$0.13 a pop, plus a Cortex Search hit, plus a warehouse spin-up. Analyst dashboards (modeled, parameterized, cached) were never going to break your bill. Agents will. The leading indicator is already public: in April 2026, Uber exhausted its full-year 2026 AI budget in four months once Claude Code adoption ran ahead of forecast (as reported by The Information). That was engineer laptops, not data agents on a warehouse — but the cost mechanic rhymes.

If you're the CTO at a bank already burning a few million a year on Snowflake, the day someone wires Cortex Agents — or Claude, or Cursor, or your homegrown "ask the data" Slack bot — into that warehouse and turns it loose is the day your forecast breaks. Twenty thousand exploratory queries before lunch. The forecast you signed off on six weeks ago is no longer recognizable. You're not getting fired this quarter, but you're not getting the budget you wanted next year, either.

The fix isn't a budget setting or a smaller warehouse size. It's the realization that analyst dashboards and AI agents are fundamentally different workloads on the same SKU. Keep Snowflake for the work it's actually good at. Put agents somewhere else.

Here's the 2026 math, the architecture, and the decision framework for where each agent workload should actually run.

In one screen

  • Analyst dashboards and AI agents emit fundamentally different query patterns. Per-credit warehouse SKUs price one well and the other badly.
  • By Anthropic's published telemetry, agents emit ~4× the tokens of a chat interaction; multi-agent systems ~15× (source). That amplification falls directly onto warehouse spin-ups and AI Credit consumption.
  • The architectural fix: a thin, hot compute layer (DuckDB on Duck Lake) that runs alongside Snowflake for agent workloads, not instead of it. The warehouse stays for analysts.
  • 2026 worked example below: 1,000 user questions/day at typical agent loop = ~$22K/month in Cortex Analyst + Cortex Search alone, before warehouse compute.
  • Decision framework: customer-facing real-time → warehouse. Internal exploratory agent → complementary layer. Scheduled batch → either, with caveats.
  • Hard objections — governance, freshness, "what happens when the agent is wrong" — are first-class concerns of this architecture, not appendix material.

Two workloads on one warehouse

The structural argument has to land before any cost number, because the cost numbers don't make sense until the workload mismatch does.

What an analyst query looks like

An analyst opens a dashboard. The dashboard fires three to ten queries — modeled, parameterized, predictable. The analyst reads the result, asks a follow-up, fires three to ten more. Maybe twenty to fifty dashboard loads per business day per analyst. Maybe a few ad-hoc explorations.

The query shape is steady. The queries are written by humans, modeled in dbt, materialized where they need to be, cached where they should be. The cost per dashboard is something you can model and budget for. Snowflake's per-credit pricing was designed for this. A warehouse spins up, scans columnar storage, hands back an answer, idles down. Per-second billing, one-minute minimum spin-up. It's a clean fit.

What an agent task looks like

An agent gets a question. It needs to know the schema. It runs describe, select * limit 100, maybe a couple of show columns. It picks an approach. It writes a query. The query fails. It rewrites the query. It tries again. It composes a second query against a different table. It joins the two. It retries on a stale timestamp filter. It returns an answer. Then the next user question hits, and it does it all again.

That fan-out is the structural problem. Per Anthropic's engineering blog, agents use roughly four times the tokens of a chat interaction. Multi-agent orchestrations — agents spawning sub-agents, parallel synthesis, the kind of architecture used in deep-research systems — push that to roughly fifteen times. Most internal data agents land closer to the 4× figure, but every retry and every tool call compounds. Each token isn't just an API charge: in a Snowflake-anchored architecture, it's a billable Cortex Analyst message, often a Cortex Search hit, frequently a warehouse spin-up. Most of those queries are exploratory waste. The agent doesn't know which queries were useful until after it sees the answers.

DimensionAnalyst dashboardAI agent task
Queries per task3–10, modeledMany — schema-inspect, retry, exploration
Token amplification (vs. chat)1× baseline4× single-agent, 15× multi-agent (Anthropic, 2025)
Query shapeModeled, parameterizedExploratory, schema-inspection-heavy
Retry behaviorRareFrequent — each retry is a new billable Cortex Analyst message
Cost predictabilityHigh (modeled, cached)Low (depends on the path the agent takes)
Freshness needDaily–hourly OKOften hours–days OK; sometimes live
ConcurrencyBounded by team sizeUnbounded — autonomous, async, runs while you sleep

The two columns aren't just different sizes of the same workload. They're different shapes. A SKU built around the left column will price the right column badly.

The 2026 Snowflake math nobody wants to do

A quick disclaimer: the same structural argument applies to BigQuery and Databricks for similar reasons. Different SKU shapes, same workload-mismatch logic. I'm anchoring on Snowflake because it's the most common warehouse in this conversation, the pricing is unusually well-documented, and Snowflake itself has been doing the most public work on agent-in-warehouse architecture (Cortex Analyst, Cortex Search, Cortex Agents, Cortex Code, Snowflake Intelligence). If your warehouse is BigQuery, mentally substitute "per-query bytes scanned" for "credits" and the rest of this section reads the same.

AI Credits at $2 — predictable price, unpredictable usage

Snowflake's 2026 AI Credits SKU is flat $2.00 per AI Credit globally, $2.20 regional (Service Consumption Table effective 2026-05-05, Table 2(b)). That's a separate currency from regular Snowflake Credits, which range from $2.00 (Standard, US East) up to $9.30 (VPS Switzerland) for warehouse compute. Don't confuse them.

The marketing was "predictable AI pricing." The pricing is predictable. The usage is not. AI Credits are a flat sticker. The bill is sticker × consumption. Agents emit a lot of consumption.

The good news: Snowflake's April 2026 update broke AI_SERVICES out as its own service type in METERING_HISTORY, with dedicated CORTEX_ANALYST_USAGE_HISTORY, CORTEX_FUNCTIONS_USAGE_HISTORY, and CORTEX_SEARCH_DAILY_USAGE_HISTORY views. You can attribute AI spend properly now. The bad news: you only see the spike after the bill is racked up. Attribution is a forensic tool, not a budget tool.

Cortex Analyst — every retry is a new billable message

Cortex Analyst bills 67 AI Credits per 1,000 messages, billed proportionally per message (CCT Table 6(g)). At $2 per AI Credit, that's about $0.13 per Cortex Analyst message.

Every agent retry counts as a new billable message. An agent loop that does five Cortex Analyst calls per user task — one initial answer attempt, four refinements as it learns the schema and tries again — costs about $0.65 per user task in Cortex Analyst alone, before any warehouse compute, before any Cortex Search hit.

That's not bad if you have a hundred user tasks a day. It is a problem at a thousand user tasks a day. It's an emergency at ten thousand.

Cortex Search — the idle tax that compounds with every domain

This is the one that surprises people.

Cortex Search Serving bills 6.3 credits per GB/month of indexed data (primary source). At $2 per AI Credit, that's about $12.60 per GB per month. A typical 70 GB-indexed Cortex Search service costs roughly $880 per month idle — even if no agent ever queries it.

Agents don't share search services across domains. Each agentic use case typically gets its own indexed corpus — sales, product, ops, support, legal. Five domains at 70 GB each is roughly $4,400 per month in idle serving compute — paying for searches nobody ran, before a single query lands. (Cortex Search also bills EMBED_TEXT tokens on every row insert or update, so any agent driving operational writes compounds the embedding cost on every refresh.)

A worked example: 1,000 questions a day, typical agent loop

Set up a realistic case. An internal "ask the data" agent answering 1,000 user questions per day. Three domain-specific Cortex Search services backing it (sales, product, ops), each indexing roughly 70 GB. The variable that matters most is messages per question — and that's a real distribution, not a single number. A well-tuned setup with a clean semantic model averages 2 messages per question. A typical setup averages 3–4. A bad case (no semantic layer, the agent retrying through schema chatter) hits 5+.

At the typical case (4 messages per question):

  • Cortex Analyst: 1,000 questions × 4 messages × $0.13 = $520 per day = ~$15,800 per month
  • Cortex Search idle compute: 3 domains × $880 = $2,640 per month, before a single query runs
  • Cortex Search query compute and embedding: additional, scales with traffic
  • Warehouse spin-ups for the agent's exploratory queries: the describe, select * limit 100, the schema introspection — additional credits at the regular Snowflake Credit rate, with a one-minute minimum spin per warehouse (CCT Table 1)

That puts the typical case at ~$18,500 per month minimum for this single agent. At the bad case (5+ messages per question and no semantic layer guiding retries), Cortex Analyst alone passes $20K/month and the total clears $25K. At the well-tuned end (2 messages, modeled metrics), Cortex Analyst lands closer to $8K/month — which is reasonable, but it's reasonable because someone built the semantic layer that the agent is leaning on, and most teams haven't.

That's the steady-state math. For market context: Sema4.ai's Team Edition — a vendor purpose-built for running agents inside Snowflake — charges $15 per agent per day, plus Snowflake infrastructure costs (SPCS and Cortex). The fact that a Snowflake-native agent vendor explicitly bills the warehouse layer separately is a useful signal about how the cost stack behaves: even when someone is selling you "agents inside Snowflake," they aren't pretending the architecture is cost-flat.

Three architectures compared: Snowflake-only typical case at ~$18.5K/month (4 messages per question), Snowflake-only well-tuned at ~$10.5K/month (2 messages per question with a strong semantic model), and Snowflake plus a complementary compute layer at ~$5K/month (1 final answer hits Snowflake, 3 retries on hot DuckDB) — all per 1,000 agent tasks per day

The same agent on a complementary architecture — keeping Snowflake as the system of record but running the exploratory retries on a hot DuckDB-backed compute layer — moves the message-amplification cost off the per-credit SKU. If 3 of the 4 messages move to the complementary layer, Cortex Analyst spend drops 75%. The final canonical answer can hit Snowflake (or a cached deterministic version of it). The retries, schema inspections, and self-corrections happen at near-zero per-query cost on a beefy single-node machine.

Where agent queries should actually run

The architecture isn't complicated. The framing for it is what matters.

Snowflake stays. Add a hot layer next to it.

Architecture: analysts hit Snowflake directly for modeled queries; AI agents go through a governed semantic layer to a DuckDB compute layer over a Duck Lake catalog, with segments synced from Snowflake on a schedule and a fallback path to Cortex for live or high-stakes answers

Snowflake stays as the system of record and the analyst layer. Nothing changes about how your dashboards work, how your dbt models build, how your finance team runs the close. The warehouse is still the warehouse.

Next to it, you put a thin, hot compute layer: DuckDB running over a Duck Lake catalog. Data flows into it on a schedule for the segments your agents need (or pulled on demand for specific use cases). Agents query the compute layer for the exploratory and retry-heavy work. They fall through to Snowflake for live data, customer-facing answers, or anything where staleness or governance demands it.

DuckDB earned the right to be in this conversation. As of October 2025 it ranks #1 on ClickBench, the canonical analytical-database benchmark, with 25M+ monthly PyPI downloads and production deployments at 20+ Fortune-100 companies (source). On a beefy single-node machine the hot median is about 50 milliseconds. Agent workloads aren't ClickBench, but the order-of-magnitude latency advantage holds for the schema-inspect, sample, retry pattern.

This isn't "DuckDB on a laptop." This is a managed compute layer that lives close to the agent runtime. The agent has a fast SQL surface; you have an architecture you can explain to your CFO; the warehouse is unchanged.

The semantic layer is the precondition, not a nice-to-have

The architecture above doesn't work without one piece almost nobody talks about: a governed semantic layer sitting between the agent and the data. Without it, every retry burns credits re-deriving the same SQL from scratch — the agent learns your schema once per conversation and then forgets. The semantic layer is how an agent knows ARR means a particular metric definition, joined a particular way, sliced by a particular set of dimensions, before it writes any query at all. That's the thing that turns the worked example's bad case (5+ messages per question) into the well-tuned case (2 messages). Without the semantic layer the math doesn't work on the complementary layer either — you've just moved the retries to cheaper compute. With it, the agent stops retrying as much, because it isn't guessing as much.

A decision tree: which workload goes where?

Decision tree for routing AI agent workloads: customer-facing real-time goes to Snowflake live; high-stakes (board, investor, compliance) answers go to Snowflake with deterministic parameterized queries; scheduled batch can go either way (complementary layer is cheaper unless model logic runs in Snowflake); everything else — internal exploratory agents, Slack bots, retry-heavy work — goes to the complementary compute layer

The tree is short. Hand it to your head of data:

  • Customer-facing, real-time agent answer (e.g., a customer service agent quoting account balance): Snowflake live, or wherever your live OLTP data is. Don't cache. Don't tolerate staleness here.
  • Internal exploratory agent (Slack bot, "ask the data" tools, weekly business reviews drafted by an agent): complementary compute layer. This is where the cost amplification lives. Move it.
  • Scheduled batch (nightly summaries, weekly retention reports, monthly board pack drafts): either, but the complementary layer is cheaper unless the model logic itself runs in Snowflake.
  • High-stakes (investor-facing reports, board questions, compliance answers): warehouse plus deterministic fallback. The agent doesn't write fresh SQL for these. It selects from a canonical, parameterized set of vetted queries. If an agent has to be wrong, it cannot be wrong here.

This isn't the only way to slice it. But it's the slicing that maps directly to the cost mechanics: the workloads on the right side of the tree are the ones where Snowflake's per-credit SKU prices the work fairly. The workloads on the left of the tree — exploratory, retry-heavy, autonomously-executed — are where the SKU breaks.

The objections that matter more than cost

The reason agent architecture is hard isn't the bill. It's everything else.

Governance — what data goes down, who sees it, what gets audited

Let's clear the worst misreading first. Local in this architecture does not mean laptop. It means a managed compute layer that lives close to the agent runtime — a real platform component, with deployment, ops, and access control, not a dev environment running on someone's MacBook.

What lives in the layer: the segments of data that agents are allowed to read. Not your full Snowflake. The named control concept is an explicit allowlist of schemas and tables synced to the agent layer, mapped 1:1 to warehouse access controls and audited on a defined cadence. If your auditors want to point at the thing that bounds agent access, the allowlist is the thing.

What enforces access at query time: the semantic layer's policy model, inheriting whatever access controls your warehouse already has. If a sales rep can't see other reps' pipelines in Snowflake, the agent operating on behalf of that sales rep can't see them in the compute layer either. Row-level access doesn't go away because the data lives somewhere else; the agent passes a user identity, the policy model resolves it, the layer returns the same restricted view the warehouse would have.

What gets audited: every agent query, attributed to the agent identity (and through it, the user identity that called the agent). The audit log is a system component, not an afterthought. If anything, it's easier to audit a compute layer purpose-built for agents than to disentangle agent traffic from analyst traffic in your warehouse logs.

The honest gap: per Deloitte's State of AI in the Enterprise 2026 report (3,235 business and IT leaders surveyed across 24 countries), only about 21% of organizations have a mature model for agent governance. This architecture has to help with that, not add a new sprawl problem. The way it helps is by making the agent's data perimeter explicit — a defined set of segments, a defined set of policies, a defined access path — instead of letting agents roam free across an entire warehouse.

Freshness — when agents read stale data, who's accountable

The objection: "what happens when the agent gives a wrong answer because it read stale data?"

The answer: you make freshness a tool capability, not invisible.

Default sync intervals are explicit per segment. Most operational data the agent reads — sales pipeline, product analytics, support history — lives in the compute layer at 15-minute freshness, which is appropriate for the questions agents are typically asked. Analytical data such as historical aggregates and ML feature stores refreshes nightly. For anything customer-facing or time-sensitive, the agent falls through to live warehouse queries. The agent has both modes available and chooses based on the question.

This is more thinking than a CFO usually wants to do. It's also less thinking than the alternative, which is to set freshness to "always live" and pay for it on every retry.

When the agent is wrong — eval, citations, deterministic fallback

This is the objection that doesn't get said out loud. The CTO who's seriously evaluating an agent architecture isn't most afraid of cost. They're afraid of being the executive who shipped a confidently wrong answer to the board.

Three architectural moves help:

Citations. A governed semantic layer means the agent returns metric definitions and source provenance with the answer. Not just "ARR is $42M" but "ARR is $42M, defined as [definition], computed from [tables], as of [timestamp]." Wrong answers are caught earlier when the basis for them is visible.

Deterministic fallback. For high-stakes answers, the agent doesn't write fresh SQL. It selects from a canonical, parameterized set of vetted queries. The agent's job becomes question routing and parameter selection, not query authoring. You lose flexibility. You gain auditability.

Eval harness. Every prompt that ships gets a regression test against known-good answers before it goes live. The eval harness is part of the agent platform, not someone's spare-time project.

None of these are unique to a complementary architecture. They're all easier to implement when the agent has a fast SQL surface that doesn't bill you per attempt. Eval costs are part of why teams skip eval. Cheap retries make eval affordable.

What this isn't

Honest tradeoffs:

  1. This isn't real-time. If you need genuine sub-second freshness — say, fraud detection on a payment stream — ClickHouse-class systems still beat both warehouses and lakehouse compute. The complementary architecture targets analytical workloads, not stream processing.
  2. "Local compute" doesn't mean a laptop. It means a managed compute layer with real ops. Treat it as a system component.
  3. This isn't replace-your-warehouse. Snowflake stays. Your dashboards stay. Your dbt project stays. The migration cost of doing the wrong thing here is the warehouse migration nobody needs.
  4. This isn't free. The complementary layer has its own TCO — compute, storage, ops. The win is cost per agent task, not absolute cost. If your agent volume is twenty tasks a day, the math doesn't justify the architecture. If it's ten thousand, it pays for itself before lunch.
  5. This isn't a 2-hour migration. Building a hot compute layer well takes real engineering. Productized platforms shorten the build, but they don't eliminate the choices: which segments mirror down, on what schedule, with what governance, with what eval harness.

FAQ

How do I cap Snowflake costs for AI agents?

You can set warehouse spending limits and resource monitors (you should), but those are circuit breakers, not architecture. They cut the agent off when it overruns; they don't change the underlying mechanic that makes agents expensive on a per-credit SKU. The architectural cap is to put the exploratory and retry-heavy traffic on a different compute layer where retries don't bill per attempt. If you want a more general guide to optimizing what you're already spending, we have a separate post on reducing your Snowflake costs.

Doesn't Snowflake have a setting for this?

Smaller warehouse sizes help with throughput cost on individual queries — switching from a Medium to an X-Small warehouse cuts spin-up cost. They don't change Cortex Analyst's per-message pricing, Cortex Search's idle tax, or the multiplicative effect of agent retries. Different lever.

What is Duck Lake — is it a product or a pattern?

Both. Duck Lake is an open-source lakehouse format from DuckDB Labs that uses a database (Postgres, Snowflake, MySQL, or SQLite) as the catalog instead of a file-based metadata layer like Iceberg. The format is open. Implementations of it are shipping from multiple vendors, including Definite. We have a deeper primer here, a worked deployment example on GCP, and an operator's take on Duck Lake vs Iceberg after a year in production — including when Iceberg is still the right call.

If I move agent workloads to a complementary compute layer, what do I lose for governance?

If the layer is built right, nothing structurally. The semantic layer's policies enforce row-level access. Audit logs attribute every query to an agent identity. PII handling, encryption, SOC 2 compliance — same architectural moves apply, just to a different system. What you do lose: the comfort of "everything's in Snowflake," which is psychological more than architectural. Trade it for explicit data perimeters around your agents.

How do I keep agents from giving wrong answers on stale data?

Make freshness explicit per segment. Make the agent aware of the freshness window — it's a tool capability, not invisible. For customer-facing or high-stakes answers, fall through to live warehouse queries or to deterministic, parameterized canonical queries. Build an eval harness that runs against known-good answers before shipping prompts.

Is this just shifting the cost problem somewhere else?

The complementary layer has costs — compute, storage, ops. The win is the cost shape, not the absolute number. Snowflake's per-credit pricing scales linearly with agent retries because every retry is a new billable message and potentially a new warehouse spin. A hot compute layer scales with peak working-set size, not retry count. Once it's provisioned, retries don't move the needle. At low agent volume, the complementary architecture costs more (you're paying for ops on a system you barely use). At high agent volume, it's an order of magnitude cheaper per task.

How much is Snowflake AI?

Snowflake's AI Credits are a flat $2.00 per credit globally, $2.20 regional, separate from regular Snowflake Credits for warehouse compute. Cortex Analyst bills 67 AI Credits per 1,000 messages ($0.13/message). Cortex Search Serving bills 6.3 credits per GB/month of indexed data ($12.60/GB/month). Cortex Functions and Cortex Code bill per million tokens, model-dependent (full pricing table, Tables 2(b), 6(e), 6(g)).

How expensive is Snowflake?

Regular Snowflake Credits range from $2.00 (Standard edition, US East) to $9.30 (VPS Switzerland) per credit, depending on edition and region. AI Credits are a separate, flat $2.00/$2.20 SKU. Storage is billed separately at standard cloud rates. The total bill depends on warehouse size, query volume, and AI service consumption — which is exactly the point of this post: AI service consumption is what's about to dominate, and it's priced for a different workload than the one agents impose.

What to do tomorrow

Five steps for the head of data:

  1. Audit which agent workloads (or planned agent workloads) currently hit Snowflake. Cortex Analyst, Cortex Agents, internal "ask the data" Slack bots, MCP servers wired into Claude Desktop or Cursor. Get the list.
  2. Categorize each by workload type: customer-facing real-time, internal exploratory, scheduled batch, high-stakes. Use the decision tree above.
  3. For each, ask the workload-fit question: does this need to live in the warehouse for governance, freshness, or model-execution reasons, or is it cheaper on a complementary compute layer?
  4. For workloads that don't need the warehouse, design the data path: which segments mirror to the compute layer, on what schedule, with what governance translation, with what fallback path back to Snowflake.
  5. Set up agent telemetry early. Measure cost per agent task before the bill teaches you. Logfire, OpenTelemetry, or whatever your platform supports — get the per-task cost visible.

The first three steps are a one-week diagnostic. The last two are a quarter of engineering, or a productized platform. Either way, the worst move is to keep stacking agents on a SKU built for analyst dashboards and hope the bill works out.

How Definite implements this pattern

We built Definite on this architecture because we needed it ourselves. The platform runs DuckDB over a Duck Lake catalog by default, with a governed semantic layer and an MCP server for agent integration. Snowflake is a supported source system — agents query the lake, fall through to Snowflake when the use case demands it. We wrote about why we bet the company on the duck stack in February, and about running data agent telemetry on the same architecture in production. Build it yourself if you have the engineering capacity and a year. Use a productized version of it (ours, MotherDuck, Dremio, or someone else's) if you want it running by next quarter.

Snowflake is not the wrong warehouse. Agents are not the wrong technology. Running them together, on the same compute, billed by the same SKU, is the wrong architecture. Fix the architecture. The bill follows.

Your answer engine
is one afternoon away.

Book a 30-minute call. We'll build your first dashboard on the call — or you can stop paying us.