August 27, 202510 minute read

Getting Reliable LLM-to-SQL Results with CubeJS and Constrained Decoding

Mike Ritchie
Constrained Decoding and CubeJS: Ensuring Trustworthy LLM-to-SQL Queries | Definite

TL;DR
Pointing an LLM directly at raw warehouse tables generates fast but unreliable results. The fix lies in combining governance via CubeJS, syntax-aware SQL generation, execute-and-verify loops, and production safeguards to deliver accurate, auditable answers.


1) Bound the system through roles, masking, timeouts, and tagging

Reliability begins with instituting robust guardrails—every query, including those produced by an LLM, must be constrained to prevent data leaks or runaway compute costs:

  • Read-only access with granular controls
    CubeJS lets you enforce row-level and column-level restrictions based on user context (e.g., via JWT claims), automatically applied in every query ([turn0search0], [turn0search8], [turn0search10]).

  • Masking refers to dynamically obscuring sensitive data at query time, without altering the stored data. For example, you can conditionally show or hide columns like names or emails depending on user roles by embedding logic in your CubeJS model (e.g., using a masked macro in your model files) so that unauthorized users see a placeholder instead of the real value ([turn0search4]).

  • Set execution limits such as default row caps and timeouts to prevent resource exhaustion—CubeJS supports default limits to mitigate data scraping or denial-of-service risks ([turn0search1]).

  • Tag every query with metadata (user ID, session ID, task) for auditing and cost tracking—helpful for monitoring and debugging.


2) Use CubeJS as your definitions contract

Rather than exposing raw tables, define your metrics, dimensions, and joins within CubeJS. CubeJS handles modeling, caching, access rules, and API delivery (REST, GraphQL, SQL), ensuring consistent definitions across all consumers—even LLMs ([turn0search12], [turn0search3]).

Routing LLM-generated SQL through CubeJS helps enforce governance without needing to mention semantic layer concepts explicitly.


3) Constrain LLM outputs to valid SQL

LLMs may hallucinate or produce invalid SQL. To improve reliability:

  • Apply grammar-aware prompting or constraints during generation so models only emit syntax-compliant SQL.
  • Use a validation layer—parse the generated SQL against your schema, lint for style, and reject queries with syntax errors or unauthorized columns.

This ensures that only valid and authorized SQL ever reaches execution.


4) Execute and verify multiple SQL candidates

Even valid SQL can be semantically incorrect. A generate-and-test loop helps:

  • Sample several SQL variants from the LLM.
  • Statically validate them (parsing, linting, schema checks).
  • Execute each in a sandbox with enforced limits and apply simple invariants (e.g., non-zero row counts, aggregate bounds).
  • Choose the variant that both runs successfully and meets correctness criteria.

This reduces “correct-looking but wrong” outputs.


5) Request transparency and user confirmation

Let users see what’s being run:

  • Present the inferred intent (e.g., time granularity or filter interpretation) and show the generated SQL.
  • Prompt for confirmation when ambiguity exists—such as fiscal vs. calendar periods.
  • Surface the origin of metrics or dimensions (e.g., who defined them in CubeJS) for traceability.

This builds user trust and accountability.


6) Embed evaluation to avoid regressions

Treat LLM-to-SQL systems like code:

  • Maintain a gold question suite—real NL questions mapped to expected SQL or results.
  • Run this evaluation suite in CI whenever models, prompts, or CubeJS configurations change.
  • Track metrics like execution success rate and matching accuracy (reference models like the OpenAI Cookbook SQL evaluation example) (cookbook.openai.com).

This prevents drift and maintains system reliability over time.


7) Production hygiene: cost, auditability, and guardrails

  • Tag and isolate LLM workloads for cost and performance insight.
  • Use separate compute resources with quotas and monitoring to avoid interference and runaway costs.
  • Enforce timeouts and row limits by default.
  • Lint generated SQL (e.g., with SQLFluff) to maintain style consistency.

Academic perspectives on improving reliability

Recent academic work offers structured approaches for reliable Text-to-SQL systems:

  • TrustSQL presents a benchmark and penalty-based scoring—encouraging models to abstain rather than err when unsure (arxiv.org).
  • Reliable Text-to-SQL (RTS) introduces adaptive abstention and human-in-the-loop correction to reduce schema misinterpretation (arxiv.org).
  • CHESS uses a multi-agent pipeline (schema filter, candidate generator, unit-test agent) to enhance reliability and reduce LLM calls (arxiv.org).

These ideas inspire the architecture outlined here.


Reference Architecture You Can Build This Week

  1. Define models in CubeJS (metrics, joins, access rules).
  2. Constrain SQL generation with grammar guidance and prompt rules.
  3. Validate generated SQL before execution.
  4. Execute in sandbox with limits and apply sanity checks.
  5. Expose SQL and intent to users and require confirmation.
  6. Evaluate via golden-question suite in CI.

Why Definite built it this way

At Definite, we rely on CubeJS to centralize definitions and enforce access and masking. We constrain LLM generation, validate before execution, and always surface the SQL and intent to the user. This provides speed and flexibility without sacrificing trust, auditability, or cost control.


Notable Citations

  • CubeJS: row-level, column-level access, masking, and security context ([turn0search12], [turn0search4], [turn0search8], [turn0search10]).
  • Default query limits and safeguards in CubeJS ([turn0search1]).
  • Academic references:
    • TrustSQL benchmark for reliability (arxiv.org).
    • Reliable Text-to-SQL (RTS) with adaptive abstention (arxiv.org).
    • CHESS multi-agent SQL generation architecture (arxiv.org).
  • Evaluation patterns (OpenAI Cookbook SQL evaluation example) (cookbook.openai.com).

Want this workflow up and running in an afternoon instead of months? Reach out to Definite—we'd be glad to help you launch reliable, auditable, cost-controlled LLM-to-SQL systems.

Data doesn't need to be so hard

Get the new standard in analytics. Sign up below or get in touch and we'll set you up in under 30 minutes.