The Self-Hostable Data Stack with a Private Cloud AI Data Agent

A bank in Frankfurt wants an AI analyst. Their rule: no customer data or metadata leaves their AWS account. To get a data agent, they need a self-hostable data platform with private cloud inference baked in.
The incumbents fail. Snowflake runs in Snowflake-operated accounts (same region, different account, different operator), so "nothing leaves our tenant" fails on day one. Databricks is closer, with the data plane in their own AWS account, but the control plane that runs the AI is always Databricks-operated and cannot be disconnected.
A self-hostable data stack means every layer runs inside your own boundary, cloud or on-prem, including the AI analyst. Almost nothing on the market does this. Teams assume they have to pick one, data residency or intelligence. You shouldn't have to!
What "self-hostable" actually means: three layers and who runs them
A data platform is three layers, and "self-hostable" is just how many run inside your boundary.
- Storage. Where your tables physically sit: object store, files, bytes on disk.
- Compute. The engine that scans those bytes and answers a query.
- Control plane plus AI brain. The web app, catalog, scheduler, metadata, and the AI analyst that reads your data and writes the SQL.
Layer 3 decides whether your AI analyst is actually private. It sees every table, column name, and value. By that count: Snowflake runs zero layers in your tenant. Databricks runs storage and classic compute (layers 1 and 2) but never the control plane (layer 3). Definite runs all three.
Why Snowflake can't be a self-hosted data stack
Snowflake is 100% SaaS by design. Compute, storage, tables, metadata, control plane: all of it runs in Snowflake-operated accounts. You point a client at it and pay for credits.
Some features sound like self-host but are not. Virtual Private Snowflake (VPS) is a dedicated, isolated instance, but still Snowflake-managed on Snowflake infrastructure: premium single-tenancy, not your tenant. PrivateLink gives a private network path, nothing in your account. Cortex LLM functions run in Snowflake's governance boundary in-region, a real control, but still Snowflake's account.
The one thing Snowflake brands BYOC is Openflow, and it is ingestion only: Apache NiFi runtimes in your VPC to move data. The warehouse those pipes feed stays Snowflake-hosted. Bare-metal, on-prem, or air-gapped does not exist. More in the full Definite vs Snowflake breakdown.
Databricks keeps your data in your cloud, but runs the AI from theirs
Databricks is the harder comparison, so I will steelman it. The classic compute plane really does run in your own cloud VPC. Managed tables sit in your own S3, ADLS, or GCS. That is a strong, true residency claim, and it is why regulated teams land there.
What does not stay is the control plane: web UI, notebooks, scheduler, workspace metadata, Unity Catalog, all Databricks-operated and not self-hostable. Even on classic compute, cluster launch, lifecycle, and Spark driver coordination run through it, and no supported mode works without a connection to Databricks. The "air-gapped" setups people cite are private networking (PrivateLink, PSC, no public egress), not an air gap; orchestration still needs a live link. No bare-metal, on-prem, or cable-unplugged mode exists.
The AI is the same story. The model can run on your compute, and you can route to your own model endpoint. But the serving endpoint is created, authenticated, governed, and routed through that Databricks-operated control plane. That is a hard dependency you cannot disconnect from. That missing third layer is the dealbreaker for teams that need on-prem AI analytics with a disconnect option. More in how Definite compares to Databricks on self-hosting.
Self-hosted data stack head-to-head: Snowflake vs Databricks vs Definite
| Layer | Snowflake | Databricks | Definite |
|---|---|---|---|
| Storage (your tables) | Snowflake accounts | Your cloud (S3/ADLS/GCS) | Your object store |
| Compute (query engine) | Snowflake accounts | Your VPC (classic) / Databricks accounts (serverless), orchestrated by their control plane | Your environment |
| Control plane (UI, catalog, scheduler) | Snowflake SaaS | Databricks SaaS, always | Your environment |
| AI analyst | Snowflake SaaS | Model can run on your compute, but serving is brokered by Databricks SaaS | Your environment, your model endpoint |
| True on-prem / bare metal | No | No | Yes |
| Air-gapped / fully disconnected | No | No | Yes |
| Scale ceiling | Very high (distributed) | Very high (distributed, petabyte) | Petabyte storage; 100+ TB on a tuned large node with partition pruning |
| Native connectors / ingestion | Thin native; rely on third-party ETL | Thin native; rely on third-party ETL | Native, built in |
| Maturity / track record | Largest, most mature | Very large, mature | Younger, fewer years in market |
The incumbents win on raw distributed scale and a decade of maturity, partnerships, and support. Snowflake and Databricks ship thin native connector catalogs and lean on third-party ETL like Fivetran to load data; Definite builds ingestion in.
Where Definite wins is the bottom rows: true on-prem, a real disconnect mode, and an AI analyst inside your walls on your own models.
A private cloud AI data agent that runs in your tenant
Definite is a true BYOC data platform: the whole stack lands in your cloud or on-prem, not a slice.
- Control plane: the React frontend and Rust/Python API run inside your cluster.
- Lakehouse: World-class speed in a self-hosted lakehouse on your own object store.
- Connectors: ingestion runs in your environment, credentials in your own secrets backend.
- Fi, the AI analyst: runs inside your tenant, calling your own model endpoint.
It deploys via Helm into your Kubernetes (your cloud, or bare metal on-prem), single-tenant. The piece that breaks the old tradeoff is BYO model, in two tiers.
Tier 1, private cloud residency. Everything inside your own cloud account: storage in your S3, compute on your nodes, Fi answering through your own managed model endpoint (Amazon Bedrock on AWS, Azure OpenAI on Azure, Vertex on GCP). Schema, columns, values, and prompts stay in your cloud boundary, not Definite's. The caveat: a managed model service like Bedrock is run by the cloud provider, so this is "in your cloud boundary," not "your own silicon." It beats Databricks because the orchestration lives in your tenant too, nothing brokered through a vendor control plane. This is the mode the Frankfurt bank would run day to day.
Tier 2, true air-gap. Everything on your own GPUs with a self-hosted open-weights model (Qwen, DeepSeek, Mistral, or similar), zero required egress. Fi works on smaller models by using tools custom built for analytics. Definite ships with a built-in semantic layer and an ontology of your business, so it uses your defined metrics instead of guessing at raw columns.
What a security review will ask, with the short answers:
- Telemetry / phone-home: zero outbound egress on a fully air-gapped install; control plane, Fi orchestration, and model calls all terminate inside your tenant.
- Licensing: offline license file, validated locally, no callback.
- Updates: signed offline artifact bundles; CVEs reach you through that channel plus support, not a public GitHub issue.
- Audit: query- and prompt-level logs, so you can prove what Fi sent the model and that no PII crossed a boundary.
- Secrets and keys: credentials in your secrets backend, at-rest encryption in your object store, Definite never holds a key.
- Vendor access: no standing access; break-glass only with your explicit, logged grant.
A correctness caveat: "writes SQL" is capability, and correctness is what gets an AI analyst banned in finance. Fi runs against the semantic layer, and you decide whether it executes autonomously or needs approval. Audit its work; it is not an oracle.
The engine is DuckDB and DuckLake. Queries return in under a second, with no cluster to run. Scale is not the ceiling people assume. Storage runs to petabytes: your data sits as Parquet in your object store, the catalog in a SQL database. Each query reads only the partitions it needs, not the whole dataset. On a big node with enough memory and cores, DuckDB handles 100+ TB, and anything larger than RAM spills to disk. The real limit is the shape of the work, not its size. A single node fits filtered analytics and agent queries. Petabyte-wide, shuffle-heavy joins still belong on Spark.
Can't I build a self-hosted lakehouse with AI myself?
Fair. This is the one real alternative, so I will not wave it away. You can be fully self-hostable today with open source: Iceberg or DuckLake for the table format, a query engine, a catalog, a BI tool, your own self-hosted LLM. You own every layer, air-gapped if you want, and the data is genuinely yours.
Here is the bill. You are now the integrator and operator, owning the upgrade path for every component plus every seam where two of them disagree after a version bump. And you still have no integrated AI analyst and no polished UI a non-engineer will open. Definite is the third door: a self-hosted lakehouse with the AI already inside it, on your model, supported.
The fair counter is vendor maturity. You are putting a regulated stack on a younger single-tenant product, so ask up front: who is behind it, the support model for an air-gapped buyer, and source-availability or escrow if the vendor disappears. We answer those in writing.
Own your analytics platform and keep the AI
"Data residency or the AI assistant, pick one" was never a real constraint. It was a billing decision: those vendors run the AI in their cloud because that is where they meter you. To own your data analytics end to end, the analyst has to live where the data lives: on your model, behind your firewall, working even with the cable unplugged. Not a SaaS chatbot pinky-swearing it deleted your prompt.
FAQ
What is a self-hostable data stack? A platform where storage, compute, and the control plane (including the AI analyst) all run inside your own environment, cloud or on-prem, instead of a vendor's SaaS. It is measured by how many of those layers you actually run.
Can Snowflake or Databricks run fully on-prem or air-gapped? No. Snowflake is 100% SaaS in Snowflake-operated accounts. Databricks keeps your data in your cloud, but its control plane and model-serving always run in Databricks SaaS, with required connectivity. Neither offers bare-metal, on-prem, or a disconnected mode.
What is a private cloud AI data agent? An AI analyst that reads your schema and writes SQL entirely inside your own tenant, calling your own model endpoint, so your data and prompts never leave your boundary.
Can I build a self-hosted lakehouse with AI using open source? Yes, with Iceberg or DuckLake plus your own LLM, but you become the integrator and operator of four or five projects, with no integrated AI analyst or polished UI.
If you are in finance, healthcare, public sector, or defense, and your compliance team has veto power, grab 30 minutes and I'll show you both tiers running: the private-cloud version on your own cloud model (Amazon Bedrock, Azure OpenAI, or Vertex), and the fully air-gapped version on a self-hosted model.