Inside Eon's AI-Ready Lake: How an Open Iceberg Lake Describes, Joins, and Serves Itself to AI Agents

Iceberg gives an AI agent typed columns, snapshots, and engine-agnostic reads. It does not tell the agent what a column means, which tables describe the same business entity, how two sources join when they share no key, or which of several near-identical tables is current. Eon builds each of those layers automatically as data lands in open Iceberg, then serves the result to agents over MCP and A2A.

What sits between Iceberg and an AI agent

Three things sit between an Iceberg lake and a working AI agent: the meaning of each column, the joins across sources, and the resolution of duplicate or stale tables, none of which the table format itself provides. A lakehouse decision usually ends at the table format. Iceberg gives you ACID commits, schema and partition evolution, snapshots, and the ability to point any Iceberg-compatible engine at the same files. None of that is semantics. A catalog adds structural metadata (names, types, lineage), but structure is not meaning. An AI agent needs semantic definitions of what a column actually holds: verbal descriptions rather than data types and column names.

Three gaps remain once the tables exist:

Meaning. Column names and types do not say what a column holds. AI agents need verbal descriptions of the data, beyond its schema.
Joinability. Business questions span sources, and the relevant columns rarely share a key or a name. A production domain is not a CRM customer_name.
Ambiguity. Real lakes carry duplicate and near-duplicate tables with drifting definitions; an AI agent that "selects one that matches and moves on" produces inconsistent or wrong answers.

The conventional fix is a hand-built semantic layer (dbt Semantic Layer, Cube, AtScale) curated per use case, alongside manually declared joins. Eon's framing is that these layers are maintained by hand and must be re-curated as sources and use cases evolve, which is the gap it sets out to close by generating the layer from the capture path. Join discovery in particular is a studied problem: systems like JOSIE frame it as overlap set-similarity search over column values, and Freyja ranks candidate joins by shared values rather than names, replacing the LSH-approximated set-containment of earlier methods with succinct data profiles to suppress false positives. In both systems, joins are found in the data, not the metadata. Eon's contribution is to run these layers from the capture path and assemble them into a single queryable structure it calls the context graph: a graph whose nodes are columns, tables, and the business entities they cluster into, and whose edges are the value-based join relations and natural-language descriptions derived on ingest. Every layer above the substrate reads from and writes back to this one structure.

System model: capture into open Iceberg

Eon maps a multi-cloud estate (AWS, Azure, GCP) from a single IAM role: no appliance, no compute in the data path, no gateways, no network access into production systems. Data teams define a contextual rule ("all production relational databases with financial information") and Eon ingests every matching source; freshness and retention are configuration, not pipeline engineering. Agentless capture across the cloud estate is one path in, not the only one: Eon also ingests streaming sources as a first-class path, for data that arrives continuously and at high throughput, and can pull from backups and archives into the same lake. Whichever path it takes, the data lands in the same open Iceberg substrate and flows through the same context layers.

Ingested data lands directly in open Apache Iceberg, the open table-format standard for large analytic tables. The design is standard Iceberg on the outside, Eon's patented storage engine underneath: an open format any engine can read, over a proprietary storage implementation built for warehouse-grade economics without lock-in. The open surface is what matters for consumers because the tables are standard Iceberg, engines like Databricks, Snowflake, BigQuery, Trino, and Athena can read them through their existing Iceberg integrations, and the context layers publish into existing catalogs.

Why the table format matters for AI/ML reads specifically:

Schema evolution is metadata-only. Iceberg tracks columns by unique ID, so add/drop/rename/widen never rewrites data files or resurrects old data. Ingested production schemas drift constantly; this absorbs the drift without breaking downstream embeddings or training inputs.
Hidden partitioning yields planned scans. Iceberg derives partition values from a column transform and prunes non-matching files, so a RAG or training job gets a bounded, predictable scan instead of full-table churn, and consumers don't hand-write partition filters.
Partition evolution is also metadata-only. Old data keeps its old spec; both layouts coexist via split planning, so layout can be tuned to a workload without migrating data.

Everything above is the substrate. The remaining layers are produced on top of it, automatically, as data is captured.

Self-describing data: classification at ingest

On ingestion Eon runs an AI agent that generates a contextual natural-language description for every column in every table, then a clustering step groups tables into logical entities that represent business context. Both the descriptions and the entity clusters land in the context graph, which becomes the queryable catalog feeding context to LLMs through AI agents.

Metadata cannot do this part. A type and a name (acct_id BIGINT) tell an AI agent nothing about whether the column is a billing account, a cloud account, or a login. A verbal description grounds the column so retrieval can match intent instead of string-matching headers. At the scale automatic ingest creates (thousands of databases pulled by one rule), manual cataloging cannot keep pace, which is why the description step runs inline with capture rather than as a later annotation pass.

The classification path is already productized in a narrower form: Eon auto-classifies data classes (PII, financial, PHI) at the resource level with an API to override or revert. The override path means the generated layer is reviewable, not infallible.

Join discovery: relations from value overlap

Join discovery is the layer hardest to derive automatically. Customer adoption metrics live in a production database; the customer's name and account details live in a CRM. Answering "which accounts are expanding usage" requires joining them, but the columns that connect the two share neither a name nor, often, a key. AI agents that fall back on semantic similarity fail here, because a production domain is simply not the CRM customer_name; name-and-metadata matching has weak success rates on exactly the joins that matter.

Eon attacks this with two complementary signals, and it derives them from the backup itself with no running database to query.

Declared keys, taken as truth. Eon reads the foreign keys the source database already knows about and enters them into the relation map at full confidence. Declared keys are facts, not guesses.

Relationships inferred from the data. The harder and more valuable case is the relationship no one ever declared: the cross-database join, the denormalized copy, the constraint that was never written down. Here Eon looks at the values rather than the names. While capture runs, Eon takes a lightweight statistical fingerprint of each column without scanning the full dataset. Later, outside the capture path so nothing slows down, it compares those fingerprints to find columns whose values genuinely overlap across otherwise unconnected sources. Two columns that turn out to hold the same underlying values are flagged as joinable and confidence-scored, even when their names, types, and keys all differ.

Filtering comes first, so the joins Eon proposes are the ones that mean something: low-variety categorical fields, flags, and timestamps that technically "match" everything are excluded, and Eon compares only type-compatible columns. Declared keys are never filtered away. Finally, Eon clusters the surviving relationships into entities, so when many columns across many tables all carry the same identifier they collapse into a single group and the system can reason about join paths, not just isolated pairs. The whole map (declared and inferred) feeds the context graph automatically.

Eon documents one worked example: the anchor that linked production adoption metrics to CRM customer records was the cloud-account-ID, correlated against the cloud marketplace identifier listed in the CRM exports. Once that link is established by value overlap, rows from the two systems join even though no one ever declared a foreign key.

The cloud-account-ID anchor: rows from two systems join on overlapping values, with no declared foreign key on either side.

The value-overlap half of this is the approach the data-lake join-discovery literature converged on: joinable columns are the ones whose value sets overlap, a signal invisible to names and types. The same literature documents the weakness: pure overlap over-fires when unrelated columns happen to share a value range, like sequential IDs. Eon's advantage is that value overlap never stands alone: declared keys anchor the map where they exist, noise columns are filtered out before anything is scored, and surviving candidates are reconciled against the context graph's descriptions, entity clusters, and freshness before they reach an agent. Overlap proposes a join; the rest of the system decides whether it's real.

Ambiguity, similarity, and freshness

Duplicate and near-duplicate tables are normal: the same column names mean different things across databases, and legacy tables hold stale data. An AI agent cannot tell them apart, and its default is to pick one and move on, so the same question asked twice can return two answers, and a user usually never learns that a source was chosen from several.

Eon uses the context graph and the joinability map to find tables that contain similar data, and it monitors data freshness as an indicator it passes to the AI agent. When similarity crosses a threshold, the AI agent is handed the differences, including how current each candidate is, to surface to the user or look up in the organization's knowledge bases, rather than guess. The reconciliation reuses the same two structures the earlier layers already build, so no new index is required.

Serving the context: retrieval and acceleration

Eon serves the context graph to agents three ways, because a context graph is only useful if an AI agent can hit it at query time.

Semantic search and prompt-to-table embeddings. Even when a matching view or table exists, the AI agent has to find it. When matching isn't straightforward, Eon embeds the prompt and the destination table/view in a vector store and retrieves over them; RAG works well for this routing problem.
Auto-managed SQL views. On ingest Eon generates SQL views to capture complex queries, and when an AI agent iterates to build a new query, the resulting SQL is saved as a reusable view so similar questions reuse it instead of re-deriving it. Views are kept virtual when they're cheap to resolve, and materialized when complexity or runtime warrants it, so the agent pays the multi-step derivation once and subsequent reads hit precomputed results.
Native MCP/A2A serving. Eon exposes the retrieval surface directly from the substrate over MCP (Model Context Protocol) and A2A (Agent-to-Agent): smart Iceberg layouts, cached scans, and vectorized compute in place, with planned scans for RAG and no separate serving tier. In shipped form, the Eon AI Agent discovers relevant tables across environments, runs cross-resource and cross-snapshot analysis with automatic join detection, generates SQL, and is callable from Claude Code, Codex, and Gemini Enterprise through the Eon MCP server and A2A.

Personalization: a hierarchical knowledge graph

"Active customer" means one thing to a CS engineer and another to finance. Eon stores these subjective definitions as knowledge served to LLMs through AI agents, in a hierarchical knowledge graph (company, then team, then person) layered on top of the context graph so a query resolves against the requester's definitions. These definitions are configuration-driven, not a self-correcting loop, so the same prompt returns the answer the requester's role means.

Trade-offs and limitations

Joinability mixes fact and inference. Declared foreign keys enter as ground truth; inferred relationships are probabilistic and can over-fire when unrelated columns share a value range. Filtering, plus reconciliation against context and freshness, mitigates this but doesn't eliminate false positives.
Generated layers need governance. Column descriptions, entity clusters, and saved views are model-derived. The classification override API signals that human-in-the-loop is expected; descriptions and views warrant the same review.
The layers are continuous compute. Describing and profiling every column across thousands of databases, then monitoring freshness, is ongoing work, not a one-time index build.

How the layers compose

Each layer reuses the one beneath it, from capture up through classification, the join map, reconciliation, serving, and the knowledge graph. The result is one self-describing substrate, and what an engineering team plugs into is open Iceberg tables and an MCP/A2A endpoint: no separate semantic-layer build, no separate serving tier.

Every layer reads from and writes back to the context graph; the agent only ever sees Iceberg tables and one endpoint.

If you own an Iceberg or lakehouse decision and are working out what your tables actually need before agents can use them, that architecture is the right thing to walk through end to end. We're happy to do that against your own estate: a technical session or a scoped pilot on your sources.

FAQ

What does it mean for a data lake to be "AI-ready"?

An AI-ready data lake adds the layers an AI agent needs above the table format: natural-language descriptions of every column, clustered business entities, a cross-source join map, and ambiguity resolution. An open table format alone gives an AI agent typed columns and engine-agnostic reads, but not the meaning, relationships, or currency signals it needs to answer reliably.

How does Eon detect joins across data sources that share no key?

Eon combines two signals. It reads the foreign keys the source database already declares, and, for the relationships no one declared, it analyzes column values rather than names. While capture runs, Eon takes a lightweight statistical fingerprint of each column, then flags pairs whose values overlap across otherwise unconnected sources, filtering out low-variety and mistyped columns first so the joins are meaningful. In Eon's worked example, a cloud-account-ID correlated against the cloud marketplace identifier in CRM exports anchored a join between production adoption metrics and CRM customer records, with no declared foreign key.

Does building an AI-ready Iceberg lake create vendor lock-in?

Eon writes data into standard, open Apache Iceberg, so engines like Databricks, Snowflake, BigQuery, Trino, and Athena can read the same tables through their existing Iceberg integrations, and the context layers publish into existing catalogs. A patented storage engine sits beneath the open format for warehouse-grade economics, but the surface consumers depend on is standard Iceberg, so the lake stays readable by any engine without lock-in.

How do AI agents query Eon's context graph?

Eon exposes its retrieval surface natively over the Model Context Protocol (MCP) and Agent-to-Agent (A2A) directly from the substrate, with no separate serving tier. In shipped form, the Eon AI Agent discovers relevant tables, runs cross-resource and cross-snapshot analysis with automatic join detection, generates SQL, and is callable from Claude Code, Codex, and Gemini Enterprise through the Eon MCP server.

How does Eon classify data automatically on ingest?

On ingestion Eon runs an AI agent that generates a contextual natural-language description for every column in every table, then clusters tables into business entities stored in a context graph. Eon also auto-classifies data classes such as PII, financial, and PHI at the resource level, with an API to override or revert the classification.