Article

Inside Eon's Vault: A Source-Aware Storage That Drives Radical Efficiency

Why Eon's vault stores a fraction of what native snapshots store, mechanism by mechanism.

Moshe Shelly
Written by
Moshe Shelly
Updated on: 
Jun 4, 2026
0
 min read
Inside Eon's Vault: A Source-Aware Storage That Drives Radical Efficiency

Quick Summary

  • Eon's vault deduplicates fingerprinted blocks across snapshots and resources, so every capture after the first is incremental.
  • Per-source pipelines remove each source's specific waste: databases drop derived data and store logical rows as Parquet, block volumes compress plaintext instead of ciphertext.
  • Packing, lock-aware retention, and encryption ordered after dedup keep PUT fees and Object Lock overhead from eating the savings.
  • On a representative estate, the model lands at roughly a third of the cost of a snapshot-wrapper vault, with identical restore SLAs and immutability guarantees.

Eon's vault stores less because it understands what it's storing. Each source type carries different waste: derived data in database snapshots, ciphertext that won't compress on block volumes, full copies posing as retention points. Eon removes that waste per source, deduplicates fingerprinted blocks across the estate, then packs and locks the result so request fees and immutability don't claw the savings back. On a representative estate, the result is roughly a third of the cost of a snapshot-wrapper vault.

The structural fact that drives Eon's design is that the waste is different for every source type. A database snapshot is expensive for reasons that have nothing to do with why a block-volume backup is expensive. So a storage layer that wants to be meaningfully smaller can't apply one trick everywhere; it has to understand what each source actually is. Eon's vault was built from the storage layer up to do that.

Rather than catalog every source type Eon supports, the piece explains the foundation, walks two examples that best show what "source-aware" buys you, and then covers the cross-cutting layers (encryption ordering, packing, and immutability) that those reductions feed into. Where we cite numbers, we mark which are Eon's own measurements and which are independently verifiable, because a cost figure without a mechanism is marketing, and the mechanism is what you came here for.

How to read the numbers

Savings track the structure of the data: workloads with high cross-snapshot redundancy and removable derived data compound hard; low-redundancy, already-compressed workloads see less. The headline figures below are Eon's measurements on workloads where the relevant mechanism has room to work. Read them as the upper end of a range, validated against your own estate rather than a guarantee.

The foundation: fingerprint-indexed, forever-incremental block storage

Underneath every pipeline is one idea. Data is broken into blocks, each block is identified by a cryptographic-strength fingerprint of its contents, and a block whose fingerprint has already been seen is stored as a reference rather than written again. The technique is fingerprint-based block deduplication, and it gives two properties for free: every capture after the first is incremental, and deduplication spans both snapshots and resources. A block written once for one workload satisfies every other workload in the same vault that contains it.

Two positions are worth stating, because they're where dedup systems usually succeed or fail.

The hard part is the index, not the hashing. The real engineering cost of a fingerprint-indexed store is the size and lookup cost of the index across an estate with billions of blocks. Eon treats that index as an optimization, not a system of record: a missed lookup costs one re-stored block on the next backup, never correctness. The decision lets the index run on cheaper, less durable infrastructure than the data tier, and it's the difference between a dedup pool that stays affordable at scale and one that doesn't.

Chunking is a real trade-off in both directions. Fixed-size chunking is fast to compute, but it misses duplicates the moment data shifts: insert one byte near the start of a file and every downstream boundary moves. Content-defined chunking (CDC) uses a rolling hash to find natural boundaries, catching shifted duplicates at the cost of higher chunking throughput. Either way you still pay the index cost above (fixed-size doesn't escape it, it only produces more uniform keys), so the choice is about hit rate rather than about dodging the index. Neither approach is universally correct, so Eon picks the grain per source type rather than globally.

(On the inevitable question of hash collisions: with a strong cryptographic digest, the odds that two distinct blocks collide are orders of magnitude below the rate of latent bit errors that already threaten data on any disk. The backup inherits the same integrity floor as the hardware it runs on: the well-documented justification behind compare-by-hash wherever it's used in production.)

Source-awareness in practice

Example 1 - Databases: don't back up the disk image

A native RDS snapshot captures the database's underlying block volume, so the captured artifact is the engine's physical on-disk image: live tables, indexes, freed-but-unreclaimed pages, padding, internal overhead. Indexes alone are typically 10–30% of an OLTP database: pure derived data, re-stored on every snapshot.

Eon doesn't ship the on-disk image at all. The pipeline restores a temporary instance in an isolated scanning account, reads every table over streaming SQL, and writes columnar Parquet directly into the vault. Three things follow:

  • Derived data disappears. Only logical row data plus the DDL to rebuild every index is stored; the indexes themselves are recomputed in minutes after a restore. Parquet's columnar layout then compresses the live rows a further ~2–4×, because adjacent values in the same column are far more redundant than adjacent bytes on a disk page. First full capture: typically 50–70% smaller than the disk-image snapshot.
  • Incrementals are row-aware and vacuum-immune. Rows stream through a content-defined chunker that splits them into logical blocks aligned to row boundaries. A 100-byte row update touches only the blocks containing its bytes, versus a disk snapshot that re-bills 512 KiB of "change" plus another 512 KiB per index page the row sits in. And because Eon fingerprints live row content, background reclamation (vacuum/autovacuum and equivalents) is invisible: it shuffles dead tuples around the disk image without changing live data, so a disk-snapshot incremental sees huge change where Eon sees none. Typical OLTP incrementals run ~70% smaller.
  • No 35-day cliff. Native automated backups go full-copy beyond ~35 days, and on every cross-account or cross-region copy. Eon's Parquet chain re-uses every row block already present in the destination, collapsing long-retention and multi-region workloads from ~12–52× the dataset to roughly 1.1×.

The same artifact is also a query-ready dataset.

Example 2 - Block volumes: read the plaintext, not the ciphertext

For EBS the source is a raw device, and the most instructive saving is a single counter-intuitive interaction. AWS does compress EBS snapshot data, but only on unencrypted volumes. Encryption-on is the current default for new EBS volumes, so virtually every production snapshot is ciphertext, which is statistically random and uncompressible by any general-purpose codec. The advertised "compressed snapshot" benefit doesn't apply to most real fleets.

The EBS quirk is a specific case of a general rule worth internalizing: deduplication can survive encryption, but compression cannot. Encrypt before you compress and there's nothing left to squeeze. Eon's pipeline reads from the mounted, decrypted volume in the scanning account, so what it compresses is plaintext OS, application, and database content. Standard zstd typically recovers about 2× that the native path leaves on the table. On top of that, Eon tracks change at the 4 KiB filesystem page instead of native 512 KiB (a 128× finer grain on the small scattered writes that dominate real workloads) and deduplicates changed pages against the prior chain before upload. Stacked, these reach up to ~70% smaller incrementals, rising toward ~90% once a periodic global pass folds in cross-resource redundancy (the same OS image in every VM, the same libraries in every container).

The same logic extends outward

The two examples above are the clearest illustrations of a method applied wherever it fits. DynamoDB, which has no native chained-incremental format (every retention point is a full table copy), gets an initial Parquet full plus item-level change capture that's coalesced at the destination vault, turning a ~19× one-year footprint into under 2×. S3, where churn rather than blocks dominates, gets partition-scoped variable-size dedup so an appended log object uploads only its new bytes, with no source-bucket versioning required. Each source carries a different multiplier, and the principle holds: understand the source, then remove its specific waste.

Encryption ordered to compose with deduplication

The naïve way to encrypt captures, a fresh per-object key, defeats dedup outright: identical plaintext becomes distinct ciphertext, every dedup hit collapses into a unique object, and the design forces the customer to choose between key control and cost.

Eon orders the operations instead. Deduplication (and compression) run first, on the plaintext block stream, so identical blocks resolve to a single copy and plaintext actually compresses. Only then are the deduplicated blocks packed into a storage object and the object encrypted, one layer up. The encryption boundary never sees the identical-plaintext/different-ciphertext problem, so dedup ratios survive end to end. The ordering is exactly why the block-volume pipeline's decrypt-then-compress step works. Encrypting at the object boundary also makes each object independently destroyable (delete its key, the data is gone: standard crypto-shredding). Key custody is configurable: an Eon-managed key by default, or a customer-managed KMS key for AWS/GCP vaults via standard envelope encryption.

The PUT tax nobody prices

Object storage charges per request as well as per byte, and the per-request fee is identical whether the object is 128 KiB or 128 MiB. Colder tiers have a cheaper per-byte rate but a more expensive PUT, so the smaller the object, the more the request fee dominates. Take 1 GiB of 128 KiB objects (a typical log/telemetry stream) = 8,192 PUTs, expressed as "how many months of storing that same GiB the PUT bill equals," at AWS list pricing:

The PUT tax, by storage tier

Storage tier PUT $/GiB (unpacked) Storage $/GiB-month PUT tax in months of storage
S3 Standard $0.041 $0.0230 ~1.8 months
S3 Infrequent Access $0.082 $0.0125 ~6.6 months
Glacier Instant Retrieval $0.164 $0.0040 ~41 months
Glacier Deep Archive $0.410 $0.0010 ~400 months (~33 yr)

1 GiB written as 128 KiB objects = 8,192 PUTs, at AWS list pricing. Independently verifiable.

On Deep Archive, the PUT bill for 1 GiB of small objects buys ~33 years of storing that gigabyte. Eon never uploads a small object as its own PUT: small objects and 4 KiB blocks accumulate and flush as a single PUT once a blob fills an adaptive target size, chosen from the access profile, storage class, and data lifecycle. Each blob is split into ~1 MiB zstd compression units read back via HTTP range requests, so random access by fingerprint stays an O(1) seek. On the coldest tier that $0.41 of PUTs drops to a fraction of a cent; even on Standard, packed PUTs run 100×+ cheaper.

(One honest caveat on the codec: zstd's value here is that across a band of its levels it holds a useful ratio while decompressing fast and at roughly stable speed, so raising the ratio in that band cuts footprint without proportionally taxing restore latency. The behavior is a property of a chosen operating band rather than a universal law about the codec.

Object Lock without doubling the vault

Object Lock (WORM immutability) is a compliance requirement for many customers and a ransomware baseline for almost all. In most vaults it's the single largest line item, routinely doubling footprint. The reason is the collision between locking and packing: retention is set per object, so a locked packed blob is indivisible until it expires, even after only some of its contents are still referenced. The common industry compromise (a 30-day rolling re-lock that rewrites a fresh full copy every 30 days) keeps PUT costs sane but doubles the locked footprint during each window.

Eon's lock-aware packing groups blocks with similar predicted lifetimes into the same blob (so few stragglers survive when a lock elapses), sets each blob's extension from the actual residual lifetime of its contents, and rewrites a blob's live portion only when carrying its dead bytes costs more than rewriting. Neither expensive extreme (daily extension or monthly full re-lock) happens at scale. A second move helps further: the locked anchor never has to serve a fast restore (Object Lock is consulted only on the rare ransomware path), so the working copy stays compactable on the high tier while a separate immutable anchor lives on a much cheaper lock tier (Glacier IR or Deep Archive). Net Object Lock overhead lands around ~10% over the deduplicated dataset, versus the 50–100% fixed-epoch implementations carry. On a 1 PiB protected dataset, the gap is $25K versus $50K per month on identical storage.

Why dedup and Parquet don't conflict: the Live Data Lake

It's fair to ask how a block-level deduplicated store can also be a columnar Parquet dataset; the two seem to want opposite things. The resolution is that Eon applies them to different source classes. For relational and document sources, the live row/item stream is what gets both content-defined chunked and materialized as Parquet: dedup grain aligned to row boundaries, the same rows written into Parquet row-groups, so the output is simultaneously a deduplicated incremental chain and an open table. For raw block and object sources, dedup operates on opaque blocks, with no claim they're columnar.

Where the data is relational, the payoff is that the backup is the analytics dataset. Parquet under Apache Iceberg gives every modern engine - Snowflake, Databricks, Spark, BigQuery, Trino, Athena - concurrent SQL semantics, schema evolution, and time-travel over the same files, with no rehydration and no proprietary export tool. Iceberg's structure (snapshot → manifest list → manifests → data files) is, not coincidentally, how a versioned capture already thinks about point-in-time recovery.

Putting it together

The mechanisms reinforce one another, and because they all run at the source before any byte ships, the same multipliers that shrink stored bytes also shrink cross-region and cross-cloud transfer. A representative estate (500 TiB S3, 400 TiB EBS, 100 TiB RDS, 20 TiB DynamoDB, one-year retention with cross-region DR):

Representative estate, one-year retention with cross-region DR

Other-vendor vault Eon vault
Total stored ~2.22 PiB ~0.83 PiB
Monthly storage (Standard-IA + Object Lock) ~$37,000 ~$12,000
Cross-region DR seed (one-time) ~$47,000 ~$18,000
Recurring monthly transfer ~$14,000 ~$5,300

500 TiB S3, 400 TiB EBS, 100 TiB RDS, 20 TiB DynamoDB. Illustrative model rather than a benchmark; results move with the structure of your data.

Roughly a third of the cost on identical protected data, restore SLAs, and immutability guarantees, because storage and transfer multipliers stack rather than compete.

Note that these figures are an illustrative model rather than a benchmark. Results move with the structure of your data: a higher change rate, a different source mix, or already-compressed S3 content all shift the totals.

Honest limitations

Content-defined chunking is a moving target: a better chunker could shift the trade-off the current design is tuned for. The global dedup gain is deferred under strict Object Lock: duplicates reclaim only as lock windows roll over, so cross-resource savings materialize over a few cycles rather than instantly. Already-user-content sources like S3 compress only ~30%, so the win there is churn-and-versioning elimination rather than compression. And customer-managed-key ergonomics (rotation, revocation, multi-region failover) remain operationally sharp industry-wide, which is why Eon-managed keys stay the recommended default.

None of these changes the core argument: a block-deduplicated, source-aware, compressed, customer-encrypted vault, with Parquet under an open table format wherever the source is relational, is a structurally cheaper and more useful place for captured data to live than a proprietary blob. The cost effect tracks the structure of the data, and we've tried to show the mechanism behind every number.

If you want to walk this against your own estate (restore patterns on cold tiers, your KMS topology, or how the Parquet/Iceberg surface lands in your existing catalog), Eon engineering will work through it with you.

FAQ

No items found.
Moshe Shelly
Moshe Shelly

Principal Tech Product Marketing Manager at Eon