Storage Engines

The Write-Heavy Problem - LSM Trees

Once a key-value store gets serious write traffic, the bottleneck is often no longer hashing or indexing. It is the physical act of rewriting scattered pages on disk. LSM trees change the game by making the write path almost entirely sequential, then paying the cleanup bill later through compaction.

Sequential write path MemTable and SSTables Bloom filters as a common file feature LevelDB, RocksDB, Cassandra

§01

Why in-place updates hurt

A B-tree is excellent when the storage medium tolerates random page updates cheaply. But a busy write workload turns every tiny logical change into a small storm of physical work: append to the write-ahead log, rewrite the leaf page, maybe rewrite an internal page, maybe split, and eventually flush all of that through the filesystem and device.

The pain is not that B-trees are bad. The pain is that updating in place creates write amplification. A 40-byte value update can force kilobytes of page traffic, plus random I/O that is much harder for storage hardware to batch efficiently than a single append stream.

The core LSM argument is physical, not abstract: if random page rewrites are your bottleneck, stop rewriting pages on every write.

Same logical writes, different physical behavior

B-tree style update

Rewrite pages now

Each overwrite reaches into existing structure. The fast path is still several writes, and unlucky writes trigger page splits.

LSM style update

Append now, sort later

Durability comes from an append-only log, while the in-memory table absorbs updates until it is flushed as a new sorted run.

Logical writes

B-tree page rewrites

B-tree splits

LSM sequential appends

Recent B-tree touches

Recent LSM touches

random_vs_sequential.js - immediate write-path work

§02

The LSM insight

The log-structured merge tree flips the write path on purpose. New writes go to two places: an append-only log for durability, and an in-memory sorted structure usually called the MemTable. When the MemTable fills, it is frozen and flushed out as a new immutable sorted file: an SSTable.

That means writes are cheap right now, but reads now have to consult multiple places: the MemTable, maybe a recently frozen MemTable, and several SSTables on disk. The system accepts that temporary disorder because compaction will later merge those files into larger, cleaner levels.

An LSM tree is really a promise: "I will not overwrite old data in place. I will accumulate sorted runs, then reconcile them in the background."

MemTable flush pipeline

Active MemTable

SSTables on disk (newest first)

WAL appends

MemTable entries

Flushes

SSTables

§03

Compaction as a cascade

Immutable SSTables make the write path fast, but they create a new problem: multiple versions of the same key accumulate across files. Deletes become tombstones. Reads may have to search several runs. So the system keeps merging files down the tree.

This is why compaction is not an incidental optimization. It is the second half of the data structure. The write path borrows time from the future, and compaction is how that debt gets paid.

When compaction merges files, it walks their entries in sorted-key order and resolves conflicts by recency: if the same key appears in several input files, the newest version wins and older versions are discarded. So compaction is not just concatenation. It is a cleanup pass that collapses several historical fragments into one cleaner view of the keyspace.

Deletion works the same way. Because old SSTables are immutable, the system cannot erase a key in place. Instead it writes a tombstone, which is a special record meaning “this key was deleted.” Reads treat a tombstone as stronger than any older value they may still find deeper in the tree. Later, once compaction has merged through the relevant older files and knows there is no older live version left to hide, it can drop both the tombstone and the obsolete value entirely.

The LSM trade is explicit: lower write latency and higher write throughput now, background rewrite work later.

Compaction cascade

How to read this: Write batch adds one new immutable SSTable to L0. Those files can overlap in key range, so point reads may have to check several of them. Compact merges the files in the first crowded level into one larger file in the next level down, keeping only the newest value per key. Files highlighted below are the ones whose key range still covers the selected lookup key.

L0 files

L1 files

L2 files

Files checked for key

§04

Bloom filters as a common SSTable feature

Point lookups are where a naive LSM can feel painful. If the key is absent, the system may have to ask file after file, "Do you have this key?" and pay disk work for repeated "no" answers. A Bloom filter changes the economics of that question.

Many LSM systems attach a compact probabilistic summary to each SSTable or data block. On lookup, the filter can say either definitely not here or maybe here. That means many negative lookups can skip opening the data file entirely. The only tolerated mistake is a false positive, never a false negative.

Bloom filters are a very common SSTable feature, but not a universal law of the data structure. LevelDB, for example, exposes them through an optional filter policy. This explainer treats them as first-class because they matter so much for point reads.

One query across several SSTables

Bloom rejections

Disk probes

Actual hits

False positives

bloom_guard.js - Bloom filters ruling SSTables out

§05

A simplified LSM store in JavaScript

The implementation below is deliberately small, but it includes the essential pieces: a MemTable, immutable SSTables, tombstones for deletes, per-SSTable Bloom filters, point lookups from newest to oldest, and a compacting merge that rewrites several files into one cleaner run.

This is not a production storage engine. It is the minimum shape that still feels like an LSM tree instead of just a sorted map with extra steps.

lsm_store.js - simplified MemTable + SSTables + compaction

§06

LevelDB, RocksDB, and Cassandra

LevelDB is the compact embedded expression of the idea: log, MemTable, SSTables, compaction, optional Bloom filters. It established the basic shape many programmers now picture when they hear "LSM tree."

RocksDB keeps the same overall structure but turns the compaction policy into a large design space. Its default leveled style focuses on controlling space amplification, while other modes such as universal compaction trade toward lower write amplification at the cost of more read and space amplification. It also treats Bloom filters as a standard SST-file feature for point lookups.

Cassandra stretches the idea into a distributed database. Data still lands in memtables and immutable SSTables, but the compaction strategy becomes a workload-level decision because it directly shapes read amplification, write amplification, tombstone handling, and operational behavior across the cluster.

Same family, different priorities: LevelDB is small and crisp, RocksDB is highly tunable, Cassandra is LSM at distributed-system scale.