Once a key-value store gets serious write traffic, the bottleneck is often no longer hashing or indexing. It is the physical act of rewriting scattered pages on disk. LSM trees change the game by making the write path almost entirely sequential, then paying the cleanup bill later through compaction.
A B-tree is excellent when the storage medium tolerates random page updates cheaply. But a busy write workload turns every tiny logical change into a small storm of physical work: append to the write-ahead log, rewrite the leaf page, maybe rewrite an internal page, maybe split, and eventually flush all of that through the filesystem and device.
The pain is not that B-trees are bad. The pain is that updating in place creates write amplification. A 40-byte value update can force kilobytes of page traffic, plus random I/O that is much harder for storage hardware to batch efficiently than a single append stream.
Each overwrite reaches into existing structure. The fast path is still several writes, and unlucky writes trigger page splits.
Durability comes from an append-only log, while the in-memory table absorbs updates until it is flushed as a new sorted run.
The log-structured merge tree flips the write path on purpose. New writes go to two places: an append-only log for durability, and an in-memory sorted structure usually called the MemTable. When the MemTable fills, it is frozen and flushed out as a new immutable sorted file: an SSTable.
That means writes are cheap right now, but reads now have to consult multiple places: the MemTable, maybe a recently frozen MemTable, and several SSTables on disk. The system accepts that temporary disorder because compaction will later merge those files into larger, cleaner levels.
Immutable SSTables make the write path fast, but they create a new problem: multiple versions of the same key accumulate across files. Deletes become tombstones. Reads may have to search several runs. So the system keeps merging files down the tree.
This is why compaction is not an incidental optimization. It is the second half of the data structure. The write path borrows time from the future, and compaction is how that debt gets paid.
When compaction merges files, it walks their entries in sorted-key order and resolves conflicts by recency: if the same key appears in several input files, the newest version wins and older versions are discarded. So compaction is not just concatenation. It is a cleanup pass that collapses several historical fragments into one cleaner view of the keyspace.
Deletion works the same way. Because old SSTables are immutable, the system cannot erase a key in place. Instead it writes a tombstone, which is a special record meaning “this key was deleted.” Reads treat a tombstone as stronger than any older value they may still find deeper in the tree. Later, once compaction has merged through the relevant older files and knows there is no older live version left to hide, it can drop both the tombstone and the obsolete value entirely.
How to read this: Write batch adds one new immutable SSTable to L0. Those files can overlap in key range, so point reads may have to check several of them. Compact merges the files in the first crowded level into one larger file in the next level down, keeping only the newest value per key. Files highlighted below are the ones whose key range still covers the selected lookup key.
Point lookups are where a naive LSM can feel painful. If the key is absent, the system may have to ask file after file, "Do you have this key?" and pay disk work for repeated "no" answers. A Bloom filter changes the economics of that question.
Many LSM systems attach a compact probabilistic summary to each SSTable or data block. On lookup, the filter can say either definitely not here or maybe here. That means many negative lookups can skip opening the data file entirely. The only tolerated mistake is a false positive, never a false negative.
The implementation below is deliberately small, but it includes the essential pieces: a MemTable, immutable SSTables, tombstones for deletes, per-SSTable Bloom filters, point lookups from newest to oldest, and a compacting merge that rewrites several files into one cleaner run.