Modern Software Tools, Explained Back to Explainers
Storage Engines

The Write-Heavy Problem - LSM Trees

Once a key-value store gets serious write traffic, the bottleneck is often no longer hashing or indexing. It is the physical act of rewriting scattered pages on disk. LSM trees change the game by making the write path almost entirely sequential, then paying the cleanup bill later through compaction.

Sequential write path MemTable and SSTables Bloom filters as a common file feature LevelDB, RocksDB, Cassandra
§01

Why in-place updates hurt

A B-tree is excellent when the storage medium tolerates random page updates cheaply. But a busy write workload turns every tiny logical change into a small storm of physical work: append to the write-ahead log, rewrite the leaf page, maybe rewrite an internal page, maybe split, and eventually flush all of that through the filesystem and device.

The pain is not that B-trees are bad. The pain is that updating in place creates write amplification. A 40-byte value update can force kilobytes of page traffic, plus random I/O that is much harder for storage hardware to batch efficiently than a single append stream.

The core LSM argument is physical, not abstract: if random page rewrites are your bottleneck, stop rewriting pages on every write.
Same logical writes, different physical behavior

B-tree style update

Rewrite pages now

Each overwrite reaches into existing structure. The fast path is still several writes, and unlucky writes trigger page splits.

LSM style update

Append now, sort later

Durability comes from an append-only log, while the in-memory table absorbs updates until it is flushed as a new sorted run.

Logical writes
0
B-tree page rewrites
0
B-tree splits
0
LSM sequential appends
0

Recent B-tree touches

    Recent LSM touches

      random_vs_sequential.js - immediate write-path work
      
            
      §02

      The LSM insight

      The log-structured merge tree flips the write path on purpose. New writes go to two places: an append-only log for durability, and an in-memory sorted structure usually called the MemTable. When the MemTable fills, it is frozen and flushed out as a new immutable sorted file: an SSTable.

      That means writes are cheap right now, but reads now have to consult multiple places: the MemTable, maybe a recently frozen MemTable, and several SSTables on disk. The system accepts that temporary disorder because compaction will later merge those files into larger, cleaner levels.

      An LSM tree is really a promise: "I will not overwrite old data in place. I will accumulate sorted runs, then reconcile them in the background."
      MemTable flush pipeline

      Active MemTable

      SSTables on disk (newest first)

      WAL appends
      0
      MemTable entries
      0
      Flushes
      0
      SSTables
      0
      §03

      Compaction as a cascade

      Immutable SSTables make the write path fast, but they create a new problem: multiple versions of the same key accumulate across files. Deletes become tombstones. Reads may have to search several runs. So the system keeps merging files down the tree.

      This is why compaction is not an incidental optimization. It is the second half of the data structure. The write path borrows time from the future, and compaction is how that debt gets paid.

      The LSM trade is explicit: lower write latency and higher write throughput now, background rewrite work later.
      Compaction cascade
      L0 files
      0
      L1 files
      0
      L2 files
      0
      Files checked for key
      0
      §04

      Bloom filters as a common SSTable feature

      Point lookups are where a naive LSM can feel painful. If the key is absent, the system may have to ask file after file, "Do you have this key?" and pay disk work for repeated "no" answers. A Bloom filter changes the economics of that question.

      Many LSM systems attach a compact probabilistic summary to each SSTable or data block. On lookup, the filter can say either definitely not here or maybe here. That means many negative lookups can skip opening the data file entirely. The only tolerated mistake is a false positive, never a false negative.

      Bloom filters are a very common SSTable feature, but not a universal law of the data structure. LevelDB, for example, exposes them through an optional filter policy. This explainer treats them as first-class because they matter so much for point reads.
      One query across several SSTables
      Bloom rejections
      0
      Disk probes
      0
      Actual hits
      0
      False positives
      0
      bloom_guard.js - Bloom filters ruling SSTables out
      
            
      §05

      A simplified LSM store in JavaScript

      The implementation below is deliberately small, but it includes the essential pieces: a MemTable, immutable SSTables, tombstones for deletes, per-SSTable Bloom filters, point lookups from newest to oldest, and a compacting merge that rewrites several files into one cleaner run.

      This is not a production storage engine. It is the minimum shape that still feels like an LSM tree instead of just a sorted map with extra steps.
      lsm_store.js - simplified MemTable + SSTables + compaction