Modern Software Tools, Explained Back to Explainers
Disk Algorithms

Sorting Bigger Than RAM - External Mergesort

If the file is 500GB and memory is 8GB, the main algorithmic question is no longer comparison count. It is I/O shape. External mergesort wins because it touches disk in long sequential passes, first creating sorted runs that fit in memory and then merging those runs with a heap.

Run generation in RAM-sized chunks K-way merge with a heap Sequential I/O over seeks Simulated BinaryFile with seek
§01

Why quicksort fails on disk

Quicksort assumes cheap random access. Its partition pass reaches back and forth across the array, which is fine when the array lives in RAM. But on disk, that pattern becomes a seek-heavy disaster. What looks like a few arithmetic operations in memory turns into repeated repositioning of the read head.

External mergesort changes the goal. Instead of trying to sort the whole file in place, it makes a small number of long sequential passes. That single design choice is why a 500GB file is a practical job instead of an absurd one.

For disk-resident data, the right question is not “what has the nicest in-memory constants?” It is “what minimises seeks and maximises sequential throughput?”
Cost model

Random-access mindset

Seek all over the file

Quicksort’s partitioning pattern is a cache story in RAM and a seek story on disk.

External mergesort mindset

Sequential passes

Read a chunk, sort in memory, write a run. Then merge runs in order using buffered reads.

File size
500GB
RAM
8GB
Initial runs
63
Right pattern
sequential
§02

Run generation

The first phase is simple: read as many records as fit in memory, sort that chunk in RAM, and write the result out as a sorted run. Repeat until the whole file has been turned into runs.

This phase is where your in-memory sorter still matters. If the chunk fits in RAM, use the fast in-memory algorithm there. External mergesort is not replacing introsort inside RAM; it is orchestrating introsort across many disk-sized chunks.

Run generation pipeline
Records processed
0
Runs generated
0
Chunk size
0
Pass type
sequential read/write
run_generation.js - sorting memory-sized chunks into runs

      
§03

K-way merge

Once you have sorted runs, the merge phase is conceptually straightforward: keep one current record from each run, repeatedly emit the smallest one, and then replace it with the next record from the same run. The right data structure for that is a min-heap of size k.

Each output record costs one heap pop and, unless a run is exhausted, one heap push. That gives O(n log k) merge work, where k is the number of runs being merged in the current pass.

Three-way merge

Heap front

The heap stores the smallest current record from each live run.

Output

Every pop emits the next global minimum across all runs.

Heap size
0
Output records
0
Run readers
0
Comparison pattern
log k
k_way_merge.js - merging sorted runs with a min-heap

      
§04

I/O patterns

External sorting lives or dies by buffering and access order. Reads should be sequential within each run. Writes should be sequential into the output file. Random seeks should happen only when opening a run, switching buffers, or repositioning within a simulated file abstraction.

This is also why multi-pass merging exists. If you cannot afford one input buffer per run, do not force it. Merge a manageable number of runs into larger runs, then merge those larger runs in the next pass.

The unit of thought is the same as in B-trees: a page or buffer is valuable because one I/O should move a useful amount of ordered information.
§05

A simulated external sort in JavaScript

The implementation below uses a browser-side BinaryFile-style class with seek, readInt32, and writeInt32. It is not real disk I/O, but it preserves the important part for an explainer: a file cursor, run files, and a merge phase that treats files as streams rather than plain arrays.

This is the right compromise for the page: faithful file semantics and cursor-based access, without pretending the browser is a storage engine or a full buffered I/O stack.
external_sort.js - simulated BinaryFile + run generation + k-way merge