Disk Algorithms

Sorting Bigger Than RAM - External Mergesort

If the file is 500GB and memory is 8GB, the main algorithmic question is no longer comparison count. It is I/O shape. External mergesort wins because it touches disk in long sequential passes, first creating sorted runs that fit in memory and then merging those runs with a heap.

Run generation in RAM-sized chunks K-way merge with a heap Sequential I/O over seeks Simulated BinaryFile with seek

§01

Why quicksort fails on disk

Quicksort assumes cheap random access. Its partition pass reaches back and forth across the array, which is fine when the array lives in RAM. But on disk, that pattern becomes a seek-heavy disaster. What looks like a few arithmetic operations in memory turns into repeated repositioning of the read head.

External mergesort changes the goal. Instead of trying to sort the whole file in place, it makes a small number of long sequential passes. That single design choice is why a 500GB file is a practical job instead of an absurd one.

For disk-resident data, the right question is not “what has the nicest in-memory constants?” It is “what minimises seeks and maximises sequential throughput?”

Cost model

Random-access mindset

Seek all over the file

Quicksort’s partitioning pattern is a cache story in RAM and a seek story on disk.

External mergesort mindset

Sequential passes

Read a chunk, sort in memory, write a run. Then merge runs in order using buffered reads.

File size

500GB

RAM

8GB

Initial runs

Right pattern

sequential

§02

Run generation

The first phase is simple: read as many records as fit in memory, sort that chunk in RAM, and write the result out as a sorted run. Repeat until the whole file has been turned into runs.

This phase is where your in-memory sorter still matters. If the chunk fits in RAM, use the fast in-memory algorithm there. External mergesort is not replacing introsort inside RAM; it is orchestrating introsort across many disk-sized chunks.

Run generation pipeline

1. Read input chunk→

2. Sort in RAM→

3. Write sorted run

Records processed

Runs generated

Chunk size

Pass type

sequential read/write

run_generation.js - sorting memory-sized chunks into runs

§03

K-way merge

Once you have sorted runs, the merge phase is conceptually straightforward: keep one current record from each run, repeatedly emit the smallest one, and then replace it with the next record from the same run. The right data structure for that is a min-heap of size k.

Each output record costs one heap pop and, unless a run is exhausted, one heap push. That gives O(n log k) merge work, where k is the number of runs being merged in the current pass.

Three-way merge

Run readers→

Each run is still a full sorted file. The highlighted chip marks the record from that run currently represented in the heap.

Heap backing array→

The heap holds only one live candidate per run, stored in its usual array representation.

Output file

Each step appends one new global minimum to the merged output stream.

Heap size

Output records

Run readers

Comparison pattern

log k

k_way_merge.js - merging sorted runs with a min-heap

§04

I/O patterns

External sorting lives or dies by buffering and access order. Reads should be sequential within each run. Writes should be sequential into the output file. Random seeks should happen only when opening a run, switching buffers, or repositioning within a simulated file abstraction.

This is also why multi-pass merging exists. If you cannot afford one input buffer per run, do not force it. Merge a manageable number of runs into larger runs, then merge those larger runs in the next pass.

The unit of thought is the same as in B-trees: a page or buffer is valuable because one I/O should move a useful amount of ordered information.

§05

A simulated external sort in JavaScript

The implementation below uses a browser-side BinaryFile-style class with seek, readInt32, and writeInt32. It is not real disk I/O, but it preserves the important part for an explainer: a file cursor, run files, and a merge phase that treats files as streams rather than plain arrays.

This is the right compromise for the page: faithful file semantics and cursor-based access, without pretending the browser is a storage engine or a full buffered I/O stack.

external_sort.js - simulated BinaryFile + run generation + k-way merge

§06

The B-tree connection

External mergesort and B-trees are solving neighboring problems with the same physical intuition. Both assume that disk I/O dominates CPU time, both organize work around page-sized transfers, and both try to get as much value as possible out of each seek.

The difference is the access pattern. A B-tree is optimized for random lookup: a few page reads to find one key. External mergesort is optimized for sequential processing: a small number of full passes to reorder an entire file. Same page-size thinking, different job.