Modern Software Tools, Explained Back to Explainers
Interactive explainer

The Compression Game — Huffman and LZ

How do you make a file smaller without losing anything? The answer is to exploit structure twice: first by spotting repeated substrings, then by assigning short codes to the symbols that show up most often. This page walks from Shannon's entropy bound to Huffman coding, LZ77, and the way gzip combines both ideas.

Lossless compression Shannon entropy DEFLATE = LZ77 + Huffman Runnable JavaScript
§01

The problem: same file, fewer bits

Lossless compression has a narrow brief: the output must decode to the exact original bytes. No approximations, no dropped detail, no “close enough.” The only move available is to describe the data more cleverly than the raw byte stream does.

That only works when the file has structure: repeated words, predictable symbols, recurring phrases, or common local patterns. If the bytes already look random, there is not much to save.

Compression is not magic. It cannot squeeze every file. It only wins when the data contains redundancy that the compressor can model.
Three files, three very different compression stories
Original size
0 bytes
Unique symbols
0
Entropy
0 bits/symbol
Shannon floor
0 bytes
8-bit raw bytesEntropy uses 0% of that budget
§02

Entropy: Shannon's floor

Claude Shannon's key insight was that a source emitting very predictable symbols should require fewer bits per symbol than a source emitting unpredictable ones. If one character appears half the time and the others are rare, you should not spend the same number of bits on all of them.

Entropy is the average surprise of the source. Low entropy means the next symbol is usually easy to guess. High entropy means you are closer to randomness. In a long enough stream, entropy is the theoretical lower bound for any lossless compressor that models the source correctly.

No heavy math needed here: think of entropy as the average number of yes/no questions required to identify the next symbol.
Where the bits go
§03

Huffman coding

Huffman coding turns a frequency table into a prefix code: common symbols get short bit strings, rare symbols get longer ones, and no code is a prefix of another. That last property guarantees the bitstream can be decoded unambiguously from left to right.

The construction is greedy and elegant. Repeatedly take the two least frequent symbols, combine them into a parent, and continue until one tree remains. The path to each leaf becomes that symbol's code.

Huffman coding only exploits symbol frequency. If the file contains repeated phrases like compression compression compression, Huffman alone still sees one symbol at a time.
Codes derived from the current sample
SymbolCountShareCodeBits
Fixed-width ASCII
0 bits
Entropy floor
0 bits
Huffman output
0 bits
Average code length
0 bits/symbol
§04

LZ77 and the sliding window

LZ77 tackles a different kind of redundancy. Instead of asking which symbols are common, it asks whether the next few characters already appeared nearby. If they did, emit a back-reference such as “copy 6 bytes from 11 bytes ago” instead of repeating the literal text.

That is why LZ77 works so well on logs, markup, source code, and repeated words: local history is a strong predictor of what comes next. In practice, encoders use a sliding window of recent bytes and greedily choose a long enough match when one exists.

Real LZ77-style formats also allow overlapping copies. That is why runs like AAAAAAAAAA can collapse into a short literal prefix plus a back-reference whose length exceeds its distance.
Step through an LZ77-style parse
Window
Lookahead
Emit token
§05

How gzip combines both

What most programmers casually call “gzip compression” is usually a gzip file wrapper around a DEFLATE stream. DEFLATE combines the two ideas above: first it finds repeated substrings with an LZ77-style parser, then it Huffman-codes the resulting literals, lengths, and distances.

That division of labor matters. LZ77 removes repeated phrases and local structure. Huffman then removes the remaining symbol-frequency waste from the token stream. One spots patterns across bytes; the other assigns shorter codes to frequent tokens.

The pipeline below is illustrative, not a byte-exact gzip size calculator. Real DEFLATE adds block structure, separate literal/length and distance alphabets, and some extra bits for certain lengths and distances.
A toy DEFLATE-style pipeline for the current sample
Original
0 B

Raw bytes in the input sample.

LZ77 tokens
0

0 matches, 0 literals.

Toy Huffman
0 bits

Average 0 bits per emitted symbol.

Takeaway

§06

Implementation in JavaScript

The snippets below are deliberately small, but they are not pseudocode. The Huffman example really encodes and decodes a string and emits the codebook needed to decode it later. The LZ77 sketch really emits tokens, allows overlapping back-references, and reconstructs the original text.

entropy.js — estimate Shannon entropy and the theoretical floor

      
huffman.js — Huffman encoder / decoder with serialized codebook

      
lz77_sketch.js — greedy LZ77-style encoder / decoder with overlapping copies