Interactive explainer

The Compression Game — Huffman and LZ

How do you make a file smaller without losing anything? The answer is to exploit structure twice: first by spotting repeated substrings, then by assigning short codes to the symbols that show up most often. This page walks from Shannon's entropy bound to Huffman coding, LZ77, and the way gzip combines both ideas.

Lossless compression Shannon entropy DEFLATE = LZ77 + Huffman Runnable JavaScript

§01

The problem: same file, fewer bits

Lossless compression has a narrow brief: the output must decode to the exact original bytes. No approximations, no dropped detail, no “close enough.” The only move available is to describe the data more cleverly than the raw byte stream does.

That only works when the file has structure: repeated words, predictable symbols, recurring phrases, or common local patterns. If the bytes already look random, there is not much to save.

Compression is not magic. It cannot squeeze every file. It only wins when the data contains redundancy that the compressor can model.

Three files, three very different compression stories

Original size

0 bytes

Unique symbols

Entropy

0 bits/symbol

Shannon floor

0 bytes

8-bit raw bytesEntropy uses 0% of that budget

§02

Entropy: Shannon's floor

Claude Shannon's key insight was that a source emitting very predictable symbols should require fewer bits per symbol than a source emitting unpredictable ones. If one character appears half the time and the others are rare, you should not spend the same number of bits on all of them.

Entropy is the average surprise of the source. Low entropy means the next symbol is usually easy to guess. High entropy means you are closer to randomness. In a long enough stream, entropy is the theoretical lower bound for any lossless compressor that models the source correctly.

No heavy math needed here: think of entropy as the average number of yes/no questions required to identify the next symbol.

Where the bits go

§03

Huffman coding

Huffman coding turns a frequency table into a prefix code: common symbols get short bit strings, rare symbols get longer ones, and no code is a prefix of another. That last property guarantees the bitstream can be decoded unambiguously from left to right.

The construction is greedy and elegant. Repeatedly take the two least frequent symbols, combine them into a parent, and continue until one tree remains. The path to each leaf becomes that symbol's code.

Huffman coding only exploits symbol frequency. If the file contains repeated phrases like compression compression compression, Huffman alone still sees one symbol at a time.

Codes derived from the current sample

Symbol	Count	Share	Code	Bits

Fixed-width ASCII

0 bits

Entropy floor

0 bits

Huffman output

0 bits

Average code length

0 bits/symbol

§04

LZ77 and the sliding window

LZ77 tackles a different kind of redundancy. Instead of asking which symbols are common, it asks whether the next few characters already appeared nearby. If they did, emit a back-reference such as “copy 6 bytes from 11 bytes ago” instead of repeating the literal text.

That is why LZ77 works so well on logs, markup, source code, and repeated words: local history is a strong predictor of what comes next. In practice, encoders use a sliding window of recent bytes and greedily choose a long enough match when one exists.

Real LZ77-style formats also allow overlapping copies. That is why runs like AAAAAAAAAA can collapse into a short literal prefix plus a back-reference whose length exceeds its distance.

Step through an LZ77-style parse

Window

Lookahead

Emit token

§05

How gzip combines both

What most programmers casually call “gzip compression” is usually a gzip file wrapper around a DEFLATE stream. DEFLATE combines the two ideas above: first it finds repeated substrings with an LZ77-style parser, then it Huffman-codes the resulting literals, lengths, and distances.

That division of labor matters. LZ77 removes repeated phrases and local structure. Huffman then removes the remaining symbol-frequency waste from the token stream. One spots patterns across bytes; the other assigns shorter codes to frequent tokens.

The pipeline below is illustrative, not a byte-exact gzip size calculator. Real DEFLATE adds block structure, separate literal/length and distance alphabets, and some extra bits for certain lengths and distances.

A toy DEFLATE-style pipeline for the current sample

Original

0 B

Raw bytes in the input sample.

LZ77 tokens

0 matches, 0 literals.

Toy Huffman

0 bits

Average 0 bits per emitted symbol.

Takeaway

—

§06

Implementation in JavaScript

The snippets below are deliberately small, but they are not pseudocode. The Huffman example really encodes and decodes a string and emits the codebook needed to decode it later. The LZ77 sketch really emits tokens, allows overlapping back-references, and reconstructs the original text.

entropy.js — estimate Shannon entropy and the theoretical floor

huffman.js — Huffman encoder / decoder with serialized codebook

lz77_sketch.js — greedy LZ77-style encoder / decoder with overlapping copies

The problem: same file, fewer bits

Entropy: Shannon's floor

Huffman coding

LZ77 and the sliding window

How gzip combines both

Implementation in JavaScript

Further reading