How do you make a file smaller without losing anything? The answer is to exploit structure twice: first by spotting repeated substrings, then by assigning short codes to the symbols that show up most often. This page walks from Shannon's entropy bound to Huffman coding, LZ77, and the way gzip combines both ideas.
Lossless compression has a narrow brief: the output must decode to the exact original bytes. No approximations, no dropped detail, no “close enough.” The only move available is to describe the data more cleverly than the raw byte stream does.
That only works when the file has structure: repeated words, predictable symbols, recurring phrases, or common local patterns. If the bytes already look random, there is not much to save.
Claude Shannon's key insight was that a source emitting very predictable symbols should require fewer bits per symbol than a source emitting unpredictable ones. If one character appears half the time and the others are rare, you should not spend the same number of bits on all of them.
Entropy is the average surprise of the source. Low entropy means the next symbol is usually easy to guess. High entropy means you are closer to randomness. In a long enough stream, entropy is the theoretical lower bound for any lossless compressor that models the source correctly.
Huffman coding turns a frequency table into a prefix code: common symbols get short bit strings, rare symbols get longer ones, and no code is a prefix of another. That last property guarantees the bitstream can be decoded unambiguously from left to right.
The construction is greedy and elegant. Repeatedly take the two least frequent symbols, combine them into a parent, and continue until one tree remains. The path to each leaf becomes that symbol's code.
compression compression compression, Huffman alone still sees one symbol at a time.| Symbol | Count | Share | Code | Bits |
|---|
LZ77 tackles a different kind of redundancy. Instead of asking which symbols are common, it asks whether the next few characters already appeared nearby. If they did, emit a back-reference such as “copy 6 bytes from 11 bytes ago” instead of repeating the literal text.
That is why LZ77 works so well on logs, markup, source code, and repeated words: local history is a strong predictor of what comes next. In practice, encoders use a sliding window of recent bytes and greedily choose a long enough match when one exists.
AAAAAAAAAA can collapse into a short literal prefix plus a back-reference whose length exceeds its distance.What most programmers casually call “gzip compression” is usually a gzip file wrapper around a DEFLATE stream. DEFLATE combines the two ideas above: first it finds repeated substrings with an LZ77-style parser, then it Huffman-codes the resulting literals, lengths, and distances.
That division of labor matters. LZ77 removes repeated phrases and local structure. Huffman then removes the remaining symbol-frequency waste from the token stream. One spots patterns across bytes; the other assigns shorter codes to frequent tokens.
Raw bytes in the input sample.
0 matches, 0 literals.
Average 0 bits per emitted symbol.
The snippets below are deliberately small, but they are not pseudocode. The Huffman example really encodes and decodes a string and emits the codebook needed to decode it later. The LZ77 sketch really emits tokens, allows overlapping back-references, and reconstructs the original text.