Modern Software Tools, Explained Back to Explainers
Approximate Answers

Good Enough Answers - Probabilistic Data Structures

Sometimes exactness is the wrong objective. If you need to ask whether a URL is in a set of ten billion URLs, or estimate the number of distinct users in a stream, the winning move is often to spend a tiny amount of memory on an answer that is almost always right and explicitly bounded when it is wrong.

Bloom filters for membership Count-min sketch for frequencies HyperLogLog for distinct counts Error as a design parameter
§01

The memory wall

Suppose you want to ask a simple question: does this URL exist in a set of ten billion URLs? A fully exact hash set may be fine on paper, but not when you only have about 1GB of RAM and still need room for everything else the program does.

This is where probabilistic data structures enter. They answer narrower questions with radically less memory, and they do so by being honest about the shape of their errors. Bloom filters allow false positives but not false negatives. Count-min sketch overestimates counts but never underestimates them. HyperLogLog estimates cardinality with a known relative error. They do not make every scale point magically fit, but they often make staged designs practical where exact storage is hopeless.

The key engineering move is not "be sloppy." It is "choose a cheap kind of wrongness that the system can tolerate." At 10B URLs and 1GB RAM, that likely means an approximate local stage plus a second check, not one giant exact in-memory set.
Exact set versus compact summary

Exact

Store every key

Memory scales with the number of elements plus allocator overhead, pointers, and metadata.

Probabilistic

Store evidence, not the elements

Memory scales with a target error budget: bits, counters, or registers instead of the raw key set.

URLs
10B
RAM budget
1GB
Question
membership?
Likely move
approx + staged
§02

Bloom filters

A Bloom filter is a bit array plus k hash functions. To insert a key, hash it k ways and set those bit positions to 1. To query a key, hash it the same ways and check those positions. If any bit is 0, the key is definitely absent. If all are 1, the key is probably present.

The false-positive rate is not hand-waving. For m bits, n inserted items, and k hashes, the standard approximation is (1 - e^(-kn/m))^k. That lets you size the filter from the workload instead of guessing.

Bloom filters buy tiny memory and fast negative lookups. In exchange, they cannot list contents, cannot count frequency, and cannot delete cleanly without upgrading to a counting variant.
Target a false-positive rate
Expected items
0
Bits needed
0
Hash functions
0
Observed false +
0%
Sample workload
Memory per item
bloom_filter.js - membership with explicit false-positive control

      
§03

Count-min sketch

Count-min sketch is for a different question: how often did this item appear in a stream? It keeps a small table of counters arranged in rows and columns. Each row has its own hash function. Updating an item increments one counter per row. Querying an item returns the minimum of those counters.

Why the minimum? Collisions only push counts upward, never downward. Taking the minimum across rows gives the least inflated answer available. The knobs are usually framed as epsilon for additive error and delta for failure probability, which translate to width and depth.

Count-min sketch is not trying to know everything. It is trying to know enough to spot heavy hitters and approximate frequencies in streams that are too large to store exactly.
Frequency estimation from a stream
Width
0
Depth
0
Max additive error
0
Failure prob.
0
count_min_sketch.js - streaming counts with error controls

      
§04

HyperLogLog

HyperLogLog answers a third question: how many distinct elements have I seen? Instead of storing the elements, it hashes them, partitions the hash space into registers, and records how many leading zeros appear in the remaining bits. Rare long runs of leading zeros are evidence that many distinct items must have passed through.

The main control is precision: more registers mean lower relative error, roughly 1.04 / sqrt(m) where m is the number of registers. That is why systems like Redis can offer approximate cardinality with a tiny, nearly fixed memory footprint.

HyperLogLog is one of the cleanest examples of "scale by changing the question." You stop asking for the set itself and ask only for the size of the set.
Precision versus error
Registers
0
Exact distinct
0
Estimated distinct
0
Expected rel. error
0%
Memory shape
Actual relative error
hyperloglog.js - approximate distinct counting with precision control

      
§05

The philosophy

There is a recurring engineering mistake: treating approximation as a moral failure rather than a workload decision. In real systems, exactness competes with memory, latency, bandwidth, and operational simplicity. If the tolerated error is explicit and the business consequence is acceptable, an approximate answer can be the higher-quality design.

That is the deeper lesson of these structures. They do not merely compress data. They compress decision-relevant information. A Bloom filter only needs to know enough to skip work most of the time. A count-min sketch only needs to know enough to rank frequent items. A HyperLogLog only needs to know enough to estimate cardinality.

Trading exactness for scale is not surrender. It is choosing the cheapest answer that still supports the product decision you actually need to make.