Information Theory: Entropy and Surprise

November 9, 2025 ·

Why information is measured in surprises, and what that reveals about communication

Featured

What is information? Not in the vague everyday sense, but precisely, mathematically. Claude Shannon answered this question in 1948 with a radical insight: information is surprise. And the measure of surprise has a name you might not expect — entropy.

The Bit: Information’s Atom

Let’s start simple. What’s the smallest possible unit of information?

A bit — the answer to a yes/no question.

Is the light on? Yes or no. One bit.

Did the coin land heads? Yes or no. One bit.

Every piece of information, no matter how complex, can be broken down into bits — binary choices, yes/no questions, 0s and 1s.

This isn’t just a computer thing. It’s fundamental. One bit is the smallest possible reduction of uncertainty.

Information as Surprise

Here’s Shannon’s key insight: Information is about reducing uncertainty.

If I tell you “the sun rose this morning,” I’ve given you zero information. You already knew that. No surprise, no information.

If I tell you “it’s snowing in the Sahara,” that’s highly informative. You didn’t expect it. Maximum surprise, maximum information.

Information content is inversely proportional to probability. The less likely a message, the more information it contains.

This leads to a precise formula:

I(x) = -log₂(P(x))

Information of event x = negative log of its probability.

Probability 1/2 (coin flip) → 1 bit of information
Probability 1/4 → 2 bits of information
Probability 1/8 → 3 bits of information

Each halving of probability adds one bit.

Entropy: Average Surprise

Entropy is the expected value of information — the average surprise over all possible messages.

For a fair coin (50/50 heads/tails), entropy is 1 bit. Each flip gives you 1 bit of information on average.

For a weighted coin (90% heads, 10% tails), entropy is less than 1 bit. Most flips are predictable (heads), so you learn less on average.Specifically, about 0.47 bits. The formula is H = -Σ p(x)log₂(p(x)) summed over all outcomes.

Maximum entropy = maximum uncertainty = maximum information capacity.

A fair coin has higher entropy than a weighted coin. A fair die has higher entropy than either.

Why This Matters

Shannon’s entropy isn’t just abstract math. It has profound implications:

Compression Has Limits

You can compress data by removing redundancy. English text is redundant — if I write “th_”, you can guess the next letter is probably ‘e’ or ‘a’.

But there’s a minimum size you can compress to, set by the entropy of the source. You can’t compress random data (maximum entropy) at all.

"The source coding theorem: you can compress data to its entropy, but no further." — Claude Shannon

This is why:

Random data doesn’t compress
Encrypted data looks random (high entropy)
Natural language compresses well (lots of redundancy, lower entropy)

Communication Requires Redundancy

In a noisy channel, some information gets lost. To communicate reliably, you need redundancy — saying things in multiple ways so errors can be corrected.

But redundancy lowers entropy (makes messages more predictable). There’s a tradeoff:

High entropy = efficient, but fragile to noise
Low entropy (high redundancy) = inefficient, but robust

This is why we spell words out on bad phone connections, or why DNA has error-correction mechanisms.

The Bit Is Universal

Once you measure information in bits, you realize:

A gene is information (about 2 bits per nucleotide pair)
A photo is information (millions of bits, compressed via JPEG)
A book is information (typically 1-2 MB uncompressed)
Your brain processes information (estimated 10-100 trillion bits stored)

All measured in the same unit, regardless of medium.

Entropy in Physics vs. Information

Here’s where it gets wild: Shannon borrowed the term “entropy” from thermodynamics. And it turns out they’re deeply connected.

Thermodynamic entropy: Measure of disorder in a physical system. High entropy = more possible microstates.

Information entropy: Measure of uncertainty in a message. High entropy = more possible messages.

They’re the same concept! Entropy is fundamentally about counting possibilities.

This led to profound insights:

Information has physical limits (Landauer’s principle: erasing a bit requires energy)
Black holes have entropy proportional to their surface area
The second law of thermodynamics is really about information lossMaxwell’s demon — the thought experiment about sorting molecules — is fundamentally about information and entropy!

Mutual Information: Shared Surprise

If I know X, how much does that tell me about Y?

Mutual information measures how much uncertainty about Y is reduced by knowing X.

Examples:

Height and weight have positive mutual information (knowing one tells you something about the other)
Coin flips have zero mutual information (independent events)
Encrypted message and original text have maximum mutual information (they’re the same information in different forms)

This is crucial for:

Machine learning (finding features that have high mutual information with labels)
Neuroscience (measuring how much information neurons share)
Genetics (finding correlations between genes and traits)

Building Intuition

Key concepts to internalize:

Information = surprise — Likely events carry little information; unlikely events carry lots

Entropy = average information — How much you learn on average from a source

Redundancy = predictability — Lower entropy means more predictable, less surprising

Why Shannon’s Framework Is Brilliant

Before Shannon, “information” was vague. Data, facts, knowledge, news — all mixed together.

Shannon’s theory:

Made information quantifiable (measured in bits)
Made it substrate-independent (doesn’t matter if it’s ink, electricity, or DNA)
Gave us fundamental limits (compression limits, channel capacity)
Connected to physics (thermodynamic entropy)

This enabled the digital age. Every text message, every video stream, every computer chip — designed using Shannon’s theory.

Real-World Applications

Data compression (ZIP, JPEG, MP3 — all based on entropy coding)
Error correction (CDs, QR codes, deep space communication)
Cryptography (measuring security in bits of entropy)
Machine learning (cross-entropy loss, mutual information)
Genetics (measuring information content in DNA)
Neuroscience (neural coding, information transfer in brain)

The Beautiful Paradox

Here’s what gets me: information is substrate-independent, but physically real.

You can encode the same information in:

Ink on paper
Electrical signals
Magnetic domains on a hard drive
DNA nucleotides
Photons in fiber optics

The information is the same. The medium is irrelevant.

Yet information still has physical consequences. It requires energy. It takes up space. It can’t travel faster than light.

Information seems abstract — just patterns, just meanings. But Shannon showed it's as real and measurable as mass or energy.

In a sense, information might be more fundamental. Matter and energy are information in physical form.

My Takeaway

Information theory changed how I think about communication, uncertainty, and knowledge itself.

Every time I compress a file, I’m bumping up against Shannon’s limit.

Every time I explain something, I’m trying to maximize information transfer (high surprise for things you don’t know, low redundancy so I’m not wasting bits).

Every time I learn something shocking, I’m experiencing high information content.

The universe is saturated with information — in DNA, in neural firing patterns, in cosmic microwave background radiation, in the text you’re reading right now.

And thanks to Shannon, we can measure it precisely.

That’s beautiful.

Resources: Shannon’s original 1948 paper “A Mathematical Theory of Communication” is surprisingly readable. Also check out “Information Theory, Inference, and Learning Algorithms” by David MacKay (free PDF online).