All Labs

Compression & Information Theory Lab

Encode text with Run-Length Encoding and Huffman coding, then compare their compression ratios. Build Huffman trees and visualize variable-length codes. Calculate Shannon entropy and discover it as the theoretical lower bound for lossless compression.

Guided Experiment: Entropy as Compression Lower Bound

Can any lossless compression algorithm compress a message below its Shannon entropy (in bits per symbol)? How does Huffman average bits compare to entropy?

Write your hypothesis in the Lab Report panel, then click Next.

Controls

14 / 500 characters

RLE Analysis

Runs Identified
CharacterCountEncoded
'A'33A
'B'33B
'C'44C
'D'22D
'E'22E
Encoded Output
3A3B4C2D2E
Original Size
112 bits
Encoded Size
80 bits
Compression Ratio
0.714
Smaller is better
Space Savings
28.6%
Data reduced
Step-by-Step
Input: "AAABBBCCCCDDEE" (14 chars, 112 bits)\text{Input: "AAABBBCCCCDDEE" (14 chars, 112 bits)}
Run 1: A×33A\text{Run 1: } 'A' \times 3 \rightarrow 3A
Run 2: B×33B\text{Run 2: } 'B' \times 3 \rightarrow 3B
Run 3: C×44C\text{Run 3: } 'C' \times 4 \rightarrow 4C
Run 4: D×22D\text{Run 4: } 'D' \times 2 \rightarrow 2D
Run 5: E×22E\text{Run 5: } 'E' \times 2 \rightarrow 2E
Encoded: "3A3B4C2D2E" (10 chars, 80 bits)\text{Encoded: "3A3B4C2D2E" (10 chars, 80 bits)}
Ratio=80112=0.714\text{Ratio} = \frac{80}{112} = 0.714
Savings: 28.6%\text{Savings: } 28.6\%

Data Table

(0 rows)
#InputAlgorithmOriginal BitsCompressed BitsRatioEntropy (bits)
0 / 500
0 / 500
0 / 500

Reference Guide

Run-Length Encoding

RLE replaces consecutive identical characters with a count and the character. It works best on data with long runs.

AAABBBCC3A3B2C\texttt{AAABBBCC} \to \texttt{3A3B2C}

For text without runs (like English prose), RLE can actually increase the size because each character needs a count prefix.

Huffman Coding

Huffman coding assigns shorter bit codes to more frequent characters, creating an optimal prefix-free code.

Frequent charsshort codes,Rare charslong codes\text{Frequent chars} \to \text{short codes}, \quad \text{Rare chars} \to \text{long codes}

The Huffman tree is built bottom-up by repeatedly combining the two lowest-frequency nodes.

Shannon Entropy

Shannon entropy measures the average information content per symbol. It is the theoretical minimum bits needed per symbol.

H(X)=xp(x)log2p(x)H(X) = -\sum_{x} p(x) \log_2 p(x)

Maximum entropy occurs when all symbols are equally likely. Low entropy means the data is predictable and compresses well.

Compression Ratio

The compression ratio measures how much smaller the compressed data is compared to the original.

Ratio=Compressed SizeOriginal Size,Savings=1Ratio\text{Ratio} = \frac{\text{Compressed Size}}{\text{Original Size}}, \quad \text{Savings} = 1 - \text{Ratio}

Huffman coding approaches the entropy limit. No lossless compression can consistently beat entropy.