Compression & Information Theory Lab

Encode text with Run-Length Encoding and Huffman coding, then compare their compression ratios. Build Huffman trees and visualize variable-length codes. Calculate Shannon entropy and discover it as the theoretical lower bound for lossless compression.

Guided Experiment: Entropy as Compression Lower Bound

Hypothesis

Setup

Run Experiment

Analyze

Conclude

Can any lossless compression algorithm compress a message below its Shannon entropy (in bits per symbol)? How does Huffman average bits compare to entropy?

Write your hypothesis in the Lab Report panel, then click Next.

Controls

Algorithm

Presets

Input Text

14 / 500 characters

RLE Analysis

Runs Identified

Character	Count	Encoded
'A'	3	3A
'B'	3	3B
'C'	4	4C
'D'	2	2D
'E'	2	2E

Encoded Output

3A3B4C2D2E

Original Size

112 bits

Encoded Size

80 bits

Compression Ratio

0.714

Smaller is better

Space Savings

28.6%

Data reduced

Step-by-Step

\text{Input: "AAABBBCCCCDDEE" (14 chars, 112 bits)}

\text{Run 1: } 'A' \times 3 \rightarrow 3A

\text{Run 2: } 'B' \times 3 \rightarrow 3B

\text{Run 3: } 'C' \times 4 \rightarrow 4C

\text{Run 4: } 'D' \times 2 \rightarrow 2D

\text{Run 5: } 'E' \times 2 \rightarrow 2E

\text{Encoded: "3A3B4C2D2E" (10 chars, 80 bits)}

\text{Ratio} = \frac{80}{112} = 0.714

\text{Savings: } 28.6\%

Data Table

(0 rows)

#	Input	Algorithm	Original Bits	Compressed Bits	Ratio	Entropy (bits)

Hypothesis

0 / 500

Observations

0 / 500

Conclusions

0 / 500

Reference Guide

Run-Length Encoding

RLE replaces consecutive identical characters with a count and the character. It works best on data with long runs.

\texttt{AAABBBCC} \to \texttt{3A3B2C}

For text without runs (like English prose), RLE can actually increase the size because each character needs a count prefix.

Huffman Coding

Huffman coding assigns shorter bit codes to more frequent characters, creating an optimal prefix-free code.

\text{Frequent chars} \to \text{short codes}, \quad \text{Rare chars} \to \text{long codes}

The Huffman tree is built bottom-up by repeatedly combining the two lowest-frequency nodes.

Shannon Entropy

Shannon entropy measures the average information content per symbol. It is the theoretical minimum bits needed per symbol.

H(X) = -\sum_{x} p(x) \log_2 p(x)

Maximum entropy occurs when all symbols are equally likely. Low entropy means the data is predictable and compresses well.

Compression Ratio

The compression ratio measures how much smaller the compressed data is compared to the original.

\text{Ratio} = \frac{\text{Compressed Size}}{\text{Original Size}}, \quad \text{Savings} = 1 - \text{Ratio}

Huffman coding approaches the entropy limit. No lossless compression can consistently beat entropy.