# Huffman coding

The Huffman coding is a form of entropy , which in 1952 by David A. Huffman developed in the essay A Method for the Construction of Minimum Redundancy Codes was published. It assigns code words of variable length to a fixed number of source symbols. In information technology it is a prefix code that is usually used for lossless compression. Similar to other entropy encodings, more frequently occurring characters are represented with fewer bits than less frequently occurring characters.

## Basics

In order to display data with as little redundancy as possible , the source symbols must be coded with code words of different word lengths . The length of the code words ideally corresponds to their information content . In order to be able to uniquely decode a code again, it must fulfill Kraft's inequality and also be prefix-free , ie no code word may be the beginning of another.

The basic idea is to use a k -nary root tree (a tree with k children per node) to represent the code. In this so-called Huffman tree , the leaves represent the characters to be coded, while the path from the root to the leaf determines the code symbol. In contrast to the Shannon-Fano coding , the tree is doing (from the leaves to the root English bottom-up created).

In contrast to Morse code, Huffman coding does not require any separators. A separation of the code words is not necessary because the coding is prefix-free. As a result, no code word is the beginning of another code word.

The tree obtained from Huffman coding guarantees an optimal and prefix-free coding. I.e. there is no symbol-related coding method that could generate a shorter code if the probability of the symbols appearing is known.

## history

In 1951, David A. Huffman and his classmates at MIT had the choice between a term paper and a final exam in the Information Theory course. The seminar paper , supervised by Professor Robert M. Fano , should focus on finding the most efficient binary code. Huffman, unable to prove the efficiency of a code, was just about to make the decision to give up and prepare for the final exam when he hit upon the idea of using a frequency- sorted binary tree , and thus in no time at all Method could prove to be the most efficient.

In this way, Huffman surpassed his Professor Fano, who developed a similar code together with the founder of information theory, Claude Shannon . By making the tree bottom-up instead of top-down, Huffman avoided the main weakness of the suboptimal Shannon-Fano coding .

## algorithm

Definitions

• X is the source symbol alphabet - the set of characters that make up the source symbols
• p x is the a priori probability of the symbol x (the relative frequency)
• C is the code alphabet - the set of characters that make up the code words
• m is the thickness of the code alphabet C - the number of different characters${\ displaystyle \ vert C \ vert}$

Structure of the tree

1. For each source symbol, find the relative frequency , ie count how often each character occurs and divide by the number of all characters.
2. Create a single node (the simplest form of a tree) for each source symbol and note the frequency in / at the node.
3. Repeat the following steps until there is only one tree left:
1. Choose the m subtrees with the lowest frequency in the roots, if there are several options, choose the subtrees with the lowest depth.
2. Combine these trees to form a new (partial) tree.
3. Write down the sum of the frequencies in the root.

Construction of the code book

1. Assign a character from the code alphabet to each child of a node.
2. Read out the code word for each source symbol (leaf in the tree):
1. Start at the root of the tree.
2. The code characters on the edges of the path (in this order) result in the associated code word.

Coding

1. Read in a source symbol.
2. Find the corresponding code word from the code book.
3. Enter the code word.

### Medium word length

The mean length of a code word can be calculated in three ways.

• About the weighted sum of the code word lengths:
${\ displaystyle {\ overline {l}} = \ sum _ {x \ in X} p_ {x} l_ {x}}$ The sum of the number of steps in the tree multiplied by the frequency of a symbol.
• By adding up the probabilities of occurrence at all intermediate nodes of the Huffman tree.
• If the elements to be coded have the same frequency, the mean length is
${\ displaystyle l = \ log _ {2} \ m}$with as the number of elements to be coded.${\ displaystyle m \ in N}$

### example

Huffman tree to the example. (The root of the tree is on the right, the leaves on the left.)

The source alphabet contains the characters: We choose the binary code: and as the code alphabet . The text should be compressed (without the spaces). ${\ displaystyle X = \ {a, b, c, d \}}$${\ displaystyle C = \ {0.1 \}}$${\ displaystyle m = \ vert C \ vert = 2}$a ab abc abcd

Find the relative frequencies:

p a = 0.4; p b = 0.3; p c = 0.2; p d = 0.1

Construct a Huffman tree and then enter the code words on the edges. (See picture on the right)

Code dictionary:

 a 1 b 01 c 001 d 000

Code the original text:

 Original: a a b a b c a b c d Coded: 1 1 01 1 01 001 1 01 001 000

Average code word length:

• With a naive coding, every character would be coded as well.${\ displaystyle \ log _ {2} 4 = 2 \, {\ text {bit per symbol}}}$
• This Huffman coding also encodes each character
${\ displaystyle {\ overline {l}} = 0 {,} 4 \ times 1 + 0 {,} 3 \ times 2 + 0 {,} 2 \ times 3 + 0 {,} 1 \ times 3 = 0 {, } 4 + 0 {,} 6 + 0 {,} 6 + 0 {,} 3 = 1 {,} 9 \, {\ text {bit per symbol}}}$
• The entropy is included
{\ displaystyle {\ begin {aligned} H (X) & = - (0 {,} 4 \ cdot \ log _ {2} 0 {,} 4 + 0 {,} 3 \ cdot \ log _ {2} 0 {,} 3 + 0 {,} 2 \ cdot \ log _ {2} 0 {,} 2 + 0 {,} 1 \ cdot \ log _ {2} 0 {,} 1) \\ & = 0 {, } 529 + 0 {,} 521 + 0 {,} 464 + 0 {,} 332 \\ & = 1 {,} 85 \, {\ text {bit per symbol}} \ end {aligned}}}

Because the information content per source symbol is not an integer, a residual redundancy remains in the coding.

## Decoding

In order to decode a Huffman-coded data stream, the code table created in the encoder is necessary (with the classic method). Basically, the procedure is reversed as in the coding step. The Huffman tree is rebuilt in the decoder and with each incoming bit - starting from the root - the corresponding path in the tree is followed until one arrives at a leaf. This leaf is then the source symbol you are looking for, and you start decoding the next symbol again at the root.

### example

The decoder has the code dictionary:

 a 1 b 01 c 001 d 000

The path in the tree (see above) is now followed for each received bit, starting from the root, until a leaf has been reached. Once a leaf is reached, the decoder records the symbol of the leaf and starts over from the root until the next leaf is reached.

 Part path: 1 1 01 1 01 001 1 01 001 000 Corresponding sheet: a a b a b c a b c d

## Optimality

The following applies to the mean code word length of a Huffman code (see also) ${\ displaystyle {\ overline {l}}}$

${\ displaystyle \ mathrm {H} (X) \ leq {\ overline {l}} \ leq \ mathrm {H} (X) +1}$

This means that on average each code symbol needs at least as many digits as its information content, but at most one more.

${\ displaystyle {\ overline {l}} = \ mathrm {H} (X)}$holds if and only if all probabilities are powers of two ( ). In this case one says that the coding is optimal with regard to the entropy . ${\ displaystyle 2 ^ {- m_ {x}}, \; m_ {x} \ in \ mathbb {N} ^ {+}}$

If source symbols are combined into one large symbol , the mean code symbol lengths apply${\ displaystyle n}$${\ displaystyle y}$${\ displaystyle {\ overline {l}} _ {y}}$

${\ displaystyle \ mathrm {H} _ {n} (X) \ leq {\ overline {l}} _ {y} \ leq \ mathrm {H} _ {n} (X) + {\ frac {1} { n}}}$,

that is, as the number of jointly coded source symbols increases, the mean code word length asymptotically goes against the entropy - the coding is asymptotically optimal . ${\ displaystyle n}$

The adaptive Huffman coding continuously updates the tree. The initial tree is generated by assuming a given probability distribution for all source symbols (with complete ignorance of the source, a uniform distribution ). This is updated with each new source symbol, which may also change the code symbols. This update step can be reproduced in the decoder so that the code book does not need to be transmitted.

With this method, a data stream can be encoded on-the-fly . However, it is considerably more prone to transmission errors, since a single error leads to a completely wrong decoding - starting at the point of the error.