What are Huffman codes?
Huffman codes are an extremely common way of converting information to binary. While developing his algorithm for encoding, Huffman had to make sure it followed these rules:
There was always a unique way to decode a message.
The compression must be lossless. This means that the value that the sender encodes must be exactly the same as the one that the receiver gets when decoding.
Each letter must have a unique binary mapping. (different from the first one)
How do you quantify how good an encoding algorithm is?
It’s important to determine how we can measure information. Claude Shannon made the statement that probability and information are related. If something is expected, then not much information is gained from it. A statement about whether it is raining or not in a place where it is always raining isn’t very helpful. However, a lot more information is gained when you learn that it is raining in a place that has never seen rain.
Using the self-information formula above, something called information entropy can be described. Information entropy describes how unpredictable the information is. This can be done by taking the weighted sum of all of the self-information for each type of data point. If a set of information has more entropy, then the data is more unpredictable than one with a lower entropy.
In encoding algorithms, there will always be mapping. Which takes a certain letter and maps that to a string of binary numbers. How do you measure how good a map is? How good a mapping is determined by the probality of the letter being sent and how long its corresponding binary string is. Assigning the longest binary strings to the most common letters will cause extremely poor performance. Instead, the most common letters should be given the shortest binary string, reducing the length. A value called the average length per symbol combines these two values to calculate a final value that is used to analyze a mapping. For each letter, multiply its binary string length by its probability of appearing and sum them all together.
But how optimal can an encoding algorithm get? When it comes to keeping the algorithm loseless, the lower bound for the “average length per symbol” is the information entropy (Shannon’s source coding theorem). Distributions that have an average length per symbol value equal to their entropy are called “dyadic”.
Finding Mappings:
Shannon-Fano algorithim
The Shanno-Fano algorithim was the most common algorithim for loseless compression before huffman’s code. Shannon-Fano attempts to create a mapping where the most common characters have the shortest binary strings and those who are not as common have longer strings. Shannon-Fano is best described as a tree. The algorithim works as follows:
Create a root
Using the probaility of each character appearing try to find the most even split in the characters.
Assign a right and left child to the current root. (Go back to step 2). Until there each node only holds one character at most
The algorithim works similar to a DFS. After the first split the root is given a left and right child each with a respective group. If the group on the right child has a size greater than one then it is split based on there probalites and the right child is assigned a left child and a right child. This continues on the new nodes and the left nodes until there is only one node in each node meaning no more splits can be made.
Each edge in the tree is is also assigned a binary value either 1 or 0. If the edge is a right edge then, its a 1 and if it is a left edge then it is a 0.
Shannon-Fano is an extremely robust algorithim but, it doesn’t always gurantee an optimal answer. Shannon and Fano had missed a key observation that only Huffman saw.
Huffman Codes:
Huffman codes build off of Shannon-Fano’s original algorithim. Since, binary mappings are determined by the edges then, the length of a binary mappong for each characteer can be calculated by finding the depth. SInce, longer binary strings are assigned to more uncommon letters the two least common letters will always be at the bottom of the binary tree (highest depth). That node containing the two least common characters is guranteed to be in the optimal tree. The algorithim works as follows:
Create a heap with a collection of all the probailities.
Take the two least common characters and attach them both parent node. Removing there probailities from the heap in the process. Sum the probality of the its children and assign it parent. Then, put the probaility of the parent back into the heap. (Repeat this step)
In some cases even after summing the probailities the parent may still be the smallest in which case you will use it again.