# Huffman Coding Module

This Huffman coding module was my first serious attempt at writing anything useful in Python. It could be a bit better organized and perhaps I will at some point refactor it.

It does a series of fairly complex tasks:

- generating a table of the relative probability of bytes occuring within an input file
- using this data to construct a tree which represents the Huffman codes of all bytes that occur
- generating a Huffman code table from the tree
- compressing a file using a Huffman code table
- decompressing a file using a Huffman code table

## Huffman coding

Just a note for those who may not know. Huffman coding is a method of generating a code of bits of variable length representing all the symbols occuring in a given set of data. This allows information wasted by representing symbols of different probability with the same number of bits, to be removed compressing the size of the information.

For example, in the English language the letter *e* occurs far more frequently than other letters. In an English language text file, stored using one byte per character, the letter *e* takes up just as much space as less frequently used letters like *q* and *z*, 8 bits. By representing the character *e* using less than 8-bits and using more than 8-bits to represent these less frequently used characters, a more efficient representation can be achieved.

This would not produce any space savings in a set of truly random characters, but in a set which fits a very identifiable pattern, such as normal text, the savings can be significant. Symbols that do not occur need not be represented at all. Most English text uses only the 52 letters, 10 numerals and a few dozen punctuation marks including spaces and formatting characters such as new lines.

The tricky part of this process is that the variable lengths of bits used to must be distinguishable so that no combination of them can be mistaken for any other one code. A simple example of this would be using 1 to represent *e*, 01 to represent *t*, 0001 to represent *a*, and so forth. Huffman coding provides an algorithm which even more effeciently generates these codes by constructing a binary tree organized such that the path to each leaf node, containing a symbol, represents the bit sequence to be used.

These bit sequences can then be used to process and compress the file. The table itself must be stored as well since it is based on the data that was compressed, in order for decompression to take place.

## Improvements

Some of the organization of the functions is intended to allow the intermediate results to be examined for educational and troubleshooting purposes. Several improvements should be possible in however:

* there may be a more efficient way to generate the tree * there are certainly more efficient ways to store the table in the compressed file, reducing overhead * there may be a better way to handle special characters including newlines, carriage returns, and the EOF, than I currently employ

## Code

The code consists of the module Huffman.py containing all the functions and a small test.py file for running them in the correct order on a sample file. The file used for testing was the text of a book of several hundred pages. It was compressed from 1816903 bytes to 1033484 bytes about 57% of its original size. Unless otherwise specified, the compression is run on 'input.txt' in the current directory. The compressed file is placed in 'output.txt' and the decompression in 'decompress.txt'. Another file 'probability.txt' is generated showing the byte frequency counts

### Huffman.py

There is an error in the Decompress file algorithm that was fixed in the final version. I will replace this with that when I track it down. It should be nice to be able to see the exact code changes in the diff.

class Node: # Node object definition for the tree which will give us the Huffman codes val = 0 count = 0 L = [] one = None zero= None parent = None def __init__(self, value=0, c=0): self.val = value self.count = c self.L = [] def TraverseTree(rootNode, value=0, place=None, HuffTable=[]): # Traverses tree and returns the Huffman coding table if place == None: place = [] if rootNode.val != 'root': # Initialize for traversing the tree if we are at root rootNode.val = value value = (value*2) + 1 place.append(1) else: # Otherwise increment the Huffman code generation value value = 1 place.append(1) if rootNode.one != None: # Traverse up the tree as far as possible TraverseTree(rootNode.one,value, place[:], HuffTable) if rootNode.zero != None: # Traverse down if we can't go up value= value - 1 place[len(place)-1] = 0 # Decrement Huffman code generation value if going down TraverseTree(rootNode.zero,value, place[:], HuffTable) if rootNode.one == None and rootNode.zero == None: # Stop if we hit a 'leaf'; print and store the Huffman code place.pop() print "Letter", chr(rootNode.L[0]), "becomes", place #Generate Huffman Table Here HuffTable.append([chr(rootNode.L[0]), place]) if rootNode.val == 'root': # Return if we made it all the way back to root return HuffTable def OutputByte(List): # Turn the first 8 'bits' in the list into a byte n.b. Little-Endian outputByte = 0 for x in range(8): # print "Bit", x, "=", List[x] outputByte+=pow(2,x)*List[x] List.reverse() for x in range(8): List.pop() List.reverse() return outputByte def InputByte(inputByte): # Turn a byte into a list of 'bits' tempList = [] for x in range(8): # print "Bit", x, "=", List[x] tempList.append(inputByte % 2) inputByte -= inputByte % 2 inputByte /= 2 return tempList def CompressFile(HuffTable, filename='input.txt', withKey=True): import pickle tblsize = len(HuffTable) tempList = [] foundFlag = False print "Opening Input File..." f_in = open(filename,'r') f_out = open('output.txt', 'w+b') if withKey: f_out.writelines("0\n" + str(len(HuffTable)) + "\n") for x in HuffTable: f_out.writelines( x[0] + "\n" + str(x[1]) + "\n") inp = f_in.read(1) while inp != '': for x in range(len(HuffTable)): if HuffTable[x][0] == inp: tempList += HuffTable[x][1] foundFlag = True if foundFlag != True: print inp, "not found in Huffman Table" else: foundFlag = False while 8 <= len(tempList): outputByte = OutputByte(tempList) f_out.write(chr(outputByte)) inp = f_in.read(1) padding = 0 while len(tempList) != 8 and tempList != []: # Pad last byte tempList.append(0) padding += 1 if tempList != []: outputByte = OutputByte(tempList) f_out.write(chr(outputByte)) f_out.seek(0) # Record how much padding f_out.writelines(str(padding) + "\n") f_out.close() f_in.close() print "Done" def DecompressFile(filename='output.txt'): import pickle foundFlag=False HuffTable = [] tempList = [] f_in = open(filename,'rb') print "Opened Input File..." f_out = open('decompress.txt','wb') f_in.seek(0, 2) filesize = f_in.tell() # Load the size of the file and the table at the beginning f_in.seek(0) # Return the the start of the file to read it in padding = int(f_in.readline()) tblsize = int(f_in.readline()) for x in range(tblsize): # Load the Huffman Table HuffTable.append([ f_in.readline() ]) inp = f_in.readline() if inp != '\n': # Special handling for newline HuffTable[len(HuffTable)-1].append(eval(inp)) HuffTable[len(HuffTable)-1][0] = HuffTable[len(HuffTable)-1][0][0] else: HuffTable[len(HuffTable)-1].append(eval(f_in.readline())) HuffTable[len(HuffTable)-1][0] = '\n' print "Loaded Huffman Table..." inp = f_in.read(1) while f_in.tell() != filesize and inp != '': # Read each byte in the file tempList += InputByte(ord(inp)) while 16 <= len(tempList): # Once we know we have at least one character, convert for x in HuffTable: if foundFlag==False: if x[1] == tempList[0:len(x[1])]: # Compare the leading bits to each Huffman code f_out.write(x[0]) # Output the character once we find one foundFlag==True tempList.reverse() for i in range(len(x[1])): # Remove the bits from the queue tempList.pop() tempList.reverse() foundFlag==False inp = f_in.read(1) if inp != '': tempList += InputByte(ord(inp)) for i in range(padding): if tempList[len(tempList)-1]==0: tempList.pop() while 3 <= len(tempList): # Once we know we have at least one character, convert for x in HuffTable: if foundFlag==False: if x[1] == tempList[0:len(x[1])]: # Compare the leading bits to each Huffman code f_out.write(x[0]) # Output the character once we find one foundFlag=True tempList.reverse() for i in range(len(x[1])): # Remove the bits from the queue tempList.pop() tempList.reverse() if foundFlag==False: tempList.pop() foundFlag==False print "Done", tempList, " : ", inp def BuildTree(List=[], node=None, p=1.0): import copy length = len(List)-1 tmpRoot = Node('root') if len(List[0]) < 3: # If this is the first call, start the tree for x in range(len(List)): List[x] = List[x] + [0] bottomNode = Node(0) topNode = Node(1) bottomNode.L = [List[length][0]] if List[length][2] != 0: bottomNode = List[length][2] bottomNode.val = 0 bottomNode.count = List[length][1] topNode.count=List[length-1][1] topNode.L = [List[length-1][0]] if List[length-1][2] != 0: topNode = List[length-1][2] topNode.val = 1 tmpRoot.L.append(topNode.L) tmpRoot.L.append(bottomNode.L) tmpRoot.count = topNode.count + bottomNode.count List.remove(List[length]) List.remove(List[length-1]) List.append([copy.deepcopy(tmpRoot.L), tmpRoot.count, tmpRoot]) tmpRoot.one, tmpRoot.zero = topNode, bottomNode return tmpRoot def SortByProb(List, descend=True): # Sorts the list import operator List = sorted(List, key=operator.itemgetter(1)) # Specifies the second column to sort by if descend==True: List.reverse() # Default to sorting so that the lowest probability is last return List def SortByLength(List, descend=True): # Sorts the list tmpList = [] for i in len(List): for x in range(len(List)): if len(List[x+1][1]) <= len(List[x][1]): tmpList = List[x+1][1] List[x+1][1] = List[x][1] List[x][1] = tmpList def CalcProb(filename='input.txt'): print "Opening Files..." f1 = open(filename,'r') f2 = open('probability.txt','w') ############################################### # Initialize Variables ############################################### T, R = [], [] # Stores byte, byte count, and probability count = 0 # Stores total byte count inp = 'a' total = 0.0 for x in range(256): T.append([x, 0]) # Initialize T empty ############################################### # Count Characters ############################################### while inp != '': count=count+1 inp = f1.read(1) if inp != '' : T[ord(inp)][1]=T[ord(inp)][1]+1 # Increment byte count print "File character count is", count ############################################### # Calculate Probabilities ############################################### for x in T: f2.write(str(x)) f2.write("\n") f1.close() f2.close() T = SortByProb(T) for x in range(256): if T[x][1] != 0: R.append(T[x]) return R

### test.py

import huffman #imports the huffman coding module p = huffman.CalcProb('jc.txt') #calculate the probability of all bytes in file while len(p) != 1: p = huffman.SortByProb(p) #builds the encoding tree t = huffman.BuildTree(p) c = huffman.TraverseTree(t) #build the huffman coding table from the tree huffman.CompressFile(c, 'jc.txt') #compress the file with the table huffman.DecompressFile() #decompress the file