# Huffman Coding Module

This Huffman coding module was my first serious attempt at writing anything useful in Python. It could be a bit better organized and perhaps I will at some point refactor it.

It does a series of fairly complex tasks:

• generating a table of the relative probability of bytes occuring within an input file
• using this data to construct a tree which represents the Huffman codes of all bytes that occur
• generating a Huffman code table from the tree
• compressing a file using a Huffman code table
• decompressing a file using a Huffman code table

## Huffman coding

Just a note for those who may not know. Huffman coding is a method of generating a code of bits of variable length representing all the symbols occuring in a given set of data. This allows information wasted by representing symbols of different probability with the same number of bits, to be removed compressing the size of the information.

For example, in the English language the letter e occurs far more frequently than other letters. In an English language text file, stored using one byte per character, the letter e takes up just as much space as less frequently used letters like q and z, 8 bits. By representing the character e using less than 8-bits and using more than 8-bits to represent these less frequently used characters, a more efficient representation can be achieved.

This would not produce any space savings in a set of truly random characters, but in a set which fits a very identifiable pattern, such as normal text, the savings can be significant. Symbols that do not occur need not be represented at all. Most English text uses only the 52 letters, 10 numerals and a few dozen punctuation marks including spaces and formatting characters such as new lines.

The tricky part of this process is that the variable lengths of bits used to must be distinguishable so that no combination of them can be mistaken for any other one code. A simple example of this would be using 1 to represent e, 01 to represent t, 0001 to represent a, and so forth. Huffman coding provides an algorithm which even more effeciently generates these codes by constructing a binary tree organized such that the path to each leaf node, containing a symbol, represents the bit sequence to be used.

These bit sequences can then be used to process and compress the file. The table itself must be stored as well since it is based on the data that was compressed, in order for decompression to take place.

## Improvements

Some of the organization of the functions is intended to allow the intermediate results to be examined for educational and troubleshooting purposes. Several improvements should be possible in however:

```* there may be a more efficient way to generate the tree
* there are certainly more efficient ways to store the table in the compressed file, reducing overhead
* there may be a better way to handle special characters including newlines, carriage returns, and the EOF, than I currently employ
```

## Code

The code consists of the module Huffman.py containing all the functions and a small test.py file for running them in the correct order on a sample file. The file used for testing was the text of a book of several hundred pages. It was compressed from 1816903 bytes to 1033484 bytes about 57% of its original size. Unless otherwise specified, the compression is run on 'input.txt' in the current directory. The compressed file is placed in 'output.txt' and the decompression in 'decompress.txt'. Another file 'probability.txt' is generated showing the byte frequency counts

### Huffman.py

There is an error in the Decompress file algorithm that was fixed in the final version. I will replace this with that when I track it down. It should be nice to be able to see the exact code changes in the diff.

```class Node:           # Node object definition for the tree which will give us the Huffman codes
val = 0
count = 0
L = []
one = None
zero= None
parent = None
def __init__(self, value=0, c=0):
self.val = value
self.count = c
self.L = []

def TraverseTree(rootNode, value=0, place=None, HuffTable=[]):  # Traverses tree and returns the Huffman coding table
if place == None:
place = []
if rootNode.val != 'root':           # Initialize for traversing the tree if we are at root
rootNode.val = value
value = (value*2) + 1
place.append(1)
else:                               # Otherwise increment the Huffman code generation value
value = 1
place.append(1)
if rootNode.one != None:                            # Traverse up the tree as far as possible
TraverseTree(rootNode.one,value, place[:], HuffTable)
if rootNode.zero != None:                           # Traverse down if we can't go up
value= value - 1
place[len(place)-1] = 0                         # Decrement Huffman code generation value if going down
TraverseTree(rootNode.zero,value, place[:], HuffTable)
if rootNode.one == None and rootNode.zero == None:  # Stop if we hit a 'leaf'; print and store the Huffman code
place.pop()
print "Letter", chr(rootNode.L), "becomes", place
#Generate Huffman Table Here
HuffTable.append([chr(rootNode.L), place])
if rootNode.val == 'root':   # Return if we made it all the way back to root
return HuffTable

def OutputByte(List):  # Turn the first 8 'bits' in the list into a byte n.b. Little-Endian
outputByte = 0
for x in range(8):
# print "Bit", x, "=", List[x]
outputByte+=pow(2,x)*List[x]
List.reverse()
for x in range(8):
List.pop()
List.reverse()
return outputByte

def InputByte(inputByte):  # Turn a byte into a list of 'bits'
tempList = []
for x in range(8):
# print "Bit", x, "=", List[x]
tempList.append(inputByte % 2)
inputByte -= inputByte % 2
inputByte /= 2
return tempList

def CompressFile(HuffTable, filename='input.txt', withKey=True):
import pickle
tblsize = len(HuffTable)
tempList = []
foundFlag = False
print "Opening Input File..."
f_in = open(filename,'r')
f_out = open('output.txt', 'w+b')
if withKey:
f_out.writelines("0\n" + str(len(HuffTable)) + "\n")
for x in HuffTable:
f_out.writelines( x + "\n" + str(x) + "\n")
while inp != '':
for x in range(len(HuffTable)):
if HuffTable[x] == inp:
tempList += HuffTable[x]
foundFlag  = True
if foundFlag != True:
print inp, "not found in Huffman Table"
else:
foundFlag = False
while 8 <= len(tempList):
outputByte = OutputByte(tempList)
f_out.write(chr(outputByte))
while len(tempList) != 8 and tempList != []: # Pad last byte
tempList.append(0)
if tempList != []:
outputByte = OutputByte(tempList)
f_out.write(chr(outputByte))
f_out.seek(0)				# Record how much padding
f_out.close()
f_in.close()
print "Done"

def DecompressFile(filename='output.txt'):
import pickle
foundFlag=False
HuffTable = []
tempList = []
f_in = open(filename,'rb')
print "Opened Input File..."
f_out = open('decompress.txt','wb')
f_in.seek(0, 2)
filesize = f_in.tell()     # Load the size of the file and the table at the beginning
f_in.seek(0)   # Return the the start of the file to read it in
for x in range(tblsize):        # Load the Huffman Table
if inp != '\n':					# Special handling for newline
HuffTable[len(HuffTable)-1].append(eval(inp))
HuffTable[len(HuffTable)-1] = HuffTable[len(HuffTable)-1]
else:
HuffTable[len(HuffTable)-1] = '\n'
print "Loaded Huffman Table..."
while f_in.tell() != filesize and inp != '':                # Read each byte in the file
tempList += InputByte(ord(inp))
while 16 <= len(tempList):  # Once we know we have at least one character, convert
for x in HuffTable:
if foundFlag==False:
if x == tempList[0:len(x)]:  # Compare the leading bits to each Huffman code
f_out.write(x)               # Output the character once we find one
foundFlag==True
tempList.reverse()
for i in range(len(x)):      # Remove the bits from the queue
tempList.pop()
tempList.reverse()
foundFlag==False
if inp != '':
tempList += InputByte(ord(inp))
for i in range(padding):
if tempList[len(tempList)-1]==0:
tempList.pop()
while 3 <= len(tempList):  # Once we know we have at least one character, convert
for x in HuffTable:
if foundFlag==False:
if x == tempList[0:len(x)]:  # Compare the leading bits to each Huffman code
f_out.write(x)               # Output the character once we find one
foundFlag=True
tempList.reverse()
for i in range(len(x)):      # Remove the bits from the queue
tempList.pop()
tempList.reverse()
if foundFlag==False:
tempList.pop()
foundFlag==False
print "Done", tempList, " : ", inp

def BuildTree(List=[], node=None, p=1.0):
import copy
length = len(List)-1
tmpRoot = Node('root')
if len(List) < 3:             # If this is the first call, start the tree
for x in range(len(List)):
List[x] = List[x] + 
bottomNode = Node(0)
topNode = Node(1)
bottomNode.L = [List[length]]
if List[length] != 0:
bottomNode = List[length]
bottomNode.val = 0
bottomNode.count = List[length]
topNode.count=List[length-1]
topNode.L = [List[length-1]]
if List[length-1] != 0:
topNode = List[length-1]
topNode.val = 1
tmpRoot.L.append(topNode.L)
tmpRoot.L.append(bottomNode.L)
tmpRoot.count = topNode.count + bottomNode.count
List.remove(List[length])
List.remove(List[length-1])
List.append([copy.deepcopy(tmpRoot.L), tmpRoot.count, tmpRoot])
tmpRoot.one, tmpRoot.zero = topNode, bottomNode
return tmpRoot

def SortByProb(List, descend=True):   # Sorts the list
import operator
List = sorted(List, key=operator.itemgetter(1)) # Specifies the second column to sort by
if descend==True: List.reverse()                # Default to sorting so that the lowest probability is last
return List

def SortByLength(List, descend=True):   # Sorts the list
tmpList = []
for i in len(List):
for x in range(len(List)):
if len(List[x+1]) <= len(List[x]):
tmpList = List[x+1]
List[x+1] = List[x]
List[x] = tmpList

def CalcProb(filename='input.txt'):
print "Opening Files..."
f1 = open(filename,'r')
f2 = open('probability.txt','w')
###############################################
#  Initialize Variables
###############################################
T, R = [], []     # Stores byte, byte count, and probability
count = 0  # Stores total byte count
inp = 'a'
total = 0.0
for x in range(256):
T.append([x, 0])  # Initialize T empty
###############################################
#  Count Characters
###############################################
while inp != '':
count=count+1
if inp != '' : T[ord(inp)]=T[ord(inp)]+1 # Increment byte count
print "File character count is", count
###############################################
#  Calculate Probabilities
###############################################
for x in T:
f2.write(str(x))
f2.write("\n")
f1.close()
f2.close()
T = SortByProb(T)
for x in range(256):
if T[x] != 0:
R.append(T[x])
return R

```

### test.py

```import huffman                     #imports the huffman coding module
p = huffman.CalcProb('jc.txt')     #calculate the probability of all bytes in file
while len(p) != 1:
p = huffman.SortByProb(p)      #builds the encoding tree
t = huffman.BuildTree(p)
c = huffman.TraverseTree(t)        #build the huffman coding table from the tree
huffman.CompressFile(c, 'jc.txt')  #compress the file with the table
huffman.DecompressFile()           #decompress the file
```