Huffman Coding Module

From MaccabeWiki
Jump to: navigation, search

This Huffman coding module was my first serious attempt at writing anything useful in Python. It could be a bit better organized and perhaps I will at some point refactor it.

It does a series of fairly complex tasks:

  • generating a table of the relative probability of bytes occuring within an input file
  • using this data to construct a tree which represents the Huffman codes of all bytes that occur
  • generating a Huffman code table from the tree
  • compressing a file using a Huffman code table
  • decompressing a file using a Huffman code table

Huffman coding

Just a note for those who may not know. Huffman coding is a method of generating a code of bits of variable length representing all the symbols occuring in a given set of data. This allows information wasted by representing symbols of different probability with the same number of bits, to be removed compressing the size of the information.

For example, in the English language the letter e occurs far more frequently than other letters. In an English language text file, stored using one byte per character, the letter e takes up just as much space as less frequently used letters like q and z, 8 bits. By representing the character e using less than 8-bits and using more than 8-bits to represent these less frequently used characters, a more efficient representation can be achieved.

This would not produce any space savings in a set of truly random characters, but in a set which fits a very identifiable pattern, such as normal text, the savings can be significant. Symbols that do not occur need not be represented at all. Most English text uses only the 52 letters, 10 numerals and a few dozen punctuation marks including spaces and formatting characters such as new lines.

The tricky part of this process is that the variable lengths of bits used to must be distinguishable so that no combination of them can be mistaken for any other one code. A simple example of this would be using 1 to represent e, 01 to represent t, 0001 to represent a, and so forth. Huffman coding provides an algorithm which even more effeciently generates these codes by constructing a binary tree organized such that the path to each leaf node, containing a symbol, represents the bit sequence to be used.

These bit sequences can then be used to process and compress the file. The table itself must be stored as well since it is based on the data that was compressed, in order for decompression to take place.


Improvements

Some of the organization of the functions is intended to allow the intermediate results to be examined for educational and troubleshooting purposes. Several improvements should be possible in however:

* there may be a more efficient way to generate the tree
* there are certainly more efficient ways to store the table in the compressed file, reducing overhead
* there may be a better way to handle special characters including newlines, carriage returns, and the EOF, than I currently employ

Code

The code consists of the module Huffman.py containing all the functions and a small test.py file for running them in the correct order on a sample file. The file used for testing was the text of a book of several hundred pages. It was compressed from 1816903 bytes to 1033484 bytes about 57% of its original size. Unless otherwise specified, the compression is run on 'input.txt' in the current directory. The compressed file is placed in 'output.txt' and the decompression in 'decompress.txt'. Another file 'probability.txt' is generated showing the byte frequency counts

Huffman.py

There is an error in the Decompress file algorithm that was fixed in the final version. I will replace this with that when I track it down. It should be nice to be able to see the exact code changes in the diff.

class Node:           # Node object definition for the tree which will give us the Huffman codes
	val = 0
	count = 0
	L = []
	one = None
	zero= None
	parent = None
	def __init__(self, value=0, c=0):
		self.val = value
		self.count = c
		self.L = []

def TraverseTree(rootNode, value=0, place=None, HuffTable=[]):  # Traverses tree and returns the Huffman coding table
	if place == None:
		place = []
	if rootNode.val != 'root':           # Initialize for traversing the tree if we are at root
		rootNode.val = value
		value = (value*2) + 1
		place.append(1)
	else:                               # Otherwise increment the Huffman code generation value
		value = 1
		place.append(1)
	if rootNode.one != None:                            # Traverse up the tree as far as possible
		TraverseTree(rootNode.one,value, place[:], HuffTable)
	if rootNode.zero != None:                           # Traverse down if we can't go up
		value= value - 1
		place[len(place)-1] = 0                         # Decrement Huffman code generation value if going down
		TraverseTree(rootNode.zero,value, place[:], HuffTable)
	if rootNode.one == None and rootNode.zero == None:  # Stop if we hit a 'leaf'; print and store the Huffman code
		place.pop()
		print "Letter", chr(rootNode.L[0]), "becomes", place
		#Generate Huffman Table Here
		HuffTable.append([chr(rootNode.L[0]), place])
	if rootNode.val == 'root':   # Return if we made it all the way back to root
		return HuffTable

def OutputByte(List):  # Turn the first 8 'bits' in the list into a byte n.b. Little-Endian
	outputByte = 0
	for x in range(8):
		# print "Bit", x, "=", List[x]
		outputByte+=pow(2,x)*List[x]
	List.reverse()
	for x in range(8):
		List.pop()
	List.reverse()
	return outputByte

def InputByte(inputByte):  # Turn a byte into a list of 'bits'
	tempList = []
	for x in range(8):
		# print "Bit", x, "=", List[x]
		tempList.append(inputByte % 2)
		inputByte -= inputByte % 2
		inputByte /= 2
	return tempList

def CompressFile(HuffTable, filename='input.txt', withKey=True):
	import pickle
	tblsize = len(HuffTable)
	tempList = []
	foundFlag = False
	print "Opening Input File..."
	f_in = open(filename,'r')
	f_out = open('output.txt', 'w+b')
	if withKey:
		f_out.writelines("0\n" + str(len(HuffTable)) + "\n")
		for x in HuffTable:
			f_out.writelines( x[0] + "\n" + str(x[1]) + "\n")
	inp = f_in.read(1)
	while inp != '':
		for x in range(len(HuffTable)):
			if HuffTable[x][0] == inp:
				tempList += HuffTable[x][1]
				foundFlag  = True
		if foundFlag != True:
			print inp, "not found in Huffman Table"
		else:
			foundFlag = False
		while 8 <= len(tempList):
			outputByte = OutputByte(tempList)
			f_out.write(chr(outputByte))
		inp = f_in.read(1)
	padding = 0
	while len(tempList) != 8 and tempList != []: # Pad last byte
		tempList.append(0)
		padding += 1
	if tempList != []:
		outputByte = OutputByte(tempList)
		f_out.write(chr(outputByte))
	f_out.seek(0)				# Record how much padding
	f_out.writelines(str(padding) + "\n")
	f_out.close()
	f_in.close()
	print "Done"

def DecompressFile(filename='output.txt'):
	import pickle
	foundFlag=False
	HuffTable = []
	tempList = []
	f_in = open(filename,'rb')
	print "Opened Input File..."
	f_out = open('decompress.txt','wb')
	f_in.seek(0, 2)
	filesize = f_in.tell()     # Load the size of the file and the table at the beginning
	f_in.seek(0)   # Return the the start of the file to read it in
	padding = int(f_in.readline())
	tblsize = int(f_in.readline())
	for x in range(tblsize):        # Load the Huffman Table
		HuffTable.append([ f_in.readline() ])
		inp = f_in.readline()
		if inp != '\n':					# Special handling for newline
			HuffTable[len(HuffTable)-1].append(eval(inp))
			HuffTable[len(HuffTable)-1][0] = HuffTable[len(HuffTable)-1][0][0]
		else:
			HuffTable[len(HuffTable)-1].append(eval(f_in.readline()))
			HuffTable[len(HuffTable)-1][0] = '\n'
	print "Loaded Huffman Table..."
	inp = f_in.read(1)
	while f_in.tell() != filesize and inp != '':                # Read each byte in the file 
		tempList += InputByte(ord(inp))  
		while 16 <= len(tempList):  # Once we know we have at least one character, convert
			for x in HuffTable:
				if foundFlag==False:
					if x[1] == tempList[0:len(x[1])]:  # Compare the leading bits to each Huffman code
						f_out.write(x[0])               # Output the character once we find one
						foundFlag==True
						tempList.reverse()
						for i in range(len(x[1])):      # Remove the bits from the queue
							tempList.pop()
						tempList.reverse()
			foundFlag==False
		inp = f_in.read(1)
	if inp != '':
		tempList += InputByte(ord(inp))
	for i in range(padding):
		if tempList[len(tempList)-1]==0:
			tempList.pop()
	while 3 <= len(tempList):  # Once we know we have at least one character, convert
		for x in HuffTable:
			if foundFlag==False:
				if x[1] == tempList[0:len(x[1])]:  # Compare the leading bits to each Huffman code
					f_out.write(x[0])               # Output the character once we find one
					foundFlag=True
					tempList.reverse()
					for i in range(len(x[1])):      # Remove the bits from the queue
						tempList.pop()
					tempList.reverse()
		if foundFlag==False:
			tempList.pop()
		foundFlag==False
	print "Done", tempList, " : ", inp

def BuildTree(List=[], node=None, p=1.0):
	import copy
	length = len(List)-1
	tmpRoot = Node('root')
	if len(List[0]) < 3:             # If this is the first call, start the tree
		for x in range(len(List)):
			List[x] = List[x] + [0]
	bottomNode = Node(0)
	topNode = Node(1)
	bottomNode.L = [List[length][0]]
	if List[length][2] != 0:
		bottomNode = List[length][2]
		bottomNode.val = 0
	bottomNode.count = List[length][1]
	topNode.count=List[length-1][1]
	topNode.L = [List[length-1][0]]
	if List[length-1][2] != 0:
		topNode = List[length-1][2]
		topNode.val = 1
	tmpRoot.L.append(topNode.L)
	tmpRoot.L.append(bottomNode.L)
	tmpRoot.count = topNode.count + bottomNode.count
	List.remove(List[length])
	List.remove(List[length-1])
	List.append([copy.deepcopy(tmpRoot.L), tmpRoot.count, tmpRoot])
	tmpRoot.one, tmpRoot.zero = topNode, bottomNode
	return tmpRoot

def SortByProb(List, descend=True):   # Sorts the list
	import operator
	List = sorted(List, key=operator.itemgetter(1)) # Specifies the second column to sort by
	if descend==True: List.reverse()                # Default to sorting so that the lowest probability is last
	return List

def SortByLength(List, descend=True):   # Sorts the list
	tmpList = []
	for i in len(List):
		for x in range(len(List)):
			if len(List[x+1][1]) <= len(List[x][1]):
				tmpList = List[x+1][1]
				List[x+1][1] = List[x][1]
				List[x][1] = tmpList


def CalcProb(filename='input.txt'):
	print "Opening Files..."
	f1 = open(filename,'r')
	f2 = open('probability.txt','w')
###############################################
#  Initialize Variables
###############################################
	T, R = [], []     # Stores byte, byte count, and probability
	count = 0  # Stores total byte count
	inp = 'a'
	total = 0.0
	for x in range(256):
		T.append([x, 0])  # Initialize T empty
###############################################
#  Count Characters
###############################################
	while inp != '':
		count=count+1
		inp = f1.read(1)
		if inp != '' : T[ord(inp)][1]=T[ord(inp)][1]+1 # Increment byte count
	print "File character count is", count
###############################################
#  Calculate Probabilities
###############################################
	for x in T:
		f2.write(str(x))
		f2.write("\n")
	f1.close()
	f2.close()
	T = SortByProb(T)
	for x in range(256):
		if T[x][1] != 0:
			R.append(T[x])
	return R

test.py

import huffman                     #imports the huffman coding module
p = huffman.CalcProb('jc.txt')     #calculate the probability of all bytes in file
while len(p) != 1:
    p = huffman.SortByProb(p)      #builds the encoding tree
    t = huffman.BuildTree(p)
c = huffman.TraverseTree(t)        #build the huffman coding table from the tree
huffman.CompressFile(c, 'jc.txt')  #compress the file with the table
huffman.DecompressFile()           #decompress the file