Jordan Wellbaum's Huffman Tree Project

From Norsemathology
Revision as of 19:10, 10 December 2010 by Fubini (talk | contribs) (→‎Problem)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Description of the Problem

Introduction

The problem at hand is to compress a file. The idea is to exploit the fact that a file doesn't use enough different symbols to warrant the use of 8 bit representations for each. A Huffman Tree will be used to create the prefix code that can be used in place of the usual set length ASCII. This reduces the overall length of the file by shortening the code for the symbols while keeping them distinct when scanning the file.

Problem

In order to demonstrate the technique for using Huffman Trees to compress a file, I will be using some sample text from J.D. Salinger's "The Catcher in the Rye".

The text is 590 characters long (including spaces):

Among other things, you'll find that you're not the first person who was ever 
confused and frightened and even sickened by human behavior. You're by no 
means alone on that score, you'll be excited and stimulated to know.  Many, 
many men have been just as troubled morally and spiritually as you are right 
now.  Happily, some of them kept records of their troubles.  You'll learn from 
them - if you want to. Just as someday, if you have something to offer, someone 
will learn something from you.  It's a beautiful reciprocal arrangement.  And it 
isn't education.  It's history.  It's poetry.

The first step to doing this is evaluating what symbols are present in the file you're trying to compress as well as their respective quantities.

The following was written in Mathematica to yield that information:

freq[string_] := 
  Grid[
Partition[
   Flatten[Table[{FromCharacterCode[i], 
       Length[Select[
         Flatten[ToCharacterCode[Characters[string]]], # == 
           i &]]}, {i, 1, 200}] /. {x_, 0} -> {}], 2]];

The output of the above function looks like the following:

    106

'   9

,   6

-   1

.   10

A   2

H   1

I   3

J   1

M   1

Y   2

a   33

b   8

c   8

d   15

e   50

f   13

g   7

h   18

i   24

j   1

k   3

l   21

m   17

n   35

o   44

p   7

r   29

s   27

t   39

u   20

v   5

w   6

x   1

y   17

This output should be ordered from least frequent to most frequent and needs to be used to create a Huffman Tree. An algorithm for this is described on page 464 of our text.

The algorithm produces the following tree:

Huffman Tree.jpg

The above tree is used to assign the new character code. This is done by allowing a left branch traversal to be a 0 and a right branch traversal to be a 1.

The following code is given by the tree:

    00

e   1000

o   1010

t   1110

n   1100

a   0110

r   0100

s   10010

i   10110

l   11110

u   11011

h   11010

y   01011

m   01110

d   100111

f   101111

.   111110

'   011111

c   010101

b   010100

p   1011101

g   1001101

w   1111111

,   1011100

v   0111101

k   11111101

I   10011000

Y   01111001

A   01111000

x   1001100110

j   1001100111

M   1001100100

J   1001100101

H   111111000

-   111111001

The above code can now be used to rewrite the text.

Recall, the original length of the text was 590 characters which would have 4720 ASCII bits (8 per character).

When the new code is used, a length can be found using the following formula as described on page 463 of our text:


1*10 + 1*10 + 1*10 + 1*10 + 1*9 + 1*9 + 2*8 + 2*8 + 3*8 + 3*8 + 5*7 + 
 6*7 + 6*7 + 7*7 + 7*7 + 8*6 + 8*6 + 9*6 + 10*6 + 13*6 + 15*6 + 
 17*5 + 17*5 + 18*5 + 20*5 + 21*5 + 24*5 + 27*5 + 29*4 + 33*4 + 
 35*4 + 39*4 + 44*4 + 50*4 + 106*3

This compresses the length down to 2691 bits. This puts the average code for each symbol at 4.56 bits.

2691/4720 = .57

The compressed file is only 57% the length of the original.