Here’s an interim solution. I’m working on writing and reading binary
huffman encoded files. There is code here which isn’t used yet by the
demo code. I figured that the time had come to publish my approach to
building the tree, and encoding and decoding. The current code is
more pedagogical than useful since the ‘compressed’ data is
represented by a string representation of the bit string.
There are three files attached. huffman.rb is a module and some core
extensions. huffman_demo1.rb runs the example from the quiz, dumping
the tree and various computed data. Then ‘compresses’ and
‘decompresses’ the sample string. huffman_wikipedia.rb is an earlier
version which uses the algorithm in the wikipedia article.
A few remarks.
I started out using the wikipedia algorithm. Once I got that working,
I decided that it would be better to use a priority queue instead of
two separate queues. For this I used Brian Schroeder’s PriorityQueue
gem.
To handle the problem of padding to a given word length, I add an
artificial ‘character’ to the tree with an occurrence count of 0. This
should ensure that it doesn’t take a shorter encoding away from a
character which actually occurs in the input data.
In order to decode the compressed data, I use a regex which is a
simple alternation of all of the string representations of the codes,
anchored at the begining of the string. Since the huffman encoding
algorithm ensures that no code is a prefix of any other code, this
regex can be used to find the code at the head of the input stream.
So walking through a string representation of a huffman encoded bit
stream can be done by matching this pattern against the string, the
match will be the code. After this match, the string can be replaced
by the string remaining after the match.
Extending this to binary data will involve making an input stream
class which reads the binary data and converts it into reasonably
sized binary string representations, and an output stream which
accumulates the string representation of the codes and writes them as
binary as enough data is produced. The huffman.rb file contains these
classes in an embrionic form.
–
Rick DeNatale
My blog on Ruby
http://talklikeaduck.denhaven2.com/