Huffman Encoder (#123)

The three rules of Ruby Q.:

  1. Please do not post any solutions or spoiler discussion for this quiz
    until
    48 hours have passed from the time on this message.

  2. Support Ruby Q. by submitting ideas as often as you can:

http://www.rubyquiz.com/

  1. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby T. follow the discussion. Please reply to the original quiz
message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Harlan

Huffman Coding is a common form of data compression where none of the
original
data gets lost. It begins by analyzing a string of data to determine
which
pieces occur with the highest frequencies. These frequencies and pieces
are
used to construct a binary tree. It is the “path” from root node to the
leaf with this data that forms its encoding. The following example
should
explain things:

Data: ABRRKBAARAA (11 bytes)

Frequency counts:
A 5
R 3
B 2
K 1

In Huffman Tree form, with frequency weights in parentheses:
ARBK (11)
/
0 1
/
A (5) RBK (6)
/
0 1
/
R (3) BK (3)
/
0 1
/
B (2) K (1)

The encoding for each character is simply the path to that character:
A 0
R 10
B 110
K 111

Here is the original data encoded:
01101010 11111000 1000 (fits in 3 bytes)

We have compressed the original information by 80%!

A key point to note is that every character encoding has a unique
prefix,
corresponding to the unique path to that character within the tree. If
this
were not so, then decoding would be impossible due to ambiguity.

The quiz this time is to write a program to implement a compression
program
using Huffman encoding.

Extra Credit:
Perform the actual encoding using your tree. You may encounter one
issue
during the decompression/decoding phase. Your encoded string may not
be a
multiple of 8. This means that when you compress your encoding into a
binary number, padding 0’s get added. Then, upon decompression, you
may
see extra characters. To counter this, one solution is to add your
own
padding of 1 extra character every time. And then simply strip it off
once you have decoded.

You may also wish to provide a way to serialize the Huffman Tree so it
can be shared among copies of your program.

./huffman_encode.rb I want this message to get encoded!

Encoded:
11111111 11111110 11111111 11101111 10111111
01100110 11111111 11110111 11111111 11011100
11111111 11010111 01110111 11011110 10011011
11111100 11110101 10010111 11101111 11111011
11111101 11111101 01111111 01111111 11111110
Encoded Bytes:
25

Original:
I want this message to get encoded!
Original Bytes:
35

Compressed: 28%

Ruby Q. [email protected] writes:

The quiz this time is to write a program to implement a compression
program using Huffman encoding.

I’d like to comment on and regularize a few of the “extra credit”
portions.

First off, without knowledge of the tree, one can’t decode the Huffman
encoding given in the quiz since the message:
I want this message to get encoded!
encodes to the same thing as:
V jnag guvf zrffntr gb trg rapbqrq!

So, with a Huffman encoding one always needs to know the tree
structure when doing the decoding. This means that usually what one
does with Huffman encoding is use some big training database to build
up your tree, and then one re-uses the same tree again and again.

Therefore, I want to add these switches:

-d Decoding mode. (Default is encoding)

-T file Generate tree and store it to file. Do not produce other
output.

-t file Use file to retrieve your tree. (for either encoding or
decoding)

-i file Use file as input instead of reading command line arguments

-o file Use file as output instead of writing to standard output
(Note: compression statistics should still be written to
standard output. file should contain only the message)

Also, I’ll note that the quiz description can give a false impression
of Huffman encoding and how to build up the tree. For example, after
reading that I’d forgive anyone who thought that Huffman encoding
always assigned to each letter a code of some number of 1s followed
by a zero, except the least frequent letter which is assigned a code
of all 1s. In fact, it only works out to that for certain frequency
distributions.

Although I don’t want to spoil things I’ll point at the section
labeled “Basic technique” in the wikipedia article on Huffman coding
(Huffman coding - Wikipedia) and mention that I won’t
quite be using the algorithm they mention there, but will be using
something vaguely similar. (That is, I will be building the tree from
the bottom up rather than the top down)

First off, without knowledge of the tree, one can’t decode the Huffman
encoding given in the quiz since the message:
I want this message to get encoded!
encodes to the same thing as:
V jnag guvf zrffntr gb trg rapbqrq!

So, with a Huffman encoding one always needs to know the tree
structure when doing the decoding. This means that usually what one
does with Huffman encoding is use some big training database to build
up your tree, and then one re-uses the same tree again and again.

Another alternative is to do an adaptive Huffman encoding, which does
not require a dictionary at the front of the file, nor a training
database. The dictionary (i.e. tree) is built up during the process of
decoding.

The tradeoff is less compression for small files, or early on in larger
files.

Ruby Q. [email protected] writes:

Encoded Bytes:
25

Original:
I want this message to get encoded!
Original Bytes:
35

Compressed: 28%

I’ll note that this phrase includes 16 distinct characters (" ", “!”,
“I”, “a”, “c”, “d”, “e”, “g”, “h”, “i”, “m”, “n”, “o”, “s”, “t”, “w”).
Even if you add a special “end-of-data” character, that’s only 17
distinct characters, so you would expect any Huffman code to perform
at least as well 5 bits per character; since your code is in theory
being developed to encode this exact phrase, I’d expect better
performance than that.

At 5 bits per character, if you add a special “end of data” character
to the end of the phrase so that you’re encooding 36 values, that’s
only 180 bits of output. 180 bits fits into 23 bytes.

I’m pointing out that the initial reference code probably has a bug in
its algorithm, since it does even worse than a naive
5-bits-per-character encoding.

For what it’s worth, my current Huffman encoder when tuned to that
phrase alone compresses it into 18 bytes; when tuned to
http://norvig.com/big.txt (a large chunk of English text) it still
manages to compress that phrase to 22 bytes.

I based my solution on the algorithm described here:

http://www.siggraph.org/education/materials/HyperGraph/video/mpeg/mpegfaq/huffman_tutorial.html

I use the rubytree gem for the tree, and subclassed its TreeNode into
the class
FreqNode, which holds a frequency, possibly a letter (just the leaves),
and a
side (either :left or :right). I made FreqNodes compare primarily based
on
their frequencies so I could build a SortedSet of them and repeatedly
take the
two lowest to combine. The actual tree is built in the build_tree
method.

It just takes an input string, builds the tree, and outputs the
compressed
data (which is slightly different from the example given in the original
quiz
message because we treat ties differently. I’m pretty sure this doesn’t
matter though). No extra credit, though it shouldn’t be too hard to add
on some
tree serialization.

irb> tree = build_tree(‘ABC’)
irb> tree.printTree
Node ID: 16 Content: 3 Parent: ROOT Children: 2 Total Nodes: 5 Side:
Node ID: 14 Content: 1C Parent: 16 Children: 0 Total Nodes: 1 Side:
left
Node ID: 15 Content: 2 Parent: 16 Children: 2 Total Nodes: 3 Side:
right
Node ID: 12 Content: 1A Parent: 15 Children: 0 Total Nodes: 1
Side: left
Node ID: 13 Content: 1B Parent: 15 Children: 0 Total Nodes: 1
Side: right
irb> encode ‘ABC’, tree
=> “10110”

$ ./huffman_encode.rb ABRRKBAARAA
Encoded:
01011111 10010100 1100
Encoded bytes: 3

Original:
ABRRKBAARAA
Original bytes: 11

Compressed: 73%

#!/usr/bin/env ruby

require ‘rubygems’
require ‘tree’
require ‘set’

class Tree::TreeNode

Yield once for each leaf node.

def each_leaf
self.each { |node| yield(node) if not node.hasChildren? }
end

Both assume exactly 2 children

def left_child; children.first; end
def right_child; children.last; end
end

FreqNodes are TreeNodes that:

1) Have generated, numeric names to ensure they’re all unique.

2) Compare based primarily on their contents.

3) Have as contents an array containing a frequency, and possibly a

letter

(for the leaves).

4) Have a @side attribute that should be set to either :left or

:right for

non-root nodes.

5) Have a few other nicities, like letter().

class FreqNode < Tree::TreeNode
attr_accessor :side

def initialize(content = nil)
super(FreqNode.next_name, content)
end

def <=>(node)
content_diff = @content <=> node.content
return content_diff if not content_diff.zero?
super(node)
end

Assumes this is a leaf.

def letter; @content[1]; end

def to_s; super() + " Side: #{@side}"; end

Generate unique names because we can’t add two nodes with the same

name as

children of the same parent node.

@next_name_num = -1
def FreqNode.next_name
(@next_name_num += 1).to_s
end
end

class Set
def shift
elem = self.min
self.delete elem
elem
end
end

Build a SortedSet containing pairs of [freq, letter] for the given

string.
def build_freqs str

Build hash of byte => count

counts = Hash.new 0
str.each_byte { |byte| counts[byte.chr] += 1 }

Build SortedSet of [freq, byte] pairs (lower freqs first).

freqs = SortedSet[]
counts.each { |bc| freqs << bc.reverse }
freqs
end

Build the tree for the given input string, and return the root node.

def build_tree str
nodes = build_freqs(str).map! { |pair| FreqNode.new(pair) }

while nodes.size > 1
child1, child2 = nodes.shift, nodes.shift
parent = FreqNode.new([child1.content.first + child2.content.first])
child1.side, child2.side = :left, :right
parent << child1
parent << child2
nodes << parent
end

nodes.min
end

Encode the given letter using the given tree of FreqNodes.

def encode_letter letter, tree
enc = ‘’

Find leaf with the right byte value

node = nil
tree.each_leaf do |leaf|
(node = leaf; break) if leaf.letter == letter
end

while not node.isRoot?
node.side == :left ? enc << ‘0’ : enc << ‘1’
node = node.parent
end

enc.reverse
end

Build a tree for the given string, and encode it using that tree.

def encode str, tree = build_tree(str)
#tree = build_tree(str)
encoded = ‘’
str.each_byte { |byte| encoded << encode_letter(byte.chr, tree) }
encoded
end

Decode the given string (which should consist of '0’s and '1’s) using

the

given tree.

def decode str, tree
dec = ‘’
i = 0
node = tree

str.each_byte do |byte|
node = (byte.chr == ‘0’ ? node.left_child : node.right_child)
if not node.hasChildren?
dec << node.byte_value.chr
node = tree
end
end

dec
end

class String

Split up the string by inserting sep between len-length chunks.

def split_up! len, sep
(((length / len.to_f).ceil - 1) * len).step(len, -len) do |i|
insert i, sep
end
self
end
end

Output a binary string, splitting it up for easier reading.

def puts_binary str
str = str.dup
str.split_up! 40, “\n”
str.each { |line| puts line.split_up!(8, ’ ') }
end

if FILE == $0
input = ARGV.join ’ ’
input_bytes = input.length
enc = encode(input)
enc_bytes = (enc.length / 8.0).ceil
compressed = ((1 - enc_bytes.to_f/input_bytes) * 100).round

puts ‘Encoded:’
puts_binary(enc)
puts “Encoded bytes: #{enc_bytes}\n\n”

puts “Original:\n#{input}”
puts “Original bytes: #{input_bytes}\n\n”

puts “Compressed: #{compressed}%”
end

Very simplistic solution. recursively partitions sorted frequencies to
build the huffman tree:

% ./huffman.rb I want this message to get encoded!
Encoded:
11001000 11111000 01010010 00001011 01111100 01100011 10100101 10111000
01001001 00001010 11000100 10010100 00001101 01101010 11100010 01100011
1000

35 => 17 (51%)

Code is rather straight forward: Parked at Loopia

Well, first time I actually submit my solution, it’s a bit hairy, but it
worked in my tests…

I made the decoder too, only problem was serialization, I did not know
what
to do with it, and how this in real apps work, so that part is a bit odd
(programming is just hobby for me).

usage:
ruby quiz123.rb encode this!

and
ruby quiz123.rb -d

The app will prompt you for the tree and the part to decode. The tree
should
be a hash definition ex: { “a” => “0”, “b” => “10” }. (To test, you can
first encode something (it will show the tree in correct format) en then
decode it again.

What puzzled me was that longer texts appeared to take up more bytes
encoded
than decoded.

anyway, here is my code:

#! /usr/local/bin/ruby

class Huffman
attr_accessor :tree, :encoded

def initialize
@encoded, @tree = nil
@line = “”
end

def encode( line )
@line = line
chars = line.split("")
uniq = chars.uniq

counts = []
uniq.each do |char|
  counts << { :char => char, :count => chars.select{ |x| x == char

}.length }
end

counts.sort!{ |x,y| x[:count] <=> y[:count]  }.reverse!

@tree = {}
counts.each_with_index do |char_hash,i|
  char = char_hash[:char]
  if i == counts.length-1
    @tree[char] = "1"*i
    break
  end
  @tree[char] = "1"*i+"0"
end

@encoded = line.split("").collect { |x| @tree[x] }.join("")
@encoded += @tree[counts[1][:char]]
@encoded += "0" * ([email protected]%8) if @encoded.length%8 != 0

return self

end

def decode( encoded, tree )
@encoded = encoded

@tree = tree

@encoded.gsub!(/ |\n/, "")

encoded = @encoded.dup
encoded.slice!((encoded.reverse.index("01")*-1)-2..-1) # slice of 

extra
character(s)
loop do
begin
part = encoded.slice!(0…encoded.index(“0”))
@line << @ tree.select{ |key, value| value == part }[0][0]
rescue NoMethodError
encoded = part + encoded
code = @tree.select{ |key, value| !value.include? “0” }[0][1]
begin
part = encoded.slice!(0…encoded.index(code)+code.length-1)
@line << @tree.select{ |key, value| value == part }[0][0]
rescue
break
end
rescue ArgumentError
break
end

end

end

def inspect
inspect = [ “Encoded:”]
inspect << @encoded.scan(/.{8}|.$/).join("
").scan(/.{45}|.
$/).join("\n")
inspect << “Encoded Bytes:”
inspect << byte_n_new = @encoded.length/8
inspect << “”
inspect << “Original:”
inspect << @line
inspect << “Original Bytes:”
inspect << byte_n = @line.each_byte{}.length
inspect << “compression:
#{((byte_n-byte_n_new).to_f/byte_n.to_f*100).round}%”
inspect << “”
inspect << @ tree.inspect
return inspect.join("\n")
end
end

@huffman = Huffman.new

unless $[0] == “-d”
@huffman.encode($
.join(" "))
puts @huffman.inspect
else
puts "tree: "
tree = eval STDIN.gets.chomp
puts "encoding: "
encoding = STDIN.gets.chomp
@huffman.decode(encoding, tree)
puts @huffman.inspect
end

I used a bit of freedom:

  • It reads from STDIN instead of ARGV.

  • Output is binary instead of 0-and-1-as-text.

  • Both encoding and decoding, for the extra points…

  • In order to avoid juggling with padding characters, the least
    frequent character is 1110 instead of 111 (in the example in
    the quiz). Every “character” in the encoded bit stream ends
    with a 0. Yes, it costs one bit, but since it’s not used
    frequently, it doesn’t really hurt.

  • 0 and 1 become 1 and 0, respectively, because we get the
    padding zeroes for free when we do “001”.pack(“B*”). So,
    rewriting the previous section: In order to avoid juggling
    with padding characters, the least frequent character is 0001
    instead of 000. Every “character” in the encoded bit stream
    ends with a 1.

Here’s the code: Parked at Loopia

Example:

$ echo -n Scooby-Dooby-Doo | ruby huffman.rb | ruby huffman.rb -d
Compression: 14/16 (87%)
Decompression: 14/16 (87%)
Scooby-Dooby-Doo

How does it work?

First of all, we count each unique byte in the original byte
stream (e.g. “Scooby-Dooby-Doo”):

{“-”=>2, “b”=>2, “y”=>2, “c”=>1, “D”=>2, “o”=>6, “S”=>1}

… which is sorted:

[[“o”, 6], [“-”, 2], [“D”, 2], [“b”, 2], [“y”, 2], [“S”, 1], [“c”,
1]]

… and cleaned:

[“o”, “-”, “D”, “b”, “y”, “S”, “c”]

The leftmost character is used most often and the rightmost
character is used least often. We call this ordered list the
“alphabet”.

Secondly, we build the translation table, where each character
in the “alphabet” is mapped to a bit sequence:

{“b”=>“0001”, “-”=>“01”, “c”=>“0000001”, “y”=>“00001”, “D”=>“001”,
“o”=>“1”, “S”=>“000001”}

We can translated each byte of the original byte stream:

[“S”, “c”, “o”, “o”, “b”, “y”, “-”, “D”, “o”, “o”, “b”, “y”, “-”,
“D”, “o”, “o”]

… to the corresponding bit stream:

[“000001”, “0000001”, “1”, “1”, “0001”, “00001”, “01”, “001”, “1”,
“1”, “0001”, “00001”, “01”, “001”, “1”, “1”]

… which is joined into a string:

“00000100000011100010000101001110001000010100111”

… which is packed (with pack(“B*”)) into another string:

“\004\016!N!N”

… which we’ll call the “encoded bit stream”.

In order to be able to decode the message, we have to serialize
the “alphabet” and concatenate it with the “encoded bit stream”:

alphabet + bit stream → message
“o-DbySc” + “\004\016!N!N” → “o-DbySc\004\016!N!N”

But where’s the boundary between “alphabet” and “encoded bit
stream”? Well, let’s add the length of the “alphabet”
([7]…pack(“C”)):

len + alphabet + bit stream → message
“\a” + “o-DbySc” + “\004\016!N!N” → “\ao-DbySc\004\016!N!N”

Conclusion? “Scooby-Dooby-Doo” becomes “\ao-DbySc\004\016!N!N”.

Well, we’ve saved 2 bytes, but it doesn’t sound that well…

gegroet,
Erik V. - http://www.erikveen.dds.nl/

Here’s mine. It’s my first submission; be gentle :slight_smile:

I’ve compared several ways of chopping up the string before building
the tree and encoding. As can be expected, the bigger the chunks the
smaller the encoded string but also the bigger the tree. There must be
a sweet spot somewhere in the middle.

$ ./rq123_huffman_rafc.rb
Encoded byte tokens:
Size of encoded data: 285903
Tree size: 1140

Total size: 287043
Original size: 500000
Compressed by: 42%
####################################
Encoded word tokens:
Size of encoded data: 145703
Tree size: 136634

Total size: 282337
Original size: 500000
Compressed by: 43%
####################################
Encoded 2byte tokens:
Size of encoded data: 246807
Tree size: 20761

Total size: 267568
Original size: 500000
Compressed by: 46%
####################################
Encoded 3byte tokens:
Size of encoded data: 218899
Tree size: 121651

Total size: 340550
Original size: 500000
Compressed by: 31%
####################################

Regards,
Raf

Ruby Q. [email protected] writes:

The quiz this time is to write a program to implement a compression program
using Huffman encoding.

And here’s mine. Instead of adding something to each character
encoded, I used an additional symbol to mean “end of data”. Although
the program will behave as in the quiz when given no options,
(encoding the phrase given in the command line arguments to 1s and 0s
displayed as that) the really interesting stuff happens when you give
one or more options.

Here’s the usage:

Generate tree: q123.rb -g -t treefile [opts|phrase]
encode: q123.rb -t treefile [opts|phrase]
decode: q123.rb -d -t treefile [opts|phrase]
[opts]: -i input -o output
-b flip binary mode (default is 0s and 1s
if -o is not given, and bytes if -o is)
-x generate extended tree to also cover bytes
not present in the training input
-q be quiet - no statistics

ObTestOutput:
C:\Temp>ruby q123.rb I want this message to get encoded!
Encoded:
11010000 10100110 00101111 00011110 00110101
01000001 01110010 10001001 10001100 01000111
10010000 11000111 10000010 10110000 10010111
00101111 01101101 10000000
Original Bytes: 35
Encoded Bytes: 18
Compressed: 48.6%

(big.txt is http://norvig.com/big.txt)
C:\Temp>ruby q123.rb -g -t english.tree big.txt
Generated tree from input size 7

C:\Temp>ruby q123.rb -g -t english.tree -i big.txt
Generated tree from input size 6500961

C:\Temp>ruby q123.rb -t english.tree -i big.txt -o nul:
Original Bytes: 6500961
Encoded Bytes: 3707406
Compressed: 43.0%

And now the script itself:

#!ruby
require ‘strscan’
require ‘stringio’

A useful method for my priority queue implementation

class Array
def swap(a,b)
self[a], self[b] = self[b], self[a]
end
end

Inspired by, but totally different from, the heap in

ruby quiz 40

class PriorityQueue
# Actually, this is more inspired by the summary on
# quiz number 98 than the implementation in quiz 40.
def initialize
@items=[]
@priorities=[]
end

def add(priority, item)
    @priorities.push priority
    @items.push item
    bubble_up
    self
end

def empty?
    @items.empty?
end

def shift
    return nil if empty?
    retval = @items.shift
    @priorities.shift
    if ! @items.empty? then
        @priorities.unshift @priorities.pop
        @items.unshift @items.pop
        bubble_down
    end
    retval
end

# Inspired by mapn in quiz 122
def eachn(&b)
    arglen = b.arity
    while !empty?
        args = Array.new(arglen){shift}
        b.call(*args)
    end
end

private

def bubble_up
    wi = @priorities.size - 1
    pr = @priorities[wi]
    until wi == 0
        pi = (wi-1)/2
        return if @priorities[pi] <= pr
        @items.swap(pi,wi)
        @priorities.swap(pi,wi)
        wi = pi
    end
end

def bubble_down
    wi = 0
    pr = @priorities[wi]
    until false   # i.e. until I return
        ci = 2*wi + 1
        return if ci >= @priorities.size
        ci += 1 if ci + 1 < @priorities.size and
                    @priorities[ci+1] < @priorities[ci]
        return if @priorities[ci] >= pr
        @items.swap(ci,wi)
        @priorities.swap(ci,wi)
        wi = ci
    end
end

end

Okay, that’s all for utilities. Now on with stuff for this quiz;

basically, I have two classes: HuffmanNode and HuffmanCoder.

HuffmanNode is the tree structure. HuffmanCoder handles the

dirty business of encoding and decoding given the tree structure.

A HuffmanNode either has data or it has both left and right children,

but not both.

HuffmanNode = Struct.new(“HuffmanNode”, :frequency, :data, :left,
:right)

class HuffmanNode
def inspect
“HuffmanNode.from_s(#{to_s.inspect})”
end

def to_s
    if self.data.nil? then
        "(" + self.left.to_s + ' ' + self.right.to_s + ")"
    else
        self.data.to_s
    end
end

def to_h(forencode, prefix='')
    if self.data.nil? then
        l = self.left.to_h(forencode,prefix + '0')
        r = self.right.to_h(forencode,prefix + '1')
        l.update(r)
    else
        if forencode then
            {self.data => prefix}
        else
            {prefix => self.data}
        end
    end
end

def HuffmanNode.from_s(s)
    begin
        return from_s_internal(StringScanner.new(s))
    rescue
        raise "Malformed string: '#{s}'"
    end
end

def HuffmanNode.from_s_internal(scanner)
    data = scanner.scan(/\s*-?\d+/)
    if data.nil? then
        scanner.scan(/\s*\(/) or raise 'err'
        rei = from_s_internal(scanner)
        scanner.scan(/\s+/) or raise 'err'
        ichi = from_s_internal(scanner)
        scanner.scan(/\s*\)/) or raise 'err'
        return new(0,nil,rei,ichi)
    else
        return new(0,data.to_i,nil,nil)
    end
end

def HuffmanNode.make_tree(freqhash, add_everything=false)
    pqueue = PriorityQueue.new
    # node with data -1 is used to mean "end of data"
    universe = {-1=>1}
    if add_everything then
        # Assume anything we haven't seen at all is an order of
        # magnitude less likely than those things we've seen once
        256.times{|i|universe[i]=0.1;}
    end
    universe.update(freqhash)
    universe.each {|charcode,freq|
        pqueue.add(freq, new(freq,charcode,nil,nil))
    }
    pqueue.eachn {|node1, node2|
        return node1 if node2.nil?

        n = new(node1.frequency + node2.frequency, \
                        nil, node2, node1)
        pqueue.add(n.frequency, n)
    }
end

end

class HuffmanCoder
attr :enchash
attr :dechash
attr :decre
def initialize(nodetree)
@enchash = nodetree.to_h(true)
@dechash = nodetree.to_h(false)
@decre = Regexp.new(‘^(’ + @dechash.keys.join(‘|’) + ‘)(.*)’)
end

def encode(io_in,io_out,crunch_binary=true,io_stats=$stderr)
    buff = ''
    outbytes = 0
    inbytes = 0
    io_out.puts "Encoded:" unless crunch_binary
    encode_bits(io_in) { |bits|
        inbytes += 1
        if bits then
            buff += bits
        else
            buff += '0' * ((-buff.length) %  8 )
        end
        while buff.length >= 8 do
            binary = buff.slice!(0..7)
            if crunch_binary then
                binary = [binary].pack("b*")
            else
                binary = binary + ' '
                binary += "\n" if 4 == outbytes % 5
            end
            io_out.print binary
            outbytes += 1
        end
    }
    if 0 != outbytes % 5 and not crunch_binary then
        io_out.print "\n"
    end
    inbytes -= 2
    if io_stats then
        io_stats.puts "Original Bytes: #{inbytes}"
        io_stats.puts "Encoded Bytes: #{outbytes}"
        io_stats.puts "Compressed: %2.1f%%"%  [100.0 - 

(100.0*outbytes)/inbytes]
end
end

def decode(io_in,io_out,crunched_binary=true,io_stats=$stderr)
    buff = ''
    outbytes = 0
    inbytes = decode_bits(io_in,crunched_binary) { |bits|
        buff += bits
        m = @decre.match buff
        while m do
            ch = @dechash[m[1]]
            if ch == -1
                if m[2] !~ /^0*$/ then
                    raise "Garbage after EOD marker"
                end
                break
            end
            io_out.putc ch
            outbytes += 1
            buff = m[2]
            m = @decre.match buff
        end
    }
    if io_stats then
        io_stats.puts "Encoded Bytes: #{inbytes}"
        io_stats.puts "Original Bytes: #{outbytes}"
        io_stats.puts "Compressed: %2.1f%%"%  [100.0 - 

(100.0*inbytes)/outbytes]
end
end

def HuffmanCoder.from_file(treefile)
    tree = nil
    File.open(treefile, "r") { |f|
        treetxt = ''
        f.each{ |treeline| treetxt += treeline }
        tree = HuffmanNode.from_s(treetxt)
    }
    new(tree)
end

def HuffmanCoder.generate(io_in, treefile, generate_extended, 

io_stats=$stderr)
bytecount = 0
d = Hash.new(0);
io_in.each_byte {|b| d[b] += 1; bytecount += 1}
tree = HuffmanNode.make_tree(d,generate_extended)
if ! treefile.nil?
File.open(treefile, “w”) {|f|
f.puts tree.to_s
}
end
if io_stats then
io_stats.puts “Generated tree from input size #{bytecount}”
end
new(tree)
end

private

def encode_bits(io_in)
    c = io_in.getc
    until c.nil?
        bits = @enchash[c]
        raise "no code for character #{c}" unless bits
        yield bits
        c = io_in.getc
    end
    yield @enchash[-1]
    yield nil
end

def decode_bits(io_in,crunched_binary)
    inbytes = 0
    if crunched_binary then
        until io_in.eof?
            b = io_in.read(4096)
            return if b.nil?
            inbytes += b.length
            yield b.unpack('b*')[0]
        end
    else
        until io_in.eof?
            b = io_in.read(4096)
            return if b.nil?
            b.tr!('^01','')
            inbytes += b.length
            yield b if b.length > 0
        end
        inbytes = (inbytes+7)/8
    end
    inbytes
end

end

That’s all the interesting stuff. Everything from here down is

argument handling. Basically uninteresting.

if FILE == $0 then
mode = :encode
input = nil
output = nil
treefile = nil
crunched_binary = true
generate_extended = false
statschannel = $stderr
while ARGV[0] and ARGV[0] =~ /^-/
opt = ARGV.shift
case opt
when ‘–’
break
when ‘-t’
treefile = ARGV.shift
when ‘-i’
input = ARGV.shift
when ‘-o’
output = ARGV.shift
when ‘-d’
mode = :decode
when ‘-b’
crunched_binary = false
when ‘-q’
statschannel = nil
when ‘-g’
mode = :gentree
when ‘-x’
mode = :gentree
generate_extended = true
when ‘-h’
puts “Usage:”
puts “Generate tree: #{$0} -g -t treefile [opts|phrase]”
puts " encode: #{$0} -t treefile [opts|phrase]"
puts " decode: #{$0} -d -t treefile [opts|phrase]"
puts “[opts]: -i input -o output”
puts " -b flip binary mode (default is 0s and 1s"
puts " if -o is not given, and bytes if -o is)"
puts " -x generate extended tree to also cover bytes"
puts " not present in the training input"
puts " -q be quiet - no statistics"
exit 0
else
$stderr.puts “Unrecognized option #{opt} – use -h for help”
exit 1
end
end
if treefile.nil? then
# allow no treefile only when encoding command line
# stuff as a demo.
if ! (input.nil? and mode == :encode)
$stderr.puts “Error: no -t option given”
exit 1
end
end
in_io = nil
if input.nil?
in_io = StringIO.new(ARGV.join(’ '))
crunched_binary = ! crunched_binary if mode == :decode
elsif input == ‘-’
in_io = $stdin
else
in_io = File.open(input, “rb”)
end
out_io = nil
if output.nil?
out_io = $stdout
crunched_binary = ! crunched_binary if mode == :encode
elsif output == ‘-’
out_io = $stdout
else
out_io = File.open(output, “wb”)
end
if mode == :gentree then
HuffmanCoder.generate(in_io, treefile, generate_extended,
statschannel)
else
coder = nil
if (treefile.nil?)
coder = HuffmanCoder.generate(in_io, nil, false, nil)
in_io.rewind
else
coder = HuffmanCoder.from_file(treefile)
end
coder.send(mode, in_io,out_io, crunched_binary, statschannel)
end
end

END

The better solution…

Here’s the code: Parked at Loopia

Example:

$ echo -n Scooby-Dooby-Doo | ruby huffman.rb | ruby huffman.rb -d
Compression: 27/17 (158%)
Decompression: 27/17 (158%)
Scooby-Dooby-Doo

How does it work?

Count each unique byte in the original byte stream (e.g.
“Scooby-Dooby-Doo”):

[[“S”, 1], [“c”, 1], [“o”, 6], [“b”, 2], [“y”, 2], [“-”, 2], [“D”,
2]]

This list is then converted to a tree, using this code:

while tree.length > 1
a, b, *tree = tree.sort_by{|_, count| count} # Not deterministic.
tree << [[a[0], b[0]], a[1]+b[1]]
end

In English: The two least frequently used leaves/nodes are
replaced by one node (a combination of both leaves),
recursively, until the list is only one leaf/node long. Step by
step, this results in the following:

[[“S”, 1], [“c”, 1], [“b”, 2], [“D”, 2], [“y”, 2], [“-”, 2], [“o”,
6]]
[[“b”, 2], [“D”, 2], [“y”, 2], [[“S”, “c”], 2], [“-”, 2], [“o”, 6]]
[[“y”, 2], [[“S”, “c”], 2], [“-”, 2], [[“b”, “D”], 4], [“o”, 6]]
[[“-”, 2], [[“b”, “D”], 4], [[“y”, [“S”, “c”]], 4], [“o”, 6]]
[[[“y”, [“S”, “c”]], 4], [“o”, 6], [[“-”, [“b”, “D”]], 6]]
[[[“-”, [“b”, “D”]], 6], [[[“y”, [“S”, “c”]], “o”], 10]]
[[[[“-”, [“b”, “D”]], [[“y”, [“S”, “c”]], “o”]], 16]]

The tree is actually still in the list (the only element in the
list), so we get it and strip the weight:

[[“-”, [“b”, “D”]], [[“y”, [“S”, “c”]], “o”]]

Secondly, we build the translation table, where each leaf in
the tree is mapped to a bit sequence, by using the path to each
leaf:

{“-”=>“00”, “b”=>“010”, “y”=>“100”, “c”=>“1011”, “D”=>“011”,
“o”=>“11”, “S”=>“1010”}

We can translated each byte of the original byte stream:

[“S”, “c”, “o”, “o”, “b”, “y”, “-”, “D”, “o”, “o”, “b”, “y”, “-”,
“D”, “o”, “o”]

… to the corresponding bit stream:

[“1010”, “1011”, “11”, “11”, “010”, “100”, “00”, “011”, “11”, “11”,
“010”, “100”, “00”, “011”, “11”, “11”]

… which is joined into a string:

“101010111111010100000111111010100000111111”

The rest, serializing the translation table and adding and
registering padding bits is not that complicated and won’t be
explained here.

The decoding (decompressing?) is, well, just the reversed order
of the steps…

I want to emphasize one line of the decode code. After having
deserialized the translation table, we can decode the whole
message with one line:

bitstring.scan(/#{table.keys.join(“|”)}/).collect{|bits|
table[bits]}.join(“”)

gegroet,
Erik V. - http://www.erikveen.dds.nl/

My solution to the quiz can be found here:

http://pastie.caboo.se/61613

This solution has a number of features which might make it different
to other solutions submitted. It encodes to and decodes from actual
binary data, not strings of “0” and “1” characters. It is able to
serialize and deserialize the Huffman encoder (which is a hash
table, not a tree). It uses a linear algorithm to create the
Huffman code from the character frequency data. And beyond that, it
was written with an attempt to be efficient time-wise and
memory-wise. It uses an end-of-message token (:eom) to mark the end
of the encoded message. It therefore is able to handle messages
that use 256 characters plus the 1 token. It recognizes that with
large messages some characters could be encoded with more than 8
bits, and it handles those situations. And finally it makes an
attempt to recognize corrupt data, both in terms of the Huffman
encoded message and the serialized encoder.

Eric

Here’s my solution. I’m getting different values, but I’m fairly sure
I’m building the tree correctly. I suppose it all depends on how you
choose to encode the values (0 = left, 1 = right in my case). I may have
some glaring stupid error, but I can’t find it if I do :slight_smile:

Output:


ruby huffman.rb ABRRKBAARAA

Encoded
10100000 01101011 0011
3 bytes

Original
ABRRKBAARAA
11 bytes

72.7272727272727% compression


ruby huffman.rb I want this message to get encoded!

Encoded
01010001 10101100 10000110 00011010 01010111
10001101 10011111 11110010 10001000 01110110
00101000 10110000 01100001 00011011 10010011
00101000 0
17 bytes

Original
I want this message to get encoded!
35 bytes

51.4285714285714% compression


  • Drew

file: huffman.rb

author: Drew O.

require ‘enumerator’

class to hold nodes in the huffman tree

class Node
attr_accessor :val,:weight,:left,:right

def initialize(val="",weight=0)
@val,@weight = val,weight
end
end

priority queue for nodes

class NodeQueue
def initialize
@queue = []
end

def enqueue node
@queue << node
@queue = @queue.sort_by{|x|[-x.weight,x.val.size]}
end

def dequeue
@queue.pop
end

def size
@queue.size
end
end

HuffmanTree represents the tree with which we perform

the encoding

class HuffmanTree

initialize the tree based on data

def initialize data
@freqs = build_frequencies(data)
@root = build_tree
end

#encode the given data
def encode data
data.downcase.split(//).inject("") do |code,char|
code + encode_char(char)
end
end

private

this method encodes a given character based on our

tree representation

def encode_char char
node = @root
coding = “”

# we do a binary search, building the representation
# of the character based on which branch we follow
while node.val != char
  if node.right.val.include? char
    node = node.right
    coding += "1"
  else
    node = node.left
    coding += "0"
  end
end
coding

end

get word frequencies in a given phrase

def build_frequencies phrase
phrase.downcase.split(//).inject(Hash.new(0)) do |hash,item|
hash[item] += 1
hash
end
end

build huffmantree using the priority queue method

def build_tree
queue = NodeQueue.new

# build a node for each character and place in pqueue
@freqs.keys.each do |char|
  queue.enqueue(Node.new(char,@freqs[char]))
end

while !queue.size.zero?

  # if only one node exists, it is the root. return it
  return queue.dequeue if queue.size == 1

  # dequeue two lightest nodes, create parent,
  # add children and enqueue newly created node
  node = Node.new
  node.right = queue.dequeue
  node.left = queue.dequeue
  node.val = node.left.val+node.right.val
  node.weight = node.left.weight+node.right.weight
  queue.enqueue node
end

end
end

get command lines args, build tree and encode data

if FILE == $0
data = ARGV.join(" “)
tree = HuffmanTree.new data
encoded = tree.encode(data).scan(/\d{1,8}/)
puts
puts “Encoded”
encoded.each_slice(5) do |slice|
puts slice.join(” ")
end
puts “#{encoded.size} bytes”
puts
puts “Original”
puts data
puts “#{data.size} bytes”
puts
compression = (100.0 - (encoded.size.to_f/data.size.to_f)*100)
puts “#{compression}% compression”
end

2007/5/15, Ball, Donald A Jr (Library) [email protected]:

I also tried writing a binary encoder which packs
and unpacks the bit strings generated by my string encoder, but it fails
to decompress the entire text for long files, e.g. the program source
code itself. I’m unsure why, perhaps someone more experienced in this
arena can suggest why.

Perhaps because the file you want to encode already contains your
end-marker. The decoder stops when it encounters the first end-marker
(line 161). A binary file would not be unlikely to contain a NULL
character somewhere.

Regards,
Raf

My solution is pastied and included below:

http://pastie.caboo.se/61871

it’s overly long, but I wanted some practice messing around with bit
arrays. My solution prepends the dictionary to the compressed text, and
adds a NULL character to the dictionary and at the end of the compressed
text to mark the end. I also tried writing a binary encoder which packs
and unpacks the bit strings generated by my string encoder, but it fails
to decompress the entire text for long files, e.g. the program source
code itself. I’m unsure why, perhaps someone more experienced in this
arena can suggest why.

In any case, it was a fun exercise, one with which I will now stop
playing as I need to do some real work. :slight_smile:

  • donald

Ruby Q. 123

Donald B.

require ‘optparse’
require ‘stringio’

module Huffman

EOS = “\000” # end of string character for binary decoding

class Leaf
@code = 1

class << self
  attr_reader :code
end

attr_reader :char, :weight

def initialize(char, weight)
  @char = char
  @weight = weight
end

def ==(other)
  self === other && @weight == other.weight
end

def ===(other)
  other.is_a?(Leaf) && @char == other.char
end

def ciphers(hash, prefix)
  hash[char] = prefix
end

def serialize
  yield self.class.code
  yield char
end

end

class Branch
@code = 0

class << self
  attr_reader :code
end

attr_reader :left, :right

def initialize(left, right)
  @left = left
  @right = right
end

def weight
  left.weight + right.weight
end

def ciphers(hash = {}, prefix = '')
  left.ciphers(hash, prefix + '0')
  right.ciphers(hash, prefix + '1')
  hash
end

def ===(other)
  other.is_a?(Branch) && @left === other.left && @right ===

other.right
end

def ==(other)
  other.is_a?(Branch) && @left == other.left && @right ==

other.right
end

def serialize(&blk)
  yield self.class.code
  left.serialize(&blk)
  right.serialize(&blk)
end

end

class Tree
attr_reader :root, :plaintext

# returns a hash of weights by character, adding EOS char with

weight 0
def self.char_freq(src)
src.split(//m).inject(Hash.new(0)) {|m, c| m[c] += 1;
m}.merge({EOS=>0})
end

# returns a sorted array of arrays of weights and chars
def self.sort_freq(src)
  char_freq(src).map {|c, v| Leaf.new(c, v)}.sort_by {|l| [l.weight,

l.char] }
end

# shifts the next smallest node from the two given arrays
def self.next_smallest(l1, l2)
  return l1.shift if l2.length == 0
  return l2.shift if l1.length == 0
  l1[0].weight <= l2[0].weight ? l1.shift : l2.shift
end

def initialize(src)
  raise if src.nil?
  if src.is_a?(Branch) || src.is_a?(Leaf)
    @root = src
    return
  end
  @plaintext = src
  leaves = self.class.sort_freq(plaintext)
  branches = []
  until leaves.length + branches.length == 1
    n1 = self.class.next_smallest(leaves, branches)
    n2 = self.class.next_smallest(leaves, branches)
    branches << Branch.new(n1, n2)
  end
  @root = branches[0]
end

def ==(other)
  @root == other.root
end

def ===(other)
  @root === other.root
end

def serialize(&blk)
  @root.serialize(&blk)
  ciphers = @root.ciphers
  @plaintext.split(//m).each {|char| yield [ciphers[char]] }
  yield [ciphers[EOS]]
end

end

encodes and decodes strings to strings of mostly 0s and 1s,

prepending

the dictionary

class StringEncoder
def self.encode(plaintext)
tree = Tree.new(plaintext)
bits = ‘’
tree.serialize do |x|
bits <<
case x
when Fixnum: x.to_s
when Array: x[0]
else x.unpack(‘B8’)[0]
end
end
bits << ‘0’ * ((8 - (bits.length % 8)) % 8)
end

def self.decode(src)
  ios = StringIO.new(src)
  root = decode_tree(ios)
  deciphers = root.ciphers.invert
  s = ''
  buffer = ''
  while bit = ios.read(1)
    buffer << bit
    if char = deciphers[buffer]
      break if char == EOS
      s << char
      buffer = ''
    end
  end
  s
end

def self.decode_tree(ios)
  case ios.read(1)
    when Branch.code.to_s
      Branch.new(decode_tree(ios), decode_tree(ios))
    when Leaf.code.to_s
      Leaf.new([ios.read(8)].pack('B8'), 0)
    else
      raise
  end
end

end

encodes and decodes strings to binary, prepending the dictionary

class BinaryEncoder
def self.encode(src)
[StringEncoder.encode(src)].pack(‘B*’)
end

def self.decode(src)
  StringEncoder.decode(src.unpack('B*')[0])
end

end

end

if $0 == FILE
options = {}
OptionParser.new do |opts|
opts.on(‘-b’, ‘–binary’, ‘Use binary encoding’) do |s|
options[:encoder] = Huffman::BinaryEncoder
end
opts.on(‘-d’, ‘–decode’, ‘Decode’) do |d|
options[:action] = :decode
end
opts.on(‘-f F’, ‘–file’, String, ‘File input’) do |f|
input = ‘’
File.open(f, ‘r’) do |file|
file.each_line do |line|
input += line
end
end
options[:input] = input
end
end.parse!
options[:encoder] ||= Huffman::StringEncoder
options[:action] ||= :encode
options[:input] ||= $stdin.read
$stdout.write options[:encoder].send(options[:action],
options[:input])
end

2007/5/15, Ball, Donald A Jr (Library) [email protected]:

why.

  • donald

Strange… I get this:

Encoding-decoding paste-61871.rb:
$ ruby -w paste-61871.rb -b -f paste-61871.rb | ruby -w paste-61871.rb
-b -d > ref
$ cmp paste-61871.rb ref

=>Identical

Encoding-decoding a binary file:
$ ruby -w paste-61871.rb -b -f dilbert2073317070515.gif | ruby -w
paste-61871.rb -b -d > ref
$ cmp dilbert2073317070515.gif ref
cmp: EOF on ref
$ hexdump -C ref
00000000 47 49 46 38 39 61 58 02 d5 |GIF89aX…|
00000009
$ hexdump -C dilbert2073317070515.gif | head -n 1
00000000 47 49 46 38 39 61 58 02 d5 00 c4 00 00 58 58 58
|GIF89aX…XXX|

=> Encoded-decoded terminates just before first NULL character in
original file

Regards,
Raf

“Ball, Donald A Jr (Library)” [email protected] writes:

I also tried writing a binary encoder which packs
and unpacks the bit strings generated by my string encoder, but it fails
to decompress the entire text for long files, e.g. the program source
code itself. I’m unsure why, perhaps someone more experienced in this
arena can suggest why.

As someone else has already been successful at this task with your
program, allow me to propose a guess as to what the problem was: you
were working on windows.

If you just open a file with mode “r” on windows, it’ll stop at the
first occurrence in the input of character 26 (ctrl-z). To read
binary files on windows, you need to open with mode “rb”. (or use
binmode on the file)

As for the issue that you can’t compress binary data, I’ll note that I
got around that by having my tree encode numbers from 0 to 255, instead
of characters, and encoded the number -1 to mean EOF.

why.

Perhaps because the file you want to encode already contains
your end-marker. The decoder stops when it encounters the
first end-marker (line 161). A binary file would not be
unlikely to contain a NULL character somewhere.

That was my first thought, but I threw in a check for the existence of
NULL characters in the frequency hash before merging the EOS=>0 hash,
and it did not complain. Besides, the file I’m trying to decompress is
the ruby source of my quiz solution, which is just plain ol’ text.

  • donald

A little late, but here’s my 2nd solution. It does decoding as well, but
I didn’t get a chance to add serialization. Nothing too fancy in here:

Output:


ruby huffman.rb ABRRKBAARAA
Original
ABRRKBAARAA
11 bytes

Encoded
10100000 01101011 0011
3 bytes

72 percent compression

Decoded
abrrkbaaraa


ruby huffman.rb I want this message to get encoded!

Original
I want this message to get encoded!
35 bytes

Encoded
01010001 10101100 10000110 00011010 01010111
10001101 10011111 11110010 10001000 01110110
00101000 10110000 01100001 00011011 10010011
00101000 0
17 bytes

51 percent compression

Decoded
i want this message to get encoded!


Code:

file: huffman.rb

author: Drew O.

require ‘enumerator’

class to hold nodes in the huffman tree

class Node
attr_accessor :val,:weight,:left,:right

def initialize(val="",weight=0)
@val,@weight = val,weight
end

def children?
return @left || @right
end
end

priority queue for nodes

class NodeQueue
def initialize
@queue = []
end

def enqueue node
@queue << node
@queue = @queue.sort_by{|x|[-x.weight,x.val.size]}
end

def dequeue
@queue.pop
end

def size
@queue.size
end
end

HuffmanTree represents the tree with which we perform

the encoding

class HuffmanTree

initialize the tree based on data

def initialize data
@freqs = build_frequencies(data)
@root = build_tree
end

#encode the given data
def encode data
data.downcase.split(//).inject("") do |code,char|
code + encode_char(char)
end
end

def decode data
node = @root

if [email protected]?
  return @root.val
end

data.split(//).inject("") do |phrase,digit|
  if digit == "0"
    node = node.left
  else
    node = node.right
  end
  if !node.children?
    phrase += node.val
    node = @root
  end
  phrase
end

end

private

this method encodes a given character based on our

tree representation

def encode_char char
node = @root
coding = “”

# encode to 0 if only one character
if [email protected]?
  return "0"
end

# we do a binary search, building the representation
# of the character based on which branch we follow
while node.val != char
  if node.right.val.include? char
    node = node.right
    coding += "1"
  else
    node = node.left
    coding += "0"
  end
end
coding

end

get word frequencies in a given phrase

def build_frequencies phrase
phrase.downcase.split(//).inject(Hash.new(0)) do |hash,item|
hash[item] += 1
hash
end
end

build huffmantree using the priority queue method

def build_tree
queue = NodeQueue.new

# build a node for each character and place in pqueue
@freqs.keys.each do |char|
  queue.enqueue(Node.new(char,@freqs[char]))
end

while !queue.size.zero?

  # if only one node exists, it is the root. return it
  return queue.dequeue if queue.size == 1

  # dequeue two lightest nodes, create parent,
  # add children and enqueue newly created node
  node = Node.new
  node.right = queue.dequeue
  node.left = queue.dequeue
  node.val = node.left.val+node.right.val
  node.weight = node.left.weight+node.right.weight
  queue.enqueue node
end

end
end

get command lines args, build tree and encode data

if FILE == $0
data = ARGV.join(" ")
tree = HuffmanTree.new data

get encoded data and split into bits

code = tree.encode(data)
encoded_bits = code.scan(/\d{1,8}/)

output

puts
puts “Original”
puts data
puts “#{data.size} bytes”
puts
puts “Encoded”
encoded_bits.each_slice(5) do |slice|
puts slice.join(" ")
end
puts “#{encoded_bits.size} bytes”
puts
puts “%d percent compression” % (100.0 -
(encoded_bits.size.to_f/data.size.to_f)*100.0)
puts
puts “Decoded”
puts tree.decode(code)
end