Forum: Ruby Huffman Encoder (#123)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
James G. (Guest)
on 2007-05-11 17:21
(Received via mailing list)
The three rules of Ruby Q.:

1.  Please do not post any solutions or spoiler discussion for this quiz
until
48 hours have passed from the time on this message.

2.  Support Ruby Q. by submitting ideas as often as you can:

http://www.rubyquiz.com/

3.  Enjoy!

Suggestion:  A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby T. follow the discussion.  Please reply to the original quiz
message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

by Harlan

Huffman Coding is a common form of data compression where none of the
original
data gets lost.  It begins by analyzing a string of data to determine
which
pieces occur with the highest frequencies.  These frequencies and pieces
are
used to construct a binary tree.  It is the “path” from root node to the
leaf with this data that forms its encoding.  The following example
should
explain things:

  Data:  ABRRKBAARAA (11 bytes)

  Frequency counts:
  A  5
  R  3
  B  2
  K  1

  In Huffman Tree form, with frequency weights in parentheses:
       ARBK (11)
      /    \
     0      1
    /        \
  A (5)      RBK (6)
            /   \
           0     1
          /       \
        R (3)     BK (3)
                  / \
                 0   1
                /     \
              B (2)   K (1)

  The encoding for each character is simply the path to that character:
  A    0
  R   10
  B  110
  K  111

  Here is the original data encoded:
  01101010 11111000 1000 (fits in 3 bytes)

We have compressed the original information by 80%!

A key point to note is that every character encoding has a unique
prefix,
corresponding to the unique path to that character within the tree.  If
this
were not so, then decoding would be impossible due to ambiguity.

The quiz this time is to write a program to implement a compression
program
using Huffman encoding.

  Extra Credit:
  Perform the actual encoding using your tree.  You may encounter one
issue
  during the decompression/decoding phase.  Your encoded string may not
be a
  multiple of 8.  This means that when you compress your encoding into a
  binary number, padding 0’s get added.  Then, upon decompression, you
may
  see extra characters.  To counter this, one solution is to add your
own
  padding of 1 extra character every time.  And then simply strip it off
  once you have decoded.

  You may also wish to provide a way to serialize the Huffman Tree so it
  can be shared among copies of your program.

  ./huffman_encode.rb I want this message to get encoded!

  Encoded:
  11111111 11111110 11111111 11101111 10111111
  01100110 11111111 11110111 11111111 11011100
  11111111 11010111 01110111 11011110 10011011
  11111100 11110101 10010111 11101111 11111011
  11111101 11111101 01111111 01111111 11111110
  Encoded Bytes:
  25

  Original:
  I want this message to get encoded!
  Original Bytes:
  35

  Compressed:  28%
Daniel M. (Guest)
on 2007-05-11 19:24
(Received via mailing list)
Ruby Q. <removed_email_address@domain.invalid> writes:

> The quiz this time is to write a program to implement a compression
> program using Huffman encoding.

I'd like to comment on and regularize a few of the "extra credit"
portions.

First off, without knowledge of the tree, one can't decode the Huffman
encoding given in the quiz since the message:
  I want this message to get encoded!
encodes to the same thing as:
  V jnag guvf zrffntr gb trg rapbqrq!

So, with a Huffman encoding one always needs to know the tree
structure when doing the decoding.  This means that usually what one
does with Huffman encoding is use some big training database to build
up your tree, and then one re-uses the same tree again and again.

Therefore, I want to add these switches:

-d       Decoding mode.  (Default is encoding)

-T file  Generate tree and store it to file.  Do not produce other
         output.

-t file  Use file to retrieve your tree. (for either encoding or
         decoding)

-i file  Use file as input instead of reading command line arguments

-o file  Use file as output instead of writing to standard output
         (Note: compression statistics should still be written to
          standard output.  file should contain only the message)

Also, I'll note that the quiz description can give a false impression
of Huffman encoding and how to build up the tree.  For example, after
reading that I'd forgive anyone who thought that Huffman encoding
always assigned to each letter a code of some number of 1s followed
by a zero, except the least frequent letter which is assigned a code
of all 1s.  In fact, it only works out to that for certain frequency
distributions.

Although I don't want to spoil things I'll point at the section
labeled "Basic technique" in the wikipedia article on Huffman coding
(http://en.wikipedia.org/wiki/Huffman_coding) and mention that I won't
quite be using the algorithm they mention there, but will be using
something vaguely similar.  (That is, I will be building the tree from
the bottom up rather than the top down)
Matthew M. (Guest)
on 2007-05-11 20:31
(Received via mailing list)
> First off, without knowledge of the tree, one can't decode the Huffman
> encoding given in the quiz since the message:
>   I want this message to get encoded!
> encodes to the same thing as:
>   V jnag guvf zrffntr gb trg rapbqrq!
>
> So, with a Huffman encoding one always needs to know the tree
> structure when doing the decoding.  This means that usually what one
> does with Huffman encoding is use some big training database to build
> up your tree, and then one re-uses the same tree again and again.


Another alternative is to do an adaptive Huffman encoding, which does
not require a dictionary at the front of the file, nor a training
database. The dictionary (i.e. tree) is built up during the process of
decoding.

The tradeoff is less compression for small files, or early on in larger
files.
Daniel M. (Guest)
on 2007-05-12 16:51
(Received via mailing list)
Ruby Q. <removed_email_address@domain.invalid> writes:

>   Encoded Bytes:
>   25
>
>   Original:
>   I want this message to get encoded!
>   Original Bytes:
>   35
>
>   Compressed:  28%

I'll note that this phrase includes 16 distinct characters (" ", "!",
"I", "a", "c", "d", "e", "g", "h", "i", "m", "n", "o", "s", "t", "w").
Even if you add a special "end-of-data" character, that's only 17
distinct characters, so you would expect any Huffman code to perform
at least as well 5 bits per character; since your code is in theory
being developed to encode this exact phrase, I'd expect better
performance than that.

At 5 bits per character, if you add a special "end of data" character
to the end of the phrase so that you're encooding 36 values, that's
only 180 bits of output.  180 bits fits into 23 bytes.

I'm pointing out that the initial reference code probably has a bug in
its algorithm, since it does even worse than a naive
5-bits-per-character encoding.

For what it's worth, my current Huffman encoder when tuned to that
phrase alone compresses it into 18 bytes; when tuned to
http://norvig.com/big.txt (a large chunk of English text) it still
manages to compress that phrase to 22 bytes.
Jesse M. (Guest)
on 2007-05-13 18:33
(Received via mailing list)
I based my solution on the algorithm described here:

http://www.siggraph.org/education/materials/HyperG...

I use the rubytree gem for the tree, and subclassed its TreeNode into
the class
FreqNode, which holds a frequency, possibly a letter (just the leaves),
and a
side (either :left or :right). I made FreqNodes compare primarily based
on
their frequencies so I could build a SortedSet of them and repeatedly
take the
two lowest to combine. The actual tree is built in the build_tree
method.

It just takes an input string, builds the tree, and outputs the
compressed
data (which is slightly different from the example given in the original
quiz
message because we treat ties differently. I'm pretty sure this doesn't
matter though). No extra credit, though it shouldn't be too hard to add
on some
tree serialization.

irb> tree = build_tree('ABC')
irb> tree.printTree
Node ID: 16 Content: 3 Parent: ROOT Children: 2 Total Nodes: 5 Side:
    Node ID: 14 Content: 1C Parent: 16 Children: 0 Total Nodes: 1 Side:
left
    Node ID: 15 Content: 2 Parent: 16 Children: 2 Total Nodes: 3 Side:
right
        Node ID: 12 Content: 1A Parent: 15 Children: 0 Total Nodes: 1
Side: left
        Node ID: 13 Content: 1B Parent: 15 Children: 0 Total Nodes: 1
Side: right
irb> encode 'ABC', tree
=> "10110"

$ ./huffman_encode.rb ABRRKBAARAA
Encoded:
01011111 10010100 1100
Encoded bytes: 3

Original:
ABRRKBAARAA
Original bytes: 11

Compressed: 73%


#!/usr/bin/env ruby

require 'rubygems'
require 'tree'
require 'set'

class Tree::TreeNode
  # Yield once for each leaf node.
  def each_leaf
    self.each { |node| yield(node) if not node.hasChildren? }
  end

  # Both assume exactly 2 children
  def left_child;  children.first; end
  def right_child; children.last; end
end

# FreqNodes are TreeNodes that:
#   1) Have generated, numeric names to ensure they're all unique.
#   2) Compare based primarily on their contents.
#   3) Have as contents an array containing a frequency, and possibly a
letter
#      (for the leaves).
#   4) Have a @side attribute that should be set to either :left or
:right for
#      non-root nodes.
#   5) Have a few other nicities, like letter().
class FreqNode < Tree::TreeNode
  attr_accessor :side

  def initialize(content = nil)
    super(FreqNode.next_name, content)
  end

  def <=>(node)
    content_diff = @content <=> node.content
    return content_diff if not content_diff.zero?
    super(node)
  end

  # Assumes this is a leaf.
  def letter; @content[1]; end

  def to_s; super() + " Side: #{@side}"; end

  # Generate unique names because we can't add two nodes with the same
name as
  # children of the same parent node.
  @next_name_num = -1
  def FreqNode.next_name
    (@next_name_num += 1).to_s
  end
end

class Set
  def shift
    elem = self.min
    self.delete elem
    elem
  end
end

# Build a SortedSet containing pairs of [freq, letter] for the given
string.
def build_freqs str
  # Build hash of byte => count
  counts = Hash.new 0
  str.each_byte { |byte| counts[byte.chr] += 1 }

  # Build SortedSet of [freq, byte] pairs (lower freqs first).
  freqs = SortedSet[]
  counts.each { |bc| freqs << bc.reverse }
  freqs
end

# Build the tree for the given input string, and return the root node.
def build_tree str
  nodes = build_freqs(str).map! { |pair| FreqNode.new(pair) }

  while nodes.size > 1
    child1, child2 = nodes.shift, nodes.shift
    parent = FreqNode.new([child1.content.first + child2.content.first])
    child1.side, child2.side = :left, :right
    parent << child1
    parent << child2
    nodes << parent
  end

  nodes.min
end

# Encode the given letter using the given tree of FreqNodes.
def encode_letter letter, tree
  enc = ''

  # Find leaf with the right byte value
  node = nil
  tree.each_leaf do |leaf|
    (node = leaf; break) if leaf.letter == letter
  end

  while not node.isRoot?
    node.side == :left ? enc << '0' : enc << '1'
    node = node.parent
  end

  enc.reverse
end

# Build a tree for the given string, and encode it using that tree.
def encode str, tree = build_tree(str)
  #tree = build_tree(str)
  encoded = ''
  str.each_byte { |byte| encoded << encode_letter(byte.chr, tree) }
  encoded
end

# Decode the given string (which should consist of '0's and '1's) using
the
# given tree.
def decode str, tree
  dec = ''
  i = 0
  node = tree

  str.each_byte do |byte|
    node = (byte.chr == '0' ? node.left_child : node.right_child)
    if not node.hasChildren?
      dec << node.byte_value.chr
      node = tree
    end
  end

  dec
end

class String
  # Split up the string by inserting sep between len-length chunks.
  def split_up! len, sep
    (((length / len.to_f).ceil - 1) * len).step(len, -len) do |i|
      insert i, sep
    end
    self
  end
end

# Output a binary string, splitting it up for easier reading.
def puts_binary str
  str = str.dup
  str.split_up! 40, "\n"
  str.each { |line| puts line.split_up!(8, ' ') }
end

if __FILE__ == $0
  input = ARGV.join ' '
  input_bytes = input.length
  enc = encode(input)
  enc_bytes = (enc.length / 8.0).ceil
  compressed = ((1 - enc_bytes.to_f/input_bytes) * 100).round

  puts 'Encoded:'
  puts_binary(enc)
  puts "Encoded bytes: #{enc_bytes}\n\n"

  puts "Original:\n#{input}"
  puts "Original bytes: #{input_bytes}\n\n"

  puts "Compressed: #{compressed}%"
end
Charles L. (Guest)
on 2007-05-13 18:59
Very simplistic solution. recursively partitions sorted frequencies to
build the huffman tree:

% ./huffman.rb I want this message to get encoded!
Encoded:
11001000 11111000 01010010 00001011 01111100 01100011 10100101 10111000
01001001 00001010 11000100 10010100 00001101 01101010 11100010 01100011
1000

35 => 17 (51%)


Code is rather straight forward: http://pastie.caboo.se/61225
Ivo D. (Guest)
on 2007-05-13 23:39
(Received via mailing list)
Well, first time I actually submit my solution, it's a bit hairy, but it
worked in my tests...

I made the decoder too, only problem was serialization, I did not know
what
to do with it, and how this in real apps work, so that part is a bit odd
(programming is just hobby for me).

usage:
ruby quiz123.rb encode this!

and
ruby quiz123.rb -d

The app will prompt you for the tree and the part to decode. The tree
should
be a hash definition ex: { "a" => "0", "b" => "10" }. (To test, you can
first encode something (it will show the tree in correct format) en then
decode it again.

What puzzled me was that longer texts appeared to take up more bytes
encoded
than decoded.

anyway, here is my code:

#! /usr/local/bin/ruby

class Huffman
  attr_accessor :tree, :encoded

  def initialize
    @encoded, @tree = nil
    @line = ""
  end

  def encode( line )
    @line = line
    chars = line.split("")
    uniq = chars.uniq

    counts = []
    uniq.each do |char|
      counts << { :char => char, :count => chars.select{ |x| x == char
}.length }
    end

    counts.sort!{ |x,y| x[:count] <=> y[:count]  }.reverse!

    @tree = {}
    counts.each_with_index do |char_hash,i|
      char = char_hash[:char]
      if i == counts.length-1
        @tree[char] = "1"*i
        break
      end
      @tree[char] = "1"*i+"0"
    end

    @encoded = line.split("").collect { |x| @tree[x] }.join("")
    @encoded += @tree[counts[1][:char]]
    @encoded += "0" * (removed_email_address@domain.invalid%8) if 
@encoded.length%8 != 0

    return self
  end

  def decode( encoded, tree )
    @encoded = encoded

    @tree = tree

    @encoded.gsub!(/ |\n/, "")

    encoded = @encoded.dup
    encoded.slice!((encoded.reverse.index("01")*-1)-2..-1) # slice of
extra
character(s)
    loop do
      begin
        part = encoded.slice!(0..encoded.index("0"))
        @line << @ tree.select{ |key, value| value == part }[0][0]
      rescue NoMethodError
        encoded = part + encoded
        code = @tree.select{ |key, value| !value.include? "0" }[0][1]
        begin
          part = encoded.slice!(0..encoded.index(code)+code.length-1)
          @line << @tree.select{ |key, value| value == part }[0][0]
        rescue
          break
        end
      rescue ArgumentError
        break
      end

    end
  end

  def inspect
    inspect = [ "Encoded:"]
    inspect << @encoded.scan(/.{8}|.*$/).join("
").scan(/.{45}|.*$/).join("\n")
    inspect << "Encoded Bytes:"
    inspect << byte_n_new = @encoded.length/8
    inspect << ""
    inspect << "Original:"
    inspect << @line
    inspect << "Original Bytes:"
    inspect << byte_n = @line.each_byte{}.length
    inspect << "compression:
#{((byte_n-byte_n_new).to_f/byte_n.to_f*100).round}%"
    inspect << ""
    inspect << @ tree.inspect
    return inspect.join("\n")
  end
end

@huffman = Huffman.new

unless $*[0] == "-d"
  @huffman.encode($*.join(" "))
  puts @huffman.inspect
else
  puts "tree: "
  tree = eval STDIN.gets.chomp
  puts "encoding: "
  encoding = STDIN.gets.chomp
  @huffman.decode(encoding, tree)
  puts @huffman.inspect
end
Erik V. (Guest)
on 2007-05-14 02:44
(Received via mailing list)
I used a bit of freedom:

* It reads from STDIN instead of ARGV.

* Output is binary instead of 0-and-1-as-text.

* Both encoding and decoding, for the extra points...

* In order to avoid juggling with padding characters, the least
  frequent character is 1110 instead of 111 (in the example in
  the quiz). Every "character" in the encoded bit stream ends
  with a 0. Yes, it costs one bit, but since it's not used
  frequently, it doesn't really hurt.

* 0 and 1 become 1 and 0, respectively, because we get the
  padding zeroes for free when we do "001".pack("B*"). So,
  rewriting the previous section: In order to avoid juggling
  with padding characters, the least frequent character is 0001
  instead of 000. Every "character" in the encoded bit stream
  ends with a 1.

Here's the code: http://tinyurl.com/3x5nsy

Example:

 $ echo -n  Scooby-Dooby-Doo | ruby huffman.rb | ruby huffman.rb -d
 Compression: 14/16 (87%)
 Decompression: 14/16 (87%)
 Scooby-Dooby-Doo

How does it work?

First of all, we count each unique byte in the original byte
stream (e.g. "Scooby-Dooby-Doo"):

 {"-"=>2, "b"=>2, "y"=>2, "c"=>1, "D"=>2, "o"=>6, "S"=>1}

... which is sorted:

 [["o", 6], ["-", 2], ["D", 2], ["b", 2], ["y", 2], ["S", 1], ["c",
1]]

... and cleaned:

 ["o", "-", "D", "b", "y", "S", "c"]

The leftmost character is used most often and the rightmost
character is used least often. We call this ordered list the
"alphabet".

Secondly, we build the translation table, where each character
in the "alphabet" is mapped to a bit sequence:

 {"b"=>"0001", "-"=>"01", "c"=>"0000001", "y"=>"00001", "D"=>"001",
"o"=>"1", "S"=>"000001"}

We can translated each byte of the original byte stream:

 ["S", "c", "o", "o", "b", "y", "-", "D", "o", "o", "b", "y", "-",
"D", "o", "o"]

... to the corresponding bit stream:

 ["000001", "0000001", "1", "1", "0001", "00001", "01", "001", "1",
"1", "0001", "00001", "01", "001", "1", "1"]

... which is joined into a string:

 "00000100000011100010000101001110001000010100111"

... which is packed (with pack("B*")) into another string:

 "\004\016!N!N"

... which we'll call the "encoded bit stream".

In order to be able to decode the message, we have to serialize
the "alphabet" and concatenate it with the "encoded bit stream":

 alphabet  + bit stream     -> message
 "o-DbySc" + "\004\016!N!N" -> "o-DbySc\004\016!N!N"

But where's the boundary between "alphabet" and "encoded bit
stream"? Well, let's add the length of the "alphabet"
([7]..pack("C")):

 len  + alphabet  + bit stream     -> message
 "\a" + "o-DbySc" + "\004\016!N!N" -> "\ao-DbySc\004\016!N!N"

Conclusion? "Scooby-Dooby-Doo" becomes "\ao-DbySc\004\016!N!N".

Well, we've saved 2 bytes, but it doesn't sound that well...

gegroet,
Erik V. - http://www.erikveen.dds.nl/
Daniel M. (Guest)
on 2007-05-14 18:21
(Received via mailing list)
Ruby Q. <removed_email_address@domain.invalid> writes:

> The quiz this time is to write a program to implement a compression program
> using Huffman encoding.

And here's mine.  Instead of adding something to each character
encoded, I used an additional symbol to mean "end of data".  Although
the program will behave as in the quiz when given no options,
(encoding the phrase given in the command line arguments to 1s and 0s
displayed as that) the really interesting stuff happens when you give
one or more options.

Here's the usage:

Generate tree: q123.rb -g -t treefile [opts|phrase]
       encode: q123.rb -t treefile [opts|phrase]
       decode: q123.rb -d -t treefile [opts|phrase]
[opts]: -i input -o output
        -b flip binary mode (default is 0s and 1s
           if -o is not given, and bytes if -o is)
        -x generate extended tree to also cover bytes
           not present in the training input
        -q be quiet - no statistics

ObTestOutput:
C:\Temp>ruby q123.rb I want this message to get encoded!
Encoded:
11010000 10100110 00101111 00011110 00110101
01000001 01110010 10001001 10001100 01000111
10010000 11000111 10000010 10110000 10010111
00101111 01101101 10000000
Original Bytes: 35
Encoded Bytes: 18
Compressed: 48.6%

(big.txt is http://norvig.com/big.txt)
C:\Temp>ruby q123.rb -g -t english.tree big.txt
Generated tree from input size 7

C:\Temp>ruby q123.rb -g -t english.tree -i big.txt
Generated tree from input size 6500961

C:\Temp>ruby q123.rb -t english.tree -i big.txt -o nul:
Original Bytes: 6500961
Encoded Bytes: 3707406
Compressed: 43.0%

And now the script itself:

#!ruby
require 'strscan'
require 'stringio'

# A useful method for my priority queue implementation
class Array
    def swap(a,b)
        self[a], self[b] = self[b], self[a]
    end
end

# Inspired by, but totally different from, the heap in
# ruby quiz 40
class PriorityQueue
    # Actually, this is more inspired by the summary on
    # quiz number 98 than the implementation in quiz 40.
    def initialize
        @items=[]
        @priorities=[]
    end

    def add(priority, item)
        @priorities.push priority
        @items.push item
        bubble_up
        self
    end

    def empty?
        @items.empty?
    end

    def shift
        return nil if empty?
        retval = @items.shift
        @priorities.shift
        if ! @items.empty? then
            @priorities.unshift @priorities.pop
            @items.unshift @items.pop
            bubble_down
        end
        retval
    end

    # Inspired by mapn in quiz 122
    def eachn(&b)
        arglen = b.arity
        while !empty?
            args = Array.new(arglen){shift}
            b.call(*args)
        end
    end

    private

    def bubble_up
        wi = @priorities.size - 1
        pr = @priorities[wi]
        until wi == 0
            pi = (wi-1)/2
            return if @priorities[pi] <= pr
            @items.swap(pi,wi)
            @priorities.swap(pi,wi)
            wi = pi
        end
    end

    def bubble_down
        wi = 0
        pr = @priorities[wi]
        until false   # i.e. until I return
            ci = 2*wi + 1
            return if ci >= @priorities.size
            ci += 1 if ci + 1 < @priorities.size and
                        @priorities[ci+1] < @priorities[ci]
            return if @priorities[ci] >= pr
            @items.swap(ci,wi)
            @priorities.swap(ci,wi)
            wi = ci
        end
    end
end

# Okay, that's all for utilities.  Now on with stuff for this quiz;
# basically, I have two classes: HuffmanNode and HuffmanCoder.
# HuffmanNode is the tree structure.  HuffmanCoder handles the
# dirty business of encoding and decoding given the tree structure.

# A HuffmanNode either has data or it has both left and right children,
# but not both.

HuffmanNode = Struct.new("HuffmanNode", :frequency, :data, :left,
:right)

class HuffmanNode
    def inspect
        "HuffmanNode.from_s(#{to_s.inspect})"
    end

    def to_s
        if self.data.nil? then
            "(" + self.left.to_s + ' ' + self.right.to_s + ")"
        else
            self.data.to_s
        end
    end

    def to_h(forencode, prefix='')
        if self.data.nil? then
            l = self.left.to_h(forencode,prefix + '0')
            r = self.right.to_h(forencode,prefix + '1')
            l.update(r)
        else
            if forencode then
                {self.data => prefix}
            else
                {prefix => self.data}
            end
        end
    end

    def HuffmanNode.from_s(s)
        begin
            return from_s_internal(StringScanner.new(s))
        rescue
            raise "Malformed string: '#{s}'"
        end
    end

    def HuffmanNode.from_s_internal(scanner)
        data = scanner.scan(/\s*-?\d+/)
        if data.nil? then
            scanner.scan(/\s*\(/) or raise 'err'
            rei = from_s_internal(scanner)
            scanner.scan(/\s+/) or raise 'err'
            ichi = from_s_internal(scanner)
            scanner.scan(/\s*\)/) or raise 'err'
            return new(0,nil,rei,ichi)
        else
            return new(0,data.to_i,nil,nil)
        end
    end

    def HuffmanNode.make_tree(freqhash, add_everything=false)
        pqueue = PriorityQueue.new
        # node with data -1 is used to mean "end of data"
        universe = {-1=>1}
        if add_everything then
            # Assume anything we haven't seen at all is an order of
            # magnitude less likely than those things we've seen once
            256.times{|i|universe[i]=0.1;}
        end
        universe.update(freqhash)
        universe.each {|charcode,freq|
            pqueue.add(freq, new(freq,charcode,nil,nil))
        }
        pqueue.eachn {|node1, node2|
            return node1 if node2.nil?

            n = new(node1.frequency + node2.frequency, \
                            nil, node2, node1)
            pqueue.add(n.frequency, n)
        }
    end
end

class HuffmanCoder
    attr :enchash
    attr :dechash
    attr :decre
    def initialize(nodetree)
        @enchash = nodetree.to_h(true)
        @dechash = nodetree.to_h(false)
        @decre = Regexp.new('^(' + @dechash.keys.join('|') + ')(.*)')
    end

    def encode(io_in,io_out,crunch_binary=true,io_stats=$stderr)
        buff = ''
        outbytes = 0
        inbytes = 0
        io_out.puts "Encoded:" unless crunch_binary
        encode_bits(io_in) { |bits|
            inbytes += 1
            if bits then
                buff += bits
            else
                buff += '0' * ((-buff.length) %  8 )
            end
            while buff.length >= 8 do
                binary = buff.slice!(0..7)
                if crunch_binary then
                    binary = [binary].pack("b*")
                else
                    binary = binary + ' '
                    binary += "\n" if 4 == outbytes % 5
                end
                io_out.print binary
                outbytes += 1
            end
        }
        if 0 != outbytes % 5 and not crunch_binary then
            io_out.print "\n"
        end
        inbytes -= 2
        if io_stats then
            io_stats.puts "Original Bytes: #{inbytes}"
            io_stats.puts "Encoded Bytes: #{outbytes}"
            io_stats.puts "Compressed: %2.1f%%"%  [100.0 -
(100.0*outbytes)/inbytes]
        end
    end

    def decode(io_in,io_out,crunched_binary=true,io_stats=$stderr)
        buff = ''
        outbytes = 0
        inbytes = decode_bits(io_in,crunched_binary) { |bits|
            buff += bits
            m = @decre.match buff
            while m do
                ch = @dechash[m[1]]
                if ch == -1
                    if m[2] !~ /^0*$/ then
                        raise "Garbage after EOD marker"
                    end
                    break
                end
                io_out.putc ch
                outbytes += 1
                buff = m[2]
                m = @decre.match buff
            end
        }
        if io_stats then
            io_stats.puts "Encoded Bytes: #{inbytes}"
            io_stats.puts "Original Bytes: #{outbytes}"
            io_stats.puts "Compressed: %2.1f%%"%  [100.0 -
(100.0*inbytes)/outbytes]
        end
    end

    def HuffmanCoder.from_file(treefile)
        tree = nil
        File.open(treefile, "r") { |f|
            treetxt = ''
            f.each{ |treeline| treetxt += treeline }
            tree = HuffmanNode.from_s(treetxt)
        }
        new(tree)
    end

    def HuffmanCoder.generate(io_in, treefile, generate_extended,
io_stats=$stderr)
        bytecount = 0
        d = Hash.new(0);
        io_in.each_byte {|b| d[b] += 1; bytecount += 1}
        tree = HuffmanNode.make_tree(d,generate_extended)
        if ! treefile.nil?
            File.open(treefile, "w") {|f|
                f.puts tree.to_s
            }
        end
        if io_stats then
            io_stats.puts "Generated tree from input size #{bytecount}"
        end
        new(tree)
    end

    private

    def encode_bits(io_in)
        c = io_in.getc
        until c.nil?
            bits = @enchash[c]
            raise "no code for character #{c}" unless bits
            yield bits
            c = io_in.getc
        end
        yield @enchash[-1]
        yield nil
    end

    def decode_bits(io_in,crunched_binary)
        inbytes = 0
        if crunched_binary then
            until io_in.eof?
                b = io_in.read(4096)
                return if b.nil?
                inbytes += b.length
                yield b.unpack('b*')[0]
            end
        else
            until io_in.eof?
                b = io_in.read(4096)
                return if b.nil?
                b.tr!('^01','')
                inbytes += b.length
                yield b if b.length > 0
            end
            inbytes = (inbytes+7)/8
        end
        inbytes
    end
end

# That's all the interesting stuff.  Everything from here down is
# argument handling.  Basically uninteresting.

if __FILE__ == $0 then
    mode = :encode
    input = nil
    output = nil
    treefile = nil
    crunched_binary = true
    generate_extended = false
    statschannel = $stderr
    while ARGV[0] and ARGV[0] =~ /^-/
        opt = ARGV.shift
        case opt
        when '--'
            break
        when '-t'
            treefile = ARGV.shift
        when '-i'
            input = ARGV.shift
        when '-o'
            output = ARGV.shift
        when '-d'
            mode = :decode
        when '-b'
            crunched_binary = false
        when '-q'
            statschannel = nil
        when '-g'
            mode = :gentree
        when '-x'
            mode = :gentree
            generate_extended = true
        when '-h'
            puts "Usage:"
            puts "Generate tree: #{$0} -g -t treefile [opts|phrase]"
            puts "       encode: #{$0} -t treefile [opts|phrase]"
            puts "       decode: #{$0} -d -t treefile [opts|phrase]"
            puts "[opts]: -i input -o output"
            puts "        -b flip binary mode (default is 0s and 1s"
            puts "           if -o is not given, and bytes if -o is)"
            puts "        -x generate extended tree to also cover bytes"
            puts "           not present in the training input"
            puts "        -q be quiet - no statistics"
            exit 0
        else
            $stderr.puts "Unrecognized option #{opt} -- use -h for help"
            exit 1
        end
    end
    if treefile.nil? then
        # allow no treefile only when encoding command line
        # stuff as a demo.
        if ! (input.nil? and mode == :encode)
            $stderr.puts "Error: no -t option given"
            exit 1
        end
    end
    in_io = nil
    if input.nil?
        in_io = StringIO.new(ARGV.join(' '))
        crunched_binary = ! crunched_binary if mode == :decode
    elsif input == '-'
        in_io = $stdin
    else
        in_io = File.open(input, "rb")
    end
    out_io = nil
    if output.nil?
        out_io = $stdout
        crunched_binary = ! crunched_binary if mode == :encode
    elsif output == '-'
        out_io = $stdout
    else
        out_io = File.open(output, "wb")
    end
    if mode == :gentree then
        HuffmanCoder.generate(in_io, treefile, generate_extended,
                                statschannel)
    else
        coder = nil
        if (treefile.nil?)
            coder = HuffmanCoder.generate(in_io, nil, false, nil)
            in_io.rewind
        else
            coder = HuffmanCoder.from_file(treefile)
        end
        coder.send(mode, in_io,out_io, crunched_binary, statschannel)
    end
end

__END__
Raf C. (Guest)
on 2007-05-14 21:48
(Received via mailing list)
Here's mine. It's my first submission; be gentle :)

I've compared several ways of chopping up the string before building
the tree and encoding. As can be expected, the bigger the chunks the
smaller the encoded string but also the bigger the tree. There must be
a sweet spot somewhere in the middle.

$ ./rq123_huffman_rafc.rb
Encoded byte tokens:
  Size of encoded data:       285903
  Tree size:                    1140
  ----------------------------------
  Total size:                 287043
  Original size:              500000
  Compressed by:                 42%
####################################
Encoded word tokens:
  Size of encoded data:       145703
  Tree size:                  136634
  ----------------------------------
  Total size:                 282337
  Original size:              500000
  Compressed by:                 43%
####################################
Encoded 2byte tokens:
  Size of encoded data:       246807
  Tree size:                   20761
  ----------------------------------
  Total size:                 267568
  Original size:              500000
  Compressed by:                 46%
####################################
Encoded 3byte tokens:
  Size of encoded data:       218899
  Tree size:                  121651
  ----------------------------------
  Total size:                 340550
  Original size:              500000
  Compressed by:                 31%
####################################


Regards,
Raf
Drew O. (Guest)
on 2007-05-15 02:21
Here's my solution. I'm getting different values, but I'm fairly sure
I'm building the tree correctly. I suppose it all depends on how you
choose to encode the values (0 = left, 1 = right in my case). I may have
some glaring stupid error, but I can't find it if I do :)

Output:

--------------------------------------------------------

>ruby huffman.rb ABRRKBAARAA

Encoded
10100000 01101011 0011
3 bytes

Original
ABRRKBAARAA
11 bytes

72.7272727272727% compression

--------------------------------------------------------


>ruby huffman.rb I want this message to get encoded!

Encoded
01010001 10101100 10000110 00011010 01010111
10001101 10011111 11110010 10001000 01110110
00101000 10110000 01100001 00011011 10010011
00101000 0
17 bytes

Original
I want this message to get encoded!
35 bytes

51.4285714285714% compression

--------------------------------------------------------

- Drew

# file: huffman.rb
# author: Drew O.

require 'enumerator'

# class to hold nodes in the huffman tree
class Node
  attr_accessor :val,:weight,:left,:right

  def initialize(val="",weight=0)
    @val,@weight = val,weight
  end
end

# priority queue for nodes
class NodeQueue
  def initialize
    @queue = []
  end

  def enqueue node
    @queue << node
    @queue = @queue.sort_by{|x|[-x.weight,x.val.size]}
  end

  def dequeue
    @queue.pop
  end

  def size
    @queue.size
  end
end

# HuffmanTree represents the tree with which we perform
# the encoding
class HuffmanTree

  # initialize the tree based on data
  def initialize data
    @freqs = build_frequencies(data)
    @root = build_tree
  end

  #encode the given data
  def encode data
    data.downcase.split(//).inject("") do |code,char|
      code + encode_char(char)
    end
  end

  private

  # this method encodes a given character based on our
  # tree representation
  def encode_char char
    node = @root
    coding = ""

    # we do a binary search, building the representation
    # of the character based on which branch we follow
    while node.val != char
      if node.right.val.include? char
        node = node.right
        coding += "1"
      else
        node = node.left
        coding += "0"
      end
    end
    coding
  end

  # get word frequencies in a given phrase
  def build_frequencies phrase
    phrase.downcase.split(//).inject(Hash.new(0)) do |hash,item|
      hash[item] += 1
      hash
    end
  end

  # build huffmantree using the priority queue method
  def build_tree
    queue = NodeQueue.new

    # build a node for each character and place in pqueue
    @freqs.keys.each do |char|
      queue.enqueue(Node.new(char,@freqs[char]))
    end

    while !queue.size.zero?

      # if only one node exists, it is the root. return it
      return queue.dequeue if queue.size == 1

      # dequeue two lightest nodes, create parent,
      # add children and enqueue newly created node
      node = Node.new
      node.right = queue.dequeue
      node.left = queue.dequeue
      node.val = node.left.val+node.right.val
      node.weight = node.left.weight+node.right.weight
      queue.enqueue node
    end
  end
end

# get command lines args, build tree and encode data
if __FILE__ == $0
  data = ARGV.join(" ")
  tree = HuffmanTree.new data
  encoded = tree.encode(data).scan(/\d{1,8}/)
  puts
  puts "Encoded"
  encoded.each_slice(5) do |slice|
    puts slice.join(" ")
  end
  puts "#{encoded.size} bytes"
  puts
  puts "Original"
  puts data
  puts "#{data.size} bytes"
  puts
  compression = (100.0 - (encoded.size.to_f/data.size.to_f)*100)
  puts "#{compression}\% compression"
end
Erik V. (Guest)
on 2007-05-15 02:38
(Received via mailing list)
The better solution...

Here's the code: http://tinyurl.com/32tjfo

Example:

 $ echo -n Scooby-Dooby-Doo | ruby huffman.rb | ruby huffman.rb -d
 Compression: 27/17 (158%)
 Decompression: 27/17 (158%)
 Scooby-Dooby-Doo

How does it work?

Count each unique byte in the original byte stream (e.g.
"Scooby-Dooby-Doo"):

 [["S", 1], ["c", 1], ["o", 6], ["b", 2], ["y", 2], ["-", 2], ["D",
2]]

This list is then converted to a tree, using this code:

 while tree.length > 1
   a, b, *tree = tree.sort_by{|_, count| count} # Not deterministic.
   tree << [[a[0], b[0]], a[1]+b[1]]
 end

In English: The two least frequently used leaves/nodes are
replaced by one node (a combination of both leaves),
recursively, until the list is only one leaf/node long. Step by
step, this results in the following:

 [["S", 1], ["c", 1], ["b", 2], ["D", 2], ["y", 2], ["-", 2], ["o",
6]]
 [["b", 2], ["D", 2], ["y", 2], [["S", "c"], 2], ["-", 2], ["o", 6]]
 [["y", 2], [["S", "c"], 2], ["-", 2], [["b", "D"], 4], ["o", 6]]
 [["-", 2], [["b", "D"], 4], [["y", ["S", "c"]], 4], ["o", 6]]
 [[["y", ["S", "c"]], 4], ["o", 6], [["-", ["b", "D"]], 6]]
 [[["-", ["b", "D"]], 6], [[["y", ["S", "c"]], "o"], 10]]
 [[[["-", ["b", "D"]], [["y", ["S", "c"]], "o"]], 16]]

The tree is actually still in the list (the only element in the
list), so we get it and strip the weight:

 [["-", ["b", "D"]], [["y", ["S", "c"]], "o"]]

Secondly, we build the translation table, where each leaf in
the tree is mapped to a bit sequence, by using the path to each
leaf:

 {"-"=>"00", "b"=>"010", "y"=>"100", "c"=>"1011", "D"=>"011",
"o"=>"11", "S"=>"1010"}

We can translated each byte of the original byte stream:

 ["S", "c", "o", "o", "b", "y", "-", "D", "o", "o", "b", "y", "-",
"D", "o", "o"]

... to the corresponding bit stream:

 ["1010", "1011", "11", "11", "010", "100", "00", "011", "11", "11",
"010", "100", "00", "011", "11", "11"]

... which is joined into a string:

 "101010111111010100000111111010100000111111"

The rest, serializing the translation table and adding and
registering padding bits is not that complicated and won't be
explained here.

The decoding (decompressing?) is, well, just the reversed order
of the steps...

I want to emphasize one line of the decode code. After having
deserialized the translation table, we can decode the whole
message with one line:

 bitstring.scan(/#{table.keys.join("|")}/).collect{|bits|
table[bits]}.join("")

gegroet,
Erik V. - http://www.erikveen.dds.nl/
Eric I. (Guest)
on 2007-05-15 05:55
(Received via mailing list)
My solution to the quiz can be found here:

    http://pastie.caboo.se/61613

This solution has a number of features which might make it different
to other solutions submitted.  It encodes to and decodes from actual
binary data, not strings of "0" and "1" characters.  It is able to
serialize and deserialize the Huffman encoder (which is a hash
table, not a tree).  It uses a linear algorithm to create the
Huffman code from the character frequency data.  And beyond that, it
was written with an attempt to be efficient time-wise and
memory-wise.  It uses an end-of-message token (:eom) to mark the end
of the encoded message.  It therefore is able to handle messages
that use 256 characters plus the 1 token.  It recognizes that with
large messages some characters could be encoded with more than 8
bits, and it handles those situations.  And finally it makes an
attempt to recognize corrupt data, both in terms of the Huffman
encoded message and the serialized encoder.

Eric
Ball, Donald A Jr (Library) (Guest)
on 2007-05-16 01:24
(Received via mailing list)
My solution is pastied and included below:

http://pastie.caboo.se/61871

it's overly long, but I wanted some practice messing around with bit
arrays. My solution prepends the dictionary to the compressed text, and
adds a NULL character to the dictionary and at the end of the compressed
text to mark the end. I also tried writing a binary encoder which packs
and unpacks the bit strings generated by my string encoder, but it fails
to decompress the entire text for long files, e.g. the program source
code itself. I'm unsure why, perhaps someone more experienced in this
arena can suggest why.

In any case, it was a fun exercise, one with which I will now stop
playing as I need to do some real work. :)

- donald

# Ruby Q. 123
# Donald B.

require 'optparse'
require 'stringio'

module Huffman

  EOS = "\000" # end of string character for binary decoding

  class Leaf
    @code = 1

    class << self
      attr_reader :code
    end

    attr_reader :char, :weight

    def initialize(char, weight)
      @char = char
      @weight = weight
    end

    def ==(other)
      self === other && @weight == other.weight
    end

    def ===(other)
      other.is_a?(Leaf) && @char == other.char
    end

    def ciphers(hash, prefix)
      hash[char] = prefix
    end

    def serialize
      yield self.class.code
      yield char
    end
  end

  class Branch
    @code = 0

    class << self
      attr_reader :code
    end

    attr_reader :left, :right

    def initialize(left, right)
      @left = left
      @right = right
    end

    def weight
      left.weight + right.weight
    end

    def ciphers(hash = {}, prefix = '')
      left.ciphers(hash, prefix + '0')
      right.ciphers(hash, prefix + '1')
      hash
    end

    def ===(other)
      other.is_a?(Branch) && @left === other.left && @right ===
other.right
    end

    def ==(other)
      other.is_a?(Branch) && @left == other.left && @right ==
other.right
    end

    def serialize(&blk)
      yield self.class.code
      left.serialize(&blk)
      right.serialize(&blk)
    end
  end

  class Tree
    attr_reader :root, :plaintext

    # returns a hash of weights by character, adding EOS char with
weight 0
    def self.char_freq(src)
      src.split(//m).inject(Hash.new(0)) {|m, c| m[c] += 1;
m}.merge({EOS=>0})
    end

    # returns a sorted array of arrays of weights and chars
    def self.sort_freq(src)
      char_freq(src).map {|c, v| Leaf.new(c, v)}.sort_by {|l| [l.weight,
l.char] }
    end

    # shifts the next smallest node from the two given arrays
    def self.next_smallest(l1, l2)
      return l1.shift if l2.length == 0
      return l2.shift if l1.length == 0
      l1[0].weight <= l2[0].weight ? l1.shift : l2.shift
    end

    def initialize(src)
      raise if src.nil?
      if src.is_a?(Branch) || src.is_a?(Leaf)
        @root = src
        return
      end
      @plaintext = src
      leaves = self.class.sort_freq(plaintext)
      branches = []
      until leaves.length + branches.length == 1
        n1 = self.class.next_smallest(leaves, branches)
        n2 = self.class.next_smallest(leaves, branches)
        branches << Branch.new(n1, n2)
      end
      @root = branches[0]
    end

    def ==(other)
      @root == other.root
    end

    def ===(other)
      @root === other.root
    end

    def serialize(&blk)
      @root.serialize(&blk)
      ciphers = @root.ciphers
      @plaintext.split(//m).each {|char| yield [ciphers[char]] }
      yield [ciphers[EOS]]
    end
  end

  # encodes and decodes strings to strings of mostly 0s and 1s,
prepending
  # the dictionary
  class StringEncoder
    def self.encode(plaintext)
      tree = Tree.new(plaintext)
      bits = ''
      tree.serialize do |x|
        bits <<
          case x
            when Fixnum: x.to_s
            when Array: x[0]
            else x.unpack('B8')[0]
          end
      end
      bits << '0' * ((8 - (bits.length % 8)) % 8)
    end

    def self.decode(src)
      ios = StringIO.new(src)
      root = decode_tree(ios)
      deciphers = root.ciphers.invert
      s = ''
      buffer = ''
      while bit = ios.read(1)
        buffer << bit
        if char = deciphers[buffer]
          break if char == EOS
          s << char
          buffer = ''
        end
      end
      s
    end

    def self.decode_tree(ios)
      case ios.read(1)
        when Branch.code.to_s
          Branch.new(decode_tree(ios), decode_tree(ios))
        when Leaf.code.to_s
          Leaf.new([ios.read(8)].pack('B8'), 0)
        else
          raise
      end
    end
  end

  # encodes and decodes strings to binary, prepending the dictionary
  class BinaryEncoder
    def self.encode(src)
      [StringEncoder.encode(src)].pack('B*')
    end

    def self.decode(src)
      StringEncoder.decode(src.unpack('B*')[0])
    end
  end

end

if $0 == __FILE__
  options = {}
  OptionParser.new do |opts|
    opts.on('-b', '--binary', 'Use binary encoding') do |s|
      options[:encoder] = Huffman::BinaryEncoder
    end
    opts.on('-d', '--decode', 'Decode') do |d|
      options[:action] = :decode
    end
    opts.on('-f F', '--file', String, 'File input') do |f|
      input = ''
      File.open(f, 'r') do |file|
        file.each_line do |line|
          input += line
        end
      end
      options[:input] = input
    end
  end.parse!
  options[:encoder] ||= Huffman::StringEncoder
  options[:action] ||= :encode
  options[:input] ||= $stdin.read
  $stdout.write options[:encoder].send(options[:action],
options[:input])
end
Raf C. (Guest)
on 2007-05-16 01:45
(Received via mailing list)
2007/5/15, Ball, Donald A Jr (Library) <removed_email_address@domain.invalid>:
<snip>
> I also tried writing a binary encoder which packs
> and unpacks the bit strings generated by my string encoder, but it fails
> to decompress the entire text for long files, e.g. the program source
> code itself. I'm unsure why, perhaps someone more experienced in this
> arena can suggest why.

Perhaps because the file you want to encode already contains your
end-marker. The decoder stops when it encounters the first end-marker
(line 161). A binary file would not be unlikely to contain a NULL
character somewhere.

Regards,
Raf
Ball, Donald A Jr (Library) (Guest)
on 2007-05-16 01:57
(Received via mailing list)
> > why.
>
> Perhaps because the file you want to encode already contains
> your end-marker. The decoder stops when it encounters the
> first end-marker (line 161). A binary file would not be
> unlikely to contain a NULL character somewhere.

That was my first thought, but I threw in a check for the existence of
NULL characters in the frequency hash before merging the EOS=>0 hash,
and it did not complain. Besides, the file I'm trying to decompress is
the ruby source of my quiz solution, which is just plain ol' text.

- donald
Raf C. (Guest)
on 2007-05-16 02:19
(Received via mailing list)
2007/5/15, Ball, Donald A Jr (Library) <removed_email_address@domain.invalid>:
> > > why.
>
> - donald
>

Strange... I get this:


Encoding-decoding paste-61871.rb:
$ ruby -w paste-61871.rb -b -f paste-61871.rb | ruby -w paste-61871.rb
-b -d > ref
$ cmp paste-61871.rb ref

=>Identical


Encoding-decoding a binary file:
$ ruby -w paste-61871.rb -b -f dilbert2073317070515.gif | ruby -w
paste-61871.rb -b -d > ref
$ cmp dilbert2073317070515.gif ref
cmp: EOF on ref
$ hexdump -C ref
00000000  47 49 46 38 39 61 58 02  d5                       |GIF89aX..|
00000009
$ hexdump -C dilbert2073317070515.gif | head -n 1
00000000  47 49 46 38 39 61 58 02  d5 00 c4 00 00 58 58 58
|GIF89aX......XXX|

=> Encoded-decoded terminates just before first NULL character in
original file


Regards,
Raf
Daniel M. (Guest)
on 2007-05-16 15:56
(Received via mailing list)
"Ball, Donald A Jr (Library)" <removed_email_address@domain.invalid> writes:

> I also tried writing a binary encoder which packs
> and unpacks the bit strings generated by my string encoder, but it fails
> to decompress the entire text for long files, e.g. the program source
> code itself. I'm unsure why, perhaps someone more experienced in this
> arena can suggest why.

As someone else has already been successful at this task with your
program, allow me to propose a guess as to what the problem was: you
were working on windows.

If you just open a file with mode "r" on windows, it'll stop at the
first occurrence in the input of character 26 (ctrl-z).  To read
binary files on windows, you need to open with mode "rb". (or use
binmode on the file)

As for the issue that you can't compress binary data, I'll note that I
got around that by having my tree encode numbers from 0 to 255, instead
of characters, and encoded the number -1 to mean EOF.
Drew O. (Guest)
on 2007-05-16 17:41
A little late, but here's my 2nd solution. It does decoding as well, but
I didn't get a chance to add serialization. Nothing too fancy in here:

Output:

-------------------------------------------------------------

>ruby huffman.rb ABRRKBAARAA
Original
ABRRKBAARAA
11 bytes

Encoded
10100000 01101011 0011
3 bytes

72 percent compression

Decoded
abrrkbaaraa

-------------------------------------------------------------

>ruby huffman.rb I want this message to get encoded!

Original
I want this message to get encoded!
35 bytes

Encoded
01010001 10101100 10000110 00011010 01010111
10001101 10011111 11110010 10001000 01110110
00101000 10110000 01100001 00011011 10010011
00101000 0
17 bytes

51 percent compression

Decoded
i want this message to get encoded!

-------------------------------------------------------------

Code:

# file: huffman.rb
# author: Drew O.

require 'enumerator'

# class to hold nodes in the huffman tree
class Node
  attr_accessor :val,:weight,:left,:right

  def initialize(val="",weight=0)
    @val,@weight = val,weight
  end

  def children?
    return @left || @right
  end
end

# priority queue for nodes
class NodeQueue
  def initialize
    @queue = []
  end

  def enqueue node
    @queue << node
    @queue = @queue.sort_by{|x|[-x.weight,x.val.size]}
  end

  def dequeue
    @queue.pop
  end

  def size
    @queue.size
  end
end

# HuffmanTree represents the tree with which we perform
# the encoding
class HuffmanTree

  # initialize the tree based on data
  def initialize data
    @freqs = build_frequencies(data)
    @root = build_tree
  end

  #encode the given data
  def encode data
    data.downcase.split(//).inject("") do |code,char|
      code + encode_char(char)
    end
  end

  def decode data
    node = @root

    if !@root.children?
      return @root.val
    end

    data.split(//).inject("") do |phrase,digit|
      if digit == "0"
        node = node.left
      else
        node = node.right
      end
      if !node.children?
        phrase += node.val
        node = @root
      end
      phrase
    end
  end

  private

  # this method encodes a given character based on our
  # tree representation
  def encode_char char
    node = @root
    coding = ""

    # encode to 0 if only one character
    if !@root.children?
      return "0"
    end

    # we do a binary search, building the representation
    # of the character based on which branch we follow
    while node.val != char
      if node.right.val.include? char
        node = node.right
        coding += "1"
      else
        node = node.left
        coding += "0"
      end
    end
    coding
  end

  # get word frequencies in a given phrase
  def build_frequencies phrase
    phrase.downcase.split(//).inject(Hash.new(0)) do |hash,item|
      hash[item] += 1
      hash
    end
  end

  # build huffmantree using the priority queue method
  def build_tree
    queue = NodeQueue.new

    # build a node for each character and place in pqueue
    @freqs.keys.each do |char|
      queue.enqueue(Node.new(char,@freqs[char]))
    end

    while !queue.size.zero?

      # if only one node exists, it is the root. return it
      return queue.dequeue if queue.size == 1

      # dequeue two lightest nodes, create parent,
      # add children and enqueue newly created node
      node = Node.new
      node.right = queue.dequeue
      node.left = queue.dequeue
      node.val = node.left.val+node.right.val
      node.weight = node.left.weight+node.right.weight
      queue.enqueue node
    end
  end
end

# get command lines args, build tree and encode data
if __FILE__ == $0
  data = ARGV.join(" ")
  tree = HuffmanTree.new data

  # get encoded data and split into bits
  code = tree.encode(data)
  encoded_bits = code.scan(/\d{1,8}/)

  # output
  puts
  puts "Original"
  puts data
  puts "#{data.size} bytes"
  puts
  puts "Encoded"
  encoded_bits.each_slice(5) do |slice|
    puts slice.join(" ")
  end
  puts "#{encoded_bits.size} bytes"
  puts
  puts "%d percent compression" % (100.0 -
(encoded_bits.size.to_f/data.size.to_f)*100.0)
  puts
  puts "Decoded"
  puts tree.decode(code)
end
This topic is locked and can not be replied to.