Forum: Ruby Huffman Encoder (#123)

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
4299e35bacef054df40583da2d51edea?d=identicon&s=25 James Gray (bbazzarrakk)
on 2007-05-17 14:00
(Received via mailing list)
With the extra credits, this problem is a little involved and some
people did
write a lot of code for it.  Building the tree was our main interest in
this
problem though.

The quiz didn't detail that process too much, but several submitters
found
write-ups like the one at Wikipedia.  The trick is generally to use two
queues.
The first starts with all of the letters queued lowest frequency to the
highest
and the second starts empty.  While there is more than one node in the
combined
queues, you dequeue the two with the lowest weights, build a new node
with them
as children and a combined weight, then enqueue that in the second
queue.  When
you get down to just one node, you are finished.  That single node is
the root
of the tree.

A variation on this strategy is to use a single priority queue.  When
working
this way you can always just pull the two lowest entries, since the
queue will
keep them coming in the proper order.

Drew Olson has some pretty easy to follow code using the priority queue
approach, so let's look into that now.  First, Drew had to build a
priority
queue since one doesn't ship with Ruby:

  # priority queue for nodes
  class NodeQueue
    def initialize
      @queue = []
    end

    def enqueue node
      @queue << node
      @queue = @queue.sort_by{|x|[-x.weight,x.val.size]}
    end

    def dequeue
      @queue.pop
    end

    def size
      @queue.size
    end
  end

This is a trivial implementation that just resorts the queue after each
new
entry.  Note that the sort is on the opposite of the weights to put the
lowest
entries at the front.

This is not ideal, of course, but likely to be reasonably quick if you
are just
encoding simple text.  That's because the sort is largely in C.  For a
better
priority queue, have a peek at Daniel Martin's code.

Drew also used a trivial class to represent nodes in the tree:

  # class to hold nodes in the huffman tree
  class Node
    attr_accessor :val,:weight,:left,:right

    def initialize(val="",weight=0)
      @val,@weight = val,weight
    end

    def children?
      return @left || @right
    end
  end

As you can see, Nodes are pretty much just a Struct for tracking value,
weight,
and children.  The additional method just checks to see if this node is
a
branch, meaning that it has at least one child node.

With those tools to build on, Drew is now ready to create a HuffmanTree:

  # HuffmanTree represents the tree with which we perform
  # the encoding
  class HuffmanTree

    # initialize the tree based on data
    def initialize data
      @freqs = build_frequencies(data)
      @root = build_tree
    end

    #encode the given data
    def encode data
      data.downcase.split(//).inject("") do |code,char|
        code + encode_char(char)
      end
    end

    def decode data
      node = @root

      if !@root.children?
        return @root.val
      end

      data.split(//).inject("") do |phrase,digit|
        if digit == "0"
          node = node.left
        else
          node = node.right
        end
        if !node.children?
          phrase += node.val
          node = @root
        end
        phrase
      end
    end

    # ...

These three methods define the external interface for this class.
First, you
create HuffmanTree objects by passing in the data a tree should be
constructed
from.  Frequencies are counted for the characters in the data and a tree
is
built from those counts.

The encode() method takes the data you wish to apply the encoding to and
returns
a String of ones and zeros representing the data.  This implementation
just
iterates over the characters, using a helper method to translate them.
Note
that Drew's implementation normalizes case which results in smaller
encodings,
but this step needs to be removed if you want lossless compression.

The decode method is the most complex in the set, but still not too hard
to
grasp.  It starts at the root node and iterates over each one and zero,
selecting the correct child node.  Each time it reaches a leaf node (no
children), that character value is added to the translation and the
search
resets to the root node.

Now we just need to see the helper methods used in those methods.  This
first
one is the reverse of the decoder we just examined:

    # ...

    private

    # this method encodes a given character based on our
    # tree representation
    def encode_char char
      node = @root
      coding = ""

      # encode to 0 if only one character
      if !@root.children?
        return "0"
      end

      # we do a binary search, building the representation
      # of the character based on which branch we follow
      while node.val != char
        if node.right.val.include? char
          node = node.right
          coding += "1"
        else
          node = node.left
          coding += "0"
        end
      end
      coding
    end

    # ...

Again, the search begins with the root node and advances down the tree
branches.
This time the search is for nodes containing the character and we can
stop as
soon as we reach a leaf.  The encoding is the path of one and zero
branches that
lead to the character.

These last two methods handle tree construction:

    # ...

    # get word frequencies in a given phrase
    def build_frequencies phrase
      phrase.downcase.split(//).inject(Hash.new(0)) do |hash,item|
        hash[item] += 1
        hash
      end
    end

    # build huffmantree using the priority queue method
    def build_tree
      queue = NodeQueue.new

      # build a node for each character and place in pqueue
      @freqs.keys.each do |char|
        queue.enqueue(Node.new(char,@freqs[char]))
      end

      while !queue.size.zero?

        # if only one node exists, it is the root. return it
        return queue.dequeue if queue.size == 1

        # dequeue two lightest nodes, create parent,
        # add children and enqueue newly created node
        node = Node.new
        node.right = queue.dequeue
        node.left = queue.dequeue
        node.val = node.left.val+node.right.val
        node.weight = node.left.weight+node.right.weight
        queue.enqueue node
      end
    end
  end

The first method, build_frequencies(), is just a character counter.  The
counts
are returned in a Hash keyed by the character for a given count.

The main work is done in build_tree().  It begins by creating a priority
queue
and queuing each of the characters from the frequency count.  After
that, the
while loop is a direct translation of the process I described at the
beginning
of this summary.

The final bit of code puts the tree to work creating Drew's solution:

  # get command lines args, build tree and encode data
  if __FILE__ == $0
    require 'enumerator'

    data = ARGV.join(" ")
    tree = HuffmanTree.new data

    # get encoded data and split into bits
    code = tree.encode(data)
    encoded_bits = code.scan(/\d{1,8}/)

    # output
    puts
    puts "Original"
    puts data
    puts "#{data.size} bytes"
    puts
    puts "Encoded"
    encoded_bits.each_slice(5) do |slice|
      puts slice.join(" ")
    end
    puts "#{encoded_bits.size} bytes"
    puts
    puts "%d percent compression" %
         (100.0 - (encoded_bits.size.to_f/data.size.to_f)*100.0)
    puts
    puts "Decoded"
    puts tree.decode(code)
  end

The first few chunks of this code just run the interface methods we have
been
examining.  The last big chunk is simply the output of results using
some
straightforward printing logic.

My thanks to all who took on this challenge.  Several of you wrote
library
quality solutions.  It was impressive to see.

Tomorrow we will try some magical math, as quick as we can...
This topic is locked and can not be replied to.