Hexdump (#171)

mansfiem · July 31, 2008, 8:38pm

When learning a new programming language, the first thing many coders
do is write the traditional “Hello, world!” program. This generally
provides the bare minimum needed for coding: base program structure,
compilation if needed… In Ruby, this is very bare, as puts "Hello, world!" is sufficient. (See quiz #158 for some non-traditional
versions.)

What also seems a tradition is the question, “What should I program
now?” after “Hello, world!” is output to the console. New coders are
looking for something to try, to expand their skills, without becoming
overwhelmed. Often, I find, the easiest way to do this is to reproduce
an existing program. You can focus on learning the new language and
implementing an existing design, rather than coming up with something
novel.

This week’s quiz was chosen with this in mind; it is a good project
for new Rubyists, to dive into the language a bit without drowning.
Hex dump utilities have been around for ages, and there are plenty of
them, so we don’t have to think about implementing anything new;
rather, we can focus on learning the Ruby. And writing a hex dump
program let’s you deal with files, strings, arrays and output: some of
the basics of any code.

I’m going to look at parts from each of the few solutions, to
highlight some of the things you should know as a Rubyist. If you’re
new to Ruby, you might consider trying the quiz first before reading
this summary and the submissions. Then, after reading this summary,
revise and refactor your solution to be leaner and cleaner.

First, let’s look at the non-golfed (and slightly modified)
submission Mikael Hoilund. It’s short, but dense with good
Ruby-isms.

i = 0
ARGF.read.scan(/.{0,16}/m) { |match|
        puts(("%08x " % i) + match.unpack('H4'*8).join(' '))
        i += 16

}

ARGF is a special constant. It isn’t a file, but can be treated as
such (as seen above, via the call to the IO#read method). It will
sequentially read through all files provided on the command-line or,
if none are provided, will read from standard input. It works together
with ARGV, the array of arguments provided to your program,
expecting that all values in ARGV are filenames. If you happen to
have a script that also expects command-line options (such as
--help), just make sure to process and/or remove them from ARGV
before using ARGF.

String#scan which finds instances of the pattern provided in the
source string. In this case, Mikael is using a regular-expression that
grabs up to 16 characters (i.e. bytes) at a time, including newlines.
(The m in the regular-expression indicates a multi-line match, in
which newline characters are treated like any other character, rather
than terminators.)

String#scan can return an array of matches, but it can also be used
in block-form, as shown above, the block called once per match with
the matching values passed in argument match.

Another trick here is replication. These aren’t really “tricks”, as
they are standard functions defined on the class, but they can
certainly save typing and keep the code clearer. Try these in irb:

> 'H4' * 8
=> "H4H4H4H4H4H4H4H4"

> [1, 2, 3] * 2
=> [1, 2, 3, 1, 2, 3]

String#unpack is a powerful function for handling raw data. It uses
a format string (e.g. “H4H4H4H4H4H4H4H4”) to decode the raw data. In
this case, H4 indicates that four nybbles (e.g. two bytes) should be
decoded from the string. Doing that eight times decodes 16 bytes,
which is how much we are reading at a time in Mikael’s code above.

String#unpack (and the reverse Array#pack) can do a lot of work in
short-order. It just takes a bit of practice to understand, and
easy-access to the formats table. (On the command-line, type: ri String#unpack.)

Finally, take a quick look at Mikael’s golfed solution. Aside from
squeezing everything together, it makes use of some special globals:
$< (equivalent to ARGF) and $& (evaluates to the current match
from scan, eliminating the need for the match parameter to the
block). Globals like this can certainly make it more fun to “golf”
(i.e. the deliberate shrinking and obfuscation of a program), but
aren’t recommended for clarity.

Robert D. provides a clean, straightforward solution that needs
little explanation. Make sure to look at the whole of it, while I
examine briefly his output method.

require 'enumerator'

BYTES_PER_LINE = 0x10

def output address, line
  e = line.enum_for :each_byte
  puts "%04x %-#{BYTES_PER_LINE*3+1}s %s" % [ address,
    e.map{ |b|  "%02x" % b }.join(" "),
    e.map{ |b|
      0x20 > b || 0x7f < b  ? "." : b.chr
    }.join ]
end

The most useful bit here is the enumerator module, and the
enum_for method that returns an Enumerable::Enumerator object.
This object provides a number of ways to access the data. Here, Robert
accesses it one byte at a time, having passed the argument
:each_byte. Enumerators, of course, are not required to process each
byte of the source string: a couple calls to each_byte could have
done that as well. But the enumerator is a convenient package, which
can be used multiple times, can be used as an Enumerable, and remove
redundancy, all shown above.

Enumerators also have access to other ways to enumerate… What if you
want to get three objects at a time from a collection? Disjointed or
overlapping? You can use :each_cons or :each_slice to that effect.

> x = [1, 2, 3, 4, 5]
=> [1, 2, 3, 4, 5]

> x.enum_for(:each_cons, 3).to_a
=> [ [1, 2, 3], [2, 3, 4], [3, 4, 5] ]

> x.enum_for(:each_slice, 3).to_a
=> [ [1, 2, 3], [4, 5] ]

(Note that there are some changes going on with enumerators between
Ruby 1.8.6 and 1.9; here is some good information on the changes in
Ruby 1.9).

Now we look briefly at Adam S.'s solution, in particular his
command-line option handling.

width = 16
group = 2
skip = 0
length = Float::MAX
do_ascii = true
file = $stdin

while (opt = ARGV.shift)
   if opt[0] == ?-
     case opt[1]
      when ?n
        length = ARGV.shift.to_i
      when ?s
        skip = ARGV.shift.to_i
      when ?g
        group = ARGV.shift.to_i
      when ?w
        width = ARGV.shift.to_i
      when ?a
        do_ascii = false
      else
        raise ArgumentError, "invalid Option #{opt}"
      end
    else
        file = File.new(opt)
    end
end

ARGV.shift is a common pattern. It removes the first item from
ARGV and returns it. Doing the assignment and while-loop test in one
motion with ARGV.shift is a simple way to look at all the
command-line arguments.

Adam’s arguments to his hexdump program are expected to be a single
character preceded by a single dash. The question-mark notation (e.g.
?n) returns the integer ASCII value of the character immediately
following. Likewise, single-character array access (e.g. opt[1])
also returns an integer ASCII value. (Note: This also differs in
1.9.) So by checking the first two characters of an argument pulled
from ARGV against the dash character and various other options
implemented, Adam can replace the default values provided at the top.

For a quick-and-dirty script, handling options in such a way is simple
and convenient. For more complex option-handling, you would do well to
make use of the standard optparse module, or third-party
main.

That’s it for this week! Thanks for the submissions; I certainly
learned a few things myself. (I can’t believe I didn’t know about
ARGF…)