Records and Arrays (#170)

mansfiem · July 24, 2008, 10:08pm

It is worthwhile to note that, for most real situations, this week’s
quiz problem isn’t necessary. From a human perspective, we naturally
organize things in aggregate, as structured records. Whether it be
object-oriented data, rows in a SQL table, XML records or other,
that’s how we naturally organize things. Except for a few specialized
conditions, collections of records are sufficient for our
organizational needs as either programmers or users.

What are those specialized conditions? I can think of a couple
situations:

Performance. When dealing with older hardware, or perhaps
specialized hardware (e.g. vector units on some processors or GPUs),
it can be quite beneficial to “de-organize” data. Tight, little loops
can then quickly process individual bits of data, and not be required
to either process nor hold the whole data record.
Interface. If the data must be provided to another library or
component that expects parallel bits of data (e.g. a graphing
component as suggested by the quiz), then obviously we need to provide
the data as requested.

In the quiz, I asked that first for a data moving solution. This is
often appropriate for performance situations, since an adapter does
not provide the compact data that performance may require. (For
example, some vector units may have limited memory, so it’s important
to extract only the data needed for vector processing.)

The extra-credit adapter (wrapper) solution is more appropriate for
interface situations. Here, especially with Ruby, we can easily wrap a
new interface around the old, and make one set of data work in
numerous situations.

Let’s start by looking at a solution that moves data into a new
“record of arrays” structure, submitted by Andrea F..

def aor_to_roa(arr)
  hash = { }
  arr.each do |record|
    record.marshal_dump.each_key { |field|
      (hash[field] ||= []) << record.send(field)
    }
  end
  OpenStruct.new(hash)
end

This function accepts arrays of OpenStruct instances, as specified
by the quiz. The outer loop enumerates over each record. For each
record, marshal_dump is called: this is an instance method of
OpenStruct and generates a hash consisting of field names associated
to the field values.

For each key in that hash, Andrea first ensures that the new hash is
either already initialized, or gets initialized to an empty array. The
old record is sent the field name as a message (which returns the
field value), which is appended to the new hash’s array for that
field. Finally, a new OpenStruct instance is returned, which takes
the new hash as its content.

(On a side note… Technically, for vector units and similar
situations, you’d need to move the actual data into a memory block,
rather than mere copying of object references as done above… but the
code is sufficient for displaying the basic idea.)

I’m going to pass on talking about the opposite function: that is,
moving data from a “record of arrays” to an “array of records.” Given
that we usually store data as an “array of records,” this is less
important since we can refer to the original data. Still, take a look
at Andrea’s roa_to_aor method, which maintains a similar approach
and simplicity.

I’m also not going to talk about pushing updates back to the original
data for this case. Generally, this isn’t desireable for the typical
cases (i.e. performance) that move/duplicate data. Either the data
isn’t being modified, or if being modified, it gets pushed out the end
of the pipe and intentionally lost. (Had I considered this when
writing the quiz, I would have skipped my request for the roa_to_aor
method.)

Situations that focus on interface, however, will prefer to work with
the original data. In fact, it’s important that the adapter allow the
original data to be modified to the extent it was originally
modifiable: read-only data should not suddenly become read-write.
(Another thing I should have put into the original quiz description…
sorry.)

First, I want to look at the adapter from James G., which is quite
simple and satifies the quiz description as was written. (I’ve left
out the method __enum__, since it only existed to support
roa_to_aor.)

class MapToEnum
   instance_methods.each { |meth|
       undef_method(meth) unless meth =~ /\A__/
   }

   def initialize(enum)
     @enum = enum
   end

   def method_missing(meth, *args, &block)
     @enum.map { |o| o.send(meth, *args, &block) }
   end
end

The first line undefines almost all the instance methods of this class
(excepting those that start with two underscores), and does so
immediately. To see what this leaves behind, try the following in irb:

> class WhatIsLeft
>   instance_methods.each { |m| undef_method(m) unless m =~ /\A__/ }
>   instance_methods
> end

=> ["__id__", "__send__"]

Basically, the intent here is that MapToEnum do very little on its
own: most work will be passed onto the data, which we’ll see in a
moment. Handly little thing… at least when using the class, but
better to comment it out while debugging it.

The MapToEnum initializer simply remembers the array passed in,
which is then used in method_missing. Here, any methods called
except those above are sent to each of the objects in @enum. Using
@enum.map means that arrays will be returned. And because all
instance methods were cleared out first, you can do some other things
besides getting the field names.

# Assuming people is an array of Person structs...

> peeps = MapToEnum.new(people)
> peeps.name
=> ["James", "Dana", "Matthew"]

> peeps.name[2]
=> "Matthew"

> peeps.class
=> [Person, Person, Person]

What doesn’t work with James’ solution is modifying the data.

> peeps.name[2] = "Bob"
=> "Bob"

> peeps.name
=> ["James", "Dana", "Matthew"]

> people.map { |x| x.name }
=> ["James", "Dana", "Matthew"]

This is because the attempt to modify peeps.name[2] actually changes
the reference contained in the third slot of the array generated by
peeps.name. To change the source data, peeps.name[2] would need to
represent the name field of the original record.

Let’s look at one more solution, this one from Frederick C..

class ArrayOfRecordsWrapper
   def initialize(array)
     @array = array
   end

   def method_missing(name, *args)
     FieldProxy.new(@array, name)
   end

   class FieldProxy
     instance_methods.each { |m| undef_method m unless m =~ /(^__)/

}

     def initialize array, name
       @array, @name = array, name
     end

     def [](index)
       @array[index].send @name
     end

     def []=(index,value)
       @array[index].send(@name.to_s + '=', value)
     end

     def to_a
       @field_array = @array.collect {|a| a.send @name}
     end

     def method_missing(*args, &block)
       to_a.send *args, &block
     end
   end
end

Frederick’s solution is, in many ways, similar to James’ solution,
though Frederick adds in a proxy object to manage named fields. In
fact, if you remove the square bracket methods [] and []= in the
proxy, simplify and refactor, you’d essentially have James’ solution.

Yet, Frederick’s solution does work for manipulating the data.

> people[2].name
=> "Matthew"

> peeps = ArrayOfRecordsWrapper.new(people)

> peeps.name
=> ["James", "Dana", "Matthew"]

> peeps.name[2] = "Bob"
=> "Bob"

> peeps.name
=> ["James", "Dana", "Bob"]

> people[2].name
=> "Bob"

This works specifically because Frederick provided the square bracket
methods, which manipulate the original data directly, rather than
through the array generated from to_a and method_missing. If we
attempt to change the original data through those (rather than the
proxy supplied square bracket methods), it won’t work.

> peeps.name
=> ["James", "Dana", "Bob"]

> peeps.name.to_a[2] = "Matthew"
=> "Matthew"

> peeps.name
=> ["James", "Dana", "Bob"]     # not changed!

It might be worthwhile to hide to_a, though I expect that’s not the
only way to circumvent the goal of modifying the original data. (And
with a dynamic language such as Ruby, I’m pretty sure of it.) If we
set a goal of defining and supporting common usage, then Frederick’s
solution is a pretty satisfactory one.