Implementing an array based class


#1

Some of the data I deal with involves hundreds of thousands of records
where each is very predictable in the number of attributes. By going
from a hash based object to an array based object I am able to
substantially decrease memory usage. However, the array based object
I have made (see below), is really lame for two reasons:

  1. It is a pain to make a different array based class
  2. When I call ‘flatten’ on an array of these objects, my objects get
    flattened, too (how do I prevent this?)

I realize that I could just use arrays of arrays for all the objects,
but for ease of programming I’d like to be able to access an object’s
attributes by name. Does anyone know of some memory efficient and
fast way of doing this (that’s insensitive to ‘flatten’)?

I’ve tried several times to create some kind of metaprogramming
version of this but always fail.

This works (as far as what I’m basically after), but it’s very ugly:

ind_keys = {} ; ind_keys_w_eq = {}; @@ind = {}

The method name to index mapping

ind_keys = {:shape => 0, :intensity => 1, :flare => 2, :alpha =>
3, :id => 4, :xposition => 5, :yposition => 6, :zposition => 7, :color
=> 8, :parent => 9 }

@@arr_size = ind_keys.size

Make a single class hash that maps getters and setters

ind_keys.each {|k,v| ind_keys_w_eq["#{k}=".to_sym] = v }
ind_keys.merge!(ind_keys_w_eq)
ind_keys.each {|k,v| @@ind[k] = v ; @@ind["#{k}"] = v}

Manually define getters and setters

def shape ; self[0] end ; def shape=(oth) ; self[0] = oth end
def intensity ; self[1] end ; def intensity=(oth) ; self[1] = oth
end
def flare ; self[2] end ; def flare=(oth) ; self[2] = oth end
def alpha ; self[3] end ; def alpha=(oth) ; self[3] = oth end
def id ; self[4] end ; def id=(oth) ; self[4] = oth end
def xposition ; self[5] end ; def xposition=(oth) ; self[5] = oth
end
def yposition ; self[6] end ; def yposition=(oth) ; self[6] = oth
end
def zposition ; self[7] end ; def zposition=(oth) ; self[7] = oth
end
def color ; self[8] end ; def color=(oth) ; self[8] = oth end
def parent ; self[9] end ; def parent=(oth) ; self[9] = oth end

The number of total proteins sharing this color

def num_tot_proteins ; self[10] end ; def num_tot_proteins=(oth) ;
self[10] = oth end

create an array of the exact object size or set from a hash

def initialize(args=nil)
super(@@arr_size.size)
if args
if args.is_a? Hash
args.each do |k,v|
self[@@ind[k]] = v
end
end
end
end

Ideas and suggestions are most welcome. Thanks.


#2

On 3 Apr 2007, at 08:35, bwv549 wrote:

I realize that I could just use arrays of arrays for all the objects,
but for ease of programming I’d like to be able to access an object’s
attributes by name. Does anyone know of some memory efficient and
fast way of doing this (that’s insensitive to ‘flatten’)?

snip

Ideas and suggestions are most welcome. Thanks.

I don’t know about speed/memory use, but perhaps Struct (in stdlib)
or Arrayfields:

http://raa.ruby-lang.org/project/arrayfields/

does what you want?

Alex G.

Bioinformatics Center
Kyoto University


#3

Kyoto University
Thanks for the suggestions. I’ve looked at both as options and took
the time to do some profiling (speed and memory). [ See the SUMMARY
at the end if you want the short answer ]
It’s really only pseudo-scientific, but it’s better than where I was
at before.

Memory profiling turns out to be kind of a pain to get going. I
created a little plugin to do this based on
http://t-a-w.blogspot.com/2007/02/ruby-and-judy.html

module MemoryUsage

MY_START_MEMORY = {}

def mem_start
file = IO.read("/proc/#{Process.pid}/smaps")
MY_START_MEMORY[‘size’] = _size(file)
MY_START_MEMORY[‘rss’] = _rss(file)
end

def mem_stop
file = IO.read("/proc/#{Process.pid}/smaps")
size_dif = _size(file) - MY_START_MEMORY[‘size’]
rss_dif = _rss(file) - MY_START_MEMORY[‘rss’]
“SIZE: #{size_dif} kB, RSS: #{rss_dif} kB”
end

def _size(file_string)
file_string.scan(/^Size:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|
a,b| a+b}
end

def _rss(file_string)
file_string.scan(/^Rss:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|
a,b| a+b}
end
end

class Object
include MemoryUsage
end

then you simply require ‘memory_usage’

and put ‘mem_start’ in your code at the beginning

and ‘puts mem_stop’ in your code at the end

I will summarize the results. A struct is really nice and easy to
use:

MyClass = Struct.new(:att1, :att2, :att3)
myobject = MyClass.new(value1, value2, value3)
myobject.att1 = a_different_value # can get/set by method
puts myobject[0] ## puts value1 (can get/set by array index)

I made an array of about 500,000 arrays (6 values each). Then tested
each style of array for speed in object creation, object access by
keyword and object access by index (where applicable).

MyObject (this is my object)

class MyObject < Array

def a ; self[0] ; end
def b ; self[1] ; end
def c ; self[2] ; end
def d ; self[3] ; end
def e ; self[4] ; end
def f ; self[5] ; end
def g ; self[6] ; end

end

STRUCT CLASS

MyStructClass = Struct.new( *$fields )

MyStructClass.new( *atts ) ## -> to create object

NORMAL OBJECT

class NormalObject
attr_accessor( *$fields )
def initialize(*args)
(@a, @b, @c, @d, @e, @f, @g) = args
end
end

Access by index:
MyObject, Structs, and Arrays all have equivalent access times by
index.

Access by keywords:
Structs have essentially equivalent access times by keyword as by
index (blazing fast) and its about the same as for a normal object.
MyObject takes about 2 times as long to access by keyword.

Initialization:
Arrays initialize fastest followed by MyObject. Struct takes about
twice as long to initialize as MyObject (I think because you have to
pass in the field as a *list). Normal objects take the longest to
initialize (even though I’m efficiently setting all properties in one
call).

MEMORY:
Arrays take about 20 times less memory than structs or MyObject.
MyObject uses a little less memory than a struct, but it depends on
how the initialization is implemented. Normal objects (in this
experiment, mileage will vary I think) took a little more than twice
as much memory as Structs or MyObject.

I didn’t test ArrayFields (even though it looks like a very useful
class) based on my understanding that each array that is created must
then be set with the fields that it will use (which seems like it
would be too much overhead for what I’m doing, although I could be
wrong).

*** SUMMARY *** (for situations with gazillions of set-length
objects):

  1. don’t use objects–if you can, use arrays. Faster and way leaner.
  2. If you have to use an object, struct seems like a good
    compromise. Simple to use, fast, extendable, fast access by keywords,
    a little slow on object creation. Significantly more memory than an
    array, but much less than a normal object.
  3. The ugly MyObject from above is probably best of you want faster
    initialization and plan to access mostly by index. However, it’s not
    implemented in any easy to create way. This is where someone with some
    meta-programming magic might be able to really clean things up…
  4. Avoid normal objects. They take a long time to initialize and
    take up a lot of memory.

#4

Thanks for the pointers. I’d love to know how to do proper memory
benchmarking. It does not seem trivial, I guess.

Regarding object initialization, I guess I’m really interested in
total time to initialize the object and fill in all attributes with
values (as this is what I’m going to be facing before I can use any of
the objects). I’ve incorporated this aspect into a similar benchmark
for comparison:

Rehearsal --------------------------------------------------
Class1 5.666667 0.533333 6.200000 ( 3.730552)
Class2 5.983333 0.516667 6.500000 ( 3.904695)
Struct 3.950000 0.250000 4.200000 ( 2.518688)
ArrayBased 2.033333 0.350000 2.383333 ( 1.431213)
Array.new 2.100000 0.316667 2.416667 ( 1.443011)
Array [] 0.833333 0.266667 1.100000 ( 0.671233)
---------------------------------------- total: 22.800000sec

                 user     system      total        real

Class1 5.550000 0.600000 6.150000 ( 3.686555)
Class2 5.716667 0.666667 6.383333 ( 3.834078)
Struct 3.883333 0.316667 4.200000 ( 2.527558)
ArrayBased 2.166667 0.266667 2.433333 ( 1.445458)
Array.new 2.150000 0.300000 2.450000 ( 1.474190)
Array [] 0.900000 0.233333 1.133333 ( 0.684567)


#!/usr/bin/ruby

require ‘benchmark’
include Benchmark

attribute_array = (1…7).to_a

class Class1
def initialize(attribute_array)
(@f1, @f2, @f3, @f4, @f5, @f6, @f7) = attribute_array
end
end

class Class2
def initialize(f1=nil, f2=nil, f3=nil, f4=nil, f5=nil, f6=nil,
f7=nil)
@f1=f1
@f2=f2
@f3=f3
@f4=f4
@f5=f5
@f6=f6
@f7=f7
end
end

StructClass = Struct.new(:f1, :f2, :f3, :f4, :f5, :f6, :f7)
filled_in = StructClass.new( *attribute_array )

class ArrayBased < Array
end

filled_in = ArrayBased.new(attribute_array)

REP = 1_000_000

bmbm(15) do |re|
re.report(“Class1”) { REP.times { Class1.new(attribute_array) } }
re.report(“Class2”) { REP.times { Class2.new(*attribute_array) } }
re.report(“Struct”) { REP.times
{ StructClass.new(*attribute_array) } }
re.report(“ArrayBased”) { REP.times
{ ArrayBased.new(attribute_array) } }
re.report(“Array.new”) { REP.times { Array.new(attribute_array) } }
re.report(“Array []”) { REP.times { [*attribute_array] } }
end


For speed, at least, my initialize observations appear to hold (in
this benchmark, anyway).
Ordered by fastest filled object creation: Array [], ArrayBased &
Array.new, Struct, Class

Kind regards,
john


#5

On Tue, 3 Apr 2007, bwv549 wrote:

would be too much overhead for what I’m doing, although I could be
wrong).

arrayfields was specifically design for the case you’re coding. in
particular
it was designed to provide fast keyword access and low memory overhead -
i was
using it with huge postgresql result sets which were arrays. as you
point
out, arrays use far less memory that any other solution. they are also
much
nicer to work with

csv = tuple.join ‘,’

for example. anyhow, the @fields object is not copied for each array -
only a
reference is held. so if one does this

fields = %w( a b c )

huge_number_tuples.each{|tuple| tuple.fields = fields}

then ALL tuples share the fields object - a copy is not taken. anyhow,
i’d
love to see your tests using it.

regards.

-a


#6

i’d love to see your tests using it.

Thanks for the insight. I’ve included an arrayfields implementation
on the above benchmark:

Rehearsal --------------------------------------------------
ArrayFields 8.433333 0.800000 9.233333 ( 5.585626)
Class1 5.566667 0.500000 6.066667 ( 3.631757)
Class2 6.050000 0.550000 6.600000 ( 3.963679)
Struct 3.583333 0.300000 3.883333 ( 2.335911)
ArrayBased 2.033333 0.266667 2.300000 ( 1.381698)
Array.new 1.950000 0.333333 2.283333 ( 1.371351)
Array [] 0.833333 0.316667 1.150000 ( 0.689010)
---------------------------------------- total: 31.516667sec

                 user     system      total        real

ArrayFields 8.516667 0.816667 9.333333 ( 5.609708)
Class1 5.366667 0.650000 6.016667 ( 3.617349)
Class2 5.933333 0.566667 6.500000 ( 3.893614)
Struct 3.616667 0.266667 3.883333 ( 2.360849)
ArrayBased 1.933333 0.316667 2.250000 ( 1.374079)
Array.new 1.883333 0.333333 2.216667 ( 1.370410)
Array [] 0.866667 0.283333 1.150000 ( 0.696223)

Here’s how I tested it (let me know if this is done incorrectly):

require ‘arrayfields’
fields = %w(f1 f2 f3 f4 f5 f6 f7)

bmbm(15) do |re|
re.report(“ArrayFields”) { REP.times { x = [*attribute_array];
x.fields = fields } }

… same as before

end

I haven’t tested the memory on it, but it would appear to be
comparable to an Array (very low memory usage) from your
implementation. It does have slightly higher overhead in the time
required for creation (if I did it correctly). Nonetheless,
arrayfields seems ideal for database work as you can simply label the
fields without creating new classes for everything. In the particular
case before me, I have only a handful of object types, but lots of
them.

Thanks,
john


#7

Based on the model I was using above for an array based object
(inherits from array) and similar to the way Struct works to
initialize it, I’ve worked up this guy (with some effor):

class ArrayClass
def self.create(fields)
new_class = Class.new(Array)
new_class.class_eval(‘undef_method :flatten’)
new_class.class_eval(‘undef_method :flatten!’)
new_class.class_eval("@@keyword_to_index = {}")
fields.each_with_index do |field,i|
field_sym = field.to_sym
## THE GETTERS AND SETTERS:
new_class.class_eval(“def #{field}() self[#{i}] end”)
new_class.class_eval(“def #{field}=(val) self[#{i}]=val end”)

  new_class.class_eval("@@keyword_to_index[:#{field}] = #{i}")
end
new_class.class_eval('def self.indices() @@keyword_to_index end')
new_class.class_eval('def indices() @@keyword_to_index end')
new_class

end
end

‘flatten’ wreaks havoc on objects that inherit from array. I’ve

tried in different ways to make the above object insensitive to
flatten, with no real luck. It will break code with other array-based
objects that want to be flattened, but allows us to use flatten on
arrays of our objects:

class Array

WARNING! I’ve redefined Array#flatten to only work on Array

objects (not

derived objects)

def flatten
new_array = []
self.each do |v|
if v.class == Array
new_array.push( *(v.flatten) )
else
new_array << v
end
end
new_array
end

WARNING! I’ve redefined Array#flatten! to only work on Array

objects (not

derived objects)

def flatten!
self.replace(flatten)
end
end

USAGE:

MyClass = ArrayClass.create( %w(f1 f2 f3 f4 f5) )
obj = MyClass.new( %w(v1 v2 v3 v4 v5) )
puts obj.f1
obj.f2 = ‘something_else’
MyClass.indices # -> hash giving index based on keyword.
obj.indices # -> same

I’ve written some tests and it seems to fly.


#8

On 03.04.2007 08:07, bwv549 wrote:

meta-programming magic might be able to really clean things up…
4. Avoid normal objects. They take a long time to initialize and
take up a lot of memory.

I am not sure I agree to your assessments about performance. A class is
actually the fastest if you omit the initialize definition.

Also, I am not sure whether your memory measurements are accurate: as
far as I can see you do neither invoke GC.start (in order to try to
force GC, which is an unreliable method anyway) nor GC.disable (in order
to stop GC completely and thus be able to measure allocation).

Kind regards

robert

09:58:18 [Temp]: ./create.rb
Rehearsal --------------------------------------------------
Struct 3.281000 0.000000 3.281000 ( 3.360000)
Class 0.703000 0.000000 0.703000 ( 0.840000)
Class (init) 6.375000 0.000000 6.375000 ( 7.962000)
Class (init2) 7.203000 0.000000 7.203000 ( 8.807000)
Array 1.016000 0.000000 1.016000 ( 1.351000)
Array.new 0.812000 0.000000 0.812000 ( 0.881000)
Array.new(5) 1.516000 0.000000 1.516000 ( 1.890000)
OpenStruct 4.672000 0.000000 4.672000 ( 5.969000)
---------------------------------------- total: 25.578000sec

                  user     system      total        real

Struct 2.687000 0.000000 2.687000 ( 3.359000)
Class 0.797000 0.000000 0.797000 ( 0.845000)
Class (init) 6.687000 0.000000 6.687000 ( 7.929000)
Class (init2) 7.016000 0.000000 7.016000 ( 8.815000)
Array 1.125000 0.000000 1.125000 ( 1.347000)
Array.new 0.781000 0.000000 0.781000 ( 0.875000)
Array.new(5) 1.547000 0.000000 1.547000 ( 1.879000)
OpenStruct 4.750000 0.000000 4.750000 ( 5.985000)
09:59:31 [Temp]: cat create.rb
#!/usr/bin/ruby

require ‘ostruct’
require ‘benchmark’
include Benchmark

St1 = Struct.new :f1, :f2, :f3, :f4, :f5

class St2; attr_accessor :f1, :f2, :f3, :f4, :f5; end

class St3
attr_accessor :f1, :f2, :f3, :f4, :f5

def initialize(f1=nil, f2=nil, f3=nil, f4=nil, f5=nil)
@f1=f1
@f2=f2
@f3=f3
@f4=f4
@f5=f5
end
end

class St4
attr_accessor :f1, :f2, :f3, :f4, :f5

def initialize(*a)
@f1, @f2, @f3, @f4, @f5 = a
end
end

REP = 1_000_000

bmbm(15) do |re|
re.report(“Struct”) { REP.times { St1.new } }
re.report(“Class”) { REP.times { St2.new } }
re.report(“Class (init)”) { REP.times { St3.new } }
re.report(“Class (init2)”) { REP.times { St4.new } }
re.report(“Array”) { REP.times { [] } }
re.report(“Array.new”) { REP.times { Array.new } }
re.report(“Array.new(5)”) { REP.times { Array.new(5) } }
re.report(“OpenStruct”) { REP.times { OpenStruct.new } }
end
09:59:33 [Temp]:


#9

On 03.04.2007 21:34, removed_email_address@domain.invalid wrote:

so, now you see what you should - all approaches are within the same
order of magnitude: there is no difference.

Well, factor two between slowest and fastest. Not dramatic but may be
noticeable when dealing with large quantities of objects.

But I also (?) have this “premature optimization” feeling about this
thread. And we don’t have seen real memory measurements yet which
seemed to have been the root concern…

Kind regards

robert


#10

On Wed, 4 Apr 2007, bwv549 wrote:

ArrayBased 2.033333 0.266667 2.300000 ( 1.381698)
Array.new 1.883333 0.333333 2.216667 ( 1.370410)
x.fields = fields } }

… same as before

end

I haven’t tested the memory on it, but it would appear to be comparable to
an Array (very low memory usage) from your implementation. It does have
slightly higher overhead in the time required for creation (if I did it
correctly). Nonetheless, arrayfields seems ideal for database work as you
can simply label the fields without creating new classes for everything. In
the particular case before me, I have only a handful of object types, but
lots of them.

seems like you’re realy comparing apples to oranges here. for instance
your
Class1 contructor only every takes a copy of ‘attribute_array’, while
some
ctors take copies, or even several. i made a mod so all calles to
attribute_array created a new array so you can actually measure the
speed of
the ctor not the speed of copying vs not copying an array (we know which
one is
faster!). and not all your implimentations provide keyword access - but
i
guess you know that…

also i used the new (version 3.7.0) arrayfields interface to build a
class
that way. here are the results

harp:~ > cat a.rb
#! /usr/bin/env ruby

require ‘rubygems’
require ‘arrayfields’ # 3.7.0!
require ‘benchmark’
include Benchmark

class Class1
def initialize(attribute_array)
(@f1, @f2, @f3, @f4, @f5, @f6, @f7) = attribute_array
end
end

class Class2
def initialize(f1=nil, f2=nil, f3=nil, f4=nil, f5=nil, f6=nil,
f7=nil)
@f1=f1
@f2=f2
@f3=f3
@f4=f4
@f5=f5
@f6=f6
@f7=f7
end
end

StructClass = Struct.new(:f1, :f2, :f3, :f4, :f5, :f6, :f7)

class ArrayBased < Array
end

=begin

this is all Array.fields does

class ArrayFieldsBased < Array
FIELDS = ArrayFields::FieldSet.new [:f1, :f2, :f3, :f4, :f5, :f6,
:f7]
include ArrayFields
def initialize *a, &b
super
ensure
@fieldset = FIELDS
end
end
=end

ArrayFieldsBased = Array.fields :f1, :f2, :f3, :f4, :f5, :f6, :f7

def attribute_array() (1…7).to_a end

REP = 1_000_00 #_000

bmbm(15) do |re|
re.report(“ArrayFieldsBased”) { REP.times { ArrayFieldsBased.new
attribute_array } }
re.report(“Class1”) { REP.times { Class1.new attribute_array } }
re.report(“Class2”) { REP.times { Class2.new *attribute_array} }
re.report(“Struct”) { REP.times { StructClass.new *attribute_array }
}
re.report(“ArrayBased”) { REP.times { ArrayBased.new attribute_array
} }
re.report(“Array.new”) { REP.times { Array.new attribute_array } }
re.report(“Array []”) { REP.times { [*attribute_array] } }
end

and here’s the results

[ahoward@localhost ~]$ ruby a.rb
Rehearsal ----------------------------------------------------
ArrayFieldsBased 1.440000 0.010000 1.450000 ( 1.709076)
Class1 1.550000 0.020000 1.570000 ( 6.274899)
Class2 1.680000 0.000000 1.680000 ( 1.945875)
Struct 1.230000 0.010000 1.240000 ( 1.486386)
ArrayBased 1.000000 0.000000 1.000000 ( 1.309320)
Array.new 0.980000 0.000000 0.980000 ( 1.178536)
Array [] 0.750000 0.000000 0.750000 ( 0.935874)
------------------------------------------- total: 8.670000sec

                    user     system      total        real

ArrayFieldsBased 1.560000 0.000000 1.560000 ( 1.787502)
Class1 1.400000 0.010000 1.410000 ( 1.627891)
Class2 1.480000 0.000000 1.480000 ( 1.742301)
Struct 1.180000 0.010000 1.190000 ( 1.299734)
ArrayBased 0.970000 0.000000 0.970000 ( 1.050021)
Array.new 0.980000 0.000000 0.980000 ( 1.089647)
Array [] 0.740000 0.010000 0.750000 ( 0.888502)

so, now you see what you should - all approaches are within the same
order of
magnitude: there is no difference.

regards.

-a