Forum: Ruby implementing an array based class

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
bwv549 (Guest)
on 2007-04-03 03:36
(Received via mailing list)
Some of the data I deal with involves hundreds of thousands of records
where each is very predictable in the number of attributes.  By going
from a hash based object to an array based object I am able to
substantially decrease memory usage.  However, the array based object
I have made (see below), is really lame for two reasons:

1. It is a pain to make a different array based class
2. When I call 'flatten' on an array of these objects, my objects get
flattened, too (how do I prevent this?)

I realize that I could just use arrays of arrays for all the objects,
but for ease of programming I'd like to be able to access an object's
attributes by name.  Does anyone know of some memory efficient and
fast way of doing this (that's insensitive to 'flatten')?

I've tried several times to create some kind of metaprogramming
version of this but always fail.

This works (as far as what I'm basically after), but it's very ugly:

  ind_keys = {} ; ind_keys_w_eq = {}; @@ind = {}
  ## The method name to index mapping
  ind_keys = {:shape => 0, :intensity => 1, :flare => 2, :alpha =>
3, :id => 4, :xposition => 5, :yposition => 6, :zposition => 7, :color
=> 8, :parent => 9 }

  @@arr_size = ind_keys.size

  ## Make a single class hash that maps getters and setters
  ind_keys.each {|k,v| ind_keys_w_eq["#{k}=".to_sym] = v }
  ind_keys.merge!(ind_keys_w_eq)
  ind_keys.each {|k,v| @@ind[k] = v ; @@ind["#{k}"] = v}

  ## Manually define getters and setters
  def shape ; self[0] end ; def shape=(oth) ; self[0] = oth end
  def intensity ; self[1] end ; def intensity=(oth) ; self[1] = oth
end
  def flare ; self[2] end ; def flare=(oth) ; self[2] = oth end
  def alpha ; self[3] end ; def alpha=(oth) ; self[3] = oth end
  def id ; self[4] end ; def id=(oth) ; self[4] = oth end
  def xposition ; self[5] end ; def xposition=(oth) ; self[5] = oth
end
  def yposition ; self[6] end ; def yposition=(oth) ; self[6] = oth
end
  def zposition ; self[7] end ; def zposition=(oth) ; self[7] = oth
end
  def color ; self[8] end ; def color=(oth) ; self[8] = oth end
  def parent ; self[9] end ; def parent=(oth) ; self[9] = oth end
  # The number of total proteins sharing this color
  def num_tot_proteins ; self[10] end ; def num_tot_proteins=(oth) ;
self[10] = oth end


  ## create an array of the exact object size or set from a hash
  def initialize(args=nil)
    super(@@arr_size.size)
    if args
      if args.is_a? Hash
        args.each do |k,v|
          self[@@ind[k]] = v
        end
      end
    end
  end

Ideas and suggestions are most welcome.  Thanks.
Alex G. (Guest)
on 2007-04-03 04:52
(Received via mailing list)
On 3 Apr 2007, at 08:35, bwv549 wrote:

> I realize that I could just use arrays of arrays for all the objects,
> but for ease of programming I'd like to be able to access an object's
> attributes by name.  Does anyone know of some memory efficient and
> fast way of doing this (that's insensitive to 'flatten')?

snip

> Ideas and suggestions are most welcome.  Thanks.

I don't know about speed/memory use, but perhaps Struct (in stdlib)
or Arrayfields:

http://raa.ruby-lang.org/project/arrayfields/

does what you want?

Alex G.

Bioinformatics Center
Kyoto University
bwv549 (Guest)
on 2007-04-03 10:11
(Received via mailing list)
> Kyoto University
Thanks for the suggestions.  I've looked at both as options and took
the time to do some profiling (speed and memory).  [ See the SUMMARY
at the end if you want the short answer ]
It's really only pseudo-scientific, but it's better than where I was
at before.

Memory profiling turns out to be kind of a pain to get going.  I
created a little plugin to do this based on
http://t-a-w.blogspot.com/2007/02/ruby-and-judy.html

module MemoryUsage

  MY_START_MEMORY = {}

  def mem_start
    file = IO.read("/proc/#{Process.pid}/smaps")
    MY_START_MEMORY['size'] = _size(file)
    MY_START_MEMORY['rss'] = _rss(file)
  end

  def mem_stop
    file = IO.read("/proc/#{Process.pid}/smaps")
    size_dif = _size(file) - MY_START_MEMORY['size']
    rss_dif = _rss(file) - MY_START_MEMORY['rss']
    "SIZE: #{size_dif} kB, RSS: #{rss_dif} kB"
  end

  def _size(file_string)
    file_string.scan(/^Size:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|
a,b| a+b}
  end

  def _rss(file_string)
    file_string.scan(/^Rss:\s+(\d+) kB/).map{|x| x[0].to_i}.inject{|
a,b| a+b}
  end
end

class Object
  include MemoryUsage
end

## then you simply require 'memory_usage'
## and put 'mem_start' in your code at the beginning
## and 'puts mem_stop' in your code at the end

I will summarize the results.  A struct is really nice and easy to
use:

  MyClass = Struct.new(:att1, :att2, :att3)
  myobject = MyClass.new(value1, value2, value3)
  myobject.att1 = a_different_value  # can get/set by method
  puts myobject[0] ## puts value1  (can get/set by array index)

I made an array of about 500,000 arrays (6 values each).  Then tested
each style of array for speed in object creation, object access by
keyword and object access by index (where applicable).

## MyObject (this is my object)
class MyObject < Array

  def a ; self[0] ; end
  def b ; self[1] ; end
  def c ; self[2] ; end
  def d ; self[3] ; end
  def e ; self[4] ; end
  def f ; self[5] ; end
  def g ; self[6] ; end

end

## STRUCT CLASS
MyStructClass = Struct.new( *$fields )
# MyStructClass.new( *atts )   ## -> to create object

## NORMAL OBJECT
class NormalObject
  attr_accessor( *$fields )
  def initialize(*args)
    (@a, @b, @c, @d, @e, @f, @g) = args
  end
end


Access by index:
MyObject, Structs, and Arrays all have equivalent access times by
index.

Access by keywords:
Structs have essentially equivalent access times by keyword as by
index (blazing fast) and its about the same as for a normal object.
MyObject takes about 2 times as long to access by keyword.

Initialization:
Arrays initialize fastest followed by MyObject.  Struct takes about
twice as long to initialize as MyObject (I think because you have to
pass in the field as a *list).  Normal objects take the longest to
initialize (even though I'm efficiently setting all properties in one
call).

MEMORY:
Arrays take about 20 times less memory than structs or MyObject.
MyObject uses a little less memory than a struct, but it depends on
how the initialization is implemented.  Normal objects (in this
experiment, mileage will vary I think) took a little more than twice
as much memory as Structs or MyObject.

I didn't test ArrayFields (even though it looks like a very useful
class) based on my understanding that each array that is created must
then be set with the fields that it will use (which seems like it
would be too much overhead for what I'm doing, although I could be
wrong).

*** SUMMARY *** (for situations with gazillions of set-length
objects):
1.  don't use objects--if you can, use arrays.  Faster and way leaner.
2.  If you have to use an object, struct seems like a good
compromise.  Simple to use, fast, extendable, fast access by keywords,
a little slow on object creation.  Significantly more memory than an
array, but much less than a normal object.
3.  The ugly MyObject from above is probably best of you want faster
initialization and plan to access mostly by index.  However, it's not
implemented in any easy to create way. This is where someone with some
meta-programming magic might be able to really clean things up...
4.  Avoid normal objects.  They take a long time to initialize and
take up a lot of memory.
Robert K. (Guest)
on 2007-04-03 13:46
(Received via mailing list)
On 03.04.2007 08:07, bwv549 wrote:
> meta-programming magic might be able to really clean things up...
> 4.  Avoid normal objects.  They take a long time to initialize and
> take up a lot of memory.

I am not sure I agree to your assessments about performance.  A class is
actually the fastest if you omit the initialize definition.

Also, I am not sure whether your memory measurements are accurate: as
far as I can see you do neither invoke GC.start (in order to try to
force GC, which is an unreliable method anyway) nor GC.disable (in order
to stop GC completely and thus be able to measure allocation).

Kind regards

  robert


09:58:18 [Temp]: ./create.rb
Rehearsal --------------------------------------------------
Struct           3.281000   0.000000   3.281000 (  3.360000)
Class            0.703000   0.000000   0.703000 (  0.840000)
Class (init)     6.375000   0.000000   6.375000 (  7.962000)
Class (init2)    7.203000   0.000000   7.203000 (  8.807000)
Array            1.016000   0.000000   1.016000 (  1.351000)
Array.new        0.812000   0.000000   0.812000 (  0.881000)
Array.new(5)     1.516000   0.000000   1.516000 (  1.890000)
OpenStruct       4.672000   0.000000   4.672000 (  5.969000)
---------------------------------------- total: 25.578000sec

                      user     system      total        real
Struct           2.687000   0.000000   2.687000 (  3.359000)
Class            0.797000   0.000000   0.797000 (  0.845000)
Class (init)     6.687000   0.000000   6.687000 (  7.929000)
Class (init2)    7.016000   0.000000   7.016000 (  8.815000)
Array            1.125000   0.000000   1.125000 (  1.347000)
Array.new        0.781000   0.000000   0.781000 (  0.875000)
Array.new(5)     1.547000   0.000000   1.547000 (  1.879000)
OpenStruct       4.750000   0.000000   4.750000 (  5.985000)
09:59:31 [Temp]: cat create.rb
#!/usr/bin/ruby

require 'ostruct'
require 'benchmark'
include Benchmark

St1 = Struct.new :f1, :f2, :f3, :f4, :f5

class St2; attr_accessor :f1, :f2, :f3, :f4, :f5; end

class St3
   attr_accessor :f1, :f2, :f3, :f4, :f5

   def initialize(f1=nil, f2=nil, f3=nil, f4=nil, f5=nil)
     @f1=f1
     @f2=f2
     @f3=f3
     @f4=f4
     @f5=f5
   end
end

class St4
   attr_accessor :f1, :f2, :f3, :f4, :f5

   def initialize(*a)
     @f1, @f2, @f3, @f4, @f5 = a
   end
end

REP = 1_000_000

bmbm(15) do |re|
   re.report("Struct") { REP.times { St1.new } }
   re.report("Class") { REP.times { St2.new } }
   re.report("Class (init)") { REP.times { St3.new } }
   re.report("Class (init2)") { REP.times { St4.new } }
   re.report("Array") { REP.times { [] } }
   re.report("Array.new") { REP.times { Array.new } }
   re.report("Array.new(5)") { REP.times { Array.new(5) } }
   re.report("OpenStruct") { REP.times { OpenStruct.new } }
end
09:59:33 [Temp]:
bwv549 (Guest)
on 2007-04-03 18:06
(Received via mailing list)
Thanks for the pointers.  I'd love to know how to do proper memory
benchmarking.  It does not seem trivial, I guess.

Regarding object initialization, I guess I'm really interested in
total time to initialize the object and fill in all attributes with
values (as this is what I'm going to be facing before I can use any of
the objects).  I've incorporated this aspect into a similar benchmark
for comparison:

Rehearsal --------------------------------------------------
Class1           5.666667   0.533333   6.200000 (  3.730552)
Class2           5.983333   0.516667   6.500000 (  3.904695)
Struct           3.950000   0.250000   4.200000 (  2.518688)
ArrayBased       2.033333   0.350000   2.383333 (  1.431213)
Array.new        2.100000   0.316667   2.416667 (  1.443011)
Array []         0.833333   0.266667   1.100000 (  0.671233)
---------------------------------------- total: 22.800000sec

                     user     system      total        real
Class1           5.550000   0.600000   6.150000 (  3.686555)
Class2           5.716667   0.666667   6.383333 (  3.834078)
Struct           3.883333   0.316667   4.200000 (  2.527558)
ArrayBased       2.166667   0.266667   2.433333 (  1.445458)
Array.new        2.150000   0.300000   2.450000 (  1.474190)
Array []         0.900000   0.233333   1.133333 (  0.684567)

----------------------------------------------------

#!/usr/bin/ruby

require 'benchmark'
include Benchmark

attribute_array = (1...7).to_a

class Class1
  def initialize(attribute_array)
    (@f1, @f2, @f3, @f4, @f5, @f6, @f7) = attribute_array
  end
end

class Class2
  def initialize(f1=nil, f2=nil, f3=nil, f4=nil, f5=nil, f6=nil,
f7=nil)
     @f1=f1
     @f2=f2
     @f3=f3
     @f4=f4
     @f5=f5
     @f6=f6
     @f7=f7
  end
end


StructClass = Struct.new(:f1, :f2, :f3, :f4, :f5, :f6, :f7)
filled_in = StructClass.new( *attribute_array )

class ArrayBased < Array
end

filled_in = ArrayBased.new(attribute_array)


REP = 1_000_000

bmbm(15) do |re|
  re.report("Class1") { REP.times { Class1.new(attribute_array) } }
  re.report("Class2") { REP.times { Class2.new(*attribute_array) } }
  re.report("Struct") { REP.times
{ StructClass.new(*attribute_array) } }
  re.report("ArrayBased") { REP.times
{ ArrayBased.new(attribute_array) } }
  re.report("Array.new") { REP.times { Array.new(attribute_array) } }
  re.report("Array []") { REP.times { [*attribute_array] } }
end

----------------------------------------------------

For speed, at least, my initialize observations appear to hold (in
this benchmark, anyway).
Ordered by fastest filled object creation: Array [], ArrayBased &
Array.new, Struct, Class

Kind regards,
john
unknown (Guest)
on 2007-04-03 19:16
(Received via mailing list)
On Tue, 3 Apr 2007, bwv549 wrote:

> would be too much overhead for what I'm doing, although I could be
> wrong).

arrayfields was specifically design for the case you're coding.  in
particular
it was designed to provide fast keyword access and low memory overhead -
i was
using it with huge postgresql result sets which were arrays.  as you
point
out, arrays use far less memory that any other solution.  they are also
much
nicer to work with

   csv = tuple.join ','

for example.  anyhow, the @fields object is not copied for each array -
only a
reference is held.  so if one does this

   fields = %w( a b c )

   huge_number_tuples.each{|tuple| tuple.fields = fields}

then ALL tuples share the fields object - a copy is not taken.  anyhow,
i'd
love to see your tests using it.

regards.


-a
bwv549 (Guest)
on 2007-04-03 21:05
(Received via mailing list)
> i'd love to see your tests using it.

Thanks for the insight.  I've included an arrayfields implementation
on the above benchmark:

Rehearsal --------------------------------------------------
ArrayFields      8.433333   0.800000   9.233333 (  5.585626)
Class1           5.566667   0.500000   6.066667 (  3.631757)
Class2           6.050000   0.550000   6.600000 (  3.963679)
Struct           3.583333   0.300000   3.883333 (  2.335911)
ArrayBased       2.033333   0.266667   2.300000 (  1.381698)
Array.new        1.950000   0.333333   2.283333 (  1.371351)
Array []         0.833333   0.316667   1.150000 (  0.689010)
---------------------------------------- total: 31.516667sec

                     user     system      total        real
ArrayFields      8.516667   0.816667   9.333333 (  5.609708)
Class1           5.366667   0.650000   6.016667 (  3.617349)
Class2           5.933333   0.566667   6.500000 (  3.893614)
Struct           3.616667   0.266667   3.883333 (  2.360849)
ArrayBased       1.933333   0.316667   2.250000 (  1.374079)
Array.new        1.883333   0.333333   2.216667 (  1.370410)
Array []         0.866667   0.283333   1.150000 (  0.696223)


Here's how I tested it (let me know if this is done incorrectly):

require 'arrayfields'
fields = %w(f1 f2 f3 f4 f5 f6 f7)

bmbm(15) do |re|
  re.report("ArrayFields") { REP.times { x = [*attribute_array];
x.fields = fields } }
  ## .... same as before
end

I haven't tested the memory on it, but it would appear to be
comparable to an Array (very low memory usage) from your
implementation.  It does have slightly higher overhead in the time
required for creation (if I did it correctly).  Nonetheless,
arrayfields seems ideal for database work as you can simply label the
fields without creating new classes for everything.  In the particular
case before me, I have only a handful of object types, but lots of
them.

Thanks,
john
bwv549 (Guest)
on 2007-04-03 21:15
(Received via mailing list)
Based on the model I was using above for an array based object
(inherits from array) and similar to the way Struct works to
initialize it, I've worked up this guy (with some effor):


class ArrayClass
  def self.create(fields)
    new_class = Class.new(Array)
    new_class.class_eval('undef_method :flatten')
    new_class.class_eval('undef_method :flatten!')
    new_class.class_eval("@@keyword_to_index = {}")
    fields.each_with_index do |field,i|
      field_sym = field.to_sym
      ## THE GETTERS AND SETTERS:
      new_class.class_eval("def #{field}() self[#{i}] end")
      new_class.class_eval("def #{field}=(val) self[#{i}]=val end")

      new_class.class_eval("@@keyword_to_index[:#{field}] = #{i}")
    end
    new_class.class_eval('def self.indices() @@keyword_to_index end')
    new_class.class_eval('def indices() @@keyword_to_index end')
    new_class
  end
end

## 'flatten' wreaks havoc on objects that inherit from array.  I've
tried in different ways to make the above object insensitive to
flatten, with no real luck.  It will break code with other array-based
objects that want to be flattened, but allows us to use flatten on
arrays of our objects:

class Array
  ## WARNING! I've redefined Array#flatten to only work on Array
objects (not
  # derived objects)
  def flatten
    new_array = []
    self.each do |v|
      if v.class == Array
        new_array.push( *(v.flatten) )
      else
        new_array << v
      end
    end
    new_array
  end

  ## WARNING! I've redefined Array#flatten! to only work on Array
objects (not
  # derived objects)
  def flatten!
    self.replace(flatten)
  end
end

## USAGE:
MyClass = ArrayClass.create( %w(f1 f2 f3 f4 f5) )
obj = MyClass.new( %w(v1 v2 v3 v4 v5) )
puts obj.f1
obj.f2 = 'something_else'
MyClass.indices # -> hash giving index based on keyword.
obj.indices     # -> same

I've written some tests and it seems to fly.
unknown (Guest)
on 2007-04-03 23:35
(Received via mailing list)
On Wed, 4 Apr 2007, bwv549 wrote:

> ArrayBased       2.033333   0.266667   2.300000 (  1.381698)
> Array.new        1.883333   0.333333   2.216667 (  1.370410)
> x.fields = fields } }
>  ## .... same as before
> end
>
> I haven't tested the memory on it, but it would appear to be comparable to
> an Array (very low memory usage) from your implementation.  It does have
> slightly higher overhead in the time required for creation (if I did it
> correctly).  Nonetheless, arrayfields seems ideal for database work as you
> can simply label the fields without creating new classes for everything.  In
> the particular case before me, I have only a handful of object types, but
> lots of them.

seems like you're realy comparing apples to oranges here.  for instance
your
Class1 contructor only every takes a copy of 'attribute_array', while
some
ctors take copies, or even several.  i made a mod so all calles to
attribute_array created a new array so you can actually measure the
speed of
the ctor not the speed of copying vs not copying an array (we know which
one is
faster!).  and not all your implimentations provide keyword access - but
i
guess you know that...

also i used the new (version 3.7.0) arrayfields interface to build a
class
that way.  here are the results


harp:~ > cat a.rb
#! /usr/bin/env ruby

require 'rubygems'
require 'arrayfields' # 3.7.0!
require 'benchmark'
include Benchmark

class Class1
   def initialize(attribute_array)
     (@f1, @f2, @f3, @f4, @f5, @f6, @f7) = attribute_array
   end
end

class Class2
   def initialize(f1=nil, f2=nil, f3=nil, f4=nil, f5=nil, f6=nil,
f7=nil)
      @f1=f1
      @f2=f2
      @f3=f3
      @f4=f4
      @f5=f5
      @f6=f6
      @f7=f7
   end
end

StructClass = Struct.new(:f1, :f2, :f3, :f4, :f5, :f6, :f7)

class ArrayBased < Array
end

=begin
#
# this is all Array.fields does
#
class ArrayFieldsBased < Array
   FIELDS = ArrayFields::FieldSet.new [:f1, :f2, :f3, :f4, :f5, :f6,
:f7]
   include ArrayFields
   def initialize *a, &b
     super
   ensure
     @fieldset = FIELDS
   end
end
=end

ArrayFieldsBased = Array.fields :f1, :f2, :f3, :f4, :f5, :f6, :f7

def attribute_array() (1..7).to_a end

REP = 1_000_00 #_000

bmbm(15) do |re|
   re.report("ArrayFieldsBased") { REP.times { ArrayFieldsBased.new
attribute_array } }
   re.report("Class1") { REP.times { Class1.new attribute_array } }
   re.report("Class2") { REP.times { Class2.new *attribute_array} }
   re.report("Struct") { REP.times { StructClass.new *attribute_array }
}
   re.report("ArrayBased") { REP.times { ArrayBased.new attribute_array
} }
   re.report("Array.new") { REP.times { Array.new attribute_array } }
   re.report("Array []") { REP.times { [*attribute_array] } }
end


and here's the results



[ahoward@localhost ~]$ ruby a.rb
Rehearsal ----------------------------------------------------
ArrayFieldsBased   1.440000   0.010000   1.450000 (  1.709076)
Class1             1.550000   0.020000   1.570000 (  6.274899)
Class2             1.680000   0.000000   1.680000 (  1.945875)
Struct             1.230000   0.010000   1.240000 (  1.486386)
ArrayBased         1.000000   0.000000   1.000000 (  1.309320)
Array.new          0.980000   0.000000   0.980000 (  1.178536)
Array []           0.750000   0.000000   0.750000 (  0.935874)
------------------------------------------- total: 8.670000sec

                        user     system      total        real
ArrayFieldsBased   1.560000   0.000000   1.560000 (  1.787502)
Class1             1.400000   0.010000   1.410000 (  1.627891)
Class2             1.480000   0.000000   1.480000 (  1.742301)
Struct             1.180000   0.010000   1.190000 (  1.299734)
ArrayBased         0.970000   0.000000   0.970000 (  1.050021)
Array.new          0.980000   0.000000   0.980000 (  1.089647)
Array []           0.740000   0.010000   0.750000 (  0.888502)



so, now you see what you should - all approaches are within the same
order of
magnitude: there is no difference.


regards.

-a
Robert K. (Guest)
on 2007-04-04 00:21
(Received via mailing list)
On 03.04.2007 21:34, removed_email_address@domain.invalid wrote:
> so, now you see what you should - all approaches are within the same
> order of magnitude: there is no difference.

Well, factor two between slowest and fastest.  Not dramatic but may be
noticeable when dealing with large quantities of objects.

But I also (?) have this "premature optimization" feeling about this
thread.  And we don't have seen real memory measurements yet which
seemed to have been the root concern...

Kind regards

  robert
This topic is locked and can not be replied to.