Forum: Ruby Array#inject to create a hash versus Hash-- Speed, segfault

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Anthony M. (Guest)
on 2007-06-11 09:41
(Received via mailing list)
So, I was tinkering with ways to build a hash out of transforming an
array, knowing the standard/idiomatic

id_list = [:test1, :test2, :test3]
id_list.inject({}) { |a,e| a[e]=e.object_id ; a }

I also decided to try something like this:

Hash[ *id_list.collect  { |e| [e,e.object_id]}.flatten]

and further (attempt to) optimize it via
Hash[ *id_list.collect  { |e| [e,e.object_id]}.flatten!]
and
Hash[ *id_list.collect! { |e| [e,e.object_id]}.flatten!]

Running this via Benchmark#bmbm gives pretty interesting, and to me,
unexpected, results (on a 3.2 GHz P4, 1GB of RAM,  FC5 with ruby 1.8.4)

require 'benchmark'
id_list = (1..1_000_000).to_a
Benchmark::bmbm do |x|
    x.report("inject")   { id_list.inject({})      { |a,e| a[e] =
e.object_id ; a} }
    x.report("non-bang") { Hash[ *id_list.collect  { |e|
[e,e.object_id]}.flatten] }
    x.report("bang")     { Hash[ *id_list.collect  { |e|
[e,e.object_id]}.flatten!] }
    x.report("two-bang") { Hash[ *id_list.collect! { |e|
[e,e.object_id]}.flatten!] }
end

Rehearsal --------------------------------------------
inject    16.083333   0.033333  16.116667 (  9.670747)
non-bang 1657.050000   1.800000 1658.850000 (995.425642)
bang     1593.716667   0.016667 1593.733333 (956.334565)
two-bang 1604.816667   1.350000 1606.166667 (963.803356)
-------------------------------- total: 4874.866667sec

               user     system      total        real
inject     5.183333   0.000000   5.183333 (  3.102379)
non-bang zsh: segmentation fault  ruby

Ow?

Also, I just thought of a similar way to accomplish the same thing:

x.report("zip")      { Hash[ *id_list.zip(id_list.collect {|e|
e.object_id})] }

Array#collect! won't work right with this, of course, but it seems to
have equally-bad performance. Is Array#inject just optimized for this,
or something?

-- Pi
Robert K. (Guest)
on 2007-06-11 11:06
(Received via mailing list)
On 11.06.2007 07:41, Anthony M. wrote:
> and further (attempt to) optimize it via
>     x.report("inject")   { id_list.inject({})      { |a,e| a[e] = e.object_id ; a} }
> -------------------------------- total: 4874.866667sec
>
> Array#collect! won't work right with this, of course, but it seems to
> have equally-bad performance. Is Array#inject just optimized for this,
> or something?

The reason why you are seeing this (performance as well as timing) is
most likely caused by the different approach.  When you use #inject you
just create one copy of the Array (the Hash).  When you use #collect you
create at least one additional copy of the large array plus a ton of two
element arrays.  That's way less efficient considering memory usage and
GC.  You'll probably see much different results if your input array is
much shorter (try with 10 or 100 elements).

Kind regards

  robert
Caleb C. (Guest)
on 2007-06-11 11:16
(Received via mailing list)
That segfault shouldn't be happening... I guess you've found a bug. I
modified your benchmark slightly (reduced id_list to 100_000 elements,
since I don't want to wait more than an hour for it to finish) and I
didn't get a segfault, but the results are definitely strange:

Rehearsal --------------------------------------------
inject     0.320000   0.350000   0.670000 (  0.682610)
non-bang   7.700000   0.590000   8.290000 (  8.522170)
bang       7.150000   0.230000   7.380000 (  7.392471)
two-bang   6.500000   0.220000   6.720000 (  6.766079)
---------------------------------- total: 23.060000sec

               user     system      total        real
inject     0.700000   0.310000   1.010000 (  1.019340)
non-bang 131.260000   1.440000 132.700000 (132.823762)
bang     134.610000   1.280000 135.890000 (136.323224)
two-bang 137.260000   0.750000 138.010000 (138.027025)

Why is the rehearsal so much faster? Especially for the various
collect implementations?
(I'm running Debian Etch, Ruby 1.8.5.)



I expect that the fastest way to do what you want is like this:

id_list = [:test1, :test2, :test3]
result={}
id_list.each { |e| result[e]=e.object_id }

Anyway, that's what I usually do.

I have several times wanted a method on Enumerable that helps you
transform it into a hash. Something like:

module Enumerable
  def map_to_hash
    result={}
    each { |e|
       k,v=*yield(e)
       result[k]=v
    }
    return result
  end
end

id_list.map_to_hash {|x| [x,x.object_id]}

I don't know what the performance of this might be.... It's creating a
2-element hash on each loop now, but maybe that has minimal cost.
Anthony M. (Guest)
on 2007-06-11 11:24
(Received via mailing list)
On Mon, Jun 11, 2007 at 04:05:10PM +0900, Robert K. wrote:
> On 11.06.2007 07:41, Anthony M. wrote:
> >So, I was tinkering with ways to build a hash out of transforming an
[snip]
> plus a ton of two element arrays.  That's way less efficient
Ah. This does make sense. Now, I throw in the #zip variant!

> considering memory usage and GC.  You'll probably see much different
> results if your input array is much shorter (try with 10 or 100
> elements).

Well, the general trend is that any of the non-idiomatic solutions are
pretty horrible, although the two-bang gets vaguely better than 0 or 1
on
bigger arrays:

1000 elements:
Rehearsal --------------------------------------------
inject     0.010000   0.000000   0.010000 (  0.007391)
non-bang   0.000000   0.000000   0.000000 (  0.006602)
bang       0.000000   0.000000   0.000000 (  0.008503)
two-bang   0.010000   0.000000   0.010000 (  0.010136)
zip        0.790000   0.220000   1.010000 (  1.202124)
----------------------------------- total: 1.030000sec

               user     system      total        real
inject     0.010000   0.010000   0.020000 (  0.012621)
non-bang   0.020000   0.000000   0.020000 (  0.022407)
bang       0.020000   0.000000   0.020000 (  0.019274)
two-bang   0.020000   0.000000   0.020000 (  0.027521)
zip        2.240000   0.670000   2.910000 (  5.804656)


10000 elements:
Rehearsal --------------------------------------------
inject     0.050000   0.010000   0.060000 (  0.079325)
non-bang   0.140000   0.010000   0.150000 (  0.265995)
bang       0.150000   0.010000   0.160000 (  0.184713)
two-bang   0.140000   0.020000   0.160000 (  0.202579)
zip       77.510000  24.270000 101.780000 (134.493519)
--------------------------------- total: 102.310000sec

               user     system      total        real
inject     0.110000   0.030000   0.140000 (  0.146650)
non-bang   0.450000   0.020000   0.470000 (  0.526816)
bang       0.470000   0.020000   0.490000 (  0.583837)
two-bang   0.440000   0.040000   0.480000 (  0.520914)
zip      222.910000  69.370000 292.280000 (389.541023)


Extra-ouch! This is what I get for trying to be clever.
Robert K. (Guest)
on 2007-06-11 12:21
(Received via mailing list)
On 11.06.2007 09:23, Anthony M. wrote:
>> most likely caused by the different approach.  When you use #inject
> Well, the general trend is that any of the non-idiomatic solutions are
> ----------------------------------- total: 1.030000sec
> Rehearsal --------------------------------------------
> bang       0.470000   0.020000   0.490000 (  0.583837)
> two-bang   0.440000   0.040000   0.480000 (  0.520914)
> zip      222.910000  69.370000 292.280000 (389.541023)
>
>
> Extra-ouch! This is what I get for trying to be clever.

:-)

I'd say figures for the 1000 elements array are not too meaningful.  You
should probably wrap those tests in a repetition.  And if you are really
trying to get the fastest solution I'd also try

h = {}
id_list.each {|e| h[e] = e.object_id }

This is usually the fastest in my experience (unless of course the
timing is dominated by what you do in the block).

Kind regards

  robert
This topic is locked and can not be replied to.