Inserting hash value slows down as table gets larger

luislavena · March 17, 2011, 1:57pm

People,

In response to people’s suggestions about speeding up my script by
replacing output to many small files with output to one large file I
have implemented a hash table I can write out with YAML. However, I
find as the hash table gets larger, the script slows down . . but when I
try and work out what is happening by producing a small test script that
does more or less the same thing, I can’t reproduce the problem . .

The test script is:

#!/usr/bin/ruby

h1 = Hash.new( 0 )
srand = 0

seeds = [ ‘01’, ‘01’, ‘01’, ‘22’ ]

seeds = [ ‘01’, ‘01’, ‘20’, ‘22’ ]

seeds = [ ‘01’, ‘32’, ‘20’, ‘22’ ]

seeds = [ ‘50’, ‘32’, ‘20’, ‘22’ ]

for a in ‘01’ … seeds[0]
start = Time.now

puts a

for b in ‘01’ … seeds[1]

puts b

 for c in '01' .. seeds[2]

puts c

   for d in '01' .. seeds[3]

print "#{d} "

     h1[ "#{a}.#{b}.#{c}.#{d}" ] = Array.new(2){ Array.new(1){

Array.new( 20, rand(1) ) } }
end

puts

end

puts

end

puts

stop = Time.now
puts stop - start
end

The script is faster with the hash insertion commented out of course and
the time between iterations of the outer loop are constant in both
scripts - but they are longer and STILL constant with the insertion not
commented out in the test script. In my actual script, when the
insertion is not commented out the time between iterations in the outer
loop gets longer and longer eg 36 sec -> a few minutes before I kill it
about half way through . .

Can anyone suggest a way of working out why the production hash
insertion behaves differently and somewhat unexpectedly?

Thanks,

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]

unebaguettesvp · March 17, 2011, 3:25pm

I suspect is it the memory related to object creation for Array.new(2){
Array.new(1){ Array.new( 20, rand(1) ) } }

With this line as originally
h1[ “#{a}.#{b}.#{c}.#{d}” ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(1) ) } }
I get 20.8 seconds

set to
h1[ “#{a}.#{b}.#{c}.#{d}” ] = 3
I get 2.54 seconds

set to
Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }

I get 3.75 seconds

set to
# h1[ “#{a}.#{b}.#{c}.#{d}” ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(1) ) } }

I get 0.47 seconds

Dave.

unebaguettesvp · March 17, 2011, 6:04pm

Dave B. wrote in post #987924:

set to
Array.new(2){ Array.new(1){ Array.new( 20, rand(1) ) } }

I get 3.75 seconds

set to
# h1[ “#{a}.#{b}.#{c}.#{d}” ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(1) ) } }

I get 0.47 seconds

Try these too:

h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = 3

h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new etc

It’s a foible of Ruby that if you use a String as a hash key, it is
dup’d and frozen - unless it was frozen already, which is what the above
does.

The other thing to beware of is YAML - try using Marshal instead as a
comparison, because YAML might not be very efficient when dealing with
very large data structures. I think you said you needed a human-editable
form, but this would rule out YAML as the source of the problem.

If it is, then you can look at alternative YAML libraries, or at JSON.

unebaguettesvp · March 18, 2011, 2:42pm

The other thought I had was putting the data into a sqlite3 db - I will
try it and see what happens but I don’t imagine it would be faster than
a memory based hash table?

Probably not, but at least it persists, and it may be cheaper to make
updates to a large external data structure than rebuild the entire
structure in memory each time.

In your application, do you really need to use Strings?

Your sample code looks like it’s handling numeric-style data (although I
realise this is just a test case for the problems you’re having).
Integers in the range -2^30…+2^30 (or larger in on a 64-bit machine)
have their values encoded within the reference, so no memory allocation
is done.

Or, if you’re handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn’t allocate any memory; it just points to the entry in the symbol
table.

Or you could use frozen strings and share the references.

LABEL1 = “00”.freeze
LABEL2 = “01”.freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP[“00”]
puts a.object_id
puts LABEL1.object_id

Although that’s more work than symbols, it might be useful depending on
your use case. For example, you could replace a subset of the values you
see with these frozen strings (which covers the majority of the data),
whilst still allowing arbitrary other strings.

unebaguettesvp · March 18, 2011, 12:29am

Brian,

On 2011-03-18 04:04, Brian C. wrote:

I get 0.47 seconds

Try these too:
 h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = 3

 h1[ "#{a}.#{b}.#{c}.#{d}".freeze ] = Array.new etc
It’s a foible of Ruby that if you use a String as a hash key, it is
dup’d and frozen - unless it was frozen already, which is what the above
does.

Adding “freeze” improves the situation by about a second on my machine:

18.3 -> 17.3

The other thing to beware of is YAML - try using Marshal instead as a
comparison, because YAML might not be very efficient when dealing with
very large data structures. I think you said you needed a human-editable
form, but this would rule out YAML as the source of the problem.

Yes that is very slow too but that is another issue - I can live with
that for the convenience of having human readable data . .

If it is, then you can look at alternative YAML libraries, or at JSON.

The other thought I had was putting the data into a sqlite3 db - I will
try it and see what happens but I don’t imagine it would be faster than
a memory based hash table?

Thanks,

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]

unebaguettesvp · March 19, 2011, 10:17am

Brian,

On 2011-03-19 00:42, Brian C. wrote:

The other thought I had was putting the data into a sqlite3 db - I will
try it and see what happens but I don’t imagine it would be faster than
a memory based hash table?

Probably not, but at least it persists, and it may be cheaper to make
updates to a large external data structure than rebuild the entire
structure in memory each time.

Yes, using a db was much slower

In your application, do you really need to use Strings?

I guess not but I preferred a tag of:

01.01.01.01

to:

1010101

Your sample code looks like it’s handling numeric-style data (although I
realise this is just a test case for the problems you’re having).
Integers in the range -2^30…+2^30 (or larger in on a 64-bit machine)
have their values encoded within the reference, so no memory allocation
is done.

Are you talking about the hash key or the hash values? - the values in
the real script will all be floats . .

Or, if you’re handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn’t allocate any memory; it just points to the entry in the symbol
table.

Not sure what you mean - example?

Or you could use frozen strings and share the references.

LABEL1 = “00”.freeze
LABEL2 = “01”.freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP[“00”]
puts a.object_id
puts LABEL1.object_id

I ran that code but I don’t understand how it helps . .

Although that’s more work than symbols, it might be useful depending on
your use case. For example, you could replace a subset of the values you
see with these frozen strings (which covers the majority of the data),
whilst still allowing arbitrary other strings.

Still not clear - examples?

The other thing that occurred to me was that on my 64-bit machine maybe
I could run 2-3 threads for inserting into the hash table?

BTW, I did end up changing to JSON - a YAML dump on a 32000x22 hash
table was deadly . .

Thanks,

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]

unebaguettesvp · March 19, 2011, 10:31pm

Philip R. wrote in post #988264:

Your sample code looks like it’s handling numeric-style data (although I
realise this is just a test case for the problems you’re having).
Integers in the range -2^30…+2^30 (or larger in on a 64-bit machine)
have their values encoded within the reference, so no memory allocation
is done.

Are you talking about the hash key or the hash values?

Either.

the values in
the real script will all be floats . .

Then they will be allocated on the heap, just like strings. I presume
you’re aware of the inherent inaccuracy of floats (in any language), and
are OK with this.

1.0/2.0 == 1.0 + 1.0/2.0 - 1.0
=> true

1.0/10.0 == 1.0 + 1.0/10.0 - 1.0
=> false

Or, if you’re handling a relatively small set of unique values, you
could use symbols instead of strings. Each symbol reference again
doesn’t allocate any memory; it just points to the entry in the symbol
table.

Not sure what you mean - example?

a = []
a[0] = :foo
a[1] = :foo
a[2] = :foo

puts a[0].object_id
puts a[1].object_id
puts a[2].object_id

Or you could use frozen strings and share the references.

LABEL1 = “00”.freeze
LABEL2 = “01”.freeze
MAP = {LABEL1 => LABEL1, LABEL2=>LABEL2}
a = MAP[“00”]
puts a.object_id
puts LABEL1.object_id

I ran that code but I don’t understand how it helps . .

It uses less memory if you have (say) millions of identical strings. It
may help garbage collection performance, but not much else.

Although that’s more work than symbols, it might be useful depending on
your use case. For example, you could replace a subset of the values you
see with these frozen strings (which covers the majority of the data),
whilst still allowing arbitrary other strings.

Still not clear - examples?

Suppose the strings “foo” and “bar” comprise 80% of your hash keys or
values. Then mapping them to the same frozen string means that you only
have one instance of string “foo” and one instance of string “bar” in
the system, instead of (say) millions of distinct strings. You can still
use individual strings for the other 20%.

This is really an edge optimisation though, you really shouldn’t need to
be worrying about these things - if they are significant, then perhaps
ruby is the wrong language for the problem in hand.

The other thing that occurred to me was that on my 64-bit machine maybe
I could run 2-3 threads for inserting into the hash table?

Noooo… even in ruby 1.9, there is a global interpreter lock.
Multiple threads gain you nothing really, except for threads which are
blocked on I/O.

Even if there were not, having multiple threads contending on the same
hash (and controlling access via, say, a mutex) would be pretty much
guaranteed to make performance worse not better.

Regards,

Brian.

unebaguettesvp · March 20, 2011, 6:50am

Brian,

On 2011-03-20 08:31, Brian C. wrote:

Either.
Right - for the following in my test script:

h1[ “#{a}.#{b}.#{c}.#{d}” ] = Array.new(2){ Array.new(1){ Array.new( 20,
rand(100) ) } }
h1[ “#{a}.#{b}.#{c}.#{d}”.freeze ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }
h1[ “#{a}.#{b}.#{c}.#{d}”.to_i ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }
h1[ “#{a}.#{b}.#{c}.#{d}”.to_i.freeze ] = Array.new(2){ Array.new(1){
Array.new( 20, rand(100) ) } }

I get the following times:

18.350s
18.113s
4.724s
4.896s

So I guess I should live with the slight decrease of readability when
searching for particular results in the JSON output file by using ints
instead of strings for the hash keys.

=> false
I suppose I could convert them all to six or eight digit ints . . they
are measures of biological diversity and changing them backwards and
forwards is a bit of a hassle but maybe it is worth doing for the speed
advantage? - I will try my test script with floats and see what happens.

a[2] = :foo

puts a[0].object_id
puts a[1].object_id
puts a[2].object_id

The reason the numbers are called “seeds” is that they correspond to the
seed for the random dumber generator in the C/C++ simulation program -
so they are all unique for each of the 32,000 simulations.

I ran that code but I don’t understand how it helps . .

It uses less memory if you have (say) millions of identical strings.

Not the case for the keys and unlikely for the values.

Suppose the strings “foo” and “bar” comprise 80% of your hash keys or
values. Then mapping them to the same frozen string means that you only
have one instance of string “foo” and one instance of string “bar” in
the system, instead of (say) millions of distinct strings. You can still
use individual strings for the other 20%.

Unfortunately this doesn’t correspond to my case . .

This is really an edge optimisation though, you really shouldn’t need to
be worrying about these things - if they are significant, then perhaps
ruby is the wrong language for the problem in hand.

The other thing that occurred to me was that on my 64-bit machine maybe
I could run 2-3 threads for inserting into the hash table?

Noooo… even in ruby 1.9, there is a global interpreter lock.
Multiple threads gain you nothing really, except for threads which are
blocked on I/O.

Right.

Even if there were not, having multiple threads contending on the same
hash (and controlling access via, say, a mutex) would be pretty much
guaranteed to make performance worse not better.

OK - oh well it was worth a thought!

Many thanks!

Regards,

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]

unebaguettesvp · March 20, 2011, 2:21pm

“#{a}.#{b}.#{c}.#{d}”.to_i.freeze

I don’t think that does what you expect. Try it in irb.

unebaguettesvp · March 21, 2011, 1:16am

Brian,

On 2011-03-21 00:21, Brian C. wrote:

“#{a}.#{b}.#{c}.#{d}”.to_i.freeze

I don’t think that does what you expect. Try it in irb.

Oops! . . forgot to take the dots out . .

In any case, in the actual live script (as opposed to the test script) I
did the same comparison of:

$recs[ “#{$seed_pre}.#{$seed_post}.#{$seed}.#{sgen}”.freeze ] = $iarr
$recs[ “#{$seed_pre}#{$seed_post}#{$seed}#{sgen}”.to_i.freeze ] = $iarr

and unfortunately there was no real difference because of the amount of
other processing that happens inside the loops . .

Thanks,

Phil.

Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]

unebaguettesvp · March 31, 2011, 12:00pm

People,

To add to my last post:

On 2011-03-20 16:44, Philip R. wrote:

is done.
rand(100) ) } }
18.113s

I suppose I could convert them all to six or eight digit ints . . they

Not sure what you mean - example?

a = MAP[“00”]

blocked on I/O.
OK - oh well it was worth a thought!
I ran the normal number of the inner two loops and only one of the
outermost loop (ie 1/50th of the total) and got the following with
different versions of Ruby:

ruby-1.8.7-p334 [ x86_64 ] = 31.41
ruby-1.9.2-p180 [ x86_64 ] = 19.53
rbx-head [ ] = 128.42

Many thanks!

Regards,

Phil.

–
Philip R.

GPO Box 3411
Sydney NSW 2001
Australia
E-mail: [email protected]