Improving performance of hash math

dubstep · January 19, 2011, 3:56am

I am trying to improve performance of Euclidian distance between two
hash maps. The code is pretty straightforward:

def self.euclidian_distance(lhs, rhs)
@s = lhs.inject(0.0) do |s, (k,v)|
rhs.has_key?(k) ? (s + (v - rhs[k])2) : (s + v2)
end
@s = rhs.inject(@s) do |s, (k,v)|
lhs.has_key?(k) ? s : (s + v**2)
end
@s
end

Any suggestions?

dblock · January 19, 2011, 4:54am

Try not to use the exponentiation operator to square arguments. Direct
exponentiation is very inefficient except for large values. It is
almost always better to use straightforward multiplication when your
powers are reasonably small. Other than this I can’t think of any way
to improve performance short of dropping to C, and even there I think
the performance improvement you’ll get would be marginal.

Have you tried profiling your code? The profiler can tell you a lot
about where it actually spends time doing stuff and can be very
enlightening about optimization.

dblock · January 19, 2011, 5:31am

On Jan 18, 2011, at 9:55 PM, dblock wrote:

@s
end

Any suggestions?

Switch to using #each with an explicit accumulator variable instead of
#inject.

Gary W.

dblock · January 19, 2011, 6:35am

Switch to using #each with an explicit accumulator variable instead of #inject.

Looks about twice as slow. Either way, about 36% of the time is spent
in Hash.each.

dblock · January 19, 2011, 7:04am

Here’re some actual numbers. I am doing the above code 10K times with
a data distribution that’s somewhat not-random (some overlap in keys).

%self total self wait child calls name
37.61 1.49 0.99 0.00 0.50 61446 Hash#each
10.00 0.26 0.26 0.00 0.00 419946 Float#+

dblock · January 19, 2011, 10:16am

On Wed, Jan 19, 2011 at 3:55 AM, dblock [email protected] wrote:

@s
end

Any suggestions?

Use Hash’s default value mechanism or use “|| 0” and kick out
#has_key? calls. I’d also rather use a proper Matrix or Vector class
instead of a Hash. NArray also seems like a good and fast option.

Btw, why are you using an instance variable here? That does not seem
to make any sense.

Also, it seems that your calculation is flawed because you calculate
the distance for keys which are present in both Hashes twice. I’d
rather

require ‘set’

def euclidian_distance(lhs, rhs)
s = 0

(lhs.keys.to_set.merge(rhs.keys)).each do |k|
s += ( (rhs[k] || 0) - (lhs[k] || 0) ) ** 2
end

s
end

Kind regards

robert

dblock · January 19, 2011, 6:03pm

Use Hash’s default value mechanism or use “|| 0” and kick out
#has_key? calls. I’d also rather use a proper Matrix or Vector class
instead of a Hash. NArray also seems like a good and fast option.

has_key has almost no cost, made no difference
my data is sparse, there’re a lot of zero values - using an array
forces you to do N calculations (N is the max width of the array), an
Array was 3x slower, especially with a fast growing data set (haven’t
tried NArray)

Btw, why are you using an instance variable here? That does not seem
to make any sense.

fixed

Also, it seems that your calculation is flawed because you calculate
the distance for keys which are present in both Hashes twice. I’d
rather

re-read the core, you’re wrong, the second half only adds values
that are not present in the first array

end

this is 3x slower because of the cost of to_set.merge (the code is
nicer though

dblock · January 19, 2011, 8:05pm

On Jan 19, 2011, at 12:23 AM, dblock wrote:

Switch to using #each with an explicit accumulator variable instead of #inject.

Looks about twice as slow. Either way, about 36% of the time is spent
in Hash.each.

Surprising since most of the ‘inject solutions’ that show up on this
list are slower than using #each.

Can you post some sample data?

Gary W.

dblock · January 20, 2011, 1:34am

On Jan 18, 2011, at 21:23 , dblock wrote:

Switch to using #each with an explicit accumulator variable instead of #inject.

Looks about twice as slow. Either way, about 36% of the time is spent
in Hash.each.

I don’t agree with Gary’s suggestion, because in this case it is a
proper application of inject, but if you think it is 2x slower, then
you’re benchmarking is probably at fault.

My measurements show each to be roughly equivalent on 1.9 and quite a
bit faster on 1.8. That could be explained by my modification to your
treatment of rhs or it could also be explained by my fake data.

% ./q.rb 10000

of iterations = 10000

                      user     system      total        real

null_time 0.000000 0.000000 0.000000 ( 0.001118)
euclidian_distance1 18.960000 0.010000 18.970000 ( 19.011762)
euclidian_distance2 13.050000 0.010000 13.060000 ( 13.069581)
euclidian_distance3 13.090000 0.010000 13.100000 ( 13.109778)

against:

def euclidian_distance1(lhs, rhs)
@s = lhs.inject(0.0) do |s, (k,v)|
rhs.has_key?(k) ? (s + (v - rhs[k])2) : (s + v2)
end
@s = rhs.inject(@s) do |s, (k,v)|
lhs.has_key?(k) ? s : (s + v**2)
end
@s
end

def euclidian_distance2(lhs, rhs)
s = 0.0
lhs.each do |k,v|
s += rhs.has_key?(k) ? ((v - rhs[k])2) : (v2)
end
rhs.each do |k,v|
s += v**2 if lhs.has_key?(k)
end
s
end

def euclidian_distance3(lhs, rhs)
s = lhs.inject(0.0) do |n, (k,v)|
n + (rhs.has_key?(k) ? v - rhs[k] : v)2
end
rhs.each do |k,v|
s += v2 if lhs.has_key? k
end
s
end

dblock · January 20, 2011, 3:15am

In case someone reads this for the purpose of the algorithms, and not
that it influences benchmarks, but the last two have bugs: the second
sum needs to do ! lhs.has_key? k. You want to do (right[i] - left[i])
** 2 for the left hash (items in the left hash only and in both the
left and the right hash), then add right[i] for all items that are in
the right hash only.

Back to the original question: anyone thinks we can do this for
millions of items?

dblock · January 20, 2011, 5:46am

On Jan 19, 2011, at 6:13 PM, Ryan D. wrote:

I don’t agree with Gary’s suggestion, because in this case it is a proper
application of inject, but if you think it is 2x slower, then you’re benchmarking
is probably at fault.

I’m OK with being wrong. I didn’t actually try my own suggestion so I’m
not going to defend it too hard but I did notice that your
euclidian_distance2 version that uses #each was 30% faster than
euclidian_distance1 that used #inject. Your euclidian_distance3 version
uses both #inject and #each but would seem to be sensitive to the size
of the lhs argument (i.e. the number of entries in the lhs hash).

I’d like to see the sample data to better understand what is going on.

Gary W.

dblock · January 20, 2011, 3:20am

Back to the original question: anyone thinks we can do this for
millions of items?

I meant to say “fast”, as in under a second.

dblock · January 20, 2011, 9:30am

On Wed, Jan 19, 2011 at 5:25 PM, dblock [email protected] wrote:

Use Hash’s default value mechanism or use “|| 0” and kick out
#has_key? calls. I’d also rather use a proper Matrix or Vector class
instead of a Hash. NArray also seems like a good and fast option.

has_key has almost no cost, made no difference

Still the hash needs to be looked into twice. I prefer to do the work
only once and then decide based on the result.

my data is sparse, there’re a lot of zero values - using an array
forces you to do N calculations (N is the max width of the array), an
Array was 3x slower, especially with a fast growing data set (haven’t
tried NArray)

Can you post your test data in some way (either raw or a generator
which roughly creates what you have) so others can play around and
benchmark?

Also, it seems that your calculation is flawed because you calculate
the distance for keys which are present in both Hashes twice. I’d
rather

re-read the core, you’re wrong, the second half only adds values
that are not present in the first array

Right you are! Sorry, my fault.

end

this is 3x slower because of the cost of to_set.merge (the code is
nicer though

How about

def euclidian_distance(lhs, rhs)
s = 0

lhs.each do |k,v|
s += ( (rhs[k] || 0 ) - v ) ** 2
end

rhs.each do |k,v|
s += v * v unless lhs[k]
end

s
end

Kind regards

robert