Quicker finding strings? Alternative for array, hash, set?


#1

I’ve been searching for this information already, but cannot really find
the answer I’m looking for.

I have to find values in a list of some 40,000 strings. Putting all
these in an array is not likely to be the quickest way… and it isn’t.
A set seems to be faster already. Maybe using a map where I only use the
keys is also faster. But I’m wondering: is there no such thing as a tree
to store and retrieve items in a fast way?

If I remember well, for n elements in a list it should take log(n) steps
then, where it takes n/2 just going through an array if I only want to
pick up the element. In my situation it is much closer to n because most
of the times the value is NOT in the list.

Any ideas?


#2

On Mon, Feb 2, 2009 at 12:00 PM, Patrick P. removed_email_address@domain.invalid
wrote:

then, where it takes n/2 just going through an array if I only want to
pick up the element. In my situation it is much closer to n because most
of the times the value is NOT in the list.

Any ideas?

If you are only concerned with lookup time, a hash has amortized O(1)
lookup time:

require ‘benchmark’

words = (“aaaa”…“zzzz”).to_a # => 456,976 strings

hash = {}
words.each {|w| hash[w] = true}

TIMES = 10
Benchmark.bmbm do |x|
x.report(“Array#find existing”) do
TIMES.times do
words.find {|w| w == “mmmm”}
end
end
x.report(“Array#find non-existing”) do
TIMES.times do
words.find {|w| w == “mmmmm”}
end
end
x.report(“Hash#[] existing”) do
TIMES.times do
hash[“mmmm”]
end
end
x.report(“Hash#[] non-existing”) do
TIMES.times do
hash[“mmmmm”]
end
end
end

jesus@jesus-laptop:~/personal/programacion/ruby/Benchmarks/lib$ ruby
hash_array2.rb
Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

                          user     system      total        real

Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000032)

Jesus.


#3

Rehearsal -----------------------------------------------------------
Array#find existing 3.160000 0.700000 3.860000 ( 3.870061)
Array#find non-existing 6.280000 1.630000 7.910000 ( 7.935304)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000019)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000018)
------------------------------------------------- total: 11.770000sec

                          user     system      total        real

Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. :wink: Thanks for showing me this!

I ran a program over some 1.8 billion lines of data this weekend. Good
thing it was more slowed down because of IO than my Ruby programming. It
would have been faster of course, but the damage is not that big…


#4

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.
Is the implementation “under the hood” the same? That would mean my
choice to use a Set was not so bad after all.


#5

2009/2/2 Patrick P. removed_email_address@domain.invalid:

Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. :wink: Thanks for showing me this!

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

Cheers

robert


#6

On Mon, Feb 2, 2009 at 3:37 PM, Patrick P. removed_email_address@domain.invalid
wrote:

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.
Is the implementation “under the hood” the same? That would mean my
choice to use a Set was not so bad after all.

If you check here:

http://www.ruby-doc.org/core/classes/Set.html

you can see the source code for each method. For example this is the
method new:

File lib/set.rb, line 64

def initialize(enum = nil, &block) # :yields: o
@hash ||= Hash.new

enum.nil? and return

if block
  enum.each { |o| add(block[o]) }
else
  merge(enum)
end

end

So you can see, it uses a Hash underneath. Let’s take a look at the add
method:

File lib/set.rb, line 195

def add(o)
@hash[o] = true
self
end

So, it’s using a hash with true as the values (the same I did in my
example). Of course, if the true doesn’t mean anything to you (which
shouldn’t as you are not using the values), using a Set is a bit more
clear, and has convenience methods for Set “arithmetic”.

Jesus.


#7

Very clear. I should/could have checked this myself. I’m only using Ruby
for a few months now, and loving it more and more.

Thanks guys!


#8

On 02.02.2009, at 15:37, Patrick P. wrote:

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.
Is the implementation “under the hood” the same?

Set uses a Hash as storage.
See set.rb:
--------------8<-------------

The equality of each couple of elements is determined according to

Object#eql? and Object#hash, since Set uses Hash as storage.

-------------->8-------------

regards, Sandor
Sz


#9

On Mon, 02 Feb 2009 08:21:48 -0500, Robert K. wrote:

Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)

robert

A trie might also be a good idea. Some tries were proposed as solutions
to Ruby Q. #103 (http://rubyquiz.com/quiz103.html)


#10

2009/2/3 Ken B. removed_email_address@domain.invalid:

                          user     system      total        real

Array#find existing 3.130000 0.710000 3.840000 ( 3.866618)
Array#find non-existing 5.820000 1.580000 7.400000 ( 9.195417)
Hash#[] existing 0.000000 0.000000 0.000000 ( 0.000030)
Hash#[] non-existing 0.000000 0.000000 0.000000 ( 0.000032)

That is impressive indeed! No need for any other containers here I
guess. :wink: Thanks for showing me this!

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

A trie might also be a good idea. Some tries were proposed as solutions
to Ruby Q. #103 (http://rubyquiz.com/quiz103.html)

Good idea. But this is only a realistic option if the trie is
implemented in C I guess - otherwise the standard String#hash and
Set#include? are faster.

Kind regards

robert