Forum: Ruby Quicker finding strings? Alternative for array, hash, set?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Patrick P. (Guest)
on 2009-02-02 13:00
I've been searching for this information already, but cannot really find
the answer I'm looking for.

I have to find values in a list of some 40,000 strings. Putting all
these in an array is not likely to be the quickest way.... and it isn't.
A set seems to be faster already. Maybe using a map where I only use the
keys is also faster. But I'm wondering: is there no such thing as a tree
to store and retrieve items in a fast way?

If I remember well, for n elements in a list it should take log(n) steps
then, where it takes n/2 just going through an array if I only want to
pick up the element. In my situation it is much closer to n because most
of the times the value is NOT in the list.

Any ideas?
Jesús Gabriel y Galán (Guest)
on 2009-02-02 13:25
(Received via mailing list)
On Mon, Feb 2, 2009 at 12:00 PM, Patrick P. 
<removed_email_address@domain.invalid>
wrote:
> then, where it takes n/2 just going through an array if I only want to
> pick up the element. In my situation it is much closer to n because most
> of the times the value is NOT in the list.
>
> Any ideas?

If you are only concerned with lookup time, a hash has amortized O(1)
lookup time:

require 'benchmark'

words = ("aaaa".."zzzz").to_a # => 456,976 strings

hash = {}
words.each {|w| hash[w] = true}

TIMES = 10
Benchmark.bmbm do |x|
  x.report("Array#find existing") do
    TIMES.times do
      words.find {|w| w == "mmmm"}
    end
  end
  x.report("Array#find non-existing") do
    TIMES.times do
        words.find {|w| w == "mmmmm"}
  end
 end
  x.report("Hash#[] existing") do
    TIMES.times do
      hash["mmmm"]
    end
  end
  x.report("Hash#[] non-existing") do
    TIMES.times do
        hash["mmmmm"]
  end
 end
end



jesus@jesus-laptop:~/personal/programacion/ruby/Benchmarks/lib$ ruby
hash_array2.rb
Rehearsal -----------------------------------------------------------
Array#find existing       3.160000   0.700000   3.860000 (  3.870061)
Array#find non-existing   6.280000   1.630000   7.910000 (  7.935304)
Hash#[] existing          0.000000   0.000000   0.000000 (  0.000019)
Hash#[] non-existing      0.000000   0.000000   0.000000 (  0.000018)
------------------------------------------------- total: 11.770000sec

                              user     system      total        real
Array#find existing       3.130000   0.710000   3.840000 (  3.866618)
Array#find non-existing   5.820000   1.580000   7.400000 (  9.195417)
Hash#[] existing          0.000000   0.000000   0.000000 (  0.000030)
Hash#[] non-existing      0.000000   0.000000   0.000000 (  0.000032)


Jesus.
Patrick P. (Guest)
on 2009-02-02 13:39
> Rehearsal -----------------------------------------------------------
> Array#find existing       3.160000   0.700000   3.860000 (  3.870061)
> Array#find non-existing   6.280000   1.630000   7.910000 (  7.935304)
> Hash#[] existing          0.000000   0.000000   0.000000 (  0.000019)
> Hash#[] non-existing      0.000000   0.000000   0.000000 (  0.000018)
> ------------------------------------------------- total: 11.770000sec
>
>                               user     system      total        real
> Array#find existing       3.130000   0.710000   3.840000 (  3.866618)
> Array#find non-existing   5.820000   1.580000   7.400000 (  9.195417)
> Hash#[] existing          0.000000   0.000000   0.000000 (  0.000030)
> Hash#[] non-existing      0.000000   0.000000   0.000000 (  0.000032)

That is impressive indeed! No need for any other containers here I
guess. ;-) Thanks for showing me this!

I ran a program over some 1.8 billion lines of data this weekend. Good
thing it was more slowed down because of IO than my Ruby programming. It
would have been faster of course, but the damage is not that big...
Robert K. (Guest)
on 2009-02-02 16:07
(Received via mailing list)
2009/2/2 Patrick P. <removed_email_address@domain.invalid>:
>> Hash#[] existing          0.000000   0.000000   0.000000 (  0.000030)
>> Hash#[] non-existing      0.000000   0.000000   0.000000 (  0.000032)
>
> That is impressive indeed! No need for any other containers here I
> guess. ;-) Thanks for showing me this!

If you do not want to attach information to those words then a Set is
more appropriate: same O(1) lookup but no key value mapping.

Cheers

robert
Patrick P. (Guest)
on 2009-02-02 16:37
> If you do not want to attach information to those words then a Set is
> more appropriate: same O(1) lookup but no key value mapping.
Is the implementation "under the hood" the same? That would mean my
choice to use a Set was not so bad after all.
Jesús Gabriel y Galán (Guest)
on 2009-02-02 17:45
(Received via mailing list)
On Mon, Feb 2, 2009 at 3:37 PM, Patrick P. 
<removed_email_address@domain.invalid>
wrote:
>> If you do not want to attach information to those words then a Set is
>> more appropriate: same O(1) lookup but no key value mapping.
> Is the implementation "under the hood" the same? That would mean my
> choice to use a Set was not so bad after all.

If you check here:

http://www.ruby-doc.org/core/classes/Set.html

you can see the source code for each method. For example this is the
method new:

# File lib/set.rb, line 64
  def initialize(enum = nil, &block) # :yields: o
    @hash ||= Hash.new

    enum.nil? and return

    if block
      enum.each { |o| add(block[o]) }
    else
      merge(enum)
    end
  end

So you can see, it uses a Hash underneath. Let's take a look at the add
method:

# File lib/set.rb, line 195
  def add(o)
    @hash[o] = true
    self
  end

So, it's using a hash with true as the values (the same I did in my
example). Of course, if the true doesn't mean anything to you (which
shouldn't as you are not using the values), using a Set is a bit more
clear, and has convenience methods for Set "arithmetic".

Jesus.
Patrick P. (Guest)
on 2009-02-02 17:50
Very clear. I should/could have checked this myself. I'm only using Ruby
for a few months now, and loving it more and more.

Thanks guys!
Sandor Szücs (Guest)
on 2009-02-02 18:34
(Received via mailing list)
On 02.02.2009, at 15:37, Patrick P. wrote:

>> If you do not want to attach information to those words then a Set is
>> more appropriate: same O(1) lookup but no key value mapping.
> Is the implementation "under the hood" the same?

Set uses a Hash as storage.
See set.rb:
--------------8<-------------
   # The equality of each couple of elements is determined according to
   # Object#eql? and Object#hash, since Set uses Hash as storage.
-------------->8-------------

regards, Sandor
Sz
Ken B. (Guest)
on 2009-02-03 17:17
(Received via mailing list)
On Mon, 02 Feb 2009 08:21:48 -0500, Robert K. wrote:

>>> Array#find non-existing   5.820000   1.580000   7.400000 (  9.195417)
>
> robert

A trie might also be a good idea. Some tries were proposed as solutions
to Ruby Q. #103 (http://rubyquiz.com/quiz103.html)
Robert K. (Guest)
on 2009-02-03 17:52
(Received via mailing list)
2009/2/3 Ken B. <removed_email_address@domain.invalid>:
>>>>                               user     system      total        real
>>>> Array#find existing       3.130000   0.710000   3.840000 (  3.866618)
>>>> Array#find non-existing   5.820000   1.580000   7.400000 (  9.195417)
>>>> Hash#[] existing          0.000000   0.000000   0.000000 (  0.000030)
>>>> Hash#[] non-existing      0.000000   0.000000   0.000000 (  0.000032)
>>>
>>> That is impressive indeed! No need for any other containers here I
>>> guess. ;-) Thanks for showing me this!
>>
>> If you do not want to attach information to those words then a Set is
>> more appropriate: same O(1) lookup but no key value mapping.

> A trie might also be a good idea. Some tries were proposed as solutions
> to Ruby Q. #103 (http://rubyquiz.com/quiz103.html)

Good idea.  But this is only a realistic option if the trie is
implemented in C I guess - otherwise the standard String#hash and
Set#include? are faster.

Kind regards

robert
This topic is locked and can not be replied to.