On 30.04.2008 23:48, Robert K. wrote:
| I have a sentence “This is my test sentence” and an array[“is”,
| just one.
| JB
sentence, words = “This is my test sentence”, [“This”, “is”, “my”]
irb(main):004:0> “This is my test sentence”.to_enum(:scan,/\w+/).any?
the test while traversing while scan approach first converts the whole
text into words and then applies the test thus iterating twice over the
whole text plus doing more conversions (to words) and needs more
temporary memory (i.e. for the whole sequence of words, although the
overhead might be small because of internal String memory sharing).
The Set approach scales better for larger sets of words because the Set
lookup is O(1) while an Array based lookup is O(n).
I am not saying that my approach is faster under all circumstances. But
it surely scales better.
Well, I did a little benchmarking and it turns out that I probably spoke
too soon. As often - assumptions should be verified against measurable
reality.
Here’s the numbers. I leave the analysis for the reader, but keep in
mind that the situation might change significantly if the input text
needs to be read via IO (from a file etc.). 
Kind regards
robert
[email protected] /cygdrive/c/Temp
$ ./scan.rb
Rehearsal -------------------------------------------------------
head arr std 7.578000 0.063000 7.641000 ( 7.628000)
head arr enum 0.000000 0.000000 0.000000 ( 0.000000)
head set std 8.016000 0.031000 8.047000 ( 8.043000)
head set enum 0.000000 0.000000 0.000000 ( 0.000000)
head rarr std 7.968000 0.016000 7.984000 ( 8.041000)
head rarr enum 0.000000 0.000000 0.000000 ( 0.002000)
head rx 0.000000 0.000000 0.000000 ( 0.000000)
tail arr std 20.203000 0.000000 20.203000 ( 20.390000)
tail arr enum 32.079000 0.000000 32.079000 ( 33.039000)
tail set std 15.421000 0.031000 15.452000 ( 15.616000)
tail set enum 26.672000 0.016000 26.688000 ( 26.721000)
tail rarr std 19.782000 0.031000 19.813000 ( 19.811000)
tail rarr enum 31.281000 0.000000 31.281000 ( 31.360000)
tail rx 0.078000 0.000000 0.078000 ( 0.080000)
mid arr std 13.828000 0.031000 13.859000 ( 13.853000)
mid arr enum 15.781000 0.000000 15.781000 ( 15.814000)
mid set std 11.485000 0.063000 11.548000 ( 11.559000)
mid set enum 12.953000 0.000000 12.953000 ( 12.961000)
mid rarr std 14.156000 0.062000 14.218000 ( 14.231000)
mid rarr enum 15.375000 0.016000 15.391000 ( 15.412000)
mid rx 0.031000 0.000000 0.031000 ( 0.039000)
-------------------------------------------- total: 253.047000sec
user system total real
head arr std 7.031000 0.062000 7.093000 ( 7.086000)
head arr enum 0.000000 0.000000 0.000000 ( 0.000000)
head set std 7.078000 0.063000 7.141000 ( 7.131000)
head set enum 0.000000 0.000000 0.000000 ( 0.000000)
head rarr std 7.000000 0.125000 7.125000 ( 7.129000)
head rarr enum 0.000000 0.000000 0.000000 ( 0.000000)
head rx 0.000000 0.000000 0.000000 ( 0.000000)
tail arr std 19.282000 0.031000 19.313000 ( 19.341000)
tail arr enum 30.328000 0.078000 30.406000 ( 30.658000)
tail set std 14.594000 0.000000 14.594000 ( 14.600000)
tail set enum 25.360000 0.000000 25.360000 ( 25.403000)
tail rarr std 19.047000 0.016000 19.063000 ( 19.076000)
tail rarr enum 29.922000 0.000000 29.922000 ( 29.984000)
tail rx 0.078000 0.000000 0.078000 ( 0.082000)
mid arr std 13.297000 0.000000 13.297000 ( 13.312000)
mid arr enum 14.453000 0.000000 14.453000 ( 14.451000)
mid set std 10.954000 0.031000 10.985000 ( 11.012000)
mid set enum 12.093000 0.000000 12.093000 ( 12.155000)
mid rarr std 13.312000 0.000000 13.312000 ( 13.346000)
mid rarr enum 14.375000 0.000000 14.375000 ( 14.389000)
mid rx 0.031000 0.000000 0.031000 ( 0.037000)
[email protected] /cygdrive/c/Temp
$ cat scan.rb
#!/bin/env ruby
require ‘set’
require ‘enumerator’
require ‘benchmark’
TEXT_FRONT = (“a” << (" x" * 1_000_000)).freeze
TEXT_TAIL = ((“x " * 1_000_000) << “a”).freeze
TEXT_MID = ((“x " * 500_000) << “a” << (” x” * 500_000)).freeze
WORDS = %w{a b c d e f}.freeze
REV_WORDS = WORDS.reverse.freeze
SET_WORDS = WORDS.to_set.freeze
RX = Regexp.new("\b#{Regexp.union(*WORDS)}\b")
TEXTS = {
“head” => TEXT_FRONT,
“mid” => TEXT_MID,
“tail” => TEXT_TAIL,
}
TESTER = {
“arr” => WORDS,
“rarr” => REV_WORDS,
“set” => SET_WORDS,
}
REPEAT = 5
Benchmark.bmbm 20 do |b|
TEXTS.each do |tlabel, text|
TESTER.each do |lab,enum|
b.report “#{tlabel} #{lab} std” do
REPEAT.times do
text.scan(/\w+/).any? {|w| enum.include? w}
end
end
b.report "#{tlabel} #{lab} enum" do
REPEAT.times do
text.to_enum(:scan, /\w+/).any? {|w| enum.include? w}
end
end
end
b.report "#{tlabel} rx" do
REPEAT.times do
RX =~ text
end
end
end
end
[email protected] /cygdrive/c/Temp
$