Count substrings in string, scan too slow

dji · June 24, 2010, 5:04pm

Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:

s = “you like to play with your yo-yo”
s.magical_count_method(“yo”) => 4

Once again, what I’m really looking for is something fast. I’ve tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
Thanks,

Danny.

dji · June 24, 2010, 5:17pm

On Thu, Jun 24, 2010 at 5:04 PM, Danny C. [email protected]
wrote:

Once again, what I’m really looking for is something fast. I’ve tried
using external linux commands such as awk, but that was much much
slower. Any ideas?

I don’t know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:

require ‘benchmark’

TIMES = 100_000
s = “you like to play with your yo-yo”

Benchmark.bmbm do |x|
x.report(“scan”) do
TIMES.times do
s.scan(“yo”).size
end
end
x.report(“while”) do
TIMES.times do
index = -1
count = 0
while (index = s.index(“yo”, index+1))
count += 1
end
count
end
end
end

$ ruby scan_vs_while.rb
Rehearsal -----------------------------------------
scan 0.560000 0.020000 0.580000 ( 0.585972)
while 0.440000 0.060000 0.500000 ( 0.492969)
-------------------------------- total: 1.080000sec

        user     system      total        real

scan 0.510000 0.010000 0.520000 ( 0.519078)
while 0.470000 0.020000 0.490000 ( 0.493562)

Don’t know if this is enough for you, probably not

Jesus.

dji · June 24, 2010, 5:45pm

Thanks Jesus,
This method actually decreased the runtime by quite a bit, so thanks
for the help! However, I still need something even faster if it exists,
so any other ideas would be appreciated. I may have to just implement
this part is C or something.

Danny.

JesÃºs Gabriel y GalÃ¡n wrote:

On Thu, Jun 24, 2010 at 5:04 PM, Danny C. [email protected]
wrote:

Once again, what I’m really looking for is something fast. ï¿½I’ve tried
using external linux commands such as awk, but that was much much
slower. Any ideas?

I don’t know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:
…

Don’t know if this is enough for you, probably not

Jesus.

dji · June 24, 2010, 5:51pm

On Thu, Jun 24, 2010 at 5:45 PM, Danny C. [email protected]
wrote:

Thanks Jesus,
This method actually decreased the runtime by quite a bit, so thanks
for the help! However, I still need something even faster if it exists,
so any other ideas would be appreciated. I may have to just implement
this part is C or something.

I suppose that if you implement a C method that does what I did in
Ruby, that would be faster.
I mean doing the loop in C and calling String#index from there.

Jesus.

dji · June 24, 2010, 5:50pm

If written in Ruby may not beat using the underlying library functions
as they are written in C.

I have vague recollections of a ruby quiz being based on something like
this
Dave.

dji · June 24, 2010, 6:17pm

On Thu, Jun 24, 2010 at 6:05 PM, Robert K.
[email protected] wrote:

s = “you like to play with your yo-yo”

TIMES.times do
$ ruby scan_vs_while.rb

I took the liberty to extend the benchmark a bit:

sc.rb · GitHub

I would have expected regexp to be faster…

This thing about adding the length of the match can be argued
depending on the requirements, I think.
What would you expect from:

“yoyoyoyo”.magical_count_method(“yoyo”)

2 or 3?

If you add the length to the index you get 2. If you add 1, you get 3.

irb(main):018:0> s = “yoyoyoyo”
=> “yoyoyoyo”
irb(main):019:0> count = 0
=> 0
irb(main):020:0> len = s.length
=> 8
irb(main):021:0> search = “yoyo”
=> “yoyo”
irb(main):023:0> len = search.length
=> 4
irb(main):024:0> index = -len
=> -4
irb(main):025:0> while (index = s.index(search, index + len))
irb(main):026:1> count += 1
irb(main):027:1> end
=> nil
irb(main):028:0> count
=> 2

irb(main):029:0> count = 0
=> 0
irb(main):030:0> index = -1
=> -1
irb(main):031:0> while (index = s.index(search, index + 1))
irb(main):032:1> count += 1
irb(main):033:1> end
=> nil
irb(main):034:0> count
=> 3

So, I don’t know. Of course, if the requirement is to get 2 from the
above situation, adding the length is better.

Also of notice is that the block versions of scan are slower because
they have to call a block for each match.
I think I’ve read that the String#index method uses Rabin-Karp. It
would be interesting to compare this to a Boyer-Moore implementation.
Of course it will depend on the input data, if it’s near the best or
worst case for each, but anyway.

Jesus.

dji · June 24, 2010, 6:35pm

On Fri, Jun 25, 2010 at 12:05 AM, Robert K.
[email protected] wrote:

sc.rb · GitHub
I would have expected regexp to be faster…

you don’t like strscan ?
best regards -botp

dji · June 24, 2010, 7:03pm

I’m looking for non-overlapping matches (so a 2 in your example)
I modified your code to do this for me like you showed and it works
fine. I was thinking of trying a Boyer-Moore implementation, but I
suspect if I implement this manually in Ruby it will be much slower.

JesÃºs Gabriel y GalÃ¡n wrote:

On Thu, Jun 24, 2010 at 6:05 PM, Robert K.
[email protected] wrote:

s = “you like to play with your yo-yo”

So, I don’t know. Of course, if the requirement is to get 2 from the
above situation, adding the length is better.

Also of notice is that the block versions of scan are slower because
they have to call a block for each match.
I think I’ve read that the String#index method uses Rabin-Karp. It
would be interesting to compare this to a Boyer-Moore implementation.
Of course it will depend on the input data, if it’s near the best or
worst case for each, but anyway.

Jesus.

dji · June 24, 2010, 7:16pm

On Fri, Jun 25, 2010 at 1:35 AM, botp [email protected] wrote:

On Fri, Jun 25, 2010 at 12:05 AM, Robert K.
[email protected] wrote:

sc.rb · GitHub
I would have expected regexp to be faster…

you don’t like strscan ?
best regards -botp

I’ve just run some benchmarks with strscan, and it’s at least in the
same ballpark as the other approaches, unless you’re on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: sc.rb · GitHub

dji · June 24, 2010, 6:05pm

2010/6/24 Jesús Gabriel y Galán [email protected]:

s.magical_count_method(“yo”) => 4
TIMES = 100_000
index = -1
Rehearsal -----------------------------------------
scan 0.560000 0.020000 0.580000 ( 0.585972)
while 0.440000 0.060000 0.500000 ( 0.492969)
-------------------------------- total: 1.080000sec
       user     system      total        real
scan 0.510000 0.010000 0.520000 ( 0.519078)
while 0.470000 0.020000 0.490000 ( 0.493562)

Don’t know if this is enough for you, probably not

I took the liberty to extend the benchmark a bit:

gist.github.com

https://gist.github.com/rklemme/451622

gistfile2.txt

18:00:04 Temp$ allruby sc.rb
CYGWIN_NT-5.1 padrklemme1 1.7.5(0.225/5/3) 2010-04-12 19:07 i686 Cygwin
========================================
ruby 1.8.7 (2008-08-11 patchlevel 72) [i386-cygwin]
Rehearsal ----------------------------------------------
scan         2.766000   0.000000   2.766000 (  2.786000)
scan ++      4.656000   0.000000   4.656000 (  4.668000)
scan re      2.688000   0.000000   2.688000 (  2.696000)
scan re ++   4.531000   0.000000   4.531000 (  4.547000)
while        1.094000   0.000000   1.094000 (  1.135000)

This file has been truncated. show original

sc.rb

require 'benchmark'

TIMES = 100_000
s = "you like to play with your yo-yo"

Benchmark.bmbm do |x|
 x.report("scan") do
   TIMES.times do
       count = s.scan("yo").size
       raise count unless count == 4

This file has been truncated. show original

I would have expected regexp to be faster…

Cheers

robert

dji · June 24, 2010, 9:03pm

On Thu, Jun 24, 2010 at 1:16 PM, Michael F.
[email protected] wrote:

I’ve just run some benchmarks with strscan, and it’s at least in the
same ballpark as the other approaches, unless you’re on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: sc.rb · GitHub

http://en.literateprograms.org/Boyer-Moore_string_search_algorithm_(Java)

require ‘java’
java_import ‘BoyerMoore’

x.report ‘boyer_moore’ do
count = BoyerMoore.match(“yo”, s).size
check count
end

$ jruby -v yomark.rb
jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot™ Client VM 1.6.0_20) [i386-java]
Rehearsal -----------------------------------------------
scan 22.423000 0.000000 22.423000 ( 22.334000)
scan ++ 36.738000 0.000000 36.738000 ( 36.738000)
scan re 19.451000 0.000000 19.451000 ( 19.451000)
scan re ++ 39.222000 0.000000 39.222000 ( 39.222000)
while 22.621000 0.000000 22.621000 ( 22.622000)
strscan 29.075000 0.000000 29.075000 ( 29.076000)
boyer_moore 0.009000 0.000000 0.009000 ( 0.009000)
------------------------------------ total: 169.539000sec

              user     system      total        real

scan 18.050000 0.000000 18.050000 ( 18.051000)
scan ++ 35.046000 0.000000 35.046000 ( 35.046000)
scan re 17.807000 0.000000 17.807000 ( 17.807000)
scan re ++ 34.086000 0.000000 34.086000 ( 34.085000)
while 22.089000 0.000000 22.089000 ( 22.089000)
strscan 29.538000 0.000000 29.538000 ( 29.538000)
boyer_moore 0.005000 0.000000 0.005000 ( 0.004000)

$ jruby -v --server --fast yomark.rb
jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot™ Server VM 1.6.0_20) [i386-java]
yobench.rb:50 warning: Useless use of a variable in void context.
Rehearsal -----------------------------------------------
scan 17.340000 0.000000 17.340000 ( 17.154000)
scan ++ 23.986000 0.000000 23.986000 ( 23.987000)
scan re 15.170000 0.000000 15.170000 ( 15.169000)
scan re ++ 22.805000 0.000000 22.805000 ( 22.806000)
while 12.050000 0.000000 12.050000 ( 12.050000)
strscan 31.396000 0.000000 31.396000 ( 31.396000)
boyer_moore 0.010000 0.000000 0.010000 ( 0.010000)
------------------------------------ total: 122.756999sec

              user     system      total        real

scan 15.201000 0.000000 15.201000 ( 15.201000)
scan ++ 23.758000 0.000000 23.758000 ( 23.758000)
scan re 14.770000 0.000000 14.770000 ( 14.770000)
scan re ++ 22.455000 0.000000 22.455000 ( 22.455000)
while 12.182000 0.000000 12.182000 ( 12.182000)
strscan 24.497000 0.000000 24.497000 ( 24.497000)
boyer_moore 0.002000 0.000000 0.002000 ( 0.002000)

dji · June 25, 2010, 6:03am

On Fri, Jun 25, 2010 at 1:16 AM, Michael F. > I’ve just run
some benchmarks with strscan, and it’s at least in the

same ballpark as the other approaches, unless you’re on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: sc.rb · GitHub

that is not fair for strscan… you are recreating the object inside the
loop

outside loop do:
s=StringScanner.new “some string foo…”
s2=s.dup

inside loop do:
s=s2
… s.scan_until…

best regards -botp

dji · June 25, 2010, 9:39am

On Fri, Jun 25, 2010 at 1:01 PM, botp [email protected] wrote:

On Fri, Jun 25, 2010 at 1:16 AM, Michael F. > I’ve just run
some benchmarks with strscan, and it’s at least in the

same ballpark as the other approaches, unless you’re on rubinius, but
then all string processing is really slow on that anyway.

Benchmark with strscan here: sc.rb · GitHub

that is not fair for strscan… you are recreating the object inside the loop

That’s not fair for the others, and doesn’t make any difference in the
benchmark anyway.

dji · June 24, 2010, 9:49pm

http://en.literateprograms.org/Boyer-Moore_string_search_algorithm_(Java)

require ‘java’
java_import ‘BoyerMoore’

x.report ‘boyer_moore’ do
count = BoyerMoore.match(“yo”, s).size
check count
end

that wasn’t the right one

x.report ‘boyer_moore’ do
TIMES.times do
count = BoyerMoore.match(“yo”, s).size
check count
end
end

jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot™ Client VM 1.6.0_20) [i386-java]
Rehearsal -----------------------------------------------
boyer_moore 25.742000 0.000000 25.742000 ( 25.661000)
------------------------------------- total: 25.742000sec

              user     system      total        real

boyer_moore 24.869000 0.000000 24.869000 ( 24.869000)

jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot™ Server VM 1.6.0_20) [i386-java]
Rehearsal -----------------------------------------------
boyer_moore 16.733000 0.000000 16.733000 ( 16.401000)
------------------------------------- total: 16.733000sec

              user     system      total        real

boyer_moore 15.970000 0.000000 15.970000 ( 15.971000)

dji · June 29, 2010, 10:20pm

On Thu, Jun 24, 2010 at 6:05 PM, Robert K.
[email protected] wrote:
I too took the liberty to change the benchmark and I found a strange
way to beat the “while”
but by little

gist.github.com

https://gist.github.com/RobertDober/457751

scountbench.rb

require 'benchmark'

TIMES = 4_000
s = "you like to play with your yo-yo" * 100
Count = 400

def check!
  abort "count not #{Count} but #{@count}" unless @count == Count
end

This file has been truncated. show original

dji · June 25, 2010, 12:00pm

On Fri, Jun 25, 2010 at 3:38 PM, Michael F.

That’s not fair for the others,

indeed, in general. but if multiple/repeated processes are done on the
same string, then strscan will make very big difference.

and doesn’t make any difference in the
benchmark anyway.

wc makes me think that it could be possible that ruby strings may be
strscan-ready without added init load

best regards -botp

dji · June 29, 2010, 9:14pm

On Thu, Jun 24, 2010 at 2:48 PM, [email protected] wrote:

that wasn’t the right one

Â x.report ‘boyer_moore’ do
Â Â TIMES.times do
Â Â Â count = BoyerMoore.match(“yo”, s).size
Â Â Â check count
Â Â end
Â end

FYI, a large part of the overhead here is probably the Java calls,
which are a bit slower than Ruby to Ruby calls (plus it’s decoding the
“yo” string to UTF-16 each call). For a larger string and fewer calls,
the pure Java BoyerMoore performance would likely benchmark a lot
better than this.

Charlie

dji · June 30, 2010, 4:05am

On Tue, Jun 29, 2010 at 3:13 PM, Charles Oliver N.
[email protected] wrote:

“yo” string to UTF-16 each call). For a larger string and fewer calls,
the pure Java BoyerMoore performance would likely benchmark a lot
better than this.

I had a similar suspicion and had started a modified benchmark doing
fewer loops over larger data, but had to move on to other things.

This gives me a chance to try out the JRuby Mac Installer…

Original benchmark:

jruby 1.5.0 (ruby 1.8.7 patchlevel 249) (2010-05-12 6769999) (Java
HotSpot™ 64-Bit Server VM 1.6.0_20) [x86_64-java]
Rehearsal -----------------------------------------------
scan 8.851000 0.000000 8.851000 ( 8.784000)
scan ++ 14.186000 0.000000 14.186000 ( 14.186000)
scan re 8.594000 0.000000 8.594000 ( 8.594000)
scan re ++ 15.558000 0.000000 15.558000 ( 15.558000)
while 8.102000 0.000000 8.102000 ( 8.101000)
strscan 14.023000 0.000000 14.023000 ( 14.023000)
boyer_moore 7.446000 0.000000 7.446000 ( 7.446000)
------------------------------------- total: 76.760000sec

              user     system      total        real

scan 8.157000 0.000000 8.157000 ( 8.157000)
scan ++ 13.953000 0.000000 13.953000 ( 13.953000)
scan re 8.346000 0.000000 8.346000 ( 8.346000)
scan re ++ 15.332000 0.000000 15.332000 ( 15.333000)
while 8.087000 0.000000 8.087000 ( 8.087000)
strscan 14.303000 0.000000 14.303000 ( 14.303000)
boyer_moore 6.885000 0.000000 6.885000 ( 6.885000)

Even with the Ruby to Java call overhead, the Java BoyerMoore is
coming back the fastest on this machine. For comparison:

ruby 1.8.7 (2009-06-12 patchlevel 174) [universal-darwin10.0]
Rehearsal ----------------------------------------------
scan 31.030000 0.020000 31.050000 ( 31.094718)
scan ++ 62.310000 0.900000 63.210000 ( 63.227271)
scan re 31.030000 0.030000 31.060000 ( 31.110528)
scan re ++ 62.820000 0.870000 63.690000 ( 63.718876)
while 26.090000 0.020000 26.110000 ( 26.095308)
strscan 28.440000 0.010000 28.450000 ( 28.485140)
----------------------------------- total: 243.570000sec

             user     system      total        real

scan 31.240000 0.020000 31.260000 ( 31.264699)
scan ++ 64.000000 0.860000 64.860000 ( 64.865223)
scan re 31.570000 0.020000 31.590000 ( 31.581045)
scan re ++ 64.180000 0.980000 65.160000 ( 65.401667)
while 26.580000 0.030000 26.610000 ( 26.757658)
strscan 28.730000 0.030000 28.760000 ( 28.831860)

Unfortunately, I do not have 1.9.x on this machine at the moment.