Hello everyone,
I need to count the number of times a substring occurs in a string.
I am currently doing this using the scan method, but it is simply too
slow. I feel there should be a faster way to do this since the scan
method is really designed for more advanced things than this. I do not
need to do regex matching or to process the matches, just count
substrings. So what I want is something like this:
s = “you like to play with your yo-yo”
s.magical_count_method(“yo”) => 4
Once again, what I’m really looking for is something fast. I’ve tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
Thanks,
Once again, what I’m really looking for is something fast. I’ve tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
I don’t know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:
require ‘benchmark’
TIMES = 100_000
s = “you like to play with your yo-yo”
Benchmark.bmbm do |x|
x.report(“scan”) do
TIMES.times do
s.scan(“yo”).size
end
end
x.report(“while”) do
TIMES.times do
index = -1
count = 0
while (index = s.index(“yo”, index+1))
count += 1
end
count
end
end
end
Thanks Jesus,
This method actually decreased the runtime by quite a bit, so thanks
for the help! However, I still need something even faster if it exists,
so any other ideas would be appreciated. I may have to just implement
this part is C or something.
Once again, what I’m really looking for is something fast. �I’ve tried
using external linux commands such as awk, but that was much much
slower. Any ideas?
I don’t know how slow is scan for you. An implementation using
String#index and a loop is a little bit faster, but not too much:
…
Don’t know if this is enough for you, probably not
Thanks Jesus,
This method actually decreased the runtime by quite a bit, so thanks
for the help! However, I still need something even faster if it exists,
so any other ideas would be appreciated. I may have to just implement
this part is C or something.
I suppose that if you implement a C method that does what I did in
Ruby, that would be faster.
I mean doing the loop in C and calling String#index from there.
This thing about adding the length of the match can be argued
depending on the requirements, I think.
What would you expect from:
“yoyoyoyo”.magical_count_method(“yoyo”)
2 or 3?
If you add the length to the index you get 2. If you add 1, you get 3.
irb(main):018:0> s = “yoyoyoyo”
=> “yoyoyoyo”
irb(main):019:0> count = 0
=> 0
irb(main):020:0> len = s.length
=> 8
irb(main):021:0> search = “yoyo”
=> “yoyo”
irb(main):023:0> len = search.length
=> 4
irb(main):024:0> index = -len
=> -4
irb(main):025:0> while (index = s.index(search, index + len))
irb(main):026:1> count += 1
irb(main):027:1> end
=> nil
irb(main):028:0> count
=> 2
irb(main):029:0> count = 0
=> 0
irb(main):030:0> index = -1
=> -1
irb(main):031:0> while (index = s.index(search, index + 1))
irb(main):032:1> count += 1
irb(main):033:1> end
=> nil
irb(main):034:0> count
=> 3
So, I don’t know. Of course, if the requirement is to get 2 from the
above situation, adding the length is better.
Also of notice is that the block versions of scan are slower because
they have to call a block for each match.
I think I’ve read that the String#index method uses Rabin-Karp. It
would be interesting to compare this to a Boyer-Moore implementation.
Of course it will depend on the input data, if it’s near the best or
worst case for each, but anyway.
I’m looking for non-overlapping matches (so a 2 in your example)
I modified your code to do this for me like you showed and it works
fine. I was thinking of trying a Boyer-Moore implementation, but I
suspect if I implement this manually in Ruby it will be much slower.
So, I don’t know. Of course, if the requirement is to get 2 from the
above situation, adding the length is better.
Also of notice is that the block versions of scan are slower because
they have to call a block for each match.
I think I’ve read that the String#index method uses Rabin-Karp. It
would be interesting to compare this to a Boyer-Moore implementation.
Of course it will depend on the input data, if it’s near the best or
worst case for each, but anyway.
On Fri, Jun 25, 2010 at 12:05 AM, Robert K. [email protected] wrote:
sc.rb · GitHub
I would have expected regexp to be faster…
you don’t like strscan ?
best regards -botp
I’ve just run some benchmarks with strscan, and it’s at least in the
same ballpark as the other approaches, unless you’re on rubinius, but
then all string processing is really slow on that anyway.
On Thu, Jun 24, 2010 at 1:16 PM, Michael F. [email protected] wrote:
I’ve just run some benchmarks with strscan, and it’s at least in the
same ballpark as the other approaches, unless you’re on rubinius, but
then all string processing is really slow on that anyway.
On Thu, Jun 24, 2010 at 6:05 PM, Robert K. [email protected] wrote:
I too took the liberty to change the benchmark and I found a strange
way to beat the “while”
but by little
 x.report ‘boyer_moore’ do
  TIMES.times do
   count = BoyerMoore.match(“yo”, s).size
   check count
  end
 end
FYI, a large part of the overhead here is probably the Java calls,
which are a bit slower than Ruby to Ruby calls (plus it’s decoding the
“yo” string to UTF-16 each call). For a larger string and fewer calls,
the pure Java BoyerMoore performance would likely benchmark a lot
better than this.
On Tue, Jun 29, 2010 at 3:13 PM, Charles Oliver N. [email protected] wrote:
“yo” string to UTF-16 each call). For a larger string and fewer calls,
the pure Java BoyerMoore performance would likely benchmark a lot
better than this.
I had a similar suspicion and had started a modified benchmark doing
fewer loops over larger data, but had to move on to other things.
This gives me a chance to try out the JRuby Mac Installer…