Finding positions in a string

I have a long string and I need to know the starting position of all
substrings matching a particular sequence. Most importantly, it needs
to be fast. Secondly, it would be nice if it was somewhat concise.

Here’s the method:

def substring_positions(substring, string)

fast, concise method??

end

my_substring = ‘this’
my_string = ‘this string this is a string this is it’

substring_positions(my_substring, my_string) # -> should return
[0, 12, 29]

This seems trivial to do, but looking at StringScanner, String#match,
and String#scan, nothing simple comes to mind. There must be a one-
liner somewhere for this kind of thing. I checked through facets and
didn’t see anything…

Thanks

Hi –

On Tue, 24 Jul 2007, bwv549 wrote:

I have a long string and I need to know the starting position of all
substrings matching a particular sequence. Most importantly, it needs
to be fast. Secondly, it would be nice if it was somewhat concise.

Here’s the method:

def substring_positions(substring, string)

fast, concise method??

Would you settle for concise? :slight_smile:

liner somewhere for this kind of thing. I checked through facets and
didn’t see anything…

I’ll get the ball rolling, though I doubt it’s very efficient:

require ‘enumerator’
def substring_positions(substring, string)
string.enum_for(:scan, substring).map { $~.offset(0)[0] }
end

David

Well, here’s what I have so far:

def substring_positions(substring, string)
substring_len = substring.size
s = StringScanner.new(string)
regexp = Regexp.new(Regexp.escape(substring))
positions = []
while s.scan_until(regexp)
positions << (s.pos - substring_len)
end
positions
end

In theory, it looks fairly fast, but it’s not all that concise…

Anyone have anything better?

Thanks

require ‘enumerator’
def substring_positions(substring, string)
string.enum_for(:scan, substring).map { $~.offset(0)[0] }
end

That’s the one-line magic I knew was hiding out there. I will use it
and see if it is acceptably fast. Thanks

From: bwv549 [mailto:[email protected]]

substring_positions(my_substring, my_string) # → should return

[0, 12, 29]

This seems trivial to do, but looking at StringScanner, String#match,

and String#scan, nothing simple comes to mind. There must be a one-

liner somewhere for this kind of thing.

not a raw one-liner, but you can create one :wink:

root@pc4all:~# cat -n test.rb
1 class String
2 def scan_p ss
3 a = []
4 self.scan(ss){a << Regexp.last_match.begin(0)}
5 a
6 end
7 end
8
9 p “this string this is a string this is it”.scan_p(“this”)
10
root@pc4all:~# ruby test.rb
[0, 12, 29]
root@pc4all:~#

i think the keyword is Matchdata.

kind regards -botp

On Jul 23, 9:10 pm, James Edward G. II [email protected]
wrote:

On Jul 23, 2007, at 3:54 PM, bwv549 wrote:

I have a long string and I need to know the starting position of all
substrings matching a particular sequence. Most importantly, it needs
to be fast. Secondly, it would be nice if it was somewhat concise.

Another idea is to use String#index() with the optional second
minimum index parameter which you keep incrementing.

James Edward G. II

Indeed.

d-o-ibook-g4:~ d-o$ uname -a
Darwin d-o-ibook-g4.local 8.8.0 Darwin Kernel Version 8.8.0: Fri Sep
8 17:18:57 PDT 2006; root:xnu-792.12.6.obj~1/RELEASE_PPC Power
Macintosh powerp

d-o-ibook-g4:~ d-o$ cat bs.rb
require ‘benchmark’
require ‘enumerator’

COUNT=1_000_000

class String
def scan_p ss
a = []
self.scan(ss){a << Regexp.last_match.begin(0)}
a
end
end

substr = Proc.new do |s,seq|
pos=0
index=[]

while((i=s.index(seq,pos))!=nil)
index<<i
pos=i+seq.length
end

index
end

substr_positions = Proc.new do |string,substring|
string.enum_for(:scan, substring).map { $~.offset(0)[0] }
end

str=“assbasscassmassthatassbass”
seq=“ss”

Benchmark.bm do |x|
x.report("with index(): ") do; 1.upto(COUNT) do;
substr.call(seq,str); end end
x.report("with enum_for()/map(): ") do; 1.upto(COUNT) do;
substr_positions.call(str,seq); end end
x.report("with scan(): ") do; 1.upto(COUNT) do; str.scan_p(seq); end
end
end

d-o-ibook-g4:~ d-o$ ruby bs.rb
user system total real
with index(): 10.380000 0.060000 10.440000 ( 10.670129)
with enum_for()/map(): 88.740000 0.910000 89.650000 ( 98.726685)
with scan(): 57.620000 0.590000 58.210000 ( 65.996496)

On Jul 23, 2007, at 3:54 PM, bwv549 wrote:

my_substring = ‘this’
my_string = ‘this string this is a string this is it’

substring_positions(my_substring, my_string) # -> should return
[0, 12, 29]

Another idea is to use String#index() with the optional second
minimum index parameter which you keep incrementing.

James Edward G. II

From: Skye Shaw!@#$ [mailto:[email protected]]

On Jul 23, 9:10 pm, James Edward G. II [email protected]

wrote:

> Another idea is to use String#index() with the optional second

> minimum index parameter which you keep incrementing.

cool and simple indeed.

substr = Proc.new do |s,seq|

pos=0

index=[]

^^^^^
rename. it might confuse us nubies :slight_smile:

while((i=s.index(seq,pos))!=nil)

simplify. while i=s.index(seq,pos)

index<<i

pos=i+seq.length

       ^^^^^
   this does not change. init this outside loop.

end

index

end

hmmm, even the recursive index method is not bad…

root@pc4all:~# cat -n test.rb
1 require ‘benchmark’
2 require ‘enumerator’
3
4 COUNT=100_000
5
6 class String
7 def scan_p ss
8 a = []
9 self.scan(ss){a << Regexp.last_match.begin(0)}
10 a
11 end
12
13 # recursive scan using index
14 def scan_p2 ss, pos = 0
15 return [] unless i = index(ss,pos)
16 [i] + scan_p2(ss, i + ss.length)
17 end
18 end
19
20 class String
21 def scan_i seq
22 pos=0
23 ndx=[]
24 slen = seq.length
25 while i=index(seq,pos)
26 ndx << i
27 pos = i + slen
28 end
29 ndx
30 end
31 end
32
33 class String
34 #substr_positions = Proc.new do |string,substring|
35 def scan_enum substring
36 enum_for(:scan, substring).map { $~.offset(0)[0] }
37 end
38 end
39
40 str=“assbasscassmassthatassbass”
41 seq=“ss”
42
43 Benchmark.bm do |x|
44 x.report("with index(): ") do; 1.upto(COUNT) do;
45 str.scan_i(seq); end end
46 x.report("with enum_for()/map(): ") do; 1.upto(COUNT) do;
47 str.scan_enum(seq); end end
48 x.report("with scan_p(): ") do; 1.upto(COUNT) do;
str.scan_p(seq); end end
49 x.report("with scan_p2(): ") do; 1.upto(COUNT) do;
str.scan_p2(seq); end end
50
51 end
52
53
root@pc4all:~# ruby test.rb
user system total real
with index(): 4.760000 0.010000 4.770000 ( 4.774163)
with enum_for()/map(): 22.100000 0.010000 22.110000 ( 22.139630)
with scan_p(): 16.680000 0.010000 16.690000 ( 16.702391)
with scan_p2(): 9.940000 0.010000 9.950000 ( 9.955084)
root@pc4all:~#

not bad at 100k loops :slight_smile:

root@pc4all:~# cat /proc/cpuinfo | egrep -i ‘(mhz|name)’
model name : Pentium II (Klamath)
cpu MHz : 300.061
root@pc4all:~# cat /proc/meminfo | grep -i mem
MemTotal: 255616 kB
MemFree: 7920 kB

kind regards -botp