Specify start postion of Regexp matching

makoto_kuwata · November 25, 2007, 4:23pm

Hi, all.

Is it possible to specify start position of Regexp matching?

str = "foo bar baz"
m = /ba/.match(str)
p m.begin(0)             #=> 4
m = /ba/.match(str, 5)   # is it possible?
p m.begin(0)             #=> 8 (if possible)

If it is possible, some kind of parser or scanner can be
implemented easily.

StringScanner is a litte too big, I think.

makoto_kuwata · November 25, 2007, 4:40pm

On Nov 25, 10:18 am, makoto kuwata [email protected] wrote:

If it is possible, some kind of parser or scanner can be
implemented easily.

StringScanner is a litte too big, I think.

You could try something like this:

m = /^.{5,}(ba)/.match(str)
p m.begin(1)

In the regular expression, you’re saying start at the beginning and
skip at least 5 characters. But then we have to use parens to “note”
the part you’re interested in, and then we have to pass 1 rather than
0 to begin, so it reports the location of the first noted match (0
would report where the entire Regexp matched, and that would be the
beginning of the line).

An alternative would be to slice the first n characters off the front
of the string and then do the match.

Eric

====

Interested in hands-on, on-site Ruby training? See http://LearnRuby.com
for information about a well-reviewed class.

makoto_kuwata · November 25, 2007, 5:28pm

On 25.11.2007 16:39, Eric I. wrote:

skip at least 5 characters. But then we have to use parens to “note”
the part you’re interested in, and then we have to pass 1 rather than
0 to begin, so it reports the location of the first noted match (0
would report where the entire Regexp matched, and that would be the
beginning of the line).

An alternative would be to slice the first n characters off the front
of the string and then do the match.

Another alternative is to use String#scan - we would have to know what
the OP really wants to parse though to decide whether it’s a feasible
solution.

Kind regards

robert

makoto_kuwata · November 25, 2007, 7:00pm

Thank you, all.

Eric l wrote:

You could try something like this:
m = /^.{5,}(ba)/.match(str)
p m.begin(1)

In my program, start position is variable such as
def f(n)
m = /^.{n,}(ba)/.match(str)
…
end
In this case, /^.{n,}(ba)/ is created for each time.
It is not effective.

Robert K. wrote:

Another alternative is to use String#scan -

String#scan is useful only when regexp pattern is fixed.
input.scan(/FIXED-REGEXP/) do … end
Using String#scan, it is not able to change regexp pattern
in the loop.

Axel E. wrote:

           temp=string[start_pos..-1]
           ref=self.match(temp)
           return temp.index(ref[0])+start_pos

In this solution, temp substring is created every time.
If input string is long, it is not efficient.

Thanks to all your advices.
I’m going to propose to support start position in Regexp#match().

makoto_kuwata · November 25, 2007, 5:49pm

-------- Original-Nachricht --------

Datum: Mon, 26 Nov 2007 00:20:25 +0900
Von: makoto kuwata [email protected]
An: [email protected]
Betreff: [Q] specify start postion of Regexp matching

If it is possible, some kind of parser or scanner can be
implemented easily.

StringScanner is a litte too big, I think.

–
makoto kuwata

Dear Makoto,

what about :

class Regexp
def match_index_offset(string,start_pos)
temp=string[start_pos…-1]
ref=self.match(temp)
return temp.index(ref[0])+start_pos
end
end

str = “foo bar baz”
m = /ba/.match_index_offset(str,5)
p m

Best regards,

Axel

makoto_kuwata · November 26, 2007, 12:24am

Hi,

In message “Re: [Q] specify start postion of Regexp matching”
on Mon, 26 Nov 2007 00:20:25 +0900, makoto kuwata
[email protected] writes:

str.index(/ba/, 5) ?

          matz.

makoto_kuwata · November 26, 2007, 12:30am

Robert K. [email protected] wrote:

In this solution, temp substring is created every time.
If input string is long, it is not efficient.

This is not true. Creating a substring is fairly cheap because the
character buffer is not copied (copy on write).

You are right. If input string is not modified, creating substring
doesn’t copy anything.
Creating substring may be the solution I wanted.

I’m going to propose to support start position in Regexp#match().

For the time being it’s faster to use one of the other alternatives.
Also, with the new regexp engine in 1.9 your feature might be present
already.

I found that Regexp#match() can take optional 2nd argument which
specifies matching start position in Ruby1.9. Good news.

Thank you, Robert.

makoto_kuwata · November 25, 2007, 8:21pm

On 25.11.2007 18:58, makoto kuwata wrote:

Robert K. wrote:

Another alternative is to use String#scan -

String#scan is useful only when regexp pattern is fixed.
input.scan(/FIXED-REGEXP/) do … end
Using String#scan, it is not able to change regexp pattern
in the loop.

But in various situations it is possible to use a unified regexp for
scanning or a regexp that comprises all other patterns.

Axel E. wrote:
           temp=string[start_pos..-1]
           ref=self.match(temp)
           return temp.index(ref[0])+start_pos
In this solution, temp substring is created every time.
If input string is long, it is not efficient.

This is not true. Creating a substring is fairly cheap because the
character buffer is not copied (copy on write).

I’m going to propose to support start position in Regexp#match().

For the time being it’s faster to use one of the other alternatives.
Also, with the new regexp engine in 1.9 your feature might be present
already.

Kind regards

robert

makoto_kuwata · November 26, 2007, 12:52am

makoto kuwata [email protected] wrote:

str.index(/ba/, 5) ?

No, String#index returns Fixnum (position), but I want MatchData.

I found that it is able to get MatchData by Regexp.last_match()
after String#index().
Well, I think Regexp#match(string, start=0) is the natural way,
but String#index(regexp, start) can be the good solution.

Thank you, Matz.

makoto_kuwata · November 26, 2007, 12:50am

Yukihiro M. [email protected] wrote:

str.index(/ba/, 5) ?

No, String#index returns Fixnum (position), but I want MatchData.

Regexp#match(string, start=0) in Ruby1.9 is the best solution I want.
Is there any plan to implement it into Ruby1.8?

makoto_kuwata · November 26, 2007, 6:21am

What’s the difference between 1.9 Regexp#match(string, start=n) and
1.8 Regexp#match(string[n…-1])?? You have to create a sub-string with
the 1.8 version, but according to Robert K. (above) it’s just
creating a pointer into the original string if you’re not changing the
substring or original string. Besides, even if you did get a copy,
it’s anonymous and should be garbage collected soon. If I understand
everything correctly, the 1.9 version would just basically be a
convenience feature over the 1.8 way?

$ irb19
irb(main):001:0> RUBY_VERSION
=> “1.9.0”
irb(main):002:0> m = /oo/.match(“foo”, start=1)
=> #<MatchData “oo”>
irb(main):003:0> m[0]
=> “oo”

$ irb
irb(main):001:0> RUBY_VERSION
=> “1.8.6”
irb(main):002:0> m = /oo/.match(“foo”[1…-1])
=> #MatchData:0xb78777a8
irb(main):003:0> m[0]
=> “oo”

Regards,
Jordan

makoto_kuwata · November 26, 2007, 7:07am

Jordan Callicoat wrote:

You have to create a sub-string with
the 1.8 version, but according to Robert K. (above) it’s just
creating a pointer into the original string if you’re not changing the
substring or original string.

I’m having a hard time confirming that:

str = “hello”
sub_str = str[1, 2]

puts str.object_id
–>76750

puts sub_str.object_id
–>76740

puts sub_str.class
–>String

makoto_kuwata · November 26, 2007, 7:50am

On Nov 26, 12:07 am, 7stud – [email protected] wrote:

Jordan Callicoat wrote:

You have to create a sub-string with
the 1.8 version, but according to Robert K. (above) it’s just
creating a pointer into the original string if you’re not changing the
substring or original string.

I’m having a hard time confirming that:

I’m not sure how to confirm it, other than just looking at the source,
and since I’m very poor at C programming, it probably wouldn’t help
for me to try that. I’m sure Robert can demonstrate. But I will say
that I’m not suprised that they have different object_id, because they
are different objects. The copy on write is just a back-end
optimization where you pretend that two objects that point to the same
data are unique copies in the front-end, but you don’t actually move
any data in the back-end until you have to (i,e., when one of the
objects is changed).

Regards,
Jordan

makoto_kuwata · November 26, 2007, 8:10am

On Nov 26, 12:45 am, MonkeeSage [email protected] wrote:

I’m not sure how to confirm it, other than just looking at the source,
and since I’m very poor at C programming, it probably wouldn’t help
for me to try that.

Well, I did anyhow…

http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/ruby.h
http://svn.ruby-lang.org/repos/ruby/branches/ruby_1_8/string.c

And I think the functions of interest are str_new3 and str_new4
(called from rb_str_substr). Specifically, the assignment of
RSTRING(str2)->aux.shared. But like I said, I’m not great with C, so I
could be mistaken.

Regards,
Jordan

makoto_kuwata · November 27, 2007, 6:08am

On Nov 26, 2007 5:07 PM, 7stud – [email protected] wrote:

puts str.object_id
–>76750

puts sub_str.object_id
–>76740

puts sub_str.class
–>String

A new ruby object is created, but the string buffer that it points to
is only copied on write.

makoto_kuwata · November 26, 2007, 10:06am

Here’s a test to show that my reading of the source, and Robert’s
assertion, is correct (there is probably a better way to do this…):

#!/usr/bin/env ruby

disable GC to get fair reading of actual allocation cost

GC.disable

def free_megs
(free -o.split("\n")[1].split(’ ')[3].to_i/1024).to_s
end

puts "Free megabytes " + free_megs

make a one megabyte string

s1 = “a” * 1048576
s100 = “” # placeholder to be filled in below

make 100 substrings of it

0.upto(101) { |i| eval(“s#{i}=s1[0…-1]”) }

puts s100.length.to_s
puts "Free megabytes " + free_megs

Output:

Free megabytes 588
1048576
Free megabytes 587

Only one meg is used, which is the length of the original string. So,
by inductive inference, the substrings are only pointers back to the
original string rather than copies of the data.

Regards,
Jordan