On Thu, Oct 18, 2012 at 2:51 AM, Matthew K. [email protected]
wrote:
Tangentially, it just occurred to me that ruby’s regular expression
engine does the same thing that javascript’s does, when globally
replacing /X*$/ .
This behavior is common with most regexp engines (at least I don’t
know any which does not behave like this). All regular expressions
X* can match the empty string - anywhere in the input.
irb(main):022:0> “####”.scan /\w*/
=> [“”, “”, “”, “”, “”]
And, when anchoring a portion of the match expression at the end and
have repetition in that match you need to make sure that the
characters are not eaten by other parts of the regexp.
“naive” approach:
irb(main):026:0> %w{aaa aab abb bbb}.each {|s| /.(b)\z/ =~ s; printf
“%p: 1:%p\n”, s, $1}
“aaa”: 1:“”
“aab”: 1:“”
“abb”: 1:“”
“bbb”: 1:“”
=> [“aaa”, “aab”, “abb”, “bbb”]
Working approaches:
- reduce greed
irb(main):027:0> %w{aaa aab abb bbb}.each {|s| /.?(b)\z/ =~ s;
printf “%p: 1:%p\n”, s, $1}
“aaa”: 1:“”
“aab”: 1:“b”
“abb”: 1:“bb”
“bbb”: 1:“bbb”
=> [“aaa”, “aab”, “abb”, “bbb”]
- negative lookbehind
irb(main):028:0> %w{aaa aab abb bbb}.each {|s| /.(?<!b)(b)\z/ =~ s;
printf “%p: 1:%p\n”, s, $1}
“aaa”: 1:“”
“aab”: 1:“b”
“abb”: 1:“bb”
“bbb”: 1:“bbb”
=> [“aaa”, “aab”, “abb”, “bbb”]
Note though the special case where there is only one alternative with
a match anchored at the end:
irb(main):045:0> for b in body; for pre in segm; for post in segm;
s=“#{pre}#{b}#{post}”; printf “%p → %p\n”,s,s[/#*\z/]; end end end
“” → “”
“#” → “#”
“##” → “##”
“#” → “#”
“##” → “##”
“###” → “###”
“##” → “##”
“###” → “###”
“####” → “####”
“foo” → “”
“foo#” → “#”
“foo##” → “##”
“#foo” → “”
“#foo#” → “#”
“#foo##” → “##”
“##foo” → “”
“##foo#” → “#”
“##foo##” → “##”
=> [“”, “foo”]
Here, the simple expression works since the # are not eaten by other
portions of the regexp.
irb(main):004:0> ‘foo#’.gsub(/\A#|#\Z/, ‘#’)
=> “#foo##”
irb(main):005:0> ‘foo##’.gsub(/\A#|#\Z/, ‘#’)
=> “#foo##”
irb(main):006:0> ‘##foo##’.gsub(/\A#|#\Z/, ‘#’)
=> “#foo##”
If one regexp should be used in this case the negative lookbehind is a
viable option since there is no preceding part in this alternative
which we can make non greedy:
irb(main):044:0> for b in body; for pre in segm; for post in segm;
s=“#{pre}#{b}#{post}”; printf “%p → %p\n”,s,s.gsub(/\A#|(?<!#)#\z/,
‘#’); end end end
“” → “#”
“#” → “#”
“##” → “#”
“#” → “#”
“##” → “#”
“###” → “#”
“##” → “#”
“###” → “#”
“####” → “#”
“foo” → “#foo#”
“foo#” → “#foo#”
“foo##” → “#foo#”
“#foo” → “#foo#”
“#foo#” → “#foo#”
“#foo##” → “#foo#”
“##foo” → “#foo#”
“##foo#” → “#foo#”
“##foo##” → “#foo#”
=> [“”, “foo”]
I blogged about it here:
Matthew Kerwin :: Blog :: JS: <code>/x*$/</code> in global replace
Turns out with Oniguruma there is a way to do it with a single
regexp. In fact any regexp engine with lookbehind will do.
Reference: サービス終了のお知らせ
Kind regards
robert