Forum: Ruby regexp problem

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
4a6bd1cc395eae0ba8c31bba4c195016?d=identicon&s=25 Chai Muang thong (novice_user)
on 2008-11-26 17:47
Hi,

I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
 1)A string starts and end with any character string
 2)Matching string starts from "<<" and end with ">>"
 3)Matching string must be a pair of "<<" and ">>" and must be nested
correctly
 4)There may be any unwanted string between matching pairs but never be
inside the pairs

A sample string is as followed:
str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"

The result string I am looking for from the above string should be as
followed:
["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<<p5>>", "<<p6 <<p7 <<p8>>>>>>"]

The closest I could come up with is str.split(/(<<[^>].*(>>)+)/) and I
got the result as followed:
["", "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> <<p6 <<p7 <<p8>>>>>>",
">>"]

I cannot change the rule (using "<<" and ">>), so I had little
experience on how to match 2 consecutive characters. It would be
appreciated if anyone has an input or suggestion. Thank you.

Chai
753dcb78b3a3651127665da4bed3c782?d=identicon&s=25 Brian Candler (candlerb)
on 2008-11-26 20:55
Chai Muang thong wrote:
> Hi,
>
> I have been searching on the Internet about how to solve my problem. I
> have a string that needs to be split using regexp in ruby. The problem
> is that I cannot find a way to get it right. The string has a fix
> format. The rules for matching are as followed:
>  1)A string starts and end with any character string
>  2)Matching string starts from "<<" and end with ">>"
>  3)Matching string must be a pair of "<<" and ">>" and must be nested
> correctly

A Regular Expression (by the computer science definition of Regular
Expression) cannot match nested pairs. You'd need a parser for a
context-free grammar.

Having said that: many "regexp" implementations have a features crammed
into them which do that. But the penalty you pay is in complexity, and
often artefacts such as exponential time taken in backtracking.

So: if you want to do it right look at a parser generator such as antlr,
racc, coco, rockit, treetop etc.

However, I admit those may be very large hammers to crack a small nut.

What regexps *are* very good at is tokenising:

str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
tokens = str.scan(/<<|>>|(?:[^<>]|<(?!<)|>(?!>))+/)

# => ["<<", "p1 ", "<<", "p2", ">>", " ", "<<", "p3 ", "<<", "p4", ">>",
">>", ">>", " <and> ", "<<", "p5", ">>", " or ", "<<", "p6 ", "<<", "p7
", "<<", "p8", ">>", ">>", ">>"]

The constructs I've used are:

  (?: ... )       grouping without capture
  (?! ... )       zero-width negative lookahead (match if string
                  is not followed by ... without consuming it)

From this you can count << to find matching >>, or implement a recursive
parser.

A naive counting solution:

depth = 0
out = []
while t = tokens.shift
  case t
  when "<<"
    res = "" if depth == 0
    res << t
    depth += 1
  when ">>"
    if depth > 0
      res << t
      depth -= 1
      out << res if depth == 0
    end
  else
    res << t if depth > 0
  end
end
3d401468ea7d69980d5cf9ecd18ce65a?d=identicon&s=25 Constantine Karnacevych (digital)
on 2008-11-27 03:09
Brian Candler wrote:
> Chai Muang thong wrote:
>> Hi,
>>
>> I have been searching on the Internet about how to solve my problem. I
>> have a string that needs to be split using regexp in ruby. The problem
>> is that I cannot find a way to get it right. The string has a fix
>> format. The rules for matching are as followed:
>>  1)A string starts and end with any character string
>>  2)Matching string starts from "<<" and end with ">>"
>>  3)Matching string must be a pair of "<<" and ">>" and must be nested
>> correctly
>
> A Regular Expression (by the computer science definition of Regular
> Expression) cannot match nested pairs. You'd need a parser for a
> context-free grammar.
>
> Having said that: many "regexp" implementations have a features crammed
> into them which do that. But the penalty you pay is in complexity, and
> often artefacts such as exponential time taken in backtracking.
>
> So: if you want to do it right look at a parser generator such as antlr,
> racc, coco, rockit, treetop etc.
>
> However, I admit those may be very large hammers to crack a small nut.
>
> What regexps *are* very good at is tokenising:
>
> str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
> tokens = str.scan(/<<|>>|(?:[^<>]|<(?!<)|>(?!>))+/)
>
> # => ["<<", "p1 ", "<<", "p2", ">>", " ", "<<", "p3 ", "<<", "p4", ">>",
> ">>", ">>", " <and> ", "<<", "p5", ">>", " or ", "<<", "p6 ", "<<", "p7
> ", "<<", "p8", ">>", ">>", ">>"]
>
> The constructs I've used are:
>
>   (?: ... )       grouping without capture
>   (?! ... )       zero-width negative lookahead (match if string
>                   is not followed by ... without consuming it)
>
> From this you can count << to find matching >>, or implement a recursive
> parser.
>
> A naive counting solution:
>
> depth = 0
> out = []
> while t = tokens.shift
>   case t
>   when "<<"
>     res = "" if depth == 0
>     res << t
>     depth += 1
>   when ">>"
>     if depth > 0
>       res << t
>       depth -= 1
>       out << res if depth == 0
>     end
>   else
>     res << t if depth > 0
>   end
> end

def match_pairs s, l, r
    d, o, e = 0, [],  '(' + Regexp.escape(l) + '|' + Regexp.escape(r) +
')'
    while m = s =~ /#{e}/ do
        $1 == l ? (o << "" if 1 === d += 1) : d -= 1
        o[o.length - 1] += o.last.empty? ? $1 : $` + $1
        s = $'
    end
    o
end

str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
p match_pairs str, "<<", ">>"

=> ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<<p5>>", "<<p6 <<p7 <<p8>>>>>>"]

p match_pairs str, "<", ">"
=> ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<and>", "<<p5>>", "<<p6 <<p7
<<p8>>>>>>"]

str = '{p1 {p2} {p3 {p4}}} <and> {p5} or {p6 {p7 {p8}}}'
p match_pairs str, '{', '}'

=> ["{p1 {p2} {p3 {p4}}}", "{p5}", "{p6 {p7 {p8}}}"]
4a6bd1cc395eae0ba8c31bba4c195016?d=identicon&s=25 Chai Muang thong (novice_user)
on 2008-12-02 15:35
Constantine Karnacevych wrote:
> Brian Candler wrote:
>
> def match_pairs s, l, r
>     d, o, e = 0, [],  '(' + Regexp.escape(l) + '|' + Regexp.escape(r) +
> ')'
>     while m = s =~ /#{e}/ do
>         $1 == l ? (o << "" if 1 === d += 1) : d -= 1
>         o[o.length - 1] += o.last.empty? ? $1 : $` + $1
>         s = $'
>     end
>     o
> end
>
> str = "<<p1 <<p2>> <<p3 <<p4>>>>>> <and> <<p5>> or <<p6 <<p7 <<p8>>>>>>"
> p match_pairs str, "<<", ">>"
>
> => ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<<p5>>", "<<p6 <<p7 <<p8>>>>>>"]
>
> p match_pairs str, "<", ">"
> => ["<<p1 <<p2>> <<p3 <<p4>>>>>>", "<and>", "<<p5>>", "<<p6 <<p7
> <<p8>>>>>>"]
>
> str = '{p1 {p2} {p3 {p4}}} <and> {p5} or {p6 {p7 {p8}}}'
> p match_pairs str, '{', '}'
>
> => ["{p1 {p2} {p3 {p4}}}", "{p5}", "{p6 {p7 {p8}}}"]

--------------

Thank you very much Brian. It works fine as long as an incoming string
is valid. In other words, if the string has extra pair to the left, it
will return wrong result. For example a string "Test <tag> or <<not>
correct tag" will return ["<<"] instead of [""]. Anyway, I have solved
the problem by implementing a method to handle this instead of using
only regexp. Thank you again.

Chai
4a6bd1cc395eae0ba8c31bba4c195016?d=identicon&s=25 Chai Muang thong (novice_user)
on 2008-12-02 15:45
Oops, I was pointing to wrong person in my last post. Brian method is
correct but not Karnacevych. Anyway, thank you very much for both
replies.
This topic is locked and can not be replied to.