Regexp problem

novice_user · November 26, 2008, 5:47pm

Hi,

I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
1)A string starts and end with any character string
2)Matching string starts from “<<” and end with “>>”
3)Matching string must be a pair of “<<” and “>>” and must be nested
correctly
4)There may be any unwanted string between matching pairs but never be
inside the pairs

A sample string is as followed:
str = “<<p1 <> <<p3 <>>>>> <> or <<p6 <<p7 <>>>>>”

The result string I am looking for from the above string should be as
followed:
["<<p1 <> <<p3 <>>>>>", “<>”, “<<p6 <<p7 <>>>>>”]

The closest I could come up with is str.split(/(<<[^>].*(>>)+)/) and I
got the result as followed:
["", “<<p1 <> <<p3 <>>>>> <> <<p6 <<p7 <>>>>>”,
“>>”]

I cannot change the rule (using “<<” and ">>), so I had little
experience on how to match 2 consecutive characters. It would be
appreciated if anyone has an input or suggestion. Thank you.

Chai

novice_user · November 26, 2008, 8:55pm

Chai Muang thong wrote:

Hi,

I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
1)A string starts and end with any character string
2)Matching string starts from “<<” and end with “>>”
3)Matching string must be a pair of “<<” and “>>” and must be nested
correctly

A Regular Expression (by the computer science definition of Regular
Expression) cannot match nested pairs. You’d need a parser for a
context-free grammar.

Having said that: many “regexp” implementations have a features crammed
into them which do that. But the penalty you pay is in complexity, and
often artefacts such as exponential time taken in backtracking.

So: if you want to do it right look at a parser generator such as antlr,
racc, coco, rockit, treetop etc.

However, I admit those may be very large hammers to crack a small nut.

What regexps are very good at is tokenising:

str = “<<p1 <> <<p3 <>>>>> <> or <<p6 <<p7 <>>>>>”
tokens = str.scan(/<<|>>|(?:[^<>]|<(?!<)|>(?!>))+/)

=> ["<<", "p1 ", “<<”, “p2”, “>>”, " ", “<<”, "p3 ", “<<”, “p4”, “>>”,

“>>”, “>>”, " ", “<<”, “p5”, “>>”, " or ", “<<”, "p6 ", “<<”, "p7
", “<<”, “p8”, “>>”, “>>”, “>>”]

The constructs I’ve used are:

(?: … ) grouping without capture
(?! … ) zero-width negative lookahead (match if string
is not followed by … without consuming it)

From this you can count << to find matching >>, or implement a recursive
parser.

A naive counting solution:

depth = 0
out = []
while t = tokens.shift
case t
when “<<”
res = “” if depth == 0
res << t
depth += 1
when “>>”
if depth > 0
res << t
depth -= 1
out << res if depth == 0
end
else
res << t if depth > 0
end
end

novice_user · November 27, 2008, 3:09am

Brian C. wrote:

Chai Muang thong wrote:

Hi,

I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
1)A string starts and end with any character string
2)Matching string starts from “<<” and end with “>>”
3)Matching string must be a pair of “<<” and “>>” and must be nested
correctly

A Regular Expression (by the computer science definition of Regular
Expression) cannot match nested pairs. You’d need a parser for a
context-free grammar.

Having said that: many “regexp” implementations have a features crammed
into them which do that. But the penalty you pay is in complexity, and
often artefacts such as exponential time taken in backtracking.

So: if you want to do it right look at a parser generator such as antlr,
racc, coco, rockit, treetop etc.

However, I admit those may be very large hammers to crack a small nut.

What regexps are very good at is tokenising:

str = “<<p1 <> <<p3 <>>>>> <> or <<p6 <<p7 <>>>>>”
tokens = str.scan(/<<|>>|(?:[^<>]|<(?!<)|>(?!>))+/)

=> ["<<", "p1 ", “<<”, “p2”, “>>”, " ", “<<”, "p3 ", “<<”, “p4”, “>>”,

“>>”, “>>”, " ", “<<”, “p5”, “>>”, " or ", “<<”, "p6 ", “<<”, "p7
", “<<”, “p8”, “>>”, “>>”, “>>”]

The constructs I’ve used are:

(?: … ) grouping without capture
(?! … ) zero-width negative lookahead (match if string
is not followed by … without consuming it)

From this you can count << to find matching >>, or implement a recursive
parser.

A naive counting solution:

depth = 0
out = []
while t = tokens.shift
case t
when “<<”
res = “” if depth == 0
res << t
depth += 1
when “>>”
if depth > 0
res << t
depth -= 1
out << res if depth == 0
end
else
res << t if depth > 0
end
end

def match_pairs s, l, r
d, o, e = 0, [], ‘(’ + Regexp.escape(l) + ‘|’ + Regexp.escape® +
‘)’
while m = s =~ /#{e}/ do
$1 == l ? (o << “” if 1 === d += 1) : d -= 1
o[o.length - 1] += o.last.empty? ? $1 : $` + $1
s = $’
end
o
end

str = “<<p1 <> <<p3 <>>>>> <> or <<p6 <<p7 <>>>>>”
p match_pairs str, “<<”, “>>”

=> ["<<p1 <> <<p3 <>>>>>", “<>”, “<<p6 <<p7 <>>>>>”]

p match_pairs str, “<”, “>”
=> ["<<p1 <> <<p3 <>>>>>", “”, “<>”, “<<p6 <<p7
<>>>>>”]

str = ‘{p1 {p2} {p3 {p4}}} {p5} or {p6 {p7 {p8}}}’
p match_pairs str, ‘{’, ‘}’

=> ["{p1 {p2} {p3 {p4}}}", “{p5}”, “{p6 {p7 {p8}}}”]

novice_user · December 2, 2008, 3:45pm

Oops, I was pointing to wrong person in my last post. Brian method is
correct but not Karnacevych. Anyway, thank you very much for both
replies.

novice_user · December 2, 2008, 3:35pm

Constantine Karnacevych wrote:

Brian C. wrote:

def match_pairs s, l, r
d, o, e = 0, [], ‘(’ + Regexp.escape(l) + ‘|’ + Regexp.escape® +
‘)’
while m = s =~ /#{e}/ do
$1 == l ? (o << “” if 1 === d += 1) : d -= 1
o[o.length - 1] += o.last.empty? ? $1 : $` + $1
s = $’
end
o
end

str = “<<p1 <> <<p3 <>>>>> <> or <<p6 <<p7 <>>>>>”
p match_pairs str, “<<”, “>>”

=> ["<<p1 <> <<p3 <>>>>>", “<>”, “<<p6 <<p7 <>>>>>”]

p match_pairs str, “<”, “>”
=> ["<<p1 <> <<p3 <>>>>>", “”, “<>”, “<<p6 <<p7
<>>>>>”]

str = ‘{p1 {p2} {p3 {p4}}} {p5} or {p6 {p7 {p8}}}’
p match_pairs str, ‘{’, ‘}’

=> ["{p1 {p2} {p3 {p4}}}", “{p5}”, “{p6 {p7 {p8}}}”]

Thank you very much Brian. It works fine as long as an incoming string
is valid. In other words, if the string has extra pair to the left, it
will return wrong result. For example a string “Test or <
correct tag” will return ["<<"] instead of [""]. Anyway, I have solved
the problem by implementing a method to handle this instead of using
only regexp. Thank you again.

Chai