Brian C. wrote:
Chai Muang thong wrote:
Hi,
I have been searching on the Internet about how to solve my problem. I
have a string that needs to be split using regexp in ruby. The problem
is that I cannot find a way to get it right. The string has a fix
format. The rules for matching are as followed:
1)A string starts and end with any character string
2)Matching string starts from “<<” and end with “>>”
3)Matching string must be a pair of “<<” and “>>” and must be nested
correctly
A Regular Expression (by the computer science definition of Regular
Expression) cannot match nested pairs. You’d need a parser for a
context-free grammar.
Having said that: many “regexp” implementations have a features crammed
into them which do that. But the penalty you pay is in complexity, and
often artefacts such as exponential time taken in backtracking.
So: if you want to do it right look at a parser generator such as antlr,
racc, coco, rockit, treetop etc.
However, I admit those may be very large hammers to crack a small nut.
What regexps are very good at is tokenising:
str = “<<p1 <> <<p3 <>>>>> <> or <<p6 <<p7 <>>>>>”
tokens = str.scan(/<<|>>|(?:[^<>]|<(?!<)|>(?!>))+/)
=> ["<<", "p1 ", “<<”, “p2”, “>>”, " ", “<<”, "p3 ", “<<”, “p4”, “>>”,
“>>”, “>>”, " ", “<<”, “p5”, “>>”, " or ", “<<”, "p6 ", “<<”, "p7
", “<<”, “p8”, “>>”, “>>”, “>>”]
The constructs I’ve used are:
(?: … ) grouping without capture
(?! … ) zero-width negative lookahead (match if string
is not followed by … without consuming it)
From this you can count << to find matching >>, or implement a recursive
parser.
A naive counting solution:
depth = 0
out = []
while t = tokens.shift
case t
when “<<”
res = “” if depth == 0
res << t
depth += 1
when “>>”
if depth > 0
res << t
depth -= 1
out << res if depth == 0
end
else
res << t if depth > 0
end
end
def match_pairs s, l, r
d, o, e = 0, [], ‘(’ + Regexp.escape(l) + ‘|’ + Regexp.escape® +
‘)’
while m = s =~ /#{e}/ do
$1 == l ? (o << “” if 1 === d += 1) : d -= 1
o[o.length - 1] += o.last.empty? ? $1 : $` + $1
s = $’
end
o
end
str = “<<p1 <> <<p3 <>>>>> <> or <<p6 <<p7 <>>>>>”
p match_pairs str, “<<”, “>>”
=> ["<<p1 <> <<p3 <>>>>>", “<>”, “<<p6 <<p7 <>>>>>”]
p match_pairs str, “<”, “>”
=> ["<<p1 <> <<p3 <>>>>>", “”, “<>”, “<<p6 <<p7
<>>>>>”]
str = ‘{p1 {p2} {p3 {p4}}} {p5} or {p6 {p7 {p8}}}’
p match_pairs str, ‘{’, ‘}’
=> ["{p1 {p2} {p3 {p4}}}", “{p5}”, “{p6 {p7 {p8}}}”]