Regexp Question: Checking for [joe][/joe] pairs

joep · December 20, 2006, 9:31pm

Hey, I’ve got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/([joe][\s\d\w]+?[/joe]){1,3}/)

joep · December 20, 2006, 9:37pm

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

joep · December 20, 2006, 9:40pm

Daniel F. wrote:

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.

joep · December 20, 2006, 9:44pm

Hi –

On Thu, 21 Dec 2006, Daniel F. wrote:

Why are you doing /[\s\d\w]+?/? Just use /.+?/.
\d is part of \w, so [\s\w] would be OK. But . is very different. It
does not include newline (by default), and does include punctuation.

David

joep · December 20, 2006, 9:41pm

Hi Joe –

On Thu, 21 Dec 2006, Joe P. wrote:

Hey, I’ve got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/([joe][\s\d\w]+?[/joe]){1,3}/)

require ‘test/unit’

class JoeTest < Test::Unit::TestCase
def setup
@re = /([joe][\s\d\w]+?[/joe]){1,3}/
end

 def test_ok
   assert("[joe]abc[/joe]".match(@re))
 end

 def test_broken
   # ???        <--- fill in the blank :-)
 end

end

David

joep · December 20, 2006, 9:52pm

The problem is that the Regexp is not anchored to the start and ends of
the string.

/^(?![joe]).?([joe].+?[\joe]){1,3}(?![joe]).?$/

joep · December 20, 2006, 9:48pm

Hi –

On Thu, 21 Dec 2006, Joe P. wrote:

Daniel F. wrote:

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.

That’s because {1,3} doesn’t mean there can’t be another. Usually
you’d anchor it or surround it with something else, like:

/Xa{1,3}X/

so it can’t keep repeating.

David

joep · December 20, 2006, 9:56pm

Joe P. wrote:

Posted via http://www.ruby-forum.com/.
@x = “[joe] [/joe] [joe][/joe] [joe] foo [/joe]”
count = @x.scan(/[joe](.*?)[/joe]/m).flatten.
reject{|s| “”==s}.size
p count

joep · December 21, 2006, 5:40pm

Joe P. wrote:

The problem is I don’t want it to accept things like:
“[joe] hello [joe] how are [/joe] you”
where there are two opening tags before a closing tag is reached.
Similarly, I don’t want to accept something like:
“hey [joe] it’s hot today[/joe] where [joe] is the ac”
where there is one correct pair but then an opening tag without a
closing one.

I missed the beginning of this thread, but if I recall correctly from my
course on formal languages, this sort if thing can’t be done with
regular expressions.

Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You’d like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren’t keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.

A.

joep · December 21, 2006, 5:48pm

Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You’d like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren’t keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.

A.
It doesn’t sound like Chinese

If wouldn’t have to be an infinite amount of states. Let’s say these
are the states:

State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds
[/joe], fails.
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.

If count goes above 3, fails.

But maybe I’ll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

joep · December 21, 2006, 5:52pm

Hi –

On Fri, 22 Dec 2006, Arne B. wrote:

course on formal languages, this sort if thing can’t be done with regular

If this sounds like Chinese, just remember regexpes aren’t keen on matching
this sort of stuff. Stacks on the other hand seem to be custom designed for
these purposes.

If you’re using scan, though, doesn’t that mean that you’re not really
trying to match one string to the regex, but rather a series of
strings? That means that the state machine gets completely restarted
as the scan method goes along the string. I think that’s a different
situation. You’re not really saying: match all the pairs; you’re
saying: find the first substring that has a matching pair, then
discard it and don’t worry about backtracking through it; etc.

David

joep · December 21, 2006, 4:51pm

The problem is I don’t want it to accept things like:
“[joe] hello [joe] how are [/joe] you”
where there are two opening tags before a closing tag is reached.
Similarly, I don’t want to accept something like:
“hey [joe] it’s hot today[/joe] where [joe] is the ac”
where there is one correct pair but then an opening tag without a
closing one.

joep · December 21, 2006, 5:58pm

Joe P. wrote:

If count goes above 3, fails.

But maybe I’ll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

You’re right, the infinte amount of states is when you have nesting,
e.g. parentheses in programming languages.

I just came up with this solution, but it looks like that’s not what
you’re after.

count=0
‘[joe] hello [joe] how are [/joe] you’.scan(/[joe]|[/joe]/).each do
|m|
if m =~ /[joe]/
count+=1
else
count-=1
end
end

if count!=0
puts “Your joes dont match up!”
end

A.

joep · December 21, 2006, 5:54pm

On 12/21/06, Joe P. [email protected] wrote:

matching this sort of stuff. Stacks on the other hand seem to be custom
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.

If count goes above 3, fails.

But maybe I’ll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

To my knowledge, you can’t do this with Ruby’s current regexp engine,
though it is possible with Perl and .NET. Both of those languages
support something roughly analogous to a stack, within the expression.
I don’t think Ruby 1.8’s regexp engine is powerful enough to handle
this, but I would be happy to be proven wrong.

It’s worth remembering that what we call ‘regular expressions’ these
days don’t actually match the formal definition of that term, and are
much more powerful in some ways.

joep · December 21, 2006, 8:11pm

Joe P. wrote:

matching this sort of stuff. Stacks on the other hand seem to be custom
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
Posted via http://www.ruby-forum.com/.
If a regular expression can’t do it, does that mean we can’t use
a regular expression?

No. We’ll still use a regexp and add some code to help it.

If all the pairs are matched, then after partitioning and zipping
we wind up with the original pairs.

[
“ok [joe] ok [/joe] right”,
“ok [joe] [/joe] [joe] foo [/joe]”,
“bad [joe] [/joe] foo [joe]”,
“bad [joe] [/joe] foo [/joe]”,
“bad [joe] [joe] foo [/joe]”,
"bad [joe] [joe] ",
"bad [/joe] [joe] ",
"bad [/joe] [/joe] "
].each { |s|
ary = s.scan( %r{[/?joe]} )
p ary
if ary == ary.partition{|t| “[joe]”==t}.inject{|a,b| a.zip(b)
}.flatten
puts “good\n”
else
puts “bad\n”
end
}

— output -----
[“[joe]”, “[/joe]”]
good
[“[joe]”, “[/joe]”, “[joe]”, “[/joe]”]
good
[“[joe]”, “[/joe]”, “[joe]”]
bad
[“[joe]”, “[/joe]”, “[/joe]”]
bad
[“[joe]”, “[joe]”, “[/joe]”]
bad
[“[joe]”, “[joe]”]
bad
[“[/joe]”, “[joe]”]
bad
[“[/joe]”, “[/joe]”]
bad

joep · December 22, 2006, 9:40pm

Andrew J. wrote:

If I am reading your specs correctly:

/^(((?![joe]).)([joe]((?![joe]).)+[/joe])){1,3}((?![joe]).)$/

[
“good [joe] [/joe] [joe] [/joe]”,
“bad [/joe] [joe] [/joe]”,
“bad [joe] [/joe] [/joe]”
].each { |s|
p s =~
/^(((?![joe]).)([joe]((?![joe]).)+[/joe])){1,3}((?![joe]).)$/
}

joep · December 22, 2006, 9:00pm

Joe P. wrote:

The problem is I don’t want it to accept things like:
“[joe] hello [joe] how are [/joe] you”
where there are two opening tags before a closing tag is reached.
Similarly, I don’t want to accept something like:
“hey [joe] it’s hot today[/joe] where [joe] is the ac”
where there is one correct pair but then an opening tag without a
closing one.

If I am reading your specs correctly:

/^(((?![joe]).)([joe]((?![joe]).)+[/joe])){1,3}((?![joe]).)$/

cheers,
andrew

joep · December 25, 2006, 4:45pm

Andrew J. wrote:

].each { |s|
].each { |s|
p s if s =~
/^(((?![/?joe]).)([joe]((?![/?joe]).)+[/joe])){1,3}((?![/?joe]).)$/
}

cheers
andrew

Yours is faster for very short strings; longer strings allow the array
method to pull ahead.

require ‘benchmark’

$n = 10_000
$strings = [
“good [joe] Wasn’t that what he was seeking? [/joe]
[joe] Can’t you see that? [/joe]”,
“bad was Peck’s boy [/joe] [joe] But he’ll never know. [/joe]”,
“bad to the bone [joe] Or will he?! [/joe] mish mash mush
Marching on Tom Tidler’s ground fatigues me. [/joe]”,
“bad: too many [joe] [/joe] [joe] [/joe] [joe] [/joe] [joe] [/joe]”,
“bad: too few”
]

def regexp
$regexp_good = 0
$n.times{ $strings.each { |s|
$regexp_good += 1 if s =~
/\A(((?![/?joe]).)([joe]((?![/?joe]).)+[/joe])){1,3}((?![/?joe]).)\Z/m
} }
end
def array
$array_good = 0
$n.times{ $strings.each { |s|
ary = s.scan( %r{[/?joe]} )
if [2,4,6].include?(ary.size) and
ary == ary.partition{|t| “[joe]”==t}.inject{|a,b| a.zip(b)}.
flatten
$array_good += 1
end
} }
end

Benchmark.bmbm do |x|
x.report(“regexp”) { regexp }
x.report(“array”) { array }
end

puts ; p $regexp_good, $array_good

Rehearsal ------------------------------------------
regexp 6.870000 0.000000 6.870000 ( 7.391000)
array 2.653000 0.000000 2.653000 ( 2.874000)
--------------------------------- total: 9.523000sec

         user     system      total        real

regexp 6.940000 0.000000 6.940000 ( 7.441000)
array 2.634000 0.000000 2.634000 ( 2.854000)

10000
10000

joep · December 25, 2006, 6:52pm

William J. wrote:
[snip]

Rehearsal ------------------------------------------
regexp 6.870000 0.000000 6.870000 ( 7.391000)
array 2.653000 0.000000 2.653000 ( 2.874000)
--------------------------------- total: 9.523000sec
         user     system      total        real
regexp 6.940000 0.000000 6.940000 ( 7.441000)
array 2.634000 0.000000 2.634000 ( 2.854000)

10000
10000

The regex engine makes a difference in this case – ruby1.8.5 + the
oniguruma engine gives:

Rehearsal ------------------------------------------
regexp 2.740000 0.010000 2.750000 ( 2.960385)
array 3.120000 0.000000 3.120000 ( 3.120808)
--------------------------------- total: 5.870000sec

           user     system      total        real

regexp 2.750000 0.000000 2.750000 ( 2.743031)
array 3.140000 0.000000 3.140000 ( 3.137098)

10000
10000

cheers,
andrew

joep · December 24, 2006, 9:40pm

William J. wrote:

Andrew J. wrote:

If I am reading your specs correctly:

/^(((?![joe]).)([joe]((?![joe]).)+[/joe])){1,3}((?![joe]).)$/

[
“good [joe] [/joe] [joe] [/joe]”,
“bad [/joe] [joe] [/joe]”,
“bad [joe] [/joe] [/joe]”
].each { |s|
p s =~
/^(((?![joe]).)([joe]((?![joe]).)+[/joe])){1,3}((?![joe]).)$/
}

Quite right:

[
“good [joe] [/joe] [joe] [/joe]”,
“bad [/joe] [joe] [/joe]”,
“bad [joe] [/joe] [/joe]”
].each { |s|
p s if s =~
/^(((?![/?joe]).)([joe]((?![/?joe]).)+[/joe])){1,3}((?![/?joe]).)$/
}

cheers
andrew