Regexp Question: Checking for [joe][/joe] pairs

Hey, I’ve got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/([joe][\s\d\w]+?[/joe]){1,3}/)

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Daniel F. wrote:

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.

Hi –

On Thu, 21 Dec 2006, Daniel F. wrote:

Why are you doing /[\s\d\w]+?/? Just use /.+?/.
\d is part of \w, so [\s\w] would be OK. But . is very different. It
does not include newline (by default), and does include punctuation.

David

Hi Joe –

On Thu, 21 Dec 2006, Joe P. wrote:

Hey, I’ve got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].

This is what I have now, and it seems to sometimes work, and sometimes
not.

@x.match(/([joe][\s\d\w]+?[/joe]){1,3}/)

require ‘test/unit’

class JoeTest < Test::Unit::TestCase
def setup
@re = /([joe][\s\d\w]+?[/joe]){1,3}/
end

 def test_ok
   assert("[joe]abc[/joe]".match(@re))
 end

 def test_broken
   # ???        <--- fill in the blank :-)
 end

end

David

The problem is that the Regexp is not anchored to the start and ends of
the string.

/^(?![joe]).?([joe].+?[\joe]){1,3}(?![joe]).?$/

Hi –

On Thu, 21 Dec 2006, Joe P. wrote:

Daniel F. wrote:

Why are you doing /[\s\d\w]+?/? Just use /.+?/.

Dan

Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.

That’s because {1,3} doesn’t mean there can’t be another. Usually
you’d anchor it or surround it with something else, like:

/Xa{1,3}X/

so it can’t keep repeating.

David

Joe P. wrote:

Posted via http://www.ruby-forum.com/.
@x = “[joe] [/joe] [joe][/joe] [joe] foo [/joe]”
count = @x.scan(/[joe](.*?)[/joe]/m).flatten.
reject{|s| “”==s}.size
p count

Joe P. wrote:

The problem is I don’t want it to accept things like:
“[joe] hello [joe] how are [/joe] you”
where there are two opening tags before a closing tag is reached.
Similarly, I don’t want to accept something like:
“hey [joe] it’s hot today[/joe] where [joe] is the ac”
where there is one correct pair but then an opening tag without a
closing one.

I missed the beginning of this thread, but if I recall correctly from my
course on formal languages, this sort if thing can’t be done with
regular expressions.

Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You’d like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren’t keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.

A.

Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You’d like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.

If this sounds like Chinese, just remember regexpes aren’t keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.

A.
It doesn’t sound like Chinese :slight_smile:

If wouldn’t have to be an infinite amount of states. Let’s say these
are the states:

State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds
[/joe], fails.
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.

If count goes above 3, fails.

But maybe I’ll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

Hi –

On Fri, 22 Dec 2006, Arne B. wrote:

course on formal languages, this sort if thing can’t be done with regular

If this sounds like Chinese, just remember regexpes aren’t keen on matching
this sort of stuff. Stacks on the other hand seem to be custom designed for
these purposes.

If you’re using scan, though, doesn’t that mean that you’re not really
trying to match one string to the regex, but rather a series of
strings? That means that the state machine gets completely restarted
as the scan method goes along the string. I think that’s a different
situation. You’re not really saying: match all the pairs; you’re
saying: find the first substring that has a matching pair, then
discard it and don’t worry about backtracking through it; etc.

David

The problem is I don’t want it to accept things like:
“[joe] hello [joe] how are [/joe] you”
where there are two opening tags before a closing tag is reached.
Similarly, I don’t want to accept something like:
“hey [joe] it’s hot today[/joe] where [joe] is the ac”
where there is one correct pair but then an opening tag without a
closing one.

Joe P. wrote:

If count goes above 3, fails.

But maybe I’ll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

You’re right, the infinte amount of states is when you have nesting,
e.g. parentheses in programming languages.

I just came up with this solution, but it looks like that’s not what
you’re after.

count=0
‘[joe] hello [joe] how are [/joe] you’.scan(/[joe]|[/joe]/).each do
|m|
if m =~ /[joe]/
count+=1
else
count-=1
end
end

if count!=0
puts “Your joes dont match up!”
end

A.

On 12/21/06, Joe P. [email protected] wrote:

matching this sort of stuff. Stacks on the other hand seem to be custom
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.

If count goes above 3, fails.

But maybe I’ll use something besides a regexp, although I thought there
would be a pretty easy way to do it.

To my knowledge, you can’t do this with Ruby’s current regexp engine,
though it is possible with Perl and .NET. Both of those languages
support something roughly analogous to a stack, within the expression.
I don’t think Ruby 1.8’s regexp engine is powerful enough to handle
this, but I would be happy to be proven wrong.

It’s worth remembering that what we call ‘regular expressions’ these
days don’t actually match the formal definition of that term, and are
much more powerful in some ways.

Joe P. wrote:

matching this sort of stuff. Stacks on the other hand seem to be custom
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
Posted via http://www.ruby-forum.com/.
If a regular expression can’t do it, does that mean we can’t use
a regular expression?

No. We’ll still use a regexp and add some code to help it.

If all the pairs are matched, then after partitioning and zipping
we wind up with the original pairs.

[
“ok [joe] ok [/joe] right”,
“ok [joe] [/joe] [joe] foo [/joe]”,
“bad [joe] [/joe] foo [joe]”,
“bad [joe] [/joe] foo [/joe]”,
“bad [joe] [joe] foo [/joe]”,
"bad [joe] [joe] ",
"bad [/joe] [joe] ",
"bad [/joe] [/joe] "
].each { |s|
ary = s.scan( %r{[/?joe]} )
p ary
if ary == ary.partition{|t| “[joe]”==t}.inject{|a,b| a.zip(b)
}.flatten
puts “good\n”
else
puts “bad\n”
end
}

— output -----
[“[joe]”, “[/joe]”]
good
[“[joe]”, “[/joe]”, “[joe]”, “[/joe]”]
good
[“[joe]”, “[/joe]”, “[joe]”]
bad
[“[joe]”, “[/joe]”, “[/joe]”]
bad
[“[joe]”, “[joe]”, “[/joe]”]
bad
[“[joe]”, “[joe]”]
bad
[“[/joe]”, “[joe]”]
bad
[“[/joe]”, “[/joe]”]
bad

Andrew J. wrote:

If I am reading your specs correctly:

/^(((?![joe]).)([joe]((?![joe]).)+[/joe])){1,3}((?![joe]).)$/

[
“good [joe] [/joe] [joe] [/joe]”,
“bad [/joe] [joe] [/joe]”,
“bad [joe] [/joe] [/joe]”
].each { |s|
p s =~
/^(((?![joe]).)([joe]((?![joe]).)+[/joe])){1,3}((?![joe]).)$/
}

Joe P. wrote:

The problem is I don’t want it to accept things like:
“[joe] hello [joe] how are [/joe] you”
where there are two opening tags before a closing tag is reached.
Similarly, I don’t want to accept something like:
“hey [joe] it’s hot today[/joe] where [joe] is the ac”
where there is one correct pair but then an opening tag without a
closing one.

If I am reading your specs correctly:

/^(((?![joe]).)([joe]((?![joe]).)+[/joe])){1,3}((?![joe]).)$/

cheers,
andrew

Andrew J. wrote:

].each { |s|
].each { |s|
p s if s =~
/^(((?![/?joe]).)([joe]((?![/?joe]).)+[/joe])){1,3}((?![/?joe]).)$/
}

cheers
andrew

Yours is faster for very short strings; longer strings allow the array
method to pull ahead.

require ‘benchmark’

$n = 10_000
$strings = [
“good [joe] Wasn’t that what he was seeking? [/joe]
[joe] Can’t you see that? [/joe]”,
“bad was Peck’s boy [/joe] [joe] But he’ll never know. [/joe]”,
“bad to the bone [joe] Or will he?! [/joe] mish mash mush
Marching on Tom Tidler’s ground fatigues me. [/joe]”,
“bad: too many [joe] [/joe] [joe] [/joe] [joe] [/joe] [joe] [/joe]”,
“bad: too few”
]

def regexp
$regexp_good = 0
$n.times{ $strings.each { |s|
$regexp_good += 1 if s =~
/\A(((?![/?joe]).)([joe]((?![/?joe]).)+[/joe])){1,3}((?![/?joe]).)\Z/m
} }
end
def array
$array_good = 0
$n.times{ $strings.each { |s|
ary = s.scan( %r{[/?joe]} )
if [2,4,6].include?(ary.size) and
ary == ary.partition{|t| “[joe]”==t}.inject{|a,b| a.zip(b)}.
flatten
$array_good += 1
end
} }
end

Benchmark.bmbm do |x|
x.report(“regexp”) { regexp }
x.report(“array”) { array }
end

puts ; p $regexp_good, $array_good

Rehearsal ------------------------------------------
regexp 6.870000 0.000000 6.870000 ( 7.391000)
array 2.653000 0.000000 2.653000 ( 2.874000)
--------------------------------- total: 9.523000sec

         user     system      total        real

regexp 6.940000 0.000000 6.940000 ( 7.441000)
array 2.634000 0.000000 2.634000 ( 2.854000)

10000
10000

William J. wrote:
[snip]

Rehearsal ------------------------------------------
regexp 6.870000 0.000000 6.870000 ( 7.391000)
array 2.653000 0.000000 2.653000 ( 2.874000)
--------------------------------- total: 9.523000sec

         user     system      total        real

regexp 6.940000 0.000000 6.940000 ( 7.441000)
array 2.634000 0.000000 2.634000 ( 2.854000)

10000
10000

The regex engine makes a difference in this case – ruby1.8.5 + the
oniguruma engine gives:

Rehearsal ------------------------------------------
regexp 2.740000 0.010000 2.750000 ( 2.960385)
array 3.120000 0.000000 3.120000 ( 3.120808)
--------------------------------- total: 5.870000sec

           user     system      total        real

regexp 2.750000 0.000000 2.750000 ( 2.743031)
array 3.140000 0.000000 3.140000 ( 3.137098)

10000
10000

cheers,
andrew

William J. wrote:

Andrew J. wrote:

If I am reading your specs correctly:

/^(((?![joe]).)([joe]((?![joe]).)+[/joe])){1,3}((?![joe]).)$/

[
“good [joe] [/joe] [joe] [/joe]”,
“bad [/joe] [joe] [/joe]”,
“bad [joe] [/joe] [/joe]”
].each { |s|
p s =~
/^(((?![joe]).)([joe]((?![joe]).)+[/joe])){1,3}((?![joe]).)$/
}

Quite right:

[
“good [joe] [/joe] [joe] [/joe]”,
“bad [/joe] [joe] [/joe]”,
“bad [joe] [/joe] [/joe]”
].each { |s|
p s if s =~
/^(((?![/?joe]).)([joe]((?![/?joe]).)+[/joe])){1,3}((?![/?joe]).)$/
}

cheers
andrew