Hey, I’ve got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].
This is what I have now, and it seems to sometimes work, and sometimes
not.
Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.
Why are you doing /[\s\d\w]+?/? Just use /.+?/.
\d is part of \w, so [\s\w] would be OK. But . is very different. It
does not include newline (by default), and does include punctuation.
Hey, I’ve got some text in @x and want there to be at least 1 and at
most 3 [joe][/joe] pairs, each having at least one character between the
beginning [joe] and the ending [/joe].
This is what I have now, and it seems to sometimes work, and sometimes
not.
@x.match(/([joe][\s\d\w]+?[/joe]){1,3}/)
require ‘test/unit’
class JoeTest < Test::Unit::TestCase
def setup @re = /([joe][\s\d\w]+?[/joe]){1,3}/
end
def test_ok
assert("[joe]abc[/joe]".match(@re))
end
def test_broken
# ??? <--- fill in the blank :-)
end
Good point. I was using .+? earlier, but thought that might be part of
my problem. It seems to accept @x even if it contains more than 3
[joe][/joe] pairs.
That’s because {1,3} doesn’t mean there can’t be another. Usually
you’d anchor it or surround it with something else, like:
The problem is I don’t want it to accept things like:
“[joe] hello [joe] how are [/joe] you”
where there are two opening tags before a closing tag is reached.
Similarly, I don’t want to accept something like:
“hey [joe] it’s hot today[/joe] where [joe] is the ac”
where there is one correct pair but then an opening tag without a
closing one.
I missed the beginning of this thread, but if I recall correctly from my
course on formal languages, this sort if thing can’t be done with
regular expressions.
Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You’d like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.
If this sounds like Chinese, just remember regexpes aren’t keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.
Regular expressions can be used to test whether a string belongs to a
certain regular language, which is a subset of all possible languages
(where a language is a set of strings). Regular expressions are
equivalent to finite state automata in this respect. Since a finite
state automata can only be in a finite number of states. You’d like to
match a possibly infinitely large number of [joe][/joe] pairs. The FSA
would need a new state for every extra [joe] it reads to remember it
still needs to consume a matching [/joe] for it.
If this sounds like Chinese, just remember regexpes aren’t keen on
matching this sort of stuff. Stacks on the other hand seem to be custom
designed for these purposes.
A.
It doesn’t sound like Chinese
If wouldn’t have to be an infinite amount of states. Let’s say these
are the states:
State 1 - no [joe] yet. If finds [joe], goes to state 2. If finds
[/joe], fails.
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.
If count goes above 3, fails.
But maybe I’ll use something besides a regexp, although I thought there
would be a pretty easy way to do it.
course on formal languages, this sort if thing can’t be done with regular
If this sounds like Chinese, just remember regexpes aren’t keen on matching
this sort of stuff. Stacks on the other hand seem to be custom designed for
these purposes.
If you’re using scan, though, doesn’t that mean that you’re not really
trying to match one string to the regex, but rather a series of
strings? That means that the state machine gets completely restarted
as the scan method goes along the string. I think that’s a different
situation. You’re not really saying: match all the pairs; you’re
saying: find the first substring that has a matching pair, then
discard it and don’t worry about backtracking through it; etc.
The problem is I don’t want it to accept things like:
“[joe] hello [joe] how are [/joe] you”
where there are two opening tags before a closing tag is reached.
Similarly, I don’t want to accept something like:
“hey [joe] it’s hot today[/joe] where [joe] is the ac”
where there is one correct pair but then an opening tag without a
closing one.
matching this sort of stuff. Stacks on the other hand seem to be custom
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
in this state, then fails. If it finds [/joe], increments count by 1
and moves to state 1.
If count goes above 3, fails.
But maybe I’ll use something besides a regexp, although I thought there
would be a pretty easy way to do it.
To my knowledge, you can’t do this with Ruby’s current regexp engine,
though it is possible with Perl and .NET. Both of those languages
support something roughly analogous to a stack, within the expression.
I don’t think Ruby 1.8’s regexp engine is powerful enough to handle
this, but I would be happy to be proven wrong.
It’s worth remembering that what we call ‘regular expressions’ these
days don’t actually match the formal definition of that term, and are
much more powerful in some ways.
matching this sort of stuff. Stacks on the other hand seem to be custom
State 2 - [joe] found but not matching [/joe]. If it finds [joe] again
Posted via http://www.ruby-forum.com/.
If a regular expression can’t do it, does that mean we can’t use
a regular expression?
No. We’ll still use a regexp and add some code to help it.
If all the pairs are matched, then after partitioning and zipping
we wind up with the original pairs.
— output -----
[“[joe]”, “[/joe]”]
good
[“[joe]”, “[/joe]”, “[joe]”, “[/joe]”]
good
[“[joe]”, “[/joe]”, “[joe]”]
bad
[“[joe]”, “[/joe]”, “[/joe]”]
bad
[“[joe]”, “[joe]”, “[/joe]”]
bad
[“[joe]”, “[joe]”]
bad
[“[/joe]”, “[joe]”]
bad
[“[/joe]”, “[/joe]”]
bad
The problem is I don’t want it to accept things like:
“[joe] hello [joe] how are [/joe] you”
where there are two opening tags before a closing tag is reached.
Similarly, I don’t want to accept something like:
“hey [joe] it’s hot today[/joe] where [joe] is the ac”
where there is one correct pair but then an opening tag without a
closing one.
].each { |s|
].each { |s|
p s if s =~
/^(((?![/?joe]).)([joe]((?![/?joe]).)+[/joe])){1,3}((?![/?joe]).)$/
}
cheers
andrew
Yours is faster for very short strings; longer strings allow the array
method to pull ahead.
require ‘benchmark’
$n = 10_000
$strings = [
“good [joe] Wasn’t that what he was seeking? [/joe]
[joe] Can’t you see that? [/joe]”,
“bad was Peck’s boy [/joe] [joe] But he’ll never know. [/joe]”,
“bad to the bone [joe] Or will he?! [/joe] mish mash mush
Marching on Tom Tidler’s ground fatigues me. [/joe]”,
“bad: too many [joe] [/joe] [joe] [/joe] [joe] [/joe] [joe] [/joe]”,
“bad: too few”
]
def regexp
$regexp_good = 0
$n.times{ $strings.each { |s|
$regexp_good += 1 if s =~
/\A(((?![/?joe]).)([joe]((?![/?joe]).)+[/joe])){1,3}((?![/?joe]).)\Z/m
} }
end
def array
$array_good = 0
$n.times{ $strings.each { |s|
ary = s.scan( %r{[/?joe]} )
if [2,4,6].include?(ary.size) and
ary == ary.partition{|t| “[joe]”==t}.inject{|a,b| a.zip(b)}.
flatten
$array_good += 1
end
} }
end
Benchmark.bmbm do |x|
x.report(“regexp”) { regexp }
x.report(“array”) { array }
end