Regex and non-greedy matching?

I have a slight problem. I have strings with some tags such as

name:</>

I need to match “name:” and “lightblue”
In other words:

  • What is between <> </>
    and
  • What is inside the first <> right next to “name:”

The following regex does not work:

name:</>’ =~ /<([a-zA-Z]+)>(.+?)</>/

$1 # => “b”
$2 # => "name:

$2 should only be name:
and $1 should only be lightblue

Hi,

At Mon, 7 Apr 2008 08:25:28 +0900,
Marc H. wrote in [ruby-talk:297262]:

The following regex does not work:

name:</>’ =~ /<([a-zA-Z]+)>(.+?)</>/

/<([a-zA-Z]+)>.*?([^<>]+)</>/ =~ “name:</>

$2 should only be name:
and $1 should only be lightblue

Non-greedy matching doesn’t mean the shortest result matching.
It matches at the leftmost position.

Marc H. wrote:

I have a slight problem. I have strings with some tags such as

name:</>

I need to match “name:” and “lightblue”
In other words:

  • What is between <> </>
    and
  • What is inside the first <> right next to “name:”

The following regex does not work:

name:</>’ =~ /<([a-zA-Z]+)>(.+?)</>/

$1 # => “b”

This is your string:

name:</>

and the first part of your regex says to look for a ‘<’, followed by one
or more characters, followed by a ‘>’. That certainly describes the
string ‘’.

$2 # => "name:

This is your string again:

<–already matched this
name:</>

The second part of your regex says to look for a ‘<’, followed by any
character one or more times, followed by ‘</>’. That certainly
describes the string ‘name</>’.

Note that since the characters ‘</>’ only appear once in your string,
the non-greedy qualifier has no effect. By default, regex’s are greedy,
so if your string looked like this:

name:</>xxxxxxxxxxxxxxx</>’

then the greedy version of your regex:

/>(.+)</>/ <----(no ‘?’)

would match:

name:</>xxxxxxxxxxxxxxx</>

That’s because the portion:

name:</>xxxxxxxxxxxxxxx

is interpreted as “any character(.) one or more times(+)”.

On the other hand, your non-greedy regex(i.e. with the ‘?’) would match:

name:</>

If you examine your string again:

name:</>

the ‘lightblue’ substring is preceded by the characters ‘><’, and that
is different from what precedes ‘b’. You can use that fact to get
‘lightblue’ instead of ‘b’. This regex will get ‘lightblue’:

<([^>]+)

That says to look for ‘><’ followed by one or more characters that are
not a ‘>’. That will match:

‘><lightblue’

To get ‘name:’, you can do something similar. This is the rest of the
string after ‘lightblue’:

‘>name:</>’

Here is a regex to get ‘name:’:

([^<]+)

That says to look for a ‘>’, followed by one or more characters that are
not a ‘<’. Here it is altogether:

pattern = /><([^>]+)>([^<]+)/
str = “name:</>

match_obj = pattern.match(str)
puts match_obj[1]
puts match_obj[2]

–output:–

lightblue
name:

Hi–

On Apr 6, 2008, at 4:25 PM, Marc H. wrote:

The following regex does not work:

name:</>’ =~ /<([a-zA-Z]+)>(.+?)</>/

$1 # => “b”
$2 # => "name:

$2 should only be name:
and $1 should only be lightblue

You might want to look into hpricot
(http://code.whytheluckystiff.net/hpricot/
). It will give you pretty reliable parsing of XML markup. What you
have here is not valid XML because the closing tag for is
not but on the chance that it’s a typo, I really
recommend giving hpricot a try.

2008/4/7, Marc H. [email protected]:

The following regex does not work:

name:</>’ =~ /<([a-zA-Z]+)>(.+?)</>/

$1 # => “b”
$2 # => "name:

$2 should only be name:
and $1 should only be lightblue

Constructing a regexp to match more specific often helps:

irb(main):001:0> s=‘name:</>
=> “name:</>

irb(main):002:0> md = %r{\s*<([^>])>([^<])</>}.match s
=> #MatchData:0x7ff973f4
irb(main):003:0> md.to_a
=> [“name:</>”, “lightblue”, “name:”]

irb(main):004:0> md = %r{\s*<([^>])>\s([^<]*)</>}.match s
=> #MatchData:0x7ff85b54
irb(main):005:0> md.to_a
=> [“name:</>”, “lightblue”, “name:”]
irb(main):006:0>

See how this works without reluctant quantifier?

Cheers

robert