Reg exp comparison hanging irb

hi,

was just going trough reg exp
the lines below just hangs irb and cpu usage goes 100% on windows xp
can anyone explain me why?

irb(main):001:0>
r=/^(https?://)?[a-z0-9]+([.-_=&+/?]?[a-z0-9]+)+$/i
=> /^(https?://)?[a-z0-9]+([.-_=&+/?]?[a-z0-9]+)+$/i
irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

i’ll appreciate if anyone helps with this.

regards
gaurav…

hi simon,
thanks for reply
so whats the way out i am not used to reg exp that much
some one helped me with reg exp that i posted
as a single entry might stop the things to work…
regards
gaurav

On 12/27/06, gaurav bagga [email protected] wrote:

hi simon,
thanks for reply
so whats the way out i am not used to reg exp that much
some one helped me with reg exp that i posted
as a single entry might stop the things to work…

do you want to do a rough validation of the entire url?
or do you want to check only the beginning?

try this one
/^(https?://)?\w+([-._=&+/?#]\w+)+$/i

On 12/27/06, gaurav bagga [email protected] wrote:

was just going trough reg exp
the lines below just hangs irb and cpu usage goes 100% on windows xp
can anyone explain me why?

irb(main):001:0> r=/^(https?://)?[a-z0-9]+([.-_=&+/?]?[a-z0-9]+)+$/i
=> /^(https?://)?[a-z0-9]+([.-_=&+/?]?[a-z0-9]+)+$/i
irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

the GNU regexp engine has a few oddities…
here is another example that triggers an endless loop.

r=‘<META http-equiv="Content-Type content=“text/html;
charset=iso-8859-1”>’
r.scan /<(?:[^“>]+|”[^“]*”)+>/

we have in common that our regexp has nested repeating patterns.

hi simon,

can you explain regexp has nested repeating patterns.

regards
gaurav

On 12/27/06, gaurav bagga [email protected] wrote:

can you explain regexp has nested repeating patterns.

the pattern /a+/ is not nested.
putting ( )+ around it, then /(a+)+/ is nested 1 level.
putting ( )+ around it, then /((a+)+)+/ is nested 2 levels.

I suspect the GNU regexp engine sometimes go crazy when there
is a pattern that can match nothing inside a nested pattern.

by matching nothing i mean: ^ $ \b * ? lookahead lookbehind… etc

hi,
thanks for the reg exp :slight_smile:
basically that suffices simple website url(
http://www.something.com/something) or a blog url
regards
gaurav

From: “Simon S.” [email protected]

the GNU regexp engine has a few oddities…
here is another example that triggers an endless loop.

r=‘<META http-equiv="Content-Type content=“text/html; charset=iso-8859-1”>’
r.scan /<(?:[^“>]+|”[^“]*”)+>/

To clarify, it’s probably not an endless loop, just may or
may not finish in our lifetimes. :slight_smile:

If you make the string shorter, like:

r=‘<M h="C-T c=“t/h; c=i-8-1”>’

…you’ll see it finishes quickly.

Add a few characters:

r=‘<M h="C-T c=“t/h; charset=i-8-1”>’

…and there’s a slight delay before it finishes.

I found that removing greediness from your outer one-or-more
match, sped it up a lot: (changed + to +?)

r.scan /<(?:[^“>]+|”[^“]*”)+?>/

Now the match on your full string finishes in a few seconds
on my system. Still slow… just a lot faster than the
greedy version.

Incidentally, it’s the mismatched quotes in the attribute
value that are causing the backtracking.

If we allow the regex to fail “gracefully” on mismatched
quotes, we can prevent the backtracking:

r.scan /<(?:[^“>]+|”[^“]*”|")+?>/

…i.e. the thinking is, if all else fails, just gobble a
single " and keep going.

Regards,

Bill

hi,
thank you all for help
regards
gaurav

On 12/27/06, Bill K. [email protected] wrote:

From: “Simon S.” [email protected]

the GNU regexp engine has a few oddities…
here is another example that triggers an endless loop.

r=‘<META http-equiv="Content-Type content=“text/html; charset=iso-8859-1”>’
r.scan /<(?:[^“>]+|”[^“]*”)+>/

To clarify, it’s probably not an endless loop, just may or
may not finish in our lifetimes. :slight_smile:

Thanks for the clarification. I must have forgotten
a lot about regexp too… iirc stress can cause amnesia.

Simon S. wrote:

read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

the GNU regexp engine has a few oddities…
here is another example that triggers an endless loop.

r=‘<META http-equiv="Content-Type content=“text/html; charset=iso-8859-1”>’
r.scan /<(?:[^“>]+|”[^“]*”)+>/

we have in common that our regexp has nested repeating patterns.

Much faster:

r = %r{
^
(https?://)?
(?> [a-z0-9]+ )
(?> [-._=&+/?]? [a-z0-9]+ )+
$
}xi
p ( “http://groups-beta.google.com/group/rubyonrails-talk/” +
“browse_thread/thread/8f085b191387d799/” +
“e78a71cbd7354c0c#e78a71cbd7354c0c”) =~ r

It doesn’t match (because of the #).

On Dec 26, 2006, at 21:20, gaurav bagga wrote:

irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

i’ll appreciate if anyone helps with this.

There’s already a library for what you want to do:

require ‘uri’

uri = URI.parse “http://groups-beta.google.com/group/rubyonrails-
talk/…”


Eric H. - [email protected] - http://blog.segment7.net

I LIT YOUR GEM ON FIRE!