Reg exp comparison hanging irb

gaurav_bagga · December 27, 2006, 6:21am

hi,

was just going trough reg exp
the lines below just hangs irb and cpu usage goes 100% on windows xp
can anyone explain me why?

irb(main):001:0>
r=/^(https?://)?[a-z0-9]+([.-_=&+/?]?[a-z0-9]+)+$/i
=> /^(https?://)?[a-z0-9]+([.-_=&+/?]?[a-z0-9]+)+$/i
irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

i’ll appreciate if anyone helps with this.

regards
gaurav…

gaurav_bagga · December 27, 2006, 9:41am

hi simon,
thanks for reply
so whats the way out i am not used to reg exp that much
some one helped me with reg exp that i posted
as a single entry might stop the things to work…
regards
gaurav

gaurav_bagga · December 27, 2006, 11:30am

On 12/27/06, gaurav bagga [email protected] wrote:

hi simon,
thanks for reply
so whats the way out i am not used to reg exp that much
some one helped me with reg exp that i posted
as a single entry might stop the things to work…

do you want to do a rough validation of the entire url?
or do you want to check only the beginning?

try this one
/^(https?://)?\w+([-._=&+/?#]\w+)+$/i

gaurav_bagga · December 27, 2006, 9:29am

On 12/27/06, gaurav bagga [email protected] wrote:

was just going trough reg exp
the lines below just hangs irb and cpu usage goes 100% on windows xp
can anyone explain me why?

irb(main):001:0> r=/^(https?://)?[a-z0-9]+([.-_=&+/?]?[a-z0-9]+)+$/i
=> /^(https?://)?[a-z0-9]+([.-_=&+/?]?[a-z0-9]+)+$/i
irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

the GNU regexp engine has a few oddities…
here is another example that triggers an endless loop.

r=‘<META http-equiv="Content-Type content=“text/html;
charset=iso-8859-1”>’
r.scan /<(?:[^“>]+|”[^“]*”)+>/

we have in common that our regexp has nested repeating patterns.

gaurav_bagga · December 27, 2006, 12:04pm

hi simon,

can you explain regexp has nested repeating patterns.

regards
gaurav

gaurav_bagga · December 27, 2006, 12:50pm

On 12/27/06, gaurav bagga [email protected] wrote:

can you explain regexp has nested repeating patterns.

the pattern /a+/ is not nested.
putting ( )+ around it, then /(a+)+/ is nested 1 level.
putting ( )+ around it, then /((a+)+)+/ is nested 2 levels.

I suspect the GNU regexp engine sometimes go crazy when there
is a pattern that can match nothing inside a nested pattern.

by matching nothing i mean: ^ $ \b * ? lookahead lookbehind… etc

gaurav_bagga · December 27, 2006, 12:00pm

hi,
thanks for the reg exp
basically that suffices simple website url(
http://www.something.com/something) or a blog url
regards
gaurav

gaurav_bagga · December 27, 2006, 1:46pm

From: “Simon S.” [email protected]

the GNU regexp engine has a few oddities…
here is another example that triggers an endless loop.

r=‘<META http-equiv="Content-Type content=“text/html; charset=iso-8859-1”>’
r.scan /<(?:[^“>]+|”[^“]*”)+>/

To clarify, it’s probably not an endless loop, just may or
may not finish in our lifetimes.

If you make the string shorter, like:

r=‘<M h="C-T c=“t/h; c=i-8-1”>’

…you’ll see it finishes quickly.

Add a few characters:

r=‘<M h="C-T c=“t/h; charset=i-8-1”>’

…and there’s a slight delay before it finishes.

I found that removing greediness from your outer one-or-more
match, sped it up a lot: (changed + to +?)

r.scan /<(?:[^“>]+|”[^“]*”)+?>/

Now the match on your full string finishes in a few seconds
on my system. Still slow… just a lot faster than the
greedy version.

Incidentally, it’s the mismatched quotes in the attribute
value that are causing the backtracking.

If we allow the regex to fail “gracefully” on mismatched
quotes, we can prevent the backtracking:

r.scan /<(?:[^“>]+|”[^“]*”|")+?>/

…i.e. the thinking is, if all else fails, just gobble a
single " and keep going.

Regards,

Bill

gaurav_bagga · December 27, 2006, 2:18pm

hi,
thank you all for help
regards
gaurav

gaurav_bagga · December 27, 2006, 2:26pm

On 12/27/06, Bill K. [email protected] wrote:

From: “Simon S.” [email protected]

the GNU regexp engine has a few oddities…
here is another example that triggers an endless loop.

r=‘<META http-equiv="Content-Type content=“text/html; charset=iso-8859-1”>’
r.scan /<(?:[^“>]+|”[^“]*”)+>/

To clarify, it’s probably not an endless loop, just may or
may not finish in our lifetimes.

Thanks for the clarification. I must have forgotten
a lot about regexp too… iirc stress can cause amnesia.

gaurav_bagga · December 27, 2006, 9:21pm

Simon S. wrote:

read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

the GNU regexp engine has a few oddities…
here is another example that triggers an endless loop.

r=‘<META http-equiv="Content-Type content=“text/html; charset=iso-8859-1”>’
r.scan /<(?:[^“>]+|”[^“]*”)+>/

we have in common that our regexp has nested repeating patterns.

Much faster:

r = %r{
^
(https?://)?
(?> [a-z0-9]+ )
(?> [-._=&+/?]? [a-z0-9]+ )+
$
}xi
p ( “http://groups-beta.google.com/group/rubyonrails-talk/” +
“browse_thread/thread/8f085b191387d799/” +
“e78a71cbd7354c0c#e78a71cbd7354c0c”) =~ r

It doesn’t match (because of the #).

gaurav_bagga · December 27, 2006, 11:13pm

On Dec 26, 2006, at 21:20, gaurav bagga wrote:

irb(main):002:0> "
http://groups-beta.google.com/group/rubyonrails-talk/browse_th
read/thread/8f085b191387d799/e78a71cbd7354c0c#e78a71cbd7354c0c"=~r

i’ll appreciate if anyone helps with this.

There’s already a library for what you want to do:

require ‘uri’

uri = URI.parse “http://groups-beta.google.com/group/rubyonrails-
talk/…”

–
Eric H. - [email protected] - http://blog.segment7.net

I LIT YOUR GEM ON FIRE!