Forum: Ruby String.scan hangs in some cases?

Announcement (2017-05-07): www.ruby-forum.com is now read-only since I unfortunately do not have the time to support and maintain the forum any more. Please see rubyonrails.org/community and ruby-lang.org/en/community for other Rails- und Ruby-related community platforms.
Wes G. (Guest)
on 2006-05-24 04:29
Has anyone ever see a call to String.scan (the "regular" form, not the
block form) just hang when processing a string?

If so, any insight into how that might happen?

Thanks,
Wes
Daniel B. (Guest)
on 2006-05-24 04:43
(Received via mailing list)
Wes G. wrote:
> Has anyone ever see a call to String.scan (the "regular" form, not the
> block form) just hang when processing a string?
>
> If so, any insight into how that might happen?
>
> Thanks,
> Wes
>

Got an example you can share that demonstrates the problem?

Maybe the string is huge?  Maybe there's a recursive lookup happening of
some kind?  Dunno without more detail.

Regards,

Dan
Bill K. (Guest)
on 2006-05-24 04:59
(Received via mailing list)
From: "Daniel B." <removed_email_address@domain.invalid>
> Got an example you can share that demonstrates the problem?
>
> Maybe the string is huge?  Maybe there's a recursive lookup happening of
> some kind?  Dunno without more detail.

I'm guessing exponential backtracking when the expression doesn't match.

For ex:

x = "x" * 1000
x << "ab"
x =~ /x*x*a/       # matches quickly
x =~ /x*x*b/       # "hangs" trying exponential combinations with
backtracking


Regards,

Bill
Robert K. (Guest)
on 2006-05-24 11:12
(Received via mailing list)
2006/5/24, Bill K. <removed_email_address@domain.invalid>:
> >
> x << "ab"
> x =~ /x*x*a/       # matches quickly
> x =~ /x*x*b/       # "hangs" trying exponential combinations with backtracking

Yes, that's a property of NFA based regular expression engines.  Some
patterns have awful runtime characteristics because of the
backtracking.  Without looking too close at your patter I guess in
this case it's caused by having "x*" twice in your RX - not very
useful anyway.  If you are interested in more detail I recommend
reading "Mastering Regular Expressions":

http://www.amazon.com/gp/product/0596002890/

Kind regards

robert
Bill K. (Guest)
on 2006-05-24 12:16
(Received via mailing list)
From: "Robert K." <removed_email_address@domain.invalid>
>
> Yes, that's a property of NFA based regular expression engines.  Some
> patterns have awful runtime characteristics because of the
> backtracking.  Without looking too close at your patter I guess in
> this case it's caused by having "x*" twice in your RX - not very
> useful anyway.

Thanks :)  We haven't seen the OP's regexp yet; I was just
offering the above as an example of what can cause the
hanging he described.


Regards,

Bill
Wes G. (Guest)
on 2006-05-24 19:03
Bill K. wrote:
> From: "Robert K." <removed_email_address@domain.invalid>
>>
>> Yes, that's a property of NFA based regular expression engines.  Some
>> patterns have awful runtime characteristics because of the
>> backtracking.  Without looking too close at your patter I guess in
>> this case it's caused by having "x*" twice in your RX - not very
>> useful anyway.
>
> Thanks :)  We haven't seen the OP's regexp yet; I was just
> offering the above as an example of what can cause the
> hanging he described.
>
>
> Regards,
>
> Bill

All,

Thanks for giving me some idea of what might be wrong.  For now, I've
gone with a simpler regex that I think will work fine.  My original
pattern was attempting to allow several portions of the pattern to be
optional and the more that I think about it, the more it makes sense
that this might occur.

I will keep (?>pattern) in mind as a way to inhibit backtracking.

Thanks,
Wes
Wes G. (Guest)
on 2006-05-24 19:11
Just for kicks, here was the original regex - attempting to match
against a string containing an entire HTML page OR a style sheet.

/(?=<style)*?background.*?(?=-image)*?:.*?url\((.*?)\).*?(?=<\/style>)*?/

I was attempting to make the <style  and </style> parts optional.

As it turns out, there are so many different ways that a style can be
specified in a HTML page, that I just went with the much simpler:

/background.*?:.*?url\((.*?)\)/i

Wes
This topic is locked and can not be replied to.