Forum: Ruby Regex Comparison Causing Massive Memory Usage

Posted by Mike Harris (gfunk911)
on 2006-06-26 17:05
(Received via mailing list)
Has anyone run into a problem where a regex comparison against a very
large string causes memory usage to massively and rapidly spike
upwards.  A string of length ~200000 causes memory usage to reach and
exceed 2 gigs within seconds, crashing the running program.

I'm not looking to diagnose the problem (with this e-mail) or anything
like that, which is why I haven't provided details.  Just curious if
others have run into this problem.
Posted by Xavier Noria (Guest)
on 2006-06-26 17:16
(Received via mailing list)
On Jun 26, 2006, at 17:04, Mike Harris wrote:

> Has anyone run into a problem where a regex comparison against a  
> very large string causes memory usage to massively and rapidly  
> spike upwards.  A string of length ~200000 causes memory usage to  
> reach and exceed 2 gigs within seconds, crashing the running program.
> I'm not looking to diagnose the problem (with this e-mail) or  
> anything like that, which is why I haven't provided details.  Just  
> curious if others have run into this problem.

Some regexps explode because of a mixture of alternations,
quantifiers, etc. This is intrinsic to them so to speak, not
dependant on the engine. How does yours look?

-- fxn
Posted by Jeff Pritchard (jeffpritchard)
on 2006-06-26 17:17
I recently wondered about this situation.  My strings were approx one 
order of magnitude smaller than what you have described, and I had no 
problem running multiple "fancy" regex's on them (mine were 'C' files). 
Everybody told me not to worry about it.

They also mentioned that 'IO' and therefore 'file' classes have 
enumerator methods via mixin, so you can usually do all the same stuff 
directly to the file if you need to.

jp


Mike Harris wrote:
> Has anyone run into a problem where a regex comparison against a very
> large string causes memory usage to massively and rapidly spike
> upwards.  A string of length ~200000 causes memory usage to reach and
> exceed 2 gigs within seconds, crashing the running program.
> 
> I'm not looking to diagnose the problem (with this e-mail) or anything
> like that, which is why I haven't provided details.  Just curious if
> others have run into this problem.
Posted by Mike Harris (gfunk911)
on 2006-06-26 17:36
(Received via mailing list)
Xavier Noria wrote:

>
> Some regexps explode because of a mixture of alternations,  
> quantifiers, etc. This is intrinsic to them so to speak, not  
> dependant on the engine. How does yours look?
>
> -- fxn
>
>
>
>
"My" Regex is not actually mine.  It's in Ruby's SOAP Implementation.
The crash is (I believe) in this method

  def Charset.is_utf8(str)
    UTF8Regexp =~ str
  end

and the regex in question looks like this

  character_utf8 =
    "(?:#{us_ascii}|#{twobytes_utf8}|#{threebytes_utf8}|#{fourbytes_utf8})"
  UTF8Regexp = Regexp.new("\\A#{character_utf8}*\\z", nil, "NONE")
Posted by Xavier Noria (Guest)
on 2006-06-26 17:49
(Received via mailing list)
On Jun 26, 2006, at 17:35, Mike Harris wrote:

>    "(?:#{us_ascii}|#{twobytes_utf8}|#{threebytes_utf8}|# 
> {fourbytes_utf8})"
>  UTF8Regexp = Regexp.new("\\A#{character_utf8}*\\z", nil, "NONE")

Yeah, looks potentially problematic at first sight.

In each single byte the engine has up to 3 possible backtracking
points that have to be remembered. I don't know details about the
UTF-8 encoding, but if the alternatives are not exclusive you may
wrongly match #{twobytes_utf8} in the first byte to discover at the
end of the string the match fails. In that case the engine has to
backtrack a zillion times back and forth until it undoes the walked
way (potentially a huge tree) and arrives to switch to #
{threebytes_utf8} at the beginning, which in turn may fail, etc.

-- fxn
Posted by Xavier Noria (Guest)
on 2006-06-26 17:53
(Received via mailing list)
On Jun 26, 2006, at 17:48, Xavier Noria wrote:

> In each single byte

That should be "in each single match", meaning a successful match of
any of the alternatives at any given point.

-- fxn
Posted by Mike Harris (gfunk911)
on 2006-06-26 17:56
(Received via mailing list)
Xavier Noria wrote:

>>
> UTF-8 encoding, but if the alternatives are not exclusive you may  
> wrongly match #{twobytes_utf8} in the first byte to discover at the  
> end of the string the match fails. In that case the engine has to  
> backtrack a zillion times back and forth until it undoes the walked  
> way (potentially a huge tree) and arrives to switch to # 
> {threebytes_utf8} at the beginning, which in turn may fail, etc.
>
> -- fxn
>
>
Yup.  I figured it was something in the "has to store something at each
character" vein, although I had no clue what was actually happening.  I
"fixed it" by temporarily commenting out the calling code, since I'm
only receiving known data.

In your opinion, is this unavoidable, or is this something that should
ideally be fixed in the implementation?
Posted by Florian Gross (Guest)
on 2006-07-01 15:44
(Received via mailing list)
Xavier Noria wrote:

> zillion times back and forth until it undoes the walked way (potentially 
> a huge tree) and arrives to switch to #{threebytes_utf8} at the 
> beginning, which in turn may fail, etc.

I think in this case you can get away by using (?>...) instead of
(?:...) in character_utf8. Backtracking makes no sense, because the
choice is supposed to be unambiguous on a per-character basis. (?>...)
just tells the RE engine not to backtrack inside. It will skip right
over the construct.
Please log in before posting. Registration is free and takes only a minute.
Existing account (Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
No account? Register here.