Has anyone run into a problem where a regex comparison against a very large string causes memory usage to massively and rapidly spike upwards. A string of length ~200000 causes memory usage to reach and exceed 2 gigs within seconds, crashing the running program. I'm not looking to diagnose the problem (with this e-mail) or anything like that, which is why I haven't provided details. Just curious if others have run into this problem.
on 2006-06-26 17:05
on 2006-06-26 17:16
On Jun 26, 2006, at 17:04, Mike Harris wrote: > Has anyone run into a problem where a regex comparison against a > very large string causes memory usage to massively and rapidly > spike upwards. A string of length ~200000 causes memory usage to > reach and exceed 2 gigs within seconds, crashing the running program. > I'm not looking to diagnose the problem (with this e-mail) or > anything like that, which is why I haven't provided details. Just > curious if others have run into this problem. Some regexps explode because of a mixture of alternations, quantifiers, etc. This is intrinsic to them so to speak, not dependant on the engine. How does yours look? -- fxn
on 2006-06-26 17:17
I recently wondered about this situation. My strings were approx one order of magnitude smaller than what you have described, and I had no problem running multiple "fancy" regex's on them (mine were 'C' files). Everybody told me not to worry about it. They also mentioned that 'IO' and therefore 'file' classes have enumerator methods via mixin, so you can usually do all the same stuff directly to the file if you need to. jp Mike Harris wrote: > Has anyone run into a problem where a regex comparison against a very > large string causes memory usage to massively and rapidly spike > upwards. A string of length ~200000 causes memory usage to reach and > exceed 2 gigs within seconds, crashing the running program. > > I'm not looking to diagnose the problem (with this e-mail) or anything > like that, which is why I haven't provided details. Just curious if > others have run into this problem.
on 2006-06-26 17:36
Xavier Noria wrote: > > Some regexps explode because of a mixture of alternations, > quantifiers, etc. This is intrinsic to them so to speak, not > dependant on the engine. How does yours look? > > -- fxn > > > > "My" Regex is not actually mine. It's in Ruby's SOAP Implementation. The crash is (I believe) in this method def Charset.is_utf8(str) UTF8Regexp =~ str end and the regex in question looks like this character_utf8 = "(?:#{us_ascii}|#{twobytes_utf8}|#{threebytes_utf8}|#{fourbytes_utf8})" UTF8Regexp = Regexp.new("\\A#{character_utf8}*\\z", nil, "NONE")
on 2006-06-26 17:49
On Jun 26, 2006, at 17:35, Mike Harris wrote: > "(?:#{us_ascii}|#{twobytes_utf8}|#{threebytes_utf8}|# > {fourbytes_utf8})" > UTF8Regexp = Regexp.new("\\A#{character_utf8}*\\z", nil, "NONE") Yeah, looks potentially problematic at first sight. In each single byte the engine has up to 3 possible backtracking points that have to be remembered. I don't know details about the UTF-8 encoding, but if the alternatives are not exclusive you may wrongly match #{twobytes_utf8} in the first byte to discover at the end of the string the match fails. In that case the engine has to backtrack a zillion times back and forth until it undoes the walked way (potentially a huge tree) and arrives to switch to # {threebytes_utf8} at the beginning, which in turn may fail, etc. -- fxn
on 2006-06-26 17:53
On Jun 26, 2006, at 17:48, Xavier Noria wrote:
> In each single byte
That should be "in each single match", meaning a successful match of
any of the alternatives at any given point.
-- fxn
on 2006-06-26 17:56
Xavier Noria wrote: >> > UTF-8 encoding, but if the alternatives are not exclusive you may > wrongly match #{twobytes_utf8} in the first byte to discover at the > end of the string the match fails. In that case the engine has to > backtrack a zillion times back and forth until it undoes the walked > way (potentially a huge tree) and arrives to switch to # > {threebytes_utf8} at the beginning, which in turn may fail, etc. > > -- fxn > > Yup. I figured it was something in the "has to store something at each character" vein, although I had no clue what was actually happening. I "fixed it" by temporarily commenting out the calling code, since I'm only receiving known data. In your opinion, is this unavoidable, or is this something that should ideally be fixed in the implementation?
on 2006-07-01 15:44
Xavier Noria wrote: > zillion times back and forth until it undoes the walked way (potentially > a huge tree) and arrives to switch to #{threebytes_utf8} at the > beginning, which in turn may fail, etc. I think in this case you can get away by using (?>...) instead of (?:...) in character_utf8. Backtracking makes no sense, because the choice is supposed to be unambiguous on a per-character basis. (?>...) just tells the RE engine not to backtrack inside. It will skip right over the construct.
Please log in before posting. Registration is free and takes only a minute.
Existing account
(Switch to SSL-encrypted connection)
NEW: Do you have a Google/GoogleMail or Yahoo account? No registration required!
Log in with Google account | Log in with Yahoo account
Log in with Google account | Log in with Yahoo account
No account? Register here.