Regex Comparison Causing Massive Memory Usage

Has anyone run into a problem where a regex comparison against a very
large string causes memory usage to massively and rapidly spike
upwards. A string of length ~200000 causes memory usage to reach and
exceed 2 gigs within seconds, crashing the running program.

I’m not looking to diagnose the problem (with this e-mail) or anything
like that, which is why I haven’t provided details. Just curious if
others have run into this problem.

On Jun 26, 2006, at 17:04, Mike H. wrote:

Has anyone run into a problem where a regex comparison against a
very large string causes memory usage to massively and rapidly
spike upwards. A string of length ~200000 causes memory usage to
reach and exceed 2 gigs within seconds, crashing the running program.
I’m not looking to diagnose the problem (with this e-mail) or
anything like that, which is why I haven’t provided details. Just
curious if others have run into this problem.

Some regexps explode because of a mixture of alternations,
quantifiers, etc. This is intrinsic to them so to speak, not
dependant on the engine. How does yours look?

– fxn

I recently wondered about this situation. My strings were approx one
order of magnitude smaller than what you have described, and I had no
problem running multiple “fancy” regex’s on them (mine were ‘C’ files).
Everybody told me not to worry about it.

They also mentioned that ‘IO’ and therefore ‘file’ classes have
enumerator methods via mixin, so you can usually do all the same stuff
directly to the file if you need to.

jp

Mike H. wrote:

Has anyone run into a problem where a regex comparison against a very
large string causes memory usage to massively and rapidly spike
upwards. A string of length ~200000 causes memory usage to reach and
exceed 2 gigs within seconds, crashing the running program.

I’m not looking to diagnose the problem (with this e-mail) or anything
like that, which is why I haven’t provided details. Just curious if
others have run into this problem.

Xavier N. wrote:

Some regexps explode because of a mixture of alternations,
quantifiers, etc. This is intrinsic to them so to speak, not
dependant on the engine. How does yours look?

– fxn

“My” Regex is not actually mine. It’s in Ruby’s SOAP Implementation.
The crash is (I believe) in this method

def Charset.is_utf8(str)
UTF8Regexp =~ str
end

and the regex in question looks like this

character_utf8 =
“(?:#{us_ascii}|#{twobytes_utf8}|#{threebytes_utf8}|#{fourbytes_utf8})”
UTF8Regexp = Regexp.new("\A#{character_utf8}*\z", nil, “NONE”)

On Jun 26, 2006, at 17:35, Mike H. wrote:

“(?:#{us_ascii}|#{twobytes_utf8}|#{threebytes_utf8}|#
{fourbytes_utf8})”
UTF8Regexp = Regexp.new("\A#{character_utf8}*\z", nil, “NONE”)

Yeah, looks potentially problematic at first sight.

In each single byte the engine has up to 3 possible backtracking
points that have to be remembered. I don’t know details about the
UTF-8 encoding, but if the alternatives are not exclusive you may
wrongly match #{twobytes_utf8} in the first byte to discover at the
end of the string the match fails. In that case the engine has to
backtrack a zillion times back and forth until it undoes the walked
way (potentially a huge tree) and arrives to switch to #
{threebytes_utf8} at the beginning, which in turn may fail, etc.

– fxn

On Jun 26, 2006, at 17:48, Xavier N. wrote:

In each single byte

That should be “in each single match”, meaning a successful match of
any of the alternatives at any given point.

– fxn

Xavier N. wrote:

UTF-8 encoding, but if the alternatives are not exclusive you may
wrongly match #{twobytes_utf8} in the first byte to discover at the
end of the string the match fails. In that case the engine has to
backtrack a zillion times back and forth until it undoes the walked
way (potentially a huge tree) and arrives to switch to #
{threebytes_utf8} at the beginning, which in turn may fail, etc.

– fxn

Yup. I figured it was something in the “has to store something at each
character” vein, although I had no clue what was actually happening. I
“fixed it” by temporarily commenting out the calling code, since I’m
only receiving known data.

In your opinion, is this unavoidable, or is this something that should
ideally be fixed in the implementation?

Xavier N. wrote:

zillion times back and forth until it undoes the walked way (potentially
a huge tree) and arrives to switch to #{threebytes_utf8} at the
beginning, which in turn may fail, etc.

I think in this case you can get away by using (?>…) instead of
(?:…) in character_utf8. Backtracking makes no sense, because the
choice is supposed to be unambiguous on a per-character basis. (?>…)
just tells the RE engine not to backtrack inside. It will skip right
over the construct.