Regular expression too big

Peter S. [email protected] writes:

Why is there a limitation at all? I implemented the same thing in perl
and it no complains …
Is the regexp engine of perl that much better?

I’d be very interested if you could post some benchmarks comparing
that huge regexp both in Perl’s and Ruby’s regexp engines. I think a
pretty new Perl actually optimizes the regexp into a trie, so it
should be loads faster than using an naive implementation.

Paul L. wrote:

on a string.

You are not being very clear. Do you mean to match entire words, letter for
letter, beginning to end, at least some of the time? If so, for those
cases, you can use a hash table that is preloaded with the words as keys.
That will be fast.

no its a substring match kind of /hello/.match(“dear customer id like to
say hello to …”). I have to find 70000 words in a large textfile:
if lotofwords.match(bigtext) …
I need a trigger to classify the bitext depending on wether any of the
word (later regexps) are in it or not. Tries work well for substring
matching in this case but fail with regexps.

Also precompile the regexes before use. You probably already know this, but
I thought I would mention it anyway.

yes, done that.

Another option is C++, which has a readily available regexp library. The way
I would go about this is to design the entire thing in Ruby, then, if the
speed was not acceptable, recreate it in C++. This gives you the advantage
of speedy development in Ruby, followed by speedy execution.

If this is a full-on language analysis problem, you really should be using
Prolog or Lisp anyway. If the problem really is as complex as you are
hinting at, you may not be using the right language, or even class of
language.

yes, but I wanted to try it with “my favorit” language first before
doing something wild. It’s a rails application so I’d have to call
lisp/perl/… programms, interfacing them (calling on each test would
be a mess because reading somethousand lines every time is a performance
killer.)

Anyway, thanks for the replies I’ll give a perl interface a try (due to
some other reasons like robust html parsing …).

Peter S. wrote:

Paul L. wrote:

Yes, unless he is matching whole words, he is stuck with regexes. There
is very likely to be a refactoring for this problem, and it would have to
start with a clear statement of the problem to be solved.

Sorry, my fault.

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.

You are not being very clear. Do you mean to match entire words, letter
for
letter, beginning to end, at least some of the time? If so, for those
cases, you can use a hash table that is preloaded with the words as
keys.
That will be fast.

For the regexps, which one hopes are in the minority, the speed will of
course go down.

But separate the two classes of tests – match whole words using a hash
for
speed.

The actual implementation is to concat the word with | and create a big
regexp. word1|word2|word3|word4 …

Yes, but you are moving ahead to implementation before stating the
problem
to be solved.

This went well until I tested with some 100 words. Now I have the ‘big
regexp problem’ problem. The solution has to work with regexps as words
as well like: word1(the|a|an)word11|regexp2|…

Also precompile the regexes before use. You probably already know this,
but
I thought I would mention it anyway.

Another option is C++, which has a readily available regexp library. The
way
I would go about this is to design the entire thing in Ruby, then, if
the
speed was not acceptable, recreate it in C++. This gives you the
advantage
of speedy development in Ruby, followed by speedy execution.

If this is a full-on language analysis problem, you really should be
using
Prolog or Lisp anyway. If the problem really is as complex as you are
hinting at, you may not be using the right language, or even class of
language.

“Converts a list of words to a regular expression with minimum
backtracking by joining words with common prefixes. It is a port
of the Perl module MakeRegex.pm by Hakan Kjellerstrand with
some improvements.”
http://raa.ruby-lang.org/project/makeregex/

YMMV; I have never used it on anything like the scale you are.

In a little bit of testing here, it goes to long after about 8,000
words.

puts “Took #{finish - start} seconds to convert #{words.size} words into a regex #{r.size} bytes long.”

“FOO”.match(r)
end

Took 0.000372 seconds to convert 2 words into a regex 20 bytes long.
Took 0.000285 seconds to convert 3 words into a regex 25 bytes long.
Took 0.000359 seconds to convert 5 words into a regex 51 bytes long.
Took 0.000493 seconds to convert 9 words into a regex 86 bytes long.
Took 0.000973 seconds to convert 17 words into a regex 157 bytes long.
Took 0.001773 seconds to convert 33 words into a regex 285 bytes long.
Took 0.005386 seconds to convert 65 words into a regex 491 bytes long.
Took 0.00823 seconds to convert 129 words into a regex 933 bytes long.
Took 0.019234 seconds to convert 257 words into a regex 1876 bytes long.
Took 0.042557 seconds to convert 513 words into a regex 3856 bytes long.
Took 0.09146 seconds to convert 1025 words into a regex 7807 bytes long.
Took 0.196851 seconds to convert 2049 words into a regex 15669 bytes
long.
Took 0.399155 seconds to convert 4097 words into a regex 32325 bytes
long.
Took 0.968776 seconds to convert 8193 words into a regex 64671 bytes
long.
foo:14:in `match’: regular expression too big:
/(?:1(?:0(?:80\n|-point\n|th\n)|…

In a little bit of testing here, it goes to long after about 8,000
words.

require ‘makeregex’

20.times do |n|
words = IO.readlines("/usr/share/dict/words")[0…(2 ** n)]

Of course, that isn’t the best sample of words to test with.

Here is a revised version (fails with a smaller word count):

end
finish = Time.now

puts “Took #{finish - start} seconds to convert #{w.size} words into a regex #{r.size} bytes long.”

“FOO”.match®
end

Took 9.9e-05 seconds to convert 1 words into a regex 9 bytes long.
Took 0.000163 seconds to convert 2 words into a regex 18 bytes long.
Took 0.000169 seconds to convert 4 words into a regex 42 bytes long.
Took 0.000299 seconds to convert 8 words into a regex 80 bytes long.
Took 0.000618 seconds to convert 16 words into a regex 176 bytes long.
Took 0.001366 seconds to convert 32 words into a regex 324 bytes long.
Took 0.003021 seconds to convert 64 words into a regex 689 bytes long.
Took 0.007129 seconds to convert 128 words into a regex 1391 bytes long.
Took 0.019143 seconds to convert 256 words into a regex 2740 bytes long.
Took 0.149488 seconds to convert 512 words into a regex 5295 bytes long.
Took 0.304324 seconds to convert 1024 words into a regex 10236 bytes
long.
Took 0.307832 seconds to convert 2048 words into a regex 19755 bytes
long.
Took 0.404071 seconds to convert 4096 words into a regex 37308 bytes
long.
foo:26:in `match’: regular expression too big:
/(?:A(?:ME|c(?:arapis|kley|mispon)|…

Peter S. wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with ‘|’ that I’d like to
match as a regex. /bla|blub|foo|bar|…(70000)/

But unfortunately ruby gives me a ‘regular expression too big’ if I’m
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

Jeffrey S. wrote:

You could optimize the regex a little for size, e.g. by factoring out
common prefixes:

(b(l(a|ub)|ar)|foo)...

Peter S. wrote:

Thought of that.

Have you seen:
“Converts a list of words to a regular expression with minimum
backtracking by joining words with common prefixes. It is a port
of the Perl module MakeRegex.pm by Hakan Kjellerstrand with
some improvements.”
http://raa.ruby-lang.org/project/makeregex/

YMMV; I have never used it on anything like the scale you are.

Anyway, thanks for the replies I’ll give a perl interface a try (due to
some other reasons like robust html parsing …).

there’s got to be a better way to do this. I recently attempted to
implement a particle simulation algorithm for arranging nodes in a
graph in Rails, before I came to my senses.

solving this problem with regexes sounds like a nightmare on the
maintenance front. I think it’s possible that you’re using the wrong
language but more likely that you’re just using the wrong technique.

On 11/12/06, Ross B. [email protected] wrote:

On Sun, 12 Nov 2006 15:01:56 -0000, Peter S.
[email protected] wrote:

Why is there a limitation at all? I implemented the same thing in perl
and it no complains …
Is the regexp engine of perl that much better?

Irrespective of whether regex the best solution for your needs, it seems
Oniguruma will improve the situation somewhat with respect to large
regular expressions.

I built a local version of 1.8.5 with the oniguruma engine:
http://raa.ruby-lang.org/project/oniguruma/

And re-ran (a slight variation of) my test program:

[~]$ ruby foo
Using the regex engine.
Converted a list of 1 words into a regex 8 bytes long.
Converted a list of 2 words into a regex 36 bytes long.
Converted a list of 4 words into a regex 48 bytes long.
Converted a list of 8 words into a regex 73 bytes long.
Converted a list of 16 words into a regex 173 bytes long.
Converted a list of 32 words into a regex 352 bytes long.
Converted a list of 64 words into a regex 718 bytes long.
Converted a list of 128 words into a regex 1415 bytes long.
Converted a list of 256 words into a regex 2656 bytes long.
Converted a list of 512 words into a regex 5210 bytes long.
Converted a list of 1024 words into a regex 10105 bytes long.
Converted a list of 2048 words into a regex 19432 bytes long.
Converted a list of 4096 words into a regex 37509 bytes long.
@_@

[~]$ /usr/local/bin/ruby foo
Using the Oniguruma regex engine.
Converted a list of 1 words into a regex 11 bytes long.
Converted a list of 2 words into a regex 16 bytes long.
Converted a list of 4 words into a regex 38 bytes long.
Converted a list of 8 words into a regex 97 bytes long.
Converted a list of 16 words into a regex 185 bytes long.
Converted a list of 32 words into a regex 359 bytes long.
Converted a list of 64 words into a regex 686 bytes long.
Converted a list of 128 words into a regex 1387 bytes long.
Converted a list of 256 words into a regex 2715 bytes long.
Converted a list of 512 words into a regex 5264 bytes long.
Converted a list of 1024 words into a regex 10074 bytes long.
Converted a list of 2048 words into a regex 19439 bytes long.
Converted a list of 4096 words into a regex 37452 bytes long.
Converted a list of 8192 words into a regex 71931 bytes long.
Converted a list of 16384 words into a regex 135572 bytes long.
Converted a list of 32768 words into a regex 253027 bytes long.
Converted a list of 65536 words into a regex 461607 bytes long.
Converted a list of 131072 words into a regex 808171 bytes long.
Converted a list of 262144 words into a regex 1326345 bytes long.
Converted a list of 479625 words into a regex 1873539 bytes long.

On 11/17/06, [email protected] [email protected] wrote:

regular expressions.

I built a local version of 1.8.5 with the oniguruma engine:
http://raa.ruby-lang.org/project/oniguruma/

And re-ran (a slight variation of) my test program:

I thought I’d try running under jruby too:

$ ruby long_regex_test.rb
Took 0.000153 seconds to convert 1 words into a regex 17 bytes long.
Took 0.000381 seconds to convert 2 words into a regex 20 bytes long.
Took 0.000393 seconds to convert 4 words into a regex 36 bytes long.
Took 0.000629 seconds to convert 8 words into a regex 93 bytes long.
Took 0.001359 seconds to convert 16 words into a regex 180 bytes long.
Took 0.002261 seconds to convert 32 words into a regex 360 bytes long.
Took 0.007304 seconds to convert 64 words into a regex 741 bytes long.
Took 0.013601 seconds to convert 128 words into a regex 1348 bytes long.
Took 0.028273 seconds to convert 256 words into a regex 2746 bytes long.
Took 0.066228 seconds to convert 512 words into a regex 5345 bytes long.
Took 0.177105 seconds to convert 1024 words into a regex 10017 bytes
long.
Took 0.330573 seconds to convert 2048 words into a regex 19597 bytes
long.
Took 1.390542 seconds to convert 4096 words into a regex 37345 bytes
long.
long_regex_test.rb:26:in match': regular expression too big: /(?:A(?:cr(?:edula|opora)|d(?:ar|elochorda|ventis[mt])|frogaean|hepatokla|ileen|l(?:adinist|l(?:a(?:manda|sch)|otheria)|ticamelus)|m(?:bystomidae|ericanly|ioidei|phioxidae)|n(?:chisaurus|d(?:aman|romache)|olympiad|t(?:echinomys|h(?:ophila|ropozoic)))|patornis|r(?:ab|chelenis|istarch)|s(?:caridia|elli|hantee|ilidae|terias)|tropa|u(?:riculidae|stroasiatic))|B(?:a(?:cchus|eria|haism|iera|k(?:shaish|wiri)|re|sili(?:ca|scus))|e(?:atrice|l(?:g(?:ae|ic)|shazzaresque)|mbex|rn(?:inesque|oullian))|i(?:elid|lati|smarck|tis)|lackfoot|o(?:hemia|llandist|rrovian)|ra(?:m|nchiopulmonata)|u(?:nga|phthalmum)|yroni(?:cs|te))|C(?:a(?:ctales|l(?:edonia|li(?:carpa|stephus)|ochortaceae|vados|ycophorae)|m(?:bodian|orra)|ntabri|p(?:ito(?:line)?|sidae)|r(?:olan|tist)|s(?:sandra|tanospermum)|thari)|e(?:ntrarchidae|strian)|h(?:arontas|e(?:lura|makuan)|rist(?:ianomastix|li(?:keness|ness)|mas))|lathrus|o(?:bleskill|fane|l(?:letidae|ossian)|m(?:melinaceae|us)|rybantic)|rocus|u(?:cumariidae|thbert)|y(?:clospondy (RegexpError) from long_regex_test.rb:26 from long_regex_test.rb:15:in times’
from long_regex_test.rb:15

$ /opt/ruby/v1.8.5-oniguruma/bin/ruby long_regex_test.rb
Took 0.000211 seconds to convert 1 words into a regex 5 bytes long.
Took 0.000334 seconds to convert 2 words into a regex 24 bytes long.
Took 0.000215 seconds to convert 4 words into a regex 52 bytes long.
Took 0.000836 seconds to convert 8 words into a regex 92 bytes long.
Took 0.000885 seconds to convert 16 words into a regex 173 bytes long.
Took 0.002779 seconds to convert 32 words into a regex 345 bytes long.
Took 0.004934 seconds to convert 64 words into a regex 725 bytes long.
Took 0.009765 seconds to convert 128 words into a regex 1369 bytes long.
Took 0.020761 seconds to convert 256 words into a regex 2737 bytes long.
Took 0.088759 seconds to convert 512 words into a regex 5408 bytes long.
Took 0.144276 seconds to convert 1024 words into a regex 10131 bytes
long.
Took 0.246762 seconds to convert 2048 words into a regex 19531 bytes
long.
Took 0.667575 seconds to convert 4096 words into a regex 37498 bytes
long.
Took 1.677037 seconds to convert 8192 words into a regex 71352 bytes
long.
Took 2.971277 seconds to convert 16384 words into a regex 133499 bytes
long.
Took 6.078681 seconds to convert 32768 words into a regex 245318 bytes
long.
Took 13.001538 seconds to convert 65536 words into a regex 433611 bytes
long.
Took 26.791838 seconds to convert 131072 words into a regex 713229 bytes
long.
Took 47.691109 seconds to convert 262144 words into a regex 1061186
bytes long.
Took 71.050324 seconds to convert 524288 words into a regex 1354567
bytes long.

$ export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Home
$ ~/Desktop/jruby-0.9.1/bin/jruby long_regex_test.rb
Took 0.032 seconds to convert 1 words into a regex 9 bytes long.
Took 0.012 seconds to convert 2 words into a regex 18 bytes long.
Took 0.624 seconds to convert 4 words into a regex 40 bytes long.
Took 0.033 seconds to convert 8 words into a regex 95 bytes long.
Took 0.095 seconds to convert 16 words into a regex 156 bytes long.
Took 0.057 seconds to convert 32 words into a regex 358 bytes long.
Took 0.171 seconds to convert 64 words into a regex 743 bytes long.
Took 0.309 seconds to convert 128 words into a regex 1402 bytes long.
Took 0.40900000000000003 seconds to convert 256 words into a regex
2692 bytes long.
Took 1.863 seconds to convert 512 words into a regex 5341 bytes long.
Took 0.838 seconds to convert 1024 words into a regex 10328 bytes long.
Took 1.504 seconds to convert 2048 words into a regex 19733 bytes long.
Took 2.814 seconds to convert 4096 words into a regex 37334 bytes long.
Took 8.177 seconds to convert 8192 words into a regex 71593 bytes long.
Took 15.181000000000001 seconds to convert 16384 words into a regex
133779 bytes long.
Took 30.695 seconds to convert 32768 words into a regex 244280 bytes
long.
Took 61.555 seconds to convert 65536 words into a regex 432751 bytes
long.
Took 155.94400000000002 seconds to convert 131072 words into a regex
713573 bytes long.
Took 224.93 seconds to convert 262144 words into a regex 1060079 bytes
long.
Exception in thread “main” java.lang.OutOfMemoryError: Java heap space

Peter S. wrote:

/ …

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.

You are not being very clear. Do you mean to match entire words, letter
for letter, beginning to end, at least some of the time? If so, for those
cases, you can use a hash table that is preloaded with the words as keys.
That will be fast.

no its a substring match kind of /hello/.match(“dear customer id like to
say hello to …”).

Translation: “Yes, I am searching for matches with individual words or
groups of words, character by character, without always requiring
regular
expressions.” In this case, a hash approach is indicated for one tier of
your solution.

You need to realize that this:

/hello/.match(“dear customer id like to say hello to …”)

Is fantastically, unbelievably, slower than this:

hash = [ “hello” => true,“goodbye” => true,“stay a while” => true ]

if hash[word] …

All you have to do is tokenize your word list using “split” or another
similar approach, then apply the hash to it. You may have a need for
regular expressions, but it is clear that you can also use a hash and
save
a load of run time.

I have to find 70000 words in a large textfile:
if lotofwords.match(bigtext) …
I need a trigger to classify the bitext depending on wether any of the
word (later regexps) are in it or not. Tries work well for substring
matching in this case but fail with regexps.

At this point, to offer any useful advice, it would be nice to know what
the
goal is, more specifically than we’ve heard up to now.

/ …

Anyway, thanks for the replies I’ll give a perl interface a try (due to
some other reasons like robust html parsing …).

Oh, Ruby can parse HTML just fine, and in only a few lines.

#would_never_do.rb
matched = perl -e 'if ("Hello" =~ m/(Hello)/) {print "true";}'