Regular expression too big

splattael · November 12, 2006, 3:32pm

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with ‘|’ that I’d like to
match as a regex. /bla|blub|foo|bar|…(70000)/

But unfortunately ruby gives me a ‘regular expression too big’ if I’m
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

Peter

splattael · November 12, 2006, 3:42pm

Peter S. wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with ‘|’ that I’d like to
match as a regex. /bla|blub|foo|bar|…(70000)/

But unfortunately ruby gives me a ‘regular expression too big’ if I’m
trying to build such a thing.
I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

You could optimize the regex a little for size, e.g. by factoring out
common prefixes:

(b(l(a|ub)|ar)|foo)...

Of course, that will only help if the | alternatives have a reasonable
amount of redundancy. Alternatively, you could just break the whole
thing into multiple expressions. Instead of

if /first_part|second_part/ =~ text

You could try:

if /first_part/ =~ text or /second_part/ =~ text

splattael · November 12, 2006, 4:06pm

Jeffrey S. wrote:

Thanks for any hint

You could optimize the regex a little for size, e.g. by factoring out
common prefixes:
(b(l(a|ub)|ar)|foo)...

Thought of that.

Of course, that will only help if the | alternatives have a reasonable
amount of redundancy. Alternatively, you could just break the whole
thing into multiple expressions. Instead of
if /first_part|second_part/ =~ text
You could try:
if /first_part/ =~ text or /second_part/ =~ text

Yes, that was my next thought but where to split? Just count the bytes
and splitt near 1 <<16?

Why is there a limitation at all? I implemented the same thing in perl
and it no complains …
Is the regexp engine of perl that much better?

Thanks for the reply

splattael · November 12, 2006, 4:42pm

On Sun, 12 Nov 2006 15:01:56 -0000, Peter S.
[email protected] wrote:

Why is there a limitation at all? I implemented the same thing in perl
and it no complains …
Is the regexp engine of perl that much better?

Irrespective of whether regex the best solution for your needs, it seems
Oniguruma will improve the situation somewhat with respect to large
regular expressions.

$ wc -w acd-holmes-return.txt
112110 acd-holmes-return.txt

$ cat bigregexp.rb
txt = File.read(‘acd-holmes-return.txt’).split(’ ‘)
re = /^(#{txt.map { |e| “#{Regexp.escape(e)}” }.join(’|')})$/

p re =~ “beautiful”
p $&
END

$ ruby -v bigregexp.rb
ruby 1.8.5 (2006-08-25) [i686-linux]
bigregexp.rb:2: regular expression too big: /…[snipped]…/
(RegexpError)

$ ruby9 -v bigregexp.rb
ruby 1.9.0 (2006-09-07) [i686-linux]
0
“beautiful”

splattael · November 12, 2006, 4:28pm

This doesn’t really help with the actual question of getting past
regex size limits, but are you sure that regexen are correct solution
to this problem? Unless I’m mistaken, the above match is going to be
horribly, painfully slow; the issue you’re running into is probably an
indication that you might want to look elsewhere.

I don’t want to make any silly claims without doing any benchmarking
on your data, but I would imagine that even doing an efficient search
on a sorted array of your words would give you better performance than
the regex search. You can move up from there into hashes or other
data structures for this sort of thing.

splattael · November 12, 2006, 5:50pm

Peter S. wrote:

regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint
You could optimize the regex a little for size, e.g. by factoring out
common prefixes:
(b(l(a|ub)|ar)|foo)...
Thought of that.

Good for you.

Yes, that was my next thought but where to split? Just count the bytes
and splitt near 1 <<16?

Probably better not to construct the mega-regex in the first place. For
the record, finding yourself on the edges of the language’s capacity
like this might be a sign that refactoring is in order.

Even if you’re going to stick with your current technique, but work
around the size limitation, it’s probably better not to build a megex
™ you’ll have to split up. As you put the pattern together, only add
alternations as long as the cumulative size will be < 0x10000 (or a
well-commented static constant with that value).

Why is there a limitation at all? I implemented the same thing in perl
and it no complains …
Is the regexp engine of perl that much better?

As Friedl notes, Perl is darned close to being the ideal regex language.

Ruby regexes aren’t necessarily meant to be the one hammer that can
drive every nail. If you want to be able to view every problem through
a regex lens, you’ll probably have to dig a little deeper than
categorizing one language’s engine as simply “better” than another.
Big, static sizes like 2**16 are often used to avoid dynamic allocation,
or otherwise improve runtime efficiency.

Also for the record: I’m a big fan of regexes. Though a lot of people
complain about complexity or efficiency issues, I’ve never had a problem
with either. I would be interested to see a comparison of the relative
merits and limitations of various engines, e.g. regex lengths,
benchmarks, big-O complexity, and ability to handle null bytes.

splattael · November 12, 2006, 8:20pm

On Sun, 12 Nov 2006 15:25:49 +0100, Peter S. wrote:

Thanks for any hint

Peter

Maybe a trie would be useful?
http://rubyforge.org/projects/trie/
(or there’s another trie at http://kzk9.net/software/miscprograms/ruby/)

–Ken

splattael · November 12, 2006, 8:30pm

On Sun, 12 Nov 2006 19:15:12 +0000, Ken B. wrote:

I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
regexes. Is there a way around this (without going down to 2000 words) ?

Thanks for any hint

Peter

Maybe a trie would be useful?
http://rubyforge.org/projects/trie/
(or there’s another trie at http://kzk9.net/software/miscprograms/ruby/)

On one more thing: to implement substring search using a trie, when
adding words to the trie, you should generate appropriate back links so
that you can implement =~ use the Knuth-Morris-Pratt algorithm for
matching.

–Ken

splattael · November 12, 2006, 8:57pm

Peter S. wrote:

Hi,

got problem with big regexes:
I have a regex of about 70000+ words concated with ‘|’ that I’d like to
match as a regex. /bla|blub|foo|bar|…(70000)/

It appears you are trying to match a word against a list of words. Yes?
From
an efficiency standpoint, if the expression is really as shown (no real
regex syntax), why not create a hash out of the data and see if the
desired
word is present that way? That would likely be much faster than running
the
(non-regex) regex against each word in the input data.

When you post an question like this, it is always a good idea to reveal
the
purpose as well as the method.

#!/usr/bin/ruby -w

data = File.read("/usr/share/dict/words")

words = data.split(%r{\s+}m)

hash = {}

words.each do |word|
hash[word] = true
end

while true
print “Enter a word:”
word = STDIN.readline.chomp
if hash[word]
puts “Valid word.”
else
puts “Not present.”
end
end

Now you have a hash that can be used for testing a list of words.

splattael · November 12, 2006, 9:47pm

On 11/12/06, Peter S. [email protected] wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with ‘|’ that I’d like to
match as a regex. /bla|blub|foo|bar|…(70000)/

if you have many words to check for then consider using a

splattael · November 12, 2006, 10:11pm

When you post an question like this, it is always a good idea to
reveal the
purpose as well as the method.

First of all: Thanks for the replies, I think I have enough input to
chew on.

And to reveal the purpose:
I’d like to match a LOT of words/strings (words with spaces) and
sometimes regexes against a lot of small and medium size( < 10000 byte )
strings.

The comparison with Perl was just a comparison of the regexp engines and
not the language itself. It was just that until now I had never to think
much about the underlying hardware … because ruby did that for me. So
I was surprised by the “too big exception”.

Peter

splattael · November 13, 2006, 2:31am

On Mon, 13 Nov 2006 05:46:30 +0900, Simon S. wrote:

On 11/12/06, Peter S. [email protected] wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with ‘|’ that I’d like to
match as a regex. /bla|blub|foo|bar|…(70000)/

if you have many words to check for then consider using a
Bloom filter - Wikipedia

The only reason I didn’t suggest that is becuase it can have false
positives.

–Ken B.
^^^^^

splattael · November 13, 2006, 9:10am

Peter S. wrote:

I’d like to match a LOT of words/strings (words with spaces) and
sometimes regexes against a lot of small and medium size( < 10000 byte )
strings.

If you are going to compare strings to strings as well as strings to
regexes, maybe a two-tiered scheme would be better. First tier, simple
textual comparison using a hash of strings. Second tier, regexes.

This removes the ambiguity that a particular data string might pass a
string
comparison but fail any of the provided regexes, or the opposite. It
will
also be much faster than a gigantic list of strings and regexes, all
presented to the regex engine as though they were all regular
expressions,
even though some are simple string comparisons.

I think this (e.g speed improvement) would have been true in Perl also.

splattael · November 13, 2006, 3:51pm

On Mon, 13 Nov 2006 00:07:52 -0800, Paul L. wrote:

comparison but fail any of the provided regexes, or the opposite. It will
also be much faster than a gigantic list of strings and regexes, all
presented to the regex engine as though they were all regular expressions,
even though some are simple string comparisons.

I think this (e.g speed improvement) would have been true in Perl also.

His regex isn’t anchored to the beginning and end of the string. This
makes
hashes useless for the kind of comparison he seems to want to perform.

–Ken

splattael · November 13, 2006, 6:21pm

Ken B. wrote:

/ …

I think this (e.g speed improvement) would have been true in Perl also.

His regex isn’t anchored to the beginning and end of the string.

Yes, I see that now. From the appearance of the data, it looks as though
he
is matching whole words, but that’s only an assumption on my part.

This makes hashes useless for the kind of comparison he seems to want to
perform.

Yes, unless he is matching whole words, he is stuck with regexes. There
is
very likely to be a refactoring for this problem, and it would have to
start with a clear statement of the problem to be solved.

splattael · November 13, 2006, 6:45pm

On 12/nov/06, at 15:30, Peter S. wrote:

for
regexes. Is there a way around this (without going down to 2000
words) ?

I friend of mine told me to suggest you to system() a perl script
which does the job

However, since he was just kidding, I think I’ll suggest you
something like this:

class String
def include_at_least_one_of?(words)
words.find { |w| include? w }
end
end

my_words = %w(bla blub foo bar)

“zump blub asd”.include_at_least_one_of? my_words # => “blub”
“nah none included”.include_at_least_one_of? my_words # => nil

You probably don’t want to hack core classes in order to do this, but
you get the idea.

splattael · November 13, 2006, 7:29pm

On Nov 13, 2006, at 11:45 AM, Gabriele M. wrote:

class String
def include_at_least_one_of?(words)
words.find { |w| include? w }
end
end

Just a tiny style related suggestion. I would use any?() instead of
find() there.

James Edward G. II

splattael · November 13, 2006, 7:19pm

Simon S. [email protected] wrote:

On 11/12/06, Peter S. [email protected] wrote:

got problem with big regexes:
I have a regex of about 70000+ words concated with ‘|’ that I’d like to
match as a regex. /bla|blub|foo|bar|…(70000)/

if you have many words to check for then consider using a
Bloom filter - Wikipedia

There’s even a library for doing Bloom filters:
http://raa.ruby-lang.org/project/rbloomfilter/

(Bloom filters were invented by Burton Bloom, who is not even remotely
related to me)

splattael · November 17, 2006, 12:06am

Paul L. wrote:

Yes, unless he is matching whole words, he is stuck with regexes. There is
very likely to be a refactoring for this problem, and it would have to
start with a clear statement of the problem to be solved.

Sorry, my fault.

The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.
The actual implementation is to concat the word with | and create a big
regexp. word1|word2|word3|word4 …
This went well until I tested with some 100 words. Now I have the ‘big
regexp problem’ problem. The solution has to work with regexps as words
as well like: word1(the|a|an)word11|regexp2|…
So the currently working solutions are:
-use ruby 1.9 (the performance is far below perl, but I have to
benchmark this.)
-loop through the words/regexp and match each of them on the string
(don’t ask for performance here).
-perhaps I’ll realy pipe the problem into a
perl-implemented-matching-server (not that bad idea)

peter

splattael · November 17, 2006, 2:36am

On 11/17/06, Peter S. [email protected] wrote:

regexp. word1|word2|word3|word4 …
Is it exact word matching?

if so then you can split the input-string on blankspace,
so you get an array of words. Then you can sort the array of words

irb(main):001:0> a = %w(a b c e f g)
=> [“a”, “b”, “c”, “e”, “f”, “g”]
irb(main):002:0> b = %w(b d f)
=> [“b”, “d”, “f”]
irb(main):003:0> a & b
=> [“b”, “f”]