Peter S. wrote:
Paul L. wrote:
Yes, unless he is matching whole words, he is stuck with regexes. There
is very likely to be a refactoring for this problem, and it would have to
start with a clear statement of the problem to be solved.
Sorry, my fault.
The problem is to match a whole bunch (>70000) of words (later regexps)
on a string.
You are not being very clear. Do you mean to match entire words, letter
for
letter, beginning to end, at least some of the time? If so, for those
cases, you can use a hash table that is preloaded with the words as
keys.
That will be fast.
For the regexps, which one hopes are in the minority, the speed will of
course go down.
But separate the two classes of tests – match whole words using a hash
for
speed.
The actual implementation is to concat the word with | and create a big
regexp. word1|word2|word3|word4 …
Yes, but you are moving ahead to implementation before stating the
problem
to be solved.
This went well until I tested with some 100 words. Now I have the ‘big
regexp problem’ problem. The solution has to work with regexps as words
as well like: word1(the|a|an)word11|regexp2|…
Also precompile the regexes before use. You probably already know this,
but
I thought I would mention it anyway.
Another option is C++, which has a readily available regexp library. The
way
I would go about this is to design the entire thing in Ruby, then, if
the
speed was not acceptable, recreate it in C++. This gives you the
advantage
of speedy development in Ruby, followed by speedy execution.
If this is a full-on language analysis problem, you really should be
using
Prolog or Lisp anyway. If the problem really is as complex as you are
hinting at, you may not be using the right language, or even class of
language.