Peter Schrammel wrote:
> Jeffrey Schwab wrote:
>> Peter Schrammel wrote:
>>
>>> got  problem with big regexes:
>>> I have a regex of about 70000+ words concated with '|' that I'd like to
>>> match as a regex. /bla|blub|foo|bar|.....(70000)/
>>>
>>> But unfortunately ruby gives me a 'regular expression too big' if I'm
>>> trying to build such a thing.
>>> I had a look at the regex.c code and saw the limit of 1 << 16 bytes for
>>> regexes. Is there a way around this (without going down to 2000 words) ?
>>>
>>> Thanks for any hint
>> You could optimize the regex a little for size, e.g. by factoring out
>> common prefixes:
>>
>>     (b(l(a|ub)|ar)|foo)...
> 
> Thought of that.

Good for you.

>> Of course, that will only help if the | alternatives have a reasonable
>> amount of redundancy.  Alternatively, you could just break the whole
>> thing into multiple expressions.  Instead of
>>
>>     if /first_part|second_part/ =~ text
>>
>> You could try:
>>
>>     if /first_part/ =~ text or /second_part/ =~ text
> 
> Yes, that was my next thought but where to split? Just count the bytes
> and splitt near 1 <<16?

Probably better not to construct the mega-regex in the first place.  For 
the record, finding yourself on the edges of the language's capacity 
like this might be a sign that refactoring is in order.

Even if you're going to stick with your current technique, but work 
around the size limitation, it's probably better not to build a megex 
(TM) you'll have to split up.  As you put the pattern together, only add 
alternations as long as the cumulative size will be < 0x10000 (or a 
well-commented static constant with that value).

> Why is there a limitation at all? I implemented the same thing in perl
> and it no complains ...
> Is the regexp engine of perl that much better?

As Friedl notes, Perl is darned close to being the ideal regex language. :)

Ruby regexes aren't necessarily meant to be the one hammer that can 
drive every nail.  If you want to be able to view every problem through 
a regex lens, you'll probably have to dig a little deeper than 
categorizing one language's engine as simply "better" than another. 
Big, static sizes like 2**16 are often used to avoid dynamic allocation, 
or otherwise improve runtime efficiency.

Also for the record:  I'm a big fan of regexes.  Though a lot of people 
complain about complexity or efficiency issues, I've never had a problem 
with either.  I would be interested to see a comparison of the relative 
merits and limitations of various engines, e.g. regex lengths, 
benchmarks, big-O complexity, and ability to handle null bytes.