On Jun 28, 2006, at 8:20 PM, Yukihiro Matsumoto wrote:

> Have you ever heard of regular expression engine (one of the hardest
> parts to implement in text processing) that handles more than 30
> different encodings without conversion, _and_ runs faster than PCRE?
...
> If you haven't, I tell you that it is named Oniguruma, regular
> expression engine comes with Ruby 1.9.

I'd heard of it but I hadn't tried it until now.  Previously I have  
done quantitative measurement of the performance of Perl vs. Java  
regex engines (conclusion: Java is faster but perl is safer, see  
http://www.tbray.org/ongoing/When/200x/2005/11/20/Regex-Promises).

I thought I would compare Oniguruma, so I downloaded it and compiled  
it and ran some tests and looked at the documentation. (http:// 
www.geocities.jp/kosako3/oniguruma/doc/RE.txt and http:// 
www.geocities.jp/kosako3/oniguruma/doc/API.txt, or is there something  
better?)

Oniguruma is very clever; support for multiple different regex  
syntaxes?  Wow.

The documentation needs a little work, the example files such as  
simple.c do not correspond very well (e.g. ONIG_OPTION_DEFAULT).

But I think I must be missing something, because I can't run my  
test.  It's is a fast approximate word counter for large volumes of  
XML.  Here is how the regular expression is built in Perl:

my $stag = "<[^/]([^>]*[^/>])?>";
my $etag = "</[^>]*>";
my $empty = "<[^>]*/>";

my $alnum =
     "\\p{L}|" .
     "\\p{N}|" .
     "[\\x{4e00}-\\x{9fa5}]|" .
     "\\x{3007}|" .
     "[\\x{3021}-\\x{3029}]";
my $wordChars =
     "\\p{L}|" .
     "\\p{N}|" .
     "[-._:']|" .
     "\\x{2019}|" .
     "[\\x{4e00}-\\x{9fa5}]|" .
     "\\x{3007}|" .
     "[\\x{3021}-\\x{3029}]";
my $word = "(($alnum)(($wordChars)*($alnum))?)";

my $regex = "($stag)|($etag)|($empty)|$word";

full regex: (<[^/]([^>]*[^/>])?>)|(</[^>]*>)|(<[^>]*/>)|((\p{L}|\p{N}| 
[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])((\p{L}|\p{N}| 
[-._:']|\x{2019}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])*(\p 
{L}|\p{N}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}]))?)

I have a very specific idea of what I mean by "word".  \w is nice but  
it's not what I mean.

As far as I can tell, \p{L} and so on don't work, so I can't do this  
in Oniguruma.  Error message: "ERROR: invalid character property name  
{L}".  So a bit more work is required to support Unicode? (Supporting  
the properties from Chapter 4 is very important.)  Or am I mis- 
reading the documentation?  I did it in C because simple.c was there,  
would it make a difference if I did it from Ruby 1.9?

   -Tim