On Jun 28, 2006, at 8:20 PM, Yukihiro Matsumoto wrote: > Have you ever heard of regular expression engine (one of the hardest > parts to implement in text processing) that handles more than 30 > different encodings without conversion, _and_ runs faster than PCRE? ... > If you haven't, I tell you that it is named Oniguruma, regular > expression engine comes with Ruby 1.9. I'd heard of it but I hadn't tried it until now. Previously I have done quantitative measurement of the performance of Perl vs. Java regex engines (conclusion: Java is faster but perl is safer, see http://www.tbray.org/ongoing/When/200x/2005/11/20/Regex-Promises). I thought I would compare Oniguruma, so I downloaded it and compiled it and ran some tests and looked at the documentation. (http:// www.geocities.jp/kosako3/oniguruma/doc/RE.txt and http:// www.geocities.jp/kosako3/oniguruma/doc/API.txt, or is there something better?) Oniguruma is very clever; support for multiple different regex syntaxes? Wow. The documentation needs a little work, the example files such as simple.c do not correspond very well (e.g. ONIG_OPTION_DEFAULT). But I think I must be missing something, because I can't run my test. It's is a fast approximate word counter for large volumes of XML. Here is how the regular expression is built in Perl: my $stag = "<[^/]([^>]*[^/>])?>"; my $etag = "</[^>]*>"; my $empty = "<[^>]*/>"; my $alnum = "\\p{L}|" . "\\p{N}|" . "[\\x{4e00}-\\x{9fa5}]|" . "\\x{3007}|" . "[\\x{3021}-\\x{3029}]"; my $wordChars = "\\p{L}|" . "\\p{N}|" . "[-._:']|" . "\\x{2019}|" . "[\\x{4e00}-\\x{9fa5}]|" . "\\x{3007}|" . "[\\x{3021}-\\x{3029}]"; my $word = "(($alnum)(($wordChars)*($alnum))?)"; my $regex = "($stag)|($etag)|($empty)|$word"; full regex: (<[^/]([^>]*[^/>])?>)|(</[^>]*>)|(<[^>]*/>)|((\p{L}|\p{N}| [\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])((\p{L}|\p{N}| [-._:']|\x{2019}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])*(\p {L}|\p{N}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}]))?) I have a very specific idea of what I mean by "word". \w is nice but it's not what I mean. As far as I can tell, \p{L} and so on don't work, so I can't do this in Oniguruma. Error message: "ERROR: invalid character property name {L}". So a bit more work is required to support Unicode? (Supporting the properties from Chapter 4 is very important.) Or am I mis- reading the documentation? I did it in C because simple.c was there, would it make a difference if I did it from Ruby 1.9? -Tim