FWIW, OniGuruma is the regex engine used by SubEthaEdit  via theOgreKit [OniGuruma RegEx Kit for Cocoa]. I am not too sure about itsbeing faster than PCRE  tests in SEE and BBEdit don't show anythingconclusive. One of my pet peeves with OgreKit/SEE is that it treatsthe full text as one line by default, making ^...$ useless [and in theversion of SEE I use ^ doesn't work...].
OTOH, the good thing about OgreKit/SEE is that \w+ on ܸdodo willcatch the whole yahzoo, whereas in PCRE/BBEdit only dodo will getcaught. Yet again, \p{L} works in PCRE, which helps refine what onewants to call a word, as Mr. Bray showed.
-- Didier
On 6/30/06, Tim Bray <tbray / textuality.com> wrote:> On Jun 28, 2006, at 8:20 PM, Yukihiro Matsumoto wrote:>> > Have you ever heard of regular expression engine (one of the hardest> > parts to implement in text processing) that handles more than 30> > different encodings without conversion, _and_ runs faster than PCRE?> ...> > If you haven't, I tell you that it is named Oniguruma, regular> > expression engine comes with Ruby 1.9.>> I'd heard of it but I hadn't tried it until now.  Previously I have> done quantitative measurement of the performance of Perl vs. Java> regex engines (conclusion: Java is faster but perl is safer, see> http://www.tbray.org/ongoing/When/200x/2005/11/20/Regex-Promises).>> I thought I would compare Oniguruma, so I downloaded it and compiled> it and ran some tests and looked at the documentation. (http://> www.geocities.jp/kosako3/oniguruma/doc/RE.txt and http://> www.geocities.jp/kosako3/oniguruma/doc/API.txt, or is there something> better?)>> Oniguruma is very clever; support for multiple different regex> syntaxes?  Wow.>> The documentation needs a little work, the example files such as> simple.c do not correspond very well (e.g. ONIG_OPTION_DEFAULT).>> But I think I must be missing something, because I can't run my> test.  It's is a fast approximate word counter for large volumes of> XML.  Here is how the regular expression is built in Perl:>> my $stag = "<[^/]([^>]*[^/>])?>";> my $etag = "</[^>]*>";> my $empty = "<[^>]*/>";>> my $alnum =>      "\\p{L}|" .>      "\\p{N}|" .>      "[\\x{4e00}-\\x{9fa5}]|" .>      "\\x{3007}|" .>      "[\\x{3021}-\\x{3029}]";> my $wordChars =>      "\\p{L}|" .>      "\\p{N}|" .>      "[-._:']|" .>      "\\x{2019}|" .>      "[\\x{4e00}-\\x{9fa5}]|" .>      "\\x{3007}|" .>      "[\\x{3021}-\\x{3029}]";> my $word = "(($alnum)(($wordChars)*($alnum))?)";>> my $regex = "($stag)|($etag)|($empty)|$word";>> full regex: (<[^/]([^>]*[^/>])?>)|(</[^>]*>)|(<[^>]*/>)|((\p{L}|\p{N}|> [\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])((\p{L}|\p{N}|> [-._:']|\x{2019}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}])*(\p> {L}|\p{N}|[\x{4e00}-\x{9fa5}]|\x{3007}|[\x{3021}-\x{3029}]))?)>> I have a very specific idea of what I mean by "word".  \w is nice but> it's not what I mean.>> As far as I can tell, \p{L} and so on don't work, so I can't do this> in Oniguruma.  Error message: "ERROR: invalid character property name> {L}".  So a bit more work is required to support Unicode? (Supporting> the properties from Chapter 4 is very important.)  Or am I mis-> reading the documentation?  I did it in C because simple.c was there,> would it make a difference if I did it from Ruby 1.9?>>    -Tim