>> I have a trivial script that parses UnicodeData.txt and looks for
>> properties unrecognized by Onigurma. If there's an automated process
>> to update unicode.c, I'll provide the raw data. If, as I suspect, we
>> need to update unicode.c by hand, adding in each codepoint
>> individually, I guess I volunteer...
>
> If that°«s truly the case, wouldn°«t you rather volunteer to make it so
> that unicode.c doesn°«t need to be updated manually?

I would, but I doubt I'm smart enough. ;-) I looked around a while
back at how various languages implemented access to Unicode character
metadata, and the results were universally ugly. I saw one, an older
version of the Python approach, IIRC, which consisted of a switch
statement in excess of a thousand lines... It's a classic case of
trading clarity for performance. UnicodeData.txt is 1.1MB, yet the
general use case requires lightning fast lookup. Every approach I've
seen came to the conclusion: the best they could do was automatically
generate array or bit vector literals of the entire database.

Likewise, all I can suggest is that we move the codepoint lists to a
separate file, so we can at least generate that automatically. Again,
though, I don't feel qualified to undertake that by myself. It would
require benchmarking the internals of a regexp library and
understanding the source in sufficient detail to reason about
optimizations...

I haven't looked at the code in a while, but from what I recall each
property is represented by a constant array whose values are the
codepoints in hex, ordered by ordinal value. I suspect I'll start by
writing a test suite in the Mini Unit style that simply iterates over
a local copy of UnicodeData.txt and checks that the given property
matches with Onigurma. I'll take a note of the failures (i.e. the new
characters that aren't yet enumerated in unicode.c). Then, for each
property I'll generate a list, adhering to the current format in
unicode.c, of _all_ codepoints with that property, that I can paste in
over the top of the current list. I can then re-run the test suite to
check I haven't made any mistakes. If that works, it shouldn't be too
difficult.