On 16/09/2008, Martin Duerst <duerst / it.aoyama.ac.jp> wrote:
> At 02:35 08/09/11, Tim Bray wrote:
>  >On Sep 10, 2008, at 12:55 AM, Tanaka Akira wrote:
>
>
> >> NFC (Normalization Form C) can be a solution for "ƥ".  But
>
> >> there are characters which don't have single codepoint (as
>  >> some characters defined in JIS X 0213, for example).
>  >
>  >Unfortunately NFC isn't a solution because it isn't widely respected,
>  >so a developer has to deal with nonstandard normalizations. :(
>
>
> I think what Akira meant here is that you should use some kind
>  of normalization (e.g. NFC) as a preprocessing step, which
>  would avoid the need to map various pre/de-composed forms
>  to the same entry in your actual indexing code.
>
>  I think that's true, but it won't deal with the fact that
>  in text indexing, you often also want to link the index to
>  a non-accented version, and so on, so you always one way or
>  another end up having to look at each character/codepoint closely
>  anyway.
>
>

As I understand it currently Ruby does allow iterating bytes and
"characters" which are short substrings corresponding to a single
codepoint.

There are two requests here:

1)
to add something like

class String
  def each_codepoint &block
    each_character{|c| yield c.org}
 end
end

which seems reasonable and can possibly be optimized in the C
implementation of the string class. It gives you access to another
representation of the string which may be more convenient for some
uses.

2) To make each_character and other character-based functions return
some "smart characters" that may be composed of multiple codepoints.
This makes each_character incompatible with the above, and is not in
general solved for all possible combinations of codepoints.

While the "smart character" might be well defined for some subset of
Latin script even with combining accents (but maybe not - consider
"dotless i" and combining accents) I can imagine different people
having different ideas of "smart character" for scripts like Arabian,
various Indic scripts which combine characters, Korean, etc.

People who want "smart characters" should write their own iterators or
preprocessing normalizing functions because only they know what their
characters should look like.

Sure it might be useful to write libraries of such functions for
people with similar ideas of "smart character" but there is no
universal solution that would satisfy the needs of every language and
every kind of text processing.

Thanks

Michal