Issue #10084 has been updated by Martin Dürst.


Nobuyoshi Nakada wrote:
> What will happen for a non-unicode string, raising an exception?

This is a very good question. I'm okay with whatever Matz and the communityhink is best.

There are many potential approaches. In general, these will be:
1) Make the operation a no-op.
2) Convert to UTF-8, normalize, then convert back.
3) Implement normalization directly in the encoding.
4) Raise an exception.

There is also the question of what a "non-unicode" or "unicode" string is.

UTF-8 is the preferred way to handle Unicode in Ruby, and is where normalization is really needed and will be used.

For the other encodings, unless we go with 1) or 4), the following considerations apply.

UTF8-Mac, UTF8-DoCoMo, UTF8-KDDI and UTF8-Softbank are essentially UTF-8 but with slightly different character conversions. For these encodings, the easiest thing to do is force_encoding to UTF-8, normalize, and force_encoding back. A C-level implementation may not actually need force_encoding, but a Ruby implementation does. There are some questions about what normalizingTF8-Mac means, so that may have to be treated separately. The DoCoMo/KDDI/Softbank variants are mostly about emoji, which as far as I know are not affected by normalization.

Then there are UTF-16LE/BE and UTF-32LE/BE. For these, it depends on the implementation. A Ruby-level implementation (unless very slow) may want to convert to UTF-8 and back. A C-level implementation may not need to do this.

Then there is also GB18030. Conversion to UTF-8 and back seems to be the best solution. Doing normalization directly in GB18030 will need too much data.

For other, truely non-unicode encodings, implementing noramlization directly in the encoding would mean the following: Analyze to what extent the normalization applies to the encoding in question, and apply this part. As an example, '.nfkc produces '1' in UTF-8, it could do the same in Windows-31J. The analysis might take some time (but can be automated), and theata needed for each encoding would mostly be just very small.


----------------------------------------
Feature #10084: Add Unicode String Normalization to String class
https://bugs.ruby-lang.org/issues/10084#change-48005

* Author: Martin Drst
* Status: Open
* Priority: Normal
* Assignee: 
* Category: 
* Target version: 
----------------------------------------
Unicode string normalization is a frequent operation when comparing or normalizing strings.

This should be available directly on the String class.

The proposed syntax is:

   'string'.normalize       # normalize 'string' according to NFC (most frequent on the Web)
   'string'.normalize :nfc  # normalize 'string' according to NFC; :nfd, :nfkc, :nfkd also usable
   'string'.nfc             # shorter variant, but maybe too many methods

There are several "unofficial" but convenient normalization variants that could be offered, e.g.:
                           
   'string'.normalize :mac  # use MacIntosh file system normalization variant

Implementations are already available in pure Ruby (easy for other Ruby implementations; e.g. eprun: https://github.com/duerst/eprun) and in C (unf,, http://bibwild.wordpress.com/2013/11/19/benchmarking-ruby-unicode-normalization-alternatives/)

---Files--------------------------------
Normalization.pdf (576 KB)


-- 
https://bugs.ruby-lang.org/