Issue #10084 has been updated by Martin Dürst.

Assignee changed from Nobuyoshi Nakada to Yukihiro Matsumoto

This feature is going to add one or more methods to class String (String#unicode_normalize and probably String#unicode_normalize! and String#unicode_normalized?).

The implementation also internally needs various additional methods and constants that an end user should not ever want or need to use. Just adding them to class String is the easiest solution, but this may confuse a user when e.g. calling String.instance_methods(false). In the current implementation, these methods and constants are put in module Normalize (see https://github.com/duerst/eprun/blob/master/lib/normalize.rb).

In order to proceed with the implementation of this feature, I'd like to get advice from Matz (and others) on the best alternative. I have thought through the following alternatives:

1) Use a standalone module (probably better to change the name, e.g. to UnicodeNormalizeImplementation or so)

2) Use a module inside String, e.g. String::Normalize. Advantage: module name can be shorter, because local.

3) Use an anonymous module. This is possible, but it requires that all the related code and data is in the same physical file, which restricts potential memory optimizations. Also, it requires that all the code is re-written e.g. replacing 'def' with 'define_method', which will look rather clumsy.

4) Just add the necessary methods and constants to String, but use longer, more explicit names. This should be slightly faster, because currently manyf the methods take an explicit string parameter, but this would just be the receiver. We can also make the methods private to reduce user temptation.

5) Use a refinement. The advantages are that this can be distributed over more than one file, and that we can directly call the methods on Strings (see 4). The disadvantages are that we still need a public module as the refinement container.

I personally would like to avoid 3) if at all possible. I don't have much preferences among the other solutions. There may also be other solutions that I haven't thought about yet.

I would like to get Matz's preference(s) as soon as possible to proceed with the implementation. Any advice from others, e.g. with respect to similar cases, performance tradeoffs, other ideas, and so on, are also greatly appreciated.


----------------------------------------
Feature #10084: Add Unicode String Normalization to String class
https://bugs.ruby-lang.org/issues/10084#change-49242

* Author: Martin Dürst
* Status: Open
* Priority: Normal
* Assignee: Yukihiro Matsumoto
* Category: 
* Target version: Ruby 2.2.0
----------------------------------------
Unicode string normalization is a frequent operation when comparing or normalizing strings.

This should be available directly on the String class.

The proposed syntax is:

   'string'.normalize       # normalize 'string' according to NFC (most frequent on the Web)
   'string'.normalize :nfc  # normalize 'string' according to NFC; :nfd, :nfkc, :nfkd also usable
   'string'.nfc             # shorter variant, but maybe too many methods

There are several "unofficial" but convenient normalization variants that could be offered, e.g.:
                           
   'string'.normalize :mac  # use MacIntosh file system normalization variant

Implementations are already available in pure Ruby (easy for other Ruby implementations; e.g. eprun: https://github.com/duerst/eprun) and in C (unf, http://bibwild.wordpress.com/2013/11/19/benchmarking-ruby-unicode-normalization-alternatives/)

---Files--------------------------------
Normalization.pdf (576 KB)


-- 
https://bugs.ruby-lang.org/