2012/11/9 kennygrant (Kenny Grant) <kennygrant / gmail.com>:
> Thanks for the comments on this issue. I'm not clear on what the UTF8-MAC=
 encoding represents, are there docs on this Ruby behaviour and the problem=
s involved somewhere?

see several lines at the end of enc/utf_8.c.

> It may return a filename marked UTF-8 which is NFD, or NFC, depending on =
the glob pattern you call it with (see writer.rb attachment to this issue).=
 That's a small issue though and just indicates a wider complex problem.

writer.rb's two puts output the same result.
What do you mean?

>> An issue is people may write decomposed filename. A imaginary use case i=
s a program which make a filename from the name of a music output from iTun=
es. iTunes manages texts with UTF8-MAC. So the people will confuse.
>
> OK, so in this case someone is unwittingly using a mix of UTF-8 NFC (any =
strings they create in ruby with legible accents) and UTF-8 NFD (any string=
s they get from itunes say) in their script, which could lead to issues eve=
n before writing file names. If they get NFD from itunes, then try to match=
 on a track name with a regexp, it won't work unless they convert to NFC or=
 explicitly create an NFD string will it?

It will work unless the regexp highly depends composed string.

> One thing I don't understand though, is that you say there are both in no=
rmal use - in use of Ruby ignoring file systems, if you create a string or =
regexp, NFC is the default isn't it?

No, NFC is not default.
The fact is that many IMEs outputs composed characters.
Once a decomposed characters is mixed in a string, the character lives as i=
s.
It won't normalized.

> So Ruby has chosen one default for UTF-8 strings created in Ruby (as it m=
ust), but has to interact with lots of systems which might or might not be =
using NFC. At present we seem to have a de-facto default normalization of N=
FC, but nothing is translated to it when it comes from the OS. That might b=
e a a very hard problem, but in principle it would be nice to have one norm=
alization blessed as the default so that all strings in a given encoding ar=
e comparable. The results of leaving them as they are supplied are really u=
nexpected, and people using Ruby are not going to want to manually convert =
every string they touch from outside Ruby to NFC in case it was touched by =
HFS or created as NFD.

Ruby don't normalize characters.
It treat them as they are.
Windows, Linux, and other file systems also don't normalize.

Moreover NFC/NFD lost information.
If a filename is decomposed characters on Windows or Linux, NFC for
the filename lost it.

>> First Ruby 1.9.0 set strings derived from filenames UTF8-MAC.
>> But some reported that if filenames is UTF8-MAC, it is hard to compare
>> with normal UTF-8 strings.
>
> This is interesting as it's exactly the behaviour I expected (if it's not=
 possible to cleanly translate to NFC) - if strings are coming through as U=
TF-8 NFD, I'd expect them to be marked as such somehow (for example by bein=
g marked as encoding UTF8-MAC) - is there any indication?

A no so simple point is UTF8-MAC string is valid as UTF-8.

> Then at least it is clear that they are not comparable or compatible with=
 the NFC ruby strings I get when creating a string s =3D "d=E9tente".

Even if the string is accidentally composed, there are no guarantee
that a string is always composed.

>> If the translation from UTF8-MAC -> UTF-8 is entirely non-lossy and woul=
d do no harm to other UTF-8 strings
>> Yes until all part of the converting string is truly UTF8-MAC.
>
> I assumed from others' comments that UTF8-MAC was purely a sub-encoding u=
sed to indicate the use of decomposed strings, but would appreciate some mo=
re detail (if anyone has a link) on what exactly it involves, and if transl=
ation from UTF8-MAC to UTF8 can lose information that implies other differe=
nces. If the only difference is the decomposition (patterns which do not oc=
cur in NFC), I'd expect re-encoding to be idempotent and not affect NFC str=
ings and thus harmless to apply to NFC strings or strings containing a mix.=
 Re the file-system example, I had assumed that if you ask HFS to write to =
a file on a mounted file system HFS would normalize all names to NFD (as it=
 does for any HFS files), but perhaps that is incorrect.

A UTF-8 string is not always NFCed.

> I suppose the above boils down to this question:
>
> Is there a correct way to handle this situation, and never fail when comp=
aring a default Ruby string (NFC) against a file from any file system which=
 may be NFD?

No way.
And again, Ruby string is not NFC.

--=20
NARUSE, Yui  <naruse / airemix.jp>