2012/11/9 kennygrant (Kenny Grant) <kennygrant / gmail.com>: > Thanks for the comments on this issue. I'm not clear on what the UTF8-MAC= encoding represents, are there docs on this Ruby behaviour and the problem= s involved somewhere? see several lines at the end of enc/utf_8.c. > It may return a filename marked UTF-8 which is NFD, or NFC, depending on = the glob pattern you call it with (see writer.rb attachment to this issue).= That's a small issue though and just indicates a wider complex problem. writer.rb's two puts output the same result. What do you mean? >> An issue is people may write decomposed filename. A imaginary use case i= s a program which make a filename from the name of a music output from iTun= es. iTunes manages texts with UTF8-MAC. So the people will confuse. > > OK, so in this case someone is unwittingly using a mix of UTF-8 NFC (any = strings they create in ruby with legible accents) and UTF-8 NFD (any string= s they get from itunes say) in their script, which could lead to issues eve= n before writing file names. If they get NFD from itunes, then try to match= on a track name with a regexp, it won't work unless they convert to NFC or= explicitly create an NFD string will it? It will work unless the regexp highly depends composed string. > One thing I don't understand though, is that you say there are both in no= rmal use - in use of Ruby ignoring file systems, if you create a string or = regexp, NFC is the default isn't it? No, NFC is not default. The fact is that many IMEs outputs composed characters. Once a decomposed characters is mixed in a string, the character lives as i= s. It won't normalized. > So Ruby has chosen one default for UTF-8 strings created in Ruby (as it m= ust), but has to interact with lots of systems which might or might not be = using NFC. At present we seem to have a de-facto default normalization of N= FC, but nothing is translated to it when it comes from the OS. That might b= e a a very hard problem, but in principle it would be nice to have one norm= alization blessed as the default so that all strings in a given encoding ar= e comparable. The results of leaving them as they are supplied are really u= nexpected, and people using Ruby are not going to want to manually convert = every string they touch from outside Ruby to NFC in case it was touched by = HFS or created as NFD. Ruby don't normalize characters. It treat them as they are. Windows, Linux, and other file systems also don't normalize. Moreover NFC/NFD lost information. If a filename is decomposed characters on Windows or Linux, NFC for the filename lost it. >> First Ruby 1.9.0 set strings derived from filenames UTF8-MAC. >> But some reported that if filenames is UTF8-MAC, it is hard to compare >> with normal UTF-8 strings. > > This is interesting as it's exactly the behaviour I expected (if it's not= possible to cleanly translate to NFC) - if strings are coming through as U= TF-8 NFD, I'd expect them to be marked as such somehow (for example by bein= g marked as encoding UTF8-MAC) - is there any indication? A no so simple point is UTF8-MAC string is valid as UTF-8. > Then at least it is clear that they are not comparable or compatible with= the NFC ruby strings I get when creating a string s =3D "d=E9tente". Even if the string is accidentally composed, there are no guarantee that a string is always composed. >> If the translation from UTF8-MAC -> UTF-8 is entirely non-lossy and woul= d do no harm to other UTF-8 strings >> Yes until all part of the converting string is truly UTF8-MAC. > > I assumed from others' comments that UTF8-MAC was purely a sub-encoding u= sed to indicate the use of decomposed strings, but would appreciate some mo= re detail (if anyone has a link) on what exactly it involves, and if transl= ation from UTF8-MAC to UTF8 can lose information that implies other differe= nces. If the only difference is the decomposition (patterns which do not oc= cur in NFC), I'd expect re-encoding to be idempotent and not affect NFC str= ings and thus harmless to apply to NFC strings or strings containing a mix.= Re the file-system example, I had assumed that if you ask HFS to write to = a file on a mounted file system HFS would normalize all names to NFD (as it= does for any HFS files), but perhaps that is incorrect. A UTF-8 string is not always NFCed. > I suppose the above boils down to this question: > > Is there a correct way to handle this situation, and never fail when comp= aring a default Ruby string (NFC) against a file from any file system which= may be NFD? No way. And again, Ruby string is not NFC. --=20 NARUSE, Yui <naruse / airemix.jp>