Issue #7267 has been updated by kennygrant (Kenny Grant).


Thanks for the explanation. I didn't think Ruby was being evil :)

If the translation from UTF8-MAC -> UTF-8 is entirely non-lossy and would do no harm to other UTF-8 strings (these decomposed patterns would only occur with the intention of displaying an accent wouldn't they?), and Ruby is set to use UTF-8 for all encodings, perhaps the right thing to do here would be to auto-translate UTF-8-MAC to UTF-8 on reading all file names assumed to be UTF-8 on Mac OS, as the OS default is decomposed, but the default in Ruby is composed. I can't think of a situation when anyone would want to use UTF8-MAC in a ruby script as opposed to UTF-8, and if they did presumably they could explicitly convert to it, and if the file system was set to use composed anyway, this translation would not affect the name. The string is coming to me as UTF-8 from Dir.glob, so at that stage it knows (or rather has assumed) it is UTF-8 (though in fact it seems it is UTF-8-MAC), and presumably at that stage it could do a conversion to be sure it was canonical UTF-8 before re
 turning the string, IF the encoding was set to UTF-8 already?

Hopefully this would not affect any other users/file systems, but I'm afraid I don't know enough to make that judgement call and may well have overlooked something. I'd be happy to try submitting a patch but have not done so before and would likely hinder rather than help as this is a big complex issue with lots of potential side-effects, so this is really just a long-term suggestion from an end user's point of view. 

As you say, it would be nice to have a cleaner solution to this at some point, so I hope one can be found which would get rid of this potentially confusing behaviour for those using UTF-8 on Mac OS X, and not cause more work for those using other encodings or operating systems. 
----------------------------------------
Bug #7267: Dir.glob on Mac OS X returns unexpected string encodings for unicode file names
https://bugs.ruby-lang.org/issues/7267#change-32296

Author: kennygrant (Kenny Grant)
Status: Open
Priority: Normal
Assignee: 
Category: 
Target version: 2.0.0
ruby -v: ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-darwin11.4.0]


Tested on Ruby 1.9.3-p194 and ruby-2.0.0-preview1 on Mac OS X 10. 7.5

When calling file system methods with Ruby on Mac OS X, it is not possible to manipulate the resulting file name as a normal UTF-8 string, even though it reports the encoding as UTF-8. It seems to be a UTF-8-MAC string, even when the default encoding is set to UTF-8. This leads to confusion as the string can be manipulated normally except for any unicode characters, which seem to be decomposed. So a regexp using utf-8 characters won't work on the string, unless it is first converted from UTF-8-MAC. I'd expect the string encoding to be UTF-8, or at least to report that it is not a normal UTF-8 string if it has to be UTF-8-MAC for some reason. 

Example, run with a file called Testé.txt in the same folder:

def transform_string s
   puts "Testing string #{s}"
   puts s.gsub(/é/,'TEST')
end

Dir.glob("./*.txt").each do |f|  
  puts "Inline string works as expected" 
   s = "./Testé.txt" 
   puts transform_string s

   puts "File name from Dir.glob does not" 
   puts transform_string f
   
   puts "Encoded file name works as expected, though it is reported as UTF-8, not UTF-8-MAC" 
   f.encode!('UTF-8','UTF-8-MAC')
   puts transform_string f
end


-- 
http://bugs.ruby-lang.org/