2009/12/28 Brian Candler <b.candler / pobox.com>

> Benoit Daloze wrote:
> > But then I come with things like:
> > /Users/benoitdaloze/Library/GlestGame/data/lang/espan><ol.lng
> >
> > (The ~ is separated from the n and then is not ). The Regexp is acting
> > like
> > it is 2 different characters. How to handle that easily? I tried to
> > change
> > the script encoding in MacRoman, but it produced an error of bad
> > encoding
> > not matching UTF-8.
>
> I don't know what you mean. If Dir.[] tells you that the file name is
> <e> <s> <p> <a> <n> <~> <o> <l> <.> <l> <n> <g>, is that not the true
> filename?
>
> I suggest you try something like this:
>
>  puts "Source encoding: #{"".encoding}"
>  puts "External encoding: #{Encoding.default_external}"
>  Dir["*.lng"] do |fn|
>    puts "Name: #{fn.inspect}"
>    puts "Encoding: #{fn.encoding}"
>    puts "Chars: #{fn.chars.to_a.inspect}"
>    puts "Codepoints: #{fn.codepoints.to_a.inspect}"
>    puts "Bytes: #{fn.bytes.to_a.inspect}"
>    puts
>  end
>
> then post the results for this file here. Then also post what you think
> the true filename is.
>
> Then you can see whether: (1) Dir.[] is returning the correct sequence
> of bytes for the filename or not; and (2) Dir.[] is tagging the string
> with the correct encoding or not.
>
> (This is one of the thousands of cases I did *not* document in
> string19.rb; I did some of the core methods on String, but of course
> every method in every class which either returns a string or accepts a
> string argument needs to document how it handles encodings)
>
> > as output of this script (which is then not able to rename any wrong
> > file,
> > because tr! seem to not work either on name) :
> >
> > path = ARGV[0] || "/"
> >
> > ALLOWED_CHARS = "A-Za-z0-9 %#:$@?!=+~&|'()\\[\\]{}.,\r_-"
> >
> > Dir["#{File.expand_path(path)}/**/*"].each { |f|
> >     name = File.basename(f)
> >     unless name =~ /^[#{ALLOWED_CHARS}]+$/
> >         puts File.dirname(f) + '/' + name.gsub(/([^#{ALLOWED_CHARS}]+)/,
> > ">\\1<")
> >
> >         if name.tr!('', 'e') =~ /^[#{ALLOWED_CHARS}]+$/ # Here it is
> > not
> > complete, it is just a test, but it doesn't work even for 'filname'
> >             File.rename(f, File.dirname(f) + '/' + name)
> >             puts "\trenamed in #{name}"
> >             break
> >         end
> >     end
> > }
>
> What error do you get? Is it failing to match the  at all (tr! returns
> nil), or is an encoding error raised in tr!, or is an error raised by
> File.rename ?
> --
> Posted via http://www.ruby-forum.com/.
>
>
" And so it is.  If memory serves, Mac OS X stores filenames in normal
form D.

> How to handle that easily?

Normalize to normal form C instead.

Best,
--
Marnen Laibow-Koser "

So that solved it, converting with Iconv.
It would probably only works on Mac the encoding "UTF-8-MAC", but that for
working on HFS+, so that's not really a problem.

I found the documentation(in 1.9.2) of Iconv a little messy ...
For exemple, typing 'ri Iconv#iconv'
------------------------------------------------------------ Iconv#iconv
      Iconv.iconv(to, from, *strs)

and in 1.8.7
------------------------------------------------------------ Iconv#iconv
      iconv(str, start=0, length=-1)

The result of ri(1.9.2) is the same of 'ri Iconv::iconv', what is kind of
very different.

Anyway, converting every filename using this works :)

fn = Iconv.open("UTF-8", "UTF-8-MAC") { |iconv|
    iconv.iconv(fn)
}
or
fn = Iconv.iconv("UTF-8", "UTF-8-MAC", fn).shift