2009/12/28 Brian Candler <b.candler / pobox.com>

> Benoit Daloze wrote:
> > But then I come with things like:
> > /Users/benoitdaloze/Library/GlestGame/data/lang/espan><ol.lng
> >
> > (The ~ is separated from the n and then is not ). The Regexp is acting
> > like
> > it is 2 different characters. How to handle that easily? I tried to
> > change
> > the script encoding in MacRoman, but it produced an error of bad
> > encoding
> > not matching UTF-8.
>
> I don't know what you mean. If Dir.[] tells you that the file name is
> <e> <s> <p> <a> <n> <~> <o> <l> <.> <l> <n> <g>, is that not the true
> filename?
>
> I suggest you try something like this:
>
>  puts "Source encoding: #{"".encoding}"
>  puts "External encoding: #{Encoding.default_external}"
>  Dir["*.lng"] do |fn|
>    puts "Name: #{fn.inspect}"
>    puts "Encoding: #{fn.encoding}"
>    puts "Chars: #{fn.chars.to_a.inspect}"
>    puts "Codepoints: #{fn.codepoints.to_a.inspect}"
>    puts "Bytes: #{fn.bytes.to_a.inspect}"
>    puts
>  end
>
> then post the results for this file here. Then also post what you think
> the true filename is.
>

The true filename is (from the Finder and Terminal):
-rw-r--r--@ 1 benoitdaloze  staff  3758 Jul 17  2008 espaol.lng
So, with the ''.

I don't know which is the encoding of the filename on HFS+, from Wikipedia
it s said as UTF-16, with Decomposition:
"names which are also character encoded in
UTF-16<http://en.wikipedia.org/wiki/UTF-16>and normalized to a form
very nearly the same as Unicode
Normalization Form D (NFD)<http://en.wikipedia.org/wiki/Unicode_normalization>
[4] <http://en.wikipedia.org/wiki/HFS_Plus#cite_note-3> (which means that
precomposed characters like  are decomposed in the HFS+ filename and
therefore count as two
characters[5]<http://en.wikipedia.org/wiki/HFS_Plus#cite_note-4>"
So, that's probably a problem of encoding for Dir.[]

I changed a little the script, to compare with a String hard-coded inside
the script (rn = "espaol.lng")

ruby 1.9.2dev (2009-12-11 trunk 26067) [x86_64-darwin10.2.0]

Source encoding: UTF-8
External encoding: UTF-8

Format:
String in the code
  filename from Dir[]

String equality: false

Name:
"espaol.lng"
  "espanol.lng"
Encoding:
UTF-8
  UTF-8
Chars:
["e", "s", "p", "a", "", "o", "l", ".", "l", "n", "g"]
  ["e", "s", "p", "a", "n", "", "o", "l", ".", "l", "n", "g"]
Codepoints:
[101, 115, 112, 97, 241, 111, 108, 46, 108, 110, 103]
  [101, 115, 112, 97, 110, 771, 111, 108, 46, 108, 110, 103]
Bytes:
[101, 115, 112, 97, 195, 177, 111, 108, 46, 108, 110, 103]
  [101, 115, 112, 97, 110, 204, 131, 111, 108, 46, 108, 110, 103]


> Then you can see whether: (1) Dir.[] is returning the correct sequence
> of bytes for the filename or not; and (2) Dir.[] is tagging the string
> with the correct encoding or not.
>

(1) Dir[] seems to return a correct String in UTF-8, while being different
(!!) from a String inside in UTF-8
But looking at  the codepoints and bytes, it's very different ...

(2) That's probably the case, let's look by forcing the encoding to
MacRoman:
Or not ... making crazy results like: "espan\xCC\x83ol.lng" or
"espan\u0303ol.lng"

Well, this is out of my poor knowledge of encoding I'm afraid :(

The most frustrating is it's printing the same...

P.S.: Well I got also filenames with "\r", quite weared,no? ("Target
Application Alias\r", and it "\r" is shown as "?" in the Terminal)

(This is one of the thousands of cases I did *not* document in
> string19.rb; I did some of the core methods on String, but of course
> every method in every class which either returns a string or accepts a
> string argument needs to document how it handles encodings)
>
> > as output of this script (which is then not able to rename any wrong
> > file,
> > because tr! seem to not work either on name) :
> >
> > path = ARGV[0] || "/"
> >
> > ALLOWED_CHARS = "A-Za-z0-9 %#:$@?!=+~&|'()\\[\\]{}.,\r_-"
> >
> > Dir["#{File.expand_path(path)}/**/*"].each { |f|
> >     name = File.basename(f)
> >     unless name =~ /^[#{ALLOWED_CHARS}]+$/
> >         puts File.dirname(f) + '/' + name.gsub(/([^#{ALLOWED_CHARS}]+)/,
> > ">\\1<")
> >
> >         if name.tr!('', 'e') =~ /^[#{ALLOWED_CHARS}]+$/ # Here it is
> > not
> > complete, it is just a test, but it doesn't work even for 'filname'
> >             File.rename(f, File.dirname(f) + '/' + name)
> >             puts "\trenamed in #{name}"
> >             break
> >         end
> >     end
> > }
>
> What error do you get? Is it failing to match the  at all (tr! returns
> nil), or is an encoding error raised in tr!, or is an error raised by
> File.rename ?
> --
> Posted via http://www.ruby-forum.com/.
>
> Yes, tr! returns nil on name.tr!('', 'n'), but it would work on a String
inside the script (eg: "eo".tr!('', 'n'))