On Sep 29, 2007, at 9:47 PM, Felipe Contreras wrote:

> Hi,
>
> On 9/29/07, John Joyce <dangerwillrobinsondanger / gmail.com> wrote:
>>>
>>> Yes but what about stuff already encoded in UTF-16?
>>
>> That's why I said read up on unicode!
>> After you read that stuff you'll understand why it's no problem.
>> I'm not going to explain it. Many people understand it, but when
>> explaining it might make mistakes.
>> Read the unicode stuff carefully. It's vital for many things.
>>
>> The only thing you might run into is BOM or Endian-ness, but it's
>> doubtful it will be an issue in most cases.
>>
>> This might get you started.
>> http://www.unicode.org/faq/utf_bom.html#37
>>
>>
>> Even Joel Spoelsky wrote a brief bit on unicode... mostly trumpeting
>> how programmers need to know it and how few actually do.
>> The short version is that UTF-16 is basically wasteful. It uses 2
>> bytes for lower-level code-points (the stuff also known as ASCII
>> range) where UTF-8 does not.
>
> As you suggested I read the article:
> http://www.joelonsoftware.com/articles/Unicode.html
>
> I didn't find anything new. It's just explaining character sets in a
> rather non-specific way. ASCII uses 8 bits, so it can store 256
> characters, so it can't store all the characters in the world, so
> other character sets are needed (really? I would have never guessed
> that). UTF-16 basically stores characters in 2 bytes (that means more
> characters in the world), UTF-8 also allows more characters it doesn't
> necessarily needs 2 bytes, it uses 1, and if the character is beyond
> 127 then it will use 2 bytes. This whole thing can be extended up to 6
> bytes.
>
> So what exactly am I looking for here?
>
>> You really need to spend an afternoon reading about unicode. It
>> should be required in any computer science program as part of an
>> encoding course, Americans in particular are often the ones who know
>> the least about it....
>
> What is there to know about Unicode? There's a couple of character
> sets, use UTF-8, and remember that one character != one byte. Is there
> anything else for practical purposes?
>
> I'm sorry if I'm being rude, but I really don't like when people tell
> me to read stuff I already know.
>
> My question is still there:
>
> Let's say I want to rename a file "fooobar", and remove the third "o",
> but it's UTF-16, and Ruby only supports UTF-8, so I remove the "o" and
> of course there will still be a 0x00 in there. That's if the string is
> recognized at all.
>
> Why is there no issue with UTF-16 if only UTF-8 is supported?
>
> I don't mind reading some more if I can actually find the answer.
>
> Best regards.
>
> --  
> Felipe Contreras
>
Hmm... you should consider converting it to utf-8 via iconv.
There is a gem for iconv
This will keep your data intact, but you might need to convert it  
back to utf-16 later.

I believe filenames on windows are actually utf-8,
Files' contents are generally written in utf-16

Could be wrong on this...
but test it and see!
Try to to open a file with non-ascii range characters in irb and see  
what happens.
If it fails, no harm done.