Yukihiro Matsumoto wrote:
> Hi,
> 
> In message "Re: ruby 1.9 hates you and me and the encodings we rode in on so just 	get  used to it."
>     on Tue, 20 May 2008 19:50:08 +0900, DJ Jazzy Linefeed <john.d.perkins / gmail.com> writes:
> 
> |> Regular expression operation does not work fine on broken strings.  It
> |> seems that you specify utf-8 for your locale, yet the content of
> |> reading file is not.  If you know the encoding of the content, say
> |> iso-8859-1, you can open it with the explicit encoding:
> |>
> |>   x = File.open(path, "r:iso-8859-1")
> |>
> |> if not, you can say it
> |>
> |>   x = File.open(path, "r:ascii-8bit")
> |>
> |> unless the file content is non ASCII like UTF-16.
> 
> |It makes no sense, Matz.
> |
> |I don't get to know what the encoding is before hand, that's just it -
> |there may be every encoding. I just deal with a pile of files, I
> |think...
> 
> Since today's OSes do not provide encoding information for files, you
> HAVE TO know the encoding of the files if you want to handle them
> correctly, unfortunately.  That's life, no matter how you expect.

I ran across this in Perl last week on a Windows machine. It seems the 
Perl "Encode" library has a "guess" option. It will look at a file and 
attempt to guess what the encoding is. Unfortunately, it could only 
determine that the files were "UTF-16", not which of (at least) two 
variants. The solution turned out to be to open the files in Wordpad and 
save them as ASCII.

You're absolutely right ... if you don't know what encoding the writer 
of the file used, your first action should be to ask!

P.S.: I suppose I should look at how Perl attempts to guess the encoding 
and why it couldn't pick one of two UTF-16 variants. :)
> 
> If you don't need exact encoding handling, and know the file is mostly
> ASCII, use ASCII-8BIT for encoding.  It works most of the cases.

Well ... it didn't work on my UTF-16 files last week. :)