In the code example you used, there's no external encoding being
defined, so the default is being used, which is generally the default
encoding of the operating system (windows is likely Windows-1252, mac
is likely MacRoman, linux is likely UTF-8). Based on your comments,
the default encoding is not appropriate.

There's not really a great way to determine the encoding of a file.
Generally the encoding is well defined by some standard, contract or
other out-of-band mechanism. Once you do know the encoding, the proper
way to open up a file is to declare the external encoding of the file,
like this -

f = File.open('somefile.txt', 'r:iso-8859-1')

Then when you read content from the file, the data in the file will be
transcoded from 'iso-8859-1' to the default internal encoding of the
Ruby interpreter instance (generally UTF-8). You can define what
internal encoding to use by further qualifying the open, like this -

f = File.open('somefile.txt', 'r:iso-8859-1:utf-8')

This will open the file with an external encoding of iso-8859-1 and an
internal encoding of utf-8.

Check out this article for some more information -
http://nuclearsquid.com/writings/ruby-1-9-encodings/.

On Mon, Sep 17, 2012 at 4:52 PM, Thomas Bednarz <lists / ruby-forum.com> wrote:
> I am new to ruby and play around with it a little bit at the moment. I
> have a large text file containing data with french accents and german
> umlauts. The content of this file (some hundredthousend lines) should be
> stored in a table in a postgres database. When I open the file on
> windows with an editor called notepad++ it displays the data correctly.
> When I look at the output from File.foreach(...) |line| puts line, I get
> garbage for any non ASCII character. When I try to store the records to
> postgres I get an error, as soon as data with non ASCII characters
> should be inserted.
>
> I use RubyMine as IDE and receive the following output with the
> following code:
>
>     File.foreach("somefile.txt") do |line|
>       if counter > 0 then
>         record = line.split(";")
>         @az_addidnr = record[1]
>         az_chnr = record[2]
>         az_adr1 = record[6]
>         puts "record data: #{@az_addidnr} | #{az_chnr} | #{az_adr1}"
>         conn.exec_prepared('stmt1', [@az_addidnr, az_chnr, az_adr1])
>       end
>
> OUTPUT:
>
> record data: 512999 | CH21702301867 | Garage de la Molire SA
> Uncaught exception: FEHLER:  ungltige Byte-Sequenz fr Kodierung
> UTF8: 0xe87265
>
> I also tried az_adr1 = record[6].encode("ISO-8859-1")
>
> If I try az_adr1 = record[6].encode("ASCII") I get:
> Uncaught exception: U+00DE to US-ASCII in conversion from CP850 to UTF-8
> to US-ASCII
>
> Could anybode please explain me the following:
> How can I find, what kind of Encoding is used in a text file?
> What kind of conversion to I need to a) get a correct output and b) to
> be able to insert the record into postgresql.
>
> Many thanks for your help.
>
> Tom
>
> --
> Posted via http://www.ruby-forum.com/.