Hi Carlos,

Thanks v much for the advice. Thought I'd start with looking at what's
already in the database  using unpack.

> 1. that application, the MySQL query tool, is not UTF-8 aware. So, it
> interprets the 2 bytes of "" (197, 130) as 2 characters in some simple-byte
> encoding (probably latin-1), which gives "" and an unprintable character.
> Your test line wasn't UTF-8 encoded at all.

Yeah, for another db on the server it works fine, so I'm guessing it's
your 2nd option. Your explanation of the 2 bytes solves another
question I had though :-)

> 2. The application is UTF-8 aware, the test line is in UTF-8, but the data
> from your web pages was already in UTF-8 and you thought it wasn't and
> encoded it again to UTF-8.

> To test if a string is encoded in UTF-8, just examine its bytes
>   p str.unpack("C*")

> and see if the diacritic letters are encoded with 2 or more bytes (UTF-8),
> or only one (iso-8859-*, cp*, etc.). (If you see *four* then you encoded
> them twice :).

Here's a test case

On web page after being loaded from DB: "Wyナ嬪ij" [This is correct!]
In MySQL Analyser: "Wylij" [bad, even though MySQL analyser is
UTF-8]
In Interactive Ruby (IRB) printed to console, after loading from DB:
"Wy笏シテクlij" [expected in a DOS prompt!]
In IRB unpacked, after loading from DB: [87, 121, 197, 155, 108, 105,
106]

So, I can see that the character "" must correspond to the 3rd and
4th bytes of "Wyナ嬪ij".

Looking at the Ruby help, I see I can do this

p str.unpack("U*") to get the UTF-8 characters, which gives:

[87, 121, 347, 108, 105, 106]

According to this,
http://www.fileformat.info/info/unicode/char/015b/index.htm, character
347 is in fact a "".

This would suggest that the database has UTF-8 text, and it's getting
into Ruby without corruption! Is this right?

So, the question now is why doesn't Iconv convert my UTF-8 to Latin2
correctly... That could just be because the original text can't be
converted due to additional characters outside of the Latin2 set.

I could probably give Iconv explicit mapping codes for how to handle
certain characters, that may do the trick.. I'll re-read your post and
see if I can find anything else.

Thanks for the help, feels like I'm a few steps forward now!

If you can spot any errors in the above a hint would be most welcome!

Tobin

> HTH. Good luck.
> --