On 2011/10/26 6:53, Eric Hodel wrote:
> On Oct 23, 2011, at 4:05 PM, Perry Smith wrote:

>> But as far as I know, Unicode claims to be able to encode everything and
>> UTF-8 is just a more compact version of Unicode.

For many kinds of data it's more compact, but it can also get longer.

>> I believe (perhaps mistakenly)
>> that everything can be re-encoded to Unicode (and thus encoded to UTF-8).  Coding
>> everything in Unicode is how a lot of other languages deal with this problem.
>
> This is the issue.  For certain encodings you can't round-trip through Unicode and get back your input document, so ruby does not automatically perform such conversions on your behalf.  You can look back through the archives to find threads on the specifics.

It's slightly more complex than that.

There are encodings that can easily be round-tripped to Unicode and 
back, but for which there are several ways to do so.

For example, there are minor ways in which the mapping from Shift_JIS 
(the traditional Japanese encoding on the PC and the Mac) to Unicode and 
back differ for various systems. See e.g. 
http://icu-project.org/charts/charset/roundtripIndex.html#aix-IBM_932-4.3.6 
(cp932 is the numberical code for Shift_JIS).

This means that you can avoid certain problems by not converting to 
Unicode. You know that one and the same Shift_JIS codepoint will always 
be treated the same. Some programmers and users, especially in Japan, 
care a lot about this. Ruby can deal with this.

On the other hand, there are many cases where you have to work in 
Unicode (in particular Web stuff and everything that potentially mixes 
e.g. Japanese with data from other cultures). Ruby is well prepared for 
this, too.


Regards,   Martin.