On Wed, May 16, 2012 at 9:02 AM, Brian Candler <lists / ruby-forum.com> wrote=
:
> I will add that the OP is not entirely alone in his opinion.

The OP may not be alone in his opinion, but that's because encodings
are broken in general.

This is *not* a Ruby problem, this is a *data* problem.

C gets it wrong because it assumes that characters, code points, and
bytes are the same (but it gets a pass because it was created in a
time when this was true).

Java gets it wrong because it uses a nominally-UTF-16 character width
(it's actually UCS-2) which doesn't allow for UTF-16 surrogates.

Python and Java get it wrong because they always assume that Unicode
is a safe, reliable, and reversible transformation (and they don't
work well with non-Unicode encoding).

The problem the OP had? Partially a library problem for not shifting
to Ruby 1.9 assumptions (I've slowly been moving my libraries to state
their encodings up front, but it's a pain because I just don't have
the time).

Matz and others have worked very hard to make sure that Ruby 1.9 works
well if you follow certain rules regarding your inputs and outputs.
These rules, by the way, are more or less what Joel Spolsky wrote
almost nine years ago:
http://www.joelonsoftware.com/articles/Unicode.html

If you don't respect your encodings, they will bite you. They may not
bite you up front (as they do with Ruby, because it exposes these
things which are painful), but they *will* bite you.

Ruby got it right, because it acknowledges that (a) this is hard and
(b) gives you the tools you need in order to make this less painful.
It also doesn't (c) incorrectly assume that everything is or can be
expressed safely in Unicode. (Shift-JIS will not roundtrip to Unicode
and back for some characters.)

-a
--=20
Austin Ziegler =95 halostatue / gmail.com =95 austin / halostatue.ca
http://www.halostatue.ca/ =95 http://twitter.com/halostatue