Austin Ziegler wrote in post #1061436:
> This is *not* a Ruby problem, this is a *data* problem.

Leaving aside the point that not all data is text, you still need a 
clear conceptual model to be able to reason about your program.

In Python 3, there is a clear distinction between "characters" and "a 
sequence of bytes which encode those characters". They are two 
completely different classes and cannot be combined (e.g. a+b will 
always fail if a is str and b is bytes). It's also symmetrical: you 
convert from bytes to characters as text enters your program, and from 
characters to bytes as text leaves it.

(Aside: I know that Python only supports Unicode characters, but this is 
just an implementation limitation. There could be a third class 
"gb2312str" if desired, and additional classes for other character sets 
which are not subsets of Unicode)

Ruby muddles these concepts by having all strings be a sequence of bytes 
plus the encoding, which in turn muddles the concepts of "character set" 
and "a method of encoding that character set".

Now, you could argue that Ruby is actually implementing the Python 3 
approach but in a "lazy" way: by not explicitly converting bytes to 
characters until required, it avoids potentially unnecessary work. But 
if so, it's half-baked. For example, you cannot combine a UTF16-LE 
string with a UTF16-BE string, even though they are the same character 
set (Unicode). What's worse is that a UTF16-LE string will sort 
differently to a UTF16-BE string (because ruby 1.9 sorts by byte 
ordering, which happens to work for UTF8 but not all other encodings of 
Unicode). So it kind-of behaves like a string of characters, except that 
it doesn't.

Furthermore, ruby sometimes lets you combine objects representing 
"characters" and "bytes", or "characters with encoding A" and 
"characters with encoding B". Whether it is allowed or not depends on 
the run-time contents of those objects.

If a = b + c *always* crashed when b and c had different encodings, I 
would really not have a problem with any of this. Your test case would 
immediately catch it, you fix it, problem solved.

However ruby 1.9's insidious behaviour means that b + c may *or may not* 
crash depending not only the encodings but the actual content of the 
strings at that instant. One perfectly reasonable set of tests may pass; 
actual application data may fail.

Finally, ruby is asymmetrical. On input, encodings are tagged; on 
output, they are ignored (by default). From files, the environment 
encoding is used; from sockets, the ASCII_8BIT encoding is used. WIth 
regexps, invalid strings cause an exception; with String#[] they do not. 
It is just an utter dog's breakfast of arbitrary rules which you just 
have no choice but to learn.

Some people see ruby 1.9's highly complex encoding implementation as a 
triumph of engineering; I see it as design smell.

> Matz and others have worked very hard to make sure that Ruby 1.9 works
> well if you follow certain rules regarding your inputs and outputs.

... which one has to absorb by osmosis. Certainly the core API docs 
don't give these rules; in fact they give precious little about the 
encoding semantics of String. And you can't get much more of a core part 
of the language than String.

Want to find out what String#[] does when given a string which contains 
invalid characters in its declared encoding? The docs won't help you. 
Try it and see. Or go to the C source code.

Of course, because every String is now two-dimensional (x = sequence of 
bytes, y = Encoding) there is a much higher requirement to document 
every method which acts on a string or returns on a string, because 
there is a much larger variety of inputs and outputs to consider.

Take strings with invalid characters, for example, or the fact that 
every returned string also has an encoding and you need to document how 
it is chosen. (For example Net::HTTP: does it return strings with 
encoding from the Content-Type header? You tell me)

Incidentally, strings with invalid characters are not an edge case or 
only for erroneous input. Ruby encourages you to do things like:

    txt = sock.read(4096)    # txt likely to contain a split character 
at the end

This could be dealt with if explicitly converting bytes to characters at 
some point (you'd buffer the extra bit). By not having this explicit 
conversion, you are quite likely to have byte patterns which don't 
represent *any* character. Yes you can do the buffering yourself; I'm 
just saying that all methods need to *document* whether they do accept 
strings with invalid bytes, and how they handle them.

> If you don't respect your encodings, they will bite you. They may not
> bite you up front (as they do with Ruby, because it exposes these
> things which are painful), but they *will* bite you.

Certainly you need to know about character sets and how they are 
encoded. This does not imply that ruby does it in a sane way. And as I 
said before, if Ruby were to bite you consistently, it would be much 
better.

> Ruby got it right, because it acknowledges that (a) this is hard and
> (b) gives you the tools you need in order to make this less painful.
> It also doesn't (c) incorrectly assume that everything is or can be
> expressed safely in Unicode. (Shift-JIS will not roundtrip to Unicode
> and back for some characters.)

That's kind of irrelevant, since ruby 1.9 doesn't really handle 
Shift-JIS either, except to transcode it.

-- 
Posted via http://www.ruby-forum.com/.