Austin Ziegler wrote in post #1061436: > This is *not* a Ruby problem, this is a *data* problem. Leaving aside the point that not all data is text, you still need a clear conceptual model to be able to reason about your program. In Python 3, there is a clear distinction between "characters" and "a sequence of bytes which encode those characters". They are two completely different classes and cannot be combined (e.g. a+b will always fail if a is str and b is bytes). It's also symmetrical: you convert from bytes to characters as text enters your program, and from characters to bytes as text leaves it. (Aside: I know that Python only supports Unicode characters, but this is just an implementation limitation. There could be a third class "gb2312str" if desired, and additional classes for other character sets which are not subsets of Unicode) Ruby muddles these concepts by having all strings be a sequence of bytes plus the encoding, which in turn muddles the concepts of "character set" and "a method of encoding that character set". Now, you could argue that Ruby is actually implementing the Python 3 approach but in a "lazy" way: by not explicitly converting bytes to characters until required, it avoids potentially unnecessary work. But if so, it's half-baked. For example, you cannot combine a UTF16-LE string with a UTF16-BE string, even though they are the same character set (Unicode). What's worse is that a UTF16-LE string will sort differently to a UTF16-BE string (because ruby 1.9 sorts by byte ordering, which happens to work for UTF8 but not all other encodings of Unicode). So it kind-of behaves like a string of characters, except that it doesn't. Furthermore, ruby sometimes lets you combine objects representing "characters" and "bytes", or "characters with encoding A" and "characters with encoding B". Whether it is allowed or not depends on the run-time contents of those objects. If a = b + c *always* crashed when b and c had different encodings, I would really not have a problem with any of this. Your test case would immediately catch it, you fix it, problem solved. However ruby 1.9's insidious behaviour means that b + c may *or may not* crash depending not only the encodings but the actual content of the strings at that instant. One perfectly reasonable set of tests may pass; actual application data may fail. Finally, ruby is asymmetrical. On input, encodings are tagged; on output, they are ignored (by default). From files, the environment encoding is used; from sockets, the ASCII_8BIT encoding is used. WIth regexps, invalid strings cause an exception; with String#[] they do not. It is just an utter dog's breakfast of arbitrary rules which you just have no choice but to learn. Some people see ruby 1.9's highly complex encoding implementation as a triumph of engineering; I see it as design smell. > Matz and others have worked very hard to make sure that Ruby 1.9 works > well if you follow certain rules regarding your inputs and outputs. ... which one has to absorb by osmosis. Certainly the core API docs don't give these rules; in fact they give precious little about the encoding semantics of String. And you can't get much more of a core part of the language than String. Want to find out what String#[] does when given a string which contains invalid characters in its declared encoding? The docs won't help you. Try it and see. Or go to the C source code. Of course, because every String is now two-dimensional (x = sequence of bytes, y = Encoding) there is a much higher requirement to document every method which acts on a string or returns on a string, because there is a much larger variety of inputs and outputs to consider. Take strings with invalid characters, for example, or the fact that every returned string also has an encoding and you need to document how it is chosen. (For example Net::HTTP: does it return strings with encoding from the Content-Type header? You tell me) Incidentally, strings with invalid characters are not an edge case or only for erroneous input. Ruby encourages you to do things like: txt = sock.read(4096) # txt likely to contain a split character at the end This could be dealt with if explicitly converting bytes to characters at some point (you'd buffer the extra bit). By not having this explicit conversion, you are quite likely to have byte patterns which don't represent *any* character. Yes you can do the buffering yourself; I'm just saying that all methods need to *document* whether they do accept strings with invalid bytes, and how they handle them. > If you don't respect your encodings, they will bite you. They may not > bite you up front (as they do with Ruby, because it exposes these > things which are painful), but they *will* bite you. Certainly you need to know about character sets and how they are encoded. This does not imply that ruby does it in a sane way. And as I said before, if Ruby were to bite you consistently, it would be much better. > Ruby got it right, because it acknowledges that (a) this is hard and > (b) gives you the tools you need in order to make this less painful. > It also doesn't (c) incorrectly assume that everything is or can be > expressed safely in Unicode. (Shift-JIS will not roundtrip to Unicode > and back for some characters.) That's kind of irrelevant, since ruby 1.9 doesn't really handle Shift-JIS either, except to transcode it. -- Posted via http://www.ruby-forum.com/.