Chris White wrote in post #1020283:
> Python won't save you from the complexities of encoding.

No, but the language *does* have very clearly defined semantics. It also 
makes a clear distinction between "a character" and "an encoded 
representation of that character", and it has two distinct classes for 
those things.

It is then pretty much foolproof: if you forget to decode your bytes 
into characters, or encode your characters into bytes, or try to combine 
characters with bytes, then you'll get an immediate and consistent 
runtime error.

Ruby has a hazy notion of these things, and hazy(*) rules which allow 
you sometimes to combine strings of characters and binary strings, and 
sometimes not. If your program runs successfully once, it doesn't mean 
that it's going to run successfully with different input data.

Furthermore, any library in ruby 1.9 which either accepts a String or 
returns a String needs to document its encoding-related behaviour; 
almost none of them do. In Python 3, all you have to say is whether it 
uses String or Bytes.

(*) Even data which I *explicity* tag as being BINARY is taken to be 
ASCII-8BIT, whether that is true or not.

> You have to
> remember, Ruby has its base in Japan. In Japan you roughly have the
> following encodings to deal with:
> - UTF*- EUC-JP- SJIS- ISO-2022-JP

The confusion between "encodings" and "character sets" is pretty 
endemic, and I have fallen prey to it myself many times.

Python partly dodges this issue because it supports only one character 
set - unicode - and then various encodings of it (like UTF*) and 
encodings of subsets of the character set (like ISO-8859-*)

I understand that there are various Asian character sets which are not 
proper subsets of unicode, and so can't be converted losslessly to and 
from unicode. If Python3 were to be extended to handle them, then I 
imagine there would be separate classes for EUCJPString and GB2312String 
or whatever, and methods to transcode between them (and options for what 
to do about missing characters)

And of course, Ruby 1.9 doesn't really handle ISO-2022-JP anyway, 
because it's a stateful encoding; I'm pretty sure you can't index or 
take the length or regexp-match an ISO-2022-JP string in ruby 1.9, 
without first transcoding it.

> This is just a very broad generalization. There are even more issues
> such as multiple versions of SJIS.

Absolutely. So it's vital to have a clear distinction between

      encoded sequence  <----------->  set of characters
      of bytes

which Python 3 has; whereas ruby 1.9 tries to work with the encoded 
sequence of bytes as-is, hoping you've remembered to tag the encoding 
correctly every time, and remorselessly tagging binary data as being 
text anyway.

> just remember that every language has its ups and
> downs. Python 3 for example has many external libraries, including
> Django and some of the ui toolkits, that are not supported.

That's true, and it's Django which keeps me from skipping python 2 
entirely and just going to 3.

Regards,

Brian.

-- 
Posted via http://www.ruby-forum.com/.