On Wednesday, November 24, 2010 05:14:15 am Brian Candler wrote:
> For example, an expression like
> 
>    s1 = s2 + s3
> 
> where s2 and s3 are both Strings will always work and do the obvious
> thing in 1.8, but in 1.9 it may raise an exception. Whether it does
> depends not only on the encodings of s2 and s3 at that point, but also
> their contents (properties "empty?" and "ascii_only?")

In 1.8, if those strings aren't in the same encoding, it will blindly 
concatenate them as binary values, which may result in a corrupt and 
nonsensical string.

It seems to me that the obvious thing is to raise an error when there's an 
error, instead of silently corrupting your data.

> This
> means the same program with the same data may work on your machine, but
> crash on someone else's.

Better, again, than working on my machine, but corrupting on someone else's. 
At least if it crashes, hopefully there's a bug report and even a fix _before_ 
it corrupts someone's data, not after.

> https://github.com/candlerb/string19/blob/master/string19.rb
> https://github.com/candlerb/string19/blob/master/soapbox.rb

From your soapbox.rb:

* Whether or not you can reason about whether your program works, you will
  want to test it. 'Unit testing' is generally done by running the code with
  some representative inputs, and checking if the output is what you expect.
  
  Again, with 1.8 and the simple line above, this was easy. Give it any two
  strings and you will have sufficient test coverage.

Nope. All that proves is that you can get a string back. It says nothing about 
whether the resultant string makes sense.

More relevantly:

* It solves a non-problem: how to write a program which can juggle multiple
  string segments all in different encodings simultaneously.  How many
  programs do you write like that? And if you do, can't you just have
  a wrapper object which holds the string and its encoding?

Let's see... Pretty much every program, ever, particularly web apps. The end-
user submits something in the encoding of their choice. I may have to convert 
it to store it in a database, at the very least. It may make more sense to 
store it as whatever encoding it is, in which case, the simple act of 
displaying two comments on a website involves exactly this sort of 
concatenation.

Or maybe I pull from multiple web services. Something as simple and common as 
a "trackback" would again involve concatenating multiple strings from 
potentially different encodings.

* It's pretty much obsolete, given that the whole world is moving to UTF-8
  anyway.  All a programming language needs is to let you handle UTF-8 and
  binary data, and for non-UTF-8 data you can transcode at the boundary. 
  For stateful encodings you have to do this anyway.

Java at least did this sanely -- UTF16 is at least a fixed width. If you're 
going to force a single encoding, why wouldn't you use fixed-width strings?

Oh, that's right -- UTF16 wastes half your RAM when dealing with mostly ASCII 
characters. So UTF-8 makes the most sense... in the US.

The whole point of having multiple encodings in the first place is that other 
encodings make much more sense when you're not in the US.

* It's ill-conceived. Knowing the encoding is sufficient to pick characters
  out of a string, but other operations (such as collation) depend on the
  locale.  And in any case, the encoding and/or locale information is often
  carried out-of-band (think: HTTP; MIME E-mail; ASN1 tags), or within the
  string content (think: <?xml charset?>)

How does any of this help me once I've read the string?

* It's too stateful. If someone passes you a string, and you need to make
  it compatible with some other string (e.g. to concatenate it), then you
  need to force it's encoding.

You only need to do this if the string was in the wrong encoding in the first 
place. If I pass you a UTF-16 string, it's not polite at all (whether you dup 
it first or not) to just stick your fingers in your ears, go "la la la", and 
pretend it's UTF-8 so you can concatenate it. The resultant string will be 
neither, and I can't imagine what it'd be useful for.

You do seem to have some legitimate complaints, but they are somewhat 
undermined by the fact that you seem to want to pretend Unicode doesn't exist. 
As you noted:

"However I am quite possibly alone in my opinion.  Whenever this pops up on
ruby-talk, and I speak out against it, there are two or three others who
speak out equally vociferously in favour.  They tell me I am doing the
community a disservice by warning people away from 1.9."

Warning people away from 1.9 entirely, and from character encoding in 
particular, because of the problems you've pointed out, does seem incredibly 
counterproductive. It'd make a lot more sense to try to fix the real problems 
you've identified -- if it really is "buggy as hell", I imagine the ruby-core 
people could use your help.