On 6/17/06, Juergen Strobel <strobel / secure.at> wrote:
> On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul Battley wrote:
>> On 17/06/06, Austin Ziegler <halostatue / gmail.com> wrote:
>>>> - This ties Ruby's String to Unicode. A safe choice IMHO, or would
>>>> we really consider something else? Note that we don't commit to a
>>>> particular encoding of Unicode strongly.
>>> This is a wash. I think that it's better to leave the options open.
>>> After all, it *is* a hope of mine to have Ruby running on iSeries
>>> (AS/400) and *that* still uses EBCDIC.
> AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right?

Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC)
as exist in other 8-byte encodings.

> On the other hand, do you really trust all ruby library writers to
> accept your strings tagged with EBCDIC encoding? Or do you look
> forward to a lot of manual conversions?

It depends on the purpose of the library. Very few libraries end up
using byte vectors for strings or completely treat them as such. I would
expect that some of the libraries that I've written would work without
any problems in EBCDIC.

>> Not to mention that Matz has explicitly stated in the past that he
>> wants Ruby to support other encodings (TRON, Mojikyo, etc.) that
>> aren't compatible with a Unicode internal representation.
>>
>> Not tying String to Unicode is also the right thing to do: it allows
>> for future developments. Java's weird encoding system is entirely
>> down to the fact that it standardised on UCS-2; when codepoints
>> beyond 65535 arrived, they had to be shoehorned in via an ugly hack.
>> As far as possible, Ruby should avoid that trap.
> That's why I explicitly stated it ties Ruby's String class to Unicode
> Character Code Points, but not to a particular Unicode encoding or
> character class, and *that* was Java's main folly. (UCS-2 is a
> strictly 16 bit per character encoding, but new Unicode standards
> specify 21 bit characters, so they had to "extend" it).

Um. Do you mean UTF-32? Because there's *no* binary representaiton of
Unicode Character Code Points that isn't an encoding of some sort. If
that's the case, that's unacceptable from a memory representation.

> I am unaware of unsolveable problems with Unicode and Eastern
> languages, I asked specifically about it. If you think Unicode is
> unfixably flawed in this respect, I guess we all should write off
> Unicode now rather than later? Can you detail why Unicode is
> unacceptable as a single world wide unifying character set?
> Especially, are there character sets which cannot be converted to
> Unicode and back, which is the main requirement to have Unicode
> Strings in a non-Unicode environment?

Legacy data and performance.

-austin
-- 
Austin Ziegler * halostatue / gmail.com * http://www.halostatue.ca/
               * austin / halostatue.ca * http://www.halostatue.ca/feed/
               * austin / zieglers.ca