On 6/17/06, Juergen Strobel <strobel / secure.at> wrote: > On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul Battley wrote: >> On 17/06/06, Austin Ziegler <halostatue / gmail.com> wrote: >>>> - This ties Ruby's String to Unicode. A safe choice IMHO, or would >>>> we really consider something else? Note that we don't commit to a >>>> particular encoding of Unicode strongly. >>> This is a wash. I think that it's better to leave the options open. >>> After all, it *is* a hope of mine to have Ruby running on iSeries >>> (AS/400) and *that* still uses EBCDIC. > AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right? Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC) as exist in other 8-byte encodings. > On the other hand, do you really trust all ruby library writers to > accept your strings tagged with EBCDIC encoding? Or do you look > forward to a lot of manual conversions? It depends on the purpose of the library. Very few libraries end up using byte vectors for strings or completely treat them as such. I would expect that some of the libraries that I've written would work without any problems in EBCDIC. >> Not to mention that Matz has explicitly stated in the past that he >> wants Ruby to support other encodings (TRON, Mojikyo, etc.) that >> aren't compatible with a Unicode internal representation. >> >> Not tying String to Unicode is also the right thing to do: it allows >> for future developments. Java's weird encoding system is entirely >> down to the fact that it standardised on UCS-2; when codepoints >> beyond 65535 arrived, they had to be shoehorned in via an ugly hack. >> As far as possible, Ruby should avoid that trap. > That's why I explicitly stated it ties Ruby's String class to Unicode > Character Code Points, but not to a particular Unicode encoding or > character class, and *that* was Java's main folly. (UCS-2 is a > strictly 16 bit per character encoding, but new Unicode standards > specify 21 bit characters, so they had to "extend" it). Um. Do you mean UTF-32? Because there's *no* binary representaiton of Unicode Character Code Points that isn't an encoding of some sort. If that's the case, that's unacceptable from a memory representation. > I am unaware of unsolveable problems with Unicode and Eastern > languages, I asked specifically about it. If you think Unicode is > unfixably flawed in this respect, I guess we all should write off > Unicode now rather than later? Can you detail why Unicode is > unacceptable as a single world wide unifying character set? > Especially, are there character sets which cannot be converted to > Unicode and back, which is the main requirement to have Unicode > Strings in a non-Unicode environment? Legacy data and performance. -austin -- Austin Ziegler * halostatue / gmail.com * http://www.halostatue.ca/ * austin / halostatue.ca * http://www.halostatue.ca/feed/ * austin / zieglers.ca