--opJtzjQTFsWo+cga Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Ziegler wrote: > On 6/17/06, Juergen Strobel <strobel / secure.at> wrote: > >On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul Battley wrote: > >>On 17/06/06, Austin Ziegler <halostatue / gmail.com> wrote: > >>>>- This ties Ruby's String to Unicode. A safe choice IMHO, or would > >>>>we really consider something else? Note that we don't commit to a > >>>>particular encoding of Unicode strongly. > >>>This is a wash. I think that it's better to leave the options open. > >>>After all, it *is* a hope of mine to have Ruby running on iSeries > >>>(AS/400) and *that* still uses EBCDIC. > >AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right? > > Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC) > as exist in other 8-byte encodings. Obviously, EBCDIC -> UNICODE -> same EBCDIC Codepage as before. > > >On the other hand, do you really trust all ruby library writers to > >accept your strings tagged with EBCDIC encoding? Or do you look > >forward to a lot of manual conversions? > > It depends on the purpose of the library. Very few libraries end up > using byte vectors for strings or completely treat them as such. I would > expect that some of the libraries that I've written would work without > any problems in EBCDIC. > > >>Not to mention that Matz has explicitly stated in the past that he > >>wants Ruby to support other encodings (TRON, Mojikyo, etc.) that > >>aren't compatible with a Unicode internal representation. > >> > >>Not tying String to Unicode is also the right thing to do: it allows > >>for future developments. Java's weird encoding system is entirely > >>down to the fact that it standardised on UCS-2; when codepoints > >>beyond 65535 arrived, they had to be shoehorned in via an ugly hack. > >>As far as possible, Ruby should avoid that trap. > >That's why I explicitly stated it ties Ruby's String class to Unicode > >Character Code Points, but not to a particular Unicode encoding or > >character class, and *that* was Java's main folly. (UCS-2 is a > >strictly 16 bit per character encoding, but new Unicode standards > >specify 21 bit characters, so they had to "extend" it). > > Um. Do you mean UTF-32? Because there's *no* binary representaiton of > Unicode Character Code Points that isn't an encoding of some sort. If > that's the case, that's unacceptable from a memory representation. Yes, I do mean the String *interface* to be UTF-32, or pure code points which is the same but less suscept to to standard changes, if accessed at character level. If accessed at substring level, a substring of a String is obviously a String, and you don't need a bitwise representation at all. According to my proposal, Strings do not need an encoding from the String user's point of view when working just with Strings, and users won't care apart from memory/performance consumption, which I believe can be made good enough with a totally encapsulted, internal storage format to be decided later. I will avoid a premature optimization debate here now. Of course encoding matters when Strings are read or written somewhere, or converted to bit-/bytewise representation explicitly. The Encoding Framework, however it'll look, needs to be able to convert to and from Unicode code points for these operations only, and not between arbitrary encodings. (You *may* code this to recode directly from the internal storage format for performance reasons, but that'll be transparent to the String user.) This breaks down for characters not represented in Unicode at all, and is a nuisance for some characters affected by the Han Unification issue. But Unicode set out to prevent exactly this, and if we beleieve in Unicode at all, we can only hope they'll fix this in an upcoming revision. Meanwhile we could map any additional characters (or sets of) we need to higher, unused Unicode plains, that'll be no worse than having different, possibly incompatible kinds of Strings. We'll need an additional class for pure byte vectors, or just use Array for this kind of work, and I think this is cleaner. Regarding Java, they switched from UCS-2 to UTF-16 (mostly). UCS-2 is a pure 16 bit per character encoding and cannot represent codepoints above 0xffff. UTF-16 works alike UTF-8, but with 16 bit chunks. But their abstraction of a single character, the class Char(acter), is still only 16 bit wide which leads to confusion and similiar to the C type char, which cannot represent all real characters either. It is even worse than in C, because C explicitly defines char to be a memory cell of 8 bits or more, whereas Java really meant Char to be a character. > >I am unaware of unsolveable problems with Unicode and Eastern > >languages, I asked specifically about it. If you think Unicode is > >unfixably flawed in this respect, I guess we all should write off > >Unicode now rather than later? Can you detail why Unicode is > >unacceptable as a single world wide unifying character set? > >Especially, are there character sets which cannot be converted to > >Unicode and back, which is the main requirement to have Unicode > >Strings in a non-Unicode environment? > > Legacy data and performance. Map legacy data, that is characters still not in Unicode, to a high Plane in Unicode. That way all characters can be used together all the time. When Unicode includes them we can change that to the official code points. Note there are no files in String's internal storage format, so we don't have to worry about reencoding them. I am not worried about performance. I'd code in C if I were, or Lisp. For one, Moore's law is at work and my whole proposal was for 2.0. My proposal only adds a constant factor to String handling, it doesn't have higher order complexity. On the other hand, conversions needs to be done at other times with my proposal than for M17N Strings, and it depends on the application if that is more or less often. String-String operations never need to do recoding, as opposed to M17N Strings. I/O always needs conversion, and may need conversion with M17N too. I havea a hunch that allowing different kinds of Strings around (as in M17N presumely) should require recoding far more often. Jgen > > -austin > -- > Austin Ziegler * halostatue / gmail.com * http://www.halostatue.ca/ > * austin / halostatue.ca * http://www.halostatue.ca/feed/ > * austin / zieglers.ca > > -- The box said it requires Windows 95 or better so I installed Linux --opJtzjQTFsWo+cga Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) iQEVAwUBRJVb2Py64gyiEfXtAQLvwggA4oLeh/AomPV5fcM904//rVc1ngQTOxuT WJi6qcggr3MRzVYGIJm9KzVkLKadDsBTRg6QuCCDrk1XA41Ej4a7txmbJMDTOar8 TVE/alz0cludnDkPf7yXZ11x1dwqZdwT1UGO8wlytPqFVEMu7iZMYtQP8rQwbVaL JElVMGtUjlUJsoo3Vozvd0pRs814WP7/uEnatUXEaSo7jy4WlWUnq6FrZEde1bDL b2b8pfoVmDVteqXkVYSUnr4Ru679StGEa/H8avSEu59K7szROJATEwmAFVkXtXgh uAV+ehVAlYAk3FNXJOtFrXLtr8HG55sNMgfBHVPcY/0uGLkEOSJ7tw ¥Ò2v -----END PGP SIGNATURE----- --opJtzjQTFsWo+cga--