--opJtzjQTFsWo+cga
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Sun, Jun 18, 2006 at 07:21:25AM +0900, Austin Ziegler wrote:
> On 6/17/06, Juergen Strobel <strobel / secure.at> wrote:
> >On Sun, Jun 18, 2006 at 01:02:39AM +0900, Paul Battley wrote:
> >>On 17/06/06, Austin Ziegler <halostatue / gmail.com> wrote:
> >>>>- This ties Ruby's String to Unicode. A safe choice IMHO, or would
> >>>>we really consider something else? Note that we don't commit to a
> >>>>particular encoding of Unicode strongly.
> >>>This is a wash. I think that it's better to leave the options open.
> >>>After all, it *is* a hope of mine to have Ruby running on iSeries
> >>>(AS/400) and *that* still uses EBCDIC.
> >AFAIK, EBCDIC can be losslessly converted to Unicode and back. Right?
> 
> Which code page? EBCDIC has as many code pages (including a UTF-EBCDIC)
> as exist in other 8-byte encodings.

Obviously, EBCDIC -> UNICODE -> same EBCDIC Codepage as before.

> 
> >On the other hand, do you really trust all ruby library writers to
> >accept your strings tagged with EBCDIC encoding? Or do you look
> >forward to a lot of manual conversions?
> 
> It depends on the purpose of the library. Very few libraries end up
> using byte vectors for strings or completely treat them as such. I would
> expect that some of the libraries that I've written would work without
> any problems in EBCDIC.
> 
> >>Not to mention that Matz has explicitly stated in the past that he
> >>wants Ruby to support other encodings (TRON, Mojikyo, etc.) that
> >>aren't compatible with a Unicode internal representation.
> >>
> >>Not tying String to Unicode is also the right thing to do: it allows
> >>for future developments. Java's weird encoding system is entirely
> >>down to the fact that it standardised on UCS-2; when codepoints
> >>beyond 65535 arrived, they had to be shoehorned in via an ugly hack.
> >>As far as possible, Ruby should avoid that trap.
> >That's why I explicitly stated it ties Ruby's String class to Unicode
> >Character Code Points, but not to a particular Unicode encoding or
> >character class, and *that* was Java's main folly. (UCS-2 is a
> >strictly 16 bit per character encoding, but new Unicode standards
> >specify 21 bit characters, so they had to "extend" it).
> 
> Um. Do you mean UTF-32? Because there's *no* binary representaiton of
> Unicode Character Code Points that isn't an encoding of some sort. If
> that's the case, that's unacceptable from a memory representation.

Yes, I do mean the String *interface* to be UTF-32, or pure code
points which is the same but less suscept to to standard changes, if
accessed at character level. If accessed at substring level, a
substring of a String is obviously a String, and you don't need a
bitwise representation at all.

According to my proposal, Strings do not need an encoding from the
String user's point of view when working just with Strings, and users
won't care apart from memory/performance consumption, which I believe
can be made good enough with a totally encapsulted, internal storage
format to be decided later. I will avoid a premature optimization
debate here now.

Of course encoding matters when Strings are read or written somewhere,
or converted to bit-/bytewise representation explicitly. The Encoding
Framework, however it'll look, needs to be able to convert to and from
Unicode code points for these operations only, and not between
arbitrary encodings. (You *may* code this to recode directly from
the internal storage format for performance reasons, but that'll be
transparent to the String user.)

This breaks down for characters not represented in Unicode at all, and
is a nuisance for some characters affected by the Han Unification
issue.  But Unicode set out to prevent exactly this, and if we
beleieve in Unicode at all, we can only hope they'll fix this in an
upcoming revision. Meanwhile we could map any additional characters
(or sets of) we need to higher, unused Unicode plains, that'll be no
worse than having different, possibly incompatible kinds of Strings.

We'll need an additional class for pure byte vectors, or just use
Array for this kind of work, and I think this is cleaner.

Regarding Java, they switched from UCS-2 to UTF-16 (mostly). UCS-2 is
a pure 16 bit per character encoding and cannot represent codepoints
above 0xffff. UTF-16 works alike UTF-8, but with 16 bit chunks.  But
their abstraction of a single character, the class Char(acter), is
still only 16 bit wide which leads to confusion and similiar to the C
type char, which cannot represent all real characters either. It is
even worse than in C, because C explicitly defines char to be a memory
cell of 8 bits or more, whereas Java really meant Char to be a
character.

> >I am unaware of unsolveable problems with Unicode and Eastern
> >languages, I asked specifically about it. If you think Unicode is
> >unfixably flawed in this respect, I guess we all should write off
> >Unicode now rather than later? Can you detail why Unicode is
> >unacceptable as a single world wide unifying character set?
> >Especially, are there character sets which cannot be converted to
> >Unicode and back, which is the main requirement to have Unicode
> >Strings in a non-Unicode environment?
> 
> Legacy data and performance.

Map legacy data, that is characters still not in Unicode, to a high
Plane in Unicode. That way all characters can be used together all the
time. When Unicode includes them we can change that to the official
code points. Note there are no files in String's internal storage
format, so we don't have to worry about reencoding them.

I am not worried about performance. I'd code in C if I were, or
Lisp. 

For one, Moore's law is at work and my whole proposal was for 2.0. My
proposal only adds a constant factor to String handling, it doesn't
have higher order complexity.

On the other hand, conversions needs to be done at other times with my
proposal than for M17N Strings, and it depends on the application if
that is more or less often.  String-String operations never need to do
recoding, as opposed to M17N Strings. I/O always needs conversion, and
may need conversion with M17N too. I havea a hunch that allowing
different kinds of Strings around (as in M17N presumely) should
require recoding far more often.

Jgen

> 
> -austin
> -- 
> Austin Ziegler * halostatue / gmail.com * http://www.halostatue.ca/
>               * austin / halostatue.ca * http://www.halostatue.ca/feed/
>               * austin / zieglers.ca
> 
> 

-- 
 The box said it requires Windows 95 or better so I installed Linux

--opJtzjQTFsWo+cga
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iQEVAwUBRJVb2Py64gyiEfXtAQLvwggA4oLeh/AomPV5fcM904//rVc1ngQTOxuT
WJi6qcggr3MRzVYGIJm9KzVkLKadDsBTRg6QuCCDrk1XA41Ej4a7txmbJMDTOar8
TVE/alz0cludnDkPf7yXZ11x1dwqZdwT1UGO8wlytPqFVEMu7iZMYtQP8rQwbVaL
JElVMGtUjlUJsoo3Vozvd0pRs814WP7/uEnatUXEaSo7jy4WlWUnq6FrZEde1bDL
b2b8pfoVmDVteqXkVYSUnr4Ru679StGEa/H8avSEu59K7szROJATEwmAFVkXtXgh
uAV+ehVAlYAk3FNXJOtFrXLtr8HG55sNMgfBHVPcY/0uGLkEOSJ7tw2v
-----END PGP SIGNATURE-----

--opJtzjQTFsWo+cga--