--azLHFNyN32YCQGCU
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, Jun 15, 2006 at 06:34:11AM +0900, Randy Kramer wrote:
> On Wednesday 14 June 2006 06:01 am, Juergen Strobel wrote:
> > For my personal vision of "proper" Unicode support, I'd like to have
> > UTF-8 the standard internal string format, and Unicode Points the
> > standard character code, and *all* String functions to just work
> > intuitively "right" on a character base rather than byte base. Thus
> > the internal String encoding is a technical matter only, as long as it
> > is capable of supporting all Unicode characters, and these internal
> > details are not exposed via public methods.
> 
> Maybe Juergen is saying the same thing I'm going to say, but since I don't 
> understand / recall what UTF-8 encoding is exactly:

Wikipedia has decent articles on unicode at http://en.wikipedia.org/wiki/UniCode.

Basically, Unicode gives every character worldwide a unique number,
called code point. Since this numbers can be quite large (currently
up to 21 bit), and especially western users usually only use a tiny
subset, different encoding were created to save space, or remain
backward compatible with 7 bit ASCII.

UTF-8 encodes every Unicode code point as a variable length sequence
of 1 to 4 (I think) bytes. Most western symbols only require 1 or 2
bytes. This encoding is space efficient, and ASCII compatible as long
as only 7 bit characters are used. Certain string operation
are quite hard or inefficient, since the position of characters, or
even the length of a string, given a byte stream, is uncertain without
counting actual characters (no pointer/index arithmetic!).

UTF-32 encodes every code point as a single 32 bit word. This enables
simple, efficient substring access, but wastes space.

Other encodings have yet different characteristics, but all deal with
encoding the same code points. A Unicode String class should expose
code points, or sequences of code points (characters), not the
internal encoding used to store them and that is the core of my
argument.

> I'm beginning to think (with a newbie sort of perspective) that Unicode is too 
> complicated to deal with inside a program.  My suggestion would be that 
> Unicode be an external format...
> 
> What I mean is, when you have a program that must handle international text, 
> convert the Unicode to a fixed width representation for use by the program.   
> Do the processing based on these fixed width characters.  When it's complete, 
> convert it back to Unicode for output.

UTF-32 would be such an encoding. It uses quadruple space for simple 7
bit ASCII characters, but with such a dramatically larger total
character set, some tradeoffs are unavoidable.

> It seems to me that would make a lot of things easier.
> 
> Then I might have two basic "types" of programs--programs that can handleny 
> text (i.e., international), and other programs that can handle only English 
> (or maybe only European languages that can work with an 8 bit byte).  (I 
> suggest these two types of programs because I suspect those that have to 
> handle the international character set will be slower than those that don't.)
> 
> Aside: What would that take to handle all the characters / ideographs (ishat 
> what they call them, the Japanese, Chinese, ... characters) presently in use 
> in the world--iirc, 16 bits (2**16) didn't cut it for Unicode--would 32 bits?

> Randy Kramer

Currently Unicode requires 21 bit, but this has changed in the past.
Java got bitten by that by defining the character type to 16 bit and
hardcoding this in their VM, and now they need some kludges.

A split of simple and Unicode-aware will divide code into two camps,
which will remain slightly incompatible or require dirty hacks. I'd
rather prolonge the status quo, where Strings can be seen to contain
bytes in whatever encoding the user sees fit, but might break if used
with foreign code which has other notions of encoding.

Jgen

-- 
 The box said it requires Windows 95 or better so I installed Linux

--azLHFNyN32YCQGCU
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iQEVAwUBRJGkNvy64gyiEfXtAQKL8AgA0puc5uMkzaFkWJ+MFpNztP4Kd5n3o43n
wpIM24AWLzAoMdxgUZjNHR6rFl7/TOXRUcfbgHmlZDxtfvRr9JIGVf0slm8XgKkg
c9Xoh4qMQcG1jItFCWDJOJNm/Kia2LZ1Mz/6CB5ODMy3MTcxBecpWKPr/Y7LCGY2
bnBaTt9VjKAGlBqxT6Ov1MGhBuVr047EMPyn4FnckGkfftHahjUErvzde2sOyQKH
V4+HCpAtT0854af6X4c/AKy8Sh+iEEJYZsfVhGdiNkcZldZnsKBlHlR2PuG3Uit3
v3obTYQvTqPNDdO1d1XIgu+S45DBT84GuyGoobx6sjvlFEEHwTHxUgBI
-----END PGP SIGNATURE-----

--azLHFNyN32YCQGCU--