--azLHFNyN32YCQGCU Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, Jun 15, 2006 at 06:34:11AM +0900, Randy Kramer wrote: > On Wednesday 14 June 2006 06:01 am, Juergen Strobel wrote: > > For my personal vision of "proper" Unicode support, I'd like to have > > UTF-8 the standard internal string format, and Unicode Points the > > standard character code, and *all* String functions to just work > > intuitively "right" on a character base rather than byte base. Thus > > the internal String encoding is a technical matter only, as long as it > > is capable of supporting all Unicode characters, and these internal > > details are not exposed via public methods. > > Maybe Juergen is saying the same thing I'm going to say, but since I don't > understand / recall what UTF-8 encoding is exactly: Wikipedia has decent articles on unicode at http://en.wikipedia.org/wiki/UniCode. Basically, Unicode gives every character worldwide a unique number, called code point. Since this numbers can be quite large (currently up to 21 bit), and especially western users usually only use a tiny subset, different encoding were created to save space, or remain backward compatible with 7 bit ASCII. UTF-8 encodes every Unicode code point as a variable length sequence of 1 to 4 (I think) bytes. Most western symbols only require 1 or 2 bytes. This encoding is space efficient, and ASCII compatible as long as only 7 bit characters are used. Certain string operation are quite hard or inefficient, since the position of characters, or even the length of a string, given a byte stream, is uncertain without counting actual characters (no pointer/index arithmetic!). UTF-32 encodes every code point as a single 32 bit word. This enables simple, efficient substring access, but wastes space. Other encodings have yet different characteristics, but all deal with encoding the same code points. A Unicode String class should expose code points, or sequences of code points (characters), not the internal encoding used to store them and that is the core of my argument. > I'm beginning to think (with a newbie sort of perspective) that Unicode is too > complicated to deal with inside a program. My suggestion would be that > Unicode be an external format... > > What I mean is, when you have a program that must handle international text, > convert the Unicode to a fixed width representation for use by the program. > Do the processing based on these fixed width characters. When it's complete, > convert it back to Unicode for output. UTF-32 would be such an encoding. It uses quadruple space for simple 7 bit ASCII characters, but with such a dramatically larger total character set, some tradeoffs are unavoidable. > It seems to me that would make a lot of things easier. > > Then I might have two basic "types" of programs--programs that can handleny > text (i.e., international), and other programs that can handle only English > (or maybe only European languages that can work with an 8 bit byte). (I > suggest these two types of programs because I suspect those that have to > handle the international character set will be slower than those that don't.) > > Aside: What would that take to handle all the characters / ideographs (ishat > what they call them, the Japanese, Chinese, ... characters) presently in use > in the world--iirc, 16 bits (2**16) didn't cut it for Unicode--would 32 bits? > Randy Kramer Currently Unicode requires 21 bit, but this has changed in the past. Java got bitten by that by defining the character type to 16 bit and hardcoding this in their VM, and now they need some kludges. A split of simple and Unicode-aware will divide code into two camps, which will remain slightly incompatible or require dirty hacks. I'd rather prolonge the status quo, where Strings can be seen to contain bytes in whatever encoding the user sees fit, but might break if used with foreign code which has other notions of encoding. Jgen -- The box said it requires Windows 95 or better so I installed Linux --azLHFNyN32YCQGCU Content-Type: application/pgp-signature; name="signature.asc" Content-Description: Digital signature Content-Disposition: inline -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) iQEVAwUBRJGkNvy64gyiEfXtAQKL8AgA0puc5uMkzaFkWJ+MFpNztP4Kd5n3o43n wpIM24AWLzAoMdxgUZjNHR6rFl7/TOXRUcfbgHmlZDxtfvRr9JIGVf0slm8XgKkg c9Xoh4qMQcG1jItFCWDJOJNm/Kia2LZ1Mz/6CB5ODMy3MTcxBecpWKPr/Y7LCGY2 bnBaTt9VjKAGlBqxT6Ov1MGhBuVr047EMPyn4FnckGkfftHahjUErvzde2sOyQKH V4+HCpAtT0854af6X4c/AKy8Sh+iEEJYZsfVhGdiNkcZldZnsKBlHlR2PuG3Uit3 v3obTYQvTqPNDdO1d1XIgu+S45DBT84GuyGoobx6sjvlFEEHwTHxUg BI -----END PGP SIGNATURE----- --azLHFNyN32YCQGCU--