--uAKRQypu60I7Lcqm
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Fri, Jun 16, 2006 at 05:27:04PM +0900, Paul Battley wrote:
> On 15/06/06, Juergen Strobel <strobel / secure.at> wrote:
> >On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Ziegler wrote:
> ...
> >> It could be up to six bytes at one point. However, I think that there
> >> is still support for surrogate characters meaning that a single glyph
> >> *might* take as many as eight bytes to represent in the 1-4 byte
> >> representation. Even with that, though, those are rare and usually
> >> user-defined (private) ranges IIRC. This also doesn't deal with
> >> (de)composed glyphs/combining glyphs.
> >
> >No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
> >all characters. Only Java may need more than that because of their use
> >of UTF16 surrogates and special \0 handling in an intermediary step. See
>=20
> Austin's correct about six bytes, actually. The original UTF-8
> specification *was* for up to six bytes:
> http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
>=20
> However, no codepoints were ever defined in the upper part of the
> range, and once Unicode was officially restricted to the range
> 1-0x10FFFF, there was no longer any need for the five- and six-byte
> sequences.
>=20
> Compare RFC 2279 from 1998 (six bytes)
> http://tools.ietf.org/html/2279
> and RFC 3629 from 2003 (four bytes)
> http://tools.ietf.org/html/3629

I don't care who is technically correct here, that's not the point.

But when working on Unicode support for Ruby, I think it would be best
to focus on the new and current standard, before worrying if we should
support obsoleted RFCs. We might take care to be open to future
changes alongside old ones, but that's hard to predict and I wouldn't
waste time guessing. And Ruby is much more dynamic and less vulnerable
to such changes as for example Java.

J=FCrgen

>=20
> That Java encoding (UTF-8-encoded UTF-16) isn't really UTF-8, though,
> so you'd never get eight bytes in valid UTF-8:
>=20
>   The definition of UTF-8 prohibits encoding character numbers between
>   U+D800 and U+DFFF, which are reserved for use with the UTF-16
>   encoding form (as surrogate pairs) and do not directly represent
>   characters. (RFC 3629)
>=20
> Paul.
>=20
>=20

--=20
 The box said it requires Windows 95 or better so I installed Linux

--uAKRQypu60I7Lcqm
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: Digital signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iQEVAwUBRJO03/y64gyiEfXtAQI4QQf/XBuhYd0RpJ5y4VbLHJlBC5hWX3bgzyt5
Nuw6dGBG4HqzupwZ5kHEoykbl97140wqP1V2V0DXkDImtY2UR86BNez9y2TM7PJB
VSmipyRQVbpWuG9GyeOmBvKkJDvbKiAoRoMBQA7rHTQ576y8gmMZdyUVZb6E2OmR
7ci013z0rhYKFjuDy3Kp4nuYtVgwvaA5drOTYWpbbHMxyEx4+DMRHK9OMG6fiW8z
LBJmZlY43Qd62DJ7gkAt4xRPvXKiFnvTlyYgCAvUkhAuRV6xhpeMIPZvhVHK7Fm5
a+iCp3fVANQFqw8JDqm4PJ5vnb7pKLkPy6dgGfjVKbe5jvYYlIh1HA==
=Mnfb
-----END PGP SIGNATURE-----

--uAKRQypu60I7Lcqm--

On Fri, Jun 16, 2006 at 05:27:04PM +0900, Paul Battley wrote:
> On 15/06/06, Juergen Strobel <strobel / secure.at> wrote:
> >On Fri, Jun 16, 2006 at 03:39:00AM +0900, Austin Ziegler wrote:
> ...
> >> It could be up to six bytes at one point. However, I think that there
> >> is still support for surrogate characters meaning that a single glyph
> >> *might* take as many as eight bytes to represent in the 1-4 byte
> >> representation. Even with that, though, those are rare and usually
> >> user-defined (private) ranges IIRC. This also doesn't deal with
> >> (de)composed glyphs/combining glyphs.
> >
> >No. According to wikipedia, it is upt to 4 bytes for plain UTF8 for
> >all characters. Only Java may need more than that because of their use
> >of UTF16 surrogates and special \0 handling in an intermediary step. See
>=20
> Austin's correct about six bytes, actually. The original UTF-8
> specification *was* for up to six bytes:
> http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
>=20
> However, no codepoints were ever defined in the upper part of the
> range, and once Unicode was officially restricted to the range
> 1-0x10FFFF, there was no longer any need for the five- and six-byte
> sequences.
>=20
> Compare RFC 2279 from 1998 (six bytes)
> http://tools.ietf.org/html/2279
> and RFC 3629 from 2003 (four bytes)
> http://tools.ietf.org/html/3629

I don't care who is technically correct here, that's not the point.

But when working on Unicode support for Ruby, I think it would be best
to focus on the new and current standard, before worrying if we should
support obsoleted RFCs. We might take care to be open to future
changes alongside old ones, but that's hard to predict and I wouldn't
waste time guessing. And Ruby is much more dynamic and less vulnerable
to such changes as for example Java.

J=FCrgen

>=20
> That Java encoding (UTF-8-encoded UTF-16) isn't really UTF-8, though,
> so you'd never get eight bytes in valid UTF-8:
>=20
>   The definition of UTF-8 prohibits encoding character numbers between
>   U+D800 and U+DFFF, which are reserved for use with the UTF-16
>   encoding form (as surrogate pairs) and do not directly represent
>   characters. (RFC 3629)
>=20
> Paul.
>=20
>=20

--=20
 The box said it requires Windows 95 or better so I installed Linux
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.3 (GNU/Linux)

iQEVAwUBRJO03/y64gyiEfXtAQI4QQf/XBuhYd0RpJ5y4VbLHJlBC5hWX3bgzyt5
Nuw6dGBG4HqzupwZ5kHEoykbl97140wqP1V2V0DXkDImtY2UR86BNez9y2TM7PJB
VSmipyRQVbpWuG9GyeOmBvKkJDvbKiAoRoMBQA7rHTQ576y8gmMZdyUVZb6E2OmR
7ci013z0rhYKFjuDy3Kp4nuYtVgwvaA5drOTYWpbbHMxyEx4+DMRHK9OMG6fiW8z
LBJmZlY43Qd62DJ7gkAt4xRPvXKiFnvTlyYgCAvUkhAuRV6xhpeMIPZvhVHK7Fm5
a+iCp3fVANQFqw8JDqm4PJ5vnb7pKLkPy6dgGfjVKbe5jvYYlIh1HA==
=Mnfb
-----END PGP SIGNATURE-----