On Tue, Apr 8, 2008 at 3:15 PM, Thomas Kellerer
<YQDHXVLMUBXG / spammotel.com> wrote:
> Bill Kelly, 08.04.2008 14:51:
> > > Unicode (and a relevant encoding such as UTF8) should be the *standard*
> for all (new)
> > > programming languages and not an exception.
> > >
> >
> > Apparently not, as One Character Encoding to Rule Them All is not
> considered
> > satisfactory to many people.
> >
>  But Unicode/UTF8 would at least satisfy a *lot* more people than plain
> ASCII or 8bit encodings (such as ISO-8859-x)

Would it?

It comes with a very significant hit in the speed of Regex processing,
at least with the current implementation.

Enough to, for many applications, negate the speed benefit everything
that has been optimized from 1.8 to 1.9.  This has been shown with
speed reports on this mailing list previously


I'll note that I work in a non-US character set, and in my experience,
UTF support in a programming language has only been in the way.  So
far, what has been useful to me has always been to have strings be
lists of bytes.

I do not doubt that there are usecases where the support is useful; it
is just that so far I haven't come across them, or the support that
has been there has been unobtrusive enough that I haven't noticed that
it was useful (but I don't think so - all data I have fit nicely in
ISO-Latin-1, because all I work with comes from western Europe or is
in english.)

Note that this sounds like I am against transparent UTF-8 support -
that's not necessarily so.  I just want to make sure that people are
(many) usecases where the support isn't just neutral, it is actually a
drawback (loss of speed, extra complexity, not knowing that the result
of string.length actually means you can put string in a field of
length length), so the upsides had better be worthwhile.

Eivind.