"pj" <peterjohannsen / hotmail.com> wrote in message
news:127ce4a9.0303091534.7c5e7d8d / posting.google.com...

> > Here is what we did: First, the input can be in Unicode
>
> (You obviously mean UTF-8 encoded Unicode, as you make clear below.)

No, I meant Unicode as in UCS-2 - that is why the expression was
\x0080-\xffff and not \x80-\xff (although I forget it in the lead
character):

> > identifier = /[A-Za-z\x80-\xff][A-Za-z0-9\-\x0080-\xffff]*/

I then went on to argue that it also works in UTF-8 representation.

>
> But, isn't there a string#length -- does it give you character length
> or byte length ?

As I also answered in another posting, it is a separate issue. I'm talking
about the lexer process and symbol table lookup. String classes can easily
have 16 or 32 bit representation internally. The main point is that it is
easy to use an 8bit lexer.

> I am pretty ignorant of Ruby -- does it provide uppercase, lowercase,
> titlecase functions ? If so, I would guess that they only support
> UTF-8 if they were designed for UTF-8 ?

I'm not sure I understand your question. Ruby is case sensitive.
Generally case sensitive languages are great because you completely avoid
the problem of what is an upper case letter (a moving target in Unicode i.e.
large table that must be updated frequently).
However, Ruby uses uppercase to identify constants. This means that Ruby
does not avoid the upper case problem after all.
This problem is not properly addressed by my expression, except if you only
allow A-Z upper case constant. In the end you have use a large table to do
it properly.


> > By converting to UTF-8, the Lexer need not be able to handle 16 bit
> > characters.
>
> Ok, you're talking about how Ruby could be altered to do things, right
> ?

Well yes - I'm partly talking about how I thought Ruby did (and apparently
does  it using the proper -K switch), partly I'm talking about my own
experiences with how UTF-8 support can easily be retrofitted a parser. I
have already used it several times so I exploit every opportunity to spread
the merry message ;-) Actually I picked the expression almost directly from
a parser I'm hacking in Ruby.

> I've not been following, but I read at least a rumor that China put
> out a new standard (GB18030 ? I probably have the numbers wrong) that
> cannot be represented in UCS-2, but can be represented in Unicode 3.1
> (ie, UTF-8, UTF-16, etc). I dunno if that matters to anyone outside of
> China though.

Is this the Big-5?
Anyway - I think this is one reason that matz refuses to settle for Unicode
only. Initially I though this was wrong, but now I realize how important it
is to not settle for any single format. Indeed I sent an email to Paul
Graham that he considered this point in his new Arc language.

Mikkel