"pj" <peterjohannsen / hotmail.com> wrote in message news:127ce4a9.0303091534.7c5e7d8d / posting.google.com... > > Here is what we did: First, the input can be in Unicode > > (You obviously mean UTF-8 encoded Unicode, as you make clear below.) No, I meant Unicode as in UCS-2 - that is why the expression was \x0080-\xffff and not \x80-\xff (although I forget it in the lead character): > > identifier = /[A-Za-z\x80-\xff][A-Za-z0-9\-\x0080-\xffff]*/ I then went on to argue that it also works in UTF-8 representation. > > But, isn't there a string#length -- does it give you character length > or byte length ? As I also answered in another posting, it is a separate issue. I'm talking about the lexer process and symbol table lookup. String classes can easily have 16 or 32 bit representation internally. The main point is that it is easy to use an 8bit lexer. > I am pretty ignorant of Ruby -- does it provide uppercase, lowercase, > titlecase functions ? If so, I would guess that they only support > UTF-8 if they were designed for UTF-8 ? I'm not sure I understand your question. Ruby is case sensitive. Generally case sensitive languages are great because you completely avoid the problem of what is an upper case letter (a moving target in Unicode i.e. large table that must be updated frequently). However, Ruby uses uppercase to identify constants. This means that Ruby does not avoid the upper case problem after all. This problem is not properly addressed by my expression, except if you only allow A-Z upper case constant. In the end you have use a large table to do it properly. > > By converting to UTF-8, the Lexer need not be able to handle 16 bit > > characters. > > Ok, you're talking about how Ruby could be altered to do things, right > ? Well yes - I'm partly talking about how I thought Ruby did (and apparently does it using the proper -K switch), partly I'm talking about my own experiences with how UTF-8 support can easily be retrofitted a parser. I have already used it several times so I exploit every opportunity to spread the merry message ;-) Actually I picked the expression almost directly from a parser I'm hacking in Ruby. > I've not been following, but I read at least a rumor that China put > out a new standard (GB18030 ? I probably have the numbers wrong) that > cannot be represented in UCS-2, but can be represented in Unicode 3.1 > (ie, UTF-8, UTF-16, etc). I dunno if that matters to anyone outside of > China though. Is this the Big-5? Anyway - I think this is one reason that matz refuses to settle for Unicode only. Initially I though this was wrong, but now I realize how important it is to not settle for any single format. Indeed I sent an email to Paul Graham that he considered this point in his new Arc language. Mikkel