On Jun 28, 2006, at 2:51 PM, Austin Ziegler wrote:

> And I believe this to be the case. But I *also* believe that Ruby's
> support for Unicode needs to be first-rate. Where I am getting most
> frustrated is that few people have understood that -- and even fewer
> have understood that viewing first-rate support for Unicode isn't
> incompatible with m17n String.

I think people understand what you want.  But those of us who've done  
a lot of i18n work know how hard it is to get things right; for  
example, the single hardest piece of writing an efficient XML parser  
is dealing with the character input/output.  Those of us who write  
search engines and have sweated the language-sensitive tokenization  
details are also paranoid about these problems.  We also know that it  
is *possible* to get things right, if you adopt the limitation that  
characters are Unicode characters.  Matz is making a strong claim:  
that he can write a class that will get Unicode right and also handle  
arbitrary other character sets and encodings, and serve as a byte  
buffer (it's a floor wax *and* a dessert topping!) and do this all  
with acceptable correctness and efficiency.  This has not previously  
been done that I know of.  If he can pull it off, that's super.  It's  
not unreasonable to worry, though.

I would offer one piece of advice for the m17n implementation: have a  
unicode/non-unicode mode bit, and in the case that it's Unicode, pick  
one encoding and stick to it (probably UTF-8, because that's  
friendlier to C programmers).  The reason that this is a good idea is  
that if you know the encoding, then for certain performance-critical  
tasks (e.g. regexp) you can do sleazy low-level optimizations that  
run on the encoding rather than on the chunked chars.

Yes, you'd have to do conversion of all the 8859 and JIS and Big5 and  
so on going in and out, but if the volume is big enough that you  
care, there'll be disks involved, and you can transcode way faster  
than I/O speeds, so the conversion cost will probably not be observable.

Among other things, I want to be able to process XML in Ruby really  
really fast, and in XML you *know* that it's all Unicode characters;  
so it would be nice to leave the door open for low-level Unicode- 
specific optimizations.

  -Tim