At 21:03 08/09/08, Michael Selig wrote:
>On Mon, 08 Sep 2008 19:45:36 +1000, Yukihiro Matsumoto  
><matz / ruby-lang.org> wrote:

[just a bit more background]

>> |I'd like to say how amazing the character encoding implementation is. I
>> |don't know of any other language that has attempted to support all
>> |encodings internally, as you guys have. You have also done a really good
>> |job at optimizing UTF-8 string processing performance when all data is
>> |ASCII.

Well, actually, there is some very clever optimization also for
UTF-8 strings that are not ASCII (mostly done by Akira). Still
of course these are not as fast as ASCII-only strings.

>> |However, I imagine that using UTF-8 internally for strings of
>> |multi-byte characters (or any other variable-length encoding) is going  
>> to
>> |be slow. I also have a concern that supporting so many character  
>> encodings
>> |internally is making Ruby's C code (eg: string.c) hard to optimize for a
>> |particular class of encoding and when you do, messy and difficult to
>> |maintain. It would be nicer if the internal implementation of say  
>> "String"
>> |could be done in a more OO approach, based on encoding. Probably easier
>> |said than done, though!

Well, actually, it's pretty much what's going on, although the
"method dispatch table" is implemented in plain C.

>> Having less number of classes is one of the Ruby's design policy.  And
>> I feel it works well so far.

Yes, this is definitely the right thing to do at the user level.
In my opinion, even the fact that you get an exception when you
want to process two different encodings in a single method is in
some way against duck typing.

>Sorry, I didn't mean to propose introducing extra classes in Ruby. I was  
>talking about how the Ruby internal C code was implemented. I was getting  
>at trying to avoid a series of "ifs" in each method in say string.c to  
>optimize for different classes of encoding (eg if single-byte ... else if  
>constant-width ..... else .....).

To some extent, this is due to the fact that currently, the number
of 'methods' that implements an encoding is very small. Increasing
the number of 'methods' (with a standard implementation making use
of lower-level primitives for most encodings, but an encoding-specific
optimized implementation for some important cases) should give some
more performance. For some more details, please see Section
4.4 of http://www.sw.it.aoyama.ac.jp/2007/pub/IUC31-ruby/Paper.html.

Another reason for the frequent "ifs" is that some optimization
is done on the instance level rather than the 'class' (read
encoding) level.

Regards,    Martin.


#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp