Hi,

In message "Re: [ruby-core:18486] Ruby 1.9 strings & character encoding"
    on Mon, 8 Sep 2008 11:06:52 +0900, "Michael Selig" <michael.selig / fs.com.au> writes:

|1) Maybe I am blind, but I cannot find something like String#each_code to  
|return an Enumerator of the Unicode codepoints as fixnums. Is there such a  
|beast? If not, I think there should be, considering that there is a  
|String#each_byte. (Yes, you can use String#each_char and then String#ord  
|on each). Also I think there should be an equivalent of String#setbyte &  
|getbyte for unicode codepoints (String#setcode & getcode?).

each_code is ambiguous for me.  codepoint?

|2) If there are new "code" methods above as mentioned above, are the  
|methods String#getbyte, setbyte, each_byte, really necessary? You can  
|always do "force_encoding("BINARY")" if you really want to do byte  
|stuffing, and then each_code, setcode & getcode should do the same as the  
|current "byte" methods.

They are handy to hack encoded text by bytes, without going back and
forth between encoded and binary.  It happens sometime when you tweak
low-level text.

|3) Suggestion: when opening a file with mode "b" (binary), I think the  
|encoding should be automatically set to 8-bit ASCII, overriding the  
|default locale. I think this should happen on Linux & Unix as well. That  
|way "IO#readchar" and others will only try to do byte-by-byte processing  
|(I hope!).

For historical reason, "b" does not mean "binary" encoding, but
suppression of newline processing (\r\n -> \n).  It has nothing to do
with 8bit-ascii.

|4) I notice that some methods like String#toutf8 no longer exist, but are  
|still in the doc.

Should be removed.  Where did you find them.

|I'd like to say how amazing the character encoding implementation is. I  
|don't know of any other language that has attempted to support all  
|encodings internally, as you guys have. You have also done a really good  
|job at optimizing UTF-8 string processing performance when all data is  
|ASCII. However, I imagine that using UTF-8 internally for strings of  
|multi-byte characters (or any other variable-length encoding) is going to  
|be slow. I also have a concern that supporting so many character encodings  
|internally is making Ruby's C code (eg: string.c) hard to optimize for a  
|particular class of encoding and when you do, messy and difficult to  
|maintain. It would be nicer if the internal implementation of say "String"  
|could be done in a more OO approach, based on encoding. Probably easier  
|said than done, though!

Having less number of classes is one of the Ruby's design policy.  And
I feel it works well so far.

							matz.