On Mon, 08 Sep 2008 19:45:36 +1000, Yukihiro Matsumoto  
<matz / ruby-lang.org> wrote:

> |1) Maybe I am blind, but I cannot find something like String#each_code  
> to
> |return an Enumerator of the Unicode codepoints as fixnums. Is there  
> such a
> |beast? If not, I think there should be, considering that there is a
> |String#each_byte. (Yes, you can use String#each_char and then String#ord
> |on each). Also I think there should be an equivalent of String#setbyte &
> |getbyte for unicode codepoints (String#setcode & getcode?).
>
> each_code is ambiguous for me.  codepoint?
Is "each_codepoint" too long?

> |2) If there are new "code" methods above as mentioned above, are the
> |methods String#getbyte, setbyte, each_byte, really necessary? You can
> |always do "force_encoding("BINARY")" if you really want to do byte
> |stuffing, and then each_code, setcode & getcode should do the same as  
> the
> |current "byte" methods.
>
> They are handy to hack encoded text by bytes, without going back and
> forth between encoded and binary.  It happens sometime when you tweak
> low-level text.
OK - I was just trying to avoid introducing multiple methods with similar  
function.

> |3) Suggestion: when opening a file with mode "b" (binary), I think the
> |encoding should be automatically set to 8-bit ASCII, overriding the
> |default locale. I think this should happen on Linux & Unix as well. That
> |way "IO#readchar" and others will only try to do byte-by-byte processing
> |(I hope!).
>
> For historical reason, "b" does not mean "binary" encoding, but
> suppression of newline processing (\r\n -> \n).  It has nothing to do
> with 8bit-ascii.
Before the days of Unicode you are right. But the options "t" and "b" to  
the C function fopen() *do* stand for text and binary in Windows. When  
Windows started to support Unicode however, "binary mode" also means that  
when reading multi-byte character files, conversion to "wide-chars" is  
suppressed. See http://msdn.microsoft.com/en-us/library/c4cy2b8e.aspx . In  
Ruby 1.9, 8-bit ascii is the closest thing we have to a byte string - in  
fact BINARY is an alias for ASCII-8BIT in Ruby's standard encodings.

>
> |4) I notice that some methods like String#toutf8 no longer exist, but  
> are
> |still in the doc.
>
> Should be removed.  Where did you find them.
I just noticed them on ruby-doc.org under the 1.9 core doc "String". Maybe  
that is old.

> |I'd like to say how amazing the character encoding implementation is. I
> |don't know of any other language that has attempted to support all
> |encodings internally, as you guys have. You have also done a really good
> |job at optimizing UTF-8 string processing performance when all data is
> |ASCII. However, I imagine that using UTF-8 internally for strings of
> |multi-byte characters (or any other variable-length encoding) is going  
> to
> |be slow. I also have a concern that supporting so many character  
> encodings
> |internally is making Ruby's C code (eg: string.c) hard to optimize for a
> |particular class of encoding and when you do, messy and difficult to
> |maintain. It would be nicer if the internal implementation of say  
> "String"
> |could be done in a more OO approach, based on encoding. Probably easier
> |said than done, though!
>
> Having less number of classes is one of the Ruby's design policy.  And
> I feel it works well so far.
Sorry, I didn't mean to propose introducing extra classes in Ruby. I was  
talking about how the Ruby internal C code was implemented. I was getting  
at trying to avoid a series of "ifs" in each method in say string.c to  
optimize for different classes of encoding (eg if single-byte ... else if  
constant-width ..... else .....).

Mike.