Hi,

In message "Re: The face of Unicode support in the future"
    on Tue, 18 Jan 2005 12:08:34 +0900, Wes Nakamura <wknaka / pobox.com> writes:

|Is there opposition to a separate unicode string class, that would
|coexist with the current byte-based string class?  I find a fixed-width
|unicode-based string type to be much easier to deal with rather
|than individual encodings.  With the byte-based system you would have to
|worry about the language of the text in each string, and check
|encodings before doing something like a string compare.

That's true in C strings (char* or wchar_t*), which you have to
allocate by yourself, and handle then character-wise, but not for
strings in Ruby with much higher abstraction in API.  The lower level
processing like allocation and resizing internal buffer, etc. are
handled automagically.

|IIRC in iso-2022-jp you can't even find character boundaries unless you
|go back to the shift-in marker (UTF8 allows you to find boundaries   
|easily).  Is each encoder library going to need to be smart about
|encodings, like adding two iso-2022-jp strings together:

The character encoding scheme with state, such as iso-2022, is not
supported by default, since it is very difficult to handle it
efficiently, as you described.  But even it's hard, it is possible
theoretically to handle these encoding, without knowing the detail of
the underlying encoding.  For example, string concatenation should
work by 's1 + s2' whatever encoding they are in (if s1 and s2 are in
same encoding), and s1.length should give you the number of code
points in the string.

							matz.