On Saturday 17 June 2006 13:08, Juergen Strobel wrote:
> On Thu, Jun 15, 2006 at 07:59:54PM +0900, Michal Suchanek wrote:
[...]
> > The string methods should not just blindly operate on bytes but
> > use the encoding information to operate on characters rather than
> > bytes. Sure something like byte_length is needed when the string
> > is stored somewhere outside Ruby but standard string methods
> > should work with character offsets and characters, not byte
> > offsets nor bytes.
>
> I empathically agree. I'll even repeat and propose a new Plan for
> Unicode Strings in Ruby 2.0 in 10 points:

Juergen, I agree with most of what you have written. I will
add my thoughts.

> 1. Strings should deal in characters (code points in Unicode) and
> not in bytes, and the public interface should reflect this.
>
> 2. Strings should neither have an internal encoding tag, nor an
> external one via $KCODE. The internal encoding should be
> encapsulated by the string class completely, except for a few
> related classes which may opt to work with the gory details for
> performance reasons. The internal encoding has to be decided,
> probably between UTF-8, UTF-16, and UTF-32 by the String class
> implementor.

Full ACK. Ruby programs shouldn't need to care about the
*internal* string encoding. External string data is treated as
a sequence of bytes and is converted to Ruby strings through
an encoding API.

> 3. Whenever Strings are read or written to/from an external source,
> their data needs to be converted. The String class encapsulates the
> encoding framework, likely with additional helper Modules or
> Classes per external encoding. Some methods take an optional
> encoding parameter, like #char(index, encoding=:utf8), or
> #to_ary(encoding=:utf8), which can be used as helper Class or
> Module selector.

I think the encoding/decoding API should be separated from the
String class. IMO, the most important change is to strictly
differentiate between arbitrary binary data and character
data. Character data is represented by an instance of the
String class.

I propose adding a new core class, maybe call it ByteString
(or ByteBuffer, or Buffer, whatever) to handle strings of
bytes.

Given a specific encoding, the encoding API converts
ByteStrings to Strings and vice versa.

This could look like:

    my_character_str = Encoding::UTF8.encode(my_byte_buffer)
    buffer = Encoding::UTF8.decode(my_character_str)

> 4. IO instances are associated with a (modifyable) encoding. For
> stdin, stdout this can be derived from the locale settings.
> String-IO operations work as expected.

I propose one of:

1) A low level IO API that reads/writes ByteBuffers. String IO
   can be implemented on top of this byte-oriented API.

   The basic binary IO methods could look like:

   binfile = BinaryIO.new("/some/file", "r")
   buffer = binfile.read_buffer(1024) # read 1K of binary data

   binfile = BinaryIO.new("/some/file", "w")
   binfile.write_buffer(buffer) # Write the byte buffer

   The standard File class (or IO module, whatever) has an
   encoding attribute. The default value is set by the
   constructor by querying OS settings (on my Linux system
   this could be $LANG):

   # read strings from /some/file, assuming it is encoded
   # in the systems default encoding.
   text_file = File.new("/some/file", "r")
   contents = text_file.read

   # alternatively one can explicitely set an encoding before
   # the first read/write:
   text_file = File.new("/some/file", "r")
   text_file.encoding = Encoding::UTF8

   The File class (or IO module) will probably use a BinaryIO
   instance internally.

2) The File class/IO module as of current Ruby just gets
   additional methods for binary IO (through ByteBuffers) and
   an encoding attribute. The methods that do binary IO don't
   need to care about the encoding attribute.

I think 1) is cleaner.

> 5. Since the String class is quite smart already, it can implement
> generally useful and hard (in the domain of Unicode) operations
> like case folding, sorting, comparing etc.

If the strings are represented as a sequence of Unicode
codepoints, it is possible for external libraries to implement
more advanced Unicode operations.

Since IMO a new "character" class would be overkill, I propose
that the String class provides codepoint-wise iteration (and
indexing) by representing a codepoint as a Fixnum. AFAIK a
Fixnum consists of 31 bits on a 32 bit machine, which is
enough to represent the whole range of unicode codepoints.

> 6. More exotic operations can easily be provided by additional
> libraries because of Ruby's open classes. Those operations may be
> coded depending on on String's public interface for simplicissity,
> or work with the internal representation directly for performance.
>
> 7. This approach leaves open the possibility of String subclasses
> implementing different internal encodings for performance/space
> tradeoff reasons which work transparently together (a bit like
> FixInt and BigInt).

I think providing different internal String representations
would be too much work, especially for maintenance in the long
run.

> 8. Because Strings are tightly integrated into the language with
> the source reader and are used pervasively, much of this cannot be
> provided by add-on libraries, even with open classes. Therefore the
> need to have it in Ruby's canonical String class. This will break
> some old uses of String, but now is the right time for that.
>
> 9. The String class does not worry over character representation
> on-screen, the mapping to glyphs must be done by UI frameworks or
> the terminal attached to stdout.
>
> 10. Be flexible. <placeholder for future idea>

The advantages of this proposal over the current situation and
tagging a string with an encoding are:

* There is only one internal string (where string means a
  string of characters) representation. String operations
  don't need to be written for different encodings.

* No need for $KCODE.

* Higher abstraction.

* Separation of concerns. I always found it strange that most
  dynamic languages simply mix handling of character and
  arbitrary binary data (just think of pack/unpack).

* Reading of character data in one encoding and representing
  it in other encoding(s) would be easy.

It seems that the main argument against using Unicode strings
in Ruby is because Unicode doesn't work well for eastern
countries. Perhaps there is another character set that works
better that we could use instead of Unicode. The important
point here is that there is only *one* representation of
character data Ruby.

If Unicode is choosen as character set, there is the
question which encoding to use internally. UTF-32 would be a
good choice with regards to simplicity in implementation,
since each codepoint takes a fixed number of bytes. Consider
indexing of Strings:

        "some string"[4]

If UTF-32 is used, this operation can internally be
implemented as a simple, constant array lookup. If UTF-16 or
UTF-8 is used, this is not possible to implement as an array
lookup, since any codepoint before the fifth could occupy more
than one (8 bit or 16 bit) unit. Of course there is the
argument against UTF-32 that it takes to much memory. But I
think that most text-processing done in Ruby spends much more
memory on other data structures than in actual character data
(just consider an REXML document), but I haven't measured that
;)

An advantage of using UTF-8 would be that for pure ASCII files
no conversion would be necessary for IO.

Thank you for reading so far. Just in case Matz decides to
implement something similar to this proposal, I am willing to
help with Ruby development (although I don't know much about
Ruby's internals and not too much about Unicode either).

I do not have a CS degree and I'm not a Unicode expert, so
perhaps the proposal is garbage, in this case please tell me
what is wrong about it or why it is not realistic to implement
it.

-- 
Stefan