On Wed, 31 Jul 2002, Yukihiro Matsumoto wrote:
> In message "Unicode in Ruby now?"
>     on 02/07/31, Tobias Peters <tpeters / invalid.uni-oldenburg.de> writes:
> 
> |When I export a string to an utf-8 encoded stream, how can I possibly know
> |its current encoding. Strings do not have an "encoding" tag. Will they
> |have in future?
> 
> Yes.

Nice. I still think sources and sinks of characters also need an 
"encoding" property. Strings originating from some character source will 
then have the source's encoding. Strings exported to character sinks will 
have to be converted on the fly in case of a different encodings. We could 
make behaviour in case of unconvertible characters a property of character 
sinks.

We also need rules how to combine strings with different encoding then. 
concatenating two strings encoded in koi8-r and iso-8859-1, respectively, 
may only be possible when the result is encoded in some unicode 
representation.

Are there any other character sets of relevance that are not part of 
unicode yet? Otherwise, we could probably live with just two possible 
canonical character encodings. With canonical, here I mean the encoding of 
a string that is the result of a combination of other strings with 
different encodings.

> No.  Considering the existence of "big character set" like Mojikyo
> (charset developed in Japan, which is bigger than Unicode), there
> cannot be any ideal canonical format. 

I understand that Mojikyo will not be folded into unicode due to 
political reasons. Combining ruby strings in some unicode encoding with 
ruby strings encoded in some Mojiko encoding might result in a runtime 
error then.

> In addition, from my
> estimation, the cost penalty from code conversion to/from the
> canonical character set is intolerable if one processes mainly on
> non-ASCII, non-Unicode text data, like we do in Japan.

I understand that. It would affect all countries that use non-ascii 
encodings.

Due to ruby's dynamic nature we could probably implement most of what is 
required for international string with the current ruby version. The 
biggest problems that I see are:

- Determining the encoding of string literals in source code. This should 
  be specified in the source file itself. Perhaps it's possible to 
  implement by overriding "require" and "load", read the whole file in 
  memory, and convert it to a user/system-specific default character set 
  before calling eval on it.
- Determining the encoding of strings that describe File system Paths. I 
  have no idea if operating systems provide this information to user space 
  applications.

Anyone interested in working on it?

  Tobias