On Wed, 31 Jul 2002, Yukihiro Matsumoto wrote: > In message "Unicode in Ruby now?" > on 02/07/31, Tobias Peters <tpeters / invalid.uni-oldenburg.de> writes: > > |When I export a string to an utf-8 encoded stream, how can I possibly know > |its current encoding. Strings do not have an "encoding" tag. Will they > |have in future? > > Yes. Nice. I still think sources and sinks of characters also need an "encoding" property. Strings originating from some character source will then have the source's encoding. Strings exported to character sinks will have to be converted on the fly in case of a different encodings. We could make behaviour in case of unconvertible characters a property of character sinks. We also need rules how to combine strings with different encoding then. concatenating two strings encoded in koi8-r and iso-8859-1, respectively, may only be possible when the result is encoded in some unicode representation. Are there any other character sets of relevance that are not part of unicode yet? Otherwise, we could probably live with just two possible canonical character encodings. With canonical, here I mean the encoding of a string that is the result of a combination of other strings with different encodings. > No. Considering the existence of "big character set" like Mojikyo > (charset developed in Japan, which is bigger than Unicode), there > cannot be any ideal canonical format. I understand that Mojikyo will not be folded into unicode due to political reasons. Combining ruby strings in some unicode encoding with ruby strings encoded in some Mojiko encoding might result in a runtime error then. > In addition, from my > estimation, the cost penalty from code conversion to/from the > canonical character set is intolerable if one processes mainly on > non-ASCII, non-Unicode text data, like we do in Japan. I understand that. It would affect all countries that use non-ascii encodings. Due to ruby's dynamic nature we could probably implement most of what is required for international string with the current ruby version. The biggest problems that I see are: - Determining the encoding of string literals in source code. This should be specified in the source file itself. Perhaps it's possible to implement by overriding "require" and "load", read the whole file in memory, and convert it to a user/system-specific default character set before calling eval on it. - Determining the encoding of strings that describe File system Paths. I have no idea if operating systems provide this information to user space applications. Anyone interested in working on it? Tobias