On 6/26/06, Austin Ziegler <halostatue / gmail.com> wrote:
> On 6/26/06, Jim Weirich <jim / weirichhouse.org> wrote:

> Before I get started, there are two orthogonal divisions here. The first
> division is about the internal representation of a String. There is a
> camp that very strongly believes that some Unicode encoding is the only
> right way to internally represent String data. Sort of like Java's
> String without the mistake of char being UCS-2. The other camp strongly
> believes that forcing a single universal encoding is a mistake for a
> variety of reasons and would rather have an unencoded internal
> representation with an interpretive encoding tag available. These two
> camps can be referred to as UnicodeString and m17nString. I think that I
> can be safely classified as in the m17nString camp -- but there are
> caveats to that which I will address in a moment.

Note that a fixed encoding UnicodeString has several caveats:
- you have only one encoding, and while it may be optimal in some
respects it may be suboptimal  in other. This leads to split among
UnicodeString supporters - about which encoding to choose. m17n solves
this neatly by allowing you to choose the encoding for every
application at least.
  -utf-8 - most likely encountered on io (especially network) = less
conversions. Space efficient for languages using Latin script
  -utf-16 - sometimes encountered on io (file names on certain
systems). Space efficient for most(?) other languages
  -utf-32 - fast indexing/slicing. Generally easier manipulation (but
only inside the string class)
-you cannot use a non-unicode encoding, or even have both unicode and
non-unicode (with characters outside of unicode) strings without
chnaging the interpreter incompatibly

Another subdivision exists among m17n camp about what strings are
compatible. The behavior in some other languages (which some find
unfortunate) is that strings with different encodings are incompatible
(ie operations on two strings always have to take strings with the
same encoding). In Matz's current proposal the only improvement over
this is allowing to add 7-bit ascii string to strings where this makes
sense (ie. to ISO-8859-[12], cp85[02], utf-8).
The other position is to make strings to coerce themselves
automatically if lossless conversion exists (ie cp1251, cp852, and
iso-8859-2 should be the same set of characters ordered differently
iirc, and most character sets can be safely converted to utf-8). I
could count myself into the autoconversion camp.

Yet another subdivision is about the exact meaning of string.encoding
= :utf8. It can either just change the tag or check that string is
indeed a valid utf-8 character seequence. Matz thinks that without
checking autoconversion would be too unreliable. I think that checking
would be good for debugging or when one wants to be paranoid. But the
ability to turn it off when I think (or find out) that my application
spends lots of time checking needlessly could be handy.

>
> The second division is about the suitability of a String as a
> ByteVector. Some folks believe that the twain should never meet, others
> believe that there's little to meaningfully distinguish them in practice
> and that the resulting API would be unnecessarily complex. I can safely
> be classified in the latter camp.
>
> There is an open question about the resulting String class about how
> well it will work with various arcane features of Unicode such as
> combining characters, RTL/LTR marks, etc. and these are good questions.
> Ultimately, I believe that the answer is that it should support them as
> transparently as possible without (a) hiding *too* much and (b)
> compromising support for multiple encodings.
>
> Your first question:
>
>   How do I safely concatenate strings with potentially different
>   encodings?
>
> This deals with the first division. Under the UnicodeString camp, you
> would *always* be able to safely concatenate strings because they never
> have a separate encoding. All incoming data would have to be classified
> as binary or character data and the character data would have to be
> converted from its incoming code page to the internal representation.
>
> Under the m17nString camp, Matz has promised that compatible encodings
> would work transparently. I have gone a little further and suggested
> that we have a conversion mechanism similar to #coerce for Number
> values. I could then combine text from Win1252 and SJIS to get a
> Unicode result. Or, if I knew that my target could *only* handle SJIS, I
> would force that to result in an error.

The answer also depends on what strings are compatible. If most
strings are incompatible, you would convert all strings and other data
structures you get from IO or external libraries to your chosen
encoding, and you will only concatenate strings with the same
encoding.
With autoconversion it will just work most of the time (ie when you
work with string that can be converted to unicode).

Writing to streams that do not support all unicode characters is going
to be a problem most of the time (when you do not work in the output
encoding). Unless write attempts the conversion first, and only fails
when there are non-convertible characters.

>
> Your second question:
>
>   How do I do I/O with encoded strings?
>
...
>
> The one class camp keeps file IO as it is. You can "encourage" a
> particular encoding with a variant of #2:
>
>   d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read }
>   d2 = File.open("file.txt", "rb") { |f|
>         f.encoding = :utf8
>         f.read
>   }
>
> However, whether you use an encoding or not, you still get a String
> back. Consider:
>
>   s1 = File.open("file.txt", "rb") { |f| f.read }
>   s2 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read }
>
>   s1.class == s2.class # true
>   s1.encoding == s2.encoding # false
>
> But that doesn't mean I have to keep treating s1 as a raw data byte
> array -- or even convert it.
>
>   s1.encoding = :utf8
>   s1.encoding == s2.encoding # true
>
> I think that the fundamental difference here is whether you view encoded
> strings as fundamentally different objects, or whether you view the
> encodings as *lenses* on how to interpret the object data. I prefer the
> latter view.

If you consider s3 = File.open('legacy.txt','rb',:iso885915) { |f| f.read }
without autoconversion you would have to immediately do s3.recode :utf8
otherwise s1 + s3 would not work.

The same for stuff you get from database queries (unless you are sure
you always get the right encoding), text you get from the web, emails,
third party libraries, etc.

Thanks

Michal