On 6/26/06, Jim Weirich <jim / weirichhouse.org> wrote:
> I've been following this debate with some interest.  Alas, since my
> unicode/m17n experience is quite limited, I don't have a strong
> opinion in the matter.
>
> But the following caught my eye:
> Austin Ziegler wrote:
>> [...] Ruby *will* distinguish between a String without an encoding
>> and a String with an encoding. You're basing your opposition to
>> tomorrow's behaviour based on today's (known bad) behaviour.
> Part of the problem is that we are basing our discussions on
> descriptions of what will happen in the future, but that makes it
> difficult to understand the issues involved without real code.
>
> What I would like to see is prototype implementations of both
> approaches, and see the differences in how they effect the code.  I'm
> more interested in anwering questions like "How do I safely
> concatenate strings with potentially different encodings" and "How do
> I do I/O with encoded strings" rather than addressing efficiency
> questions.  In other words, how do the different approaches effect the
> way I write code.
>
> I think it would be a great idea to prototype these ideas in real code
> to understand the advantages and disadvantages of each.

I mostly agree with you here (about prototyping), Jim. There are a few
things that I think can be done without working code. I often start from
this point in my own programs, anyway. I'll try to address each of your
questions as I understand them. Hopefully, Matz or other participants
will step in and correct me where I'm wrong.

Before I get started, there are two orthogonal divisions here. The first
division is about the internal representation of a String. There is a
camp that very strongly believes that some Unicode encoding is the only
right way to internally represent String data. Sort of like Java's
String without the mistake of char being UCS-2. The other camp strongly
believes that forcing a single universal encoding is a mistake for a
variety of reasons and would rather have an unencoded internal
representation with an interpretive encoding tag available. These two
camps can be referred to as UnicodeString and m17nString. I think that I
can be safely classified as in the m17nString camp -- but there are
caveats to that which I will address in a moment.

The second division is about the suitability of a String as a
ByteVector. Some folks believe that the twain should never meet, others
believe that there's little to meaningfully distinguish them in practice
and that the resulting API would be unnecessarily complex. I can safely
be classified in the latter camp.

There is an open question about the resulting String class about how
well it will work with various arcane features of Unicode such as
combining characters, RTL/LTR marks, etc. and these are good questions.
Ultimately, I believe that the answer is that it should support them as
transparently as possible without (a) hiding *too* much and (b)
compromising support for multiple encodings.

Your first question:

  How do I safely concatenate strings with potentially different
  encodings?

This deals with the first division. Under the UnicodeString camp, you
would *always* be able to safely concatenate strings because they never
have a separate encoding. All incoming data would have to be classified
as binary or character data and the character data would have to be
converted from its incoming code page to the internal representation.

Under the m17nString camp, Matz has promised that compatible encodings
would work transparently. I have gone a little further and suggested
that we have a conversion mechanism similar to #coerce for Number
values. I could then combine text from Win1252 and SJIS to get a
Unicode result. Or, if I knew that my target could *only* handle SJIS, I
would force that to result in an error.

Your second question:

  How do I do I/O with encoded strings?

This also sort of deals with the first, but it also deals with the
second. Note, by the way, that the UnicodeString camp would *require* a
completely separate ByteArray class because you could not then read a
JPEG into a String -- its values would be converted to Unicode
representations, rendering it unusable as a JPEG.

The two class (String/ByteArray) camp would probably require that you
either (1) change all IO operations using a pragma-style setting to
encoded strings, (2) change individual IO operations, (3) use a
separate API, or (4) read a ByteArray and *convert* it to a
UnicodeString. Either way, they seem to want an API where they can say
"read this IO and give me a UnicodeString as output" and conversely
"read this IO and give me a ByteArray as output." (Note: this could
apply whether we have a UnicodeString or an m17nString -- but the
requests have come most often from UnicodeString supporters.)

The one class camp keeps file IO as it is. You can "encourage" a
particular encoding with a variant of #2:

  d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read }
  d2 = File.open("file.txt", "rb") { |f|
	f.encoding = :utf8
	f.read
  }

However, whether you use an encoding or not, you still get a String
back. Consider:

  s1 = File.open("file.txt", "rb") { |f| f.read }
  s2 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read }

  s1.class == s2.class # true
  s1.encoding == s2.encoding # false

But that doesn't mean I have to keep treating s1 as a raw data byte
array -- or even convert it.

  s1.encoding = :utf8
  s1.encoding == s2.encoding # true

I think that the fundamental difference here is whether you view encoded
strings as fundamentally different objects, or whether you view the
encodings as *lenses* on how to interpret the object data. I prefer the
latter view.

-austin
-- 
Austin Ziegler * halostatue / gmail.com * http://www.halostatue.ca/
               * austin / halostatue.ca * http://www.halostatue.ca/feed/
               * austin / zieglers.ca