On 6/26/06, Jim Weirich <jim / weirichhouse.org> wrote: > I've been following this debate with some interest. Alas, since my > unicode/m17n experience is quite limited, I don't have a strong > opinion in the matter. > > But the following caught my eye: > Austin Ziegler wrote: >> [...] Ruby *will* distinguish between a String without an encoding >> and a String with an encoding. You're basing your opposition to >> tomorrow's behaviour based on today's (known bad) behaviour. > Part of the problem is that we are basing our discussions on > descriptions of what will happen in the future, but that makes it > difficult to understand the issues involved without real code. > > What I would like to see is prototype implementations of both > approaches, and see the differences in how they effect the code. I'm > more interested in anwering questions like "How do I safely > concatenate strings with potentially different encodings" and "How do > I do I/O with encoded strings" rather than addressing efficiency > questions. In other words, how do the different approaches effect the > way I write code. > > I think it would be a great idea to prototype these ideas in real code > to understand the advantages and disadvantages of each. I mostly agree with you here (about prototyping), Jim. There are a few things that I think can be done without working code. I often start from this point in my own programs, anyway. I'll try to address each of your questions as I understand them. Hopefully, Matz or other participants will step in and correct me where I'm wrong. Before I get started, there are two orthogonal divisions here. The first division is about the internal representation of a String. There is a camp that very strongly believes that some Unicode encoding is the only right way to internally represent String data. Sort of like Java's String without the mistake of char being UCS-2. The other camp strongly believes that forcing a single universal encoding is a mistake for a variety of reasons and would rather have an unencoded internal representation with an interpretive encoding tag available. These two camps can be referred to as UnicodeString and m17nString. I think that I can be safely classified as in the m17nString camp -- but there are caveats to that which I will address in a moment. The second division is about the suitability of a String as a ByteVector. Some folks believe that the twain should never meet, others believe that there's little to meaningfully distinguish them in practice and that the resulting API would be unnecessarily complex. I can safely be classified in the latter camp. There is an open question about the resulting String class about how well it will work with various arcane features of Unicode such as combining characters, RTL/LTR marks, etc. and these are good questions. Ultimately, I believe that the answer is that it should support them as transparently as possible without (a) hiding *too* much and (b) compromising support for multiple encodings. Your first question: How do I safely concatenate strings with potentially different encodings? This deals with the first division. Under the UnicodeString camp, you would *always* be able to safely concatenate strings because they never have a separate encoding. All incoming data would have to be classified as binary or character data and the character data would have to be converted from its incoming code page to the internal representation. Under the m17nString camp, Matz has promised that compatible encodings would work transparently. I have gone a little further and suggested that we have a conversion mechanism similar to #coerce for Number values. I could then combine text from Win1252 and SJIS to get a Unicode result. Or, if I knew that my target could *only* handle SJIS, I would force that to result in an error. Your second question: How do I do I/O with encoded strings? This also sort of deals with the first, but it also deals with the second. Note, by the way, that the UnicodeString camp would *require* a completely separate ByteArray class because you could not then read a JPEG into a String -- its values would be converted to Unicode representations, rendering it unusable as a JPEG. The two class (String/ByteArray) camp would probably require that you either (1) change all IO operations using a pragma-style setting to encoded strings, (2) change individual IO operations, (3) use a separate API, or (4) read a ByteArray and *convert* it to a UnicodeString. Either way, they seem to want an API where they can say "read this IO and give me a UnicodeString as output" and conversely "read this IO and give me a ByteArray as output." (Note: this could apply whether we have a UnicodeString or an m17nString -- but the requests have come most often from UnicodeString supporters.) The one class camp keeps file IO as it is. You can "encourage" a particular encoding with a variant of #2: d1 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read } d2 = File.open("file.txt", "rb") { |f| f.encoding = :utf8 f.read } However, whether you use an encoding or not, you still get a String back. Consider: s1 = File.open("file.txt", "rb") { |f| f.read } s2 = File.open("file.txt", "rb", encoding: :utf8) { |f| f.read } s1.class == s2.class # true s1.encoding == s2.encoding # false But that doesn't mean I have to keep treating s1 as a raw data byte array -- or even convert it. s1.encoding = :utf8 s1.encoding == s2.encoding # true I think that the fundamental difference here is whether you view encoded strings as fundamentally different objects, or whether you view the encodings as *lenses* on how to interpret the object data. I prefer the latter view. -austin -- Austin Ziegler * halostatue / gmail.com * http://www.halostatue.ca/ * austin / halostatue.ca * http://www.halostatue.ca/feed/ * austin / zieglers.ca