> I don't mean to shoot you down in flames, but a lot of thought and effort > has gone into Ruby's encoding support. Ruby could have followed the Python > route of converting everything to Unicode, but that was rejected for > various > good reasons. Also automatic transcoding to solve issues of incompatible > encodings was also rejected because it causes a number of problems, in > particular I believe that transcoding isn't necessarilly accurate, because > for example there may be multiple or ambiguous representations of the same > character. > > What *was* introduced is the concept of a "default_internal" encoding, > which, if used by the programmer, causes I/O and other interfaces to > transcode to the internal encdoing on input & the opposite on output. > Typically the default_internal encoding, if used, is UTF-8, and in this > case > the programmer would have to accept that, on doing I/O to a file in a > different encoding, the transcoding *may* cause data loss. haha. that's fine :) i expected and asked for criticism. they're just ideas you're criticising. no harm in that you seem to be misunderstanding the main idea and focusing on the "perhaps we could even go so far as to convert to a default superior encoding if needed duration concatenation" part. that was secondary and isn't necessary to the success of the idea also you say in the first paragraph that ruby rejected the idea of following python by converting everything to unicode, yet acknowledge in the second paragraph that ruby does, in fact, do this very thing using the concept of the default internal encoding, it just doesn't wave it in the programmers face and is voluntary. is this not partly contradictory? the data loss when the strings leave ruby would happen anyway. if the programmer, for example, chose to work in a better encoding within ruby, or whether it happened automatically under my proposal, but had to write files in a lesser encoding, or whether they chose to stay within the restrictions of the lesser encoding the whole time, there would be no true data loss, just a loss of the benefits gain by working in the better encoding. if the output encoding is restricted, that is a problem independent of what ruby does or doesn't do. within ruby itself there would be no information loss and that is the important thing, nor any unnecessary errors raised >> we first add a function >> to do actual conversions between two encodings based on character, not >> just reinterpreting the byte values. so c in latin-1 (0x63) would become >> c >> in utf-32 (0x00000063). > > String#encode does this I believe this was just an example. what about if a string had the japanese character ka in shift jis and was being converted to utf-8. the value would be entirely different and encode() is not capable of doing this, is it? >> it could have lists of which encodings are >> supersets of other encodings > > Unfortunately it turns out that the only encoding that we can reliably > state > is a subset of any other encoding is US-ASCII, and Ruby knows about this > and > optimizes for it. well, wikipedia seems to suggests jis 0201 is a subset of shift jis (i was also thinking falsely that latin-1 is a subset of utf-8), but this doesn't really matter. it is only an optimisation and the success of my proposal doesn't rest on it i have a feeling i probably won't get anywhere with this, sadly :) ruby may have too much momentum. what does everyone else think?