> I don't mean to shoot you down in flames, but a lot of thought and effort
> has gone into Ruby's encoding support. Ruby could have followed the Python
> route of converting everything to Unicode, but that was rejected for
> various
> good reasons. Also automatic transcoding to solve issues of incompatible
> encodings was also rejected because it causes a number of problems, in
> particular I believe that transcoding isn't necessarilly accurate, because
> for example there may be multiple or ambiguous representations of the same
> character.
>
> What *was* introduced is the concept of a "default_internal" encoding,
> which, if used by the programmer, causes I/O and other interfaces to
> transcode to the internal encdoing on input & the opposite on output.
> Typically the default_internal encoding, if used, is UTF-8, and in this
> case
> the programmer would have to accept that, on doing I/O to a file in a
> different encoding, the transcoding *may* cause data loss.

haha. that's fine :) i expected and asked for criticism. they're just
ideas you're criticising. no harm in that

you seem to be misunderstanding the main idea and focusing on the "perhaps
we could even go so far as to convert to a default superior encoding if
needed duration concatenation" part. that was secondary and isn't
necessary to the success of the idea

also you say in the first paragraph that ruby rejected the idea of
following python by converting everything to unicode, yet acknowledge in
the second paragraph that ruby does, in fact, do this very thing using the
concept of the default internal encoding, it just doesn't wave it in the
programmers face and is voluntary. is this not partly contradictory?

the data loss when the strings leave ruby would happen anyway. if the
programmer, for example, chose to work in a better encoding within ruby,
or whether it happened automatically under my proposal, but had to write
files in a lesser encoding, or whether they chose to stay within the
restrictions of the lesser encoding the whole time, there would be no true
data loss, just a loss of the benefits gain by working in the better
encoding. if the output encoding is restricted, that is a problem
independent of what ruby does or doesn't do. within ruby itself there
would be no information loss and that is the important thing, nor any
unnecessary errors raised

>> we first add a function
>> to do actual conversions between two encodings based on character, not
>> just reinterpreting the byte values. so c in latin-1 (0x63) would become
>> c
>> in utf-32 (0x00000063).
>
> String#encode does this I believe

this was just an example. what about if a string had the japanese
character ka in shift jis and was being converted to utf-8. the value
would be entirely different and encode() is not capable of doing this, is
it?

>> it could have lists of which encodings are
>> supersets of other encodings
>
> Unfortunately it turns out that the only encoding that we can reliably
> state
> is a subset of any other encoding is US-ASCII, and Ruby knows about this
> and
> optimizes for it.

well, wikipedia seems to suggests jis 0201 is a subset of shift jis (i was
also thinking falsely that latin-1 is a subset of utf-8), but this doesn't
really matter. it is only an optimisation and the success of my proposal
doesn't rest on it

i have a feeling i probably won't get anywhere with this, sadly :) ruby
may have too much momentum. what does everyone else think?