On 6/14/06, Austin Ziegler <halostatue / gmail.com> wrote:
> On 6/14/06, Michal Suchanek <hramrach / centrum.cz> wrote:
> > What I want is all methods working seamlessly with unicode strings so
> > that I do not have to think about the encoding.
>
> That will *never* happen. Even with Unicode, you have to think about
> the encoding, because UTF-32 (the closest representation to the
> Platonic ideal "Unicode" you'll ever find) is unlikely to be supported
> in the general case. Matz's idea of m17n strings is the right one: you
> have a "byte stream" and an attribute which indicates how the byte
> stream is encoded. This will sort of be like $KCODE but on an
> individual string level so that you could meaningfully have Unicode
> (probably UTF-8) and ShiftJIS strings in the same data and still
> meaningfully call #length on them.
>
> You will *always* have to care about the encoding. As well as,
> ultimately, your locale.

No. Since I have locale stdin can be marked with the proper encoding
information so that all stings originating there have the proper
encoding information.

The string methods should not just blindly operate on bytes but use
the encoding information to operate on characters rather than bytes.
Sure something like byte_length is needed when the string is stored
somewhere outside Ruby but standard string methods should work with
character offsets and characters, not byte offsets nor bytes.

Since my stdout can be also marked with correct encoding the strings
that are output there can be converted to that encoding. Even if it
originates from a source file that happens to be in a different
encoding.
Hmm, prehaps it will be necessary to mark source files with encoding
tags as well. It could be quite tedious to assingn the tag manually to
every string in a source file.

When strings are compared, concatenated, .. the encoding is known so
the methods should do the right thing.

I do not have to care about encoding. You may make a string
implemenation that forces me to care (such a the current one). But I
do not have to. I can always turn to perl if I get really desperate.

Thanks

Michal