Yukihiro Matsumoto wrote:

> |3a. If there is a way of creating literal strings in other encodings,
> |   is there also a way of creating literal regex's in other encodings?
> 
> If the encoding of the script is ascii (or binary, which is an alias
> to ascii), you can do it by using octet (or decimal) string
> representation + specifying encoding explicitly, e.g.
> 
>   # my family name in Japanese in euc-jp encoding
>   "\244\336\244\304\244\342\244\310".encoding="euc-jp"

Maybe that could be equivalent to String.new("\244\336...", "euc-jp")?
(And I think it would somehow need to work for all possible script 
encodings, but I'm not sure if this is possible when all string literals 
automatically use the script encoding. This might be a problem.)

(I think this would need to be used in libraries that return external 
content, for example from the web. So I think it would be nice to have a 
more natural syntax for it that is guaranteed to work.)

> String explode (name might be changed) returns an array of fixnums,
> which means s.explode.length == s.length (String#size now returns the
> byte length of the string under the current M17N prototype, but I
> consider it's a wrong decision, and will be fixed in the 1.9).

Does this mean that #size and #length would do different things or am I 
just misunderstanding?

Here's how I think things work:

1) #size returns number of code points in String
2) #length is an alias for #size
3) #explode returns an Array of raw characters
4) str.explode.size == str.size if str.encoding == "raw"

> |7. Will strings that, when converted to the same encoding, are identical,
> |   give different results for #intern when left in different encodings?
> |
> |   What happens to an interned string with a binary encoding?  Is it interned
> |   based on the internal bytes of the string rather than the characters?
> 
> The is also the place I haven't made design decision.  Possible
> options are:
> 
>   * restrict symbols to 7bit ascii

Hm, what about international method and variable names? (These are 
possible with -Ku right now.)

>   * embed encoding info in Symbols

Does this mean that Symbols would not be immediate in all cases? (And 
any guesses as to how that would effect performance?)

>   * symbols just use byte sequence

Hm, I think that would work in most cases. Maybe it should not be 
possible to .intern Strings that are not fully compatible (it should 
still be possible to do utf8_str.intern in an ascii script if it only 
contains 0...127) to the script's encoding.

>   * something else I don't think of now.

It's a difficult problem for sure.