This message contains queries that probably only Matz can answer:
I'm excited to see that strings have encodings now! Thank you for your
Unicode support! I have a few questions:
1) I gather that string literals are given the encoding specified by the
-K option or by the encoding comment at the top of the file. Do you
plan any changes to the string literal syntax so that encodings can be
specified for individual literals? Will I be able to include a utf-8
encoding string literal within a file that is otherwise in ASCII? I
don't like Python's u"" syntax, but I'm hoping that you'll provide some
more elegant alternative.
2) This is really part of the same question: will you extend the string
literal syntax to allow the inclusion of arbitrary codepoints in
ASCII-encoded files using some kind of character escape? I'm accustomed
to Java's \uxxxx escape sequence and would like to see something like
this. (I don't know enough about SJIS and EUC to know if that would be
relevant to those encodings or not.)
Despite my relative ignorance, I suggest something along these lines:
\uxxxx: represents Unicode codepoint U+xxxx
\Uxxxxxx: represents Unicode codepoint U+xxxxxx
\Exxxx: represents EUC codepoint xxxx
\Sxxxx: repersents SJIS codepoint xxxx
xxxx: is a string of four hex digits.
A string may not mix codepoints from different encodings.
If a string contains a codepoint escape, it gets its encoding from that
escape.
If a string literal ends with \u, \U, \E, or \S (with no hex digits
following) then the escape specifies the encoding of the string, even
when the string does not contain any characters outside of the ASCII subset.
David