Hi,

In message "Re: Unicode roadmap?"
    on Mon, 19 Jun 2006 14:57:22 +0900, "Dmitry Severin" <dmitry.severin / gmail.com> writes:

|But, I can see several imlementation issues and possible options, that
|should be considered:

Thank you for the ideas.

|- what will happen if one tries to perfom str1.operation(str2) on two
|strings with different encodings:
|  a) raise exception
|  b) silent coerce one or both strings to some "compatible"
|charset/encoding, update encoding of result, replacing non-convertable chars
|using fallback mappings? (ouch, this can be split to set of options)
|  c) same as b) but raise exception if non-loss conversion is not possible?
|  d) same as b) but warn if non-loss conversion is not possible?
|  e) downgrade encoding tag of acceptor to "raw/bytes" and process it?

a), unless either of strings is "ascii" and the other is "ascii"
compatible.  This point is arguable.

|- what will happen if one changes encoding tag for String instance:
|  a) check and raise exception if current bytes don't represent valid
|encoding sequence?
|  b) just set new tag?
|  c) convert byte sequence to given encoding, using fallback mappings?

b), encoding conformance check shall done lazily.  I think there's a
need for explicit encoding conformance check method.

|- what to do with IO:
|  a) IO will return strings in "raw/bytes"?
|  b) IO can be tagged and will return Strings with given econding tag?
|  c) IO can be tagged and is by default tagged with global encoding tag?
|  d) IO can be tagged, but is not tagged by default, although methods
|returning strings (such as read, readlines) will use global encoding tag?
|  e) if IO is tagged and one tries to write to it a String with different
|encoding, what will happen?

c), the global default shall be set from locale setting.

|- what will be default encoding tag for new Strings:
|  a) "raw/bytes"
|  b) derived from system properties of host platform
|  c) option b) and can be overriden in application (btw, $KCODE, as present,
|must definitely go away!!!)

Encoding for literal strings are set by pragma.

|- how to process source code files:
|  a) restrict them to ASCII and require all non-ASCII strings to be
|externalized?
|  b) process them as "raw/bytes"?
|  c) introduce some kind of commented pragma for source files allowing to
|set encoding,

1.9 already has encoding pragma a la Python PEP263.

|- at present time Ruby parser can parse only sources in ASCII compatible
|encoding.  Would it change?

No.  Ruby would not allow scripts in EBCDIC, nor UTF-16, although it
allows processing of those encoding.

|- what encodings will have Numeric.to_s, Time.to_s etc., or String has to
|have/conform for String#to_f, String#to_i?

Good point.  Currently, I think they should work on ASCII.

|On Unicode:
|- case-independent canonical string matches/searches DO MATTER. And even for
|encodings, that code variants of glyphs with different codepoints
|"variant-insensitive" search, as for me, is desired. Will there be such
|functionality?

Casefold search/match will be provided for Regexp.  "variant
insensitive" search should be accomplished by explicit normalization
or collation.

|- string comparison: will <=> use at least UCA rules for Unicode strings, or
|only byte-order comparisons will stay?

Byte order comparison.  UCA rules or such should be done explicitly
via normalization or collation.

|- is_digit, is_space, is_alpha, is_foobarbaz etc. could matter, when writing
|a custom parser. Will those methods be provided for one-char strings?

Those functions will be provided via Regexp.  I am not sure if we will
provide character classification methods for strings.

							matz.