2006/6/22, Yukihiro Matsumoto <matz / ruby-lang.org>:
> Hi,
>
> In message "Re: Unicode roadmap?"
>     on Thu, 22 Jun 2006 02:17:53 +0900, "Dmitry Severin" <dmitry.severin / gmail.com> writes:
>
> |Things shouldn't be that complicated.
>
> Agreed in principle.  But it seems to be fundamental complexity of the
> world of multiple encoding.  I don't think automatic conversion would
> improve the situation.  It would cause conversion error almost
> randomly.  Do you have any idea to simplify things?
>
> I am eager to hear.
>



So what will be semantic for encoding tag:
 a) weak suggestion?
 b) strong assertion?

If encoding tag is only weak suggestion (and for now I see it will be
just that), it will imply:
  - performance win (no need to check conformance to told encoding)
  - win in having less complexity (most tasks use source code, text
data input and output all in the same [default host] encoding)
  - portability drawbacks (assumtions made by original coders will be
implicit, but they have to be figured out, when porting to another
environement)
  - reliability drawbacks (weak suggestions are too often ignored, and
you don't know when, where and why they will hit your app, but someday
they will!)

If encoding tag is strong assertion, it will imply:
  - probable performance loss:
     * to assure this string with encoding = "none" (raw) represents
valid encoding sequence of bytes,
       at the same price as String#length
     * need to recode bytes, when changing tag
  - slightly more complexity (developers will have to declare these
assertions explicitly)
  - portability win
  - reliability win

What compromise on this issues would be acceptable?

I'd prefer encoding tag as strong assertion, mostly for reliability reasons.

And for operations on Strings with different encodings, I'd like
implicit automatic encoding coercion:
-------------------------------
#
# NOTES:
#  a) String#recode!(new_encoding) replaces current internal byte
representation with new byte sequence,
#     that is recoded current. must raise IncompatibleCharError, if
can't convert char to destination encoding
#  b) downgrading string from some stated encoding to "none"  tag must
be done only explicitly.
#     it is not an option for implicit conversion
#  c) $APPLICATION_UNIVERSAL_ENCODING is a global var, allowed to be
set once and only once per application run.
#     Intent: we want all strings which aren't raw bytes to be in one
single predefined encoding,
#     so all operations on string must return string in conformant encoding.
#     Desired encoding is value of $APPLICATION_UNIVERSAL_ENCODING.
#     If $APPLICATION_UNIVERSAL_ENCODING is nil, we go in "democracy
mode", see below.
#
def coerce_encodings(str1, str2)
   enc1 = str1.encoding
   enc2 = str2.encoding

   # simple case, same encodings, will return fast in most cases
   return if enc1 == enc2

   # another simple but rare case, totally incompatible encodings, as
they represent incompatible charsets
   if fully_incompatible_charsets?(enc1, enc2)
   	raise(IncompatibleCharError, "incompatible charsets %s and %s", enc1, enc2)
   end

   # uncertainity, handling "none" and preset encoding
   if enc1 == "none" || enc2 == "none"
   	raise(UnknownIntentEncodingError, "can't implicitly coerce
encodings %s and %s, use explicit conversion", enc1, enc2)
   end

   # Tirany mode:
   # we want all strings which aren't raw bytes to be in one single
predefined encoding
   if $APPLICATION_UNIVERSAL_ENCODING
   	str1.recode!($APPLICATION_UNIVERSAL_ENCODING)
	str2.recode!($APPLICATION_UNIVERSAL_ENCODING)
   	return
   end

   # Democracy mode:
   # first try to perform non-loss conversion from one encoding to another:
   # 1) direct conversion, without loss, to another encoding, e.g. UTF8 + UTF16
   if exists_direct_non_loss_conversion?(enc1, enc2)
   	if exists_direct_non_loss_conversion?(enc2, enc1)
   	# performance hint if both available
	   if str1.byte_length < str2.byte_length
	   	str1.recode!(enc2)
	   else
	   	str2.recode!(enc1)
	   end
	else
		str1.recode!(enc2)
	end
	return
   end
   if exists_direct_non_loss_conversion?(enc2, enc1)
   	str2.recode!(enc1)
	return
   end

   # 2) non-loss conversion to superset
   # (I see no reason to raise exception on KOI8R + CP1251, returning
string in Unicode will be OK)
   if superset_encoding = find_superset_non_loss_conversion?(enc1, enc2)
   	str1.recode!(superset_encoding)
	str2.recode!(superset_encoding)
	return
   end

   # A case for incomplete compatibility:
   # Check if subset of enc1 is also subset of enc2,
   # so some strings in enc1 can be safely recoded to enc2,
   # e.g. two pure ASCII strings, whatever ASCII-compatible encoding they have
   if exists_partial_loss_conversion?(enc1, enc2)    	
	if exists_partial_loss_conversion?(enc2, enc1)
   	   # performance hint if both available
	   if str1.byte_length < str2.byte_length
	   	str1.recode!(enc2)
	   else
	   	str2.recode!(enc1)
	   end
	else
		str1.recode!(enc2)
	end
	return
   end

   # the last thing we can try
   str2.recode!(enc1)
end
---------------------------

So, when operation involves two Strings or String and Regexp, with
different encodings, automatic coercion should be done, as described
above.

That will, probably, solve coding problems (no need to think about
encodings most time), but can have following impacts:
1) after several operations, when one sends string to external IO, it
might be internally encoded in superset of that IO encoding. One has
to remember that and perform external IO accordingly, i.e. to resolve
- to fail on invalid chars or use replacement chars (like U+FFFD),-
but that is unavoidable.
2) some performance hits, which I expect to be rare.

Besides, there can be another class of problems with automatic
coercion: how to ensure consistent work of character ranges in Regexps
and String methods like [count, delete, squeeze, tr, succ, next, upto]
when encodings are coerced?

What I, as Ruby user, wish for Unicode/M17N support:
1) reliability and consistency:
  a) String should be abstraction for character sequence,
  b) String methods shouldn't allow me to garble internal representation;
  c) treating String as byte sequence is handy, but must be explict stated.
2) coding comfort:
  a) no need to care what encodings have strings while working with them;
  b) no need to care what encodings have strings returned from third-party code;
  c) using explicit stated conversion options for external IO.
3) on Unicode and i18n : at least to have a set of classes for
Unicode-specific tasks (collation, normalization, string search,
locale-aware formatting etc.) that would efficiently work with Ruby
strings.

And, for all out there, just ask "Which charset/encoding will fit all
the [present and future] needs?". You know the exact answer: "NONE".

> I understand the challenge, but I don't think it is common to run some
> part of your program in legacy encoding (without conversion), and
> other part in UTF-8.  You need to convert them into universal encoding
> anyway for most of the cases.  That's why I said it rare.

uhm, how to convert compiled extension library?