On Fri, Mar 18, 2011 at 11:53, Magnus Holm <judofyr / gmail.com> wrote:
> The problem is that the definition of #upcase doesn't only depend on the
> encoding used, but also the language of the encoded text. For instance, if
> you're writing in Turkish, you would expect "i".upcase to return a dotted
> uppcase I:http://www.i18nguy.com/unicode/turkish-i18n.html

I know.  The same goes for °∆i°« in Lithuanian.

> Doing this properly is *really* hard and needs to have a lot offlexibility,
> especially when it comes to non-Western languages.

This is simply not true.  Unicode defines how to deal with case
conversions.  I°«m not saying that the Unicode standard is infallible,
but we can at least adhere to it.  I°«m not saying that Unicode is the
only encoding that we should care about, but if we support the Unicode
transfer formats, why not support other interesting parts of the
standard?

> It's far easier for everyone that the built-in #upcase is
> simple and fast and you'll have to be explicit about any
> other I18n stuff IMO.

Easy, perhaps, but hardly useful.

My point is that the current #upcase (and similar methods) is
basically useless for anything other than ASCII.  I was looking for an
actual solution to this problem.  I have a library
(character-encodings) that does support these conversions, based on
locale and the Unicode character database (UCD).  How do we make it
easy for the user to deal with m18n?  I mean, if I say

# -*- coding: utf-8 -*-

puts "bc".upcase

I expect this to do the right thing for Unicode under the current locale.

As Unicode defines how to deal with case conversions, if I tell Ruby
that °»this String is encoded as UTF-8°… (or, in this case, °»strings in
this file are encoded as UTF-8°…), I expect Ruby to respond °»OK, I°«ll
use the Unicode rules that govern methods like #upcase for that
String°….

The UCD requires a lot of memory, so I suggested that a library, such
as character-encodings, should be able to seamlessly add this kind of
behavior without requiring the user to write "bc".unicodify.upcase,
if the UCD can°«t be included in standard Ruby runtime.

But, come to think of it, doesn°«t Oniguruma need most of the UCD
information, so isn°«t most of it already included in the Ruby runtime?
 Adding casing information perhaps wouldn°«t require much additional
space.

If this isn°«t of interest, then I°«m still looking for a wayto
override #upcase for Strings that use the UTF-8 encoding without
resorting to alias_method or extend (as shown earlier in this
discussion).  This seems impossible to do at the moment, as Encoding
is a completely opaque object.