----- Original Message -----
From: "Tobias Reif" <tobiasreif / pinkjuice.com>
To: "ruby-talk ML" <ruby-talk / ruby-lang.org>
Sent: Thursday, October 04, 2001 11:57 AM
Subject: [ruby-talk:22067] Re: writing UTF-8 strings


> Yukihiro Matsumoto wrote:
>
>  > Just print.  UTF-8 is ASCII compatible, so when you use only ascii
>  > region characters, it's no difference between ASCII string and UTF-8
>  > string.
>
> So what do I declare?:
>
> <?xml version="1.0" encoding="UTF-8" standalone="no" ?>
>
> Is there a difference between an ASCII string and a UTF-8 string with
> only ascii region characters?

No. But remember that ASCII is only the  first 128 signs represented by the
sting byte.

That is, ASCII does only include the basic latin characters and some signs.
Enough for writing simple english or a computer program, but not for german
or swedish and so on.

So it is important to enter the correct encoding.

> [...] And since Ruby is 8bit clean, when you type in UTF-8 strings
>  > in your script, they just work fine.
>
> XML implementations are required to handle UTF-8.
> I'm serving SVG, which is XML, so I'd like to serve UTF-8;
> with UTF-16, it seems to be the only thing I can be sure of that the
> client can handle it.
>
> With Ruby, would use UTF-8 characters, serve ASCII, and declare it's
UTF-8?

So: as long as you only write using basic latin letters; no problem.
Wrinting anyting else, find out which encoding is used, which depends on the
app/OS.

BTW; Ruby beautifully offers some support for utf-8 (in a non-clumsy-way).
With the option -Ku, the interpreter reads your program as utf -8, Regexes
accept the //u option making them utf-8-aware, and the methods #pack and
#upack does transform between lists of unicode values and utf-8-encoded
strings.

So the awareness of uft-8 is not inthe strings, they are just sequences of
bytes. This means that the information of the encoding does have to come
from somewhere else.



---
Here is (non-useful, maybe illustrative) example of utf-8-aware
string-manipulation in Ruby:

Of course if producing svg, then one could just as well use the &#<decimal
number>; form.

(the output will only be possible to see in utf-8 aware app/system. f eks
RunyWin using a capable font.)

# use the -Ku option, so the interpreter can parse correctly (then you can
also name variables using utf-8 lower-case letters)

class String
# recognize any unicode sign names (U+hhhh), transform them into utf-8
string and put them back
def u(); self.gsub(/U\+([0-9a-fA-F]{4,4})/u){["#$1".hex ].pack('U*')}; end
end

puts "jen la U+0109ielo".u  # prints the esperanto letter c^puts "Ruby-U+8A9E".u     # the japanese kanji 'go'

puts "er den Tisch... vi vidas la U+0109ielon"   # over the table you see
the heaven:-)
# the regex treats the german utf-8 character directly
# typed into the string correctly using the //u switch