At 06:39 08/07/09, Jim Weirich wrote:

>BTW, did you get a chance to review that builder patch for the CP1252  
>character?  Should be dated around June 16th.  If you didn't get it, I  
>will resend.

Hello Jim,

Many thanks for rake, builder,...

I haven't seen the above patch (maybe it was sent to Sam Ruby only),
but I wouldn't mind having a look at it.

Anyway, with respect to internationalization, I think there's some
more work to do. The documentation at http://builder.rubyforge.org/ says:

"Builder correctly translates UTF-8 characters into valid XML."

There seems to be some fundamental misunderstanding here:
Except for a few special cases (such as XML declarations with an encoding
pseudo-attribute to the contrary), raw UTF-8 is perfectly fine in
well-formed and valid XML, because UTF-8 (together with UTF-16 under
certain circumstances) is the default encoding for XML documents,
and every XML processor is required to support UTF-8 (and UTF-16).

So when I write something like:
[All the following examples won't be in UTF-8 when they reach you,
but assume they are in UTF-8 because they were when I ran them.]

>>>>
require 'rubygems'
require_gem 'builder', '~> 2.0'

builder = Builder::XmlMarkup.new(:target=>STDOUT, :indent=>2)
builder.person { |b|
  b.name("まつもとゆきひろ")
  b.place("Matsue, Shimane, Japan")
}
>>>>

I shouldn't get something like:

>>>>
<person>
  <name>&#12414;&#12388;&#12418;&#12392;&#12422;&#12365;&#12402;&#12429;</name>
  <place>Matsue, Shimane, Japan</place>
</person>
>>>>

because something like:

>>>>
<person>
  <name>まつもとゆきひろ</name>
  <place>Matsue, Shimane, Japan</place>
</person>
>>>>

would be much shorter, more readable (for those who read Matz's
name in Hiragana, and for whom that text was probably intended),
and perfectly well-formed XML (not valid because there's no DTD around).

[Even if somebody wanted numeric character references, they'd probably
want hexadecimal ones, because they could then look them up directly
rather than having to use a dec->hex calculator all the time.]

Builder seems to take UTF-8 for granted on the input side
(which is not granted at all, unless there is some $KCODE
or some "# encoding: utf-8" in Ruby 1.9), but not on the
output side, where it could.

Exploring some more, when I convert the input file to Shift_JIS
and run it again, the output produced is:

<person>
  <name>&#8218;&#220;&#8218;&#194;&#8218;&#224;&#8218;&#198;&#8218;&#228;&#8218;&#171;&#8218;&#208;&#8218;&#235;</name>
  <place>Matsue, Shimane, Japan</place>
</person>

which is total garbage. The direct culprit seems to be
String#to_xs:

   # File lib/builder/xchar.rb, line 107
107:   def to_xs
108:     unpack('U*').map {|n| n.xchr}.join # ASCII, UTF-8
109:   rescue
110:     unpack('C*').map {|n| n.xchr}.join # ISO-8859-1, WIN-1252
111:   end

"If it looks like UTF-8, it's UTF-8, and otherwise, it must be
windows-1252." may accidentally work in some parts of the world,
but not in Japan or in most other parts of Asia.

To make sure builder works well with the overall 1.9 internationalization
philosophy and implementation, I suggest the following:

- Know your input. In 1.9, that's easy, each string has an encoding.
- Know what the user wants as output. There are two options here,
  either catching something like
     xml_markup.instruct! :xml, :version=>"1.0", :encoding=>"desired_encoding"
  or (because the XML declaration looks like a processing instruction,
  but isn't) creating a dedicated method.
- Make sure conversion happens if necessary. Sting#encode should do the
  job, although currently, the number of supported encodings is still
  a bit low.
- For characters that can't be encoded in the output character encoding,
  use hexadecimal character references (and complain if such characters
  appear in places where they shouldn't, such as element/attribute names).
  I'll gladly implement &#x...; fallbacks in String#encode exactly for
  this purpose. Note that for people who want to get mostly numeric
  character references, the easiest way is to say they want their
  output in US-ASCII.
- For those who really want decimal rather than hexadecimal numeric
  character references, provide an additional option if you really want.
  (the last browser I know that didn't support hexadecimal numeric
  character references in HTML was Netscape 5; there is no XML processor
  that doesn't support it).

Hope this helps. I'll be glad to help some more, it would be good
if we could bring Builder to a state where it would be an example
of how to use 1.9 internationalization technology. If I got something
wrong or overlooked something, please tell me.

Regards,   Martin.




#-#-#  Martin J. Du"rst, Assoc. Professor, Aoyama Gakuin University
#-#-#  http://www.sw.it.aoyama.ac.jp       mailto:duerst / it.aoyama.ac.jp