Hi,

"James Britt (rubydev)" <james / rubyxml.com> wrote:
> OK, let me rephrase that: many people lured to Ruby by the mention of XML will likley expect a DOM API.  Certainly, if you're
> looking to *write* an XML API Ruby is quite good, though the absence of Unicode support is a problem.

What kind of unicode support do you want?
How about Unicode converter like this?
## I don't test it. It's just a sample to show my idea.

####---------------------------------------------------------------------
require 'uconv'

module XMLConv
  include Uconv

  class <<self
    def u16be_to_u8(str);  u16tou8(u16swap(str)); end
    def u16le_to_u8(str);  u16tou8(str);          end
    def u4be_to_u8(str);   u4tou8(u4swap(str));   end
    def u4le_to_u8(str);   u4tou8(str);           end
    def u42143_to_u8(str); u16swap(u4tou8(str));  end
    def u43412_to_u8(str); u4tou8(u16swap(str));  end
    def u8_to_u8(str); str[3,-1];                 end ## delete BOM
    def noop(str); str; end
  end

  @@encode_detect = {
    "\x00\x00\x00\x3C" => ["UCS-4BE", :u4be_to_u8],
    "\x00\x00\x3C\x00" => ["UCS-4-2143", :u42143_to_u8],
    "\x00\x00x\EFx\FF" => ["UCS-4BE", :u4be_to_u8],
    "\x00\x00\xFF\xFE" => ["UCS-4-2143", :u42143_to_u8],
    "\x00\x3C\x00\x00" => ["UCS-4-3412", :u43412_to_u8],
    "\x00\x3C\x00\x3F" => ["UTF-16BE", :u16be_to_u8],
    "\x3C\x00\x00\x00" => ["UCS-4LE", :u4le_to_u8],
    "\x3C\x00\x3F\x00" => ["UTF-16LE", :u16le_to_u8],
    "\xEF\xBB\xBF\x3c" => ["UTF-8", :u8_to_u8],
    "\xFE\xFF\x00\x00" => ["UCS-4-3412", :u43412_to_u8],
    "\xFE\xFF\x00\x3C" => ["UTF-16BE", :u16be_to_u8],
    "\xFF\xFE\x00\x00" => ["UCS-4LE", :u4le_to_u8],
    "\xFF\xFE\x3C\x00" => ["UTF-16LE", :u16le_to_u8],
    "\x3C\x3F\x78\x6D" => [nil, :noop] ## UTF-8 or ascii-compatible
  }

  def convert_to_utf8(str)
    str_head = str[0,4]
    encoding, func = @@encode_detect[str_head]
    if func
      return XMLConv.__send__(func, str), encoding
    else
      raise 'oops, I cannot parse!'
    end
  end
  module_function :convert_to_utf8
end

if __FILE__ == $0
  str = ARGF.read
  encoding, enc_str = XMLConv::convert_to_utf8(str)
  # print "original encoding: #{encoding}\n"
  print enc_str
end
####---------------------------------------------------------------------

This program converts string into UTF-8. It supports:

 * UCS-4BE (1234 order)
 * UCS-4   (2143 order)
 * UCS-4LE (4321 order)
 * UCS-4   (3412 order)
 * UTF-16BE  (12 order)
 * UTF-16LE  (21 order)
 * UTF-8

with or without BOM.

Of cource, it's just string converter, not IO. But we can
write some IO-like library (ex. SourceIO in REXML) using it.

And, uconv is extentional library. But uconv doesn't
link any other library, so IMHO portability is not so bad.
In fact it works in Windows.


Regards,

TAKAHASHI 'Maki' Masayoshi     E-mail: maki / open-news.com