On 10/21/05, Peter Fitzgibbons <peter.fitzgibbons / gmail.com> wrote: [snip] > Second, the error messaging from the scanner tells me that I have a "Invalid > byte 1 of 1-byte UTF-8 sequence." > That's nice, but I have no way to tell _what_ byte is in violation. > > SO, if you have any reference to what might qualify as a "UTF8 Validator", > please tell. Clearly Java and Ruby have different definitions of UTF8. I have made a utf8 decoder capable of this, example below: irb(main):003:0> require 'iterator' => trueirb(main):004:0> str = "ab\000\200\300" => "ab\000\200\300" irb(main):005:0> byte_iterator = Iterator::Continuation.new(str, :each_byte) => #<Iterator::Continuation:0x58f480 @symbol=:each_byte, @instance="ab\000\200\300", @return_where=#<Proc:0x00000000@/usr/local/lib/ruby/site_ruby/1.8/iterator.rb:494>, @value=97, @position=0, @resume_where=#<Continuation:0x58f3f4>> irb(main):006:0> Iterator::DecodeUTF8.new(byte_iterator).to_a Iterator::DecodeUTF8::Malformed: unexpected continuation byte. byte-offset=3 from /usr/local/lib/ruby/site_ruby/1.8/iterator.rb:740:in `current' from /usr/local/lib/ruby/site_ruby/1.8/iterator.rb:91:in `each' from (irb):6:in `to_a' from (irb):6 irb(main):007:0> You need to install my iterator package. http://rubyforge.org/frs/?group_id=18 -- Simon Strandgaard