On 10/21/05, Peter Fitzgibbons <peter.fitzgibbons / gmail.com> wrote:
[snip]
> Second, the error messaging from the scanner tells me that I have a "Invalid
> byte 1 of 1-byte UTF-8 sequence."
> That's nice, but I have no way to tell _what_ byte is in violation.
>
> SO, if you have any reference to what might qualify as a "UTF8 Validator",
> please tell. Clearly Java and Ruby have different definitions of UTF8.

I have made a utf8 decoder capable of this, example below:

irb(main):003:0> require 'iterator'
=> trueirb(main):004:0> str = "ab\000\200\300"
=> "ab\000\200\300"
irb(main):005:0> byte_iterator = Iterator::Continuation.new(str, :each_byte)
=> #<Iterator::Continuation:0x58f480 @symbol=:each_byte,
@instance="ab\000\200\300",
@return_where=#<Proc:0x00000000@/usr/local/lib/ruby/site_ruby/1.8/iterator.rb:494>,
@value=97, @position=0, @resume_where=#<Continuation:0x58f3f4>>
irb(main):006:0> Iterator::DecodeUTF8.new(byte_iterator).to_a
Iterator::DecodeUTF8::Malformed: unexpected continuation byte. byte-offset=3
        from /usr/local/lib/ruby/site_ruby/1.8/iterator.rb:740:in `current'
        from /usr/local/lib/ruby/site_ruby/1.8/iterator.rb:91:in `each'
        from (irb):6:in `to_a'
        from (irb):6
irb(main):007:0>


You need to install my iterator package.
http://rubyforge.org/frs/?group_id=18

--
Simon Strandgaard