On Fri, 11 Mar 2005 01:33:15 +0100, Simon Strandgaard <neoneye / gmail.com> wrote:
> On Fri, 11 Mar 2005 09:05:11 +0900, Ian Macdonald <ian / caliban.org> wrote:
> > One such allegedly bad string is the following:
> >
> > irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
> > ArgumentError: malformed UTF-8 character
> >         from (irb):1:in `unpack'
> >         from (irb):1
> >
> > This is supposed to be Japanese. Can a Japanese reader please confirm
> > that this is, indeed, malformed UTF-8? I need to be sure that the bug
> > does not lie with Ruby before I get back to our calendar admin and tell
> > him to go and pester Oracle.
> 
> the substring "\210\004" is invalid UTF8.
> in hex its [0x88, 0x04].
> 
> 0x88 has its uppermost bit set, so this is a dual byte sequence.
> 0x04 is not a valid continuation byte (upper bit should have been 1).

Forget this explanaition, its wrong.. (I mis-read my testcase)


0x88 is not a valid first-byte for a sequence.
In order to be a valid first-byte, then the 2 upper most bits must be set.
0x88 only has one bit set.

--
Simon Strandgaard