On Fri 11 Mar 2005 at 10:05:26 +0900, Nikolai Weibull wrote:

> * Ian Macdonald (Mar 11, 2005 01:30):
> > irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*")
> > ArgumentError: malformed UTF-8 character from (irb):1:in `unpack' from
> > (irb):1
> 
> utf8validate.rb:
> 
> --- cut here ---
> #! /usr/bin/ruby -w
> 
> ARGV[0] =~ /^(
>      [\x00-\x7F]            		# ASCII
>    | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
>    |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
>    | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
>    |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
>    |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
>    | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
>    |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
>   )*/x
> 
> if $~.end(0) != ARGV[0].length
>   printf("malformed UTF-8 character starting at position %d in the input\n", $~.end(0))
>   exit 1
> end
> --- cut here ---
> 
> and from zsh:
> 
> % utf8validate.rb $'p\210\004n\306\271\310gY\002'
> malformed UTF-8 character starting at position 2 in the input
> % 
> 
> For your input, the \210 is wrong, as this regex won't allow it.  I'm
> not 100% sure that this is actually correct, as I haven't verified that
> the regular expression is correct, but I'm guessing it is.  Anyway, now
> you can tell where in the data things blow up,
> 	nikolai

My thanks to you and Simon. It's especially nice to see a formal
definition of UTF-8 encapsulated in your regex. I wasn't aware of the
formal definition until someone at work pointed me at this excellent
resource:

  http://en.wikipedia.org/wiki/UTF-8

Ian
-- 
Ian Macdonald               | Arrakis teaches the attitude of the knife -
System Administrator        | chopping off what's incomplete and saying: 
ian / caliban.org             | "Now it's complete because it's ended
http://www.caliban.org      | here."   -- Muad'dib, "Dune" 
                            |