On Fri 11 Mar 2005 at 10:05:26 +0900, Nikolai Weibull wrote: > * Ian Macdonald (Mar 11, 2005 01:30): > > irb(main):001:0> "\032p\210\004n\306\271\310gY\002".unpack("U*") > > ArgumentError: malformed UTF-8 character from (irb):1:in `unpack' from > > (irb):1 > > utf8validate.rb: > > --- cut here --- > #! /usr/bin/ruby -w > > ARGV[0] =~ /^( > [\x00-\x7F] # ASCII > | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte > | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs > | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte > | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates > | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 > | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 > | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 > )*/x > > if $~.end(0) != ARGV[0].length > printf("malformed UTF-8 character starting at position %d in the input\n", $~.end(0)) > exit 1 > end > --- cut here --- > > and from zsh: > > % utf8validate.rb $'p\210\004n\306\271\310gY\002' > malformed UTF-8 character starting at position 2 in the input > % > > For your input, the \210 is wrong, as this regex won't allow it. I'm > not 100% sure that this is actually correct, as I haven't verified that > the regular expression is correct, but I'm guessing it is. Anyway, now > you can tell where in the data things blow up, > nikolai My thanks to you and Simon. It's especially nice to see a formal definition of UTF-8 encapsulated in your regex. I wasn't aware of the formal definition until someone at work pointed me at this excellent resource: http://en.wikipedia.org/wiki/UTF-8 Ian -- Ian Macdonald | Arrakis teaches the attitude of the knife - System Administrator | chopping off what's incomplete and saying: ian / caliban.org | "Now it's complete because it's ended http://www.caliban.org | here." -- Muad'dib, "Dune" |