On Sep 13, 2008, at 5:44 PM, Gregory Brown wrote:

> On Sat, Sep 13, 2008 at 6:32 PM, James Gray =20
> <james / grayproductions.net> wrote:
>> * Have the parser always work in ASCII-8BIT.  I imagine this could =20=

>> work for
>> some cases, but I assume it would do bad things to something like =20
>> UTF-16.
>
> Not a good option.  Without knowing the encoding, you're really
> talking about ASCII-7bit being the only common point between the most
> popular encodings, as things like Latin1 display accents in one byte
> wheras UTF-8 uses two.
>
> But I suppose if the parser is only looking for things like commas and
> quotes and things, this might be possible.

Yeah, that's kind of the point of ASCII-8BIT, as I understand it.  =20
ASCII is treated a ASCII (the common ground you mention), and =20
everything else is just bytes.

> Would this mean that smart quotes in UTF-8 would blow up the parser?

Well, they wouldn't match normal quotes, if that's you question:

$ cat smart_quotes.rb
#!/usr/bin/env ruby -w
# encoding: UTF-8

puts %Q{=93James=94} =3D~ /\A"/ ? "Match" : "No Match"
$ ruby_dev smart_quotes.rb
No Match

I assume their are values of UTF-8 where it would fail though.  For =20
example, if the last byte of a multibyte character looks like a quote =20=

or comma.

That's why UTF-16 makes a great edge case for testing this stuff.  =20
Every other byte looks like normal ASCII, so the parser would latch on =20=

to the wrong things.  It looks like Ruby knows this though, and just =20
disallows it:

$ cat utf16_match.rb
#!/usr/bin/env ruby -w
# encoding: UTF-8

data =3D "abc,xyz"
data.encode!("UTF-16BE")
binary =3D "[^,]+"
binary.force_encoding("ASCII-8BIT")
re =3D Regexp.new(binary)

p data.encoding
p binary.encoding
p re.encoding

p data.scan(re)
$ ruby_dev utf16_match.rb
#<Encoding:UTF-16BE>
#<Encoding:ASCII-8BIT>
#<Encoding:US-ASCII>
utf16_match.rb:14:in `scan': incompatible encoding regexp match (US-=20
ASCII regexp with UTF-16BE string) (ArgumentError)
	from utf16_match.rb:14:in `<main>'

Ruby is "safely downgrading" my Regexp to US-ASCII there as you can =20
see.  This is what Michael Selig explained in his message.  I haven't =20=

been able to figure out a way to force it to ASCII-8BIT.

> Please let me know when you start working on this, I'd be happy to
> help test / debug / patch.

I started about a week ago.  My hope is that I'm almost done now, as =20
soon as I figure out the right way to resolve these issues.

I'll definitely encourage everyone to try it out when I'm finished =20
though.

James Edward Gray II=