On Sat, Sep 13, 2008 at 6:32 PM, James Gray <james / grayproductions.net> wrote:
> I'm trying to get the standard CSV library ready for m17n in Ruby 1.9.  I'm
> going to layout my thoughts here and would love to get some feedback on
> these issues.
>
> First, it's important to consider what strategy CSV should use to adapt to
> encodings.  CSV generally reads from IO or String objects.  Obviously, those
> could have any number of encodings.  The "parser" is just a few regular
> expressions that break up the input.  I've considered:
>
> * Adapt the parser to the encoding of the input.  This is my current first
> choice.  The parser only cares about a few characters, so if I transcode
> those characters into the encoding of the input, my hope is that it could be
> made to process most data.  Obviously, there could be issues transcoding
> certain characters to certain encodings and I would probably just default to
> ASCII-8BIT in that case.

This sounds best, I agree.

> * Have the parser always work in ASCII-8BIT.  I imagine this could work for
> some cases, but I assume it would do bad things to something like UTF-16.

Not a good option.  Without knowing the encoding, you're really
talking about ASCII-7bit being the only common point between the most
popular encodings, as things like Latin1 display accents in one byte
wheras UTF-8 uses two.

But I suppose if the parser is only looking for things like commas and
quotes and things, this might be possible.  Would this mean that smart
quotes in UTF-8 would blow up the parser?

> * Transcode all incoming data to UTF-8 and work with that.  This is probably
> the easiest to implement, but I would much rather allow users to work in
> their preferred input.

Not ideal, but totally reasonable IMO, since so many libraries already
work this way and then it becomes the user's job to make sure they can
give you data that can meaningfully represented in UTF-8 rather than
guessing at handling their encoding naively.

The other approach sounds fine too, but I worry about whether
transliterating characters into other sets to build up your regex
might get cumbersome...

Please let me know when you start working on this, I'd be happy to
help test / debug / patch.

-greg

-- 
Technical Blaag at: http://blog.majesticseacreature.com | Non-tech
stuff at: http://metametta.blogspot.com