On Sat, Sep 13, 2008 at 6:32 PM, James Gray <james / grayproductions.net> wrote: > I'm trying to get the standard CSV library ready for m17n in Ruby 1.9. I'm > going to layout my thoughts here and would love to get some feedback on > these issues. > > First, it's important to consider what strategy CSV should use to adapt to > encodings. CSV generally reads from IO or String objects. Obviously, those > could have any number of encodings. The "parser" is just a few regular > expressions that break up the input. I've considered: > > * Adapt the parser to the encoding of the input. This is my current first > choice. The parser only cares about a few characters, so if I transcode > those characters into the encoding of the input, my hope is that it could be > made to process most data. Obviously, there could be issues transcoding > certain characters to certain encodings and I would probably just default to > ASCII-8BIT in that case. This sounds best, I agree. > * Have the parser always work in ASCII-8BIT. I imagine this could work for > some cases, but I assume it would do bad things to something like UTF-16. Not a good option. Without knowing the encoding, you're really talking about ASCII-7bit being the only common point between the most popular encodings, as things like Latin1 display accents in one byte wheras UTF-8 uses two. But I suppose if the parser is only looking for things like commas and quotes and things, this might be possible. Would this mean that smart quotes in UTF-8 would blow up the parser? > * Transcode all incoming data to UTF-8 and work with that. This is probably > the easiest to implement, but I would much rather allow users to work in > their preferred input. Not ideal, but totally reasonable IMO, since so many libraries already work this way and then it becomes the user's job to make sure they can give you data that can meaningfully represented in UTF-8 rather than guessing at handling their encoding naively. The other approach sounds fine too, but I worry about whether transliterating characters into other sets to build up your regex might get cumbersome... Please let me know when you start working on this, I'd be happy to help test / debug / patch. -greg -- Technical Blaag at: http://blog.majesticseacreature.com | Non-tech stuff at: http://metametta.blogspot.com