On Nov 30, 2006, at 2:45 PM, Paul Lutus wrote:

> Also, there is a hidden assumption in your position -- that  
> libraries, ipso facto, represent robust methods.

> For the newbies, however, it might matter. They might think library  
> contents differ from ordinary code.

I sure hope they think that!  I know I do.

There's no faster way to find bugs than to bundle up some code and  
turn it loose on the world.  That leads to more robust code.  This is  
the reason open source development works so well.

If one of us patches a library, everyone benefits.  It's like having  
a few hundred extra programmers on your staff.

Yes, I realize I'm over generalizing there.  There will always be  
poorly supported or weak libraries, but someone just forks or  
replaces those eventually.

>>> On the other hand, if your data does not exploit this CSV trait (few
>>> real-world CSV databases embed linefeeds)...
>>
>> Really?  How do they handle data with newlines in it?
>
> Linefeeds are escaped as though in a normal quoted string. This is  
> how I
> have always dealt with embedded linefeeds, which is why I was  
> ignorant of
> the specification's language on this (an explanation, not an excuse).

So a linefeed is \n and then we need to escape the \ so that is \\, I  
assume.  Interesting.

I would argue that is not CSV, but it's certainly debatable.  My  
reasoning is that you either need to post process the CSV parsed data  
to restore it or use a custom parser that understands CSV plus your  
escaping rules.

>> Which "CSV databases" are you referring to here?
>
> MySQL, the database I am most familiar with, uses this method for  
> import or
> export of comma- or tab-separated plain-text data. Within MySQL's own
> database protocol, linefeeds really are linefeeds, but an imported or
> exported plain-text table has them escaped within fields.

Wild.  I use MySQL everyday.  Guess I've never dumped a CSV of  
linefeed containing data with it though.  (I generally walk the  
database myself with a Ruby script and dump with FasterCSV.)

> It just takes longer if all the database
> handling (not just record parsing) must use the same state machine  
> that
> field parsing must use.

I don't understand this comment.  MySQL does not use CSV internally,  
like most databases.

> It's very simple, really. Once you allow the record separator inside a
> field, you give up any chance to parse records quickly.

Have you heard of the FasterCSV library?  ;)  It's pretty zippy.

> But parsing will necessarily be slow, character by character, the  
> entire
> database scan must use an intelligent parser (no splitting records on
> linefeeds as I have been doing), and the state machine needs a few  
> extra
> states.

You don't really have to parse CSV character by character.  FasterCSV  
does most of its parsing with a single highly optimized (to avoid  
backtracking) regular expression and a few tricks.

Basically you can read line by line and divide into fields.  If you  
have an unclosed field at the end of the line, you hit an embedded  
linefeed.  You then just pull and append the next line and continue  
eating fields.

The standard CSV library does not do this and that is one of two big  
reasons it is so slow.

James Edward Gray II