James Edward Gray II wrote:

/ ...

> So a linefeed is \n and then we need to escape the \ so that is \\, I
> assume.  Interesting.
> 
> I would argue that is not CSV, but it's certainly debatable.

Not really debatable IMHO. Embedded linefeeds are optional, and the
two-character "\n" sequence are allowed characters. It's an
application-specific option that happens to be fairly common.

Consider what would happen if someone embedded a bunch of HTML break tags in
CSV fields. Does that mean it isn't CSV? Same idea. It's just a common
convention to symbolize a linefeed.

> My 
> reasoning is that you either need to post process the CSV parsed data
> to restore it or use a custom parser that understands CSV plus your
> escaping rules.

Or you can leave it alone and let the calling application deal with it. In a
case like this, it is often two cooperating applications, both of which
know the conventions in use. The CSV parser doesn't have to address this
convention directly.

/ ...

> Wild.  I use MySQL everyday.  Guess I've never dumped a CSV of
> linefeed containing data with it though.  (I generally walk the
> database myself with a Ruby script and dump with FasterCSV.)

If you have a database with embedded linefeeds, and you dump it to the
console:

echo "Select * from database.table;" | mysql (authorization)

You will see the linefeeds escaped in the output (the fields are
tab-separated, not comma-separated, but the result is the same).

Oh, that reminds me. Tabs are also often represented symbolically in the
same general way, to avoid a conflict with tab field delimiters, when they
are used.

>> It just takes longer if all the database
>> handling (not just record parsing) must use the same state machine
>> that
>> field parsing must use.
> 
> I don't understand this comment.  MySQL does not use CSV internally,
> like most databases.

Yes, I know. Here I am referring to the task of parsing CSV tables, it's no
longer about MySQL.

>> It's very simple, really. Once you allow the record separator inside a
>> field, you give up any chance to parse records quickly.
> 
> Have you heard of the FasterCSV library?  ;)  It's pretty zippy.

Yes, and whatever speed it achieves, it is proportionally slower to the
degree that it follows the requirement to allow linefeeds inside fields.

>> But parsing will necessarily be slow, character by character, the
>> entire
>> database scan must use an intelligent parser (no splitting records on
>> linefeeds as I have been doing), and the state machine needs a few
>> extra
>> states.
> 
> You don't really have to parse CSV character by character.  FasterCSV
> does most of its parsing with a single highly optimized (to avoid
> backtracking) regular expression and a few tricks.

Then the regex engine parses character by character, rather than the
explicit Ruby code. Same proportions as explained above, but probably much
faster overall (I assume Ruby's regex engine is written in native code).

> Basically you can read line by line and divide into fields.

Actually you must do both at once. You can't simply parse record delimiters
without paying attention to the field delimiters and field contents. I know
you know this, I am just adding it for other readers.

> If you 
> have an unclosed field at the end of the line, you hit an embedded
> linefeed.  You then just pull and append the next line and continue
> eating fields.

It must be some regex scheme to simultaneously deal with embedded field
delimiters as well as the various quote issues and, finally, embedded
record delimiters.

> The standard CSV library does not do this and that is one of two big
> reasons it is so slow.

I'm glad to see there is healthy competition for quality code, even in
libraries.

-- 
Paul Lutus
http://www.arachnoid.com