James Edward Gray II wrote:

> On Nov 30, 2006, at 12:36 PM, Paul Lutus wrote:
> 
>> Your own code ... er, excuse me, your own library ... will meet your
>> requirements exactly, it won't cover cases that are not relevant to
>> the
>> problem at hand, it will be much faster overall than existing
>> solutions,
>> and you will learn things about Ruby that you would not if you used
>> someone
>> else's library.
> 
> Now you're guilty of a new sin:  encouraging people to reinvent the
> wheel.  You just can't win, can you?  ;)

If the OP has a problem not easily solved with a library, then he isn't
reinventing the wheel. And I don't care about winning.

> Different problems require different levels of paranoia.

Yes, absolutely. The larger the job and the larger the data set, the more
likely one will encounter border conditions, and the more appropriate to
use a state machine that understands the full specification. All at the
cost of speed.

> Sometimes a 
> little code will get you over the hump, but you may be making some
> trade-offs when you don't use a robust library.

Yes, but a robust library is not appropriate if it cannot solve the problem,
or if the learning curve is so steep that it would be easier to write one's
own scanner.

Also, there is a hidden assumption in your position -- that libraries, ipso
facto, represent robust methods.

> Sometimes those are 
> even good trade-offs, like sacrificing edge case handling to gain
> some speed.  Sometimes it's even part of the goal to avoid the
> library, like when I built FasterCSV to address some needs CSV wasn't
> meeting.

That borders on the heretical. :)

> As soon as things start getting serious though, *I* usually 
> feel safer reaching for the library.

I've noticed that. I want to emphasize once again that my style is a
personal preference, not an appeal to authority or untestable precepts.

> The people reading this list have seen us debate the issue now and be
> able to make well informed decisions about what they think is best.

I think 90% of the readers of this newsgroup won't pay any attention to
either of our opinions on this topic. They will realize that inside every
library is code written by a mortal human, therefore this sort of debate is
primarily tilting at windmills or describing angel occupancy requirements
for heads of pins.

For the newbies, however, it might matter. They might think library contents
differ from ordinary code. And that is true only if the writers of
libraries differ from ordinary coders. Ultimately, they don't, as Microsoft
keeps finding out.

>> On the other hand, if your data does not exploit this CSV trait (few
>> real-world CSV databases embed linefeeds)...
> 
> Really?  How do they handle data with newlines in it?

Linefeeds are escaped as though in a normal quoted string. This is how I
have always dealt with embedded linefeeds, which is why I was ignorant of
the specification's language on this (an explanation, not an excuse).

> Which "CSV databases" are you referring to here?

MySQL, the database I am most familiar with, uses this method for import or
export of comma- or tab-separated plain-text data. Within MySQL's own
database protocol, linefeeds really are linefeeds, but an imported or
exported plain-text table has them escaped within fields.

I create a lot of plain-text databases, and I am constantly presenting them
to MySQL for parsing (or getting plain-text back from MySQL), and this only
confirmed my mistaken impression that linefeeds are always escaped in
fields of this class of database.

It's obvious why the specification reads as it does, and I should have known
about this long ago. It reads as it does because it just isn't that
difficult to parse a quoted field, and it is no big deal to allow
absolutely anything in the field. It just takes longer if all the database
handling (not just record parsing) must use the same state machine that
field parsing must use.

<OT><RANT>

It's very simple, really. Once you allow the record separator inside a
field, you give up any chance to parse records quickly.

When a group of people sit down to create a specification, the highest
priority is ... utility, common sense? ... no, it's immunity to criticism.
The easies way to avoid criticism is to allow absolutely anything, even if
this hurts performance in real-world embodiments that obey the
specification.

Someone might say, "Okay, but can you drop an entire, internally consistent
CSV database into the boundaries of a single field of another CSV database,
without any harm or lost data?" Using the present specification, the
committee can say "yes, absolutely."

But parsing will necessarily be slow, character by character, the entire
database scan must use an intelligent parser (no splitting records on
linefeeds as I have been doing), and the state machine needs a few extra
states.

I cannot tell you how many times I have foolishly said, "surely the
specification doesn't allow that!", and I cannot remember actually ever
being right after taking such a position. When I make assumptions about
committees, I am always wrong.

</RANT></OT>

-- 
Paul Lutus
http://www.arachnoid.com