On Aug 16, 2:12 pm, William James <w_a_x_... / yahoo.com> wrote:
> On Aug 16, 2:04 am, Alex Young <a... / blackkettle.org> wrote:
>
> > Michael Linfield wrote:
> > > M. Edward (Ed) Borasky wrote:
> > >> Michael Linfield wrote:
> > >>> ### this sadly only returned an output of  => []
>
> > >>> any ideas?
>
> > >>> Thanks!
> > >> OK ... first of all, define "huge" and what are your restrictions? Let
> > >> me assume the worst case just to get started -- more than 256 columns
> > >> and more than 65536 rows and you're on Windows. :)
>
> > >> Seriously, though, if this is a *recurring* use case rather than a
> > >> one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
> > >> tonight!" use case, I'd load it into a database (assuming your database
> > >> doesn't have a column count limitation larger than the column count in
> > >> your file, that is) and then hook up to it with DBI. But if it's a
> > >> one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
> > >> Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
> > >> for being able to write that in Ruby and get it debugged before someone
> > >> who's been doing command-line for years types that one-liner in. :)
>
> > > lol, alright lets say the senario will be in the range of 20k - 70k
> > > lines of data. no more than 20 columns
> > > and i wanna avoid using command line to do this, because yes in fact
> > > this will be used to process more than one datafile which i hope to
> > > setup in optparse to have a command line arg that directs the prog to
> > > the file. also i wanted to for the meantime not have to throw it on any
> > > database...avoiding DBI for the meanwhile. But an idea flew through my
> > > head a few minutes ago....what if i did this --
>
> > > res = []
> > > res << File.readlines('filename.csv').grep(/Blah1/)  #thanks chris
>
> > There's a problem with using File.readlines that I don't think anyone's
> > mentioned yet.  I don't know if it's relevant to your dataset, but CSV
> > fields are allowed to contain newlines if the field is quoted.  For
> > example, this single CSV row will break your process:
>
> > 1,2,"foo
> > Blah1",bar
>
> I think that this can be handled easily by this approach:
> to extract a record from the csv file, continue reading lines
> until the number of double quotes in the record is even.
> Something like
>
> record = ""
> begin
>   record << gets.chomp
> end until record.count( '"' ) % 2 == 0

The "chomp" is a mistake.

record = ""
begin
  record << gets
end until record.count( '"' ) % 2 == 0