On Aug 16, 2:04 am, Alex Young <a... / blackkettle.org> wrote:
> Michael Linfield wrote:
> > M. Edward (Ed) Borasky wrote:
> >> Michael Linfield wrote:
> >>> ### this sadly only returned an output of  => []
>
> >>> any ideas?
>
> >>> Thanks!
> >> OK ... first of all, define "huge" and what are your restrictions? Let
> >> me assume the worst case just to get started -- more than 256 columns
> >> and more than 65536 rows and you're on Windows. :)
>
> >> Seriously, though, if this is a *recurring* use case rather than a
> >> one-shot "somebody gave me this *$&%^# file and wants an answer by 5 PM
> >> tonight!" use case, I'd load it into a database (assuming your database
> >> doesn't have a column count limitation larger than the column count in
> >> your file, that is) and then hook up to it with DBI. But if it's a
> >> one-shot deal and you've got a command line handy (Linux, MacOS, BSD or
> >> Cygwin) just do "grep blah1 huge-file.csv > temp-file.csv". Bonus points
> >> for being able to write that in Ruby and get it debugged before someone
> >> who's been doing command-line for years types that one-liner in. :)
>
> > lol, alright lets say the senario will be in the range of 20k - 70k
> > lines of data. no more than 20 columns
> > and i wanna avoid using command line to do this, because yes in fact
> > this will be used to process more than one datafile which i hope to
> > setup in optparse to have a command line arg that directs the prog to
> > the file. also i wanted to for the meantime not have to throw it on any
> > database...avoiding DBI for the meanwhile. But an idea flew through my
> > head a few minutes ago....what if i did this --
>
> > res = []
> > res << File.readlines('filename.csv').grep(/Blah1/)  #thanks chris
>
> There's a problem with using File.readlines that I don't think anyone's
> mentioned yet.  I don't know if it's relevant to your dataset, but CSV
> fields are allowed to contain newlines if the field is quoted.  For
> example, this single CSV row will break your process:
>
> 1,2,"foo
> Blah1",bar

I think that this can be handled easily by this approach:
to extract a record from the csv file, continue reading lines
until the number of double quotes in the record is even.
Something like

record = ""
begin
  record << gets.chomp
end until record.count( '"' ) % 2 == 0