Jamey Cribbs wrote:
> Thomas Mueller wrote:
>> 2006/11/30, Drew Olson <olsonas / gmail.com>:
>>> I'll give FasterCSV a try when I get home from work and out from behind
>>> this proxy. Here's another question: in some cases I need to sort the
>>> file before splitting it (in this case sorting by the 4th cell in each
>>> row). However, the current file I'm trying to sort and split is around
>>> 76 MB and ruby fails when trying to store the CSV as an array. The code
>>> and output are below. How else can I go about this?
>
> I'm coming to this party really late, so I hope I don't come across as 
> shamelessly plugging KirbyBase, but, you might want to try it for this.
>
> If you are simply trying to take a large csv file, sort it by one of 
> its fields, and split it up into smaller files that each contain 
> 40,000 records, I think it might work.
>
> Here's some code (not tested, could be incorrect) off the top of my head:
>
>
> require 'kirbybase'
>
> db = KirbyBase.new
>
> tbl = db.create_table(:foo, :field1, :String, :field2, :Integer, 
> :field3, :String............................
>
> tbl.import_csv(name_of_csv_file)
>
> rec_count = tbl.total_recs
> last_recno_written_out = 0
>
> while rec_count > 0
>  recs = tbl.select { |r| r.recno > last_recno_written_out and r.recno 
> < last_recno_written_out + 40000 }.sort(:field4)
>
>  ........ here is where you put the code to write these 40,000 recs to 
> a csv output file .............
>
>  last_recno_written_out = recs.last.recno
>
>  rec_count = rec_count - 40000
> end

I realized this morning that the solution I posted last night won't work 
because you need the whole dataset sorted *before* you start splitting 
it up into 40,000 record files.  Oops!

Anyway, in an attempt to recover gracefully from my mistake and also to 
give me the opportunity to shamelessly plug another one of my libraries, 
I present the following proposed solution that is totally untested and 
probably full of holes:

  require 'mongoose'
  db = Mongoose::Database.new

  db.create_table(:foo) do |tbl|
    tbl.add_column(:field1, :string)
    tbl.add_column(:field2, :string)
    tbl.add_column(:field3, :integer)
    tbl.add_indexed_column(:field4, :string)
    .
    .
    .
  end

  Foo.import(csv_filename)

  total_recs_written = 0

  while total_recs_written < Foo.last_id_used
    recs = Foo.find(:order => :field4, :offset => total_recs_written, 
:limit => 40000)
 
    ........ here is where you put the code to write these 40,000 recs 
to a csv output file .............

    total_recs_written += recs.size
  end


Jamey

Confidentiality Notice: This email message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and/or privileged information. If you are not the intended recipient(s), you are hereby notified that any dissemination, unauthorized review, use, disclosure or distribution of this email and any materials contained in any attachments is prohibited. If you receive this message in error, or are not the intended recipient(s), please immediately notify the sender by email and destroy all copies of the original message, including attachments.