On 11/29/06, James Edward Gray II <james / grayproductions.net> wrote:
> On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:
>
> > Here's another question: in some cases I need to sort the
> > file before splitting it (in this case sorting by the 4th cell in each
> > row). However, the current file I'm trying to sort and split is around
> > 76 MB and ruby fails when trying to store the CSV as an array. The
> > code
> > and output are below. How else can I go about this?
>
> Hmm, that's a good question.
>
> Perhaps you can collect just the key values in an Array and then use
> those to reorder the lines bit by bit.  That's not going to be fast
> with any library helping you, but I don't have a better idea.
>
> James Edward Gray II

Indeed.  That problem is difficult in general because you need to have
the whole set of elements in memory before you can begin sorting them.
 As James pointed out, you might be able to use some sort of
memoization technique to track only the bits relevent to sorting.  The
problem is you'll also need some way to get back to the original
record.

Depending on how you ending up parsing the records, you might be able
to store the file position of the start of the record and the record
length.

Records -> [sort_key, file.pos, record.length]

Then sort those arrays if you can fit them all in memory.  Finally,
you can use the offsets for random access to grab the records and
stick them into the new files as you've been doing.

Basically, you're looking at a complicated swartzian transformation.
If it will work depends on how big your records are.  If they are
fairly large, you might be able to pull if off; however, if they're
small and the problem is only that there are too many records, you'll
still have a problem.


In that case, you might want to just shove them in an RDBMS and let it
sort it for you.


-- 
Lou.