Louis J Scoras wrote:

> On 11/29/06, James Edward Gray II <james / grayproductions.net> wrote:
>
>> On Nov 29, 2006, at 12:56 PM, Drew Olson wrote:
>>
>> > Here's another question: in some cases I need to sort the
>> > file before splitting it (in this case sorting by the 4th cell in each
>> > row). However, the current file I'm trying to sort and split is around
>> > 76 MB and ruby fails when trying to store the CSV as an array. The
>> > code
>> > and output are below. How else can I go about this?
>>
>> Hmm, that's a good question.
>>
>> Perhaps you can collect just the key values in an Array and then use
>> those to reorder the lines bit by bit.  That's not going to be fast
>> with any library helping you, but I don't have a better idea.
>>
>> James Edward Gray II
>
>
> Indeed.  That problem is difficult in general because you need to have
> the whole set of elements in memory before you can begin sorting them.
> As James pointed out, you might be able to use some sort of
> memoization technique to track only the bits relevent to sorting.  The
> problem is you'll also need some way to get back to the original
> record.
>
> Depending on how you ending up parsing the records, you might be able
> to store the file position of the start of the record and the record
> length.
>
> Records -> [sort_key, file.pos, record.length]
>
> Then sort those arrays if you can fit them all in memory.  Finally,
> you can use the offsets for random access to grab the records and
> stick them into the new files as you've been doing.
>
> Basically, you're looking at a complicated swartzian transformation.
> If it will work depends on how big your records are.  If they are
> fairly large, you might be able to pull if off; however, if they're
> small and the problem is only that there are too many records, you'll
> still have a problem.
>
>
> In that case, you might want to just shove them in an RDBMS and let it
> sort it for you.
>
>
Let's say you want to sort by the foo column

Read in all the foo values and sort them
Get every 40,000th value from the list.
Now, upon reading any row, you can determine what page it should go on.
Read the file, get the rows for the first N pages, ignoring the rest of 
the rows, where N is a number that won't run you out of memory. 
Create the files for those rows
Remove references to the rows you read in.
Repeat with the next N pages until finished.