On Sun, 27 Apr 2008 05:32:52 -0500, Ams Lo wrote:

> Hi -
> 
> What is the fast and ruby way to transpose a large file(>2GB)??
> 
> I can't read the whole file into memory due to the file sizes..

I assume all of the lines in the file are the same length, and the line
length is a multiple of the disk block size. Thus the file is laid out
like

000011112222
333344445555
666677778888
9999aaaabbbb
ccccddddeeee
ffffgggghhhh
iiiijjjjkkkk
llllmmmmnnnn
ooooppppqqqq
rrrrsssstttt
uuuuvvvvwwww
xxxxyyyyzzzz

(where the letters and numbers indicate the block number where the data
is stored). We're going to transpose in chunks, indicated by separators,
so that each chunk has one full disk block per row:


0000|1111|2222
3333|4444|5555
6666|7777|8888
9999|aaaa|bbbb
----+----+----
cccc|dddd|eeee
ffff|gggg|hhhh
iiii|jjjj|kkkk
llll|mmmm|nnnn
----+----+----
oooo|pppp|qqqq
rrrr|ssss|tttt
uuuu|vvvv|wwww
xxxx|yyyy|zzzz

Basically, for each chunk x,y where (x>y):
   *read in chunk x,y and y,x
    here, that's 8 disk blocks
   *transpose the data in chunk x,y in place in memory
   *transpose the data in chunk y,x in place in memory
   *write the transposed data from chunk x,y to chunk y,x
   *write the transposed data from chunk y,x to chunk x,y
Now, for each chunk x,x (the diagonal)
   *read in chunk x,x
   *transpose the data in chunk x,x in memory
   *write out the transposed data to chunk x,x

Thus, each disk block is read exactly once and written exactly once.

If your lines don't align precisely to disk blocks then you get:

0000|0111|1122
2223|3333|4444
4555|5566|6667
7777|8888|8999
----+----+----
99aa|aaab|bbbb
cccc|cddd|ddee
eeef|ffff|gggg
ghhh|hhii|iiij
----+----+----
jjjj|kkkk|klll
llmm|mmmn|nnnn
oooo|oppp|ppqq
qqqr|rrrr|ssss

You could make a pass through the file to pad things first, but it looks
like you could choose appropriate chunk sizes so that you'd touch each
disk block at most twice anyway, and that would have the same
performance characteristics (ignoring disk seek time) as making a pass
through the file to pad it.

--Ken

-- 
Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/