On Sun, 27 Apr 2008 05:32:52 -0500, Ams Lo wrote: > Hi - > > What is the fast and ruby way to transpose a large file(>2GB)?? > > I can't read the whole file into memory due to the file sizes.. I assume all of the lines in the file are the same length, and the line length is a multiple of the disk block size. Thus the file is laid out like 000011112222 333344445555 666677778888 9999aaaabbbb ccccddddeeee ffffgggghhhh iiiijjjjkkkk llllmmmmnnnn ooooppppqqqq rrrrsssstttt uuuuvvvvwwww xxxxyyyyzzzz (where the letters and numbers indicate the block number where the data is stored). We're going to transpose in chunks, indicated by separators, so that each chunk has one full disk block per row: 0000|1111|2222 3333|4444|5555 6666|7777|8888 9999|aaaa|bbbb ----+----+---- cccc|dddd|eeee ffff|gggg|hhhh iiii|jjjj|kkkk llll|mmmm|nnnn ----+----+---- oooo|pppp|qqqq rrrr|ssss|tttt uuuu|vvvv|wwww xxxx|yyyy|zzzz Basically, for each chunk x,y where (x>y): *read in chunk x,y and y,x here, that's 8 disk blocks *transpose the data in chunk x,y in place in memory *transpose the data in chunk y,x in place in memory *write the transposed data from chunk x,y to chunk y,x *write the transposed data from chunk y,x to chunk x,y Now, for each chunk x,x (the diagonal) *read in chunk x,x *transpose the data in chunk x,x in memory *write out the transposed data to chunk x,x Thus, each disk block is read exactly once and written exactly once. If your lines don't align precisely to disk blocks then you get: 0000|0111|1122 2223|3333|4444 4555|5566|6667 7777|8888|8999 ----+----+---- 99aa|aaab|bbbb cccc|cddd|ddee eeef|ffff|gggg ghhh|hhii|iiij ----+----+---- jjjj|kkkk|klll llmm|mmmn|nnnn oooo|oppp|ppqq qqqr|rrrr|ssss You could make a pass through the file to pad things first, but it looks like you could choose appropriate chunk sizes so that you'd touch each disk block at most twice anyway, and that would have the same performance characteristics (ignoring disk seek time) as making a pass through the file to pad it. --Ken -- Ken (Chanoch) Bloom. PhD candidate. Linguistic Cognition Laboratory. Department of Computer Science. Illinois Institute of Technology. http://www.iit.edu/~kbloom1/