On 06.07.2009 14:41, Greg Willits wrote: > -- is doing 200,000 or even 500,000 at a time in chunks really any > faster than doing them one at a time -- that I actually don't know yet, > I am just now finishing all the code that this "chunking" touches and > ensure I get the same results I used to. the size of the chunks isn't as > important for speed as it is for memory management -- making sure I stay > within 4GB. From my experience doing one at a time is efficient enough. Of course you then cannot do the line length math. But, for that I'd probably consider doing something different: What about storing all file offsets in an Array and write it to a file "<orig>.idx" via Marshal. That way you do not need fill bytes, your files are smaller and you can process one line at a time. Reading a single entry is then easy: just slurp in the index in one go and seek to index[record_index]. robert@fussel ~ $ time ruby19 -e 'h=(1..1_000_000).map {|i| i << 6};File.open("x","wb") {|io| Marshal.dump(h,io)}' real 0m1.848s user 0m1.483s sys 0m0.249s robert@fussel ~ $ ls -l x -rw-r--r-- 1 robert Kein 5736837 Jul 6 20:01 x Apart from that this will reduce memory usage of individual processes and you might be able to better utilize your cores. Dumping via Marshal is pretty fast and the memory overhead of that single index array is not too big. Alternatively you can write the index file while you are writing the main data file. You just need to fix the number of bits you reserve for each file offset. Then the read operation can be done via two seek operations (first on the index, then on the data file) if you do not cache the index. > -- as for the speed issues, we've done a lot of profiling, and even > wrote a mini compiler to read our file tranformation DSL and output a > stream of inline variable declarations and commands whichs gets included > as a module on the fly for each data source. That trick saved us from > parsing the DSL for each data row and literally shaved hours off the > total processing time. We attacked many other levels of optimization > while working to keep the code as readable as possible, because it's a > complicated layering of abstractions and processes. Did you consider to make your dsl generate the code or is that what you are doing? Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/