On 06.07.2009 14:41, Greg Willits wrote:

> -- is doing 200,000 or even 500,000 at a time in chunks really any 
> faster than doing them one at a time -- that I actually don't know yet, 
> I am just now finishing all the code that this "chunking" touches and 
> ensure I get the same results I used to. the size of the chunks isn't as 
> important for speed as it is for memory management -- making sure I stay 
> within 4GB.

 From my experience doing one at a time is efficient enough.  Of course 
you then cannot do the line length math.  But, for that I'd probably 
consider doing something different: What about storing all file offsets 
in an Array and write it to a file "<orig>.idx" via Marshal.  That way 
you do not need fill bytes, your files are smaller and you can process 
one line at a time.  Reading a single entry is then easy: just slurp in 
the index in one go and seek to index[record_index].

robert@fussel ~
$ time ruby19 -e 'h=(1..1_000_000).map {|i| i << 6};File.open("x","wb") 
{|io| Marshal.dump(h,io)}'

real    0m1.848s
user    0m1.483s
sys     0m0.249s

robert@fussel ~
$ ls -l x
-rw-r--r-- 1 robert Kein 5736837 Jul  6 20:01 x

Apart from that this will reduce memory usage of individual processes 
and you might be able to better utilize your cores.  Dumping via Marshal 
is pretty fast and the memory overhead of that single index array is not 
too big.

Alternatively you can write the index file while you are writing the 
main data file.  You just need to fix the number of bits you reserve for 
each file offset.  Then the read operation can be done via two seek 
operations (first on the index, then on the data file) if you do not 
cache the index.

> -- as for the speed issues, we've done a lot of profiling, and even 
> wrote a mini compiler to read our file tranformation DSL and output a 
> stream of inline variable declarations and commands whichs gets included 
> as a module on the fly for each data source. That trick saved us from 
> parsing the DSL for each data row and literally shaved hours off the 
> total processing time. We attacked many other levels of optimization 
> while working to keep the code as readable as possible, because it's a 
> complicated layering of abstractions and processes.

Did you consider to make your dsl generate the code or is that what you 
are doing?

Kind regards

	robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/