On Thu, Feb 24, 2011 at 4:09 AM, Philip Rhoades <phil / pricom.com.au> wrote:
> I have script that does:
>
> - statistical processing from data in 50x32x20 (32,000) large input files
>
> - writes a small text file (22 lines with one or more columns of numbers)
> for each input file
>
> - read all small files back in again for final processing.
>
> Profiling shows that IO is taking up more than 60% of the time - short of
> making fewer, larger files for the data (which is inconvenient for random
> viewing/ processing of individual results) are there other alternatives to
> using the "File" and "IO" classes that would be faster?

I think whatever you do, as long as you do not get rid of the IO or
improve IO access patterns your performance gains will only be
marginally.  Even a C extension would not help you if you stick with
the same IO patterns.

We should probably learn more about the nature of your processing but
considering that you only write 32,000 * 22 * 80 (estimated line
length) = 56,320,000 bytes (~ 54MB) NOT writing those small files is
probably an option.  Burning 54MB of memory in a structure suitable
for later processing (i.e. you do not need to parse all those small
files) is a small price compared to the large amount of IO you need to
do to read that data back again (plus the CPU cycles for parsing).

The second best option would be to keep the data in memory as before
but still write those small files if you really need them (for example
for later processing).  In this case you could put this in a separate
thread so your main processing can continue on the state in memory.
That way you'll gain another improvement.

For reading of the large files I would use at most two threads because
I assume they all reside on the same filesystem.  With two threads one
can do calculations (e.g. parsing, aggregating) while the other thread
is doing IO.  If you have more threads you'll likely see a slowdown
because you may introduce too many seeks etc.

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/