On 26 Jun 2008, at 20:47, Philip Rhoades wrote:
> A few things:
>
> - you left a line in the loop:
>
> 	File.open( output_filename, 'w' ) do |fout|
>
> which should be deleted

Paste in haste, repent at leisure ;)
I've corrected it to read the way it appeared in my head when I was  
looking at it: http://pastie.org/222765

> - I originally used:
>
> 	stats = []                  	lines = File.readlines(input_filename,  
> 'r')
>
> but found that reading the whole file (8871 lines) and then  
> processing the array was inefficient so I got rid of the array
>
> - using:
>
> 	stats << stats06

If you buffer it as a single read and then work through the file in  
memory it guarantees that you minimise the IO costs of reading. I am  
of course assuming that even at 8871 lines your file is much smaller  
than your available RAM :)

> and the file writing output of:
>
>       File.open(output_filename, "a") do |file|
> 		file.sync = false
> 		file.puts *stats
> 		file.fsync
> 	end
>
> looks interesting - why should that be faster?

Doing the file write this way offloads making it efficient to the Ruby  
runtime.
The file.fsync call will cost you in terms of runtime performance, but  
it ensures that the data is flushed to disk before moving on to the  
next file which for a large data processing job is often desirable.  
Personally I wouldn't store the results in separate files but combine  
them into a single file (possibly even a database), however I don't  
know how that would fit with your use case.

As to the file.puts *stats, there's no guarantee this approach will be  
efficient but compared to doing something like:

  File.open(output_filename, "a") do |file|
    stats.each { |stat| file.puts stat }
  end

it feels more natural to the problem domain.

Another alternative would be:

  File.open(output_filename, "a") do |file|
    file.puts stats.join("\n")
  end

but that's likely to use more memory as first an in-memory string will  
be created, then this will be passed to Ruby's IO code. For the size  
of file you're working with that's not likely to be a problem.

I've a suspicion that your overall algorithm can also be greatly  
improved. In particular the fact that you're forming a cubic array and  
then manipulating it raises warning bells and suggests you'll have  
data sparsity issues which could be handled in a different way, but  
that would require a deeper understanding of your data.

Ellie

Eleanor McHugh
Games With Brains
http://slides.games-with-brains.net
----
raise ArgumentError unless @reality.responds_to? :reason