Hi Folks,

	I am using ruby to analyse a huge (around 60G) amount of my networking 
experiment data. Let me briefly describe my technique: I have to read 
around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi 
contains traceroutes to lots of destinations at different times. I.E a 
file is basically a list of traceroutes launched from a given src (src = 
filename) launched at diff times. I want to get a structure like 
following: (list of all traceroutes from *all* src's at time 1), (list 
of all traceroutes from *all* src's at time 2)... and so on.

	For this I am using the following psuedocode:

	outputfile.open
	open all files f1..fn
	while (!(all files have eof))
		(f1..fn).each{|f|
			next if f.eof
			line = f.readline
			parse the line, and get a structure P out of it
			put P into a hashtable: H[P.time] << P

			check for eof conditions on f

			if (H has more than k keys ? (ie has it become very large))
				H.keys.sort{|t|
					outputfile << Marshal.dump(H[t])
					H.delete(t)
				}
			end
		}
	end
	close all files

//Btw I can't use an array instead of a hashtable H, as the P.time's 
read across all files needn't be same.

This is performing miserbly SLOW. I have the following questions:

	i. How fast is f.readline ?. I want to use the maximum buffering 
possible for largest speed gains. In ruby how do I set the buffer size. 
I looked through io.c, and it seems that readline essentially uses getc 
(stopping when it gets a newline). How can I set the buffer size for the 
underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

	ii. Marshal.dump is also very slow. Is there an alternative, Yaml is 
even worse.

	iii. Is it bad to have around 40-50 files opened at the same time ?.

	iv. The program does use a lot of memory but not so much, around 30-40
pc of 1G ram machine is used by it. So I think paging in/out is not a 
problem.

	v. Would coding the realine part in C using rubyinline offer me speed 
advantages ?

	vi. I am thinking of trying the following to reduce the time it takes, 
I would very much welcome your comments:

		a. Remove Marshal.dump [I don't need to strictly serialize objects, 
only dump the data and read it back] and replace it with some string 
form which is more compact. Actually is it possible to have something 
like fixed length structures like in C: Example I would want P to be 
like this: Struct P{ char foo[100], int a[100]} ?. So this way I think 
the IO would be faster as I could just dump a fixed number of bytes to a 
file.

		b. Try to reduce the memory consumption of this by reducing k further 
so as the program doesn't page in/out.

		c. Can someone point me to a good sample code for reading a file line 
by line in C and then putting it into a ruby hashtable ?.
		d. How much of the slowness is due to the fact that it is ruby and not 
C ?

To give you an idea of how slow this is actually: Just reading all the 
files
line by line takes around 8-9 hrs. Whereas the above thing easily takes 
5-6
days  !!. And I am quite unable to run profile on my code as it is just 
too slow.

I would be very grateful for your comments, and particularly if you have 
any suggestions/experience on doing this in a fast way.

--Devesh Agrawal



-- 
Posted via http://www.ruby-forum.com/.