On Nov 15, 2006, at 11:21 AM, Devesh Agrawal wrote:

> Hi Folks,
>
> 	I am using ruby to analyse a huge (around 60G) amount of my  
> networking
> experiment data. Let me briefly describe my technique: I have to read
> around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi
> contains traceroutes to lots of destinations at different times. I.E a
> file is basically a list of traceroutes launched from a given src  
> (src =
> filename) launched at diff times. I want to get a structure like
> following: (list of all traceroutes from *all* src's at time 1), (list
> of all traceroutes from *all* src's at time 2)... and so on.
>
> 	For this I am using the following psuedocode:
>
> 	outputfile.open
> 	open all files f1..fn
> 	while (!(all files have eof))
> 		(f1..fn).each{|f|
> 			next if f.eof
> 			line = f.readline
> 			parse the line, and get a structure P out of it
> 			put P into a hashtable: H[P.time] << P
>
> 			check for eof conditions on f
>
> 			if (H has more than k keys ? (ie has it become very large))
> 				H.keys.sort{|t|
> 					outputfile << Marshal.dump(H[t])
> 					H.delete(t)
> 				}
> 			end
> 		}
> 	end
> 	close all files
>
> //Btw I can't use an array instead of a hashtable H, as the P.time's
> read across all files needn't be same.
>
> This is performing miserbly SLOW. I have the following questions:

Have you profiled?  Where is your time really coming from?

Repost with a profile and then we can give some real suggestions.

> 	i. How fast is f.readline ?. I want to use the maximum buffering
> possible for largest speed gains. In ruby how do I set the buffer  
> size.
> I looked through io.c, and it seems that readline essentially uses  
> getc
> (stopping when it gets a newline). How can I set the buffer size  
> for the
> underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes.

I seriously doubt that this is your choke-point.

> 	ii. Marshal.dump is also very slow. Is there an alternative, Yaml is
> even worse.

Marshal.dump is pretty fast, probably as fast as you're going to get  
for a serialization format.  _why did some benchmarks back in the day  
and it beat out the other P languages.

That said, why are you even using it?  Why not just add raw strings?

> 	v. Would coding the realine part in C using rubyinline offer me speed
> advantages ?

No.

(or, very unlikely)

> 	vi. I am thinking of trying the following to reduce the time it  
> takes,
> I would very much welcome your comments:

Profile, profile, profile.

> 		a. Remove Marshal.dump [I don't need to strictly serialize objects,
> only dump the data and read it back] and replace it with some string
> form which is more compact. Actually is it possible to have something
> like fixed length structures like in C: Example I would want P to be
> like this: Struct P{ char foo[100], int a[100]} ?. So this way I think
> the IO would be faster as I could just dump a fixed number of bytes  
> to a
> file.

Yes, do this, simpler is better.

Try #pack and #unpack.

> 		b. Try to reduce the memory consumption of this by reducing k  
> further  so as the program doesn't page in/out.

You already said it isn't paging...

> 		c. Can someone point me to a good sample code for reading a file  
> line by line in C and then putting it into a ruby hashtable ?.

No.  Profile, profile, profile.

> 		d. How much of the slowness is due to the fact that it is ruby  
> and not C ?

We can't tell you without a profile.  Profile, profile, profile.

> To give you an idea of how slow this is actually: Just reading all the
> files line by line takes around 8-9 hrs. Whereas the above thing  
> easily takes
> 5-6 days  !!. And I am quite unable to run profile on my code as it  
> is just
> too slow.

Lies.

Use a reduced dataset and with ruby-prof or zenprofile.

You know nothing without a profile.

> I would be very grateful for your comments, and particularly if you  
> have
> any suggestions/experience on doing this in a fast way.

Profile it, you can't make sane changes without one.

-- 
Eric Hodel - drbrain / segment7.net - http://blog.segment7.net
This implementation is HODEL-HASH-9600 compliant

http://trackmap.robotcoop.com