On Nov 15, 2006, at 11:21 AM, Devesh Agrawal wrote: > Hi Folks, > > I am using ruby to analyse a huge (around 60G) amount of my > networking > experiment data. Let me briefly describe my technique: I have to read > around 40 files (of around 1.5G each) named f1,f2 ... .Each file fi > contains traceroutes to lots of destinations at different times. I.E a > file is basically a list of traceroutes launched from a given src > (src = > filename) launched at diff times. I want to get a structure like > following: (list of all traceroutes from *all* src's at time 1), (list > of all traceroutes from *all* src's at time 2)... and so on. > > For this I am using the following psuedocode: > > outputfile.open > open all files f1..fn > while (!(all files have eof)) > (f1..fn).each{|f| > next if f.eof > line = f.readline > parse the line, and get a structure P out of it > put P into a hashtable: H[P.time] << P > > check for eof conditions on f > > if (H has more than k keys ? (ie has it become very large)) > H.keys.sort{|t| > outputfile << Marshal.dump(H[t]) > H.delete(t) > } > end > } > end > close all files > > //Btw I can't use an array instead of a hashtable H, as the P.time's > read across all files needn't be same. > > This is performing miserbly SLOW. I have the following questions: Have you profiled? Where is your time really coming from? Repost with a profile and then we can give some real suggestions. > i. How fast is f.readline ?. I want to use the maximum buffering > possible for largest speed gains. In ruby how do I set the buffer > size. > I looked through io.c, and it seems that readline essentially uses > getc > (stopping when it gets a newline). How can I set the buffer size > for the > underlying libc FILE* ? Oh btw, each line is approx 200-400 bytes. I seriously doubt that this is your choke-point. > ii. Marshal.dump is also very slow. Is there an alternative, Yaml is > even worse. Marshal.dump is pretty fast, probably as fast as you're going to get for a serialization format. _why did some benchmarks back in the day and it beat out the other P languages. That said, why are you even using it? Why not just add raw strings? > v. Would coding the realine part in C using rubyinline offer me speed > advantages ? No. (or, very unlikely) > vi. I am thinking of trying the following to reduce the time it > takes, > I would very much welcome your comments: Profile, profile, profile. > a. Remove Marshal.dump [I don't need to strictly serialize objects, > only dump the data and read it back] and replace it with some string > form which is more compact. Actually is it possible to have something > like fixed length structures like in C: Example I would want P to be > like this: Struct P{ char foo[100], int a[100]} ?. So this way I think > the IO would be faster as I could just dump a fixed number of bytes > to a > file. Yes, do this, simpler is better. Try #pack and #unpack. > b. Try to reduce the memory consumption of this by reducing k > further so as the program doesn't page in/out. You already said it isn't paging... > c. Can someone point me to a good sample code for reading a file > line by line in C and then putting it into a ruby hashtable ?. No. Profile, profile, profile. > d. How much of the slowness is due to the fact that it is ruby > and not C ? We can't tell you without a profile. Profile, profile, profile. > To give you an idea of how slow this is actually: Just reading all the > files line by line takes around 8-9 hrs. Whereas the above thing > easily takes > 5-6 days !!. And I am quite unable to run profile on my code as it > is just > too slow. Lies. Use a reduced dataset and with ruby-prof or zenprofile. You know nothing without a profile. > I would be very grateful for your comments, and particularly if you > have > any suggestions/experience on doing this in a fast way. Profile it, you can't make sane changes without one. -- Eric Hodel - drbrain / segment7.net - http://blog.segment7.net This implementation is HODEL-HASH-9600 compliant http://trackmap.robotcoop.com