People, Thanks to all who responded - I have concatenated the replies for ease of response: On 2011-02-24 19:15, pp wrote: > >> Date: Thu, 24 Feb 2011 12:09:48 +0900 From: phil / pricom.com.au >> Subject: Fast alternatives to "File" and "IO" for large numbers of >> files ? To: ruby-talk / ruby-lang.org >> >> People, >> >> I have script that does: >> >> - statistical processing from data in 50x32x20 (32,000) large input >> files >> >> - writes a small text file (22 lines with one or more columns of >> numbers) for each input file >> >> - read all small files back in again for final processing. >> >> Profiling shows that IO is taking up more than 60% of the time - >> short of making fewer, larger files for the data (which is >> inconvenient for random viewing/ processing of individual results) >> are there other alternatives to using the "File" and "IO" classes >> that would be faster? >> >> Thanks, >> >> Phil. >> > Hi, could you be more specific on what do you do with the small > files, read/write in per-line or whole file?for rapid file ops due to > file system heaps(or sort) may be slow anyway.so maybe you can try > less file ops, for example, write a file with a single string may > serve the io cache well. or, maybe, have a lot of files to write/read > in a new thread, so that IO may not interfere your none-IO > calculations, if you have some Each individual small file is written in one go ie file opened, written to and closed - there is no re-opening and more writing. See later for current approach. On 2011-02-24 19:19, Peter Zotov wrote: > > I can think of two approaches here. > > First, you can write one large file (perhaps creating it in memory > first) and then splitting it afterwards. > > Second, if you're on *nix, you can write your output files to a > tmpfs. > > Both should reduce number of seeks and improve performance. After staying up all night, I eventually settled on a hash table outputted via YAML to ONE very large file. I need a human friendly form for spot checking of statistical calculations so I have used a hash table and the key lets me find a particular calculation in the big file in the same way I would have found it in the similarly named subdirectories. I haven't actually implemented this on the full system yet so it will be interesting to see if Vim can handle opening a 32,000 x 23 line file (and bigger actually if each individual small file is bigger than a 23x1 array). On 2011-02-24 19:52, Robert Klemme wrote: > > I think whatever you do, as long as you do not get rid of the IO or > improve IO access patterns your performance gains will only be > marginally. Even a C extension would not help you if you stick with > the same IO patterns. Right. > We should probably learn more about the nature of your processing > but considering that you only write 32,000 * 22 * 80 (estimated line > length) = 56,320,000 bytes (~ 54MB) NOT writing those small files is > probably an option. Burning 54MB of memory in a structure suitable > for later processing (i.e. you do not need to parse all those small > files) is a small price compared to the large amount of IO you need > to do to read that data back again (plus the CPU cycles for > parsing). Yep - I came to that conclusion too and went for one big hash table and one file. > The second best option would be to keep the data in memory as before > but still write those small files if you really need them (for > example for later processing). In this case you could put this in a > separate thread so your main processing can continue on the state in > memory. That way you'll gain another improvement. Interesting idea but I'm not sure how to actually implement that but I will see how the hash table/one file approach goes first. > For reading of the large files I would use at most two threads > because I assume they all reside on the same filesystem. With two > threads one can do calculations (e.g. parsing, aggregating) while the > other thread is doing IO. If you have more threads you'll likely see > a slowdown because you may introduce too many seeks etc. OK, this idea might help for the next stage. On 2011-02-24 20:02, Brian Candler wrote: > If you read in all the data files and build a single Ruby data > structure which contains all the data you're interested in, you can > dump it out like this: > > File.open("foo.msh","wb") {|f| Marshal.dump(myobj, f) } I did read up about this stuff but I have to have human readable files. > And you can reload it in another program like this: > > myobj = File.open("foo.msh","rb") {|f| Marshal.load(f) } > > This is*very* fast. I might check this out as an exercise! Thanks to all again! Phil. -- Philip Rhoades GPO Box 3411 Sydney NSW 2001 Australia E-mail: phil / pricom.com.au