Glenn M. Lewis <noSpam / noSpam.com> wrote: > Hi Robert! > > VERY IMPRESSIVE!!! After tweaking your two regexp's from > (\w\d+\w) to (\w+\d+\w) (because 'AD2005U' is also a valid contract), > I got this: > > 0.047s read ref > 0.062s read file c:/src/barchart/Data/mrn09215.txt > 0.047s read file c:/src/barchart/Data/mrn09205.txt > 0.063s read file c:/src/barchart/Data/mrn09195.txt > 0.078s read file c:/src/barchart/Data/mrn09165.txt > 0.062s read file c:/src/barchart/Data/mrn09155.txt > 0.047s read file c:/src/barchart/Data/mrn09145.txt > 0.047s read file c:/src/barchart/Data/mrn09135.txt > 0.078s read file c:/src/barchart/Data/mrn09125.txt > 0.063s read file c:/src/barchart/Data/mrn09095.txt > 0.172s read file c:/src/barchart/Data/mrn09085.txt > 0.109s read file c:/src/barchart/Data/mrn09075.txt > 0.094s read file c:/src/barchart/Data/mrn09065.txt > 0.047s read file c:/src/barchart/Data/mrn09025.txt > 0.062s read file c:/src/barchart/Data/mrn09015.txt > 1.547s read file c:/src/barchart/Data/mrnaug05.txt > 1.531s read file c:/src/barchart/Data/mrnjul05.txt > 1.141s read file c:/src/barchart/Data/mrnjun05.txt > 1.375s read file c:/src/barchart/Data/mrnmay05.txt > 1.734s read file c:/src/barchart/Data/mrnapr05.txt > 0.907s finished post processing > 9.313s total > > 1415 total contracts > 136164 total ticks (averages out to 96 ticks per contract) > > I ought to point out that the 'ref' file is actually not > processed in this case (meaning that its ticks are not recorded), > but that would probably add on another 0.078s or so. Guess so. I thought it was cleaner to do this separately but if you actually need this it's a minor change. You could integrate it into #parse_ticks and remember if it was the first file. Untested but cryptic: def parse_ticks_2(io) create = @contracts.nil? @contracts = {} if create io.each_line do |line| if %r{^ (\w\d+\w), # contractid (\d{6}), # date (\d+(?:\.\d+)?), # open (\d+(?:\.\d+)?), # high (\d+(?:\.\d+)?), # low (\d+(?:\.\d+)?), # close }x =~ line cid = $1.freeze contract = (@contracts[cid] || ( create && (@contracts[cid] = Contract.new cid) ) ) and contract.add_tick $2, $3.to_f, $4.to_f, $5.to_f, $6.to_f end end end > Also, I'm expecting more average ticks than that, so > I would have to figure out why it is missing some ticks... but > it is probably just a minor regexp tweak. > > Another minor note is that volume and openInterest were > not recorded, but that is a very minor thing to add on. Certainly. Just add them to the struct and add_tick(). I didn't see them processed in your code so I thought you don't need / want them. > So now the score is: > Glenn's Ruby-Only: ~29 seconds > Robert's Ruby-Only: ~9 seconds > Glenn's Ruby/C++: ~2 seconds > > Great job, Robert! Now, to answer your questions below... Wow!! I didn't expect it to compete so well. Maybe I need to buy a new machine (it's a P4 with 1.8GHz and 1GB mem) or switch off the virus scanner. :-) Now it would be interesting to see which difference in the code caused the performance difference. My guess is it's any or several of these: - I freeze hash keys. This saves a dup.freeze on the keys inserted into hashes (it's an internal implementation speciality of Hash to avoid accidental aliasing effects through key strings changed after the insert) - I didn't use split thus avoiding unnecessary object creations in case a record is not needed. - I probably made the regexp more selective and thus more efficient. <snip/> >> - How many percent of the reference contracts are present in an >> average file? > > As you start out, nearly 100%... then as you go back to earlier > and earlier dates, the reference contracts start to die out, and it > may drop down to around 90-95% or so... but in the example above, I'm > only going back around 100 days. Ah, yes. I didn't consider this in my generator script. >> - How do dates relate to files? (I assumed a file per day plus I used >> synthetic dates; see the generator script) > > Well, there are three types of files: daily updates, monthly updates, > and yearly updates. So far, I haven't needed to go back to any of the > yearly updates in any of the processing I've done. But suffice to say > that a monthly update file is basically the 'cat' (concatenation) > together of all the daily files for that month, and the yearly is the > cat of all monthly files for that year. Uh, sounds like yearly logs are going to be huuuge. > Nice job! Thanks, Robert! You're very welcome! Kind regards robert