Glenn M. Lewis <noSpam / noSpam.com> wrote:
> Hi Robert!
>
> VERY IMPRESSIVE!!!  After tweaking your two regexp's from
> (\w\d+\w) to (\w+\d+\w) (because 'AD2005U' is also a valid contract),
> I got this:
>
>    0.047s read ref
>    0.062s read file c:/src/barchart/Data/mrn09215.txt
>    0.047s read file c:/src/barchart/Data/mrn09205.txt
>    0.063s read file c:/src/barchart/Data/mrn09195.txt
>    0.078s read file c:/src/barchart/Data/mrn09165.txt
>    0.062s read file c:/src/barchart/Data/mrn09155.txt
>    0.047s read file c:/src/barchart/Data/mrn09145.txt
>    0.047s read file c:/src/barchart/Data/mrn09135.txt
>    0.078s read file c:/src/barchart/Data/mrn09125.txt
>    0.063s read file c:/src/barchart/Data/mrn09095.txt
>    0.172s read file c:/src/barchart/Data/mrn09085.txt
>    0.109s read file c:/src/barchart/Data/mrn09075.txt
>    0.094s read file c:/src/barchart/Data/mrn09065.txt
>    0.047s read file c:/src/barchart/Data/mrn09025.txt
>    0.062s read file c:/src/barchart/Data/mrn09015.txt
>    1.547s read file c:/src/barchart/Data/mrnaug05.txt
>    1.531s read file c:/src/barchart/Data/mrnjul05.txt
>    1.141s read file c:/src/barchart/Data/mrnjun05.txt
>    1.375s read file c:/src/barchart/Data/mrnmay05.txt
>    1.734s read file c:/src/barchart/Data/mrnapr05.txt
>    0.907s finished post processing
>    9.313s total
>
> 1415 total contracts
> 136164 total ticks (averages out to 96 ticks per contract)
>
> I ought to point out that the 'ref' file is actually not
> processed in this case (meaning that its ticks are not recorded),
> but that would probably add on another 0.078s or so.

Guess so.  I thought it was cleaner to do this separately but if you 
actually need this it's a minor change.  You could integrate it into 
#parse_ticks and remember if it was the first file.  Untested but cryptic:

  def parse_ticks_2(io)
    create = @contracts.nil?
    @contracts = {} if create

    io.each_line do |line|
      if %r{^
             (\w\d+\w),         # contractid
             (\d{6}),           # date
             (\d+(?:\.\d+)?),   # open
             (\d+(?:\.\d+)?),   # high
             (\d+(?:\.\d+)?),   # low
             (\d+(?:\.\d+)?),   # close
           }x =~ line
        cid = $1.freeze

        contract = (@contracts[cid] || ( create && (@contracts[cid] = 
Contract.new cid) ) ) and
          contract.add_tick $2, $3.to_f, $4.to_f, $5.to_f, $6.to_f
      end
    end
  end


> Also, I'm expecting more average ticks than that, so
> I would have to figure out why it is missing some ticks... but
> it is probably just a minor regexp tweak.
>
> Another minor note is that volume and openInterest were
> not recorded, but that is a very minor thing to add on.

Certainly.  Just add them to the struct and add_tick().  I didn't see them 
processed in your code so I thought you don't need / want them.

> So now the score is:
> Glenn's Ruby-Only: ~29 seconds
> Robert's Ruby-Only: ~9 seconds
> Glenn's Ruby/C++: ~2 seconds
>
> Great job, Robert!  Now, to answer your questions below...

Wow!!  I didn't expect it to compete so well.  Maybe I need to buy a new 
machine (it's a P4 with 1.8GHz and 1GB mem) or switch off the virus scanner. 
:-)

Now it would be interesting to see which difference in the code caused the 
performance difference.  My guess is it's any or several of these:

 - I freeze hash keys.  This saves a dup.freeze on the keys inserted into 
hashes (it's an internal implementation speciality of Hash to avoid 
accidental aliasing effects through key strings changed after the insert)

 - I didn't use split thus avoiding unnecessary object creations in case a 
record is not needed.

 - I probably made the regexp more selective and thus more efficient.

<snip/>

>> - How many percent of the reference contracts are present in an
>> average file?
>
> As you start out, nearly 100%... then as you go back to earlier
> and earlier dates, the reference contracts start to die out, and it
> may drop down to around 90-95% or so... but in the example above, I'm
> only going back around 100 days.

Ah, yes.  I didn't consider this in my generator script.

>> - How do dates relate to files? (I assumed a file per day plus I used
>> synthetic dates; see the generator script)
>
> Well, there are three types of files: daily updates, monthly updates,
> and yearly updates.  So far, I haven't needed to go back to any of the
> yearly updates in any of the processing I've done.  But suffice to say
> that a monthly update file is basically the 'cat' (concatenation)
> together of all the daily files for that month, and the yearly is the
> cat of all monthly files for that year.

Uh, sounds like yearly logs are going to be huuuge.

> Nice job!  Thanks, Robert!

You're very welcome!

Kind regards

    robert