Thanks a bunch, Hugh and Eric!  The combination of your
two suggestions sped it up quite a bit.

I don't agree with Robert, though... I have written many
parsers in C++ (and before that, C) that could soak up
all the data that I'm reading in less than a second whereas
this is taking approximately 9 minutes in Ruby.  With the
recommendations of Hugh and Eric, it is now down to about
5 minutes, or almost a factor of 2 speedup.

I would really like an order of magnitude or more, but
I would definitely have to write it in a compiled language.
I've done this before with Ruby and C++ using SWIG, but
this particular one seemed really challenging when having
Ruby call C++ which would then call Ruby...

My last project with Ruby/C++/SWIG had Ruby calling C++
but C++ kept all the data structures internally without
ever calling Ruby, and this was *much* easier... but not
as flexible as I would like for this case.

I may have to rewrite this whole puppy in D if I'm going
to get parsing times under one second.  Using C++ and STL
for its map containers is a royal nuisance, but D has
built-in associative arrays.  Or maybe I should try Perl
or Python and see how their file parsing speeds compare.

Oh, and to answer Hugh's question, it is extremely rare
that a line would have less than 8 fields... sometimes
the last line of the file has only a ^Z on it.

Thanks again for your help!  I appreciate it.
-- Glenn


Hugh Sasse wrote:
> On Sat, 10 Sep 2005, Eric Hodel wrote:
> 
>> On 08 Sep 2005, at 20:46, Glenn M. Lewis wrote:
>>
>>> Hi!
> 
>         [...]
> 
>>>    Any ideas on how I can rewrite 'Contract.parseFile()' for
>>> speed?
>>>
>>>    Thanks!
>>> -- Glenn Lewis
>>>
>>>  def Contract.parseFile(file)
>>>    return unless File.exists?(file)
>>>    return if @@files.has_key?(file)
> 
> 
> I'd swap those two: test of a hash will be faster than test of a
> filesystem, so may as well bail out quickly. Do repeat keys happen
> often?
> 
>>>    @@files[file] = 1
> 
> 
> Maybe it only needs to be a Set, not a Hash?  Not sure how speeds
> compare.
> 
>>>    print "Parsing file #{file}..."
>>>    File.open(file, "rb").each_line {|line|
> 
> 
> I think:
> 
>>>      line.chomp!
>>>      # puts line
>>>      fields = line.split(/,/)
> 
> 
> might be faster as
>         fields = line.chomp.split(/,/)
> 
> or if you only chomp the last field afterwards (shorter string to
> change)?
> 
>>>      next if fields.size < 8
> 
> 
> Maybe line.count(",") first, and bail out quickly? Again, how often
> does this happen?
> 
>>>      datestring = fields[1]
>>>      year  = datestring[0..1].to_i
>>>      month = datestring[2..3].to_i
>>>      day   = datestring[4..5].to_i
> 
> 
> reduce array refs:
> 
>         year, month, day = datestring[0..5].scan(/../).collect do |s|
>           s.to_i
>         end
> 
> possibly
> 
> 
> That's all I can think of just now.
>         Hugh
> 
>