On Fri, 2 Sep 2005, Hugh Sasse wrote:

> I've been doing some stuff with CSV recently, having data in one flat file
> that I'm trying to normalize into tables.  Thus, I'm trying to ignore things
> I've seen before so I don't create 2 entries for them, and that sort of
> thing.
>
> Jon Bentley points out that "Data Structures Programs" -- i.e. how the data
> is "shaped" determines how the program will be designed and perform.  Often
> the best solution is not immediately obvious, as he discusses.  So, for
> example, one can test for "seen before" by
>
>  array = []
>
>  unless array.include? item array << item end
>
> which is much slower for clumped data than
>
>  unless array.include? item array.unshift item end
>
> or one could use a hash, or maybe a set.
>
> So, my question is [a memeber of the set of "How long is a piece of string?"
> type questions]:
>
>  Are there any heuristics for performance of *Ruby* data structures
>  available, which guide one to designing code to *often* have good
>  performance?
>
> The answer will be very much "it depends, you neet to test it, it's a
> function of data size, ...", but I suspect thee implementation, i.e. Ruby
> puts useful bounds on the problem.
>
> My searching has not turned up anything, though there are good performance
> hints in the PickAxe. I'm wondering if the info exists or if it would be
> worth trying to create it, or if it is just too broad a problem to be worth
> it.  However, since we Rubyists often speak of the ability to put code
> together quickly, having hints around on how to do so and yield efficient
> code seems of some worth.


here's some things i think of:

   - use class variables of some sort : never store in an object what should be
     stored in a class

   - write lazy attribute getters : if your parser loads 15,000 fields from a
     header but normally only 4 or 5 are use in code write the parser to parse
     the field on demand and cache the result

   - don't write explicit IO : use mmap and let the kernel worry about it

   - use processes not threads : ipc is so easy with ruby (using local drb
     objects) that it's ridiculous not to use multiple processes so
     multi-processor machines can actually take best advantage of your code.
     sometimes threads are better for a problem - but rarely - especially when
     you count debugging time ;-)

   - don't check things : assume an arg is an int - the stack trace is plenty
     good to tell you it wasn't and the check slows things down

   - use rbtree : it's amazing ;-)

   - use sqlite : it's amazing ;-)

   - ruby inline : it's amazing ;-)

   - use iterator methods for filesystem operations : prefer

       find('.'){|f| ... }

     over

       Dir['*/**'].each{|f| ... }

     the seconds kills you when a dir has 100,000 files in it.

   - compose things out of array, hashes, and strings whenever possible.  these
     objects are pretty optimized in ruby and serialize great as yaml meaning
     you never need to write hand parsers.

   - eliminate variable creation where possible : prefer

       n = nil
       obj.each{|elem| n = elem ** 2 and p n}

     over

       obj.each{|elem| n = elem ** 2 and p n}

   - anchor every single regular expression : performance degrades
     exponentially otherwise

   - compile regular expressions only once unless they require variable
     substitution

   - cache anything possible, for instance methods dynamically generated from
     method_missing : permanmently add the method to the object instead of
     generating it everytime


now, i'm not saying these are 'right' - just that they are what i think of
when trying to make ruby fast.

cheers.

-a
-- 
===============================================================================
| email :: ara [dot] t [dot] howard [at] noaa [dot] gov
| phone :: 303.497.6469
| Your life dwells amoung the causes of death
| Like a lamp standing in a strong breeze.  --Nagarjuna
===============================================================================