One last followup (sorry, I'm bored onboard a plane) :)

I did one manual test of RAM comparing the VM used by the Set storage  
versus the Trie storage, comparing the previously-measured 496 word  
document with a document that had 1007 words. The results were as I  
expected:

469 words:
     create set:     16.040000   1.100000  17.140000 ( 21.742738)
     159MB of VM

     create matcher: 85.430000   1.340000  86.770000 ( 96.524512)
     68MB of VM


1007 words:
     create set:    137.470000   9.400000 146.870000 (166.828737)
     ~1GB of VM

     create matcher: 746.690000  11.050000 757.740000 (806.450292)
     149MB of VM

Conclusion: if you have the RAM to spare, the Set-based approach is  
quite speedy, but it gets greedy as your full phrase base grows. If  
you need to save some memory and can spare the time, go with the Trie  
based approach.



Now, having done all this work...if all you want is sub-phrase  
matching, why not use a regexp?


469 words:
                               user     system      total        real
     create clean string:  0.010000   0.010000   0.020000 (  0.003050)
     run 100k matches:    10.750000   0.140000  10.890000 ( 15.839430)
     28MB of VM

1007 words:
                               user     system      total        real
     create clean string:  0.010000   0.010000   0.020000 (  0.432572)
     run 100k matches:    19.350000   0.200000  19.550000 ( 27.612700)
     28MB of VM



[Slim:~/Desktop/Match Phrases] gavinkis% cat regexp.rb
require 'benchmark'

cleaned = nil
matcher = Regexp.new( "\\b#{ARGV[1]}\\b" )

Benchmark.bm( 20 ){ |x|
         x.report( "create clean string:" ){
                 cleaned = IO.read( ARGV[0] ).downcase.scan( /[a-z'] 
+/ ).join( ' ' )
         }
         x.report( "run 100k matches:"){
                 100_000.times{
                         cleaned =~ matcher
                         cleaned =~ /the brown fox/
                 }
         }
}