David,
After reading your results I thought I would try and make a couple of
simple changes. I attempted to cleanup the 'insert' routine since that
is where most of the processing time seemed to be spent. I also added
the ability to perform multi-term searching (individual terms or single
string). This will worsen the look-up times, but it might be a good
change.
If possible, could you run this version through your test to see how it
does?
class IndexHash
def initialize( documents=nil )
@index = Hash.new( [] )
input( documents ) if documents
end
def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word) }
end
end
def insert( document_symbol, word )
w = word.downcase
@index[w] += [ document_symbol ] unless @index[w].include?(
document_symbol )
end
def find( *strings )
result = []
strings.each do |string|
string.split.each do |word|
result += @index[ word.downcase ]
end
end
result.uniq
end
def words
@index.keys.sort
end
end
class IndexBitmap
def initialize( documents=nil )
@index = []
@documents = Hash.new( 0 )
input( documents ) if documents
end
def input( documents )
documents.each_pair do |symbol, contents|
contents.split.each { |word| insert( symbol, word) }
end
end
def insert( document_symbol, word )
w = word.downcase
@index.push( w ) unless @index.include?( w )
@documents[ document_symbol ] |= (1<<@index.index( w ))
end
def find( *strings )
result = []
mask = 0
strings.each do |string|
string.split.each do |word|
w = word.downcase
mask |= (1<<@index.index(w)) if @index.index(w)
end
end
@documents.each_pair do |symbol, value|
result.push( symbol ) if value & mask
end
result
end
def words
@index.sort
end
end