On Mon, 27 Nov 2006 15:20:32 +0900, Edwin Fine wrote: >> Wow. My solution's a real slowpoke compared to yours, even thought >> mine's >> asymptotically faster: O(string.length) versus >> O(string.length* max(keys.length)) >> >> I wonder what's slowing mine down so much, since these are in many ways >> the same. > > Well, do we get the same results from the scan of the big file? Recall > that in the scan I skip over strings once I see they are in the > dictionary, which cuts out a lot of work and may yield far fewer > results. > > For example, if the text to be scanned is > > abacusqwertyark > > and the dictionary contains the words abacus and ark, the index into the > string will move as follows: > > abacusqwertyark # Begin > ^ > > abacusqwertyark # Found abacus, skip it > ^ > abacusqwertyark # Found ark > ^ > > A solution that did not skip the words would find additional substrings, > for example: > > abacusqwertyark # Begin > ^ > > abacusqwertyark # Found abacus, start again on next character > ^ > .. omit some steps ... > > abacusqwertyark # Found "us" > ^ > > Maybe I'm cheating or there's some other bug in my code that omits > results that you find :) > > Apparently, part of my problem was the use of the Enumerator object. Eliminating that and keeping track of the index manually saves about 5 seconds off the benchmark. I also took my solution and reimplemented it in your code -- instead of having Node objects, I just use your hashes now and for the special fields I need, I use hash[:failure] and hash[:depth]. This saved about 15 seconds, so my new solution (based on your code) is only about a second slower than your solution, which is probably acceptable given the length of the words in the dictionary file. If the entries in the dictionary were longer, then my new solution should start to win. And by the way, both of these solutions win out over fairly large regular expressions pretty easily, since regular expressions have to backtrack to try each branch. user system total real Edwin Fine -- fill 0.840000 0.090000 0.930000 ( 0.933338) Edwin Fine -- test 5.650000 0.540000 6.190000 ( 6.194274) Ken Bloom -- fill 3.800000 0.300000 4.100000 ( 4.104002) Ken Bloom -- test 6.800000 0.370000 7.170000 ( 7.167073) #based on Edwin Fine's solution, this reimplements my solution with #less overhead all around. class DictionaryMatcher attr_reader :word_count def initialize @trie = {} @word_count = 0 end def add_word(word) @word_count += 1 container = @trie containers=[] i=0 word.each_byte do |b| container[b] = {} unless container.has_key? b container[:depth]=i containers << container container = container[b] i+=1 end containers << container container[0] = true # Mark end of word ff=compute_failure_function word ff.zip(containers).each do |pointto,container| container[:failure]=containers[pointto] if pointto end end def compute_failure_function p m=p.size pi=[nil,0] k=0 2.upto m do |q| k=pi[k] while k>0 and p[k] != p[q-1] k=k+1 if p[k]==p[q-1] pi[q]=k end pi end private :compute_failure_function def include?(word) container = @trie word.each_byte do |b| break unless container.has_key? b container = container[b] end container[0] end def =~ text internal_match text {|pos,len| return pos} nil end def internal_match string node=@trie pos=0 string.each_byte do |b| advance=false until advance nextnode=node[b] if not nextnode if node[:failure] node=node[:failure] else advance=true end elsif nextnode[0] yield pos, nextnode[:depth] advance=true node=@trie else advance=true node=nextnode end pos+=1 end end end private :internal_match def find_all_matching(text, &block) matches=[] block= lambda{ |pos,len| matches << [pos,len] } unless block internal_match(text,&block) matches end alias_method :scan, :find_all_matching # implement much of the rest of the interface implemented by Regexps alias_method :===, :=~ alias_method :match, :=~ alias_method :<<, :add_word # Add words from a file def add_words(words_file) IO.foreach(words_file) do |line| add_word line.chomp end end end -- Ken Bloom. PhD candidate. Linguistic Cognition Laboratory. Department of Computer Science. Illinois Institute of Technology. http://www.iit.edu/~kbloom1/