On Mon, 27 Nov 2006 15:20:32 +0900, Edwin Fine wrote:

>> Wow. My solution's a real slowpoke compared to yours, even thought 
>> mine's
>> asymptotically faster: O(string.length) versus
>> O(string.length* max(keys.length))
>> 
>> I wonder what's slowing mine down so much, since these are in many ways
>> the same.
> 
> Well, do we get the same results from the scan of the big file? Recall 
> that in the scan I skip over strings once I see they are in the 
> dictionary, which cuts out a lot of work and may yield far fewer 
> results.
> 
> For example, if the text to be scanned is
> 
> abacusqwertyark
> 
> and the dictionary contains the words abacus and ark, the index into the 
> string will move as follows:
> 
> abacusqwertyark # Begin
> ^
> 
> abacusqwertyark # Found abacus, skip it
>       ^
> abacusqwertyark # Found ark
>                ^
> 
> A solution that did not skip the words would find additional substrings, 
> for example:
> 
> abacusqwertyark # Begin
> ^
> 
> abacusqwertyark # Found abacus, start again on next character
>  ^
> .. omit some steps ...
> 
> abacusqwertyark # Found "us"
>     ^
> 
> Maybe I'm cheating or there's some other bug in my code that omits 
> results that you find :)
> 
>

Apparently, part of my problem was the use of the Enumerator object.
Eliminating that and keeping track of the index manually saves about 5
seconds off the benchmark.

I also took my solution and reimplemented it in your code --
instead of having Node objects, I just use your hashes now and for the
special fields I need, I use hash[:failure] and hash[:depth]. This saved
about 15 seconds, so my new solution (based on your code) is only about a
second slower than your solution, which is probably acceptable given the
length of the words in the dictionary file. If the entries in the
dictionary were longer, then my new solution should start to win. And by
the way, both of these solutions win out over fairly large regular
expressions pretty easily, since regular expressions have to backtrack to
try each branch.

                        user     system      total        real
Edwin Fine -- fill  0.840000   0.090000   0.930000 (  0.933338)
Edwin Fine -- test  5.650000   0.540000   6.190000 (  6.194274)
Ken Bloom -- fill   3.800000   0.300000   4.100000 (  4.104002)
Ken Bloom -- test   6.800000   0.370000   7.170000 (  7.167073)

#based on Edwin Fine's solution, this reimplements my solution with
#less overhead all around.
class DictionaryMatcher
  attr_reader :word_count

  def initialize
    @trie = {}
    @word_count = 0
  end

  def add_word(word)
    @word_count += 1
    container = @trie
    containers=[]

    i=0
    word.each_byte do |b|
      container[b] = {} unless container.has_key? b
      container[:depth]=i
      containers << container
      container = container[b]
      i+=1
    end
    containers << container

    container[0] = true # Mark end of word
     
    ff=compute_failure_function word
    ff.zip(containers).each do |pointto,container|
      container[:failure]=containers[pointto] if pointto
    end

  end

  def compute_failure_function p
    m=p.size
    pi=[nil,0]
    k=0
    2.upto m do |q|
      k=pi[k] while k>0 and p[k] != p[q-1]
      k=k+1 if p[k]==p[q-1]
      pi[q]=k
    end
    pi
  end
  private :compute_failure_function

  def include?(word)
    container = @trie
    word.each_byte do |b|
      break unless container.has_key? b
      container = container[b]
    end
    container[0]
  end

  def =~ text
    internal_match text {|pos,len| return pos}
    nil
  end

  def internal_match string
      node=@trie
      pos=0
      string.each_byte do |b|
	 advance=false
	 until advance
	    nextnode=node[b]
	    if not nextnode
	       if node[:failure]
		  node=node[:failure]
	       else
		  advance=true 
	       end
	    elsif nextnode[0]
	       yield pos, nextnode[:depth] 
	       advance=true
	       node=@trie
	    else
	       advance=true
	       node=nextnode
	    end
	    pos+=1
	 end
      end
  end
  private :internal_match


  def find_all_matching(text, &block)
    matches=[]
    block= lambda{ |pos,len| matches << [pos,len] } unless block
    internal_match(text,&block)
    matches
  end

  alias_method :scan, :find_all_matching

  # implement much of the rest of the interface implemented by Regexps
  alias_method :===, :=~
  alias_method :match, :=~
  alias_method :<<, :add_word

  # Add words from a file
  def add_words(words_file)
    IO.foreach(words_file) do |line|
      add_word line.chomp
    end
  end
end


-- 
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/