On Sat, 25 Nov 2006 08:21:24 +0900, Ruby Quiz wrote:
> by Ken Bloom
> 
> From time to time someone asks on ruby-talk how they can write a regexp of the
> form:
> 
> 	/alligator|crocodile|bear|dinosaur|...|seven-thousandth-word/
> 
> It's not hard to write such a regexp, but Ruby has in internal limit on how big
> the regular expression can be, so users find they can't do this matching
> function easily.
> 
> Implement a class DictionaryMatcher that determines whether any of the strings
> added to it are substrings of a string S. This should function as almost a
> drop-in replacement for a Regexp, therefore your implementation should support
> the following operations:
> 
> 	# creates a new empty matcher
> 	dm=DictionaryMatcher.new
> 	
> 	# adds strings to the matcher
> 	dm << "string"
> 	dm << "Ruby"
> 	
> 	# determines whether a given word was one of those added to the matcher
> 	dm.include?("Ruby")                 # => true
> 	dm.include?("missing")              # => false
> 	dm.include?("stringing you along")  # => false
> 	
> 	# Regexp-like substing search
> 	dm =~ "long string"            # => 5
> 	dm =~ "rub you the wrong way"  # => nil
> 	
> 	# will automatically work as a result of implementing
> 	# DictionaryMatcher#=~ (see String#=~)
> 	"long string" =~ dm  # => true
> 	
> 	# implement the rest of the interface implemented by Regexps (well, almost)
> 	class DictionaryMatcher
> 	  alias_method :===, :=~
> 	  alias_method :match, :=~
> 	end
> 
> If you can add additional features, like a case insensitivity option when
> creating a new DictionaryMatcher this is also very useful.

This DictionaryMatcher implementation implements a trie (prefix tree) 
combined with Knuth-Morris-Pratt string matching on O(n) time. The funny 
thing is I gave this solution as an answer to the same question on my 
algorithms final last semester, and it was marked wrong. The professor 
just didn't understand what I was doing.

I violated the interface a little since knowing which words matched can 
also be pretty important, and added the #scan method, because the 
questioner who inspired me to write this quiz actually told me that he
wanted to count how many matches there were in the string.

require 'enumerator'

# The DictionaryMatcher class holds a collection of strings. It allows lookups to
# determine which strings are included, and it can be used similarly
# to a +Regexp+ in substring matching.
class DictionaryMatcher
   #Create a DictionaryMatcher with no words in it
   def initialize
      @internal=Node.new
   end


   #Add a word to the DictionaryMatcher
   def add string
      array=@internal.add string
      parent_indexes=compute_failure_function string
      array.zip(parent_indexes).each do |node,parentindex|
	 node.failure=array[parentindex] if parentindex
      end
      nil
   end
   alias_method :<<, :add

   #Determine whether +string+ was previously <tt>add</tt>ed to the 
   #DictionaryMatcher.
   def include? string
      @internal.include? string
   end

   #Determine whether one of the words in the DictionaryMatcher is a substring of
   #+string+. Returns a DictionaryMatcher::MatchData object if found, +nil+ if not 
   #found.
   def match string
      internal_match(string){|md| return md}
      return nil
   end

   #Scans +string+ for all occurrances of strings in the DictionaryMatcher.
   #Overlapping matches are skipped (only the first one is yielded), and 
   #when some strings in the
   #DictionaryMatcher are substrings of others, only the shortest match at a given 
   #position is found.
   def scan string
      internal_match(string){|matchdata| yield matchdata}
      nil
   end

   #Case equality. Similar to =~, but returns true or false.
   def === string
      not match(string).nil?
   end

   #Determines whether one of the words in the DictionaryMatcher is a substring of
   #+string+. Returns the index of the match if found, +nil+ if not 
   #found.
   def =~ string
      x=match(string)
      x && x.index
   end

   #Contains the index matched, and the word matched
   MatchData=Struct.new(:index,:match)

   private
   #Doing this globally for the whole word feels kludgy, but I didn't
   #want to figure out how to do this as a per-node function.
   #Basically copied from Cormen, Leiserson, Rivest and Stein 
   #"Introduction to Algorithms" 2nd ed.
   def compute_failure_function p
      m=p.size
      pi=[0,0]
      k=0
      2.upto m do |q|
	 k=pi[k] while k>0 and p[k] != p[q-1]
	 k=k+1 if p[k]==p[q-1]
	 pi[q]=k
      end
      pi
   end

   def internal_match string
      node=@internal
      e=Enumerable::Enumerator.new(string,:each_byte)
      e.each_with_index do |b,index|
	 advance=false
	 until advance
	    nextnode=node.transitions[b]
	    if not nextnode
	       advance=true if node==node.failure #loops happen at the root
	       node=node.failure
	    elsif nextnode.endword?
	       yield MatchData.new(index-node.depth,string[index-node.depth..index])
	       advance=true
	       node=@internal
	    else
	       advance=true
	       node=nextnode
	    end
	 end
      end
   end

   class Node #:nodoc:
      attr_accessor :transitions, :failure, :endword, :depth
      alias_method :endword?, :endword
      def initialize depth=0
	 @depth=depth
	 @transitions={}
	 @endword=false
      end
      def add string,offset=0, array=[]
	 first=string[offset]
	 if offset==string.size
	    @endword=true
	    array << self
	    return array
	 end
	 array << self
	 node=(@transitions[first] ||= Node.new(@depth+1))
	 return node.add(string,offset+1,array)
      end
      def include? string, offset=0
	 first=string[offset]
	 if offset==string.size
	    return @endword
	 elsif not @transitions.include?(first)
	    return false
	 else
	    return @transitions[first].include?(string,offset+1)
	 end
      end
      def inspect; "x"; end
   end

end


-- 
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/