On Sat, 25 Nov 2006 08:21:24 +0900, Ruby Quiz wrote: > by Ken Bloom > > From time to time someone asks on ruby-talk how they can write a regexp of the > form: > > /alligator|crocodile|bear|dinosaur|...|seven-thousandth-word/ > > It's not hard to write such a regexp, but Ruby has in internal limit on how big > the regular expression can be, so users find they can't do this matching > function easily. > > Implement a class DictionaryMatcher that determines whether any of the strings > added to it are substrings of a string S. This should function as almost a > drop-in replacement for a Regexp, therefore your implementation should support > the following operations: > > # creates a new empty matcher > dm=DictionaryMatcher.new > > # adds strings to the matcher > dm << "string" > dm << "Ruby" > > # determines whether a given word was one of those added to the matcher > dm.include?("Ruby") # => true > dm.include?("missing") # => false > dm.include?("stringing you along") # => false > > # Regexp-like substing search > dm =~ "long string" # => 5 > dm =~ "rub you the wrong way" # => nil > > # will automatically work as a result of implementing > # DictionaryMatcher#=~ (see String#=~) > "long string" =~ dm # => true > > # implement the rest of the interface implemented by Regexps (well, almost) > class DictionaryMatcher > alias_method :===, :=~ > alias_method :match, :=~ > end > > If you can add additional features, like a case insensitivity option when > creating a new DictionaryMatcher this is also very useful. This DictionaryMatcher implementation implements a trie (prefix tree) combined with Knuth-Morris-Pratt string matching on O(n) time. The funny thing is I gave this solution as an answer to the same question on my algorithms final last semester, and it was marked wrong. The professor just didn't understand what I was doing. I violated the interface a little since knowing which words matched can also be pretty important, and added the #scan method, because the questioner who inspired me to write this quiz actually told me that he wanted to count how many matches there were in the string. require 'enumerator' # The DictionaryMatcher class holds a collection of strings. It allows lookups to # determine which strings are included, and it can be used similarly # to a +Regexp+ in substring matching. class DictionaryMatcher #Create a DictionaryMatcher with no words in it def initialize @internal=Node.new end #Add a word to the DictionaryMatcher def add string array=@internal.add string parent_indexes=compute_failure_function string array.zip(parent_indexes).each do |node,parentindex| node.failure=array[parentindex] if parentindex end nil end alias_method :<<, :add #Determine whether +string+ was previously <tt>add</tt>ed to the #DictionaryMatcher. def include? string @internal.include? string end #Determine whether one of the words in the DictionaryMatcher is a substring of #+string+. Returns a DictionaryMatcher::MatchData object if found, +nil+ if not #found. def match string internal_match(string){|md| return md} return nil end #Scans +string+ for all occurrances of strings in the DictionaryMatcher. #Overlapping matches are skipped (only the first one is yielded), and #when some strings in the #DictionaryMatcher are substrings of others, only the shortest match at a given #position is found. def scan string internal_match(string){|matchdata| yield matchdata} nil end #Case equality. Similar to =~, but returns true or false. def === string not match(string).nil? end #Determines whether one of the words in the DictionaryMatcher is a substring of #+string+. Returns the index of the match if found, +nil+ if not #found. def =~ string x=match(string) x && x.index end #Contains the index matched, and the word matched MatchData=Struct.new(:index,:match) private #Doing this globally for the whole word feels kludgy, but I didn't #want to figure out how to do this as a per-node function. #Basically copied from Cormen, Leiserson, Rivest and Stein #"Introduction to Algorithms" 2nd ed. def compute_failure_function p m=p.size pi=[0,0] k=0 2.upto m do |q| k=pi[k] while k>0 and p[k] != p[q-1] k=k+1 if p[k]==p[q-1] pi[q]=k end pi end def internal_match string node=@internal e=Enumerable::Enumerator.new(string,:each_byte) e.each_with_index do |b,index| advance=false until advance nextnode=node.transitions[b] if not nextnode advance=true if node==node.failure #loops happen at the root node=node.failure elsif nextnode.endword? yield MatchData.new(index-node.depth,string[index-node.depth..index]) advance=true node=@internal else advance=true node=nextnode end end end end class Node #:nodoc: attr_accessor :transitions, :failure, :endword, :depth alias_method :endword?, :endword def initialize depth=0 @depth=depth @transitions={} @endword=false end def add string,offset=0, array=[] first=string[offset] if offset==string.size @endword=true array << self return array end array << self node=(@transitions[first] ||= Node.new(@depth+1)) return node.add(string,offset+1,array) end def include? string, offset=0 first=string[offset] if offset==string.size return @endword elsif not @transitions.include?(first) return false else return @transitions[first].include?(string,offset+1) end end def inspect; "x"; end end end -- Ken Bloom. PhD candidate. Linguistic Cognition Laboratory. Department of Computer Science. Illinois Institute of Technology. http://www.iit.edu/~kbloom1/