This solution uses a digital trie to encode the strings to search.
A good background to this data structure can be found here:
http://www.ddj.com/184410528,

which also discusses a better solution that I did not explore (ternary 
search trees).

A trie has a search speed that is O(m), where m is the number of 
characters in the string to find in the dictionary. A hash table is 
O(1), which is theoretically faster than a trie, but in practice the 
trie is faster because the has table has to hash all characters in the 
string to create the hash key, whereas the trie can reject a string by 
matching as little as 1 character.

The solution trades memory usage for speed. Every string in the 
dictionary is organized into a hierarchy of character codes. 
Essentially, the dictionary encodes a DFA (deterministic finite 
automaton) where each character code is a transition from one node to 
the next one. This provides very fast search speeds, especially when a 
non-matching string is encountered. In other words, the trie can reject 
incorrect strings very quickly (as soon as a non-matching prefix to a 
valid string is seen).

To store a string in the dictionary, we start at the root (which is just 
a hash
or array of the character codes of the first character of every string 
in the
dictionary).

trie = root
for each character code in the string to store
  trie[character code] ||= {} # or [], if arrays used
  trie = trie[character code] # drop down 1 level
end
trie[0] = true # Mark end of word (code 0 will never be used)

It is necessary to mark the end of a word because you can get words that 
are prefixes of other words (for example, can, canna, and cannabis). 
This allows you to decide if you want to use a least-prefix or greedy 
search. This code uses a least-prefix search that returns the shortest 
matching string in the dictionary. This is due to the requirements of 
the quiz. However, it should be easy enough to code a greedy search.

The search algorithm is surprisingly simple. The search space is the 
text that
we want to search for matching strings. The code starts at the first 
character of the search space and tries to find a match in the 
dictionary anchored at that position. If it does not detect a match, it 
moves the anchor to the next character and starts again.

trie = root
for each character code in the search space
  break unless character code is in trie
  trie = trie[character code]
end
found a valid string if trie[0] exists

Each character in the dictionary to be searched is an index into a hash.

Informal benchmarks on my system (Intel Core 2 Duo E6600 on Win XP 32 
bit)
shows a memory usage of about 15MB when using a hash as the basic trie
storage, and 30+Mb using an array, when storing a dictionary of 24,001 
words.
The speed seems to be about the same either way, so the hash solution is 
the best bet.

The test code finds all words in the 24,001 word dictionary in a text of 
about
600KB. On my system, this takes around 2 seconds. This does include a 
possibly
incorrect optimization: once a word is matched, the anchor point is 
moved to
after the word so that substrings within the word are not matched. This 
may
or may not be what is desired, but removing this optimization almost 
doubles
the run time.

The code itself is fairly short:

class DictionaryMatcher
  attr_reader :word_count

  def initialize
    @trie = {}
    @word_count = 0
  end

  def add_word(word)
    @word_count += 1
    container = @trie

    word.each_byte do |b|
      container[b] = {} unless container.has_key? b
      container = container[b]
    end

    container[0] = true # Mark end of word
  end

  def include?(word)
    container = @trie
    word.each_byte do |b|
      break unless container.has_key? b
      container = container[b]
    end
    container[0]
  end

  def =~(text)
    text_end = text.length - 1
    pos = 0

    while pos <= text_end do
      container = @trie

      pos.upto(text_end) do |i|
        b = text[i]
        break unless container.has_key? b
        container = container[b]
      end

      return pos if container[0] # Match
      pos += 1
    end

    nil
  end

  # Return container of matches in text [[pos, len], ...]
  # or call block if provided (returns [])
  def find_all_matching(text, &block)
    matches = []
    block = lambda { |pos, len| matches << [pos, len] } unless block
    pos = 0
    text_end = text.length - 1

    while pos <= text_end do
      container = @trie
      len = 0

      pos.upto(text_end) do |i|
        b = text[i]
        break unless container.has_key?(b)
        container = container[b]
        len += 1
      end

      if container[0] # Match
        block.call(pos, len)
        pos += len # Skip over word
      else
        pos += 1
      end
    end

    matches
  end

  # implement much of the rest of the interface implemented by Regexps
  alias_method :===, :=~
  alias_method :match, :=~
  alias_method :<<, :add_word

  # Add words from a file
  def add_words(words_file)
    IO.foreach(words_file) do |line|
      add_word line.chomp
    end
  end
end

I have posted the code, Test::Unit tests, the test file, and dictionary 
file on my web site here:

http://finecomputerconsultants.com/ruby_quiz/dictionary_matcher-0.1.tgz

I hope I have not violated any copyrights/lefts by supplying the text 
files; if so, let me know and I will remedy this.

-- 
Posted via http://www.ruby-forum.com/.