On Mon, 27 Nov 2006 11:47:38 +0900, Edwin Fine wrote:

> This solution uses a digital trie to encode the strings to search.
> A good background to this data structure can be found here:
> http://www.ddj.com/184410528,
> 
> which also discusses a better solution that I did not explore (ternary 
> search trees).
> 
> A trie has a search speed that is O(m), where m is the number of 
> characters in the string to find in the dictionary. A hash table is 
> O(1), which is theoretically faster than a trie, but in practice the 
> trie is faster because the has table has to hash all characters in the 
> string to create the hash key, whereas the trie can reject a string by 
> matching as little as 1 character.
> 
> The solution trades memory usage for speed. Every string in the 
> dictionary is organized into a hierarchy of character codes. 
> Essentially, the dictionary encodes a DFA (deterministic finite 
> automaton) where each character code is a transition from one node to 
> the next one. This provides very fast search speeds, especially when a 
> non-matching string is encountered. In other words, the trie can reject 
> incorrect strings very quickly (as soon as a non-matching prefix to a 
> valid string is seen).
> 
> To store a string in the dictionary, we start at the root (which is just 
> a hash
> or array of the character codes of the first character of every string 
> in the
> dictionary).
> 
> trie = root
> for each character code in the string to store
>   trie[character code] ||= {} # or [], if arrays used
>   trie = trie[character code] # drop down 1 level
> end
> trie[0] = true # Mark end of word (code 0 will never be used)
> 
> It is necessary to mark the end of a word because you can get words that 
> are prefixes of other words (for example, can, canna, and cannabis). 
> This allows you to decide if you want to use a least-prefix or greedy 
> search. This code uses a least-prefix search that returns the shortest 
> matching string in the dictionary. This is due to the requirements of 
> the quiz. However, it should be easy enough to code a greedy search.
> 
> The search algorithm is surprisingly simple. The search space is the 
> text that
> we want to search for matching strings. The code starts at the first 
> character of the search space and tries to find a match in the 
> dictionary anchored at that position. If it does not detect a match, it 
> moves the anchor to the next character and starts again.
> 
> trie = root
> for each character code in the search space
>   break unless character code is in trie
>   trie = trie[character code]
> end
> found a valid string if trie[0] exists
> 
> Each character in the dictionary to be searched is an index into a hash.
> 
> Informal benchmarks on my system (Intel Core 2 Duo E6600 on Win XP 32 
> bit)
> shows a memory usage of about 15MB when using a hash as the basic trie
> storage, and 30+Mb using an array, when storing a dictionary of 24,001 
> words.
> The speed seems to be about the same either way, so the hash solution is 
> the best bet.
> 
> The test code finds all words in the 24,001 word dictionary in a text of 
> about
> 600KB. On my system, this takes around 2 seconds. This does include a 
> possibly
> incorrect optimization: once a word is matched, the anchor point is 
> moved to
> after the word so that substrings within the word are not matched. This 
> may
> or may not be what is desired, but removing this optimization almost 
> doubles
> the run time.

[CODE SNIPPED]

> 
> I have posted the code, Test::Unit tests, the test file, and dictionary 
> file on my web site here:
> 
> http://finecomputerconsultants.com/ruby_quiz/dictionary_matcher-0.1.tgz
> 
> I hope I have not violated any copyrights/lefts by supplying the text 
> files; if so, let me know and I will remedy this.
>

Wow. My solution's a real slowpoke compared to yours, even thought mine's
asymptotically faster: O(string.length) versus
O(string.length* max(keys.length))

I wonder what's slowing mine down so much, since these are in many ways
the same.

                        user     system      total        real
Edwin Fine -- fill  0.810000   0.130000   0.940000 (  0.939537)
Edwin Fine -- test  5.580000   0.630000   6.210000 (  6.244736)
Ken Bloom -- fill   4.550000   0.460000   5.010000 (  5.010708)
Ken Bloom -- test  25.210000   0.960000  26.170000 ( 27.519233)

(tested using the files you provided -- I modified your interface just a
little to alias your #find_all_matches method to #scan)

I'd be happy to benchmark anybody else's solution who implements #scan.

#!/usr/bin/env ruby
open("practical-file-system-design.txt") do |f|
   FILEDATA=f.read
end

open("words_en.txt") do |f|
   DICTIONARY=f.readlines.map{|x| x.chomp}
end

require 'benchmark'
include Benchmark
#the following files contain various implementations, renamed so as not to
#conflict with each other
require 'trie'
require 'finedm'

TESTCLASSES={"Ken Bloom" => KenDictionaryMatcher,
  "Edwin Fine" => FineDictionaryMatcher}

bm(TESTCLASSES.keys.collect{|x| x.length}.max + 8) do |benchmarker|
   matcher=nil
   TESTCLASSES.each do |name,klass|
      benchmarker.report("#{name} -- fill") do
	 matcher=klass.new
	 DICTIONARY.each {|x| matcher << x}
      end
      benchmarker.report("#{name} -- test") do
	 matcher.scan(FILEDATA){}
      end
   end
end
__END__

-- 
Ken Bloom. PhD candidate. Linguistic Cognition Laboratory.
Department of Computer Science. Illinois Institute of Technology.
http://www.iit.edu/~kbloom1/