On Mon, 27 Nov 2006 11:47:38 +0900, Edwin Fine wrote: > This solution uses a digital trie to encode the strings to search. > A good background to this data structure can be found here: > http://www.ddj.com/184410528, > > which also discusses a better solution that I did not explore (ternary > search trees). > > A trie has a search speed that is O(m), where m is the number of > characters in the string to find in the dictionary. A hash table is > O(1), which is theoretically faster than a trie, but in practice the > trie is faster because the has table has to hash all characters in the > string to create the hash key, whereas the trie can reject a string by > matching as little as 1 character. > > The solution trades memory usage for speed. Every string in the > dictionary is organized into a hierarchy of character codes. > Essentially, the dictionary encodes a DFA (deterministic finite > automaton) where each character code is a transition from one node to > the next one. This provides very fast search speeds, especially when a > non-matching string is encountered. In other words, the trie can reject > incorrect strings very quickly (as soon as a non-matching prefix to a > valid string is seen). > > To store a string in the dictionary, we start at the root (which is just > a hash > or array of the character codes of the first character of every string > in the > dictionary). > > trie = root > for each character code in the string to store > trie[character code] ||= {} # or [], if arrays used > trie = trie[character code] # drop down 1 level > end > trie[0] = true # Mark end of word (code 0 will never be used) > > It is necessary to mark the end of a word because you can get words that > are prefixes of other words (for example, can, canna, and cannabis). > This allows you to decide if you want to use a least-prefix or greedy > search. This code uses a least-prefix search that returns the shortest > matching string in the dictionary. This is due to the requirements of > the quiz. However, it should be easy enough to code a greedy search. > > The search algorithm is surprisingly simple. The search space is the > text that > we want to search for matching strings. The code starts at the first > character of the search space and tries to find a match in the > dictionary anchored at that position. If it does not detect a match, it > moves the anchor to the next character and starts again. > > trie = root > for each character code in the search space > break unless character code is in trie > trie = trie[character code] > end > found a valid string if trie[0] exists > > Each character in the dictionary to be searched is an index into a hash. > > Informal benchmarks on my system (Intel Core 2 Duo E6600 on Win XP 32 > bit) > shows a memory usage of about 15MB when using a hash as the basic trie > storage, and 30+Mb using an array, when storing a dictionary of 24,001 > words. > The speed seems to be about the same either way, so the hash solution is > the best bet. > > The test code finds all words in the 24,001 word dictionary in a text of > about > 600KB. On my system, this takes around 2 seconds. This does include a > possibly > incorrect optimization: once a word is matched, the anchor point is > moved to > after the word so that substrings within the word are not matched. This > may > or may not be what is desired, but removing this optimization almost > doubles > the run time. [CODE SNIPPED] > > I have posted the code, Test::Unit tests, the test file, and dictionary > file on my web site here: > > http://finecomputerconsultants.com/ruby_quiz/dictionary_matcher-0.1.tgz > > I hope I have not violated any copyrights/lefts by supplying the text > files; if so, let me know and I will remedy this. > Wow. My solution's a real slowpoke compared to yours, even thought mine's asymptotically faster: O(string.length) versus O(string.length* max(keys.length)) I wonder what's slowing mine down so much, since these are in many ways the same. user system total real Edwin Fine -- fill 0.810000 0.130000 0.940000 ( 0.939537) Edwin Fine -- test 5.580000 0.630000 6.210000 ( 6.244736) Ken Bloom -- fill 4.550000 0.460000 5.010000 ( 5.010708) Ken Bloom -- test 25.210000 0.960000 26.170000 ( 27.519233) (tested using the files you provided -- I modified your interface just a little to alias your #find_all_matches method to #scan) I'd be happy to benchmark anybody else's solution who implements #scan. #!/usr/bin/env ruby open("practical-file-system-design.txt") do |f| FILEDATA=f.read end open("words_en.txt") do |f| DICTIONARY=f.readlines.map{|x| x.chomp} end require 'benchmark' include Benchmark #the following files contain various implementations, renamed so as not to #conflict with each other require 'trie' require 'finedm' TESTCLASSES={"Ken Bloom" => KenDictionaryMatcher, "Edwin Fine" => FineDictionaryMatcher} bm(TESTCLASSES.keys.collect{|x| x.length}.max + 8) do |benchmarker| matcher=nil TESTCLASSES.each do |name,klass| benchmarker.report("#{name} -- fill") do matcher=klass.new DICTIONARY.each {|x| matcher << x} end benchmarker.report("#{name} -- test") do matcher.scan(FILEDATA){} end end end __END__ -- Ken Bloom. PhD candidate. Linguistic Cognition Laboratory. Department of Computer Science. Illinois Institute of Technology. http://www.iit.edu/~kbloom1/