Here is my solution to this very interesting quiz.

My MarkovChainer can work with any order and it is sentence oriented: I  
split the input text with /([.!?])/ and then scan each sentence with  
/[\w']+/, so all punctuation (except ".", "!" and "?") is ignored.
It also stores the first <order> words of each sentence, to use them as  
start words when generating sentences.

The script accepts command line arguments for the order (-o) and the  
number of sentences to generate (-n), the defaults are -o2 -n10. Then it  
uses ARGF.read as input text and prints n generated sentences.

Now lets talk a bit about the data structure:

My first implementation used trees of hashes and because symbols are  
faster as hash keys than strings, I stored the words as symbols:

For the text "q s a w s. a s w q s. q w s a a w. s q a w s.", that looked  
like:

{:q=>{:s=>{:"."=>1, :a=>1}, :a=>{:w=>1}, :w=>{:s=>1}},
  :s=>{:q=>{:a=>1}, :w=>{:q=>1}, :a=>{:a=>1, :w=>1}},
  :w=>{:q=>{:s=>1}, :s=>{:"."=>2, :a=>1}},
  :a=>{:s=>{:w=>1}, :a=>{:w=>1}, :w=>{:"."=>1, :s=>2}}}

Then I thought it might be faster to put all the last words into one array  
instead of building a frequency hash, because I built that array anyway,  
when I was looking for the next word: That looked like:

{:q=>{:s=>[:a, :"."], :a=>[:w], :w=>[:s]},
  :s=>{:q=>[:a], :a=>[:w, :a], :w=>[:q]},
  :a=>{:s=>[:w], :a=>[:w], :w=>[:s, :".", :s]},
  :w=>{:q=>[:s], :s=>[:".", :a, :"."]}}

And it also was faster. Then I wanted to know if symbols really are faster  
then strings:

{"w"=>{"q"=>["s"], "s"=>[".", "a", "."]},
  "a"=>{"a"=>["w"], "w"=>["s", ".", "s"], "s"=>["w"]},
  "q"=>{"a"=>["w"], "w"=>["s"], "s"=>["a", "."]},
  "s"=>{"w"=>["q"], "a"=>["w", "a"], "q"=>["a"]}}

It turned out that strings are faster. I think that is because converting  
a symbol to a string (when generating the sentences) needs one hash lookup  
and generates a new String object. So the faster hash lookup of symbols  
doesn't pay off.

And finally I wanted to see if at least the hash tree is worth it. I tried  
the following:

{["q", "a"]=>["w"],
  ["q", "w"]=>["s"],
  ["s", "q"]=>["a"],
  ["w", "q"]=>["s"],
  ["s", "w"]=>["q"],
  ["a", "s"]=>["w"],
  ["w", "s"]=>[".", "a", "."],
  ["s", "a"]=>["w", "a"],
  ["q", "s"]=>["a", "."],
  ["a", "a"]=>["w"],
  ["a", "w"]=>["s", ".", "s"]}

Well, the hash tree also turned out to be slower than this simple flat  
hash (multiple hash lookups are slower than computing the hash of a multi  
element array and one lookup).
(I also tried joined strings as keys (e.g. "q a"=>["w"]), but they are  
slower than the array keys, because more transformations are needed)

So, the morale of all this:
- don't use symbols if they have to be converted to a string often
- hash lookups might be slower than you think
- premature optimization...


Dominik


class MarkovChainer
   attr_reader :order
   def initialize(order)
     @order = order
     @beginnings = []
     @freq = {}
   end

   def add_text(text)
     # make sure each paragraph ends with some sentence terminator
     text.gsub!(/\n\s*\n/m, ".")
     text << "."
     seps = /([.!?])/
     sentence = ""
     text.split(seps).each { |p|
       if seps =~ p
         add_sentence(sentence, p)
         sentence = ""
       else
         sentence = p
       end
     }
   end

   def generate_sentence
     res = @beginnings[rand(@beginnings.size)]
     loop {
       unless nw = next_word_for(res[-order, order])
         return res[0..-2].join(" ") + res.last
       end
       res << nw
     }
   end

   private

   def add_sentence(str, terminator)
     words = str.scan(/[\w']+/)
     return unless words.size > order # ignore short sentences
     words << terminator
     buf = []
     words.each { |w|
       buf << w
       if buf.size == order + 1
         (@freq[buf[0..-2]] ||= []) << buf[-1]
         buf.shift
       end
     }
     @beginnings << words[0, order]
   end

   def next_word_for(words)
     arr = @freq[words]
     arr && arr[rand(arr.size)]
   end

end


if $0 == __FILE__
   order = 2
   n = 10
   while ARGV[0] =~ /\A-([on])([1-9]\d*)\z/
     if $1 == "o"
       order = Integer($2)
     else
       n = Integer($2)
     end
     ARGV.shift
   end
   mc = MarkovChainer.new(order)
   mc.add_text(ARGF.read)
   n.times {
     puts mc.generate_sentence
   }
end