I changed the use of the bool data type (which is really an int) to char 
and shrunk the size of the program and sped it up.

real    0m1.885s
user    0m1.649s
sys     0m0.168s

I think this is because the top most loop, when processing the top row, 
basically just zooms down a given row of the compared array. With four 
chars being packed in where one bool would be more of the tests will be 
using data that is available in the CPUs cache.