Hello - A friend and I have been working on a Ruby implementation of a
bayesian spam filter as described in Paul Graham's, Plan For Spam.
It's fully functional, but I've been trying to squeeze more performance out
of it as it's quite slow ATM(15 minutes to run across 20 megs of email).
By using profile and rbprof, I've determined that our tokenizer method is
the main source of slowness. After some careful benchmarking, I've found this
to be the problem.

# This is ran about a million times
h = Hash.new(0)
data.scan(iptok).each do |tok|
  h[tok] += 1
end

At first, I though I could do something like this:

# This is ran about a million times
h = Hash.new(0)
data.scan(iptok).each do |tok|
  h[tok].succ
end

But I realized that it doesn't modify the final value; however, I did notice
that it ran twice as fast than the former example. So, my question is, does the
assignment in the first example really have that much overhead? If so, is there
any way to do the first example using Inline::C or something similar?

Thanks in advance,
Travis