I wrote a C version of your code which uses glib.  I'm attaching it as 
Filter.c, as well as the extconf.rb.  If you have glib, you can just do:

ruby extconf.rb
make

This creates Filter.so.  There seems to be a performance improvement, but 
it's not as large as I might have hoped.  Here are 3 tests:

<----- t1.rb  ---- Your original code ------------------->
5_000.times do
  h = Hash.new(0)
  %w{a b c a g d e f a g h g i j e e k l m g d d a b c}.each do |tok|
    h[tok] += 1
  end
end
<----- t2.rb  ---- Replaced Array#each by a for loop. --->
5_000.times do
  h = Hash.new(0)
  for tok in %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
    h[tok] += 1
  end
end
<----- t3.rb ---- Uses the C version. ------------------->
require "Filter"

5_000.times do
  h = Filter.process %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
end
<----------------------------->
And here are some results:

dcarrera $ time ruby t1.rb 

real    0m3.205s
user    0m3.130s
sys     0m0.010s
dcarrera $ time ruby t2.rb 

real    0m3.006s
user    0m2.860s
sys     0m0.030s
dcarrera $ time ruby t3.rb 

real    0m1.677s
user    0m1.490s
sys     0m0.090s

Who knows, this might behave better with your data.

On Thu, Feb 20, 2003 at 07:53:14AM +0900, Travis Whitton wrote:
> Hello - A friend and I have been working on a Ruby implementation of a
> bayesian spam filter as described in Paul Graham's, Plan For Spam.
> It's fully functional, but I've been trying to squeeze more performance out
> of it as it's quite slow ATM(15 minutes to run across 20 megs of email).
> By using profile and rbprof, I've determined that our tokenizer method is
> the main source of slowness. After some careful benchmarking, I've found this
> to be the problem.
> 
> # This is ran about a million times
> h = Hash.new(0)
> data.scan(iptok).each do |tok|
>   h[tok] += 1
> end
> 
> At first, I though I could do something like this:
> 
> # This is ran about a million times
> h = Hash.new(0)
> data.scan(iptok).each do |tok|
>   h[tok].succ
> end
> 
> But I realized that it doesn't modify the final value; however, I did notice
> that it ran twice as fast than the former example. So, my question is, does the
> assignment in the first example really have that much overhead? If so, is there
> any way to do the first example using Inline::C or something similar?
> 
> Thanks in advance,
> Travis
> 
> 

-- 
Daniel Carrera
Graduate Teaching Assistant.  Math Dept.
University of Maryland.  (301) 405-5137