I wrote a C version of your code which uses glib. I'm attaching it as
Filter.c, as well as the extconf.rb. If you have glib, you can just do:
ruby extconf.rb
make
This creates Filter.so. There seems to be a performance improvement, but
it's not as large as I might have hoped. Here are 3 tests:
<----- t1.rb ---- Your original code ------------------->
5_000.times do
h = Hash.new(0)
%w{a b c a g d e f a g h g i j e e k l m g d d a b c}.each do |tok|
h[tok] += 1
end
end
<----- t2.rb ---- Replaced Array#each by a for loop. --->
5_000.times do
h = Hash.new(0)
for tok in %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
h[tok] += 1
end
end
<----- t3.rb ---- Uses the C version. ------------------->
require "Filter"
5_000.times do
h = Filter.process %w{a b c a g d e f a g h g i j e e k l m g d d a b c}
end
<----------------------------->
And here are some results:
dcarrera $ time ruby t1.rb
real 0m3.205s
user 0m3.130s
sys 0m0.010s
dcarrera $ time ruby t2.rb
real 0m3.006s
user 0m2.860s
sys 0m0.030s
dcarrera $ time ruby t3.rb
real 0m1.677s
user 0m1.490s
sys 0m0.090s
Who knows, this might behave better with your data.
On Thu, Feb 20, 2003 at 07:53:14AM +0900, Travis Whitton wrote:
> Hello - A friend and I have been working on a Ruby implementation of a
> bayesian spam filter as described in Paul Graham's, Plan For Spam.
> It's fully functional, but I've been trying to squeeze more performance out
> of it as it's quite slow ATM(15 minutes to run across 20 megs of email).
> By using profile and rbprof, I've determined that our tokenizer method is
> the main source of slowness. After some careful benchmarking, I've found this
> to be the problem.
>
> # This is ran about a million times
> h = Hash.new(0)
> data.scan(iptok).each do |tok|
> h[tok] += 1
> end
>
> At first, I though I could do something like this:
>
> # This is ran about a million times
> h = Hash.new(0)
> data.scan(iptok).each do |tok|
> h[tok].succ
> end
>
> But I realized that it doesn't modify the final value; however, I did notice
> that it ran twice as fast than the former example. So, my question is, does the
> assignment in the first example really have that much overhead? If so, is there
> any way to do the first example using Inline::C or something similar?
>
> Thanks in advance,
> Travis
>
>
--
Daniel Carrera
Graduate Teaching Assistant. Math Dept.
University of Maryland. (301) 405-5137