"Travis Whitton" <whitton / atlantic.net> schrieb im Newsbeitrag
news:BhZ4a.10243$Mr5.3967 / fe06.atl2.webusenet.com...
> > 1) Are the tokens strings?
>
> Yes, my program goes through two files. One consists of only non-spam
> messages, and the other is only spam messages. It goes through each file
> line by line and divides each line into tokens of interesting data. Here
are
> the relevant portions of my tokenizer method.
>
> def tokenizer(fh)
>   hash   = Hash.new(0)
>   ipaddr = '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+'

Maybe you can improve performance by changing this to:

ipaddr = '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}'

or

ipaddr = '[0-9]{1,3}(\.[0-9]{1,3}){3}'

This should make the regexp fail faster for longer sequences of digits.
Just a guess, but maybe worth trying.

Regards

    robert

>   token  = "[A-Za-z$][A-Za-z0-9$'.-]+[A-Za-z0-9$]"
>   iptok  = Regexp.compile("#{token}|#{ipaddr}")
>
>   fh.each do |data|
>     data.chomp!
>     # do a number of string substitutions which use negligible amounts of
time
>     data.scan(iptok).each do |tok|
>       hash[tok] = hash[tok].succ
>     end
>   end
>   hash
> end
>
> The messages are standard unix messages(mbox format?) like so:
> <message>
> From MAILER-DAEMON Mon Sep 23 22:32:37 2002
> Date: 23 Sep 2002 22:32:37 -0400
> From: Mail System Internal Data <MAILER-DAEMON / grub.ath.cx>
> Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA
> Message-ID: <1032834757 / grub.atlantic.net>
> X-IMAP: 1032834509 0000000272
> Status: RO
>
> This text is part of the internal format of your mail folder, and is not
> a real message.  It is created automatically by the mail system software.
> If deleted, important folder data will be lost, and it will be re-created
> with the data reset to initial values.
> </message>
>
> > 2) Are you running Linux?
>
> Yes, and I only intend for this program to run under unix based systems.
>
> > You might be able to use glib hashes to do this in C and then translate
> > those to Ruby hashes.  Just a thought.  I'm putting together some code
to
> > see if I can figure it out.
>
> Thanks very much for your help. I sincerely appreciate it! As a side
note,
> the program is already on the RAA:
>
> http://raa.ruby-lang.org/list.rhtml?name=bsproc
>
> So, you can grok through the code if you would like to; however, it's not
> exactly the same as the development version. As a second side not,
although
> it is slow to create the probability database, the program is _extremely_
> good at filtering spam. A very small minority of spam messages make their
way
> into my inbox, and it feels damned good to have written my own spam
filter.
> Also, I've tried strscan in place of scan, and the speedup wasn't
significant.
> As it turns out, most of the calculation time is spend doing hash
lookups. I've
> considered RJudy, but apparently, it's hashes are slower than Ruby native
> hashes... go figure.
>
> Thanks much,
> Travis
>