"Travis Whitton" <whitton / atlantic.net> schrieb im Newsbeitrag news:BhZ4a.10243$Mr5.3967 / fe06.atl2.webusenet.com... > > 1) Are the tokens strings? > > Yes, my program goes through two files. One consists of only non-spam > messages, and the other is only spam messages. It goes through each file > line by line and divides each line into tokens of interesting data. Here are > the relevant portions of my tokenizer method. > > def tokenizer(fh) > hash = Hash.new(0) > ipaddr = '[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' Maybe you can improve performance by changing this to: ipaddr = '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' or ipaddr = '[0-9]{1,3}(\.[0-9]{1,3}){3}' This should make the regexp fail faster for longer sequences of digits. Just a guess, but maybe worth trying. Regards robert > token = "[A-Za-z$][A-Za-z0-9$'.-]+[A-Za-z0-9$]" > iptok = Regexp.compile("#{token}|#{ipaddr}") > > fh.each do |data| > data.chomp! > # do a number of string substitutions which use negligible amounts of time > data.scan(iptok).each do |tok| > hash[tok] = hash[tok].succ > end > end > hash > end > > The messages are standard unix messages(mbox format?) like so: > <message> > From MAILER-DAEMON Mon Sep 23 22:32:37 2002 > Date: 23 Sep 2002 22:32:37 -0400 > From: Mail System Internal Data <MAILER-DAEMON / grub.ath.cx> > Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA > Message-ID: <1032834757 / grub.atlantic.net> > X-IMAP: 1032834509 0000000272 > Status: RO > > This text is part of the internal format of your mail folder, and is not > a real message. It is created automatically by the mail system software. > If deleted, important folder data will be lost, and it will be re-created > with the data reset to initial values. > </message> > > > 2) Are you running Linux? > > Yes, and I only intend for this program to run under unix based systems. > > > You might be able to use glib hashes to do this in C and then translate > > those to Ruby hashes. Just a thought. I'm putting together some code to > > see if I can figure it out. > > Thanks very much for your help. I sincerely appreciate it! As a side note, > the program is already on the RAA: > > http://raa.ruby-lang.org/list.rhtml?name=bsproc > > So, you can grok through the code if you would like to; however, it's not > exactly the same as the development version. As a second side not, although > it is slow to create the probability database, the program is _extremely_ > good at filtering spam. A very small minority of spam messages make their way > into my inbox, and it feels damned good to have written my own spam filter. > Also, I've tried strscan in place of scan, and the speedup wasn't significant. > As it turns out, most of the calculation time is spend doing hash lookups. I've > considered RJudy, but apparently, it's hashes are slower than Ruby native > hashes... go figure. > > Thanks much, > Travis >