Jan Arts approach to the problem. forwarded from the bioruby list;
Hey George,

So if I understand correctly you've got a huge number of aminoacid 
sequences (how many?) and about 400 regular expressions. And for each of 
the aminoacid sequences: if they match just one of the regular 
expressions they are put in box A and if they match none of the regexps, 
they go into box B. Correct?

It just happens that something very similar was the subject of Jim 
Tisdall's (from Beginning Perl for Bioinformatics fame) talk at the 
bioinformatics course we're teaching at the moment :-)

First thing: avoid loops. You don't want to take loop over all regexps 
for each AA sequences, or the other way around.

Are all regexps of the same length? Would be nice if they are, but not 
critical. My approach would be to go over the data just once. So suppose 
the regexps all are of the same length.

A. Prepare your data:
  a. "Decode" the regexps into literal strings: e.g. /A[BC]D/ become 
"ABD" and "ACD".
  b. Create a hash that contains all those things as keys.
  c. Concatenate all AA sequences together, joined with a non-AA, let's 
say a semicolon ";". E.g. CAARGNDLYSKNIG;GGARGNDLYSKNIG;KKARGNDLYSKNIG

B. Do the actual search
  a. If the length of the strings to match (what used to be the regexps, 
and are now the keys in the hash) is 5: take the first 5 characters of 
your concatenated AA string and check if that substring exists as a key 
in the hash. If so: you know that the AA sequence between the 
surrounding ";" characters should go in box A.
  b. Advance 1 position: take AAs 2 to 6.
  c. Go back to a.

You might have to tweak this approach to exactly fit your requirements, 
but if your code used to take a very long time, this might speed things 
up immensely.

(George: can you forward this to the ruby mailing list it was discussed 
on initially? Cheers)

Good luck,
jan.
-- 
Posted via http://www.ruby-forum.com/.