2009/10/15 George George <george.githinji / gmail.com>: > >> I'm not sure though about the sanity aspect of this. ¨Βεοςηε¬ γαωο>> disclose more details about the nature of your matching? ¨Βαωβε >> there's a better and / or more efficient solution. >> >> Kind regards >> >> robert > > Thank you so much for all the replies and suggestions! > > ¨Βοβεςτ¬ μενε δισγμοσε σονε νοςε ιξζοςνατιοαβουτθξατυςε οζ τθ> problem. Actually I have Protein ¨Βερυεξγεσ ¨στςιξησ οζ φαςιαβμε μεξητθ > composed of a 20 letter alphabet) for example "CAARGNDLYSKNIG" can be > considered as a protein sequence. basically it's just a string. > > In my case each of the strings that i have may fall into 2 distinct > classes or groups depending on whether they match a collection of ¨Ββουτ > 400 distict patterns in the first instance or another 200 or so patterns > in the second instance. I do not have information on which is the most > common pattern. > > ¨Βλξοχ χθιγοζ τθτχηςουπσ τθατ εαγθ οζ νω σερυεξγεζαμμ ιξτο> resulted in to writing the code that i had presented ealier but realised > that it may be inferior, inefficient or buggy and would like to improve > it. > > I ¨Βοπτθισ θεμπτο δεζιξε τθπςοβμεν® > > In a nutshell; Given a set of strings A, each of which may belong to ¨Β > group and where a group is characterised by a huge set of patterns(to be > matched against), create a method that classifies each of the strings > depending on which pattern the string contains. Looks like this is tricky to become right. There seem to be some people around that do bioinformatics, maybe some of them do have a solution already. Things you could do off the top of my head: since it seems your patterns are only strings you could optimize the pattern by building a trie from it and then creating a RX from that. I've done this before (see below). Advantage is that your regular expression gets smaller and matching becomes more efficient (because of eliminated backtracking for NDA based RX engines). http://en.wikipedia.org/wiki/Trie There might be other options using one of the string matching algorithms around (boyer moore and the like). Kind regards robert def treeify(strings) root = TextTreeNode.new strings.each {|s| root.put s} root.to_s end TextTreeNode = Struct.new :children, :char, :term do def initialize(chr = nil) self.char = chr self.children = Hash.new {|h,k| h[k] = self.class.new(k)} end def put(s) node = self s.each_byte do |char| node = node.children[char] end node.term = true self end def to_s dump("") end def dump(out) out << Regexp.escape(char.chr) if char ch = children.values unless ch.empty? first = ch.shift bracket = term || !ch.empty? out << "(?:" if bracket first.dump(out) ch.each do |v| out << "|" v.dump(out) end out << ")" if bracket out << "?" if term end out end end puts treeify %w{foo fort bar baz battery} -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/