2009/10/15 George George <george.githinji / gmail.com>:
>
>> I'm not sure though about the sanity aspect of this.   >> disclose more details about the nature of your matching?  
>> there's a better and / or more efficient solution.
>>
>> Kind regards
>>
>> robert
>
> Thank you so much for all the replies and suggestions!
>
>         > problem. Actually I have Protein      
> composed of a 20 letter alphabet) for example "CAARGNDLYSKNIG" can be
> considered as a protein sequence. basically it's just a string.
>
> In my case each of the strings that i have may fall into 2 distinct
> classes or groups depending on whether they match a collection of  
> 400 distict patterns in the first instance or another 200 or so patterns
> in the second instance. I do not have information on which is the most
> common pattern.
>
>          > resulted in to writing the code that i had presented ealier but realised
> that it may be inferior, inefficient or buggy and would like to improve
> it.
>
> I     
>
> In a nutshell; Given a set of strings A, each of which may belong to  
> group and where a group is characterised by a huge set of patterns(to be
> matched against), create a method that classifies each of the strings
> depending on which pattern the string contains.

Looks like this is tricky to become right.  There seem to be some
people around that do bioinformatics, maybe some of them do have a
solution already.

Things you could do off the top of my head: since it seems your
patterns are only strings you could optimize the pattern by building a
trie from it and then creating a RX from that.  I've done this before
(see below).  Advantage is that your regular expression gets smaller
and matching becomes more efficient (because of eliminated
backtracking for NDA based RX engines).

http://en.wikipedia.org/wiki/Trie

There might be other options using one of the string matching
algorithms around (boyer moore and the like).

Kind regards

robert


def treeify(strings)
  root = TextTreeNode.new
  strings.each {|s| root.put s}
  root.to_s
end

TextTreeNode = Struct.new :children, :char, :term do
  def initialize(chr = nil)
    self.char = chr
    self.children = Hash.new {|h,k| h[k] = self.class.new(k)}
  end

  def put(s)
    node = self
    s.each_byte do |char|
      node = node.children[char]
    end
    node.term = true
    self
  end

  def to_s
    dump("")
  end

  def dump(out)
    out << Regexp.escape(char.chr) if char
    ch = children.values

    unless ch.empty?
      first = ch.shift
      bracket = term || !ch.empty?

      out << "(?:" if bracket
      first.dump(out)

      ch.each do |v|
        out << "|"
        v.dump(out)
      end

      out << ")" if bracket
      out << "?" if term
    end

    out
  end

end

puts treeify %w{foo fort bar baz battery}



-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/