On Tue, Apr 19, 2011 at 12:30 PM, Martin Hansen <mail / maasha.dk> wrote:
>>> =A0def match?(char1, char2)
>>> =A0 =A0(EQUAL[char1.upcase.ord] & EQUAL[char2.upcase.ord]) !=3D 0
>>> =A0end
>>
>> That would be really easy to write tests and a benchmark against and
>> then rewrite in C using RubyInline... without tests and a benchmark tho,
>> you won't know that you've done it correctly and provides a measurable
>> benefit.
>
> They bottlenecks are the iteration over the sequence (while loop at line
> 68) and the vector_update (line 120). I am a bit surprised that match?
> is that slow - I expect it to be almost instantaneous in C. I would like
> to test that with a benchmark and Inline C.

Frankly, I find your code has a design issue: it seems you mix data
and iteration in a single class.  This is visible from how #match
works

def match(pattern, pos =3D 0, max_edit_distance =3D 0)
    @pattern           =3D pattern
    @pos               =3D pos
    @max_edit_distance =3D max_edit_distance
    @vector            =3D vector_init

...

IMHO it would be better to separate representation of the sequence and
the matching process.  The matcher then would only carry a reference
to the sequence and all the data it needs to do matching.

Also #vector_update creates a lot of objects and does so for each
position in the sequence.  That's likely where you can improve things.

I am not sure what the matching algorithm is exactly.  Can you summarize it=
?

Ah, and one thing: if you add lowercase entries

EQUAL['a'.ord] =3D BIT_A
EQUAL['t'.ord] =3D BIT_T
EQUAL['u'.ord] =3D BIT_T
EQUAL['c'.ord] =3D BIT_C
EQUAL['g'.ord] =3D BIT_G
EQUAL['m'.ord] =3D (BIT_A|BIT_C)
EQUAL['r'.ord] =3D (BIT_A|BIT_G)
EQUAL['w'.ord] =3D (BIT_A|BIT_T)
EQUAL['s'.ord] =3D (BIT_C|BIT_G)
EQUAL['y'.ord] =3D (BIT_C|BIT_T)
EQUAL['k'.ord] =3D (BIT_G|BIT_T)
EQUAL['b'.ord] =3D (BIT_C|BIT_G|BIT_T)
EQUAL['d'.ord] =3D (BIT_A|BIT_G|BIT_T)
EQUAL['h'.ord] =3D (BIT_A|BIT_C|BIT_T)
EQUAL['v'.ord] =3D (BIT_A|BIT_C|BIT_G)
EQUAL['n'.ord] =3D (BIT_A|BIT_C|BIT_G|BIT_T)

you can make matching simpler

def match?(char1, char2)
    (EQUAL[char1.ord] & EQUAL[char2.ord]) !=3D 0
end

You might as well consider changing the Array into a Hash.  Then you
can even get rid of the #ord call.

A different approach would just use regular expressions, e.g.

MAP =3D {
  'a' =3D> 1,
  't' =3D> 2,
  'w' =3D> '[12]',
...
}

seq =3D Array.new(20) { %w{1 2 4 8}.sample }
pat =3D "acgw"
rx =3D Regexp.new(pat.chars.map {|c| MAP[c.downcase]}.join)

seq.scan pat do |m|
  p m
end

Cheers

robert


--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/