WJ wrote in post #993576:
> Martin Hansen wrote:
>
>> The below code is too slow for practical use. I need it to run at least
>> 20 times faster. Perhaps that is possible with some C code? I have no
>> experience with writing Ruby extensions. What are the pitfalls? Which
>> part of the code should be ported? Any pointers to get me started?
>
> Please give a clear description of the algorithm, and then
> give some sample input and output.


Here is a working version of the code that can be profiled (though it 
will take forever with 20M iterations):

http://pastie.org/1808127

The slow part according to profiler is:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 29.39     2.66      2.66     1521     1.75    11.78  Range#each
 15.80     4.09      1.43    33000     0.04     0.06  Seq#match?
 10.72     5.06      0.97    78940     0.01     0.03  Kernel.dup
  9.28     5.90      0.84    78940     0.01     0.01 
Kernel.initialize_dup
  6.63     6.50      0.60   142380     0.00     0.00 
Seq::Score#edit_distance
  5.30     6.98      0.48    22220     0.02     0.03  Seq#deletion?
  3.54     7.30      0.32    66016     0.00     0.00  String#ord
  3.43     7.61      0.31    14680     0.02     0.04  Seq#mismatch?
  3.31     7.91      0.30     8300     0.04     0.05  Seq#insertion?


The input is DNA sequences. Basically strings of ATCG and Ns of length 
50-100. These comes in files with 20M-30M sequences per file. I've got 
~50 of these files and more incoming. The output will be truncated 
sequences based on the match position located with this bit of code.

The algorithm is of the dynamic programming flavor and was inspired by 
the paper by Bruno Woltzenlogel Paleo (page 197):

http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf

Locating variable length matches is tricky!



Cheers,

Martin

-- 
Posted via http://www.ruby-forum.com/.