On 7/18/07, Marc Hoeppner <marc.hoeppner / molbio.su.se> wrote:
> Hi,
>
> I have yet another question about how to write a specific text parser in
> ruby...
> So, without further ado - this is what the source file looks like:
>
> Query= gi|23510597|emb|CAD48982.1| ring-infected erythrocyte surface
> antigen precursor [Plasmodium falciparum 3D7]
>          (1085 letters)
>
> Database: KOG
>            112,920 sequences; 47,500,486 total letters
>
> Searching..................................................done
>
>
>
>                                                                  Score
> E
> Sequences producing significant alignments:                      (bits)
> Value
>
> At2g21510                                                          96
> 3e-19
> At4g39150                                                          95
> 1e-18
> At1g76700
>
> and so on...
>
> What I want to do is the following:
> Read the source file - and if a line starts with "Query=", strip
> everything from the line but the expression "gi|xxxxx". That part was no
> problem with gsub, mind you. But, now the tricky thing (or not, I
> guess...).
> Go from there until you find a line starting with "Sequence", skip this
> line and the following and puts the third line together with the
> "gi|xxxxx"
> So from the above example it would look like this:
>
> gi|23510597 At2g21510
>
> No, ideally I wouldnt have to include this skip-lines part, but I cant
> find a regexp, that lets me reliably identify the first line of the
> results block (not all possible results start with At...).
>
> How I tried to do it:
>
> def stripname line
>   s = line.gsub(/Query=/, '')
>   u = s.gsub(/\|emb.*/, '')
> end
>
> count = 0 # initializing variables
> t = nil
> v = nil
>
> ARGF.each do |l|
>
>   puts l unless count.zero?
>   count = [0, count-1].max
>
>   if l.match(/^Query=/)
>     t = stripname l
>   elsif l.match(/^Sequences/)
>     l = $1
>     count = 2
>     puts "#{t}#{l}"
>   else
>   end
> end
>
> But the output looks terrible:
> gi|23510597
>
> At2g21510
> 96   3e-19
>  gi|23510599
>
> At5g14980
> 58   3e-08
>  gi|23510600
>
> And no matter what I try, I cant get the gi|xxxx and the corresponding
> "best hit" in the same line.
It is a terrible thing happens to me all the time, one tends to forget
these \n's.
Well fortunately we have #chomp, but maybe you want to use #strip
which removes trailing (and leading) WS \n included.

HTH
Robert
>Tried it with hashes, but frankly dont know
> enough about those yet.
> So If anyone has a helpful comment or solution, I would be extremely
> grateful!
>
> Cheers,
>
> Marc
>
> --
> Posted via http://www.ruby-forum.com/.
>
>


-- 
I always knew that one day Smalltalk would replace Java.
I just didn't know it would be called Ruby
-- Kent Beck