On 7/18/07, Marc Hoeppner <marc.hoeppner / molbio.su.se> wrote: > Hi, > > I have yet another question about how to write a specific text parser in > ruby... > So, without further ado - this is what the source file looks like: > > Query= gi|23510597|emb|CAD48982.1| ring-infected erythrocyte surface > antigen precursor [Plasmodium falciparum 3D7] > (1085 letters) > > Database: KOG > 112,920 sequences; 47,500,486 total letters > > Searching..................................................done > > > > Score > E > Sequences producing significant alignments: (bits) > Value > > At2g21510 96 > 3e-19 > At4g39150 95 > 1e-18 > At1g76700 > > and so on... > > What I want to do is the following: > Read the source file - and if a line starts with "Query=", strip > everything from the line but the expression "gi|xxxxx". That part was no > problem with gsub, mind you. But, now the tricky thing (or not, I > guess...). > Go from there until you find a line starting with "Sequence", skip this > line and the following and puts the third line together with the > "gi|xxxxx" > So from the above example it would look like this: > > gi|23510597 At2g21510 > > No, ideally I wouldnt have to include this skip-lines part, but I cant > find a regexp, that lets me reliably identify the first line of the > results block (not all possible results start with At...). > > How I tried to do it: > > def stripname line > s = line.gsub(/Query=/, '') > u = s.gsub(/\|emb.*/, '') > end > > count = 0 # initializing variables > t = nil > v = nil > > ARGF.each do |l| > > puts l unless count.zero? > count = [0, count-1].max > > if l.match(/^Query=/) > t = stripname l > elsif l.match(/^Sequences/) > l = $1 > count = 2 > puts "#{t}#{l}" > else > end > end > > But the output looks terrible: > gi|23510597 > > At2g21510 > 96 3e-19 > gi|23510599 > > At5g14980 > 58 3e-08 > gi|23510600 > > And no matter what I try, I cant get the gi|xxxx and the corresponding > "best hit" in the same line. It is a terrible thing happens to me all the time, one tends to forget these \n's. Well fortunately we have #chomp, but maybe you want to use #strip which removes trailing (and leading) WS \n included. HTH Robert >Tried it with hashes, but frankly dont know > enough about those yet. > So If anyone has a helpful comment or solution, I would be extremely > grateful! > > Cheers, > > Marc > > -- > Posted via http://www.ruby-forum.com/. > > -- I always knew that one day Smalltalk would replace Java. I just didn't know it would be called Ruby -- Kent Beck