Hi,
I have yet another question about how to write a specific text parser in
ruby…
So, without further ado - this is what the source file looks like:
Query= gi|23510597|emb|CAD48982.1| ring-infected erythrocyte surface
antigen precursor [Plasmodium falciparum 3D7]
(1085 letters)
Database: KOG
112,920 sequences; 47,500,486 total letters
Searching…done
Score
E
Sequences producing significant alignments: (bits)
Value
At2g21510 96
3e-19
At4g39150 95
1e-18
At1g76700
and so on…
What I want to do is the following:
Read the source file - and if a line starts with “Query=”, strip
everything from the line but the expression “gi|xxxxx”. That part was no
problem with gsub, mind you. But, now the tricky thing (or not, I
guess…).
Go from there until you find a line starting with “Sequence”, skip this
line and the following and puts the third line together with the
“gi|xxxxx”
So from the above example it would look like this:
gi|23510597 At2g21510
No, ideally I wouldnt have to include this skip-lines part, but I cant
find a regexp, that lets me reliably identify the first line of the
results block (not all possible results start with At…).
How I tried to do it:
def stripname line
s = line.gsub(/Query=/, ‘’)
u = s.gsub(/|emb.*/, ‘’)
end
count = 0 # initializing variables
t = nil
v = nil
ARGF.each do |l|
puts l unless count.zero?
count = [0, count-1].max
if l.match(/^Query=/)
t = stripname l
elsif l.match(/^Sequences/)
l = $1
count = 2
puts “#{t}#{l}”
else
end
end
But the output looks terrible:
gi|23510597
At2g21510
96 3e-19
gi|23510599
At5g14980
58 3e-08
gi|23510600
And no matter what I try, I cant get the gi|xxxx and the corresponding
“best hit” in the same line. Tried it with hashes, but frankly dont know
enough about those yet.
So If anyone has a helpful comment or solution, I would be extremely
grateful!
Cheers,
Marc