File position and buffers

Hi all,

In a bit of a rut. Have a file with a lot of text. I want to seperate
the text in this file as entries. Each entry that I would be seperating,
would be done so using IO.pos and when that cursor reaches a certain
character in the file, it will ideally place all the content before that
character into a buffer. Then the cursor will continue reading until it
hits that same character again and put that content into a buffer, so on
and so forth. (Character I’ll be reading would be a greater than symbol)

Would I use a do iterator or use a while loop with a gets method? Or
readlines perhaps?

File:

entry 1
rubyrubyrubyrubyrubyrubyrubyruby
(newline here which I don’t want)
entry 2
rubyrubyrubyrubyrubyrubyrubyruby

Entry1 and entry2 will be in seperate buffers which I would be able to
access again.

buffer1 = >entry 1
rubyrubyrubyrubyrubyrubyrubyruby

buffer2 = >entry 2
rubyrubyrubyrubyrubyrubyrubyruby

PS. The file is huge, so I don’t want to read it into memory. What is
the best way to approach this? Any suggestions or comments would be
helpful. Thanks!

hi Cee -

this may well be WAY to simple for your needs, but it seems to me you
could so something like this:

(0text.txt is a file with 7 lines that say rubyrubyrubyetc.)

f = “0text.txt”
file = File.open(f)
buffer = []
bufferindex = 0

file.each_line{|line|
buffer[bufferindex] = line.chomp
bufferindex += 1
}

p buffer[0]
p buffer[1]
p buffer[2]
#etc

of course you could also set a maximum number of lines per buffer:

f = “0text.txt”
file = File.open(f)
buffer = Hash.new{|key, value| key[value]= []}
bufferkey = 0
maxbuflength = 3

file.each_line{|line|
if buffer[bufferkey].length == maxbuflength
bufferkey +=1
buffer[bufferkey] << line.chomp
else
buffer[bufferkey] << line.chomp
end
}

p buffer[0]
p buffer[1]
p buffer[2]

if the file’s extremely long i guess you’d want to write a method to
dump the buffers at some point too.

maybe this is dumb, i hope not!
cheers,

-j

Cee J. wrote in post #995381:

Hi all,

In a bit of a rut. Have a file with a lot of text. I want to seperate
the text in this file as entries. Each entry that I would be seperating,
would be done so using IO.pos and when that cursor reaches a certain
character in the file, it will ideally place all the content before that
character into a buffer. Then the cursor will continue reading until it
hits that same character again and put that content into a buffer, so on
and so forth. (Character I’ll be reading would be a greater than symbol)

There is absolutely no reason to use pos() to read that file.

Would I use a do iterator or use a while loop with a gets method? Or
readlines perhaps?

File:

entry 1
rubyrubyrubyrubyrubyrubyrubyruby
(newline here which I don’t want)

chomp() removes one newline, if present, at the end of a string.

You say your file looks like this:

entry 1 <—WHAT’S AT THE END OF THIS LINE??
rubyrubyrubyrubyruby <—WHAT’S AT THE END OF THIS LINE??
(newline here which I don’t want)

Are there newlines at the end of each of those strings? Are you saying
that your data is organized into paragraphs, i.e. separated by two
newlines? Like this:

entry1\n
rubyrubyruby\n
\n
entry2\n
rubyrubyruby\n
\n
entry3

A paragraph is defined as two consective newlines between lines. Note
that in ruby the default line separator is one newline. But you can
change that to two newlines–or any other character:

require ‘stringio’

str =<<ENDOFSTRING

entry1
11111111111

entry2
22222222222

entry3
33333333333
ENDOFSTRING

input = StringIO.new(str)
$/ = “\n\n”

input.each do |para|
p para.sub(/\n+ \z/xms, “”)
end

–output:–
“>entry1\n11111111111”
“>entry2\n22222222222”
“>entry3\n33333333333”

Entry1 and entry2 will be in seperate buffers which
I would be able to access again.

buffer1 = >entry 1
rubyrubyrubyrubyrubyrubyrubyruby

buffer2 = >entry 2
rubyrubyrubyrubyrubyrubyrubyruby

So you want to read two paragraphs at a time?

e = input.enum_for(:each)

e.each_slice(2) do |buffer1, buffer2|
p buffer1, buffer2
end

–output:–
“>entry1\n11111111111\n\n”
“>entry2\n22222222222\n\n”
“>entry3\n33333333333\n”
nil

This shows the output better:

e = input.enum_for(:each) #You can do this for a File too.

e.each_slice(2) do |buffer1, buffer2|
puts “buffer1: #{buffer1.inspect}”
puts “buffer2: #{buffer2.inspect}”
puts “-” * 10
end

–output:–
buffer1: “>entry1\n11111111111\n\n”
buffer2: “>entry2\n22222222222\n\n”

buffer1: “>entry3\n33333333333\n”
buffer2: nil

Before doing the sub() on buffer2, you will have to check if it’s nil:

if buffer2.nil?
#don’t do a sub()
else
#do the sub()
end

On Wed, Apr 27, 2011 at 10:02 PM, Cee J. [email protected] wrote:

Would I use a do iterator or use a while loop with a gets method? Or
access again.
helpful. Thanks!
One of the simplest approaches is to use Ruby’s ability to use
arbitrary record delimiters:

File.foreach file_name, “>” do |chunk|
chunk.chomp! “>”
chunk.gsub! /\r\n?|\n/, ‘’ # remove line terminators

if you need the leading “>”:

chunk[0,0] = “>”

p chunk
end

Kind regards

robert

You could use foreach checking if each line starts with ‘>’. If it
doesn’t
you accumulate in a buffer; if it does you do something with the current
buffer and start a new one.

Jesus
El 27/04/2011 22:04, “Cee J.” [email protected] escribi:

Thanks guys for your helpful comments. I will be more descriptive. I am
an intern and my mentor wants me to use the IO.pos to read the
characters of the file until the character reaches the “>” symbol. SO
upon the cursor reaching the “>” symbol(which is the start of a new
entry), he wants me to place that previous entry in a buffer. Here is
the actual test file I am working with:

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA\n
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG\n
\n
gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA\n
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG\n
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG\n
\n
gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA\n
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG\n
TTAGTCGCTGACGCATGCACG\n
\n

7stud, you are right there are two consecutive newlines which I failed
to mention. This should be the output of a buffer for one entry:

gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA\n
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG <-- no “\n”
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG <-- no “\n”

Notice how the newlines are gone. So with the exception of the header in
each entry, the newlines should be gone and be placed in a buffer. I am
lost on how to use the IO.pos and a file iterator to make sure each
respective entry goes into a buffer without the file being indexed into
memory.

Thanks in advance, I’m new to the language and trying to wrap my head
around it.

Robert K. wrote in post #995478:

On Wed, Apr 27, 2011 at 10:02 PM, Cee J. [email protected] wrote:

Would I use a do iterator or use a while loop with a gets method? Or
access again.
helpful. Thanks!
One of the simplest approaches is to use Ruby’s ability to use
arbitrary record delimiters:

File.foreach file_name, “>” do |chunk|
chunk.chomp! “>”
chunk.gsub! /\r\n?|\n/, ‘’ # remove line terminators

Cee J., are you reading the file in binary mode or text mode?

7stud – wrote in post #995589:

Cee J., are you reading the file in binary mode or
text mode?

If you don’t know, then show us the line in your code where you open the
file.

You still have not told us what you are supposed to do with the stuff
you read in?? You can read a file line by line and print out each line
as you go and the maximum amount of memory used will be one line’s
worth. However, if you are supposed to store all the lines in an
array, then you will read the whole file into memory.

Thanks guys for your helpful comments. I will be more
descriptive. I am an intern and my mentor wants me to
use the IO.pos to read the characters of the file
until the character reaches the “>” symbol.

What problems is that giving you? You can create a loop, read the
character at pos(i), then increment i, and do what Jesús Gabriel y Galán
suggested. You can use three buffers since there will always be three
lines. The additional housekeeping of chomp()'ing some lines and not
others is straight forward: you leave buffer1 alone, and you chomp()
buffer2 once, and you chomp buffer3 twice (or use sub() on both with the
same regex(see my previous post)).

7stud – wrote in post #995596:

7stud – wrote in post #995589:

Cee J., are you reading the file in binary mode or
text mode?

If you don’t know, then show us the line in your code where you open the
file.

f = File.open(“test.fasta”, “r”)

Where test.fasta contains the entries i posted earlier…

7stud – wrote in post #995581:

You still have not told us what you are supposed to do with the stuff
you read in?? You can read a file line by line and print out each line
as you go and the maximum amount of memory used will be one line’s
worth. However, if you are supposed to store all the lines in an
array, then you will read the whole file into memory.

Thanks guys for your helpful comments. I will be more
descriptive. I am an intern and my mentor wants me to
use the IO.pos to read the characters of the file
until the character reaches the “>” symbol.

I am extracting text from each entry I read in, something I have figured
out already. I want to read the file line by line and just store each
entry into a buffer when it reaches the “>” symbol. THen extract
specific info from it later. The entry lengths all vary as there long
and short lengths. File is in text mode.

What problems is that giving you? You can create a loop, read the
character at pos(i), then increment i, and do what Jesús Gabriel y Galán
suggested.

Could you show me a simple example or refer me to a link?

  1. Read one entry, do something to the entry, then discard it and read
    in the next entry.

This is what I want to do. Read one entry, extract information from it,
then read next entry. He says using an array will take up a lot of
memory so he said use a buffer.

But, you will end up reading every entry twice, which
is stupid. The easiest way to read in the file and prepare each entry
is to set the input separator to “\n\n”, then use each() to read in a
paragraph, then use split("\n") to split each entry into lines, then add
back a \n to the first line.

Also, are you aware that this:

gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA\n
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG <-- no “\n”
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG <-- no “\n”

is equivalent to:

gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA

GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATGCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG

Yes I am aware of that - I just put “no \n” for emphasis. Regarding the
pos(), I think he said to use it as a guide to help with the detection
of each “>” . Thanks for being patient and helping out.

Cee J. wrote in post #995597:

my mentor wants me to use the IO.pos to read the
characters of the file until the character reaches the “>” symbol.

IO.pos() does not read in data, so you are going to have to ask your
mentor what he means. You should also ask your mentor if this is a
lesson in how not to do things. If he doesn’t reply in the affirmative,
then you should find a new mentor.

I am extracting text from each entry I read in, something I have figured
out already. I want to read the file line by line and just store each
entry into a buffer when it reaches the “>” symbol. THen extract
specific info from it later.

You told us you were not supposed to read the whole file into memory.
If you store every line in an array, then you will have read the whole
file into memory. Once again, you are not being clear on what you want
to do with the data. You need to tell us which of the following you
want to do:

  1. Store every entry in an array, so that after reading the whole file
    all the entries will be in the array, in order for you to be able to
    “extract specific info from it later”.

  2. Read one entry, do something to the entry, then discard it and read
    in the next entry.

The entry lengths all vary as there long
and short lengths. File is in text mode.

Ok.

What problems is that giving you? You can create a loop, read the
character at pos(i), then increment i, and do what Jesús Gabriel y Galán
suggested.

Could you show me a simple example or refer me to a link?

Not really. If you have to mix pos() into your code by any tortuous
means necessary, you could use each_byte to read the file char by char
(that
assumes your file contains all ascii characters), and when you find a
‘>’, you could seek() back to the start of the file, and use
IO.sysread() to read this number of characters:

old_pos = 0
number_of_chars_to_read = pos() - old_pos

Then do something like:

old_pos = pos()

But, you will end up reading every entry twice, which
is stupid. The easiest way to read in the file and prepare each entry
is to set the input separator to “\n\n”, then use each() to read in a
paragraph, then use split("\n") to split each entry into lines, then add
back a \n to the first line.

Also, are you aware that this:

gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA\n
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG <-- no “\n”
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG <-- no “\n”

is equivalent to:

gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATGCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG

hi Cee -

copying the text you posted above into the file “0text.txt” and
running this:

f = “0text.txt”
file = File.open(f)
buffer = []
bufferindex = 0

file.each_line(sep=">"){|line|
buffer[bufferindex] = line.chomp
bufferindex += 1
}

p buffer[0]
p buffer[1]
p buffer[2]
p buffer[3]

i get this as output:

#=> “>”
#=> “gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895),
mRNA\n\nAGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG\n\n\n\n>”
#=> “gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895),
mRNA\n\nGTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG\n\nCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG\n\n\n\n>”
#=> “gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895),
mRNA\n\nCGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG\n\nTTAGTCGCTGACGCATGCACG\n\n\n”

does this work for you? you could easily write ways to deal with,
dump, and reset the buffers when they fill up. you can of course also
clean up all the “\n”'s…

i agree with 7stud that using #.pos and #.gets seems like a long walk
off a short pier. i’m pretty green myself, and there are probably
better ways to iterate through the file, but #.each_line(sep=">") works
just fine, and doesn’t eat up memory.

  • j

If you don’t have to use pos(), then see my first post. At some point,
you might ask him why he thinks that pos() would be of any help at all!

7stud – wrote in post #995683:

If you don’t have to use pos(), then see my first post. At some point,
you might ask him why he thinks that pos() would be of any help at all!

Thanks jake and 7stud for replying. I tried this in irb for your first
post:

e = File.open(“test/test.fasta”).enum_for(:each)
=> #Enumerable::Enumerator:0x1005777a8

$/ = “\n\n”
=> “\n\n”

Before doing the sub() on buffer2, you will have to check if it’s nil:

if buffer2.nil?
#don’t do a sub()
else
#do the sub()
end

e.each_slice(2) do |buf1, buf2|
?> p buf1, buf2

if buf2.nil?
puts “Done”
else
?> buf2.sub(/\n+ \z/xms, “”)

end
end

Output:
“>gi|329299107|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895),
mRNA\nAGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG\n\n”
“>gi|329299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895),
mRNA\nGTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG\nCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG\n\n”
“>gi|329299107|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895),
mRNA\nCGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG\nTTAGTCGCTGACGCATGCACG\n”
nil
Done
=> nil

It still returns nil, am I doing what you suggested wrong?

I suggest that people never use irb because it has too many quirks.

The first thing you need to realize is that ‘>’ is
not the separator you want to look for. That is the second bit of
erroneous advice your mentor gave you. That’s because you don’t care
what character marks the beginning of every entry, rather you care what
character marks the end of every entry. The end of every entry in your
file is marked by the string “\n\n”, so you should use that as your
input line terminator. Remember, ruby uses “\n” for the input line
separator by default, which means that when you read a file using
IO#each, ruby reads lines–where the end of a line is marked by a
newline. However, you can change the input line separator to the string
“\n\n” (or any other string, e.g. “HELLO”) like this:

$/ = “\n\n”

Then when you read a file with each(), ruby will read up to and
including the input line separator, and return that as a “line”.

Once you read in an entry, then you just need to do a little
housekeeping and remove some of the “\n” characters.

require ‘stringio’

str =<<ENDOFSTRING

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG

gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG

gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG
TTAGTCGCTGACGCATGCACG

ENDOFSTRING

input = StringIO.new(str) #Now input is just like a File

input.each(sep = “\n\n”) do |para|
buffer = ‘’

lines = para.split("\n")
buffer << lines.shift << “\n”

lines.each do |line|
buffer << line
end

puts buffer #…or do whatever else you need to do to buffer
puts “-” * 20
end

p $/

–output:–

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG


gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATGCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG


gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAGTTAGTCGCTGACGCATGCACG


“\n”

Note that specifying the new input line separator as an argument to
each() serves to restore the original input line separator once the
block has finished–which is a good thing.

7stud – wrote in post #995821:

I suggest that people never use irb because it has too many quirks.

The first thing you need to realize is that ‘>’ is
not the separator you want to look for. That is the second bit of
erroneous advice your mentor gave you. That’s because you don’t care
what character marks the beginning of every entry, rather you care what
character marks the end of every entry. The end of every entry in your
file is marked by the string “\n\n”, so you should use that as your
input line terminator. Remember, ruby uses “\n” for the input line
separator by default, which means that when you read a file using
IO#each, ruby reads lines–where the end of a line is marked by a
newline.

I understand the logic, it makes sense. What if the file looked like
this, where there is one newline seperating the entries? :

gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG
gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG
gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG
TTAGTCGCTGACGCATGCACG

Would an if-else(regarding"\n" and “\n\n”) do the trick? I wanted to
write my code to where it would handle both scenarios. Or maybe:

case
when “\n\n”

when “\n”

end

something to that extent? Suggestions?

hi Cee -

hmm, i’m getting a bit confused as to what exactly you’re trying to do

  • but if you want to load all this stuff into a buffer without the
    newlines, and regardless of how many newlines you have between each
    entry (assuming that an “entry” is something that starts with “>”) - i
    don’t see why this wouldn’t work:

f = “0text.txt”
file = File.open(f)
buffer = []
bufferindex = 0

file.each(sep = “>”){|line|
buffer[bufferindex] = line
bufferindex += 1
}

here you would do something more interesting

buffer.collect{|line|
line = line.delete("\n")
p “>#{line}”
}

which will return…
“>>”
“>gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895),
mRNAAGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG>”
“>gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895),
mRNAGTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATGCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG>”
“>gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895),
mRNACGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAGTTAGTCGCTGACGCATGCACG”

…whether you have 0 or 100,000 newlines between each entry. is this
not what you’re looking for?

-j