My script just read one line?

junhuiliao · July 29, 2010, 1:43pm

Dear all,

My script tried to read from one original tsv file and distribute into
new multiple tsv files.

Each line of the original file is like this: time_1, signal_1, time_2,
signal_2… time_4096, signal_4096.
I would like to write them into file_1, file_2, … file_4096
accordingly, and these files contain time_1, signal_1; time_2, signal_2;
… time_4096, signal_4096 separately.

My script did well only if the original file contains ONE line.
If the original file has two or more lines, the error message like
following,

new_split.rb:17: undefined method `+’ for nil:NilClass (NoMethodError)

By tracing the output, I found it seemed the script just read ONE line,
since the put results like this :

…(omitted a lots of lines here)
8182
“4.08963252844486E+00”
“-2.3E-03”
8184
“4.09063219413236E+00”
“-3.1E-03”
8186
“4.09163185987611E+00”
“-7E-04”
8188
“4.09263152560423E+00”
“-3.7E-03”
8190
“4.09363119136048E+00”
“3.6E-03”
8192
nil
nil

And my script is like this:

            @a = []
            @itemnum = 4096
            @counter = 0
            @linenum = 10
            File.open("../original_data/test_2lines.tsv").each_line

do |record| # “^M”
#File.open("…/original_data/one_line.tsv").each_line do
|record|
@a = record.chomp.split("\t")

            @itemnum.times do |n|

            File.open("#{n}_debug_split"+".tsv" , "w") do |f|
            puts @counter
            puts @a[@counter].inspect + "\n"
            puts @a[@counter+1].inspect + "\n"
            f << @a[@counter] + "\t" + @a[@counter+1] + "\n"
            @counter += 2
                                                       end


                           end

                                                                  end

Thanks a lot for your comments in advance !
Junhui

BTW, at the end of each line in original tsv file, this is a “^M”
appended.
I don’t know how it comes and results something or not.

junhuiliao · July 29, 2010, 2:30pm

On Thu, Jul 29, 2010 at 1:43 PM, Junhui L. [email protected]
wrote:

My script did well only if the original file contains ONE line.
If the original file has two or more lines, the error message like
following,

new_split.rb:17: undefined method `+’ for nil:NilClass (NoMethodError)

And my script is like this:
           @a = []

you don’t need to declare this, because you later are assigning
directly to @a again

           @itemnum = 4096
           @counter = 0
           @linenum = 10

and, by the way, you probably don’t need instance variables, probably
local variables could suffice, itemnum looks like a constant and
linenum is not used, so:

ITEM_NUM = 4096
counter = 0

           File.open("../original_data/test_2lines.tsv").each_line
do |record| # “^M”
#File.open(“…/original_data/one_line.tsv”).each_line do
|record|
@a = record.chomp.split(“\t”)

a = record.chomp.split(“\t”) # although maybe fields or line_fields
are better names than a

end

You are adding 2 to the counter every iteration, but not clearing it
after every line. So, on the second line, counter will still be 4096,
and so you will try to get an element from the array that is out of
bounds, returning nil and raising the NoMethodError, because you are
calling the + method on nil. I think you are complicated the issue
with the counting and so on, usually the Ruby iterators are a cleaner
way to traverse lists of things. You can remove the use of
itemnum,counter and so on like this (untested):

File.open(“…/original_data/test_2lines.tsv”).each_line do |record|
a = record.chomp.split(“\t”)
a.each_slice(2).with_index do |(time,signal), index|
File.open(“#{index}_debug_split”+“.tsv” , “w”) do |f|
f << “#{time}\t#{signal}\n”
end
end
end

Although this will open and close the 4096 files for every line. Are
there many lines? If not, you can read the whole file and build a
structure in memory (a hash of arrays) to store the lines that belong
to every file, and then write them at once to each file.

Jesus.

junhuiliao · July 29, 2010, 4:04pm

You are adding 2 to the counter every iteration, but not clearing it
after every line. So, on the second line, counter will still be 4096,
and so you will try to get an element from the array that is out of
bounds, returning nil and raising the NoMethodError, because you are
calling the + method on nil. I think you are complicated the issue
with the counting and so on, usually the Ruby iterators are a cleaner
way to traverse lists of things. You can remove the use of
itemnum,counter and so on like this (untested):

Many thanks for your comment !

File.open(“…/original_data/test_2lines.tsv”).each_line do |record|
a = record.chomp.split(“\t”)
a.each_slice(2).with_index do |(time,signal), index|
File.open(“#{index}_debug_split”+“.tsv” , “w”) do |f|
f << “#{time}\t#{signal}\n”
end
end
end

I tried the script, but added "require ‘enumerator’ ".
Still, there is a problem like this :
new_split_Jesus.rb:1:in `each_slice’: no block given (LocalJumpError)

After looking for this forum, I got that this results from my mac based
ruby is 1.8.6, and your code should worked under 1.9 + .
Even though I don’t know how to do “requires a block to be passed to
it”
Refer to this link please: "No block given" error -- spurious? - Ruby - Ruby-Forum

Although this will open and close the 4096 files for every line. Are
there many lines? If not, you can read the whole file and build a
structure in memory (a hash of arrays) to store the lines that belong
to every file, and then write them at once to each file.

Yes, my file is totally 2048 lines, ~260M.
So, if read the whole file into memory,
the efficiency maybe not so nice.

Thanks again for your help !

Best,
Junhui

junhuiliao · July 30, 2010, 9:34am

On Fri, Jul 30, 2010 at 1:58 AM, Junhui L. [email protected]
wrote:

This code ran well at 1.9.1 version of ruby. Since I tried at our
server where ruby is this version.

BTW, I’m using 1.8.7. And also, File.open().each_line doesn’t properly
close the file,
so we should be using File.foreach()

I would like to do, time_2.1 = time_2.1 - time_1.1 , time_2.2 =
printed two items
(time and signal) well. However, i could not print just time or signal
value.

What I’d do is create an array for the first line with the times, and
use that after on to substract.
I’ve refactored a little bit to simplify (this is completely untested):

File.open(“…/original_data/test_2lines.tsv”) do |file|
first_line = file.readline
first_line_times = first_line.chomp.split(“\t”).each_slice(2).map
{|time,signal| time}
write_line_to_file first_line
file.each_line do |record|
line_data = record.chomp.split(“\t”)
write_line_to_file line_data, first_line_times
end
end

def write_line_to_file line, base_time = Hash.new(0)
line_data.each_slice(2).with_index do |(time,signal), index|
File.open(“#{index}_debug_split”+“.tsv” , “w”) do |f|
f << “#{time.to_i - base_time[index]}\t#{signal}\n”
end
end
end

Hope this gives you an idea to explore,

Jesus.

junhuiliao · July 30, 2010, 1:58am

Dear JesÃºs Gabriel y GalÃ¡n and all,

File.open("…/original_data/test_2lines.tsv").each_line do |record|
a = record.chomp.split("\t")
a.each_slice(2).with_index do |(time,signal), index|
File.open("#{index}_debug_split"+".tsv" , “w”) do |f|
f << “#{time}\t#{signal}\n”
end
end
end

This code ran well at 1.9.1 version of ruby. Since I tried at our
server where ruby is this version.

Actually, I need to do this also: make the first line’s time value
subtracted by other lines’ corresponding time ones.

First line: time_1.1, signal_1.1, time_1.2, signal_1.2… time_1.4096,
signal_1.4096.
Second line: time_2.1, signal_2.1, time_2.2, signal_2.2… time_2.4096,
signal_2.4096.
…

I would like to do, time_2.1 = time_2.1 - time_1.1 , time_2.2 =
time_2.2 - time_1.2 ,
… time_2.4096 = time_2.4096 - time_1.4096.
…
Similar to other lines’ time value.

I tried to use a counter to pick up the first line (stupid way, I know)
than save in an array, and
take other lines time values to subtract this array, but failed. Since
it seemed
to the enumerator I could not access individual ? But "puts a[index] "
printed two items
(time and signal) well. However, i could not print just time or signal
value.

Thanks a lot for in advance!
Best,
Junhui

junhuiliao · July 30, 2010, 2:17pm

File.open("…/original_data/test_2lines.tsv") do |file|
first_line = file.readline
first_line_times = first_line.chomp.split("\t").each_slice(2).map
{|time,signal| time}
write_line_to_file first_line
file.each_line do |record|
line_data = record.chomp.split("\t")
write_line_to_file line_data, first_line_times
end
end

def write_line_to_file line, base_time = Hash.new(0)
line_data.each_slice(2).with_index do |(time,signal), index|
File.open("#{index}_debug_split"+".tsv" , “w”) do |f|
f << “#{time.to_i - base_time[index]}\t#{signal}\n”
end
end
end

Hope this gives you an idea to explore,

Jesus.

Dear JesÃºs Gabriel y GalÃ¡n,

Thanks a lot for your second comment.

While to the following script, there are few problems
while running.

1, To this line " write_line_to_file first_line" and
this line "write_line_to_file line_data, first_line_times ",
the error message like following,

in block (2 levels) in <class:File_spliter>': undefined methodwrite_line_to_file’ for File_spliter:Class (NoMethodError)

It seemed the “write_line_to_file” has no definition before calling.
I tried to make a class to “surround” your code, but same error
occurred.

2, I don’t understand this two lines.
2.1, “write_line_to_file first_line”. Since the definition of
write_line_to_file has
two arguments, while here just only one, and I don’t know the purpose of
this
line.

2.2, “line_data.each_slice(2).with_index do |(time,signal), index|”.
Why is not “line.each_slice(2).with_index do |(time,signal), index|” ?

Thanks again !
Best regards,
Junhui

junhuiliao · July 30, 2010, 2:47pm

On Fri, Jul 30, 2010 at 2:18 PM, Junhui L. [email protected]
wrote:

  write_line_to_file line_data, first_line_times
while running.
occurred.
Sorry, the method write_line_to_file should be defined before the
other block of code, so that it exists when it’s called.

2, I don’t understand this two lines.
2.1, “write_line_to_file first_line”. Since the definition of
write_line_to_file has
two arguments, while here just only one, and I don’t know the purpose of
this
line.

It has two arguments, but the second is optional, and if not passed,
it will be assigned a new Hash(0).
See the def of the method.

2.2, “line_data.each_slice(2).with_index do |(time,signal), index|”.
Why is not “line.each_slice(2).with_index do |(time,signal), index|” ?

It’s a mistake on my part. Copy/paste error.

Jesus.

junhuiliao · July 30, 2010, 3:58pm

Dear JesÃºs Gabriel y GalÃ¡n,

It’s quite strange. There is such a error message.

To this line, every_line.each_slice(2).with_index do |(time,signal),
index| .

split_time_subtract.rb:5:in write_line_to_file': undefined methodeach_slice’ for #String:0x9ebe608 (NoMethodError)

The version of ruby is 1.9.1. And I checked ‘each_slice’ worked well
like this:

irb(main):001:0> [1,2,3].each_slice(2){|s| p s}
[1, 2]
[3]
=> nil

Any suggestions ?
Thanks a lot in advance!

Junhui

junhuiliao · July 30, 2010, 8:40pm

On Fri, Jul 30, 2010 at 4:00 PM, Junhui L. [email protected]
wrote:

Dear Jesús Gabriel y Galán,

It’s quite strange. There is such a error message.

To this line, every_line.each_slice(2).with_index do |(time,signal),
index| .

split_time_subtract.rb:5:in write_line_to_file': undefined method each_slice’ for #String:0x9ebe608 (NoMethodError)

What I’m typing is untested, is just to give you some ideas. In any
case, this error is because I meant the first_line variable to contain
an array, but I misplaced the call to split:

Replace this:

first_line = file.readline
first_line_times = first_line.chomp.split(“\t”).each_slice(2).map

with this:

first_line = file.readline.chomp.split(“\t”)
first_line_times = first_line.each_slice(2).map

Jesus.

junhuiliao · August 2, 2010, 10:53am

On Sun, Aug 1, 2010 at 11:10 PM, Junhui L. [email protected]
wrote:

 f << "#{time.to_f - base_time[index].to_f}\t#{signal}\n"
file.each_line do |record|
line_data = record.chomp.split(“\t”)
write_line_to_file line_data, first_line_times
end
end

However, there existed two items need to be improved at least.
Item 1, this code took ~2 hours to save into 4096 files.
BTW, the original tsv file is around 250M. I wonder if there exist
some tricks to make it speed up?

Maybe you can read it completely in memory, reorganize the contents
per file, and then write each file at once.
I think that should speed it up, although it implies a complete
refactor of the code.

Item 2, the original data has 21 lines header. Although it could be
deleted then read by the script. But I do want to update the script
to make it exclude the fist 21 lines header.

If you do a first file.readline after opening the file, you will read
the first line.
Then continue with what you already had.

Jesus.

junhuiliao · August 1, 2010, 11:10pm

Hi, Jesus.

Thanks a lot for your help!
I modified a little to the script and make it running as expected.

Here is the code:

def write_line_to_file every_line, base_time = Hash.new(0)
every_line.each_slice(2).with_index do |(time,signal), index|
File.open(“header_split_#{index}”+".tsv" , “a”) do |f|
f << “#{time.to_f - base_time[index].to_f}\t#{signal}\n”
end
end
end

#count = 0
first_line = file.readline.chomp.split("\t")

counter +=1

if counter >= 2

puts “here!”

first_line_times = first_line.each_slice(2).map{|time,signal| time}
file.each_line do |record|
line_data = record.chomp.split("\t")
write_line_to_file line_data, first_line_times
end
end

However, there existed two items need to be improved at least.
Item 1, this code took ~2 hours to save into 4096 files.
BTW, the original tsv file is around 250M. I wonder if there exist
some tricks to make it speed up?

Item 2, the original data has 21 lines header. Although it could be
deleted then read by the script. But I do want to update the script
to make it exclude the fist 21 lines header.

I tried two ways to do this job, but it failed.

The first way was referencing from cvs file’s header reading.

require ‘csv’
reader = File.open("…/data/test_7_10lines.tsv") do |file|
header = reader.shift
reader.each {|row| process(header,2)}#Suppose the first two lines are
header.

The error code seemed no
undefined method `shift’ for nil:NilClass (NoMethodError)

Another try is, insert the two lines like above code commented.

counter +=1
if counter >= 2

But it seemed the counter +=1 did not work at all, since
counter always <=1 !

What’s wrong and any suggestions?

Best regards,
Junhui

junhuiliao · August 3, 2010, 12:15am

Hi, Jesus,

Maybe you can read it completely in memory, reorganize the contents
per file, and then write each file at once.
I think that should speed it up, although it implies a complete
refactor of the code.

Could you please give me some tips on how to organize this new script ?
I have no idea to do this at all.

Item 2, the original data has 21 lines header. Although it could be
deleted then read by the script. But I do want to update the script
to make it exclude the fist 21 lines header.

If you do a first file.readline after opening the file, you will read
the first line.
Then continue with what you already had.

As to this problem, I solved by insert this code.

until file.readline =~ /Data:/
file.readline
end

Thanks a lot in advance !
Cheers,
Junhui

junhuiliao · August 3, 2010, 8:18am

On Tue, Aug 3, 2010 at 12:15 AM, Junhui L. [email protected]
wrote:

I have no idea to do this at all.
You could read the lines one by one as you are doing now, but instead
of writing them to each file on each step, accumulate them in arrays.
For example you could have an array that contains and array of lines
for each file. Then when you’ve read the whole file, iterate through
the array writing each subarray to a file.

until file.readline =~ /Data:/
file.readline
end

Be careful, you are doing two readlines per iteration, so you might
skip the important line. If you know you have only one line before the
important data you can just do file.readline and continue.

Jesus.

junhuiliao · August 3, 2010, 9:51am

Hi,

Jesus,

Thanks a lot for your comments.

Be careful, you are doing two readlines per iteration, so you might
skip the important line. If you know you have only one line before the
important data you can just do file.readline and continue.

Totally, the header has 21 lines. And the last line of header is
''Data:".
Fortunately, the 21 is an odd number so if just in the view of result,
is same as the header has only one line :-). For sure, this is not
robust.
I will try to update it.

Cheers,
Junhui

junhuiliao · August 4, 2010, 11:21pm

Hi, Jesus,

You could read the lines one by one as you are doing now, but instead
of writing them to each file on each step, accumulate them in arrays.
For example you could have an array that contains and array of lines
for each file. Then when you’ve read the whole file, iterate through
the array writing each subarray to a file.

I tried to use the following code to make all of the line’s data
containing in an array.
But my code just read the last line of file. Could you please tell my
why?

File.open("…/data/test_2lines.tsv") do |file|
total_data = []# Array containing all of the line_datas
line_data = [] # Array just containing one line’s data
first_line = file.readline.chomp.split("\t")
first_line_times = first_line.each_slice(2).map{|time,signal| time}
file.each_line do |record|
line_data = record.chomp.split("\t")# write all time and sginal into
array line_data.
end
total_data = total_data + line_data# append line_data into total_data
puts total_data.length
File.open(“total_data.txt”,“w”) do |f|
f << total_data
end
end

However, under irb,

irb(main):001:0> a = [1,2,3]
=> [1, 2, 3]
irb(main):002:0> b = [4,5,6]
=> [4, 5, 6]
irb(main):003:0> a + b
=> [1, 2, 3, 4, 5, 6]

Thanks a lot in advance!
Cheers,
Junhui

junhuiliao · August 6, 2010, 12:05am

This is how I’d do it (untested, just an idea):

File.open("…/data/test_2lines.tsv") do |file|
total_data = Array.new(4096) {[]} # Array containing an array for
each file we will generate
first_line = file.readline.chomp.split("\t")
first_line_times = first_line.each_slice(2).map{|time,signal| time}
file.each_line do |record|
record.chomp.split("\t").each_slice(2).with_index do |(time,signal),
index|
total_data[index] << “#{time.to_f -
first_line_times[index].to_f}\t#{signal}\n”
end
end

total_data.each_with_index do |lines, index|
File.open(“header_split_#{index}”+".tsv" , “w”) do |f|
ï¿½ f.puts total_data[index]
end
end

Hope this helps,

Jesus.

Dear Jesus,

I modified little, then the code worked.
This new script just took 3 (THREE) minutes, it’s incredible !!!
Millions of thanks to you!

My main reference book is <>,
I don’t know why in this book, I could not find these information, like
each_with_index,
and like the example of two items between two bars “||”,
map{|time,signal| time} ?
Where I can find these information ?

Thanks a lot in advance!
Cheers,
Junhui

junhuiliao · August 6, 2010, 8:07am

On Fri, Aug 6, 2010 at 12:05 AM, Junhui L. [email protected]
wrote:

index|

Millions of thanks to you!

My main reference book is <>,

Are you referring to this one:
http://www.ruby-doc.org/docs/ProgrammingRuby/

I don’t know why in this book, I could not find these information, like
each_with_index,

Well, I don’t know if it’s mentioned in the text, but there’s a
reference:

http://www.ruby-doc.org/docs/ProgrammingRuby/html/ref_m_enumerable.html#Enumerable.each_with_index

and like the example of two items between two bars “||”,
map{|time,signal| time} ?
Where I can find these information ?

… but I recommend you take a look at the API reference here:

http://ruby-doc.org/core/

and study the classes: Array, Hash and Enumerable, which are pretty
important to master (of course there are others like String and so
on…)

Jesus.

junhuiliao · August 5, 2010, 8:16am

On Wed, Aug 4, 2010 at 11:22 PM, Junhui L. [email protected]
wrote:

containing in an array.
But my code just read the last line of file. Could you please tell my
why?

File.open(“…/data/test_2lines.tsv”) do |file|
total_data = []# Array containing all of the line_datas
line_data = [] # Array just containing one line’s data

You don’t need to declare the variable. You will only use it inside
the loop and assign it a value, so you should remove the above line.

first_line = file.readline.chomp.split(“\t”)
first_line_times = first_line.each_slice(2).map{|time,signal| time}

file.each_line do |record|
line_data = record.chomp.split(“\t”)# write all time and sginal into array line_data.
end

The problem is here: you are reading the full file splitting each
line, but the result of the split is thrown away.
You don’t do anything with line_data in each iteration.

total_data = total_data + line_data# append line_data into total_data

After the loop, you use the last line_data, which contains the last
line of the file.
BTW, it’s better to do total_data << line_data, because the way you
are doing it, you are creating intermediate arrays that you don’t
need.
This is how I’d do it (untested, just an idea):

File.open(“…/data/test_2lines.tsv”) do |file|
total_data = Array.new(4096) {[]} # Array containing an array for
each file we will generate
first_line = file.readline.chomp.split(“\t”)
first_line_times = first_line.each_slice(2).map{|time,signal| time}
file.each_line do |record|
record.chomp.split(“\t”).each_slice(2).with_index do |(time,signal),
index|
total_data[index] << “#{time.to_f -
first_line_times[index].to_f}\t#{signal}\n”
end
end

total_data.each_with_index do |lines, index|
File.open(“header_split_#{index}”+“.tsv” , “w”) do |f|
f.puts total_data[index]
end
end

Hope this helps,

Jesus.

junhuiliao · August 6, 2010, 12:25pm

Hi, Jesus,

Are you referring to this one:
http://www.ruby-doc.org/docs/ProgrammingRuby/

I don’t know why in this book, I could not find these information, like
each_with_index,

It is this book. And I made a mistake, in this book (I bought a PDF
version),
there exists the each_with_index. But I did search “each_with_index” in
this
book before but failed. Maybe there was a typing mistake when searching.

Well, I don’t know if it’s mentioned in the text, but there’s a
reference:

Programming Ruby: The Pragmatic Programmer's Guide

and like the example of two items between two bars “||”,
map{|time,signal| time} ?
Where I can find these information ?

… but I recommend you take a look at the API reference here:

RDoc Documentation

and study the classes: Array, Hash and Enumerable, which are pretty
important to master (of course there are others like String and so
on…)

Thanks a lot for all of these references !

Cheers,
Junhui