Reading lines from a file into an array

jimb · May 1, 2010, 2:19pm

Using ruby, I am trying to read in lines from a two column html table
and store each line in a two element array. This two element array is in
turn stored in one large array.

The table rows look like this:

Row 1 - Column 1 Row 1 - Column 2 ...

And, when I’m done, I hoping for this:
[[“row1 - col1”, “row1, col2”], [“row2 - col1”, “row2, col2”], …]

Can anyone give me any pointers on the correct way to do this.
The code I have come up with so far is this:

f = File.open(“file_containing_table.txt”, “r”)
lines = f.readlines
array_to_hold_rows= []
index = 0
loop do
if lines[index] == nil
break
elsif lines[index].match “<td”
array_to_hold_rows << ["#{lines[index]}", “#{lines[index+1]}”]
index +=2
else
index +=1
end
end

This works and does what I want, but I would like to know if this is the
best / most effective way to go about what I am trying to achieve.

Would be grateful for any help.

jimb · May 1, 2010, 2:50pm

On Sat, May 1, 2010 at 2:19 PM, Jim B. [email protected]
wrote:

…
index = 0
loop do
if lines[index] == nil
break
elsif lines[index].match “<td”
array_to_hold_rows << [“#{lines[index]}”, “#{lines[index+1]}”]
index +=2
else
index +=1
end
end

Usually parsing html with regular expressions is risky. I’d use a
parser instead, if possible, for example nokogiri:

require ‘nokogiri’
html =<<END

Row 1 - Column 1	Row 1 - Column 2
Row2 - col1	row2 - col2

END doc = Nokogiri::HTML(html) result = [] doc.xpath('//tr[@class="odd"]/td').each_slice(2) do |first, second| result << [first.inner_html, second.inner_html] end

result #=> [[“Row 1 - Column 1”, “Row 1 - Column 2”], [" Row2 -
col1", " row2 - col2"]]

This works and does what I want, but I would like to know if this is the
best / most effective way to go about what I am trying to achieve.

I think in your solution you are not stripping the markup, so your
array still contains the and tags.
Now that I think about it you might want to tweak the Xpath I wrote,
because maybe not all the trs have class=“odd”.

Hope this helps,

Jesus.

jimb · May 1, 2010, 3:32pm

Hi Jesus,

Thanks for your reply. That really helped.

Usually parsing html with regular expressions is risky. I’d use a
parser instead, if possible, for example nokogiri:

You are of course correct.
I have followed your suggestion, installed ‘nokogiri’ and am currently
looking at some examples. I will then tweak what you wrote and use that
in my final version as it is very neat and concise.

However, this:

doc.xpath(’//tr[@class=“odd”]/td’).each_slice(2) do |first, second|
result << [first.inner_html, second.inner_html]
end

gave me the idea of writing this:

lines.each_slice(4){|one,two,three,four| result <<
[two.chomp,three.chomp]}

which also does exactly what I want, so thanks again!

Essentially, I have just finished reading the first part of the Pickaxe
book and am trying to implement some of their suggestions and think
about the code I write as opposed to doing the “quick and dirty” method.