Using ruby, I am trying to read in lines from a two column html table
and store each line in a two element array. This two element array is in
turn stored in one large array.
The table rows look like this:
Row 1 - Column 1
Row 1 - Column 2
...
And, when I’m done, I hoping for this:
[[“row1 - col1”, “row1, col2”], [“row2 - col1”, “row2, col2”], …]
Can anyone give me any pointers on the correct way to do this.
The code I have come up with so far is this:
f = File.open(“file_containing_table.txt”, “r”)
lines = f.readlines
array_to_hold_rows= []
index = 0
loop do
if lines[index] == nil
break
elsif lines[index].match “<td”
array_to_hold_rows << ["#{lines[index]}", “#{lines[index+1]}”]
index +=2
else
index +=1
end
end
This works and does what I want, but I would like to know if this is the
best / most effective way to go about what I am trying to achieve.
…
index = 0
loop do
if lines[index] == nil
break
elsif lines[index].match “<td”
array_to_hold_rows << [“#{lines[index]}”, “#{lines[index+1]}”]
index +=2
else
index +=1
end
end
Usually parsing html with regular expressions is risky. I’d use a
parser instead, if possible, for example nokogiri:
require ‘nokogiri’
html =<<END
Row 1 - Column 1
Row 1 - Column 2
Row2 - col1
row2 - col2
END
doc = Nokogiri::HTML(html)
result = []
doc.xpath('//tr[@class="odd"]/td').each_slice(2) do |first, second|
result << [first.inner_html, second.inner_html]
end
This works and does what I want, but I would like to know if this is the
best / most effective way to go about what I am trying to achieve.
I think in your solution you are not stripping the markup, so your
array still contains the and tags.
Now that I think about it you might want to tweak the Xpath I wrote,
because maybe not all the trs have class=“odd”.
Usually parsing html with regular expressions is risky. I’d use a
parser instead, if possible, for example nokogiri:
You are of course correct.
I have followed your suggestion, installed ‘nokogiri’ and am currently
looking at some examples. I will then tweak what you wrote and use that
in my final version as it is very neat and concise.
However, this:
doc.xpath(’//tr[@class=“odd”]/td’).each_slice(2) do |first, second|
result << [first.inner_html, second.inner_html]
end
gave me the idea of writing this:
lines.each_slice(4){|one,two,three,four| result <<
[two.chomp,three.chomp]}
which also does exactly what I want, so thanks again!
Essentially, I have just finished reading the first part of the Pickaxe
book and am trying to implement some of their suggestions and think
about the code I write as opposed to doing the “quick and dirty” method.
This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.