Parsing html table cells

casper_the_ghost · November 11, 2006, 2:31pm

I am trying to parse an html page that has strings that looks like this

4 47 1 19 to get the numbers inside the table cells.

I would to end up with a simple string that looks like this (for this
row)
4 47 1 19

The number of table cells in a row that have numbers may vary for
different rows.
I’m new to Ruby so bear with me. I’m also learning to use hpricot and
have been able get the table rows using it

thanks,

Luis

casper_the_ghost · November 11, 2006, 2:55pm

[email protected] wrote:

The number of table cells in a row that have numbers may vary for
different rows.
I’m new to Ruby so bear with me. I’m also learning to use hpricot and
have been able get the table rows using it

I’d use XPath, I’m not sure if that’s doable with hpricot CSS selectors
or its (admittedly, I think) basic XPath support.

If you know the webpage is valid xhtml, I’d say switch to REXML, if not,
massage with tidy (maybe hpricot can do this better too) and then switch
to REXML.

The code would probably be something like (where doc is the REXML
document):

bg2_strings = doc.elements.to_a(%{//tr[@class=‘bg2’]}).map { | bg2_row |
bg2_row.elements.to_a(‘td’).map { |cell| cell.text }.join(’
').strip.gsub(/\s+/, ’ ')
}

Which might be horribly wrong, because I find REXML’s XPath API hard to
memorise. YMMV. (It also hates the text() axis specifier with a passion,
whence the second map.)

David V.

casper_the_ghost · November 11, 2006, 4:25pm

Thanks for your help. I was able to get it with some hpricot code

intCells = tr.search(“td”).length

          1.upto(intCells-1) do |i|
            print tr.search("td:eq(#{i})").inner_html + ' '
          end

thanks,

Luis

casper_the_ghost · November 12, 2006, 1:30am

[email protected] wrote:

The number of table cells in a row that have numbers may vary for
different rows.

Try this:

#!/usr/bin/ruby -w

table = “

” +
“” +
“” +
“” +
“

4	47	1	19
7	49	4	39
14	17	19	21

”

rows = table.scan(%r{

.*?})

rows.each do |row|
fields = row.scan(%r{

(.*?)})
puts fields.join(",")
end

Output:

4,47,1,19
7,49,4,39
14,17,19,21