How to get data from html table

vikashkumar051 · November 27, 2006, 12:20pm

I want to store the values of a table in different variables, I have the
following table structure:

Case5-04	10/11/2006 23:24:33	Case5-04	1005	Sell	1,000	ARP	$36.90
Case5-03	10/11/2006 23:20:07	Case5-03	1005	Buy	1,500	ARP	$36.70
Case4-04	10/11/2006 05:28:54	Case4-04	1004	Sell	300	RIL	$490.00
Case4-03	10/11/2006 05:21:32	Case4-03	1004	Buy	200	RIL	$489.90

I want to store the values in variables so that I can compare records.
Please help me out how to do this in ruby.

vikashkumar051 · November 27, 2006, 12:53pm

I want to store the values in variables so that I can compare records.
Please help me out how to do this in ruby.

One possible way:

Record = Struct.new(“Record”, :name, :date, :name_again, :some_num,
:buy_link, :some_num2, :letters, :price)
records = []

doc = Hpricot(doc)
stuff = doc/“/table/tr/td”

elements = stuff.map { |elem| elem.inner_html }.each_slice(8) do |slice|
records << Record.new(*slice)
end

p records.sort_by {|record| record.price.slice(1…record.size) }

Note that since I did not know the semantics of the table cells,
sometimes the Struct Record has some weird fields in it, but you get the
idea.

Also I am not 100% sure if the sort_by should not be done on to_f-d
prices (probably not due to rounding problems, but I wonder if there can
be some weird string issues, too).

HTH,
Peter

__
http://www.rubyrailways.com

vikashkumar051 · November 27, 2006, 5:34pm

Hi,

Case5-03 10/11/2006 05:28:54 Case4-03
Here is another way:

After saving the html table text to file ‘w.xml’,
You can deal the value like this:

require ‘rexml/document’
include REXML
doc = Document.new File.new(“w.xml”)
doc.elements.each("*/tr/td") {|e|
puts e.texts
}

Regards,

Park H.

vikashkumar051 · November 27, 2006, 8:02pm

Vikash Kumar wrote:

<td class width="36">1,000</td>
<td class width="34">ARP</td>
<td class width="52">$490.00</td>
I want to store the values in variables so that I can compare records.
Please help me out how to do this in ruby.

Only a few lines of Ruby are required to accomplish this:

#!/usr/bin/ruby -w

data = File.read(sourcefilename)

output = []

html_rows = data.scan(%r{<tr.?>(.?)}im).flatten

html_rows.each do |row|

filter these undesired elements

row.gsub!(" “,”")
row.gsub("","")
cells = row.scan(%r{<td.?>(.?)}im).flatten
output << cells
end

done collecting, now display

output.each do |row|
line = row.join(",")
puts line
end

Output:

Case5-04,10/11/2006 23:24:33,Case5-04,1005,Sell,1,000,ARP,$36.90
Case5-03,10/11/2006 23:20:07,Case5-03,1005,Buy,1,500,ARP,$36.70
Case4-04,10/11/2006 05:28:54,Case4-04,1004,Sell,300,RIL,$490.00
Case4-03,10/11/2006 05:21:32,Case4-03,1004,Buy,200,RIL,$489.90

BTW there are some errors in your HTML sample. One example are some
orphan
" half-tags. They are not difficult to filter out of the result.

One reason this sort of example is difficult to parse is because of the
errors. But this specific parser will do the basic parsing, and you can
always filter a few errors, as this code does.

Digression: when solving a problem like this, it is often much easier to
write a few lines of HTML than to try to use a high-powered library to
accomplish it.

vikashkumar051 · November 27, 2006, 8:34pm

Hello,

Digression: when solving a problem like this, it is often much easier to
write a few lines of HTML than to try to use a high-powered library to
accomplish it.

I don’t see why is it an advantage here. The first solution in this
thread:

Record = Struct.new(“Record”, :name, :date, :name_again, :some_num,
:buy_link, :some_num2, :letters, :price)
records = []

cells = Hpricot(doc)/“/table/tr/td”

cells.map { |elem| elem.inner_html }.each_slice(8) do |slice|
records << Record.new(*slice)
end

p records.sort_by {|record| record.price.slice(1…record.size) }

is shorter, does not care about malformed HTML and even does the sorting
which I believe was the main intention of the OP. So why not use a
high-powered library?

Discalimer: that solution was actually mine but I am not referring to it
because of this, but rather because I think that parsing all the cells
with a one liner using a robust HTML parser is actually much better in
practice than to use a basic set of regexps and then patch the results
they yield with ad-hoc rules (missing close tags etc) looked up from 3
examples. I believe the above HPricot-powered solution will work with
100 records, too (if the other 97 does not get really messed up - but
in that case the regexps will fail miserably too) whereas the
we-do-not-need-any-high-powered-library approach may need another 25
patches due to the other errors in the 100-record HTML…

I do not argue that parsing the page with regexps and seeing what’s
going on under the hood can provide a lot of experience, but I am really
sure that feeding a real life page to a HTML parser is safer than to use
the regexp approach.

Of course if this question is just a theoretical one, and there won’t be
100 (or more than 3) records, just these 3, then forget about this mail.

Cheers,
Peter

__
http://www.rubyrailways.com

vikashkumar051 · November 28, 2006, 11:03am

#!/usr/bin/ruby -w

data = File.read(sourcefilename)

output = []

html_rows = data.scan(%r{<tr.?>(.?)}im).flatten

html_rows.each do |row|

filter these undesired elements

row.gsub!(" “,”“)
row.gsub(”“,”")
cells = row.scan(%r{<td.?>(.?)}im).flatten
output << cells
end

done collecting, now display

output.each do |row|
line = row.join(“,”)
puts line
end

What will be right solution if some one wants to get the data from yahoo
site International Business Machines Corporation (IBM) Stock Price, News, Quote & History - Yahoo Finance and then displaying only some
values such as Prev Close, Last Trade. Lets suppose we go to the URL
through :

require ‘watir’
include Watir
require ‘hpricot’
include Hpricot
ie=Watir::IE.new
ie.goto(“International Business Machines Corporation (IBM) Stock Price, News, Quote & History - Yahoo Finance”)

Now, whats next. Also let suppose we want to get all the values of
table, we don’t know the table structure then what what should be the
correct solution ?

vikashkumar051 · November 27, 2006, 10:16pm

Peter S. wrote:

Hello,

Digression: when solving a problem like this, it is often much easier to
write a few lines of HTML than to try to use a high-powered library to
accomplish it.

I don’t see why is it an advantage here.

You may not have noticed the degree to which the sample data varied from
cell to cell, and at times fails to match up with your named fields.

cells.map { |elem| elem.inner_html }.each_slice(8) do |slice|
records << Record.new(*slice)
end

p records.sort_by {|record| record.price.slice(1…record.size) }

is shorter, does not care about malformed HTML and even does the sorting
which I believe was the main intention of the OP. So why not use a
high-powered library?

In some cases this is not the best approach. My reply was meant to make
the
OP aware of this fact.

I do not argue that parsing the page with regexps and seeing what’s
going on under the hood can provide a lot of experience, but I am really
sure that feeding a real life page to a HTML parser is safer than to use
the regexp approach.

My point is simple. People who consider using a large library to meet a
need
like this may not be aware that a simple solution exists. Sometimes a
simple solution is better, for example in the case of an environment
with
limited resources. Also, the OP doesn’t need to download and install
Ruby,
but he may well have to download and install Hpricot.

Then, having been made aware of more than one option, the OP can make a
more
informed decision.

In this case, the OP has the option of learning how to talk to Hpricot,
or
he can learn how to talk to Ruby. The difficulty level is in the same
order
of magnitude if not closer, and I have again said what I normally say in
a
case like this, and in the same way – in essence, here is an option you
may not be aware of.

vikashkumar051 · November 28, 2006, 7:21pm

Vikash Kumar wrote:

/ …

ie.goto(“International Business Machines Corporation (IBM) Stock Price, News, Quote & History - Yahoo Finance”)

Now, whats next.

What? What’s next? You have already assumed that the Watir and Hpricot
libraries are the optimal solution for this problem. Not necessarily.
There
are many circumstances where a simple Ruby solution is better. And the
more
you need to know about the process of page scraping, the more likely it
is
that you will want to understand and tune the details.

Also let suppose we want to get all the values of
table, we don’t know the table structure then what what should be the
correct solution ?

How about this approach:

#!/usr/bin/ruby -w

require ‘net/http’

read the page data

http = Net::HTTP.new(‘finance.yahoo.com’, 80)
resp, page = http.get(‘/q?s=IBM’, nil )

BEGIN processing HTML

def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.?>(.?)</#{tag}>}im).flatten
end

output = []
table_data = parse_html(page,“table”)
table_data.each do |table|
out_row = []
row_data = parse_html(table,“tr”)
row_data.each do |row|
cell_data = parse_html(row,“td”)
cell_data.each do |cell|
cell.gsub!(%r{<.*?>},“”)
end
out_row << cell_data
end
output << out_row
end

END processing HTML

examine the result

def parse_nested_array(array,tab = 0)
n = 0
array.each do |item|
if(item.size > 0)
puts “#{”\t" * tab}[#{n}] {"
if(item.class == Array)
parse_nested_array(item,tab+1)
else
puts “#{”\t" * (tab+1)}#{item}"
end
puts “#{”\t" * tab}}"
end
n += 1
end
end

parse_nested_array(output)

Notice about this program that about half the code parses the Web page
and
creates an array of arrays, while the remainder shows the array. The
entire
task of scraping the page is carried out in the middle of the program.

If you examine the array display created in the latter part of the
program,
you will see that all the data are placed in an array that can be
indexed
by table, row and cell. Simply select which array elements you want.

I want to emphasize something. The 21 lines, including spaces and
comments,
between “# BEGIN processing HTML” and “# END processing HTML” are all
that
is required to scrape the page. After this, you simply choose which
table
cells you want to use by indexing the array.

This way of scraping pages is better if you have to post-process the
extracted data, or you need a lightweight solution for environments with
limited resources, or if you want to exercise detailed control over the
scraping process, or if you don’t want to try to figure out how to use a
large, powerful library that can do absolutely anything, or if you want
to
learn how to create Ruby programs.

And this way of scraping pages is not for everyone.

Also, I must add, if the Web page contains certain kinds of HTML syntax
errors, in particular any unpaired

, or

tags, my
program
will break, and Hpricot probably won’t. If, on the other hand, the page
is
syntactically correct, this program is perfectly adequate to extract the
data.

Obligatory editorial comment: Yahoo exists because it can expose you to
advertising. That is the foundation of their business model. When you
scrape pages, you avoid having to look at their advertising. If everyone
did this, for better or worse Yahoo would go out of business (or change
their business model).

Those are the facts. If this page scraping business becomes commonplace,
eventually Yahoo and other similar Web sites will choose a different
strategy, for example, they might sell subscriptions. Or they might try
to
do more than they are already doing to discourage scraping. This
activity
might end up being a contest between the scrapers and the scrapees, with
the scrapees making their pages more and more complex.

I think eventually these content providers might put up their content as
graphics rather than text, as the spammers are now doing. Then the
scrapers
would have to invest in OCR to get the content.

This scraping activity isn’t illegal, unless of course you exploit or
re-post the scraped content.

End of editorial.