Logging to a page and scrapping values

vikashkumar051 · January 12, 2007, 11:31am

I am running a test case, in which I have to first login to a web page
then I have to go to some particular page in the same web site, then
extract some data from that page. The data is in the table.

Such as the script first call http://localhost/login.asp, then we enter
user name and password, then we click on login button. By this we enter
to the web page, then we go to http://localhost/achievements.asp, from
this page we want to extract the data residing in html table. What
should be the approach to do this.

I can use the below code to extract the data if I have not to login to
the web site.

require ‘net/http’

read the page data

http = Net::HTTP.new(‘kvcrpf.org, 80)
resp, page = http.get(’/achievements.htm’, nil )

BEGIN processing HTML

def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.?>(.?)</#{tag}>}im).flatten
end

output = []
table_data = parse_html(page,“table”)
table_data.each do |table|
out_row = []
row_data = parse_html(table,“tr”)
row_data.each do |row|
cell_data = parse_html(row,“td”)
cell_data.each do |cell|
cell.gsub!(%r{<.*?>},“”)
end
out_row << cell_data
end
output << out_row
end

END processing HTML

examine the result

def parse_nested_array(array,tab = 0)
n = 0
array.each do |item|
if(item.size > 0)
puts “#{”\t" * tab}[#{n}] {"
if(item.class == Array)
parse_nested_array(item,tab+1)
else
puts “#{”\t" * (tab+1)}#{item}"
end
puts “#{”\t" * tab}}"
end
n += 1
end
end

parse_nested_array(output[2][4])

aa, ab, ac, ad = output[2][4]

puts"hello"
puts aa + “\t” + ab + “\t” + ac + “\t” + ad

vikashkumar051 · January 13, 2007, 4:29am

require ‘net/http’

read the page data

http = Net::HTTP.new(‘kvcrpf.org, 80)
resp, page = http.get(’/achievements.htm’, nil )

BEGIN processing HTML

The code given above can be used to extract values from a web page, I we
don’t have to login to a web page, we know in advance which URL to look
for to get data from it, but the problem is to first login to a page,
then go to some desired location to scrap values from it.

Please help me out in doing this.
Thanks in advance
Vikash

vikashkumar051 · January 14, 2007, 5:55pm

Vikash Kumar wrote:

require ‘net/http’

read the page data

http = Net::HTTP.new(‘kvcrpf.org, 80)
resp, page = http.get(’/achievements.htm’, nil )

BEGIN processing HTML

The code given above can be used to extract values from a web page, I we
don’t have to login to a web page, we know in advance which URL to look
for to get data from it, but the problem is to first login to a page,
then go to some desired location to scrap values from it.

Please help me out in doing this.
Thanks in advance
Vikash

There are a few ways of doing this , if
your are on windows watir[1] can help you out doing the login stuff, may
the tricky part is how to get the data, but I am sure there is a method
which allows you to extract the hole HTML

http://wtr.rubyforge.org/

$rm rm
.rb

vikashkumar051 · January 15, 2007, 4:19am

There are a few ways of doing this , if
your are on windows watir[1] can help you out doing the login stuff, may
the tricky part is how to get the data, but I am sure there is a method
which allows you to extract the hole HTML

http://wtr.rubyforge.org/

$rm rm
.rb

I am working on windows platform, I tried a lot to first log in to a web
page then go to some desired page to get some data from it, but unable
to do it.

Anyone’s help will be appreciated.
Thanks
Vikash

vikashkumar051 · January 15, 2007, 7:23am

Try a combination of WWW::Mechanize (gem install mechanize), and Hpricot
(gem install hpricot).

I am new to Mechanize and hpricot, though I have installed it, but I am
still facing the problem in scrapping values by first log in to the web
site then going to some other page to extract data from it.

Please help me.
Vikash

vikashkumar051 · January 15, 2007, 5:43am

Vikash Kumar wrote:

There are a few ways of doing this , if
your are on windows watir[1] can help you out doing the login stuff, may
the tricky part is how to get the data, but I am sure there is a method
which allows you to extract the hole HTML

http://wtr.rubyforge.org/

$rm rm
.rb

I am working on windows platform, I tried a lot to first log in to a web
page then go to some desired page to get some data from it, but unable
to do it.

Anyone’s help will be appreciated.
Thanks
Vikash

Try a combination of WWW::Mechanize (gem install mechanize), and Hpricot
(gem install hpricot).

vikashkumar051 · January 15, 2007, 6:05pm

You can also try SWExplorerAutomation SWEA from http://webiussoft.com.
SWEA is .Net API, but can be used from Ruby using RubyCLR

example:

require ‘rubyclr’
RubyClr::reference ‘System’
RubyClr::reference ‘SWExplorerAutomationClient’
include SWExplorerAutomation::Client
include SWExplorerAutomation::Client::Controls
include SWExplorerAutomation::Client::DialogControls
explorerManager = ExplorerManager.new
explorerManager.Connect(-1)
explorerManager.LoadProject(‘google.htp’)
explorerManager.Navigate(‘http://www.google.com/’)
scene = explorerManager[‘Scene_0’]
scene.WaitForActive(30000)
scene[“q”].Value = ‘c#’
scene[‘btnG’].Click()
scene = explorerManager[‘Scene_1’]
scene.WaitForActive(30000)
explorerManager.DisconnectAndClose()

vikashkumar051 · January 19, 2007, 4:29pm

Hi,

I’ve been successfully using Selenium (check out openqa.org) to do
similar
(and more complex) web page interactions, querying etc. on both Linux
and
Windows, using Ruby to drive things. If written thoughtfully, it’s very
easy
to get code that runs on both platforms without any code-changes
required to
migrate between the two.

You get a nice ‘@selenium’ object which has a large set of methods you
can
use.

Apologies in advance if you already knew about this.

Kp.

On 1/14/07, Rodrigo B. [email protected] wrote:

There are a few ways of doing this , if

Posted via http://www.ruby-forum.com/.

–
“I refuse to prove that I exist,” says God, “for proof denies faith, and
without faith I am nothing.”
“But,” says Man, “the Babel fish is a dead giveaway isn’t it? It could
not
have evolved by chance. It proves that you exist, and so therefore, by
your
own arguments, you don’t. Q.E.D.”
“Oh dear,” says God, “I hadn’t thought of that,” and promptly vanishes
in a
puff of logic.
“Oh, that was easy,” says Man, and for an encore goes on to prove that
black
is white and gets himself killed on the next zebra crossing.

vikashkumar051 · January 19, 2007, 4:29pm

If you are running on a windows platform that you should look at watir.
It will let you control Internet Explorer and log in to a site.

Luis

vikashkumar051 · January 19, 2007, 4:30pm

Vikash Kumar wrote:

I can use the below code to extract the data if I have not to login to
the web site.

In 2 days I am going to release a web extraction toolkit which will do
exactly what you want (and more of course, but this is a basic use
case)… It’s based on Mechanize (which is used for login like stuff)
and HPricot for extracting the relevant stuff. The scenario you
described is an absolutely typical one, so you could try it with my
stuff…

I will post here an announcement after the release.

Cheers,
Peter

__
http://www.rubyrailways.com