Hi,
I’m new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction. It works fine except for some sites like
http://www.youtube.com, http://www.gmail.com where i’ll get errors like
‘400 Bad Request’ and ‘getaddrinfo: Name or service not known
(SocketError)’ respectively for each of the 2 sites. I came to know that
may be it is because the url is being redirected. But i’m not sure about
it. My code for html extraction is :
require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
require ‘dbi’
puts “Enter domain name :”
domain = gets
#concatinating ‘http://www.’ with the url to open the page
url = “http://www.”+domain
document = open(url)
#getting the original url of the site
url2 = document.base_uri.to_s
Can anybody please help. It is urgent. I’ll be really greatful for those
who reply
Regards,
Arun K.
Arun K. wrote:
Hi,
I’m new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction.
You probably want Mechanize.
domain = gets
#concatinating ‘http://www.’ with the url to open the page
url = “http://www.”+domain
Take a look at that URL – I’d say you don’t need ‘www’ in that.
But I’m guessing what’s hurting is the newline at the end of it.
Quick fix:
domain = gets.chomp
url = “http://#{domain}”
On Tue, Mar 17, 2009 at 11:28 AM, Arun K.
[email protected] wrote:
Sorry to say David, I tried that but the same error is producing. Is it
because i’ve not set the user agent. Can u please tell me how to set the
user_agent for mozilla.
http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
List of User-Agents (Spiders, Robots, Browser) has an extensive list, for
instance.
Thanks for ur immediate reply
Don’t do that, it’s annoying.
martin
2009/3/17 Arun K. [email protected]:
Can i use user-agents in hpricot? or if it can be used only for
mechanize. I’ve found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.
I found this:
http://schf.uc.org/articles/2007/02/14/scraping-gmail-with-mechanize-and-hpricot
It scraps gmail. If my memory doesn’t fail, it is one that gives you
some problems.
Cheers,
Serabe
Martin DeMello wrote:
On Tue, Mar 17, 2009 at 11:28 AM, Arun K.
[email protected] wrote:
Sorry to say David, I tried that but the same error is producing. Is it
because i’ve not set the user agent. Can u please tell me how to set the
user_agent for mozilla.
http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
List of User-Agents (Spiders, Robots, Browser) has an extensive list, for
instance.
Thanks for ur immediate reply
Don’t do that, it’s annoying.
martin
Can i use user-agents in hpricot? or if it can be used only for
mechanize. I’ve found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.
Martin DeMello wrote:
On Tue, Mar 17, 2009 at 11:55 AM, Arun K.
[email protected] wrote:
Can i use user-agents in hpricot? or if it can be used only for
mechanize.
Hpricot is an html parser, I don’t think it concerns itself with
actually fetching the page. Use mechanize for that.
What’s more, mechanize doesn’t even use hpricot anymore – it uses
nokogiri.
On Tue, Mar 17, 2009 at 11:55 AM, Arun K.
[email protected] wrote:
Can i use user-agents in hpricot? or if it can be used only for
mechanize.
Hpricot is an html parser, I don’t think it concerns itself with
actually fetching the page. Use mechanize for that.
martin