Youtube...urgent, please help

arunvoip · March 17, 2009, 5:45am

Hi,

I’m new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction. It works fine except for some sites like
http://www.youtube.com, http://www.gmail.com where i’ll get errors like
‘400 Bad Request’ and ‘getaddrinfo: Name or service not known
(SocketError)’ respectively for each of the 2 sites. I came to know that
may be it is because the url is being redirected. But i’m not sure about
it. My code for html extraction is :

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
require ‘dbi’

puts “Enter domain name :”
domain = gets
#concatinating ‘http://www.’ with the url to open the page
url = “http://www.”+domain
document = open(url)
#getting the original url of the site
url2 = document.base_uri.to_s

Can anybody please help. It is urgent. I’ll be really greatful for those
who reply

Regards,
Arun K.

arunvoip · March 17, 2009, 6:45am

Arun K. wrote:

Hi,

I’m new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction.

You probably want Mechanize.

domain = gets
#concatinating ‘http://www.’ with the url to open the page
url = “http://www.”+domain

Take a look at that URL – I’d say you don’t need ‘www’ in that.

But I’m guessing what’s hurting is the newline at the end of it.

Quick fix:

domain = gets.chomp
url = “http://#{domain}”

arunvoip · March 17, 2009, 7:00am

David M. wrote:

Arun K. wrote:

Hi,

I’m new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction.

You probably want Mechanize.

domain = gets
#concatinating ‘http://www.’ with the url to open the page
url = “http://www.”+domain

Take a look at that URL – I’d say you don’t need ‘www’ in that.

But I’m guessing what’s hurting is the newline at the end of it.

Quick fix:

domain = gets.chomp
url = “http://#{domain}”
Sorry to say David, I tried that but the same error is producing. Is it
because i’ve not set the user agent. Can u please tell me how to set the
user_agent for mozilla.
Thanks for ur immediate reply

arunvoip · March 17, 2009, 7:12am

On Tue, Mar 17, 2009 at 11:28 AM, Arun K.
[email protected] wrote:

Sorry to say David, I tried that but the same error is producing. Is it
because i’ve not set the user agent. Can u please tell me how to set the
user_agent for mozilla.

http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
List of User-Agents (Spiders, Robots, Browser) has an extensive list, for
instance.

Thanks for ur immediate reply

Don’t do that, it’s annoying.

martin

arunvoip · March 17, 2009, 7:38am

2009/3/17 Arun K. [email protected]:

Can i use user-agents in hpricot? or if it can be used only for
mechanize. I’ve found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.

I found this:

http://schf.uc.org/articles/2007/02/14/scraping-gmail-with-mechanize-and-hpricot

It scraps gmail. If my memory doesn’t fail, it is one that gives you
some problems.

Cheers,

Serabe

arunvoip · March 17, 2009, 7:28am

Martin DeMello wrote:

On Tue, Mar 17, 2009 at 11:28 AM, Arun K.
[email protected] wrote:

Sorry to say David, I tried that but the same error is producing. Is it
because i’ve not set the user agent. Can u please tell me how to set the
user_agent for mozilla.

http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
List of User-Agents (Spiders, Robots, Browser) has an extensive list, for
instance.

Thanks for ur immediate reply

Don’t do that, it’s annoying.

martin

Can i use user-agents in hpricot? or if it can be used only for
mechanize. I’ve found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.

arunvoip · March 17, 2009, 8:26pm

Martin DeMello wrote:

On Tue, Mar 17, 2009 at 11:55 AM, Arun K.
[email protected] wrote:

Can i use user-agents in hpricot? or if it can be used only for
mechanize.

Hpricot is an html parser, I don’t think it concerns itself with
actually fetching the page. Use mechanize for that.

What’s more, mechanize doesn’t even use hpricot anymore – it uses
nokogiri.

arunvoip · March 17, 2009, 7:56am

On Tue, Mar 17, 2009 at 11:55 AM, Arun K.
[email protected] wrote:

Can i use user-agents in hpricot? or if it can be used only for
mechanize.

Hpricot is an html parser, I don’t think it concerns itself with
actually fetching the page. Use mechanize for that.

martin