Youtube...urgent, please help

Hi,

I’m new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction. It works fine except for some sites like
http://www.youtube.com, http://www.gmail.com where i’ll get errors like
‘400 Bad Request’ and ‘getaddrinfo: Name or service not known
(SocketError)’ respectively for each of the 2 sites. I came to know that
may be it is because the url is being redirected. But i’m not sure about
it. My code for html extraction is :

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
require ‘dbi’

puts “Enter domain name :”
domain = gets
#concatinatinghttp://www.’ with the url to open the page
url = “http://www.”+domain
document = open(url)
#getting the original url of the site
url2 = document.base_uri.to_s

Can anybody please help. It is urgent. I’ll be really greatful for those
who reply

Regards,
Arun K.

Arun K. wrote:

Hi,

I’m new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction.

You probably want Mechanize.

domain = gets
#concatinatinghttp://www.’ with the url to open the page
url = “http://www.”+domain

Take a look at that URL – I’d say you don’t need ‘www’ in that.

But I’m guessing what’s hurting is the newline at the end of it.

Quick fix:

domain = gets.chomp
url = “http://#{domain}”

David M. wrote:

Arun K. wrote:

Hi,

I’m new to ruby and my co. has given me an assignment in ruby. It is
regarding html extraction.

You probably want Mechanize.

domain = gets
#concatinatinghttp://www.’ with the url to open the page
url = “http://www.”+domain

Take a look at that URL – I’d say you don’t need ‘www’ in that.

But I’m guessing what’s hurting is the newline at the end of it.

Quick fix:

domain = gets.chomp
url = “http://#{domain}”
Sorry to say David, I tried that but the same error is producing. Is it
because i’ve not set the user agent. Can u please tell me how to set the
user_agent for mozilla.
Thanks for ur immediate reply

On Tue, Mar 17, 2009 at 11:28 AM, Arun K.
[email protected] wrote:

Sorry to say David, I tried that but the same error is producing. Is it
because i’ve not set the user agent. Can u please tell me how to set the
user_agent for mozilla.

http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
List of User-Agents (Spiders, Robots, Browser) has an extensive list, for
instance.

Thanks for ur immediate reply

Don’t do that, it’s annoying.

martin

2009/3/17 Arun K. [email protected]:

Can i use user-agents in hpricot? or if it can be used only for
mechanize. I’ve found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.

I found this:

http://schf.uc.org/articles/2007/02/14/scraping-gmail-with-mechanize-and-hpricot

It scraps gmail. If my memory doesn’t fail, it is one that gives you
some problems.

Cheers,

Serabe

Martin DeMello wrote:

On Tue, Mar 17, 2009 at 11:28 AM, Arun K.
[email protected] wrote:

Sorry to say David, I tried that but the same error is producing. Is it
because i’ve not set the user agent. Can u please tell me how to set the
user_agent for mozilla.

http://mechanize.rubyforge.org/mechanize/EXAMPLES_rdoc.html has some
examples setting the user agent. Google around and see what the
mozilla user agent should be -
List of User-Agents (Spiders, Robots, Browser) has an extensive list, for
instance.

Thanks for ur immediate reply

Don’t do that, it’s annoying.

martin

Can i use user-agents in hpricot? or if it can be used only for
mechanize. I’ve found a user-agent for mozilla :
Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)
But still it is showing the same error.

Martin DeMello wrote:

On Tue, Mar 17, 2009 at 11:55 AM, Arun K.
[email protected] wrote:

Can i use user-agents in hpricot? or if it can be used only for
mechanize.

Hpricot is an html parser, I don’t think it concerns itself with
actually fetching the page. Use mechanize for that.

What’s more, mechanize doesn’t even use hpricot anymore – it uses
nokogiri.

On Tue, Mar 17, 2009 at 11:55 AM, Arun K.
[email protected] wrote:

Can i use user-agents in hpricot? or if it can be used only for
mechanize.

Hpricot is an html parser, I don’t think it concerns itself with
actually fetching the page. Use mechanize for that.

martin