Urgent please :HTTPS html parsing


#1

Hi,
Is there any way to extract the html code of a https:// website in
hpricot. When i use hpricot to access a https:// website i receive the
following error.

/usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:in
gem_original_require': no such file to load -- net/https (LoadError) from /usr/local/lib/site_ruby/1.8/rubygems/custom_require.rb:31:inrequire’
from /usr/lib/ruby/1.8/open-uri.rb:230:in open_http' from /usr/lib/ruby/1.8/open-uri.rb:616:inbuffer_open’
from /usr/lib/ruby/1.8/open-uri.rb:164:in open_loop' from /usr/lib/ruby/1.8/open-uri.rb:162:incatch’
from /usr/lib/ruby/1.8/open-uri.rb:162:in open_loop' from /usr/lib/ruby/1.8/open-uri.rb:132:inopen_uri’
from /usr/lib/ruby/1.8/open-uri.rb:518:in open' from /usr/lib/ruby/1.8/open-uri.rb:30:inopen’
from demo.rb:15:in `valid’
from demo.rb:93

I’m also not able to load the html data gmail, youtube etc. Is it
because i’m using hpricot. Is there any other way to extract https
websites. Please help me.

Regards
Arun K.


#2

Did you require ‘net/https’? It seems that that lib is just not loaded/
present.

On Mar 17, 1:05 pm, Arun K. removed_email_address@domain.invalid


#3

Harm wrote:

Did you require ‘net/https’? It seems that that lib is just not loaded/
present.

On Mar 17, 1:05�pm, Arun K. removed_email_address@domain.invalid

Can u please explain about how to include ‘net/https’.

Thanks a lot


#4

-1 for effort on the part of the poster…

Please go read
http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html

and learn about what you are trying to use


#5

Please help. I’ll be really thankful

Regards
Arun K.

A quick Google search for ‘rails https’ yielded this on the 5th entry
found. It seems like exactly what you need to do.

http://railsruby.blogspot.com/2006/02/https-open-uri-basic-authentication.html


#6

Ar Chron wrote:

-1 for effort on the part of the poster…

Please go read
http://www.ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html

and learn about what you are trying to use

I learned about ‘net/http’ and 'hpricot. but it is showing the same
error even for youtube. The code snippet i used for url extraction is:

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
require ‘dbi’

class Url
def valid
begin
puts “Enter domain name :”
domain = gets.chomp
#concatinatinghttp://www.’ with the url to open the page
url = “http://#{domain}”
document = open(url,“User-Agent”=>“Mozilla/4.0 (compatible; MSIE
5.5; Windows NT 5.0; .NET CLR 1.0.3705)”)
#getting the original url of the site
realUrl = document.base_uri.to_s
rescue
puts “Unable to open the URL. Please check if you have
entered a valid URL.”
end
parms = Array.new
parms = [domain, realUrl]
end

I’m able to extract the data from every site except
http://www.youtube.com’ and ‘gmail.com’ and other ‘https’ sites’.
Please help. I’ll be really thankful

Regards
Arun K.


#7

On Mar 17, 2:37 pm, Ar Chron removed_email_address@domain.invalid wrote:

Please help. I’ll be really thankful

Regards
Arun K.

A quick Google search for ‘rails https’ yielded this on the 5th entry
found. It seems like exactly what you need to do.

http://railsruby.blogspot.com/2006/02/https-open-uri-basic-authentica

It also looks completely out of date - that patch doesn’t look like it
would apply to current versions of ruby 1.8.6
if net/https can’t be required then I would assumed this is on a linux
distribution where ruby is split into multiple packages, one of which
is usually this one with ssl stuff in it ( libopenssl-ruby in ubuntu)

Fred


#8

Frederick C. wrote:

On Mar 17, 2:37�pm, Ar Chron removed_email_address@domain.invalid wrote:

Please help. I’ll be really thankful

Regards
Arun K.

A quick Google search for ‘rails https’ yielded this on the 5th entry
found. It seems like exactly what you need to do.

http://railsruby.blogspot.com/2006/02/https-open-uri-basic-authentica

It also looks completely out of date - that patch doesn’t look like it
would apply to current versions of ruby 1.8.6
if net/https can’t be required then I would assumed this is on a linux
distribution where ruby is split into multiple packages, one of which
is usually this one with ssl stuff in it ( libopenssl-ruby in ubuntu)

Fred

Yes i think like that. As a fresher to ruby, i didn’t understand a bit
of the code and as u said looks outdated. If u have any tricks in the
trade to parse html content from atleast this site.
http://www.youtube.com

i’m receiving an error like this while extracting data from the site :
`open_http’: 400 Bad Request (OpenURI::HTTPError)
This is not the error which is displayed in the case of ‘https://’
sites.

Please help

Regards
Arun K.