Problem with getting info from several websites

tomtoday · June 7, 2007, 5:28pm

Hi there,

The code below provides me the html for a specific id of the site:
www.securityfocus.com What I’m trying to do is: Getting the info of the
div
id =“vulnerability” only, but for all the different id’s available -
currently around 25000. I think it is something like: next_page
‘Next’>',
:limit => 25000 but where do I need to put it and how can I get the div
id
info only? I appreciate your help.

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’

load the Securityfocus home page (id 715 to start)

doc = Hpricot(open(“http://www.securityfocus.com/bid/715”))

print the altered HTML

puts doc

-tom

tomtoday · June 7, 2007, 7:13pm

Hi,

Is something like this what you have in mind?

doc = Hpricot(open(“http://www.securityfocus.com/bid/715”))
p (doc/‘#vulnerability’)

George

tomtoday · June 8, 2007, 9:32am

George,

Thanks, however p (doc/’#vulnerability’) still delivers me the whole
site…
i only need the content of the div id = “vulnerability”. How to proceed
with
that?

So long,
Tom

–

tomtoday · June 8, 2007, 10:09am

Hi,

That’s odd. If I do something like:

v = doc/’#vulnerability’

p v.to_html
p v.inner_html
p.inner_text

I only see results relevant to the content of the ‘vulnerability’ div
(no header, banners, navigation, etc). Maybe if you save your results
to a file it would be easier to inspect by looking at them?

George

tomtoday · June 8, 2007, 10:03am

Tom B. wrote:

George,

Thanks, however p (doc/‘#vulnerability’) still delivers me the whole
site…
i only need the content of the div id = “vulnerability”. How to proceed
with
that?

Does this solve your problem?

require ‘rubygems’
require ‘hpricot’
require ‘open-uri’

doc = Hpricot(open(“http://www.securityfocus.com/bid/715”))
p doc/“div[@id=‘vulnerability’]”

If you don’t want to scrape the table further, then the above solution
should be enough - but if you want to go on and drill down the table,
you could check out scRUBYt! (http://scrubyt.org), a Ruby web scraping
tool which is designed to handle such issues.

Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.

tomtoday · June 8, 2007, 10:47am

Sorry, I have meant

p doc/“//div[@id=‘vulnerability’]”

Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.

tomtoday · June 8, 2007, 1:34pm

Tom.

Man, you are mixing pure Hpricot and scRUBYt! together - This syntax:

securityfocus_data.to_xml.write($stdout, 1)

is from scRUBYt!, but you are gathering the data with Hpricot - how
would you like to pull this off?
Maybe I don’t get something, but I am a bit confused…

Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.

tomtoday · June 8, 2007, 12:37pm

Hi Peter and George,

I appreciate, though in the end I went with:

load the Securityfocus home page (id 715 to start)

doc = Hpricot(open(“http://www.securityfocus.com/bid/715”))

get the content of the div id =“vulnerability”

p =(doc/‘#vulnerability’).inner_html

prints div id = ‘vulnerability’

puts p
However, this is 'only’for the record with the id 715. What do I need to
add
to fetch the content of all the different id’s (1… 25000) on
securityfocus.com? Once I have them I’ll have an xml created for each
using:
securityfocus_data.to_xml.write($stdout, 1)

Thanks,
Tom

–

tomtoday · June 10, 2007, 11:27am

Hi there,

I’m trying to get data from a couple of websites at the same time by
using:

(1…10).each { |p| print p} # should get the content from the pages
with
the id’s 1 to 10…

load the Securityfocus home page (id 1 to start)

doc =
Hpricot(open("http://www.securityfocus.com/bid/1 http://www.securityfocus.com/bid/715
"))

get the content of the div id =“vulnerability”

p =(doc/‘#vulnerability’).inner_html

prints div id = ‘vulnerability’

puts p

Must I use an array?

Thanks,
Tom

tomtoday · June 8, 2007, 1:58pm

First of all I’d like to apologize. Got a bit confused myself as I
received
various different answers to my questions… Thus, I now have the html
of content I’m interested in for a single id (715). 2 questions:

How can I get the html for all the id’s on
www.securityfocus.com/bid/?
Something like:

securityfocus_data = Scrubyt::Extractor.define do
(1…10) # as a test only for the id’s 1 to 10
fetch(“www.securityfocus.com/bid”)
…

link(“/html/body/div/div/a”) do
url(“href”, { :type => :attribute })
end
next_page(“Next”, { :limit => 10 })
end
2) How can I create a text file out of this html using hpricot if
possible
so that I have something like:

Title: Berkeley Sendmail Group Permissions Vulnerability
Bugtraq ID: 715
Class: Access Validation Error
…

May I use something like:

(doc/:result).each do |el|
title = (el/:title).text

Thanks,
Tom

–

tomtoday · June 10, 2007, 2:37pm

Hello,

Does this work for you?

%w(rubygems hpricot open-uri).each { |e| require e }

(1…10).each do |id|
doc = Hpricot(open(“http://www.securityfocus.com/bid/#{id}”))
p (doc/‘#vulnerability’).inner_html
end

George

tomtoday · June 11, 2007, 5:09pm

Hello,

To open and write sth into a file I created:

File.open(“bid.txt”, “w+”) do |file|
file.write(“Howdy!”)
end

Though, how can I pass the content of the div id = “vulnerability” and
how
can I create a file for each id separately?

Thanks!!

tomtoday · June 11, 2007, 2:44pm

Indeed, I appreciate!! I have actually 2 more questions regarding this
topic. Thus,

I’d like to either create a .txt or .xml file for each id. Can I go
ahead
and use aFile = File.new(“bid.txt”, “w”) and aFile.close?
How can I chomp the \n and \t characters in the terminal output? I
only
need the text w/o tags. Best on separate lines…
I owe you a pitcher

gets the content of the div id =“vulnerability”, either as text or

html
p (doc/’#vulnerability’).inner_text.strip
end

create a new file

aFile = File.new(“bid.txt”, “w”)

#… code to process the file

close the file

aFile.close

prints div id = ‘vulnerability’

puts p

–

tomtoday · June 11, 2007, 5:42pm

Hello,

(1…10).each do |id|
doc = Hpricot(open(“http://www.securityfocus.com/bid/#{id}”))
File.open(“#{id}.txt”, “w”) do |f|
f << (doc/‘#vulnerability’).inner_html
end
end

Should write every page to a different file.

You can use stuff like gsub to replace/remove any unwanted characters.

George