Hi there,
The code below provides me the html for a specific id of the site:
www.securityfocus.com What I’m trying to do is: Getting the info of the
div
id =“vulnerability” only, but for all the different id’s available -
currently around 25000. I think it is something like: next_page
‘Next’>’,
:limit => 25000 but where do I need to put it and how can I get the div
id
info only? I appreciate your help.
require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
load the Securityfocus home page (id 715 to start)
doc = Hpricot(open(“http://www.securityfocus.com/bid/715”))
print the altered HTML
puts doc
-tom
Hi,
Is something like this what you have in mind?
doc = Hpricot(open(“http://www.securityfocus.com/bid/715”))
p (doc/’#vulnerability’)
George
George,
Thanks, however p (doc/’#vulnerability’) still delivers me the whole
site…
i only need the content of the div id = “vulnerability”. How to proceed
with
that?
So long,
Tom
–
Hi,
That’s odd. If I do something like:
v = doc/’#vulnerability’
p v.to_html
p v.inner_html
p.inner_text
I only see results relevant to the content of the ‘vulnerability’ div
(no header, banners, navigation, etc). Maybe if you save your results
to a file it would be easier to inspect by looking at them?
George
Tom B. wrote:
George,
Thanks, however p (doc/’#vulnerability’) still delivers me the whole
site…
i only need the content of the div id = “vulnerability”. How to proceed
with
that?
Does this solve your problem?
require ‘rubygems’
require ‘hpricot’
require ‘open-uri’
doc = Hpricot(open(“http://www.securityfocus.com/bid/715”))
p doc/“div[@id=‘vulnerability’]”
If you don’t want to scrape the table further, then the above solution
should be enough - but if you want to go on and drill down the table,
you could check out scRUBYt! (http://scrubyt.org), a Ruby web scraping
tool which is designed to handle such issues.
Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.
Sorry, I have meant
p doc/"//div[@id=‘vulnerability’]"
Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.
Tom.
Man, you are mixing pure Hpricot and scRUBYt! together - This syntax:
securityfocus_data.to_xml.write($stdout, 1)
is from scRUBYt!, but you are gathering the data with Hpricot - how
would you like to pull this off?
Maybe I don’t get something, but I am a bit confused…
Cheers,
Peter
__
http://www.rubyrailways.com :: Ruby and Web2.0 blog
http://scrubyt.org :: Ruby web scraping framework
http://rubykitchensink.ca/ :: The indexed archive of all things Ruby.
Hi Peter and George,
I appreciate, though in the end I went with:
load the Securityfocus home page (id 715 to start)
doc = Hpricot(open(“http://www.securityfocus.com/bid/715”))
get the content of the div id =“vulnerability”
p =(doc/’#vulnerability’).inner_html
prints div id = ‘vulnerability’
puts p
However, this is 'only’for the record with the id 715. What do I need to
add
to fetch the content of all the different id’s (1… 25000) on
securityfocus.com? Once I have them I’ll have an xml created for each
using:
securityfocus_data.to_xml.write($stdout, 1)
Thanks,
Tom
–
Hi there,
I’m trying to get data from a couple of websites at the same time by
using:
(1…10).each { |p| print p} # should get the content from the pages
with
the id’s 1 to 10…
load the Securityfocus home page (id 1 to start)
doc =
Hpricot(open("http://www.securityfocus.com/bid/1http://www.securityfocus.com/bid/715
"))
get the content of the div id =“vulnerability”
p =(doc/’#vulnerability’).inner_html
prints div id = ‘vulnerability’
puts p
Must I use an array?
Thanks,
Tom
First of all I’d like to apologize. Got a bit confused myself as I
received
various different answers to my questions… Thus, I now have the html
of content I’m interested in for a single id (715). 2 questions:
- How can I get the html for all the id’s on
www.securityfocus.com/bid/?
Something like:
securityfocus_data = Scrubyt::Extractor.define do
(1…10) # as a test only for the id’s 1 to 10
fetch(“www.securityfocus.com/bid”)
…
link("/html/body/div/div/a") do
url(“href”, { :type => :attribute })
end
next_page(“Next”, { :limit => 10 })
end
2) How can I create a text file out of this html using hpricot if
possible
so that I have something like:
Title: Berkeley Sendmail Group Permissions Vulnerability
Bugtraq ID: 715
Class: Access Validation Error
…
May I use something like:
(doc/:result).each do |el|
title = (el/:title).text
Thanks,
Tom
–
Hello,
Does this work for you?
%w(rubygems hpricot open-uri).each { |e| require e }
(1…10).each do |id|
doc = Hpricot(open(“http://www.securityfocus.com/bid/#{id}”))
p (doc/’#vulnerability’).inner_html
end
George
Hello,
To open and write sth into a file I created:
File.open(“bid.txt”, “w+”) do |file|
file.write(“Howdy!”)
end
Though, how can I pass the content of the div id = “vulnerability” and
how
can I create a file for each id separately?
Thanks!!
Indeed, I appreciate!! I have actually 2 more questions regarding this
topic. Thus,
- I’d like to either create a .txt or .xml file for each id. Can I go
ahead
and use aFile = File.new(“bid.txt”, “w”) and aFile.close?
- How can I chomp the \n and \t characters in the terminal output? I
only
need the text w/o tags. Best on separate lines…
I owe you a pitcher
gets the content of the div id =“vulnerability”, either as text or
html
p (doc/’#vulnerability’).inner_text.strip
end
create a new file
aFile = File.new(“bid.txt”, “w”)
#… code to process the file
close the file
aFile.close
prints div id = ‘vulnerability’
puts p
–
Hello,
(1…10).each do |id|
doc = Hpricot(open(“http://www.securityfocus.com/bid/#{id}”))
File.open("#{id}.txt", “w”) do |f|
f << (doc/’#vulnerability’).inner_html
end
end
Should write every page to a different file.
You can use stuff like gsub to replace/remove any unwanted characters.
George