Using Nokogiri to scrape multiple websites

Hi everyone,

I’m looking to use Nokogiri to scrape about 10 websites for their anchor
texts and output them on screen. What would be the best way to achieve
this?

I have tried doing something like this without much luck…

def index
sites = Array.new(“site1.com”,“site2.com”,“site3.com”)
sites.each do |site|
@textlinks << scrape(site)
end
end

def scrape(website)
require ‘open-uri’
require ‘nokogiri’

doc = Nokogiri::HTML(open(website))

return doc.xpath(’//a’)
end

Thanks

On Sat, Sep 4, 2010 at 4:24 PM, Ryan M. [email protected] wrote:

Hi everyone,

I’m looking to use Nokogiri to scrape about 10 websites for their anchor
texts and output them on screen. What would be the best way to achieve
this?

I have tried doing something like this without much luck…

def index
sites = Array.new(“site1.com”,“site2.com”,“site3.com”)
sites.each do |site|
@textlinks << scrape(site)
end
end

def scrape(website)
require ‘open-uri’
require ‘nokogiri’

doc = Nokogiri::HTML(open(website))
return doc.xpath(’//a’)
end

What is exactly the problem? You need to write the full url starting
with http:// so that open-uri works correctly. After that, scrape will
return an array of nokogiri elements each of them representing a link.
You are then putting each of these arrays into another array called
@textlinks. In orther to output the links to the screen, take a look
at the to_html method of Nokogiri::XML::Element.
This worked for me:

sites = %w{http://www.google.com http://www.yahoo.com}
links = []
sites.each {|site| links.concat(scrape(site))} # the scrape method is
the one you wrote above
links.each {|link| puts link.to_html}

Hope this helps,

Jesus.

Hi Jesús,

I’m looking to output the information to an .html document (using the
Rails framework) and I’m getting the following error: can’t convert
Fixnum into Array

Also what I’m actually after trying to do is scrap each of the websites
to see if they contain a specific url so I would need to pass in a list
of about 3-4 keywords for each of the domains.

So something like

def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links = []
sites.each {|site| links.concat(scrape(site, keywords[]))}
end

def scrape(website,inputtext)
require ‘open-uri’
require ‘nokogiri’

doc = Nokogiri::HTML(open(website))

for sample in doc.xpath(’//a’)
if sample.text == inputtext
keywords = doc.xpath(’//a’)
else
keywords = “MISSING”
end
end
end

Thanks for your time.

McKenzie

On Mon, Sep 6, 2010 at 5:01 PM, Ryan M. [email protected] wrote:

Hi Jesús,

I’m looking to output the information to an .html document (using the
Rails framework) and I’m getting the following error: can’t convert
Fixnum into Array

Also what I’m actually after trying to do is scrap each of the websites
to see if they contain a specific url so I would need to pass in a list
of about 3-4 keywords for each of the domains.

So something like

def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links = []
sites.each {|site| links.concat(scrape(site, keywords[]))}
end

def scrape(website,inputtext)
require ‘open-uri’
require ‘nokogiri’

doc = Nokogiri::HTML(open(website))

for sample in doc.xpath(’//a’)
if sample.text == inputtext
keywords = doc.xpath(’//a’)
else
keywords = “MISSING”
end
end
end

Thanks for your time.

So you want to iterate twice, in each site search for a link that
contains the specified word? Do you want to also organize for which
word and site each result comes from? If so, I’d do something like:

def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
links[site][keyword] = scrape(site, keyword)
end
end
links
end

def scrape(website,inputtext)
require ‘open-uri’ #these could maybe go at the start of the script
require ‘nokogiri’

regex = /#{inputtext}/
links_that_match = []
doc = Nokogiri::HTML(open(website))
doc.xpath(’//a’).each do |link|
if regex =~ link.inner_text
links_that_match << link.to_html
end
end
links_that_match
end

Untested, but it can give you some ideas. The resulting hash will have
something like:

{“http://www.google.com” => {“accounts” => [], “resources” => []

}

Jesus.

Jesús Gabriel y Galán wrote:

On Mon, Sep 6, 2010 at 5:01 PM, Ryan M. [email protected] wrote:

Hi Jes�s,

I’m looking to output the information to an .html document (using the
Rails framework) and I’m getting the following error: can’t convert
Fixnum into Array

Also what I’m actually after trying to do is scrap each of the websites
to see if they contain a specific url so I would need to pass in a list
of about 3-4 keywords for each of the domains.

So something like

def index
� �keywords = %w{accounts resources membership}
� �sites = %w{http://www.google.com http://www.yahoo.com}
�links = []
�sites.each {|site| links.concat(scrape(site, keywords[]))}
�end

def scrape(website,inputtext)
� �require ‘open-uri’
�require ‘nokogiri’

� �doc = Nokogiri::HTML(open(website))

�for sample in doc.xpath(’//a’)
� �if sample.text == inputtext
� � �keywords = doc.xpath(’//a’)
� �else
� � �keywords = “MISSING”
� �end
�end
�end

Thanks for your time.

So you want to iterate twice, in each site search for a link that
contains the specified word? Do you want to also organize for which
word and site each result comes from? If so, I’d do something like:

def index
keywords = %w{accounts resources membership}
sites = %w{http://www.google.com http://www.yahoo.com}
links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
links[site][keyword] = scrape(site, keyword)
end
end
links
end

def scrape(website,inputtext)
require ‘open-uri’ #these could maybe go at the start of the script
require ‘nokogiri’

regex = /#{inputtext}/
links_that_match = []
doc = Nokogiri::HTML(open(website))
doc.xpath(’//a’).each do |link|
if regex =~ link.inner_text
links_that_match << link.to_html
end
end
links_that_match
end

Untested, but it can give you some ideas. The resulting hash will have
something like:

{“http://www.google.com” => {“accounts” => [], “resources” => []

}

Jesus.

That works great! Thank you.

Instead of having to pull the items from a hash though I would really
like to try pull them from a database for when the list gets extremely
large. I’ve tried using the hash to pull from a variable but it produces
an error which says the hash is an odd length. It is only going to be a
flat table database so all of the data will be called under
@backlinks.title (the keyword(s)), @backlinks.permalink (for the site)

def index
@links = Hash.new { |ha,lnk| ha[lnk] = {} }
@backlinks = Backlink.find(:all)
keywords = %w{@backlinks.concat(title)}
sites = %w{@backlinks.concat(permalink)}
links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
@links[site][keyword] = scrape(site, keyword)
end
end

Thanks again.

McKenzie

On Tue, Sep 7, 2010 at 11:42 AM, Ryan M. [email protected] wrote:

links
if regex =~ link.inner_text
word accounts>], “resources” => []
an error which says the hash is an odd length.
I don’t understand what you mean here.

It is only going to be a
flat table database so all of the data will be called under
@backlinks.title (the keyword(s)), @backlinks.permalink (for the site)

def index
@links = Hash.new { |ha,lnk| ha[lnk] = {} }
@backlinks = Backlink.find(:all)
keywords = %w{@backlinks.concat(title)}
sites = %w{@backlinks.concat(permalink)}

irb(main):004:0> keywords = %w{@backlinks.concat(title)}
=> ["@backlinks.concat(title)"]

You probably mean:

keywords = @backlinks.map {|bl| bl.title}
sites = @backlinks.map {|bl| bl.permalink}

but I don’t know exactly what @backlinks is (probably an ActiveRecord?)

links_by_site = Hash.new {|h,k| h[k] = {}}
sites.each do |site|
keywords.each do |keyword|
@links[site][keyword] = scrape(site, keyword)
end
end

This code produces the error: “hash is an odd number length”?

Jesus.

This forum is not affiliated to the Ruby language, Ruby on Rails framework, nor any Ruby applications discussed here.

| Privacy Policy | Terms of Service | Remote Ruby Jobs