Scraping web pages for cisco products

I submitted a post a few days ago about scraping the web for Cisco
products. I didn’t receive that much input so I thought I would ask
again. Here are the requirments. I have a list of 2000 urls that all
have Cisco in its domain name.
(ex. http://www.soldbycisco.net
http://www.ciscoindia.net
http://www.ciscobootcamp.net
http://www.cisco-guy.net

and I want to scrape through them and determine which websites are
selling new cisco products, I’m actually looking for around 20 or so
products (ex. WIC-1T, NM-4E, WS-G2950-24). One idea I was given was to
split the pages into ones with forms and those without forms. Those
without forms probably wont have anything for sale so I can eliminate
those. But then I really don’t know how to handle after that. Does
anyone have a different/better approach? Any help would be appreciated.

On 9/19/07, Chuck D. [email protected] wrote:

http://www.cisco-guy.net

Not to make your problem worse but you will need to differentiate
between
new and used equipment too.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can’t hear a word you’re saying.”

-Greg Graffin (Bad Religion)

On 9/19/07, Glen H. [email protected] wrote:

http://www.ciscoindia.net

-Greg Graffin (Bad Religion)
I don’t remember who but someone suggested using Froogle and parsing
that
output. Froogle and a few other sites like Pricewatch might be a far
less
complicated approach, you won’t find all of them but then again I don’t
think you can possibly find everything anyway.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can’t hear a word you’re saying.”

-Greg Graffin (Bad Religion)

Quoth Glen H.:

(ex. http://www.soldbycisco.net
anyone have a different/better approach? Any help would be appreciated.

so loud, I can’t hear a word you’re saying."

-Greg Graffin (Bad Religion)

That was me. Seems to me you shouldn’t parse froogle so much as just use
it.
Writing a script is a lot more work and won’t get you what you want;
froogle
will.

Konrad M. wrote:

Quoth Glen H.:

(ex. http://www.soldbycisco.net
anyone have a different/better approach? Any help would be appreciated.

so loud, I can’t hear a word you’re saying."

-Greg Graffin (Bad Religion)

That was me. Seems to me you shouldn’t parse froogle so much as just use
it.
Writing a script is a lot more work and won’t get you what you want;
froogle
will.

But see I need to use only the list that I have with Cisco in the domain
name. (ex. usedcisco.com, ciscoequipment.com) Can froogle look up
website names like the ones I have?

On 9/19/07, Chuck D. [email protected] wrote:

Posted via http://www.ruby-forum.com/.

Why is the domain important if you are looking for fraudulent equipment
based on selling price? I don’t think you can search by url, I don’t
see
why anyone looking for a specific product would need to do that.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can’t hear a word you’re saying.”

-Greg Graffin (Bad Religion)

Quoth Chuck D.:

Konrad M. wrote:

Quoth Glen H.:

(ex. http://www.soldbycisco.net
anyone have a different/better approach? Any help would be
appreciated.

But see I need to use only the list that I have with Cisco in the domain
name. (ex. usedcisco.com, ciscoequipment.com) Can froogle look up
website names like the ones I have?

Assuming it uses a similar interface to google (I don’t know much about
it),
yes, “site:usedcisco.com” etc.

Why do you need the list? Just search for anything below 60% MSRP, and
ANY
website selling counterfeit cisco devices should come up.

Glen H. wrote:

On 9/19/07, Chuck D. [email protected] wrote:

Posted via http://www.ruby-forum.com/.

Why is the domain important if you are looking for fraudulent equipment
based on selling price? I don’t think you can search by url, I don’t
see
why anyone looking for a specific product would need to do that.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can’t hear a word you’re saying.”

-Greg Graffin (Bad Religion)

I’m looking for copywright infrigment on Cisco’s name 2. So I’m not only
looking for those companies that are selling Cisco counterfeit equipment
but also those who are infringing on Cisco’s name as well.

On 9/19/07, Chuck D. [email protected] wrote:

One idea I was given was to
split the pages into ones with forms and those without forms. Those
without forms probably wont have anything for sale so I can eliminate
those. But then I really don’t know how to handle after that.

Here’s a naive implementation of binning by forms:

cat sites
www.cnn.com
www.usedcisco.com
www.rubyforge.org
slashdot.org
technocrat.net
bk.com

cat firstbin.rb
#!/usr/bin/env ruby

require ‘rubygems’
require ‘mechanize’

agent = WWW::Mechanize.new

sites = File.readlines(“sites”)
bin1 = []
bin2 = []
bin3 = []

sites.each do |site|
site.chomp!

page = agent.get “http://#{site}”
forms = page.forms
search_forms = forms.select{|f|
(f.name and f.name.match /search/i) or
(f.action and f.action.to_s.match /search/i)
}

if search_forms.size > 0
bin1 << site
elsif forms.size > 0
bin2 << site
else
bin3 << site
end
end

p bin1
p bin2
p bin3

ruby firstbin.rb
[“www.cnn.com”, “www.rubyforge.org”, “slashdot.org”]
[“www.usedcisco.com”, “technocrat.net”]
[“bk.com”]

On 9/19/07, Chuck D. [email protected] wrote:

why anyone looking for a specific product would need to do that.
I’m looking for copywright infrigment on Cisco’s name 2. So I’m not only
looking for those companies that are selling Cisco counterfeit equipment
but also those who are infringing on Cisco’s name as well.

Posted via http://www.ruby-forum.com/.

Kind of smelled like it. Too heavy handed for me, sorry.


“Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can’t hear a word you’re saying.”

-Greg Graffin (Bad Religion)

With this method do I need to know the name of the form to use it? With
mechanize I thought you had to look at the form name first before you
could use it?

It helps to know someway to distinguish the form you’re looking for
from the other forms on the page. It would be possible to iterate
through all the forms on a page, entering some text into the text
fields in the form and submitting them; but, most of the time the
script would probably be in either the wrong form or the wrong field
in the right form (and, of course, there are other issues, e.g. forms
that require multiple fields to be edited). I don’t see anyway to
avoid customizing the code for each site (though, if you get a good
framework built the effort per site should decrease?).

unknown wrote:

On 9/19/07, Chuck D. [email protected] wrote:

One idea I was given was to
split the pages into ones with forms and those without forms. Those
without forms probably wont have anything for sale so I can eliminate
those. But then I really don’t know how to handle after that.

With this method do I need to know the name of the form to use it? With
mechanize I thought you had to look at the form name first before you
could use it?

unknown wrote:

On 9/19/07, Chuck D. [email protected] wrote:

One idea I was given was to
split the pages into ones with forms and those without forms. Those
without forms probably wont have anything for sale so I can eliminate
those. But then I really don’t know how to handle after that.

Here’s a naive implementation of binning by forms:

page = agent.get “http://#{site}”
forms = page.forms
search_forms = forms.select{|f|
(f.name and f.name.match /search/i) or
(f.action and f.action.to_s.match /search/i)
}

if search_forms.size > 0
bin1 << site
elsif forms.size > 0
bin2 << site
else
bin3 << site
end
end

I’m checking the size of the form like in the code above but when it
gets to the 13th url to check the script just exits. Does anyone know
why? How can I run a check on this?

On 9/19/07, Chuck D. [email protected] wrote:

I submitted a post a few days ago about scraping the web for Cisco
products. I didn’t receive that much input so I thought I would ask
again. Here are the requirments. I have a list of 2000 urls that all
have Cisco in its domain name.
(ex. http://www.soldbycisco.net
http://www.ciscoindia.net
http://www.ciscobootcamp.net
http://www.cisco-guy.net

I suspect that if Cisco has a problem with counterfeit products that
hurt their long term bottom line, it would most certainly come from
web sites that do not have the word cisco in DNS name.

You should have asked about scraping for some more generic term, maybe?

There are basically two things that bother me with your question.

1: there is something fundamentally wrong with using an open source
product to protect the integrity of a select few relatively expensive
products.

  1. an employee of Cisco would have no problem securing funds for a
    proposal that was delivered on a hardware level (unless Cisco is
    having some monetary problems I’m not aware of). If you don’t know
    what I’m talking about, then I’ll shut up.

Todd

unknown wrote:

With this method do I need to know the name of the form to use it? With
mechanize I thought you had to look at the form name first before you
could use it?

It helps to know someway to distinguish the form you’re looking for
from the other forms on the page. It would be possible to iterate
through all the forms on a page, entering some text into the text
fields in the form and submitting them; but, most of the time the
script would probably be in either the wrong form or the wrong field
in the right form (and, of course, there are other issues, e.g. forms
that require multiple fields to be edited). I don’t see anyway to
avoid customizing the code for each site (though, if you get a good
framework built the effort per site should decrease?).

I agree but I have around 2000 sites to look at and I can’t look at each
and every form, that would take way to long. Do you think a better
approach would be to use a search engines API to search for the products
on each site? I’ve never used any search engine API, if I know the
website name and the product name and a price I want can I use those
parameters in the search to find results?