Screen Scraping Advice

I work for Cisco Systems in San Jose Ca. I proposed a project to perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain names
so I could read through that and extract all domain names with Cisco in
it. Once I do that I want to go to each page and search/scrape for these
products, but I don’t really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

On Sep 17, 2007, at 12:25 PM, Charles P. wrote:

equipment. So I started by downloading the DNS list of all domain
Posted via http://www.ruby-forum.com/.

Doesn’t sound like much scraping, just searching text for a string.
You could even do a lot of that work with Google.
but just download the file and search for a string. create a data
file of your own that tells you what line you found the string.
Scraping is really for getting data from other sites, using the DOM
structure they have to get (for example) the weather report.

John J. wrote:

On Sep 17, 2007, at 12:25 PM, Charles P. wrote:

equipment. So I started by downloading the DNS list of all domain
Posted via http://www.ruby-forum.com/.

Doesn’t sound like much scraping, just searching text for a string.
You could even do a lot of that work with Google.
but just download the file and search for a string. create a data
file of your own that tells you what line you found the string.
Scraping is really for getting data from other sites, using the DOM
structure they have to get (for example) the weather report.

Well, I disagree. Once I have all the websites with Cisco in its domain
name and I look through them, there are lots of pages that won’t show me
info unless I do a search within that page itself. (ex. usedcisco.com)
To search for specific items on this website I would have to use the
search bar located within its page to search for say “WIC-1T” and then
search for a price below a specific amount for that item.

Quoth Chuck D.:

Scraping is really for getting data from other sites, using the DOM
structure they have to get (for example) the weather report.

Well, I disagree. Once I have all the websites with Cisco in its domain
name and I look through them, there are lots of pages that won’t show me
info unless I do a search within that page itself. (ex. usedcisco.com)
To search for specific items on this website I would have to use the
search bar located within its page to search for say “WIC-1T” and then
search for a price below a specific amount for that item.

Do a search on froogle for “cisco productname” with the max price set at
60% MSRP. Should turn up a few hits.

HTH,

On 9/17/07, Charles P. [email protected] wrote:

products, but I don’t really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

If someone knows of a super library that can recognize and interact
with arbitrary search forms, I would love to see it :slight_smile:

My first suggestion would be to write a simple script using Mechanize
to connect to the homepage of each site in an input list and check for
any forms. Bin the sites into three groups (no forms, at least one
form matching the regex /search/i, and at least one form). Then start
by just focusing at the ones which appear to have some sort of search
form (which may be a small or a large subset :-).

On Sep 17, 2007, at 1:52 PM, Chuck D. wrote:

Scraping is really for getting data from other sites, using the DOM
search for a price below a specific amount for that item.

Posted via http://www.ruby-forum.com/.

What I mean is, scraping usually relies on the document’s structure
in some way. Without looking at the structure that a give site uses
(a given page if it isn’t a templated dynamically generated page)
there is no way to know what corresponds to what. Page structure is
pretty arbitrary. Presentation and structure don’t necessarily
correspond well, or in a way you could guess.
Ironically, the better their web designers, the easier it will be.

But if you are talking about searching a dynamically generated site,
you still have to find out if it has a search mechanism, what does it
call the form field and submit buttons? The names in html can be
arbitrary, especially if they use graphic buttons.

If you have long list of products to search for, you will still save
yourself some work, but scraping involves some visual inspection of
pages and page source to get things going. Be aware that their
sysadmin may spot you doing a big blast of searches all at once and
block you from the site. If they check their logs and see that
somebody is searching for all cisco stuff, in an automated fashion,
they might just block you anyway, whether or not they are legit
themselves. Many sysadmins don’t like bots searching their
databases! They might see it as searching for exploits.

equipment. So I started by downloading the DNS list of all domain
names
so I could read through that and extract all domain names with
Cisco in
it. Once I do that I want to go to each page and search/scrape for
these
products, but I don’t really know the best approach to take. Can
anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

I’m slightly biased, but scrubyt should be able to do most of the
remaining heavy lifting for you

http://scrubyt.org/

Glenn

give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

On 9/20/07, Glenn G. [email protected] wrote:

I’m slightly biased, but scrubyt should be able to do most of the
remaining heavy lifting for you

http://scrubyt.org/

On that note:

require “rubygems”
require “scrubyt”

froogle_data = Scrubyt::Extractor.define do
fetch “http://www.google.com/products
fill_textfield “q”, “WIC-1T”
submit

info do
product “WIC-1T”
vendor “NEW2U Hardware from …”
price “$40.00”
end
next_page “Next”, :limit => 10
end

puts froogle_data.to_xml

(tons of improvement needed, but):

WIC-1T NEW2U Hardware from ... $40.00 WIC-1T ATS Computer Systems... $353.95 WIC-1T eBay $49.95 WIC-1T eBay $149.99 WIC-1T PCsForEveryone.com $337.07 WIC-1T COL - Computer Onlin... $149.00 WIC-1T eCOST.com $297.14 WIC-1T eBay $45.00 WIC-1T ATACOM $291.95 WIC-1T Express IT Options $216.44

On Sep 17, 1:25 pm, Charles P. [email protected] wrote:

products, but I don’t really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

Posted viahttp://www.ruby-forum.com/.

Hpricot
http://code.whytheluckystiff.net/hpricot/ is a great screen scrape
library for ruby.

scraping might not be the best approach because each site/page uses a
different layout, therefore the same scrape recipe probably won’t work
for another page.

you could scrape froogle (google products?) or some other aggregate
consumer sales site. it will have one interface and probably a lot of
data. you might want to see if there are web services for froogle,
usually better than scraping.

On 20/09/2007, at 7:54 PM, [email protected] wrote:

(tons of improvement needed, but):

It’s by no means a silver bullet, but could very well get you 80%
there. Setup a basic learning extract that is fairly generic looking
for terms you know will exist on the domains you want (say a model
number and a dollar sign?), have it loop over the URLs with products
on them, output the learner to production extractor and then tweak
the sites that aren’t giving you the exact results you want.

Or, make life easier if you can and let froogle put it all into a
single format for you.

Best of luck,

Glenn