Google, MSN, Yahoo spiders crawling off my 'database universe'?

casper_the_ghost · June 19, 2008, 7:22pm

I recently figured out how to create a fairly complex Google Sitemap
file and am happy to share this code with anyone who asks. As I have a
highly nested database a common URL for me will look something like;

mywebsite.com is available for purchase - Sedo.com.

The spiders come to my website and ‘crawl’ along and increase these
sequences which eventually puts them into;

www.MyWebsite.com/parents/23/children/46

and as this URL has gone off the edge of my database universe, my
exception_notification features send me an email.
Is there a way to put logic somewhere so that if a spider (or person)
is messing around and requests a URL that isn’t there, that a routine
kicks in telling them “You’ve gone off the edge of my database
universe” on a view, and then takes them back to where they were?

Thank you for any thoughts you may offer.
Kathleen

casper_the_ghost · June 20, 2008, 6:54am

Well u can setup proper robotx.txt and disallow certain things to be
accessed from ur site…This is the only way to put restriction on
crawling.

Hope this helps

Thanks

Dhaval P.
Software Engineer

sales(AT)railshouse(DOT)com

casper_the_ghost · June 20, 2008, 8:30am

[email protected] wrote:

I recently figured out how to create a fairly complex Google Sitemap
file and am happy to share this code with anyone who asks. As I have a
highly nested database a common URL for me will look something like;

mywebsite.com is available for purchase - Sedo.com.

The spiders come to my website and ‘crawl’ along and increase these
sequences which eventually puts them into;

www.MyWebsite.com/parents/23/children/46

and as this URL has gone off the edge of my database universe, my
exception_notification features send me an email.
Is there a way to put logic somewhere so that if a spider (or person)
is messing around and requests a URL that isn’t there, that a routine
kicks in telling them “You’ve gone off the edge of my database
universe” on a view, and then takes them back to where they were?

Thank you for any thoughts you may offer.
Kathleen

Hi there,

I don’t think that spiders work that way. I thought that they follow
existing links - rather than looking at the url & guessing what another
one could be.

If there are no to a page, a spider should not get to it.

Unless the link used to exist, but does not now - ie: you are
dynamically genereating your urls using table id’s & have removed some
records from the table.

In this case, the links will still be in the search engines index &
polled every so often. This is not a bad thing, as the link will
eventually fall out of the search engines index. You can request removal
through the webmaster admin tools if this bugs you.

Yahoo’s spider deliberately generates a dummy url to generate a “404 -
page not found” error to understand how your site handles this. Do don’t
be too worried that Yahoo has weird links for you site in its index.

Otherwise you can put something like this in your logic.

user_agent = request.user_agent.downcase
if ![ ‘msnbot’, ‘yahoo! slurp’,‘googlebot’ ].detect { |b|
user_agent.include? b }
#This request is not from one of the msn, yahoo, or google spiders

and so process accordingly

end

rgds,

matt
http://www.internetschminternet.com