Help: How to deal with content scrapers?

Guys,

I need some help. In the past few months, the site that I administer
(it’s a large medical non-profit/charity) has been attacked by content
scraping bots. Basically, these content thieves scrape our sites and
then repost the information on their own domains and also intersperse it
with malware, ads. They quite often rank fairly high on Google because
of it and when a user gets infected, they blame us. I’ve been asking
google to delist these sites but that takes days/weeks.

These scrapers obviously don’t care about robots.txt and they just
indiscriminately scrape the content and ignore all the rules. I’ve been
blocking these scrapers manually but by the time I’m aware of the
problem, it’s already too late. They really inflict a lot of damage to
our database performance and many users complain that the site is too
slow at times. When we correlate the data, we see that the slowdown
occurs while these thieves are scraping the site.

What’s the best way to limit the number of requests an IP can make in a,
say 15 min, time period, for example? Is there a way to block them on a
webserver (nginx) layer and move it away from an application layer since
app layer blocking incurs too much of a performance hit? I’m looking for
something that would simply count for the number of requests over a
particular time period and just add the IP to iptables if it ever
crosses the limit.

Any advice is much appreciated!!

Thank you,

Dave

Posted at Nginx Forum:

On Wed, Apr 22, 2009 at 5:17 PM, davidr [email protected] wrote:

What’s the best way to limit the number of requests an IP can make in a, say 15 min, time period, for example? Is there a way to block them on a webserver (nginx) layer and move it away from an application layer since app layer blocking incurs too much of a performance hit? I’m looking for something that would simply count for the number of requests over a particular time period and just add the IP to iptables if it ever crosses the limit.

You could try fail2ban - it’s pretty easy to build rules for it.

The trick is that you don’t want to have it monitoring your main nginx
log. So the solution is to place links to bogus URLs in your html
pages which are invisible to a human. This way when their scraper
attempts to hit the bogus links, nginx will trigger entries into your
errorlog, and your fail2ban monitor will catch x amount of those in y
amount of time and block the host in iptables.

Cheers
Kon

Some tips I learned from anti-email spamming:

Sometimes the best thing you can do in this situation isn’t to block,
but to identify and throttle + change content.
if you block, they’ll try and try again , before swapping ips or
going to the next victim
if you throttle… to something crazy like 1byte/second , most bot
operators won’t notice. you’ll also end up tying up their connections
you can also send them alternate content… like a mixture of
gibberish and text that identifies them as a spammer or would drop
down the search relevance

On Apr 22, 2009, at 8:17 PM, davidr wrote:

them on a webserver (nginx) layer and move it away from an

Posted at Nginx Forum: Help: How to deal with content scrapers?

// Jonathan V.

e. [email protected]
w. FindMeOn®.com : connect your online profiles
blog. http://destructuring.net

| - - - - - - - - - -
| Founder/CEO - FindMeOn, Inc.
| FindMeOn.com - The cure for Multiple Web Personality Disorder
| - - - - - - - - - -
| CTO - ArtWeLove, LLC
| ArtWeLove.com - Explore Art On Your Own Terms
| - - - - - - - - - -
| Founder - SyndiClick
| RoadSound.com - Tools for Bands, Stuff for Fans
| - - - - - - - - - -