I need some help. In the past few months, the site that I administer
(it’s a large medical non-profit/charity) has been attacked by content
scraping bots. Basically, these content thieves scrape our sites and
then repost the information on their own domains and also intersperse it
with malware, ads. They quite often rank fairly high on Google because
of it and when a user gets infected, they blame us. I’ve been asking
google to delist these sites but that takes days/weeks.
These scrapers obviously don’t care about robots.txt and they just
indiscriminately scrape the content and ignore all the rules. I’ve been
blocking these scrapers manually but by the time I’m aware of the
problem, it’s already too late. They really inflict a lot of damage to
our database performance and many users complain that the site is too
slow at times. When we correlate the data, we see that the slowdown
occurs while these thieves are scraping the site.
What’s the best way to limit the number of requests an IP can make in a,
say 15 min, time period, for example? Is there a way to block them on a
webserver (nginx) layer and move it away from an application layer since
app layer blocking incurs too much of a performance hit? I’m looking for
something that would simply count for the number of requests over a
particular time period and just add the IP to iptables if it ever
crosses the limit.
Any advice is much appreciated!!
Posted at Nginx Forum: