Hi all,
How can I maintain two rate limit strategies?One for spiders and one
for
regular users?
I can get the IP address list of spiders from
http://www.iplists.com/ . Can I separate it by geo? Have people
attempted
this?
My website is being pounded by some screen scrapers and I want to block
them, but not at the risk of blocking search engine spiders.
-Quintin
My website is being pounded by some screen scrapers and I want to block
them, but not at the risk of blocking search engine spiders.
Do you understand that by going that way, regular users will be subject
to
the same request limiting of the bad spiders? You can try to do further
selection on UA, but bad spiders have the habit of providing bogus UA
strings.
At the http level:
geo $good_spider {
default 0;
#list all good spider IPs
}
limit_req_zone $binary_remote_addr zone=bad_spiders:10m rate=1r/s;
On the vhost (server level):
location / {
limit_req zone=bad_spiders burst=5;
error_page 418 @good-spiders;
if ($good_spider) {
return 418;
}
#...
}
location @good-spiders {
# no limits here
#…
}
–appa
>
> location / {
> limit_req zone=bad_spiders burst=5;
>
> error_page 418 @good-spiders;
Oops. This should be:
error_page 418 =200 @good-spiders;
–appa