How to block fake google spider and fake web browser access?

Hi All,

Recently I found that someguys are trying to mirror my website. They are
doing this in two ways:

  1. Pretend to be google spiders . Access logs are as following:

89.85.93.235 - - [05/May/2015:20:23:16 +0800] “GET /robots.txt HTTP/1.0”
444
0 “http://www.example.com” “Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)” “66.249.79.138”
79.85.93.235 - - [05/May/2015:20:23:34 +0800] “GET /robots.txt HTTP/1.0”
444
0 “http://www.example.com” “Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html)” “66.249.79.154”

The http_x_forwarded_for address are google addresses.

  1. Pretend to be a normal web browser.

I’m trying to use below configuration to block their access:

For 1 above, I’ll check X_forward_for address. If user agent is spider,
and
X_forward_for is not null. Then block.
I’m using

map $http_x_forwarded_for $xf {
default 1;
“” 0;
}
map $http_user_agent $fakebots {
default 0;
“~*bot” $xf;
“~*bing” $xf;
“~*search” $xf;
}
if ($fakebots) {
return 444;
}

With this configuration, it seems the fake google spider can’t access
the
root of my website. But they can still access my php files, and they
can’t
access and js or css files. Very strange. I don’t know what’s wrong.

  1. For user-agent who declare they are not spiders. I’ll use ngx_lua to
    generate a random value and add the value into cookie, and then check
    whether they can send this value back or not. If they can’t send it
    back,
    then it means that they are robot and block access.

map $http_user_agent $ifbot {
default 0;
“~*Yahoo” 1;
“~*archive” 1;
“~*search” 1;
“~*Googlebot” 1;
“~Mediapartners-Google” 1;
“~*bingbot” 1;
“~*msn” 1;
“~*rogerbot” 3;
“~*ChinasoSpider” 3;
}

if ($ifbot = “0”) {
set $humanfilter 1;
}
#below section is to exclude flash upload
if ( $request_uri !~ “~mod=swfupload&action=swfupload” ) {
set $humanfilter “${humanfilter}1”;
}

if ($humanfilter = “11”){
rewrite_by_lua ’
local random = ngx.var.cookie_random
if(random == nil) then
random = math.random(999999)
end
local token = ngx.md5(“hello” … ngx.var.remote_addr … random)
if (ngx.var.cookie_token ~= token) then
ngx.header[“Set-Cookie”] = {“token=” … token, “random=” … random}
return ngx.redirect(ngx.var.scheme … “://” … ngx.var.host …
ngx.var.request_uri)
end
';
}
But it seems that with above configuration, google bot is also blocked
while
it shouldn’t.

Any one can help?

Thanks

Posted at Nginx Forum:

It seems that I can’t edit my post. I have to post my question here:
I tried to use “deny” to deny access from an IP. But it seems that it
can
still access my server.

In my http part:

deny 69.85.92.0/23;
deny 69.85.93.235;

But when I check the log, I still can find

69.85.93.235 - - [05/May/2015:19:44:22 +0800] “GET
/thread-1251687-1-1.html
HTTP/1.0” 302 154 “http://www.example.com” “Mozilla/5.0 (compatible;
Baiduspider/2.0; +http://www.baidu.com/search/spider.html)”
“123.125.71.107”
69.85.93.235 - - [05/May/2015:19:50:06 +0800] “GET
/thread-1072432-1-1.html HTTP/1.0” 302 154 “http://www.example.com
“Mozilla/5.0 (compatible; Baiduspider/2.0;
+http://www.baidu.com/search/spider.html)” “220.181.108.151”

it seems deny is not working.

Any one can help?

Posted at Nginx Forum:

Hey,

Why not just compare their xforward vs connecting ip, if they dont match
and its a bot, drop it.


Payam C.
Network Engineer / Security Specialist

The only way you can stop people from mirroring your site is to pull the
plug. Anything you set up can be bypassed like a normal user would. If
you
put CAPTCHAs on every page, someone motivated can get really smart
people
in poor countries to type in the letters, click the blue box, complete
the
pattern, etc. on the cheap.

However, that being said the legit Googlebot operates from a well
defined
subset of IP blocks and always identifies itself and honors robots.txt,
so
you can look those up and white list them.

Any traffic from Amazon EC2, Google Clould, and Digital Ocean is
immediately suspect, you can filter them out by IP block because they
are
probably not going to identify themselves as a bot. However you may lose
traffic from real people running VPNs and proxies though those sites as
a
consequence so think it through before you act.

And there are no shortage of other providers for people to turn to if
you
block the big clouds, so it comes back to pulling the plug if you want
to
keep your content locked down.

Thanks for your suggestion.

My thought is

  1. Is it a robot?
  2. If yes, then does’t it have a X_forward_IP?
  3. If yes, then deny.

Your method is

  1. Is it a robot?
  2. If yes, then if x_forward_ip the same with realip?
  3. If no, then deny.

I think there is no big different…

Posted at Nginx Forum:

On Tue, May 05, 2015 at 09:07:41AM -0400, meteor8488 wrote:

Hi there,

I tried to use “deny” to deny access from an IP. But it seems that it can
still access my server.

In my http part:

deny 69.85.92.0/23;
deny 69.85.93.235;

A request comes in to nginx. nginx chooses one server{} block in its
configuration to handle it. nginx chooses one location{} block in that
server{} configuration to handle it. Only configuration directives in,
or inherited into, that location{} are relevant.

(If you use any rewrite-module directives, things may be different.)

69.85.93.235 - - [05/May/2015:19:44:22 +0800] “GET /thread-1251687-1-1.html
HTTP/1.0” 302 154 “http://www.example.com” “Mozilla/5.0 (compatible;
Baiduspider/2.0; +http://www.baidu.com/search/spider.html)”
“123.125.71.107”

What is the one location{} that handles this request? What “allow” and
“deny” directives are in that location{}? And in the enclosing server{}?

Can you provide a complete nginx.conf that shows the behaviour you
report?

(It doesn’t have to be your production config. Something smaller
that shows this problem on a test machine, may make obvious where the
problem is.)

Thanks,

f

Francis D. [email protected]

Hi Francis,

I put the “deny” directives in http{} part.

Here is my nginx.conf.

http {

deny 4.176.128.153;
deny 23.105.85.0/24;
deny 36.44.146.99;
deny 42.62.36.167;
deny 42.62.74.0/24;
deny 50.116.28.209;
deny 50.116.30.23;
deny 52.0.0.0/11;
deny 54.72.0.0/13;
deny 54.80.0.0/12;
deny 54.160.0.0/12;
deny 54.176.0.0/12;
deny 54.176.195.13;
deny 54.193.0.0/16;
deny 54.193.212.129;
deny 54.208.0.0/15;
deny 54.212.0.0/15;
deny 54.219.0.0/16;
deny 54.224.0.0/12;
deny 58.208.0.0/12;
deny 61.135.219.2;
deny 61.173.11.234;
deny 61.177.134.164;
deny 61.178.110.42;
deny 69.85.92.0/23;
deny 69.85.93.235;
deny 101.226.62.63;
deny 101.226.167.237;
deny 101.226.168.225;
deny 101.231.74.38;
deny 101.231.74.40;
deny 103.19.84.0/22;
deny 106.186.112.0/21;
deny 111.20.18.224;
deny 111.20.19.148;
deny 111.67.200.68;
deny 112.90.51.35;
deny 112.235.133.139;
deny 113.74.83.46;
deny 113.120.156.252;
deny 114.80.109.30;
deny 114.80.116.164;
deny 114.86.54.43;
deny 114.87.109.129;
deny 114.112.103.46;
deny 115.226.236.69;
deny 116.7.169.91;
deny 116.208.12.74;
deny 116.228.41.122;
deny 116.232.27.33;
deny 116.234.130.64;
deny 117.27.152.197;
deny 117.27.152.198;
deny 117.151.97.223;
deny 118.144.32.66;
deny 119.85.190.7;
deny 119.147.225.177;
deny 119.254.64.12;
deny 119.254.86.240;
deny 119.254.86.246;
deny 121.202.22.154;
deny 122.4.149.168;
deny 122.49.5.11;
deny 122.49.5.14;
deny 122.49.5.15;
deny 122.96.36.167;
deny 123.151.176.198;
deny 124.156.6.198;
deny 124.226.42.78;
deny 125.125.41.167;
deny 128.199.153.220;
deny 128.199.78.7;
deny 136.243.36.95;
deny 139.200.132.233;
deny 171.108.67.30;
deny 171.112.242.65;
deny 174.2.171.84;
deny 180.153.72.92;
deny 180.153.211.148;
deny 180.153.229.0/24;
deny 180.171.146.137;
deny 182.16.44.26;
deny 182.33.66.29;
deny 182.41.45.241;
deny 182.240.7.79;
deny 183.8.83.248;
deny 183.129.200.250;
deny 183.156.102.146;
deny 183.156.108.133;
deny 183.157.68.141;
deny 183.250.40.194;
deny 188.143.232.40;
deny 188.143.232.72;
deny 198.58.96.215;
deny 198.58.99.82;
deny 198.58.102.117;
deny 198.58.102.155;
deny 198.58.102.156;
deny 198.58.102.158;
deny 198.58.102.49;
deny 198.58.102.95;
deny 198.58.102.96;
deny 198.58.103.102;
deny 198.58.103.114;
deny 198.58.103.115;
deny 198.58.103.158;
deny 198.58.103.160;
deny 198.58.103.28;
deny 198.58.103.36;
deny 198.58.103.91;
deny 198.58.103.92;
deny 202.1.232.243;
deny 203.195.219.37;
deny 204.236.128.0/17;
deny 209.141.40.22;
deny 211.97.148.191;
deny 218.148.90.164;
deny 220.240.235.158;
deny 222.73.68.103;
deny 222.95.129.93;
deny 222.175.185.14;
deny 222.175.186.18;
geo $geo {
ranges;
111.67.200.68-111.67.200.68 badip;
58.213.119.20-58.213.119.21 badip;
54.208.0.0-54.209.255.255 badip;
54.176.0.0-54.191.255.255 badip;
54.219.0.0-54.219.255.255 badip;
54.193.0.0-54.193.255.255 badip;
54.160.0.0-54.175.255.255 badip;
106.145.17.0-106.145.17.255 badip;
112.235.133.139-112.235.133.139 spider;
5.255.253.77-5.255.253.77 spider;
69.85.93.235-69.85.93.235 spider;
54.160.105.130-54.160.105.130 spider;
95.108.158.146-95.108.158.146 spider;
131.253.21.0-131.253.47.255 spider;
157.54.0.0-157.60.255.255 spider;
202.160.176.0-202.160.191.255 spider;
207.46.0.0-207.46.255.255 spider;
207.68.128.0-207.68.207.255 spider;
209.191.64.0-209.191.127.255 spider;
209.85.128.0-209.85.255.255 spider;
216.239.32.0-216.239.63.255 spider;
64.233.160.0-64.233.191.255 spider;
64.4.0.0-64.4.63.255 spider;
65.52.0.0-65.55.255.255 spider;
66.102.0.0-66.102.15.255 spider;
66.196.64.0-66.196.127.255 spider;
66.228.160.0-66.228.191.255 spider;
66.249.64.0-66.249.95.255 spider;
67.195.0.0-67.195.255.255 spider;
68.142.192.0-68.142.255.255 spider;
72.14.192.0-72.14.255.255 spider;
72.30.0.0-72.30.255.255 spider;
74.125.0.0-74.125.255.255 spider;
74.6.0.0-74.6.255.255 spider;
8.12.144.0-8.12.144.255 spider;
98.136.0.0-98.139.255.255 spider;
203.208.32.0-203.208.63.255 spider;
}

map $request_method $bad_method {
default 1;
~(?i)(GET|HEAD|POST) 0;
}

map $http_referer $bad_referer {
default 0;
~(?i)(babes|click|forsale|jewelry|nudit|organic|poker|porn|amnesty|poweroversoftware|webcam|zippo|casino|replica|CDR)
1;
}

map $query_string $spam {
default 0;
~“\b(ultram|unicauca|valium|viagra|vicodin|xanax|ypxaieo)\b” 1;
~
“\b(erections|hoodia|huronriveracres|impotence|levitra|libido)\b” 1;
~“\b(ambien|blue\spill|cialis|cocaine|ejaculation|erectile)\b” 1;
~
“\b(lipitor|phentermin|pro[sz]ac|sandyauer|tramadol|troyhamby)\b” 1;
}

map $http_x_forwarded_for $xf {
default 1;
“” 0;
}
map $http_user_agent $fakebots {
default 0;
“~*bot” $xf;
“~*bing” $xf;
“~*search” $xf;
“~*Baidu” $xf;
}

map $http_user_agent $ifbot {
default 0;
“~*rogerbot” 3;
“~*ChinasoSpider” 3;
“~*Yahoo” 1;
“~*archive” 1;
“~*search” 1;
“~*Googlebot” 1;
“~Mediapartners-Google” 1;
“~*bingbot” 1;
“~*YandexBot” 1;
“~*Baiduspider” 1;
“~*Feedly” 2;
“~*Superfeedr” 2;
“~*QuiteRSS” 2;
“~*g2reader” 2;
“~*Digg” 2;
“~*AhrefsBot” 3;
“~*ia_archiver” 3;
“~*trendiction” 3;
“~*AhrefsBot” 3;
“~*curl” 3;
“~*Ruby” 3;
“~*Player” 3;
“~*Go\ http\ package” 3;
“~*Lynx” 3;
“~*Sleuth” 3;
“~*Python” 3;
“~*Wget” 3;
“~*perl” 3;
“~*httrack” 3;
“~*JikeSpider” 3;
“~*PHP” 3;
“~*WebIndex” 3;
“~*magpie-crawler” 3;
“~*JUC” 3;
“~*Scrapy” 3;
“~*libfetch” 3;
“~*WinHTTrack” 3;
“~*htmlparser” 3;
“~*urllib” 3;
“~*Zeus” 3;
“~*scan” 3;
“~*Indy\ Library” 3;
“~*libwww-perl” 3;
“~*GetRight” 3;
“~*GetWeb!” 3;
“~*Go!Zilla” 3;
“~*Go-Ahead-Got-It” 3;
“~*Download\ Demon” 3;
“~*TurnitinBot” 3;
“~*WebscanSpider” 3;
“~*WebBench” 3;
“~*YisouSpider” 3;
“~*check_http” 3;
“~*webmeup-crawler” 3;
“~*omgili” 3;
“~*blah” 3;
“~*fountainfo” 3;
“~*MicroMessenger” 3;
“~*QQDownload” 3;
“~*shoulu.jike.com” 3;
“~*omgilibot” 3;
“~*pyspider” 3;
“~*mysite” 3;
}

server {
listen 80 accept_filter=httpready;
index index.html index.htm index.php;
access_log /var/log/server_access.log main;

location / {
  root   /var/www;

  if ( $geo = "badip" ) {
    return 444;
  }
  if ( $geo = "spider" ) {
    set $spiderip 1;
  }
  if ($bad_method = 1) {
    return 444;
  }
  if ($spam = 1) {
    return 444;
  }
  set $humanfilter 0;
  if ($ifbot = "0") {
    set $humanfilter 1;
  }
  if ( $request_uri !~ "~mod\=swfupload\&action\=swfupload" ) {
    set $humanfilter "${humanfilter}1";
  }
  if ($humanfilter = "11"){
    rewrite_by_lua '
      local random = ngx.var.cookie_random
      if(random == nil) then
        random = math.random(999999)
      end
      local token = ngx.md5("guessguess" .. ngx.var.remote_addr .. 

random)
if (ngx.var.cookie_token ~= token) then
ngx.header[“Set-Cookie”] = {“token=” … token, “random=” …
random}
return ngx.redirect(ngx.var.scheme … “://” … ngx.var.host

ngx.var.request_uri)
end
';
}

  if ($ifbot = "1") {
    set $spiderbot 1;
  }
  if ($ifbot = "2") {
    set $rssbot 1;
  }
  if ($ifbot = "3") {
    return 444;
  }

  if ($fakebots) {
    return 444;
  }

  if ($bad_referer = 1) {
    return 410;
  }
  location ~ \.php$ {
    try_files $uri =404;
    fastcgi_pass   backend;
    fastcgi_index  index.php;
    fastcgi_param  SCRIPT_FILENAME  /scripts$fastcgi_script_name;
    include        fastcgi_params;
    access_log /web/log/php.log  main;
  }
}

}

Posted at Nginx Forum:

On Tue, May 05, 2015 at 07:05:59PM -0400, meteor8488 wrote:

Hi there,

location / {
  root   /var/www;

  if ( $geo = "badip" ) {
    return 444;
  }
  if ( $geo = "spider" ) {
    set $spiderip 1;
  }

You are using “if” inside “location” and doing something other than
“return”.

That combination makes it too hard for me to understand what is
happening.

I won’t be surprised to learn that that combination is the reason your
“deny” directives do not act the way you want them to.

It looks to me like you can safely move all of these "if"s to server{}
level, outside the location{}.

If you do that, does it change the response that you get at all?

Cheers,

f

Francis D. [email protected]