Blocking by user agent if ip doesn't match

dubstep · May 31, 2011, 8:50pm

Hello everyone. Sorry for double posting this question in How to section
of the forum but I’ve noticed later there is a lot of un-replied threads
there and that mailing list is more active so I am assuming I have more
chance of getting some help here.

I am hoping this is possible and I’d really appreciate some help on
configuring it.
We are having some bots on our site that are using google spider user
agent but they are fake and their ip range has nothing to do with
Google.
So I am looking for some solution that would match visitors with user
agent is Google who’s ip doesn’t start with for example 66.x or 70.x and
block them from accessing.

I’ve found this example on one site
if ($http_user_agent ~ (Purebot|Lipperhey|MaMa
CaSpEr|libwww-perl|Mail.Ru|gold crawler) ) {
return 403;
}

But I don’t know how to add ip regex match to that and what end result
would be. Or negative match in this case I guess.
It is something like this for htaccess !^66.*$ to match those ips that
don’t start with 66. for example. But again I am not sure what the right
syntax would be in nginx config or if it is even possible to match both
statements.

Posted at Nginx Forum:

karabaja · May 31, 2011, 9:05pm

On 31 Mai 2011 19h49 WEST, [email protected] wrote:

start with for example 66.x or 70.x and block them from accessing.

I’ve found this example on one site
if ($http_user_agent ~ (Purebot|Lipperhey|MaMa
CaSpEr|libwww-perl|Mail.Ru|gold crawler) ) {
return 403;
}

Try this:

if ($http_user_agent ~* “Google Bot”) {
allow 66.x;
allow 70.x;
deny all;
}

— appa

karabaja · May 31, 2011, 9:19pm

On 5/31/11, karabaja [email protected] wrote:

syntax would be in nginx config or if it is even possible to match both
statements.

It is possible and there are different ways to match two conditions. I
like this one:

geo $google {
default 0;
66.0.0.0/8 1;
}
map $http_user_agent $googlebot {
default 0;
~google $google;
}

server {
location / {
if ($googlebot) {
…
}
}
}

karabaja · May 31, 2011, 9:35pm

On Tue, May 31, 2011 at 10:19:13PM +0300, Alexandr G. wrote:

}
}
}

It works since 0.9.6. I’m going to change case sensitivity like this

~Google # case sensitive
~*google # case insensitive

Expression compatible with old and new syntax (for gracefull upgrade):

~(?i)google # case insensitive

As to configuraiton it’s better to use this logic:

geo $not_google {
default 1;
66.0.0.0/8 0;
}

map $http_user_agent $bots {
default 0;
~(?i)google $not_google;
“~(?i)(Purebot|Lipperhey|MaMaCaSpEr|libwww-perl|Mail.Ru|gold
crawler)” 1;
}

server {
location / {
if ($bots) {
return 403;
}
}
}

–
Igor S.

karabaja · May 31, 2011, 10:35pm

On 5/31/11, Igor S. [email protected] wrote:

It works since 0.9.6. I’m going to change case sensitivity like this

~Google # case sensitive
~*google # case insensitive

It would be nice. 'Cause current implementation is a bit confusing,
i.e. inconsistent with other regexes.

karabaja · May 31, 2011, 11:05pm

On 31 Mai 2011 20h34 WEST, [email protected] wrote:

It works since 0.9.6. I’m going to change case sensitivity like this

~Google # case sensitive
~*google # case insensitive

Expression compatible with old and new syntax (for gracefull
upgrade):

Question: Is using the map directive more efficient than using a
“naive” if with access control directives?

Thanks,
— appa

karabaja · May 31, 2011, 11:29pm

On 6/1/11, Antnio P. P. Almeida [email protected] wrote:

Question: Is using the map directive more efficient than using a
“naive” if with access control directives?

I think it’s just considered to be a bad practice to use anything
rather than return or rewrite in ifs.
In terms of efficiency: geo uses rbtrees and access module uses
arrays. But I don’t think it’s about that. It’s more of an example to
illustrate how to use multiple conditions and new features in nginx.

karabaja · June 1, 2011, 12:26am

On 31 Mai 2011 23h01 WEST, [email protected] wrote:

Thanks everyone for being so helpful. I’ve ended up applying Igor’s
suggestion. But I’ve dropped this line as I wasn’t sure what to do
with it:
“~(?i)(Purebot|Lipperhey|MaMaCaSpEr|libwww-perl|Mail.Ru|gold
crawler)” 1;

I am guessing it can be used if I want to match more then just
google’s user agent. But in any case what I did worked very nice. I
tested it using Firefox user agent and I got forbidden page, then
tried adding my ip to geo bit and I was allowed.

Yes following Alexandr and Igor’s advice you can create similar
variables using more IP blocks inside the geo directive. E.g.:

geo $bad_bot {
default 1;
66.0.0.0/8 0;
xx.yy.zz.ww/16 1; # for Yahoo!
(…)
}

You’ll need also to add the regexes for User Agent string of the
remaining bots that you want to whitelist in the map directive.

map $http_user_agent $bots {
default 0;
~(?i)(google|yahoo) $bad_bot;
}

— appa

karabaja · June 1, 2011, 12:01am

I am guessing it can be used if I want to match more then just google’s
user agent. But in any case what I did worked very nice.
I tested it using Firefox user agent and I got forbidden page, then
tried adding my ip to geo bit and I was allowed.

And there is lot less “google spiders” on our site now, and all of them
have ips I added to allowed list.

Thanks again for fast and great advices.

Since I am a noob when it comes to Nginx and now that I know I can get
some help here I am sure I’ll have more questions

Posted at Nginx Forum:

karabaja · June 1, 2011, 12:45am

Thanks Antonio, so far we’ve only had fake google bots so it seems fine
as it is for now. But I’ll apply those if there is any issue with Yahoo
or some other spider being faked.
I am assuming that other user agents are not affected by the current
rule.

Posted at Nginx Forum:

karabaja · June 1, 2011, 12:32am

On 31 Mai 2011 23h25 WEST, [email protected] wrote:

Oops. Now correct.

geo $bad_bot {
default 1;
66.0.0.0/8 0;
xx.yy.zz.ww/16 0; # for Yahoo!
(…)
}

— appa

karabaja · June 1, 2011, 12:58am

On 31 Mai 2011 23h44 WEST, [email protected] wrote:

Thanks Antonio, so far we’ve only had fake google bots so it seems
fine as it is for now. But I’ll apply those if there is any issue
with Yahoo or some other spider being faked. I am assuming that
other user agents are not affected by the current rule.

No the default value of $bots is 0. Only in the case of a UA string
containing google (case insensitive) is the IP verified. Otherwise,
since you dropped the line that matched the bad bots UA, all other
bots are allowed.

This is a good approach for handling bots spoofing the UA string and
also a cleaner way to block unwanted bots.

— appa

karabaja · June 1, 2011, 9:28am

On Tue, May 31, 2011 at 06:01:19PM -0400, karabaja wrote:

Thanks everyone for being so helpful. I’ve ended up applying Igor’s
suggestion.
But I’ve dropped this line as I wasn’t sure what to do with it:
“~(?i)(Purebot|Lipperhey|MaMaCaSpEr|libwww-perl|Mail.Ru|gold crawler)”
1;

I am guessing it can be used if I want to match more then just google’s
user agent. But in any case what I did worked very nice.
I tested it using Firefox user agent and I got forbidden page, then
tried adding my ip to geo bit and I was allowed.

I misread your message and thought that you block alreay these bots,
so I have added them to block despite their IPs.

–
Igor S.

karabaja · June 1, 2011, 9:26am

On Tue, May 31, 2011 at 10:04:11PM +0100, Antnio P. P. Almeida wrote:

Question: Is using the map directive more efficient than using a
“naive” if with access control directives?

Due to “if” block implementation, this may not work in some
configurations.

–
Igor S.

karabaja · June 1, 2011, 10:47am

On 6/1/2011 2:34 AM, Igor S. wrote:

As to configuraiton it’s better to use this logic:
geo $not_google {
default 1;
66.0.0.0/8 0;
}

Does geo directive will also work for ipv6 ?

ie:

geo $not_google {
default 1;
66.0.0.0/8 0;
2404:6800::/32 0;
}

I hardly find wiki explained about ipv6 (
http://wiki.nginx.org/HttpGeoModule )

thanks

Powered By http://www.3g-net.net

karabaja · June 1, 2011, 10:48am

On Wed, Jun 01, 2011 at 03:42:12PM +0700, Hari Hendaryanto wrote:

ie:

geo $not_google {
default 1;
66.0.0.0/8 0;
2404:6800::/32 0;
}

I hardly find wiki explained about ipv6 (
http://wiki.nginx.org/HttpGeoModule )

No, currently geo does not support ipv6.

–
Igor S.