GeoIP Question - Speed & efficiency

Hello,

I am thinking of using the GeoIP module with input from the maxmind
database converted with the perl script as described through the link on
the nginx site.

I’m curious if the country-ip pairs are managed efficiently so that the
lookup/conversion is very fast or not? That is, does the module do
something like sort the list and then use a binary tree to quickly
locate the country? Is the whole thing loaded in memory? This country
database is quite huge and if this process happens on every hit or even
on only a selected entry page then it could be very slow. Does anyone
here have experience with this?

For my purposes I only really need to detect continents for deciding if
visitors should pull from one of a few server locations. So presumably
it may be possible to combine many countries into larger blocks so that
there are fewer steps in the lookup. Any input on how speedy or
efficient this has shown to be would be super helpful here.

Thanks,
Chris :slight_smile:

Hello!

On Sat, Aug 16, 2008 at 07:43:45AM +0700, Chris S. wrote:

I am thinking of using the GeoIP module with input from the maxmind
database converted with the perl script as described through the link on
the nginx site.

I’m curious if the country-ip pairs are managed efficiently so that the
lookup/conversion is very fast or not? That is, does the module do
something like sort the list and then use a binary tree to quickly
locate the country? Is the whole thing loaded in memory?

Geo module builds in-memory radix tree when loading configs. This
is the same data structure as used in routing, and lookups are
really fast.

This country
database is quite huge and if this process happens on every hit or even
on only a selected entry page then it could be very slow. Does anyone
here have experience with this?

The only inconvinience of using really large geobases is config
reading time. My currently takes about 30 seconds to load - but
that’s for more than 30 Mb of data, and not only countries.

For my purposes I only really need to detect continents for deciding if
visitors should pull from one of a few server locations. So presumably
it may be possible to combine many countries into larger blocks so that
there are fewer steps in the lookup. Any input on how speedy or
efficient this has shown to be would be super helpful here.

Aggregating blocks is good thinks to do if you don’t need detailed
information, but you’ll hardly notice any difference.

Maxim D.

On Sat, Aug 16, 2008 at 07:22:20AM +0400, Maxim D. wrote:

This country
database is quite huge and if this process happens on every hit or even
on only a selected entry page then it could be very slow. Does anyone
here have experience with this?

The only inconvinience of using really large geobases is config
reading time. My currently takes about 30 seconds to load - but
that’s for more than 30 Mb of data, and not only countries.

If you have many unique values per networks, then this long load time
is caused by searching duplicates of data in array. Otherwise, it
may be caused by insertions to a radix tree.

Thanks Maxim. Sounds cool, fast but perhaps a bit of a memory hog. For
loading time I would think the way to improve that is to compile a
binary representation on disk that can be loaded as a “pre-made tree”
into memory so that no insertion scan need be done. Or pre-sort the data
to insert with minimum searches.

Anyway, I may write a small script to see if I can amalgamate countries
into big blocks as that would help both speed and memory.

I gather that being at the http level config this means it is “always
on”. I could see it being useful to be able to put it in a location
specifier so that only certain requests go through the lookup. For
example, I’ve no need for static images to get country codes but my
index page would be great as I would set a “best choice” value for
serving the user for all further requests in the session. It sounds like
it doesn’t use much cpu time but I expect to be serving vasts amounts
of small thumbnails so reducing cycles on that is always a good thing.
(25 thumbs/page/user ad nauseum photo app).

Cheers, for excellent info.
Chris :slight_smile:

On Sat, Aug 16, 2008 at 03:05:56PM +0700, Chris S. wrote:

Thanks Maxim. Sounds cool, fast but perhaps a bit of a memory hog. For
loading time I would think the way to improve that is to compile a
binary representation on disk that can be loaded as a “pre-made tree”
into memory so that no insertion scan need be done. Or pre-sort the data
to insert with minimum searches.

Anyway, I may write a small script to see if I can amalgamate countries
into big blocks as that would help both speed and memory.

We at Rambler use geo base with countries and Russian regions:

wc geo.conf
141240 282480 2979471 geo.conf

Your base will probably be even lesser (as Russia will be one country).

I gather that being at the http level config this means it is “always
on”. I could see it being useful to be able to put it in a location
specifier so that only certain requests go through the lookup. For
example, I’ve no need for static images to get country codes but my
index page would be great as I would set a “best choice” value for
serving the user for all further requests in the session. It sounds like
it doesn’t use much cpu time but I expect to be serving vasts amounts
of small thumbnails so reducing cycles on that is always a good thing.
(25 thumbs/page/user ad nauseum photo app).

All nginx variables are evaluated on demand only, therefore geo
variables
are looked up only if they are really used in a request.

Hello!

On Sat, Aug 16, 2008 at 08:58:03AM +0400, Igor S. wrote:

If you have many unique values per networks, then this long load time
is caused by searching duplicates of data in array. Otherwise, it
may be caused by insertions to a radix tree.

Yes, I’ve read code and in my case it looks like unique values
search. One day I’ll probably try to implement rbtree there, but
currently it doesn’t bugs me too much. :slight_smile:

Maxim D.

On Sat, Aug 16, 2008 at 05:27:47PM +0700, Chris S. wrote:

All nginx variables are evaluated on demand only, therefore geo variables
are looked up only if they are really used in a request.

Ok. Excellent, so if I only include the fastcgi param line for one
location, say for index.php then it would only evaluate under that
condition to pass thru to php, like this:

fastcgi_param COUNTRY $geo;

Which is easy then…

Yes. Actually even if you set fastcgi_param on http level, it will
eventually
be inherited on all localtions level (unless overridden), but it will
execute
only when fastcgi_pass directive will start to work.

I wrote a quick php script to amalgamate the ip ranges into larger
regions than countries. It takes large groups of countries and breaks
them into user defined groups (for me NA, EU, AS). Doing this drops the
line count from 104K to about 33K and after run through the perl script
the conf file is 1.5MB instead of over 3MB. So that’s not a bad savings.
I checked a lot of the regions manually to be sure it was working so I
think it’s ok.

I’ll post the code here just in case anyone else can use it. Sorry it’s
not perl - I learned it a decade ago but never use it so didn’t want to
brush up. This works. I just want to set the correct image server for a
visitor so they get faster photos.

I guess the best thing would be to do a set of GETs from the client to
each server on demand and then choose the image server with best times -
then it adapts real time. Didn’t think of that til now…

Chris :slight_smile:

<?php // Combine regions in GeoIP Database $regions = array( 'NA' => 'US CA MX PR VI BM BO BS DM AR BZ BR CL PN AD AI AG AW AT BB BA BG KY CO '. 'CR CU DM EC SV GQ GP GT HT HN JM NR NI PY PE PL RU RO TT TC ', 'EU' => 'EU GB DE FR IT ES SE IR NL BE IE IL CH AL AM BY HR CY CZ DK EE FI GE GI '. 'GR GL GG HU IS LB LY LI LT LU MC ME MS NO PT RS SK SI TR UA VA '. 'ZA GA EG NA NG ZW BJ GH CG MW UG SC TZ TM KE RW TZ SO SR SY TM AE UZ AF DZ AO AZ BI '. 'CV CF TD IQ JE LV MR MQ MU MN SA SL CI NE LS SZ MG SL AO BF MU TG LY SN SD RE CV GQ '. 'ZM BW CD TN BJ TG BT BW DJ ER ET JO KZ KW KG LB OM QA ', 'AS' => 'JP IN AU NZ TH CN HK MY PK KR HK SG BD ID TW PH LK VN AP AS AQ TO KH '. 'CK FJ GN LA MO MM NP NC PG PN WS ST' ); $other = 'NA'; $geo = fopen('GeoIPCountryWhois.csv', 'r'); $r = $w = 0; $last = fgetcsv($geo); $last_region = region($last); while($line = fgetcsv($geo)) { $r++; if(($region = region($line)) != $last_region) { print '"'.join('","', array($last[0], $last[1], $last[2], $last[3], $last_region, '-'))."\"\n"; $last = $line; $last_region = $region; $w++; } else { $last[1] = $line[1]; $last[3] = $line[3]; } } fclose($geo); //print "$r => $w\n"; function region($vars) { global $regions, $other; $found = false; foreach($regions as $r => $codes) if(strpos($codes, $vars[4]) !== false) { $found = $r; break; } return $found ? $found : $other; } ?>

All nginx variables are evaluated on demand only, therefore geo variables
are looked up only if they are really used in a request.

Ok. Excellent, so if I only include the fastcgi param line for one
location, say for index.php then it would only evaluate under that
condition to pass thru to php, like this:

fastcgi_param COUNTRY $geo;

Which is easy then…

Thank very much,
Chris :slight_smile: