IP to Country (#139)

The three rules of Ruby Q.:

  1. Please do not post any solutions or spoiler discussion for this quiz
    until
    48 hours have passed from the time on this message.

  2. Support Ruby Q. by submitting ideas as often as you can:

http://www.rubyquiz.com/

  1. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps
everyone
on Ruby T. follow the discussion. Please reply to the original quiz
message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week’s Ruby Q. is to write a simple utility. Your program should
accept
an IP address as a command-line argument and print out the two letter
code for
the country that IP is assigned in. You can find a database for the
matching
at:

http://software77.net/cgi-bin/ip-country/geo-ip.pl

To keep the problem interesting though, let’s write our programs with a
focus on
speed and memory efficiency.

$ time ruby ip_to_country.rb 68.97.89.187
US

real 0m0.314s
user 0m0.259s
sys 0m0.053s

My solution suffers from over-generality (is supposed to work for any
sorted
string file by any sorting criteria), and from over-limiting memory
usage -
I like Simon’s approach, reading in chunks makes everything easier and
faster.

BTW, all solutions already submitted will lie for subnets 1,2 and 5 :slight_smile:
Most (but not all) will break on out of bounds submissions
(256.256.256.256
or 0.0.0.-1, latter if comments are stripped out)

(yes, I know, just trying to find an excuse for not very elegant
solution -
edge cases kill elegancy :slight_smile:


def bin_find(file,search)
if block_given?
compare = lambda { |a, b| yield(a, b) }
else
compare = lambda { |a, b| a <=> b }
end
open(file) do |f|
return bin_search(f,search,f.lstat.size,&compare)
end
end

def bin_search(f,search,size)
start,finish=0,size
while start<finish
dist=(finish-start)/2
f.seek(start+dist)
f.readline unless start+dist==0
case (l1=f.readline.chomp rescue false) ? yield(search,l1) : -1
when -1
next if (finish=start+dist)>0
break
when 0
return l1
else
case (l2=f.readline.chomp rescue false) ? yield(search,l2) : -1
when -1
return l1
when 0
return l2
else
start+=dist; next
end
end
end
nil
end

nums=[]
out=true
if ARGV[0]==‘test’
n=ARGV[1].to_i
n.times{nums << rand(4294967296)}
out=false
else
ARGV.each do |argv|
nums << ((($1.to_i*256)+$2.to_i)*256+$3.to_i)*256+$4.to_i if
argv=~/(\d{1,3}).(\d{1,3}).(\d{1,3}).(\d{1,3})/
end
end
if nums.empty?
puts “Please enter valid ip(s) (or use ‘#{$0} test NNN’ for testing)”
exit
end

nums.each do |num|
ctry=‘Unknown’
res=bin_find(‘IpToCountry.csv’,num) { |search, str|
str.empty? || str[0,1]!=’"’ ? 1 : search <=>
str.gsub(’"’,’’).split(’,’)[0].to_i
}.gsub(’"’,’’).split(’,’)
ctry=res[4] if (res[0].to_i…res[1].to_i)===num
puts ctry if out
end

Ruby Q. wrote:

[…]

$ time ruby ip_to_country.rb 68.97.89.187
US

real 0m0.314s
user 0m0.259s
sys 0m0.053s

Is an ‘initialisation run’ allowed to massage the data?
(we should at least split the benchmarks to keep it fair)

Is it motivating or a spoiler to post timings?

cheers

Simon

On Sep 14, 2007, at 2:20 PM, Simon Kröger wrote:

Is an ‘initialisation run’ allowed to massage the data?
(we should at least split the benchmarks to keep it fair)

My script does need and initialization run, yes. I don’t see any
harm in paying a one time penalty to set things up right.

Is it motivating or a spoiler to post timings?

Motivating, definitely. :slight_smile:

James Edward G. II

From: “Eugene K.” [email protected]

BTW, all solutions already submitted will lie for subnets 1,2 and 5 :slight_smile:
Most (but not all) will break on out of bounds submissions (256.256.256.256
or 0.0.0.-1, latter if comments are stripped out)

Hi, could you clarify what is meant by lying about subnets
1, 2, and 5?

Thanks,

Bill

Ruby Q. wrote:

$ time ruby ip_to_country.rb 68.97.89.187
US

real 0m0.314s
user 0m0.259s
sys 0m0.053s

This is my first submission to Ruby Q., so don’t expect
ground-breaking techniques from this implementation.

Let me characterize my solution: Instead of looking up the csv file
directly, I precompile it. The search can then be performed quickly. I
think my solution will beat some others when it comes to massive lookups
in the range of 100th of thousand lookups.

First, I wrote a random IP address generator. It can alternatively
output the IP addresses as a space delimited string or each IP address
on a separate line. On my bash submitting all IP address worked up to
about 8 k entries until the argument list was too long, so I implemented
that the addresses can be given on $stdin also. For a proper argument
list on stdin, the entries should be separated by \n, so just add a
second argument to rand_ip.rb to generate the list of ips – the script
doesn’t care what the second parameter is, it just has to be there (i
use “byline”).

I use the script like this:

$ ./rand_ip.rb 100000 byline > ip-test-list

Second, I use a precompile run. I don’t know how folks here can stay
under 100 milliseconds without parsing the file beforehand. My
precompilation generates a binary table that stores 10 bytes for each
entry: 4 bytes each for start/end of address space and 2 bytes for the
country code. Additionally, for saving memory, I squeeze memory areas
with adjacent addresses in the same country. This saves more than 50% of
the records while it doesn’t significantly speed up the search (just
about 1 step more in my binary search algorithm, see below). the packer
(packr.rb) uses pack(“NNa2”) for building the binary table, and look,
the addresses are stored in network byte order (i.e. big endian). The
output table holds about 32 k Entries.

The packer doesn’t accept any parameters, just call

$ ./packer.rb

Third comes the actual address search. I implemented two versions for
comparison: One with a small RAM footprint that looks up everything in
the packed table file. The other one reads the whole table into memory
and performs all lookups in memory.

The algorithm is the same in both implementations: I use binary search
over all entries until the ip address either matches one range or there
is no match at all.

Comparison is sped up by making 4-letter string comparisons of the
respective binary addresses. That way “search_ip >= addr_start” works
even when the addresses are strings. This saves a lot of time because at
search time there is no .to_i calculation avoided.

I run the searchers (search_file.rb, search_str.rb) for small numbers of
random IP address using the command line:

$ time ./search_file.rb ./rand_ip.rb 3000 > /dev/null

For more addresses I use one of the following – the last usage is to
make successive runs with equal input:

$ time ./rand_ip.rb 100000 byline | ./search_file.rb > /dev/null
$ ./rand_ip.rb 100000 byline > ip-test-list; time ./search_file.rb <
ip-test-list > /dev/null

My results for 100000 lookups are:

$ time ./packer.rb

real 0m1.634s
user 0m1.576s
sys 0m0.056s
$ time ./rand_ip.rb 100000 byline > ip-test-list

real 0m0.797s
user 0m0.740s
sys 0m0.056s
$ time ./search_file.rb < ip-test-list > /dev/null

real 0m11.091s
user 0m9.673s
sys 0m1.420s
$ time ./search_str.rb < ip-test-list > /dev/null

real 0m7.131s
user 0m6.960s
sys 0m0.168s

btw: I don’t understand the following result – can you help me
improving this? 129 ms just for firing ruby up! Can’t be!
$ time ruby /dev/null

real 0m0.129s
user 0m0.120s
sys 0m0.012s
$ uname -a
Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64
AMD Athlon™ 64 Processor 3000+ AuthenticAMD GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

Cheers,

  • Matthias

After this:

On 9/16/07, steve d [email protected] wrote:

Here’s my submission. It’s almost exactly the same as Jesus’.
[…]
while low <= high
[…]

and this:

On 9/16/07, Eugene K. [email protected] wrote:

BTW, all solutions already submitted will lie for subnets 1,2 and 5 :slight_smile:
Most (but not all) will break on out of bounds submissions (256.256.256.256
or 0.0.0.-1, latter if comments are stripped out)

I made a couple of modifications to my code. Here is the latest version:

require ‘ftools’

ip = ARGV[0].split(/./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 +
ip[3].to_i
file = ARGV[1] || ‘ipdb.csv’

File.open(file) do |f|
low = 0
high = f.stat.size
f.seek(high / 2)
while low < high
while (((a = f.getc) != 10) && (f.pos > 2))
f.seek(-2, IO::SEEK_CUR)
end
pos = f.pos
line = f.readline.split(“,”)
low_range = line[0][1…-2].to_i
high_range = line[1][1…-2].to_i
if (low_range > ip)
high = pos
offset = (f.pos - pos) + ((high - low) / 2)
f.seek(-offset, IO::SEEK_CUR)
elsif (high_range < ip)
low = f.pos
f.seek((high-low) / 2, IO::SEEK_CUR)
else
puts line[4][1…-2]
exit
end
end
puts “No country found”
end

I changed to while low < high, and also the walking backwards failed
when reading the first line (btw, Steve, won’t your solution also fail
for that case?).

Now:

time ruby quiz139b.rb 2.1.1.1
No country found

real 0m0.030s
user 0m0.016s
sys 0m0.004s

time ruby quiz139b.rb 5.1.1.1
No country found

real 0m0.032s
user 0m0.008s
sys 0m0.008s

time ruby quiz139b.rb 1.1.1.1
No country found

real 0m0.033s
user 0m0.016s
sys 0m0.000s

time ruby quiz139b.rb 256.256.256.256
No country found

real 0m0.096s
user 0m0.008s
sys 0m0.000s

Thanks,

Jesus.

Ruby Q. wrote:

This week’s Ruby Q. is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in.

I assume that it’s not ok for folk to use my Ruby G.IP gem?
It has a reasonable compromise between speed and memory usage.
:slight_smile:

Clifford H…

Greetings!

Thanks for the fun quiz.

Looks like my solution is a little more verbose than many
submitted so far.

Like many others, I’m performing a binary search on the
original data file. For better or for worse, I imposed the
following restrictions on my design, just out of curiosity
for how it would turn out:

  • non-recursive binary search

  • only using seek() and gets() to access the database
    (using gets means no backward scanning for the beginning
    of a record)

Regards,

Bill

====== Begin file: 139_ip_to_country.rb =====

IP FROM IP TO REGISTRY ASSIGNED CTRY CNTRY COUNTRY

“1346797568”,“1346801663”,“ripencc”,“20010601”,“IL”,“ISR”,“ISRAEL”

IpRec = Struct.new(:from, :to, :registry, :assigned, :ctry, :cntry,
:country) do
def contains?(ip_value)
ip_value >= self.from && ip_value <= self.to
end
end

class IpToCountry
DATABASE_FNAME = “IptoCountry.csv”

def initialize(db_fname=DATABASE_FNAME)
@db_size = File.stat(db_fname).size
@db = File.open(db_fname, “r”)
@db_pos = 0
end

def close
@db.close
@db = nil
end

Lookup IpRec containing ip. Exception is raised if not found.

def search(ip)
ip_value = self.class.ip_to_int(ip)
find_rec(ip_value)
end

Binary search through sorted IP database.

The search uses non-record-aligned byte offsets into the

file, but always scans forward for the next parsable

record from a given offset. It keeps track of the

byte offset past the end of the parsed record in @db_pos.

def find_rec(ip_value)
lower, upper = 0, @db_size - 1
prev_rec = nil
loop do
ofst = lower + ((upper - lower) / 2)
rec = next_rec_from_ofst(ofst)
if rec == prev_rec
# We have narrowed the search to where we’re hitting
# the same record each time. Can’t get any narrower.
# But these are variable-length records, so there may
# be one or more at our lower bound we haven’t seen.
# Do a linear scan from our lower bound to examine the
# few records remaining.
ofst = lower
while (rec = next_rec_from_ofst(ofst)) != prev_rec
return rec if rec.contains? ip_value
ofst = @db_pos
end
raise(“no record found for ip_value #{ip_value}”)
end
return rec if rec.contains? ip_value
if ip_value < rec.from
upper = ofst
else
lower = @db_pos
end
prev_rec = rec
end
end

def next_rec_from_ofst(ofst)
@db.seek(ofst)
@db_pos = ofst
while line = @db.gets
@db_pos += line.length
break if rec = self.class.parse_rec(line)
end
rec || raise(“no record found after ofst #{ofst}”)
end

def self.ip_to_int(ip_str)
ip_str.split(".").map{|s|s.to_i}.pack(“c4”).unpack(“N”).first
end

NOTE: Using a strict regexp instead of a simpler split operation,

because it’s important we find a valid record, not one embedded

in a comment or such.

def self.parse_rec(line)
if line =~ %r{\A \s*"(\d+)"\s*,
\s*"(\d+)"\s*,
\s*"(\w+)"\s*,
\s*"(\d+)"\s*,
\s*"(\w+)"\s*,
\s*"(\w+)"\s*,
\s*"([^"]+)"\s* \z
}x
rec = IpRec.new($1.to_i, $2.to_i, $3, $4.to_i, $5, $6, $7)
end
end
end

if $0 == FILE

Accepts zero-or-more IP addresses on the command line.

ip2c = IpToCountry.new

ARGV.each do |ip|
rec = ip2c.search(ip) rescue nil
puts( rec ? rec.ctry : “(#{ip} not found)” )
end

end

====== End file: 139_ip_to_country.rb =====

====== Begin file: test_139_ip_to_country.rb ======

require ‘139_ip_to_country’

require ‘test/unit’

class TestIpToCountry < Test::Unit::TestCase

NOTE: the following are the first two and last two records in

my database file:

REC_0 =
IpToCountry.parse_rec(%{“0”,“16777215”,“IANA”,“410227200”,“ZZ”,“ZZZ”,“RESERVED”})
REC_1 =
IpToCountry.parse_rec(%{“50331648”,“67108863”,“ARIN”,“572572800”,“US”,“USA”,“UNITED
STATES”})
REC_NEG_1 =
IpToCountry.parse_rec(%{“4261412864”,“4278190079”,“IANA”,“410227200”,“ZZ”,“ZZZ”,“RESERVED”})
REC_LAST =
IpToCountry.parse_rec(%{“4278190080”,“4294967295”,“IANA”,“410227200”,“ZZ”,“ZZZ”,“RESERVED”})

def test_find_rec
ip2c = IpToCountry.new
assert_equal( REC_0, ip2c.find_rec(REC_0.from) )
assert_equal( REC_0, ip2c.find_rec(REC_0.to) )
assert_equal( REC_1, ip2c.find_rec(REC_1.from) )
assert_equal( REC_1, ip2c.find_rec(REC_1.to) )
assert_equal( REC_NEG_1, ip2c.find_rec(REC_NEG_1.from) )
assert_equal( REC_NEG_1, ip2c.find_rec(REC_NEG_1.to) )
assert_equal( REC_LAST, ip2c.find_rec(REC_LAST.from) )
assert_equal( REC_LAST, ip2c.find_rec(REC_LAST.to) )
ip2c.close
end

def test_search
ip2c = IpToCountry.new
rec = ip2c.search(“67.19.248.74”)
assert_not_nil( rec )
assert_equal( “ARIN”, rec.registry )
assert_equal( “US”, rec.ctry )
end

end

====== End file: test_139_ip_to_country.rb ======

“Bill K.” [email protected] wrote in message
news:00b301c7f897$a8c9dc20$6442a8c0@musicbox…

Check what ccountry is 5.1.1.1. If you get any valid answer - this
answer
is a lie :slight_smile:

James Edward G. II wrote:

sys     0m0.053s

James Edward G. II

Ok, my script does not need any initialization, it uses the file
IpToCountry.csv exactly as downloaded.


$ ruby -v
ruby 1.8.4 (2005-12-24) [i386-cygwin]

$ time ruby quiz139.rb 68.97.89.187
US

real 0m0.047s
user 0m0.030s
sys 0m0.030s

$ time ruby quiz139.rb 84.191.4.10
DE

real 0m0.046s
user 0m0.046s
sys 0m0.015s

This is on a Pentium M 2.13GHz Laptop with 2GB RAM and rather slow HD.

cheers

Simon

Hi,

one of the (without a doubt) many binary search solutions. This one
reads a 1K
chunk of the file around the current position and uses a regexp to
extract
valid lines (and values) from this chunk.

The reasoning behind that is that data from disk is read in blocks
anyway (in
most cases even larger than 1K) and the last 2 or 3 iterations of the
algorithm
can be avoided. (max depth encountered so far was 14)

Using the regexp frees us from ‘manually’ looking for newlines (it will
find
about 10 valid sets of data in each iteration).

I tried a lot of stuff like biasing the median towards the side where
the
solution will be most likely found, using ‘static’ buffers for the
File#read
calls or unrolling the recursion but nothing proofed realy usefull so i
decided
to go with the short and fast enough™ solution.


require ‘ipaddr’

class IPAddr
def country_code()
open(‘IpToCountry.csv’){|file| code(file, to_i, 0, file.stat.size)}
end

private
def code(file, ip, min, max)
median = (min + max) / 2
file.seek(median - 512)
ary =
file.read(1024).scan(/^"(\d*)","(\d*)","(?:\w*)","(?:\d*)","(\w*)"/)
return code(file, ip, median, max) if ary.empty? || ary.last[1].to_i
< ip
return code(file, ip, min, median) if ary.first[0].to_i > ip
ary.find{|l| l[1].to_i >= ip}[2]
end
end

ARGV.each{|arg| puts IPAddr.new(arg).country_code} if $0 == FILE

timings:

$ time ruby quiz139.rb 0.0.0.0 68.97.89.187 84.191.4.10 80.79.64.128
210.185.128.123 202.10.4.222 192.189.119.1 255.255.255.255
ZZ
US
DE
RU
JP
AU
EU
ZZ

real 0m0.062s
user 0m0.046s
sys 0m0.030s

cheers

Simon

On 9/14/07, Ruby Q. [email protected] wrote:

This week’s Ruby Q. is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in. You can find a database for the matching
at:

    http://software77.net/cgi-bin/ip-country/geo-ip.pl

To keep the problem interesting though, let’s write our programs with a focus on
speed and memory efficiency.

Hi,

This is my second submission to Ruby Q., any tip will be greatly
appreciated. First I downloaded the file and removed all comments,
then tried a straightforward approach. Here it is:

require ‘faster_csv’
ip = ARGV[0].split(/./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 +
ip[3].to_i
file = ARGV[1] || ‘ipdb.csv’

FasterCSV.foreach(file) do |line|
if (line[0].to_i <= ip && line[1].to_i >= ip)
puts line[4]
exit
end
end

It allows to pass the IP as the first parameter, then transforms it to
the format in the file (the “base” 256). The file can be passed as the
second argument, by default I used my modified file without the
comments.

The method uses FasterCSV to parse the file line by line, checking ip
against the range. The result for the IP in the example was:

time ruby quiz139a.rb 68.97.89.187
US

real 0m0.355s
user 0m0.328s
sys 0m0.024s

So, it was a little bit more than the one presented with the problem.
Unfortunately, things don’t go so good when you search for an IP that
matches entries at the end of the file:

time ruby quiz139a.rb 255.0.0.1
ZZ

real 0m5.337s
user 0m5.036s
sys 0m0.292s

More than 5 seconds :frowning:

So, I went on to build a second version. This time the approach I
took, as the file is ordered is a binary search directly on the file.
Here it is:

require ‘ftools’

ip = ARGV[0].split(/./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 +
ip[3].to_i
file = ARGV[1] || ‘ipdb.csv’

File.open(file) do |f|
low = 0
high = f.stat.size
f.seek(high / 2)
while true
while ((a = f.getc) != 10)
f.seek(-2, IO::SEEK_CUR)
end
pos = f.pos
line = f.readline.split(“,”)
low_range = line[0][1…-2].to_i
high_range = line[1][1…-2].to_i
if (low_range > ip)
high = pos
offset = (f.pos - pos) + ((high - low) / 2)
f.seek(-offset, IO::SEEK_CUR)
elsif (high_range < ip)
low = f.pos
f.seek((high-low) / 2, IO::SEEK_CUR)
else
puts line[4][1…-2]
exit
end
end
end

What the method does is a binary search in the file. It starts a low
and high variables to manage the partition, then positions the cursor
in the file to the middle. Every time, the method reads back until it
finds a 10 (\n), then reads that line to check the IP range. If the
lower range is higher than the IP, the method moves the cursor in the
file, down to half of the current low-high range. If the higher range
is lower than the IP, then it moves up half of the range. The timings
here are much better:

time ruby quiz139b.rb 68.97.89.187
US

real 0m0.077s
user 0m0.048s
sys 0m0.004s

time ruby quiz139b.rb 255.0.0.1
ZZ

real 0m0.069s
user 0m0.060s
sys 0m0.008s

Apart from the general strategy of the binary search, I haven’t had
the time to give too much thought to the details, like how to extract
the values after the readline. Any tips regarding that would be
greatly appreciated. Also I don’t know if there’s a better way of
moving back in the file to the previous line break (the while (a =
f.getc) != 10 part). Tips?

Thanks for this quiz. I hope I had time to work on every quiz.

Regards,

Jesus.

Hello,

Had some fun with this one :). Basically, my approach was to download
the
.csv database file and use that file to convert from IP => Country code.
The
program reads this file each time it processes an individual lookup,
with no
precomputations required.

Initially I coded up a linear search (SLOW!) and a binary search that
read
the entire file into an array (somewhat faster). But the final “binary
seek”
approach is more efficient - it uses IO.seek to only read the specific
lines
that it needs, thus providing a good balance of CPU/Memory/IO. Binary
seek
does this by taking advantage of the fact that IP ranges in the file are
in
numerical order. I was planning to also code up a file that uses a web
service to do the conversion, but I doubt that would win any speed
contests
:).

Anyway, the algorithm is a simple modification of a standard binary
search:

def binary_seek(ip_addr_num)
low = 0
high = File.size(Filename)
fp = File.open(Filename)

Find first line of real data and set “low” placeholder accordingly

line = fp.gets
while line.strip.size == 0 or line[0].chr == “#”
line = fp.gets
low = low + line.size
end

Then find the corresponding the IP Range, if any

while (low <= high)
mid = (low + high) / 2
fp.seek(mid, IO::SEEK_SET)

   line = fp.gets # read once to find the next line
   line = fp.gets # read again to get the next full line
   from, to, country_code = parse_line(line)

   if (from == nil or from > ip_addr_num) # Safety check
       high = mid - 1
   elsif (to < ip_addr_num)
       low = mid + 1
   else
       fp.close
       return "Binary Seek Search: #{country_code}"
   end

end

fp.close
return “No Country Found”
end

A hard-coded constant points to the look up file:

Filename = “IpToCountry.csv” # IP to Country Code database file

And utility methods are provided to make the core algorithm easier to
follow:

Convert an IP (V4) address (as string) to a decimal number

Example from file:

1.2.3.4 = 4 + (3 * 256) + (2 * 256 * 256) + (1 * 256 * 256 * 256)

is 4 + 768 + 13,1072 + 16,777,216 = 16,909,060

def convert_ip_str_to_num(ip_addr_str)
ip_addr = ip_addr_str.split(“.”)
ip_addr_num = ip_addr[0].to_i * 256 ** 3 +
ip_addr[1].to_i * 256 ** 2 +
ip_addr[2].to_i * 256 ** 1 +
ip_addr[3].to_i
return ip_addr_num
end

Parse an IP range field of the IP/Country DB file

def parse_line(line)
line_parsed = line.split(‘,’)

Validate to handle comments and unexpected data

if line_parsed != nil and line_parsed.size >= 5
from = line_parsed[0].tr(‘"’, ‘’).to_i
to = line_parsed[1].tr(‘"’, ‘’).to_i
country_code = line_parsed[4]
end

return from, to, country_code
end

Finally, the program simply iterates over a list of input IP addresses:

for ip in ARGV
puts binary_seek(convert_ip_str_to_num(ip))
end

Here is a test run on a Pentium 4 2.4 Ghz Desktop with 512 MB RAM, using
one
of the test cases from the initial email chain:

justin@marketgarden:~/Desktop/139_IP_to_Country$ time ruby
IP_to_Country.rb
68.97.89.187 84.191.4.10 80.79.64.128 210.185.128.123 202.10.4.222
192.189.119.1
Binary Seek Search: “US”
Binary Seek Search: “DE”
Binary Seek Search: “RU”
Binary Seek Search: “JP”
Binary Seek Search: “AU”
Binary Seek Search: “EU”

real 0m0.020s
user 0m0.016s
sys 0m0.004s

A pastie is available here: Parked at Loopia
This file contains Binary Seek as well as linear and binary search
implementations.
Thanks,

Justin

I decided to make one extra, reusable (?), abstraction layer:
Class OrderedLinesFile implements method find_line, which is
given a block. This block is called with a line of the file.
The block has to parse this line and returns -1, 0 or 1 for “go
back”, “bingo!” and “go forward”, respectively. Method
find_line then repositions the pointer in the file and calls
block again, if necessary.

Please shoot…

gegroet,
Erik V. - http://www.erikveen.dds.nl/

BTW, lines 17899 and 17900 of the downloaded file both contain
“2624585728”, which is strange… (Downloaded: 2007-09-17 08:34
(UTC).)


$ time ruby quiz139a.rb 11.22.33.44
11.22.33.44 US

real 0m0.011s
user 0m0.006s
sys 0m0.004s


$ wc all_ip_blocks
82358 82358 905115 all_ip_blocks

$ time ruby quiz139b.rb all_ip_blocks

real 0m45.681s # That’s 1800 IPs per second.
user 0m40.351s
sys 0m5.300s


class OrderedLinesFile
def initialize(file_name)
@file_name = File.expand_path(file_name)
end

def find_line(&block)
line = nil

 File.open(@file_name) do |io|
   position  = 0
   delta     = File.size(@file_name)/2
   line      = io.gets       # The first line of the file.
   line      = io.gets       while /^\s*#/ =~ line or /^\s*$/ =~

line

   while delta > 0 and line and (space_ship = block.call(line)) !=

0
position += space_ship < 0 ? -delta : +delta
delta /= 2

     if position > 0
       io.pos        = position
       broken_line   = io.gets       # Skip the current (broken)

line.
line = io.gets # The current line of the
file.
line = io.gets while /^\s*#/ =~ line or /^\s*
$/ =~ line
else
delta = 0 # Stop.
end
end

   line      = nil   if delta == 0   # Nothing found.
 end

 line

end
end

class IpToCountry
FILE = “IpToCountry.csv”

def country_of(ip_addr)
ip_addr = ip_addr.split(/./).collect{|n| n.to_i}.inject{|n,
m| n*256+m} # “1.2.3.4” → 16909060
olf = OrderedLinesFile.new(FILE)
res = nil

 olf.find_line do |line|
   ip_from, ip_to, registry, assigned, ctry, cntry, country  =

line.gsub(/"/, “”).split(/,/, 7)

   if ip_addr < ip_from.to_i
     -1                              # Go back in the file.
   elsif ip_addr > ip_to.to_i
     +1                              # Go forward in the file.
   else
     res     = ctry
     0                               # Bingo!
   end
 end

 res

end
end

itc = IpToCountry.new

ARGV.each do |ip_addr|
puts “%-16s %s” % [ip_addr, itc.country_of(ip_addr)||“??”]
end

$ ruby quiz139.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1 5.1.1.1
0.1.1.1 ZZ
1.1.1.1 ??
2.1.1.1 ??
3.1.1.1 US
4.1.1.1 US
5.1.1.1 ??

:}

gegroet,
Erik V.

I just performed the tests another time on my desktop computer with a
freshly installed Sabayon Linux. I don’t know why this one performs that
much better – it’s gentoo-based, it’s the same file system, but it has
no SCSI controller - maybe that makes it faster? Or is it the dual-core
… anyway.

It’s about twice as fast and requires a fraction of the “sys” times.
Also the “empty” ruby run is only 5 ms. Mysteriously.

The packer is down at 0.8s, and 100000 lookups are at 6.2 s (file-based
aka memory efficient) and 3.7 s (everything in-memory).

Anyway – i’d like to see a 100000 lookups comparison :slight_smile: hehe

Matthias Wächter schrieb:

My results for 100000 lookups are:

$ time ./packer.rb

real 0m1.634s
user 0m1.576s
sys 0m0.056s

$ time ./packer.rb

real 0m0.716s
user 0m0.708s
sys 0m0.008s

$ time ./rand_ip.rb 100000 byline > ip-test-list

real 0m0.797s
user 0m0.740s
sys 0m0.056s

$ time ./rand_ip.rb 100000 byline > ip-test-list

real 0m0.322s
user 0m0.317s
sys 0m0.005s

$ time ./search_file.rb < ip-test-list > /dev/null

real 0m11.091s
user 0m9.673s
sys 0m1.420s

$ time ./search_file.rb /dev/null

real 0m6.201s
user 0m5.316s
sys 0m0.885s

$ time ./search_str.rb < ip-test-list > /dev/null

real 0m7.131s
user 0m6.960s
sys 0m0.168s

$ time ./search_str.rb /dev/null

real 0m3.714s
user 0m3.707s
sys 0m0.006s

btw: I don’t understand the following result – can you help me improving this? 129 ms just for firing ruby up! Can’t be!
$ time ruby /dev/null

real 0m0.129s
user 0m0.120s
sys 0m0.012s

$ time ruby /dev/null

real 0m0.005s
user 0m0.004s
sys 0m0.001s

$ uname -a
Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon™ 64 Processor 3000+ AuthenticAMD GNU/Linux

$ uname -a
Linux sabayon2me 2.6.22-sabayon #1 SMP Mon Sep 3 00:33:06 UTC 2007
x86_64 Intel® Core™2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux

$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

  • Matthias

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

Anyway – i’d like to see a 100000 lookups comparison :slight_smile: hehe

Great. Run it for us and let us know how we do. :wink:

James Edward G. II

On 9/18/07, Bill K. [email protected] wrote:

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

Here are the results of the supplied solutions so far, and it looks like my solution can take the 100k-performance victory :slight_smile:

If I’ve understood correctly, it looks like my solution seems
to be the fastest (so far) of those that operate on the
unmodified .csv file?

It depends what you mean by unmodified - my algorithm runs off the
original file, the only “modification” I am doing in the setup stage
is searching for and saving the byte offset of the first and last
records. It looks like l could have done that every time my script
was run and only added 5 ms.

I would have bet Simon’s would be faster. Strange!

I thought block file reads would be faster too, that was the next
thing I was planning to try. Maybe it’s the regexp that slowed it
down.

-Adam

James Edward G. II schrieb:

On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

Anyway – i’d like to see a 100000 lookups comparison :slight_smile: hehe
Great. Run it for us and let us know how we do. :wink:

Here are the results of the supplied solutions so far, and it looks like
my solution can take the 100k-performance victory :slight_smile:

First Table: Compilation (Table Packing)

          real    user    sys

Adam[*] 0.005 0.002 0.003
Luis 0.655 0.648 0.007
James[**] 21.089 18.142 0.051
Jesse 1.314 1.295 0.020
Matthias 0.718 0.711 0.008

[*]: Adam does not perform a real compression but he builds two
boundaries to search within the original .csv he subsequently uses.
[**]: Upon rebuild, James fetches the .csv sources from the web making
his solution look slow. This output highly depends on your–actually
my–ISP speed.

Second Table: Run (100_000 Addresses)

          real    user    sys

Adam 24.943 22.993 1.951
Bill 35.080 33.029 2.051
Luis 16.149 13.706 2.444
Eugene[] 52.307 48.689 3.620
Eugene 65.790 61.984 3.805
James 14.803 12.449 2.356
Jesse 14.016 12.343 1.673
Jesus_a[]
Jesus_b[
]
Kevin[
**]
Matt_file 6.192 5.332 0.859
Matt_str 3.704 3.699 0.005
Simon 69.417 64.679 4.706
Justin 56.639 53.292 3.345
steve 63.659 54.355 9.294

[]: Eugene already implements a random generator. But to make things
fair, I changed his implementation to read the same values from $stdin
as all the other implementations. The “Star” version is using his own
random generator and runs outside competition, the starless version is
my modified one.
[]: O Jesus :), I can’t make your FasterCSV version (a) run, and in
the later version you sent your direct parsing breaks when it comes to
detecting the commented lines in the first part of the file. I couldn’t
manage to make it run, sorry.
[
]: Although I managed to write the missing SQL insertion script and
to even add separate indexes for the address limits, Kevin’s SQLite3
version took simply too long. I estimated a run time of over an hour. I
am willing to replay the test if someone tells me how to speed up things
with SQLite3 to make it competitive.

Note that I slightly changed all implementations to contain a loop that
iterates on $stdin.each instead of ARGV or using just ARGV[0]. For the
test the script was run only once and was supplied with all addresses in
one run. The test set consisted of 100_000 freshly generated random IP
addresses written to a file and supplied using the following syntax:

$ (time ruby IpToCountry.rb /dev/null) 2>100k.time

I didn’t check the output of the scripts, although I checked one address
upfront. This was mainly because all scripts have a different output
format. My tests were just for measuring the performance.

Just for Info:

$ uname -a
Linux sabayon2me 2.6.22-sabayon #1 SMP Mon Sep 3 00:33:06 UTC 2007
x86_64 Intel® Core™2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]
$ cat /etc/sabayon-release
Sabayon Linux x86-64 3.4

  • Matthias