IP to Country (#139)

Matthias Wätcher made a classic mistake this week. He sent in his first
ever
quiz solution and told us not to expect anything special out of it. We
later
learned that it was a lot faster than all of the other solutions, thanks
to more
effort from Matthias. His code taught me some new tricks and I want to
share
those with you.

But first, let’s talk speed.

Most of the solutions are quite quick, relatively speaking. What’s the
shared
secret of speed? Binary search. The records are in sorted order, so
it’s
possible to perform a simple
cut-the-possible-matches-in-half-at-each-step
lookup. That favored algorithm makes short work of what could otherwise
be a
lengthy search.

Now you can do a binary search on the database file as is and several
solutions
did. This is a little trickier, because the records are not of a
standard size.
You have to be super careful not to accidently skip records. While the
solutions handled the challenge very well, you can make things quite a
bit
easier if you are willing to preprocess the file.

You can also speed things up with some clever preprocessing. That was
the other
trick Matthias used to find answers so quickly.

Matthias realized that while the actual records contain quite a bit of
data, we
only really care about the start of a range, the end of a range, and the
country
code. You can fit all of that in just ten bytes with four for each
address and
two for the country.

The file also contains some consecutive ranges that can be collapsed for
the
purposes of our search. Such ranges may differ in some attributes, but
as long
as their country codes are the same we can treat them as a single unit.

Now that you know the goal, let’s take a look at the packing code
Matthias sent
in:

#!/usr/bin/ruby

comment

last_start=nil
last_end=nil
last_country=nil
File.open(“packed-ip.dat”,“wb”) do |wfile|
IO.foreach(“geo-ip.csv”) do |line|
next if !(line =~ /^"/ )
s,e,d1,d2,co=line.delete!(""").split(",")
s,e = s.to_i,e.to_i
if !last_start

initialize with first entry

      last_start,last_end,last_country = s,e,co
    else
      if s==last_end+1 and co==last_country

squeeze if successive ranges have zero gap

        last_end=e
      else

append last entry, remember new one

        wfile << [last_start,last_end,last_country].pack("NNa2")
        last_start,last_end,last_country = s,e,co
      end
    end
end

print last entry

if last_start
  wfile << [last_start,last_end,last_country].pack("NNa2")
end

end

The three variables declared up front are for tracking the ranges.
These will
be used to collapse consecutive ranges.

The next bit of code creates the binary database we will write into and
parses
the CSV formated data. Though the data is in the CSV format, the actual
content
is trivial and you don’t need a proper parser to get at it. As you can
see,
Matthias just checks for lines starting with a quote, pulls out all
quote
characters, and split()s on commas. This is faster than using a parser.

The if/else chain in the middle of the code does most of the work.
First, it
stores the initial range in the tracking variables. For all other
ranges it
tests to see if it is consecutive with the last one recorded and has the
same
country code. When it does, the endpoint of the last range is just
bumped to
include them both. When it doesn’t, the last range is written to the
file and
the code starts tracking the new range. The final if statement ensures
that we
write out the final range before exiting.

This really isn’t too much work and it runs in under two seconds on my
box.
That’s a small price to pay for a speed gain every time we look up an
address.

Let’s dive into the search code now. Here’s how it begins:

#!/usr/bin/ruby

take command line or stdin – the latter has performance advantage

for long lists

if ARGV[0]
arr=ARGV
else
arr=$stdin
end

As the comment tells us, Matthias added another way to feed the program
addresses. Where I said we should take one from the command-line
arguments,
this code actually handles any number of addresses from command-line
arguments
or STDIN.

This next bit of code opens up our database:

the binary table file is looked up with each request

File.open(“packed-ip.dat”,“rb”) do |rfile|
rfile.seek(0,IO::SEEK_END)
record_max=rfile.pos/10-1

# ...

The seek() and pos() calls here are used to find the end of the file.
You need
to know both ends of a data set to perform a binary search. Note that
dividing
by ten gives us the count of records since they are a fixed width and
Matthias
subtracts one because we will never need to seek() to the end of the
last record
again.

Now there’s a touch more preparation for each address we want to find:

# ...

arr.each { |argv|
  # build a 4-char string representation of IP address
  # in network byte order so it can be a string compare below
  ipstr= argv.split(".").map {|x| x.to_i.chr}.join

  # ...

In order to avoid a bunch of to_i() calls on each range extracted from
the
database, Matthias just encodes the address we are hunting for into the
character encoding used in the database. This way simple String
comparisons can
determine if the address it in the range.

We’re now ready for the actual search code:

  # ...

  # low/high water marks initialized
  low,high=0,record_max
  while true
    mid=(low+high)/2       # binary search median
    rfile.seek(10*mid)     # one record is 10 byte, seek to position
    str=rfile.read(8)      # for range matching, we need only 8 

bytes
# at comparison, values are big endian, i.e. packed(“N”)
if ipstr>=str[0,4] # is this IP not below the current range?
if ipstr<=str[4,4] # is this IP not above the current range?
puts rfile.read(2) # a perfect match, voila!
break
else
low=mid+1 # binary search: raise lower limit
end
else
high=mid-1 # binary search: reduce upper limit
end
if low>high # no entries left? nothing found
puts “no country”
break
end
end
}
end

This is a pretty textbook binary search. You always begin by going to
the
middle of the current low and high. In this case that’s a seek() call
and we
have to remember to multiply the midpoint by our record size of ten.

Once we are at the right record, the code read()s the first eight bytes
and
compares the address against the low and high for the range. When it is
in the
range, the final two bytes are read and printed. If the address is
below the
current range, we drop the high to below the current range. If it’s
above, we
raise the low to above the current range. A final check is added to
catch the
cases where no match was found, in which case our low will bypass the
high.

Remember that a second goal of the quiz described search was to be
memory
friendly. This code does terrific on that front since only one record
needs to
be in memory at a time. Once the checks are done, it can be replaced by
the
next record.

My thanks to all who showed so many great examples of this classic
algorithm.

Tomorrow we will write programs that can help me to better understand my
wardrobe…