--------------000706010906020103020108
Content-Type: text/plain; charset=ISO-8859-15
Content-Transfer-Encoding: 7bit

Ruby Quiz wrote:
> This week's Ruby Quiz is to write a simple utility.  Your program should accept
> an IP address as a command-line argument and print out the two letter code for
> the country that IP is assigned in.  You can find a database for the matching
> at:
> 
> 	http://software77.net/cgi-bin/ip-country/geo-ip.pl
> 
> To keep the problem interesting though, let's write our programs with a focus on
> speed and memory efficiency.
> 
> 	$ time ruby ip_to_country.rb 68.97.89.187
> 	US
> 	
> 	real    0m0.314s
> 	user    0m0.259s
> 	sys     0m0.053s

This is my first submission to Ruby Quiz, so don't expect ground-breaking techniques from this implementation.

Let me characterize my solution: Instead of looking up the csv file directly, I precompile it. The search can then be performed quickly. I think my solution will beat some others when it comes to massive lookups in the range of 100th of thousand lookups.

First, I wrote a random IP address generator. It can alternatively output the IP addresses as a space delimited string or each IP address on a separate line.  On my bash submitting all IP address worked up to about 8 k entries until the argument list was too long, so I implemented that the addresses can be given on $stdin also. For a proper argument list on stdin, the entries should be separated by \n, so just add a second argument to rand_ip.rb to generate the list of ips -- the script doesn't care what the second parameter is, it just has to be there (i use "byline"). 

I use the script like this: 

$ ./rand_ip.rb 100000 byline > ip-test-list


Second, I use a precompile run. I don't know how folks here can stay under 100 milliseconds without parsing the file beforehand. My precompilation generates a binary table that stores 10 bytes for each entry: 4 bytes each for start/end of address space and 2 bytes for the country code. Additionally, for saving memory, I squeeze memory areas with adjacent addresses in the same country. This saves more than 50% of the records while it doesn't significantly speed up the search (just about 1 step more in my binary search algorithm, see below). the packer (packr.rb) uses pack("NNa2") for building the binary table, and look, the addresses are stored in network byte order (i.e. big endian). The output table holds about 32 k Entries.

The packer doesn't accept any parameters, just call

$ ./packer.rb


Third comes the actual address search. I implemented two versions for comparison: One with a small RAM footprint that looks up everything in the packed table file. The other one reads the whole table into memory and performs all lookups in memory.

The algorithm is the same in both implementations: I use binary search over all entries until the ip address either matches one range or there is no match at all.

Comparison is sped up by making 4-letter string comparisons of the respective binary addresses. That way "search_ip > ddr_start" works even when the addresses are strings. This saves a lot of time because at search time there is no .to_i calculation avoided.

I run the searchers (search_file.rb, search_str.rb) for small numbers of random IP address using the command line:

$ time ./search_file.rb `./rand_ip.rb 3000` > /dev/null

For more addresses I use one of the following -- the last usage is to make successive runs with equal input:

$ time ./rand_ip.rb 100000 byline | ./search_file.rb > /dev/null
$ ./rand_ip.rb 100000 byline > ip-test-list; time ./search_file.rb < ip-test-list > /dev/null



My results for 100000 lookups are:

$ time ./packer.rb

real    0m1.634s
user    0m1.576s
sys     0m0.056s
$ time ./rand_ip.rb 100000 byline > ip-test-list

real    0m0.797s
user    0m0.740s
sys     0m0.056s
$ time ./search_file.rb < ip-test-list > /dev/null

real    0m11.091s
user    0m9.673s
sys     0m1.420s
$ time ./search_str.rb < ip-test-list > /dev/null

real    0m7.131s
user    0m6.960s
sys     0m0.168s

btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be!
$  time ruby /dev/null

real    0m0.129s
user    0m0.120s
sys     0m0.012s
$  uname -a
Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

Cheers,
- Matthias

--------------000706010906020103020108
Content-Type: text/plain;
 nameand_ip.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filenameand_ip.rb"

#!/usr/bin/ruby

r ۰ݮ
EOL RGV[1] ? "\n" : " "

r.times {
  print "#{rand(256)}.#{rand(256)}.#{rand(256)}.#{rand(256)}#{EOL}"
}

--------------000706010906020103020108
Content-Type: text/plain;
 nameacker.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filenameacker.rb"

#!/usr/bin/ruby
# comment

last_startl
last_endl
last_countryl
File.open("packed-ip.dat","wb") do |wfile|
  IO.foreach("geo-ip.csv") do |line|
    next if !(line /^"/ )
      s,e,d1,d2,cone.delete!("\"").split(",")
      s,e  .to_i,e.to_i
      if !last_start
# initialize with first entry
        last_start,last_end,last_country  ,e,co
      else
        if sast_end+1 and coast_country
# squeeze if successive ranges have zero gap
          last_end       else
# append last entry, remember new one
          wfile << [last_start,last_end,last_country].pack("NNa2")
          last_start,last_end,last_country  ,e,co
        end
      end
  end
# print last entry
  if last_start
    wfile << [last_start,last_end,last_country].pack("NNa2")
  end
end

--------------000706010906020103020108
Content-Type: text/plain;
 nameearch_file.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filenameearch_file.rb"

#!/usr/bin/ruby

# take command line or stdin -- the latter has performance advantage for long lists
if ARGV[0]
  arr else
  arrtdin
end

# the binary table file is looked up with each request
File.open("packed-ip.dat","rb") do |rfile|
  rfile.seek(0,IO::SEEK_END)
  record_maxile.pos/10-1

  arr.each { |argv|
    # build a 4-char string representation of IP address
    # in network byte order so it can be a string compare below
    ipstr rgv.split(".").map {|x| x.to_i.chr}.join

    # low/high water marks initialized
    low,highrecord_max
    while true
      midow+high)/2       # binary search median
      rfile.seek(10*mid)     # one record is 10 byte, seek to position
      strile.read(8)      # for range matching, we need only 8 bytes
      # at comparison, values are big endian, i.e. packed("N")
      if ipstr>r[0,4]     # is this IP not below the current range?
        if ipstr<r[4,4]   # is this IP not above the current range?
          puts rfile.read(2) # a perfect match, voila!
          break
        else
          lowd+1          # binary search: raise lower limit
        end
      else
        highd-1           # binary search: reduce upper limit
      end
      if low>high            # no entries left? nothing found
        puts "no country"
        break
      end
    end
  }
end

--------------000706010906020103020108
Content-Type: text/plain;
 nameearch_str.rb"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
 filenameearch_str.rb"

#!/usr/bin/ruby

# read the binary table upfront
fe.open("packed-ip.dat","rb")
tableead
record_max
ble.length/10-1
f.close

# take command line or stdin -- the latter has performance advantage for long lists
if ARGV[0]
  arr else
  arrtdin
end  

arr.each { |argv|
  # build a 4-char string representation of IP address
  # in network byte order so it can be a string compare below
  ipstr rgv.split(".").map {|x| x.to_i.chr}.join

  # low/high water marks initialized
  low,highrecord_max
  while true
    midow+high)/2              # binary search median
    # at comparison, values are big endian, i.e. packed("N")
    if ipstr>
ble[10*mid,4]     # is this IP not below the current range?
      if ipstr<
ble[10*mid+4,4] # is this IP not above the current range?
        puts table[10*mid+8,2]    # a perfecct match, voila!
        break
      else
        lowd+1                 # binary search: raise lower limit
      end
    else
      highd-1                  # binary search: reduce upper limit
    end
    if low>high                   # no entries left? nothing found
      puts "no country"
      break
    end
  end
}


--------------000706010906020103020108--