--------------000706010906020103020108 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Ruby Quiz wrote: > This week's Ruby Quiz is to write a simple utility. Your program should accept > an IP address as a command-line argument and print out the two letter code for > the country that IP is assigned in. You can find a database for the matching > at: > > http://software77.net/cgi-bin/ip-country/geo-ip.pl > > To keep the problem interesting though, let's write our programs with a focus on > speed and memory efficiency. > > $ time ruby ip_to_country.rb 68.97.89.187 > US > > real 0m0.314s > user 0m0.259s > sys 0m0.053s This is my first submission to Ruby Quiz, so don't expect ground-breaking techniques from this implementation. Let me characterize my solution: Instead of looking up the csv file directly, I precompile it. The search can then be performed quickly. I think my solution will beat some others when it comes to massive lookups in the range of 100th of thousand lookups. First, I wrote a random IP address generator. It can alternatively output the IP addresses as a space delimited string or each IP address on a separate line. On my bash submitting all IP address worked up to about 8 k entries until the argument list was too long, so I implemented that the addresses can be given on $stdin also. For a proper argument list on stdin, the entries should be separated by \n, so just add a second argument to rand_ip.rb to generate the list of ips -- the script doesn't care what the second parameter is, it just has to be there (i use "byline"). I use the script like this: $ ./rand_ip.rb 100000 byline > ip-test-list Second, I use a precompile run. I don't know how folks here can stay under 100 milliseconds without parsing the file beforehand. My precompilation generates a binary table that stores 10 bytes for each entry: 4 bytes each for start/end of address space and 2 bytes for the country code. Additionally, for saving memory, I squeeze memory areas with adjacent addresses in the same country. This saves more than 50% of the records while it doesn't significantly speed up the search (just about 1 step more in my binary search algorithm, see below). the packer (packr.rb) uses pack("NNa2") for building the binary table, and look, the addresses are stored in network byte order (i.e. big endian). The output table holds about 32 k Entries. The packer doesn't accept any parameters, just call $ ./packer.rb Third comes the actual address search. I implemented two versions for comparison: One with a small RAM footprint that looks up everything in the packed table file. The other one reads the whole table into memory and performs all lookups in memory. The algorithm is the same in both implementations: I use binary search over all entries until the ip address either matches one range or there is no match at all. Comparison is sped up by making 4-letter string comparisons of the respective binary addresses. That way "search_ip > ddr_start" works even when the addresses are strings. This saves a lot of time because at search time there is no .to_i calculation avoided. I run the searchers (search_file.rb, search_str.rb) for small numbers of random IP address using the command line: $ time ./search_file.rb `./rand_ip.rb 3000` > /dev/null For more addresses I use one of the following -- the last usage is to make successive runs with equal input: $ time ./rand_ip.rb 100000 byline | ./search_file.rb > /dev/null $ ./rand_ip.rb 100000 byline > ip-test-list; time ./search_file.rb < ip-test-list > /dev/null My results for 100000 lookups are: $ time ./packer.rb real 0m1.634s user 0m1.576s sys 0m0.056s $ time ./rand_ip.rb 100000 byline > ip-test-list real 0m0.797s user 0m0.740s sys 0m0.056s $ time ./search_file.rb < ip-test-list > /dev/null real 0m11.091s user 0m9.673s sys 0m1.420s $ time ./search_str.rb < ip-test-list > /dev/null real 0m7.131s user 0m6.960s sys 0m0.168s btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be! $ time ruby /dev/null real 0m0.129s user 0m0.120s sys 0m0.012s $ uname -a Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux $ ruby --version ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux] Cheers, - Matthias --------------000706010906020103020108 Content-Type: text/plain; name and_ip.rb" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename and_ip.rb" #!/usr/bin/ruby r ¨ÂÖÛ°Ý®ôïß EOL RGV[1] ? "\n" : " " r.times { print "#{rand(256)}.#{rand(256)}.#{rand(256)}.#{rand(256)}#{EOL}" } --------------000706010906020103020108 Content-Type: text/plain; name acker.rb" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename acker.rb" #!/usr/bin/ruby # comment last_start l last_end l last_country l File.open("packed-ip.dat","wb") do |wfile| IO.foreach("geo-ip.csv") do |line| next if !(line /^"/ ) s,e,d1,d2,co ne.delete!("\"").split(",") s,e .to_i,e.to_i if !last_start # initialize with first entry last_start,last_end,last_country ,e,co else if s ast_end+1 and co ast_country # squeeze if successive ranges have zero gap last_end else # append last entry, remember new one wfile << [last_start,last_end,last_country].pack("NNa2") last_start,last_end,last_country ,e,co end end end # print last entry if last_start wfile << [last_start,last_end,last_country].pack("NNa2") end end --------------000706010906020103020108 Content-Type: text/plain; name earch_file.rb" Content-Transfer-Encoding: 7bit Content-Disposition: inline; filename earch_file.rb" #!/usr/bin/ruby # take command line or stdin -- the latter has performance advantage for long lists if ARGV[0] arr ¨Âelse arr tdin end # the binary table file is looked up with each request File.open("packed-ip.dat","rb") do |rfile| rfile.seek(0,IO::SEEK_END) record_maxile.pos/10-1 arr.each { |argv| # build a 4-char string representation of IP address # in network byte order so it can be a string compare below ipstr rgv.split(".").map {|x| x.to_i.chr}.join # low/high water marks initialized low,high record_max while true mid ow+high)/2 # binary search median rfile.seek(10*mid) # one record is 10 byte, seek to position strile.read(8) # for range matching, we need only 8 bytes # at comparison, values are big endian, i.e. packed("N") if ipstr>