Facebook Places v.s. FourSquare API

December 15, 2010 Leave a comment

I’ve been doing lately lots of work related to location and geocoding and had change to do integration to both Facebook Places API and Foursquare API. Here is simple side-by-side comparison that you find useful.

Both API’s are JSON based HTTP REST interfaces. They offer place search by query string (e.g ‘pizza’), coordinates (lat & lon) and radius of search. Authentication is based on OAuth, though FourSquare can be be also used without user token, within rate limit.

Facebook Places API

Places API is a part of the larger Graph API and enables places query in addition to people, objects, etc…

Simple example query with keyword pizza around New Jersey.

https://graph.facebook.com/search?q=pizza&type=place&center=40.82,-74.1&distance=1000&access_token=<<oauth access token>>

Example return value JSON

{u'data': [{u'category': u'Local business',
               u'id': u'154150281291249',
               u'location': {u'city': u'Wallington',
                   u'latitude': 40.843026999999999,
                   u'longitude': -74.105144999999993,
                   u'state': u'NJ',
                   u'street': u'435 Paterson Ave',
                   u'zip': u'07057-2202'},
               u'name': u"Marina's Pizza & Restaurant"},
             {u'category': u'Local business',
              u'id': u'116347165056160',
              u'location': {u'city': u'Carlstadt',
                   u'latitude': 40.841267000000002,
                   u'longitude': -74.101149000000007,
                   u'state': u'NJ',
                   u'street': u'326 Garden St',
                   u'zip': u'07072-1626'},
              u'name': u'Garden Pizza Ice Cream & Cafe'},
...

Note that as of time of writing, Places API does not work outside United States. if you try to call it from IP address outside US you’ll probably get following cryptic error response:

{
   "error": {
      "type": "OAuthException",
      "message": "(#603) The table you requested does not exist"
   }
}

API is pretty fast but data quality is mediocre as its best and you pretty much get only the name, location and some sparse address data that might be enough.

API does not support unauthenticated requests
Results have often duplicates, typos and include strange places that seem to have been automatically scraped from some database.
Rate limit is not problem. Facebook does endorse some kind of (high) rate limit, but it’s not documented. If you keep it less than 1-2s per access token you are clear.
Every location, even mountains seem to be categorized as ‘Local business’
Currently can not be used outside US and it rarely returns results for locations outside US.
Can not be used anonymously, you need the users OAuth access token.
Results can be paged

Before Facebook adds proper category info, this API is useful only for finding places by name.

FourSquare v1 API (will be deprecated mid-2011)

Corresponding query from FourSquare v1 API is the ‘venues’ call that is slower but the data quality is much better and denser. You get also category information that is pretty good for filtering and can be used also as human readable description for the venue.

Example query

http://api.foursquare.com/v1/venues.json?geolat=40.82&geolong=-74.1&l=50&q=pizza

Example result JSON

{'groups': [{'type': 'Matching Places',
             'venues': [{'address': '85 Rte 17 South',
                        'city': 'East Rutherford',
                        'distance': 1299,
                        'geolat': 40.830497762250729,
                        'geolong': -74.093245267868042,
                        'id': 370040,
                        'name': "CiCi's Pizza Buffet",
                        'phone': '2014388200',
                        'primarycategory': {
                            'fullpathname': 'Food:Pizza',
                            'iconurl': 'http://foursquare.com/img/categories/food/pizza.png',
                            'id': 79081,
                            'nodename': 'Pizza'},
                        'state': 'NJ',
                        'stats': {'herenow': '0'},
                        'verified': False,
                        'zip': '07073'},
                        {'address': '271 Main St.',
                         'city': 'Belleville',
                         'distance': 5168,
                         'geolat': 40.789736300000001,
                         'geolong': -74.146524900000003,
                         'id': 1206994,
                         'name': u'Pizza Village Caf\xe9 II',
                         'phone': '9734501818',
                         'primarycategory': {
                             'fullpathname': 'Food:Pizza',
                             'iconurl': 'http://foursquare.com/img/categories/food/pizza.png',
                             'id': 79081,
                             'nodename': 'Pizza'},
                         'state': 'New Jersey',
                         'stats': {'herenow': '0'},
                         'verified': False,
                         'zip': '07109'},
                         ...

Data quality is better than Facebooks and results seem much more relevant, at least for now.

Query latency varies a lot
Little duplicates, some because of user typos
Lots of results also for non-US locations
Strict rate limit, you can only do few requests per second and daily total is few hundred. (computed per ip + authenticated user)
You should OAuth authenticate your users from Foursquare, otherwise rate limit will make it impossible to use their API.
Maximum number of results per query is 50
Sometimes returns pretty irrelevant places like homes and such, as category can not be used as query filter

FourSquare v2 API (Beta, work in progress)

v2 API promises improvement over the old v1 and is now available for public use. However it’s still work in progress and there is no guarantee that it stays backwards compatible. Official info here: http://developer.foursquare.com/docs/overview.html

Unlike v1, v2 does not allow completely anonymous access, so you have to register your application here. Also, only HTTPS is supported. User authentication is OAuth2 that is much simpler to implement than OAuth1 used in v1.

Example (unauthenticated) query.

https://api.foursquare.com/v2/venues/search?client_secret=<<my app secret>>&ll=60.205065%2C24.654196&client_id=<<my app key>>&l=100&llAcc=100

Example response

{'meta': {'code': 200},
 'response': {'groups':
                    [{'items':
                        [{'categories':
                              [{'icon': 'http://foursquare.com/img/categories/food/pizza.png',
                                'id': '4bf58dd8d48988d1ca941735',
                                'name': 'Pizza',
                                'parents': ['Food'],
                                'primary': True},
                               {'icon': 'http://foursquare.com/img/categories/nightlife/wine.png',
                                'id': '4bf58dd8d48988d123941735',
                                'name': 'Wine Bar',
                                'parents': ['Nightlife']},
                               {'icon': 'http://foursquare.com/img/categories/nightlife/default.png',
                                'id': '4bf58dd8d48988d116941735',
                                'name': 'Bar',
                                'parents': ['Nightlife']}],
                         'contact': {'phone': '4155589991',
                                     'twitter': 'patxispizza'},
                         'hereNow': {'count': 0},
                         'id': '43d7e5dff964a5205d2e1fe3',
                         'location': {'address': '511 Hayes St',
                                      'city': 'San Francisco',
                                      'crossStreet': 'at Octavia Blvd.',
                                      'distance': 474,
                                      'lat': 37.776435900000003,
                                      'lng': -122.425003,
                                      'postalCode': '94102',
                                      'state': 'CA'},
                         'name': "Patxi's Chicago Pizza",
                         'specials': [{'description': 'every check-in',
                                       'id': '4c06d44386ba62b5945588b3',
                                       'message': 'Check-in, show your server, and your first fountain beverage (w/ purchase) is free or your first PBR is only $1. Free 10-inch half-baked pizza of any type, once a week, for the Mayor.',
                                       'type': 'frequency'}],
                         'stats': {'checkinsCount': 2269,
                                   'usersCount': 1115},
                                   'verified': True},
                         ...

Improvements and changes over v1 API

Rate limited as v1
Maximum number of results is now over 100
Better category structure
Venue Twitter handle
Address detail includes crossStreet info
Specials, etc..

They also changed venue id format from number to hash string, that makes things difficult for those who have already used the old API. I also wish there would be way to filter out private homes from results.

Filed under Experimental

Node.js Application Configuration Files

December 13, 2010 9 Comments

What is the best practice to make configuration file for your Node.js application? Writing property file parser or passing parameters at command line is cumbersome.

Eval

One easy way to separate configuration and application code is by using eval statement. Define your configuration as simple Javascript associative array and load and evaluage it on app startup.

Example configuration file myconfig.js

settings = {
    a: 10,
    // this is used for something
    SOME_FILE: "/tmp/something"
}

Then at start of your application

var fs = require('fs');
eval(fs.readFileSync('myconfig.js', encoding="ascii"));

Now settings object can be used as your program settings. e.g.

var mydata = fs.readFileSync(settings.SOME_FILE);
for( i = 0 ; i < settings.a ; i++) {
   // do something
}

Require

Another alternative to load configuration, as stated in comments, is to define configuration as module file and require it.

//-- configuration.js
module.exports = {
  a: 10,
  SOME_FILE: '/tmp/foo'
}

In application code then require file

var settings = require('./configuration');

This prevents other global variable creeping in global scope, but it’s hackier to do dynamic configuration reloading. If you detect that file has changed, and would want to reload it at runtime you must delete entry from require cache and re-require the file. Another minor complication is that require uses its own search path (that you can override with NODE_PATH env. variable) so it’s more work to define dynamic location for configuration file in your app. (e.g. set it from command line).

// to reload file with require
var path = require('path');
var filename = path.resolve('./configuration.js');
delete require.cache[filename];
var tools = require('./configuration');

Plain javascript as configure file has benefit (and downside) that it’s possible to run any javascript in the settings. For example.

settings = {
    started: new Date(),
    nonce: ~~(1E6 * Math.random()),
    a: 10,
    SOME_FILE: "/tmp/something"
}

Both of these methods are mostly matter of taste. Eval is bit riskier as it allows leaking variables to global namespace but you’ll never have anything “stupid” in the configuration files anyway. Right?

JSON file

I’m not fan of using JSON as configuration format as it’s cumbersome to write and most editors are not able to show syntax errors in it. JSON also does not support comments that can be a big problem in more complicated configuration files.
Example configuration file myconfig.json

{
   "a":10,
   "SOME_FILE":"/tmp/something"
}

Then at start of your application read json file and parse it to object.

var fs = require('fs');
var settings = JSON.parse(fs.readFileSync('myconfig.json', encoding="ascii"));

And then use settings as usual

var mydata = fs.readFileSync(settings.SOME_FILE);
for( i = 0 ; i < settings.a ; i++) {
   // do something
}

Merging configuration files

One way to simplify significantly configuration management is to do hierarchical configuration. For example have single base configuration file and then define overrides for developer, testing and production use.

For this we need merge function.

// merges o2 properties to o1
var merge = exports.merge = function(o1, o2) {
    for (var prop in o2) {
         var val = o2[prop];
         if (o1.hasOwnProperty(prop)) {
             if (typeof val == 'object') {
                 if (val && val.constructor != Array) { // not array
                     val = merge(o1[prop], val);
                 }
             }
         }
         o1[prop] = val; // copy and override
    }
    return o1;
}

You can use merge to combine configurations. For example, lets have these two configuration objects.

// base configuration from baseconf.js
var baseconfig = {
    a: "someval",
    env: {
        name "base",
        code: 1
    }
}

// test config from localconf.js
var localconfig = {
    env: { 
        name "test"
        db: 'localhost'
    },
    test: true
}

Now it’s possible to merge these easily

var settings = merge( baseconfig, localconfig );

console.log( settings. a ); // prints 'someval'
console.log( settings.env.name ); // prints 'test'
console.log( settings.env.code ); // prints '1'
console.log( settings.env.db ); // prints 'localhost'
console.log( settings.env.test ); // prints 'true'

Filed under Javascript Tagged with node.js

Socket Pooling on Node.js

December 13, 2010 Leave a comment

UPDATE: This post is pretty old and nowadays there are quite a few socket pooling implementations for Node.js. I recommend taking look into Jackpot (https://github.com/3rd-Eden/jackpot). It’s simple and does the job.

I was looking for connection pooling for my Apple Push notification proxy but couldn’t find proper pool implementation, just some experiments that didn’t do proper error handling or could not handle basic real life requirements. I also needed a pool that could be called by both blocking and non-blocking mode.

Basic functionality requirements of a typical connection pool

Ensure that connections are available and no more than maximum number of parallel connections exists
Do not keep unnecessary connections open for long. Less traffic, less connections.
Handle connection decay. For example if connection gets closed by remote peer while it’s waiting in pool.
Sane retry logic. Most importantly do not retry connections as fast as CPU allows in case of error.
Can either guarantee waiting time, or supports way of checking connection availability.

Lets define first the interface as Node.js module that exports reserve and release methods.

exports.ConnectionPool = function(factory) {
 var self = {};
 var waitlist = [] ;   // callbacks waiting for connection
 var connections = [];  // unused connections currently in pool
 var ccount = 0;  // number of current connections

 // called to reserve connection from pool. Calls callback without connection if wait is false
 self.reserve = function(callback, wait) { ... }
 // returns connection to the pool. Connection is destroyed if destroy is true
 self.release = function(connection, destroy) { ... }
}

Pool requires user provided factory that is dictionary that defines 3 functions and one property.

factory = {
create : function(callback) { ... }  // create connection and call callback with it
validate: function(validate) { .. } // return true or false for connection
destroy: function(connection) { .. }  // destroy connection
max: 5
}

Pool implementation needs several methods for house keeping. checkWaiters() is called to create new connections for waiting callbacks.

function checkWaiters() {
  if(waitlist.length > 0 && ccount < factory.max) {
     ccount++;
     factory.create(function(connection) {          
       if(!connection) {
         ccount--; // failed                    
       } else {
         if(waitlist.length > 0)
            waitlist.shift()(connection);
         else
           connections.push(connection);
       }
    });                           
  }      
 }

Then its counterpart, destroyConnection() that removes connection from pool for good. This function can be called from several places and situations so it adds “deleted” flag to the connection to avoid duplicate processing. It also tries to recreate new connection immediately (if needed) by calling checkWaiters().

function destroyConnection(connection) {
    if(connection.destroyed) {
       return;
    }
    connection.destroyed = true;
    for(var i=0; i < connections.length; i++) {
        // remove from pool if it's there                                
        if(connection == connections[i]) {
           clearTimeout(connection.timeoutid);
           connections.splice(i,1);
        }
    }
    ccount--;
    factory.destroy(connection);

    // connection was lost, we need to create new one if there are       
    // waiting requests                                                  
    checkWaiters();
 }

Then the actual reserve interface method. Function has two modes, if wait is false it returns immediately if it fails to give connection, otherwise the callback goes to the waiting list. Connections from pool are validated with factory before they are passed to the callback.

self.reserve = function(callback, wait) {
    if (wait == undefined) {
        wait = true;
    }
    if(connections.length > 0) {  // pool has available connections
        connection = connections.shift();
        if(factory.validate(connection)) {  // is it still valid
           clearTimeout(connection.timeoutid);  // cancel the cleanup timeout
           callback(connection);
           return;
        } else {
            destroyConnection(connection);  // stale connection
        }
    }
    if(ccount >= factory.max) { // maximum number of connections created
       if(!wait) {
          callback();
       } else {
          waitlist.push(callback);
       }
       return;
    }
    ccount++;  // try to create connection
    factory.create(function(connection) {
        if(!connection) {
           ccount--; // failed                                          
           if(!wait) {
              callback();
           } else {
              waitlist.push(callback);
           }
        } else {
          callback(connection); // connection created successfully
        }
    });
 }

And the release method. Release method destroys connection if requested and forgets about it then completely. Otherwise it tries to find callback from waiting list and passes the connection for immediate reuse. In case there is nobody waiting, it puts the connection back to pool and times the connection cleanup event in 10 seconds.

self.release = function(connection, destroy) {
    if(destroy) {
       destroyConnection(connection);
    } else {
       if(waitlist.length > 0) {
          waitlist.shift()(connection);
         return;
       }
       connections.push(connection);
       connection.timeoutid = setTimeout(function() {
           destroyConnection(connection);
       }, 10000);
    }
 }

And finally the background polling that is responsible mainly of connection retry logic. It polls the waiting list once per second, as you remember that function creates connections if there is anyone waiting for it. The connections are normally created on demand in reserve() call.

function poll() {
   checkWaiters()
   setTimeout(poll, 1000);
}
setTimeout(poll, 1000);  // start poller

How this then handles the different error cases?

Idle connections are cleaned by timeout call that is set on release()
User code can request connection delete by setting the destroy flag to true in call to release()
Connections that go bad while in pool are (hopefully) intercepted by user provided factory.validate()
Counter keeps track of maximum number of created connections and because its increased before creating connection it also limits maximum number of parallel connection attempts!
When connection creation fails, the callback goes to waiting list and poll per second tries to create new parallel connections.
User code that needs process immediately can set wait flag to false when reserving connection. Code could be also changed to timed out callback call, so wait could be defined as milliseconds instead of instant fail v.s. infinite wait as its done now.

Then how to use it.

HTTP Pool Example

var pool = require('./pool');
 var http = require('http');

var httppool = pool.ConnectionPool({
 create: function(callback) { callback(http.createClient(80, "www.google.com")); },
 validate: function(connection) { return true; /* no need to validate  */ },
 destroy : function(httpclient) {  /* nothing to destroy */},
 max: settings.couchdbmax
 });

Using the pool

 httppool.reserve(function(connection) {
   var req = connection.request("GET", "/index.html");
   req.on('error', function(error) {
     httppool.release(connection, true);
   });
   req.on('response', function(response) {
     body = '';
     response.on('data', function(data) {
     body += data;
   });
   response.on('end', function() {
     console.log(body)
     httppool.release(connection);
   });
 });

Client socket pool

var pool = require('./pool');
var net = require('net');

var apnpool = pool.ConnectionPool({
 create: function(callback) {
   function errorcb(error) {  // error handler
     require('util').puts(error.stack);
     callback();
   }
   connection = net.createConnection(12345, 'server.example.com');
   connection.once('error', errorcb); 
   connection.on('connect', function() {
     connection.removeListener('error', errorcb); // clear error handler before passing forward
     callback(connection);
   });
 },
 validate: function(connection) { return connection.writable; },
 destroy : function(connection) { connection.end(); },
 max: 5,
 });

Reuse is little bit tricky with sockets, as you need probably set and clear ‘error’ and ‘data’ event handlers for reading the responses in each worker.

Filed under Experimental, Javascript Tagged with node.js

Apple Push Notifications with Node.js

December 9, 2010 20 Comments

When your iPhone app backend needs to send Apple Push Notifications, it must do this over raw SSL socket using Apple proprietary raw binary interface. Standard Web REST is not supported. This kind of sucks, because if your entire backend is web based you need to break that cleanliness with external HTTP to APN proxy. One option is to use services like Urban Airship, but you can also build the proxy by yourself.

One potential platform for this is hyped Node.js, the rising javascript engine for building ad-hoc web servers. Web is full of examples of building simple HTTP based server or proxy with Node.js, so this post is only the part where we open a secure connection to the Apple server and send push notifications with plain Node.js javascript.

Please note that Apple assumes that you pool and keep sockets open as long as you have notifications to send. So, don’t make naive implementation that makes new socket for each HTTP request. Some simple pooling and reuse is a must for real implementation.

In addition for sending the push notifications, your app also needs to poll the APNS feedback service to find out what devices have uninstalled the app and should not be pushed new notifications. See more details in post Apple Push Notification feedback service.

1. Get Certificates

Apple’s Push notification server authenticates application by SSL certificates. There is no additional authentication handshake after secure connection has been established.

First we need the PEM format certificates that you can get by exporting them with Apple Keytool. Export also the Apple Worldwide CA certificate. See this excellent blog post (up to step 5) for details how to acquire the PEM files: http://blog.boxedice.com/2010/06/05/how-to-renew-your-apple-push-notification-push-ssl-certificate/

Now you should have following certificate files.

app-cert.pem (Application cerificate)
app-key-noenc.pem (Application private key)
apple-worldwide-certificate-authority.cer (Apple CA certificate)

2. Open Connection to Push Server

UPDATE:See more complete TLS example here.

Moving on the actual implementation in Node.js. This is quite simple, you just read the various certificate files as string and use them as credentials.

You must also have SSL support built in your Node.js binary.

var fs = require('fs');
var crypto = require('crypto');
var tls = require('tls');

var certPem = fs.readFileSync('app-cert.pem', encoding='ascii');
var keyPem = fs.readFileSync('app-key-noenc.pem', encoding='ascii');
var caCert = fs.readFileSync('apple-worldwide-certificate-authority.cer', encoding='ascii');
var options = { key: keyPem, cert: certPem, ca: [ caCert ] }

function connectAPN( next ) {
    var stream = tls.connect(2195, 'gateway.sandbox.push.apple.com', options, function() {
        // connected
        next( !stream.authorized, stream );
    });
}

3. Write Push Notification

After secure connection is established, you can simply write push notifications to the socket as binary data. Push notification is addressed to a device with 32 byte long push token that must be acquired by your iPhone application and sent to your backend somehow.

Easy format is simple hexadecimal string, so we define first a helper method to convert that hexadecimal string to binary buffer at server side.

function hextobin(hexstr) {
   buf = new Buffer(hexstr.length / 2);
   for(var i = 0; i < hexstr.length/2 ; i++) {
      buf[i] = (parseInt(hexstr[i * 2], 16) << 4) + (parseInt(hexstr[i * 2 + 1], 16));
   }
   return buf;
 }

Then define the data you want to send. The push payload is a serialized JSON string, that has one mandatory property ‘aps’. The JSON may contain additionally application specific custom properties.

var pushnd = { aps: { alert:'This is a test' }};
// Push token from iPhone app. 32 bytes as hexadecimal string
var hextoken = '85ab4a0cf2 ... 238adf';

Now we can construct the actual push binary PDU (Protocol Data Unit). Note that payload length is encoded UTF-8 string length, not number of characters. This would be also good place to check the maximum payload length (255 bytes).

payload = JSON.stringify(pushnd);
var payloadlen = Buffer.byteLength(payload, 'utf-8');
var tokenlen = 32;
var buffer = new Buffer(1 +  4 + 4 + 2 + tokenlen + 2 + payloadlen);
var i = 0;
buffer[i++] = 1; // command
var msgid = 0xbeefcace; // message identifier, can be left 0
buffer[i++] = msgid >> 24 & 0xFF;
buffer[i++] = msgid >> 16 & 0xFF;
buffer[i++] = msgid >> 8 & 0xFF;
buffer[i++] = msgid > 0xFF;

// expiry in epoch seconds (1 hour)
var seconds = Math.round(new Date().getTime() / 1000) + 1*60*60;
buffer[i++] = seconds >> 24 & 0xFF;
buffer[i++] = seconds >> 16 & 0xFF;
buffer[i++] = seconds >> 8 & 0xFF;
buffer[i++] = seconds > 0xFF;

buffer[i++] = tokenlen >> 8 & 0xFF; // token length
buffer[i++] = tokenlen & 0xFF;
var token = hextobin(hextoken);
token.copy(buffer, i, 0, tokenlen)
i += tokenlen;
buffer[i++] = payloadlen >> 8 & 0xFF; // payload length
buffer[i++] = payloadlen & 0xFF;

var payload = Buffer(payload);
payload.copy(buffer, i, 0, payloadlen);

stream.write(buffer);  // write push notification

And that’s it.

4. Handling Error Messages

Apple does not return anything from the socket unless there was an error. In that case Apple server sends you single binary error message with reason code (offending message is identified by the message id you set in push message) and closes connection immediately after that.

To parse error message. Stream encoding is utf-8, so we get buffer instance as data argument.

stream.on('data', function(data) {
   var command = data[0] & 0x0FF;  // always 8
   var status = data[1] & 0x0FF;  // error code
   var msgid = (data[2] << 24) + (data[3] << 16) + (data[4] << 8 ) + (data[5]);
   console.log(command+':'+status+':'+msgid);
 }

This implementation assumes that all data (6 bytes) is received on single event. In theory Node.js might return data in smaller pieces.

5. Reading Apple Feedback notifications

Apple requires that you read feedback notifications daily, so you know what push tokens have expired or app was uninstalled. See this blog post Polling Apple Push Notification feedback service with Node.js for details.

Filed under Experimental, Javascript Tagged with apns, nodejs, ssl

Simple Reverse Geocoding with CouchDB

December 2, 2010 5 Comments

Real world reverse geocoding searches require minimum of three parameters, two for location (lat, lon) and a filter. Filter can be keyword, type of location, name or something else. Common use case is to find nearest restaurants or closest address. This kind of query is simple for relational database (though not necessarily easy to shard) and the problem has been solved cleanly in many of them. (PostGIS, MySQL Spatial extensions, ..)

Geocoding is trickier to implement in typical NoSQL database that supports only one dimensional key range queries. Geohashes are classical solution, but in my experience they are too inaccurate for dense data. Simple method that works quite well are geoboxes, where earth is divided to grid that is used as index lookup table. Every location maps to a box that can be addressed with unique id.

This is experiment to implement simple geobox based geocoding on CouchDB from scratch. I assume you’re already familiar with CouchDB basics. The examples here are written with Python with couchdb-python client library.

1. Preparations

First we need geobox function that quantizes location coordinates to a list of boxes. Boxes cover the earth as grid. Latitude and location are quantized to coordinates that present geobox center on desired resolution.

from decimal import Decimal as D
D1 = D('1')
def geobox(lat, lng, res):
  def _q(c):
      return (c / res).quantize(D1) * res
  return (_q(lat), _q(lng))

Based on this function, we define a function that computes the geobox and its neighbors and retuns list of strings.

import math
def geonbr(lat, lon, res):
   blat,blon = geobox(lat, lon, res)
   boxes = [(dlat, dlon)
            for dlon in [blon - res, blon, blon + res]
            for dlat in [blat - res, blat, blat + res]]
   def _bf(box):
       (dlat, dlon) = box
       return math.fabs(dlon - lon) < float(res)*0.8 \
              and math.fabs(dlat - lat) < float(res)*0.8
   return filter(_bf, boxes)

def geoboxes(lat, lon, res):
   return list(set(['#'.join([dlat, dlon])
                    for (dlat, dlon) in geonbr(lat, lon, res)]))

The constant 0.8 defines how close the location can be at the geobox border before we include neighbor box in the list.

For example, calling geoboxes with lat lon (32.1234, -74.23233) will yield following geoboxes. Numbers are handled as Decimal instances to avoid float rounding problems.

>>> from decimal import Decimal as D
>>> geoboxes(D('32.1234'), D('-74.23233'), D('0.05'))
['32.15#-74.20', '32.10#-74.20', '32.15#-74.25', '32.10#-74.25'

2. Data Import

Data can be anything with location and some keyword, so let’s use real world places. Place name will be our searchable term in this example.

Geonames.org geocoding service makes its data available for everyone. Find here country you want and download & unpack selected data file. I did use ‘FI.zip’.

http://download.geonames.org/export/dump/

Data is tab-delimited and line-separated. We need to define few helper functions for reading and importing it.

from decimal import Decimal as D

def place_dict(entry):
   return {'_id': entry[0],
      'name': entry[1].encode('utf-8'),
      'areas': entry[17].encode('utf-8').split('/'),
      'loc': {
      'lat': entry[4],
      'lon': entry[5],
    },
    'gboxes': geobox(D(entry[4]), D(entry[5]), D('0.05'))
 }

def readnlines(f, n):
    while True:
      l = f.readline()
      if not l:
         break
      yield [l] + [f.readline() for _ in range(1, n)]

The place_dict converts line from Geonames dump file to JSON document for CouchDB. The readnlines is just helper to make updates in batches. Geobox resolution is 0.05 that makes roughly 5 x 5 km geoboxes.

Then just load the data to database. First we create database in server, open the utf-8 encoded file and write it as batches to the CouchDB.

>>>import codecs
>>>import couchdb
>>>s = couchdb.Server()
>>>places = couchdb.create('places')
>>>f = codecs.open('/home/user/Downloads/FI.txt', encoding='utf-8')
>>>for batch in readnlines(f, 100):
...   batch = [l.split('\t') for l in batch]
...   places.update([place_dict(entry) for entry in batch])
[(True, '631542', '1-239590f242b46d45b33516687c0b1df3'), ...

This takes a few moments. You can follow the progress on CouchDB Futon: http://localhost:5984/_utils/index.html

The place_dict does not validate the content, so the import might stop at broken line in the dump file, in that case you need to filter out the offending lines and rerun.

Query few places by id and verify that the data really is there and has right format

>>> places['638155']
<Document '638155'@'1-174cbb83a2794c33c40645ddf681fc76'
{'gboxes': ['66.85#25.75', '66.90#25.75'],
'loc': {'lat': '66.88333', 'lon': '25.7534'}, 'name': 'Saittajoki',
'areas': ['Europe', 'Helsinki']}>

3. Define View

CouchDB views (i.e. queries) are defined by storing a design document in the database. The couchdb Python API provides simple way to update design documents.

The query we need is defined in CouchDB by following script

function(d) {
  if(d.name) {
    var i;
    for (i = 0; i < d.gboxes.length; i += 1) {
      emit([d.gboxes[i], d.name], null);
    }
  }
}

This view builds a composite key index by the geobox string and the place name.

Load the view to CouchDB. Note that Javascript function is not validated until next query, and you will get strange error messages if it does not parse or execute correctly. Be careful!

>>>from couchdb.design impor ViewDefinition
>>>viewfunc = 'function(d) { if(d.name) { var i; for (i = 0; i < d.gboxes.length; i += 1) { emit([d.gboxes[i], d.name], null); }}}'
>>>view = ViewDefinition('places', 'around', viewfunc)
>>>view.sync(places)

4. Materialize View

CouchDB indexes views on first query and the first query will take a long time in this case. This is because CouchDB does not update index on insert, so after bulk import index building will take some time. Monitor the progress on Futon status page. (http://localhost:5984/_utils/status.html).

Init indexing by simple query.

>>> list(places.view('places/around', limit=1))
[<Row id='123456', key=['65.00#25.05', u'Some Place'], value=None>

Note that we call ‘list’ for the query to force execution. The view member function returns just generator. The limit is one to return one entry to verify success.

5. Making Queries

Now we can search places by name and location. To do that, lets compute first the geoboxes for a location.

>>> geonbr(D('60.198765'), D('25.016443'), D('0.05'))
['60.15#25.05', '60.20#25.00', '60.20#25.05', '60.15#25.00']

To search all locations in single gebox, use query like this:

list(places.view('places/around', startkey=['60.20#25.05'],
                                  endkey=['60.20#25.05',{}]))

To search by place name in a geobox, just include the search term both in start and end keys. The search term in endkey is appended with high Unicode character to define upper bound.

list(places.view('places/around', startkey=['60.20#25.05', 'Ha'],
                                  endkey=['60.20#25.05', 'Ha'+u'\u777d']))

Define simple helper function

def around(box, s):
   return list(places.view('places/around', startkey=[box, s],
                                            endkey=[box, s+u'\u777d']))

Now, to search all places around location that start with search term (e.g. here ‘H’), call the around function for each geobox for that location.

>>> l = geonbr(D('60.19'), D('25.01'), D('0.05'))
>>> for gb in l:
...     around(gb, 'H')
...
[<Row id='659403', key=['60.15#25.05', 'Haakoninlahti'], value=None>, <Row id='658086', key=['60.15#25.05', 'Hevossalmi'], value=None>]
[<Row id='659403', key=['60.20#25.00', 'Haakoninlahti'], value=None>, <Row id='6545255', key=['60.20#25.00', 'Herttoniemenranta'], value=None>, <Row id='658132', key=['60.20#25.00', 'Herttoniemi'], value=None>, <Row id='651476', key=['60.20#25.00', u'H\xf6gholmen'], value=None>, <Row id='6514261', key=['60.20#25.00', 'Hotel Avion'], value=None>, <Row id='6528458', key=['60.20#25.00', 'Hotel Fenno'], value=None>, <Row id='798734', key=['60.20#25.00', 'Hylkysaari'], value=None>]
[<Row id='659403', key=['60.20#25.05', 'Haakoninlahti'], value=None>, <Row id='6545255', key=['60.20#25.05', 'Herttoniemenranta'], value=None>, <Row id='658132', key=['60.20#25.05', 'Herttoniemi'], value=None>, <Row id='658086', key=['60.20#25.05', 'Hevossalmi'], value=None>]
[<Row id='659403', key=['60.15#25.00', 'Haakoninlahti'], value=None>, <Row id='651476', key=['60.15#25.00', u'H\xf6gholmen'], value=None>, <Row id='6514261', key=['60.15#25.00', 'Hotel Avion'], value=None>, <Row id='6528458', key=['60.15#25.00', 'Hotel Fenno'], value=None>, <Row id='798734', key=['60.15#25.00', 'Hylkysaari'], value=None>]

Our geobox resolution (0.05) guarantees minimum search radius 2.5km and maximum 7.5km. We could use several resolutions, more boxes or always search from location box and 8 boxes around the location to improve results.

Note the duplicates that you have to filter out in memory. Now it’s simple thing to fetch the interesting places and compute what ever presentation you want to give to the user.

>>> q = places.view('_all_docs', keys=['659403', '658086'], include_docs=True)
>>> for row in q:
...     print row.doc
...
<Document '659403'@'1-5f7fe8f63ae034ea9562c20a8c9b6ae7' {'gboxes': ['60.15#25.05', '60.20#25.00', '60.20#25.05', '60.15#25.00'], 'loc': {'lat': '60.16694', 'lon': '60.16694'}, 'name': 'Haakoninlahti', 'areas': ['Europe', 'Helsinki']}>
<Document '658086'@'1-ecaf156721b392411f025a3b00e27d62' {'gboxes': ['60.20#25.05', '60.15#25.05'], 'loc': {'lat': '60.16167', 'lon': '60.16167'}, 'name': 'Hevossalmi', 'areas': ['Europe', 'Helsinki']}>