Simple Reverse Geocoding with CouchDB

Real world reverse geocoding searches require minimum of three parameters, two  for location (lat, lon) and a filter. Filter can be keyword, type of location, name or  something else. Common use case is to find nearest restaurants or closest address. This kind of query is simple for relational database (though not necessarily easy to shard) and the problem has been solved cleanly in many of them. (PostGIS, MySQL Spatial extensions, ..)

Geocoding is trickier to implement in typical NoSQL database that supports only one dimensional key range queries. Geohashes are classical solution, but in my experience they are too inaccurate for dense data. Simple method that works quite well are geoboxes, where earth is divided to grid that is used as index lookup table. Every location maps to a box that can be addressed with unique id.

This is experiment to implement simple geobox based geocoding on CouchDB from scratch. I assume you’re already familiar with CouchDB basics. The examples here are written with Python with couchdb-python client library.

1. Preparations

First we need geobox function that quantizes location coordinates to a list of boxes. Boxes cover the earth as grid. Latitude and location are quantized to coordinates that present geobox center on desired resolution.

from decimal import Decimal as D
D1 = D('1')
def geobox(lat, lng, res):
  def _q(c):
      return (c / res).quantize(D1) * res
  return (_q(lat), _q(lng))

Based on this function, we define a function that computes the geobox and its neighbors and retuns list of strings.

import math
def geonbr(lat, lon, res):
   blat,blon = geobox(lat, lon, res)
   boxes = [(dlat, dlon)
            for dlon in [blon - res, blon, blon + res]
            for dlat in [blat - res, blat, blat + res]]
   def _bf(box):
       (dlat, dlon) = box
       return math.fabs(dlon - lon) < float(res)*0.8 \
              and math.fabs(dlat - lat) < float(res)*0.8
   return filter(_bf, boxes)

def geoboxes(lat, lon, res):
   return list(set(['#'.join([dlat, dlon])
                    for (dlat, dlon) in geonbr(lat, lon, res)]))

The constant 0.8 defines how close the location can be at the geobox border before we include neighbor box in the list.

For example, calling geoboxes with lat lon (32.1234, -74.23233) will yield following geoboxes. Numbers are handled as Decimal instances to avoid float rounding problems.

>>> from decimal import Decimal as D
>>> geoboxes(D('32.1234'), D('-74.23233'), D('0.05'))
['32.15#-74.20', '32.10#-74.20', '32.15#-74.25', '32.10#-74.25'

2. Data Import

Data can be anything with location and some keyword, so let’s use real world places. Place name will be our searchable term in this example.

Geonames.org geocoding service makes its data available for everyone. Find here country you want and download & unpack selected data file. I did use ‘FI.zip’.

http://download.geonames.org/export/dump/

Data is tab-delimited and line-separated. We need to define few helper functions for reading and importing it.

from decimal import Decimal as D

def place_dict(entry):
   return {'_id': entry[0],
      'name': entry[1].encode('utf-8'),
      'areas': entry[17].encode('utf-8').split('/'),
      'loc': {
      'lat': entry[4],
      'lon': entry[5],
    },
    'gboxes': geobox(D(entry[4]), D(entry[5]), D('0.05'))
 }

def readnlines(f, n):
    while True:
      l = f.readline()
      if not l:
         break
      yield [l] + [f.readline() for _ in range(1, n)]

The place_dict converts line from Geonames dump file to JSON document  for CouchDB. The readnlines is just helper to make updates in batches. Geobox resolution is 0.05 that makes roughly 5 x 5 km geoboxes.

Then just load the data to database. First we create database in server, open the utf-8 encoded file and write it as batches to the CouchDB.

>>>import codecs
>>>import couchdb
>>>s = couchdb.Server()
>>>places = couchdb.create('places')
>>>f = codecs.open('/home/user/Downloads/FI.txt', encoding='utf-8')
>>>for batch in readnlines(f, 100):
...   batch = [l.split('\t') for l in batch]
...   places.update([place_dict(entry) for entry in batch])
[(True, '631542', '1-239590f242b46d45b33516687c0b1df3'), ...

This takes a few moments. You can follow the progress on  CouchDB Futon: http://localhost:5984/_utils/index.html

The place_dict does not validate the content, so the import might stop at broken line in the dump file, in that case you need to filter out the offending lines and rerun.

Query few places by id and verify that the data really is there and has right format

>>> places['638155']
<Document '638155'@'1-174cbb83a2794c33c40645ddf681fc76'
{'gboxes': ['66.85#25.75', '66.90#25.75'],
'loc': {'lat': '66.88333', 'lon': '25.7534'}, 'name': 'Saittajoki',
'areas': ['Europe', 'Helsinki']}>

3. Define View

CouchDB views (i.e. queries) are defined by storing a design document in the database. The couchdb Python API provides simple way to update design documents.

The query we need is defined in CouchDB by following script

function(d) {
  if(d.name) {
    var i;
    for (i = 0; i < d.gboxes.length; i += 1) {
      emit([d.gboxes[i], d.name], null);
    }
  }
}

This view builds a composite key index by the geobox string and the place name.

Load the view to CouchDB. Note that Javascript function is not validated until next query, and you will get strange error messages if it does not parse or execute correctly. Be careful!

>>>from couchdb.design impor ViewDefinition
>>>viewfunc = 'function(d) { if(d.name) { var i; for (i = 0; i < d.gboxes.length; i += 1) { emit([d.gboxes[i], d.name], null); }}}'
>>>view = ViewDefinition('places', 'around', viewfunc)
>>>view.sync(places)

4. Materialize View

CouchDB indexes views on first query and the first query will take a long time in this case. This is because CouchDB does not update index on insert, so after bulk import index building will take some time. Monitor the progress on Futon status page. (http://localhost:5984/_utils/status.html).

Init indexing by simple query.

>>> list(places.view('places/around', limit=1))
[<Row id='123456', key=['65.00#25.05', u'Some Place'], value=None>

Note that we call ‘list’ for the query to force execution. The view member function returns just generator. The limit is one to return one entry to verify success.

5. Making Queries

Now we can search places by name and location. To do that, lets compute first the geoboxes for a location.

>>> geonbr(D('60.198765'), D('25.016443'), D('0.05'))
['60.15#25.05', '60.20#25.00', '60.20#25.05', '60.15#25.00']

To search all locations in single gebox, use query like this:

list(places.view('places/around', startkey=['60.20#25.05'],
                                  endkey=['60.20#25.05',{}]))

To search by place name in a geobox, just include the search term both in start and end keys. The search term in endkey is appended with high Unicode character to define upper bound.

list(places.view('places/around', startkey=['60.20#25.05', 'Ha'],
                                  endkey=['60.20#25.05', 'Ha'+u'\u777d']))

Define simple helper function

def around(box, s):
   return list(places.view('places/around', startkey=[box, s],
                                            endkey=[box, s+u'\u777d']))

Now, to search all places around location that start with search term (e.g. here ‘H’), call the around function for each geobox for that location.

>>> l = geonbr(D('60.19'), D('25.01'), D('0.05'))
>>> for gb in l:
...     around(gb, 'H')
...
[<Row id='659403', key=['60.15#25.05', 'Haakoninlahti'], value=None>, <Row id='658086', key=['60.15#25.05', 'Hevossalmi'], value=None>]
[<Row id='659403', key=['60.20#25.00', 'Haakoninlahti'], value=None>, <Row id='6545255', key=['60.20#25.00', 'Herttoniemenranta'], value=None>, <Row id='658132', key=['60.20#25.00', 'Herttoniemi'], value=None>, <Row id='651476', key=['60.20#25.00', u'H\xf6gholmen'], value=None>, <Row id='6514261', key=['60.20#25.00', 'Hotel Avion'], value=None>, <Row id='6528458', key=['60.20#25.00', 'Hotel Fenno'], value=None>, <Row id='798734', key=['60.20#25.00', 'Hylkysaari'], value=None>]
[<Row id='659403', key=['60.20#25.05', 'Haakoninlahti'], value=None>, <Row id='6545255', key=['60.20#25.05', 'Herttoniemenranta'], value=None>, <Row id='658132', key=['60.20#25.05', 'Herttoniemi'], value=None>, <Row id='658086', key=['60.20#25.05', 'Hevossalmi'], value=None>]
[<Row id='659403', key=['60.15#25.00', 'Haakoninlahti'], value=None>, <Row id='651476', key=['60.15#25.00', u'H\xf6gholmen'], value=None>, <Row id='6514261', key=['60.15#25.00', 'Hotel Avion'], value=None>, <Row id='6528458', key=['60.15#25.00', 'Hotel Fenno'], value=None>, <Row id='798734', key=['60.15#25.00', 'Hylkysaari'], value=None>]

Our geobox resolution (0.05) guarantees minimum search radius 2.5km and maximum 7.5km.  We could use several resolutions, more boxes or always search from location box and 8 boxes around the location to improve results.

Note the duplicates that you have to filter out in memory.  Now it’s simple thing to fetch the interesting places and compute what ever presentation you want to give to the user.

>>> q = places.view('_all_docs', keys=['659403', '658086'], include_docs=True)
>>> for row in q:
...     print row.doc
...
<Document '659403'@'1-5f7fe8f63ae034ea9562c20a8c9b6ae7' {'gboxes': ['60.15#25.05', '60.20#25.00', '60.20#25.05', '60.15#25.00'], 'loc': {'lat': '60.16694', 'lon': '60.16694'}, 'name': 'Haakoninlahti', 'areas': ['Europe', 'Helsinki']}>
<Document '658086'@'1-ecaf156721b392411f025a3b00e27d62' {'gboxes': ['60.20#25.05', '60.15#25.05'], 'loc': {'lat': '60.16167', 'lon': '60.16167'}, 'name': 'Hevossalmi', 'areas': ['Europe', 'Helsinki']}>

5 Responses to Simple Reverse Geocoding with CouchDB

  1. Gagan says:

    Hi Ikonen,

    Quite a good descriptive blog about the geobox working. Your approach with geobox for spatial indexing seems promising in CouchDB, as existing solution with GeoCouch does not support the reduce and geohash has flaw of edge case.
    Well I have one query, is this solution is extendable to index polygon or polyline features. What I think, if we increase the neighbors to match the extent (bbox) of polygon and index all these geobox than we can query the polygon features also.
    I am new to the NoSQL/CouchDB, another doubt I have about the performance in case the query bbox size is large. This will lead to more numbers of geoboxes that need to be checked in view.
    I am interested in polygon and line features indexing and searching through the approach you have explained, please let me know how much suitable is this.

    Regards,
    Gagan

    • tikonen says:

      Hi, I believe it’s possible to use this solution to search entities inside a polygon. In this case I would index each place coordinate with varying resolutions that make possible to limit the search to desired area size.

      0.0001 = 10m
      0.0005 = 50m
      0.001 = 100m
      0.0025 = 200m

      0.1 = 10 km
      0.2 = 20 km

      Then when searching polygons, compute the polygon bounding box and select best granularity for the polygon. In some cases it could 10km box, in some cases 100m box. depends of the polygon size.
      Then simply compute the search keys in selected resolution from polygon vertex locations with neighbors, union them to avoid unnecessary double queries and query & merge results.
      This is relatively straightforward to optimize further by selecting optimal number of queried geoboxes based on size and form of the polygon. For example, query could consist of single 1km box and one 20m box where polygon bleeds over edge.

      Drawback is that with large number of geoboxes the space and first query cost is high. Also, when adding more query parameters (like the name as in example) the number of keys grow exponentially.

      • Gagan says:

        Thanks Ikonen,

        I got your point and I’ll experiment with this approach. I was going through the geobox concept in more detail, got one doubt that is the geobox string just a unique id for that box or it has more significance.

        Regards,
        Gagan

      • tikonen says:

        Geobox string is just unique id for the box. In my example it’s concatenated quantized coordinates with ‘#’ in the middle so I it’s easily recognizable and you can tell immediately the size of the box from the string. It also works nicely with even (e.g. 75.0, 54.0) coordinates and you should be able to add other resolutions later without computing all of them again. (of course some care is needed so that you don’t use resolutions that are divisible so that they produce same id for different box sizes)

  2. Gagan says:

    Thanks Again Ikonen for the detail explanation.

    Gagan

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: