Real world reverse geocoding searches require minimum of three parameters, two for location (lat, lon) and a filter. Filter can be keyword, type of location, name or something else. Common use case is to find nearest restaurants or closest address. This kind of query is simple for relational database (though not necessarily easy to shard) and the problem has been solved cleanly in many of them. (PostGIS, MySQL Spatial extensions, ..)
Geocoding is trickier to implement in typical NoSQL database that supports only one dimensional key range queries. Geohashes are classical solution, but in my experience they are too inaccurate for dense data. Simple method that works quite well are geoboxes, where earth is divided to grid that is used as index lookup table. Every location maps to a box that can be addressed with unique id.
This is experiment to implement simple geobox based geocoding on CouchDB from scratch. I assume you’re already familiar with CouchDB basics. The examples here are written with Python with couchdb-python client library.
1. Preparations
First we need geobox function that quantizes location coordinates to a list of boxes. Boxes cover the earth as grid. Latitude and location are quantized to coordinates that present geobox center on desired resolution.
from decimal import Decimal as D
D1 = D('1')
def geobox(lat, lng, res):
def _q(c):
return (c / res).quantize(D1) * res
return (_q(lat), _q(lng))
Based on this function, we define a function that computes the geobox and its neighbors and retuns list of strings.
import math
def geonbr(lat, lon, res):
blat,blon = geobox(lat, lon, res)
boxes = [(dlat, dlon)
for dlon in [blon - res, blon, blon + res]
for dlat in [blat - res, blat, blat + res]]
def _bf(box):
(dlat, dlon) = box
return math.fabs(dlon - lon) < float(res)*0.8 \
and math.fabs(dlat - lat) < float(res)*0.8
return filter(_bf, boxes)
def geoboxes(lat, lon, res):
return list(set(['#'.join([dlat, dlon])
for (dlat, dlon) in geonbr(lat, lon, res)]))
The constant 0.8 defines how close the location can be at the geobox border before we include neighbor box in the list.
For example, calling geoboxes with lat lon (32.1234, -74.23233) will yield following geoboxes. Numbers are handled as Decimal instances to avoid float rounding problems.
>>> from decimal import Decimal as D
>>> geoboxes(D('32.1234'), D('-74.23233'), D('0.05'))
['32.15#-74.20', '32.10#-74.20', '32.15#-74.25', '32.10#-74.25'
2. Data Import
Data can be anything with location and some keyword, so let’s use real world places. Place name will be our searchable term in this example.
Geonames.org geocoding service makes its data available for everyone. Find here country you want and download & unpack selected data file. I did use ‘FI.zip’.
http://download.geonames.org/export/dump/
Data is tab-delimited and line-separated. We need to define few helper functions for reading and importing it.
from decimal import Decimal as D
def place_dict(entry):
return {'_id': entry[0],
'name': entry[1].encode('utf-8'),
'areas': entry[17].encode('utf-8').split('/'),
'loc': {
'lat': entry[4],
'lon': entry[5],
},
'gboxes': geobox(D(entry[4]), D(entry[5]), D('0.05'))
}
def readnlines(f, n):
while True:
l = f.readline()
if not l:
break
yield [l] + [f.readline() for _ in range(1, n)]
The place_dict converts line from Geonames dump file to JSON document for CouchDB. The readnlines is just helper to make updates in batches. Geobox resolution is 0.05 that makes roughly 5 x 5 km geoboxes.
Then just load the data to database. First we create database in server, open the utf-8 encoded file and write it as batches to the CouchDB.
>>>import codecs
>>>import couchdb
>>>s = couchdb.Server()
>>>places = couchdb.create('places')
>>>f = codecs.open('/home/user/Downloads/FI.txt', encoding='utf-8')
>>>for batch in readnlines(f, 100):
... batch = [l.split('\t') for l in batch]
... places.update([place_dict(entry) for entry in batch])
[(True, '631542', '1-239590f242b46d45b33516687c0b1df3'), ...
This takes a few moments. You can follow the progress on CouchDB Futon: http://localhost:5984/_utils/index.html
The place_dict does not validate the content, so the import might stop at broken line in the dump file, in that case you need to filter out the offending lines and rerun.
Query few places by id and verify that the data really is there and has right format
>>> places['638155']
<Document '638155'@'1-174cbb83a2794c33c40645ddf681fc76'
{'gboxes': ['66.85#25.75', '66.90#25.75'],
'loc': {'lat': '66.88333', 'lon': '25.7534'}, 'name': 'Saittajoki',
'areas': ['Europe', 'Helsinki']}>
3. Define View
CouchDB views (i.e. queries) are defined by storing a design document in the database. The couchdb Python API provides simple way to update design documents.
The query we need is defined in CouchDB by following script
function(d) {
if(d.name) {
var i;
for (i = 0; i < d.gboxes.length; i += 1) {
emit([d.gboxes[i], d.name], null);
}
}
}
This view builds a composite key index by the geobox string and the place name.
Load the view to CouchDB. Note that Javascript function is not validated until next query, and you will get strange error messages if it does not parse or execute correctly. Be careful!
>>>from couchdb.design impor ViewDefinition
>>>viewfunc = 'function(d) { if(d.name) { var i; for (i = 0; i < d.gboxes.length; i += 1) { emit([d.gboxes[i], d.name], null); }}}'
>>>view = ViewDefinition('places', 'around', viewfunc)
>>>view.sync(places)
4. Materialize View
CouchDB indexes views on first query and the first query will take a long time in this case. This is because CouchDB does not update index on insert, so after bulk import index building will take some time. Monitor the progress on Futon status page. (http://localhost:5984/_utils/status.html).
Init indexing by simple query.
>>> list(places.view('places/around', limit=1))
[<Row id='123456', key=['65.00#25.05', u'Some Place'], value=None>
Note that we call ‘list’ for the query to force execution. The view member function returns just generator. The limit is one to return one entry to verify success.
5. Making Queries
Now we can search places by name and location. To do that, lets compute first the geoboxes for a location.
>>> geonbr(D('60.198765'), D('25.016443'), D('0.05'))
['60.15#25.05', '60.20#25.00', '60.20#25.05', '60.15#25.00']
To search all locations in single gebox, use query like this:
list(places.view('places/around', startkey=['60.20#25.05'],
endkey=['60.20#25.05',{}]))
To search by place name in a geobox, just include the search term both in start and end keys. The search term in endkey is appended with high Unicode character to define upper bound.
list(places.view('places/around', startkey=['60.20#25.05', 'Ha'],
endkey=['60.20#25.05', 'Ha'+u'\u777d']))
Define simple helper function
def around(box, s):
return list(places.view('places/around', startkey=[box, s],
endkey=[box, s+u'\u777d']))
Now, to search all places around location that start with search term (e.g. here ‘H’), call the around function for each geobox for that location.
>>> l = geonbr(D('60.19'), D('25.01'), D('0.05'))
>>> for gb in l:
... around(gb, 'H')
...
[<Row id='659403', key=['60.15#25.05', 'Haakoninlahti'], value=None>, <Row id='658086', key=['60.15#25.05', 'Hevossalmi'], value=None>]
[<Row id='659403', key=['60.20#25.00', 'Haakoninlahti'], value=None>, <Row id='6545255', key=['60.20#25.00', 'Herttoniemenranta'], value=None>, <Row id='658132', key=['60.20#25.00', 'Herttoniemi'], value=None>, <Row id='651476', key=['60.20#25.00', u'H\xf6gholmen'], value=None>, <Row id='6514261', key=['60.20#25.00', 'Hotel Avion'], value=None>, <Row id='6528458', key=['60.20#25.00', 'Hotel Fenno'], value=None>, <Row id='798734', key=['60.20#25.00', 'Hylkysaari'], value=None>]
[<Row id='659403', key=['60.20#25.05', 'Haakoninlahti'], value=None>, <Row id='6545255', key=['60.20#25.05', 'Herttoniemenranta'], value=None>, <Row id='658132', key=['60.20#25.05', 'Herttoniemi'], value=None>, <Row id='658086', key=['60.20#25.05', 'Hevossalmi'], value=None>]
[<Row id='659403', key=['60.15#25.00', 'Haakoninlahti'], value=None>, <Row id='651476', key=['60.15#25.00', u'H\xf6gholmen'], value=None>, <Row id='6514261', key=['60.15#25.00', 'Hotel Avion'], value=None>, <Row id='6528458', key=['60.15#25.00', 'Hotel Fenno'], value=None>, <Row id='798734', key=['60.15#25.00', 'Hylkysaari'], value=None>]
Our geobox resolution (0.05) guarantees minimum search radius 2.5km and maximum 7.5km. We could use several resolutions, more boxes or always search from location box and 8 boxes around the location to improve results.
Note the duplicates that you have to filter out in memory. Now it’s simple thing to fetch the interesting places and compute what ever presentation you want to give to the user.
>>> q = places.view('_all_docs', keys=['659403', '658086'], include_docs=True)
>>> for row in q:
... print row.doc
...
<Document '659403'@'1-5f7fe8f63ae034ea9562c20a8c9b6ae7' {'gboxes': ['60.15#25.05', '60.20#25.00', '60.20#25.05', '60.15#25.00'], 'loc': {'lat': '60.16694', 'lon': '60.16694'}, 'name': 'Haakoninlahti', 'areas': ['Europe', 'Helsinki']}>
<Document '658086'@'1-ecaf156721b392411f025a3b00e27d62' {'gboxes': ['60.20#25.05', '60.15#25.05'], 'loc': {'lat': '60.16167', 'lon': '60.16167'}, 'name': 'Hevossalmi', 'areas': ['Europe', 'Helsinki']}>