Scalable, fast, accurate geo apps using Google App Engine + geohash + faultline correction

GeoMeme is a web app (and also a mobile web app for iPhone and Android) that I recently developed as a pet project. It measures real-time local twitter trends.

Visitors to GeoMeme choose a location on the map, and two search terms to compare. GeoMeme then measures and compares the number of matching tweets within the bounds of the map, based on public data from a number of mobile twitter apps.

As an example, GeoMeme can work out that :) beats :( in San Francisco:

A large amount of geo-data is generated by GeoMeme, and so arises a need shared by many geo apps: scalable, fast, and accurate spatial queries, used to select a subset of geo-data for display as markers on a map, or on Google Earth.

:)Google App Engine

Google App Engine is an obvious choice for hosting your geo app. The App Engine datastore is built on top of Google’s BigTable technology which scales very well, and is optimized for fast data retrieval. And it doesn’t cost the earth like some traditional GIS database solutions.

:( Inequality constraint

If you are coming from a background of relational databases, you might think the solution here would be to store the latitude and longitude of all your markers in a database table, and do a simple query to retrieve only those contained within the bounds of the map.

However, the flipside of being optimized for fast data retrieval is that BigTable only allows inequality filters on a single dimension, to avoid the burden of full table scans. For example, the following form of spatial query is not supported because it specifies inequality filters on both latitude and longitude dimensions:

SELECT latitude, longitude, title FROM myMarkers
WHERE latitude >= :south AND latitude <= :north
  AND longitude >= :west AND longitude <= :east

>>> BadFilterError: invalid filter: Only one property per query may have inequality filters 

:) Geohash to the rescue

Fortunately, there is an answer to this: geohash, the geocoding system invented by Gustavo Niemeyer.

My suspicion is that Niemeyer has travelled back in time after working out how to collapse two-dimensional space into a single dimension, but you might want to read the Wikipedia explanation of the algorithm instead.

An example: the location of the Sydney Opera House can be specified in two dimensions as {latitude: -33.858, longitude: 151.215}, or in a single dimension as {geohash: r3gx2ur29zzg7}.

Schuyler Erle has written an open source geohash python module, which enables the following form of spatial query on Google App Engine, because the inequality filter is specified only on a single dimension:

sw_geohash = Geohash((west, south))
ne_geohash = Geohash((east, north))

SELECT latitude, longitude, title FROM myMarkers
WHERE geohash >= :sw_geohash AND geohash <= :ne_geohash

;(Don’t hash me, coz I’m close to the edge

However…! We’re not done yet, because an artifact of the geohash algorithm is that queries like the one above can often return rogue markers which are outside of the required bounds, when the map spans what we shall call a geohash “faultline”.

This problem is particularly evident near the equator and the Greenwich Meridian, which are the biggest faultlines, but there are actually faultlines all over the place at every zoom level.

:DFaultline correction

GeoMeme solves this problem using “faultline correction”, an approach that I would like to share here:

  • If a spatial query is vulnerable to faultlines, it is split into multiple sub-queries that do not cross the faultline.
  • Sub-query limits are approximately weighted according to their relative size.
  • Sub-queries are executed in parallel, taking advantage of BigTable’s distributed goodness, and the results combined, so all this happens very fast.
  • Even though the sub-queries are executed in parallel without any significant impact on user experience, App Engine CPU costs can increase to around 2x the CPU cost of a single less accurate query. Sub-query results are memcached to reduce this CPU overhead.

Here’s a demo showing the effect of geohash faultlines, and the relative accuracy of spatial queries with or without faultline correction. Rogue markers are shown in the area surrounding the map. Switch between Correction: off / on / double and compare the accuracy.

Note: “double” correction is an advanced option which splits the sub-queries into sub-sub-queries so that they do not cross any faultlines either. This can further increase accuracy, but with further CPU cost (up to 8x the CPU cost of a single less accurate query).

All the source code can be found here, including ffGeoSearch, a python module to handle faultline-friendly geo search, if you want to use this technique on your own geo app.

Enjoy!

This post is one of a series that aims to share some of the technology innovation that can be found in GeoMeme. Other posts cover topics such as fast map re-location using Google Static Maps v2, and location-aware mobile web apps using Google Maps v3.

This entry was posted in mobile geo social and tagged , , , , , . Bookmark the permalink.