CouchDB cleanup script for purging old docs

CouchDB does not have straightforward ways to clean up old data. This is one simple way do delete entries by date, but it requires that

  • Your documents have date or timestamp property
  • There is view for each database to fetch documents for that property

Prerequisities

  1. Node.js
  2. jss module, i.e. ‘npm install jss’.

1. Prepare views for Cleanup

Define view in each database that needs regular cleanup. Use something like this where emitted key field is timestamp in seconds.

views: {
    created: {
	map: function(doc) {
		if ( doc.created ) {
			emit(doc.created, doc._rev);
		}
	}
    }
    ....
}

2. The Cleanup script

Script queries old doc ids from the cleanup view and marks them as deleted. Documents are not deleted immediately but are removed physically on next CouchDB compact. CouchDB 1.2.0 supports autocompact so just enable it and don’t worry about it.

#!/bin/bash

DBHOST=localhost

# Get key for entries that are over 6 months old. This assumes that created view can be queried using timestamps as keys.
if uname -a | grep -i darwin > /dev/null
then
	TODAY=$(date '+%Y-%m-%d')
	MONTHSAGO=$(date -v -24w '+%Y-%m-%d')
	MONTHSAGO_E=$(date -v -24w '+%s')
else
	TODAY=$(date '+%Y-%m-%d')
	MONTHSAGO=$(date -d '24 weeks ago' '+%Y-%m-%d')
	MONTHSAGO_E=$(date -d '24 weeks ago' '+%s')
fi

PATH=$PATH:/usr/local/bin

# JSON scripting tool
JSS=$(npm bin)/jss

cleanup() {
	DATABASE=$1
	DESIGN=$2

	echo "Cleaning $DATABASE/$DESIGN"
	curl --silent -S http://$DBHOST:5984/$DATABASE/_design/$DESIGN/_view/created?endkey=$MONTHSAGO_E | \
		$JSS --bulk_docs '$.id' '{_id: $.id, _rev:$.value, _deleted:true}' | \
		curl --silent -S -X POST -d @-  -H "Content-Type:application/json" http://$HOST:$PORT/$DATABASE/_bulk_docs | \
		sed 's/\({[^}]*}\),/\1\n/g' | tr -d '[]' | \
		$JSS '$.ok != true'
}

echo "STATS CLEANUP <= $MONTHSAGO - Start" `date`

# Put databases and views here
cleanup somedb1 someview
cleanup somedb1 otherview
cleanup somedb2 alsoview

echo "STATS CLEANUP - Done" `date`

The script does this

  1. Get expired docs, e.g. curl ‘http://localhost:5984/mydatabase/_design/mydesign/_view/created?endkey=1337049581&#8217;
  2. Build bulk doc delete request (jss)
  3. Issue delete bulk request (curl post)
  4. sanitize couchdb output, i.e. add newlines and remove brackets (sed)
  5. print failed ones

Note that default version of jss doesn’t output proper JSON if no documents are found, use my fork to workaround this problem if you dont want to see errors in logs.

npm install https://github.com/tikonen/jss/tarball/master

Cross Domain data channel with HTML5 Canvas

Standard Ajax is restricted to single origin policy so JSONP is the de-facto way for exchanging data with cross-domain sources and it works pretty well. Alternative, though bit hacky way, is to use HTML5 Canvas as cross domain work-around using pseudo Images as “covert channel”.

Basic idea is simple, javascript in client requests image file from 3rd party site where server encodes a data to the Image, client can use cookies and url parameters to identify itself as desired. Then client renders image, and decodes the data from the image pixels.

Backend

Backend needs to be able to construct images with custom pixel level data, in this example we use Node.js and canvas module that is server side HTML5 Canvas implementation based on Cairo graphics library.

This function accepts any object and returns canvas object that contains the objects JSON presentation encoded in image pixels.

function encodeDataToImage( data ) {

	// Convert data to binary buffer while being utf-8
	var s = encodeURIComponent( JSON.stringify(data) );
	var buffer = new Buffer(s, 'utf8');
	var pixelc = (buffer.length / 3) + (buffer.length % 3 ? 1 : 0)

	// Encode data as PNG image
	var Canvas = require('canvas');
	var canvas = new Canvas(pixelc, 1)
	var ctx = canvas.getContext('2d');
	var imgdata = ctx.getImageData(0, 0, pixelc, 1);

	for (var i=0, k=0; i < pixelc * 4; i += 4 ) {
		imgdata.data[i + 3] = 0xFF; // set alpha to full opaque
		for (var j=0; j < 3 && k < buffer.length; k++, j++ ) {
			imgdata.data[i + j] = buffer[k];
		}
	}
	// set "image" data
	ctx.putImageData(imgdata, 0, 0);
	return canvas;
}

Define xd request handler that builds and sends the data coded image to the client. (Example in Express.js).

someapp.get('/xd', function(req, res ) {
    // do here something with query or cookies, like resolve uid and set
    // data.
    // Example data
    var data = { a: 1, en: 'owl', fi: 'pöllö', es: 'búho', uid: req.query.uid }

    var canvas = encodeDataToImage( data );
    var img = canvas.toBuffer();
    res.contentType('png');
    res.header('Content-Length', img.length);
    res.send( img );
});

Browser
At browser side load the image and decode it back to object

function queryXD( query, callback ) {

	var img = new Image();
	img.src = 'http://some.site.example.com/xd?' + query;
	img.addEventListener('load', function() {

		// Image loaded, create temporary canvas
		var canvas = document.createElement('canvas');
		var ctx = canvas.getContext('2d');

		// draw image on canvas
		canvas.width = img.width;
		canvas.height = img.height;
		ctx.drawImage( img, 0, 0 );

		// collect bytes from image pixels
		var bytes = [];
		var imgdata = ctx.getImageData(0, 0, img.width, img.height);
		for (var i=0; i < img.width * 4; i++ ) {
			if ( i && (i + 1) % 4 == 0) {
				i++;
			}
			var b = imgdata.data[i];
			if (!b) {
				break;
			}
			bytes.push( b );
		}

		// convert bytes to string and parse JSON
		var s = decodeURIComponent( String.fromCharCode.apply(null, bytes) );
		var data = JSON.parse(s);

		callback(false, data);
	}, false);

        // image failed to load
	img.addEventListener('error', function(err) {
		callback(err);
	}, false);
}

And now its simple to do cross domain data exchange like

queryXD('uid=2134', function(err, data) {
   alert(data.en + ' is ' + data.fi + ' in Finnish and ' + data.es + ' in Spanish');
});

Caveats
Proxies and browsers like to cache the images, so use every time unique dummy parameter to force fetch.