CouchDB cleanup script for purging old docs

CouchDB does not have straightforward ways to clean up old data. This is one simple way do delete entries by date, but it requires that

  • Your documents have date or timestamp property
  • There is view for each database to fetch documents for that property

Prerequisities

  1. Node.js
  2. jss module, i.e. ‘npm install jss’.

1. Prepare views for Cleanup

Define view in each database that needs regular cleanup. Use something like this where emitted key field is timestamp in seconds.

views: {
    created: {
	map: function(doc) {
		if ( doc.created ) {
			emit(doc.created, doc._rev);
		}
	}
    }
    ....
}

2. The Cleanup script

Script queries old doc ids from the cleanup view and marks them as deleted. Documents are not deleted immediately but are removed physically on next CouchDB compact. CouchDB 1.2.0 supports autocompact so just enable it and don’t worry about it.

#!/bin/bash

DBHOST=localhost

# Get key for entries that are over 6 months old. This assumes that created view can be queried using timestamps as keys.
if uname -a | grep -i darwin > /dev/null
then
	TODAY=$(date '+%Y-%m-%d')
	MONTHSAGO=$(date -v -24w '+%Y-%m-%d')
	MONTHSAGO_E=$(date -v -24w '+%s')
else
	TODAY=$(date '+%Y-%m-%d')
	MONTHSAGO=$(date -d '24 weeks ago' '+%Y-%m-%d')
	MONTHSAGO_E=$(date -d '24 weeks ago' '+%s')
fi

PATH=$PATH:/usr/local/bin

# JSON scripting tool
JSS=$(npm bin)/jss

cleanup() {
	DATABASE=$1
	DESIGN=$2

	echo "Cleaning $DATABASE/$DESIGN"
	curl --silent -S http://$DBHOST:5984/$DATABASE/_design/$DESIGN/_view/created?endkey=$MONTHSAGO_E | \
		$JSS --bulk_docs '$.id' '{_id: $.id, _rev:$.value, _deleted:true}' | \
		curl --silent -S -X POST -d @-  -H "Content-Type:application/json" http://$HOST:$PORT/$DATABASE/_bulk_docs | \
		sed 's/\({[^}]*}\),/\1\n/g' | tr -d '[]' | \
		$JSS '$.ok != true'
}

echo "STATS CLEANUP <= $MONTHSAGO - Start" `date`

# Put databases and views here
cleanup somedb1 someview
cleanup somedb1 otherview
cleanup somedb2 alsoview

echo "STATS CLEANUP - Done" `date`

The script does this

  1. Get expired docs, e.g. curl ‘http://localhost:5984/mydatabase/_design/mydesign/_view/created?endkey=1337049581&#8217;
  2. Build bulk doc delete request (jss)
  3. Issue delete bulk request (curl post)
  4. sanitize couchdb output, i.e. add newlines and remove brackets (sed)
  5. print failed ones

Note that default version of jss doesn’t output proper JSON if no documents are found, use my fork to workaround this problem if you dont want to see errors in logs.

npm install https://github.com/tikonen/jss/tarball/master

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: