Cached sequential unique identifiers with Node.js and MongoDB

Acquiring sequential id with MongoDB is simple, as it supports

$inc

command for atomic sequence increment. However, naive implementation requires hit to database every single time id is required, and this can create latency and overhead issues. Typical case is for user tracking where application needs to get unique global id for every user in load balanced array of node.js instances.

This is more optimized method and works if you’re running simultaneously several Node.js instances. This method fetches a unique number range from database, uses them from memory and fetches new range when it runs out. This example assumes you use https://github.com/mongodb/node-mongodb-native .

Initialize the id starting point.

On instance startup, the implementation initializes the id to starting value (if it does not exists) and fetch the current status from database. In this example the starting value is 1000.

function init_id( seqname, next ) {
    idcollection = new mongodb.Collection(client, 'ids');
    function _findId() {
        idcollection.findOne({_id: seqname}, function(err, doc) {
            if ( err ) { console.log( 'ERROR MONGO', 'ids', err ); return next(err); }
            if( doc ) {
               return next( false, { _id: seqname, waiters: [], high: doc.index, index: doc.index } )
            }
            idcollection.insert( {_id: seqname, index: 1000}, {safe: true}, function(err, doc) {
                if ( err ) { console.log( 'ERROR MONGO', 'ids', err ); return next(err); }
                return _findId();
            });
        });
    }
    _findId();
}

callback ‘next’ is called with object initialized to the current range from database.

    init_id( 'myseq', function(err, idstatus ) {
        // we have now id status
    }
...

Sequence generation function

Next we define function that is called to fetch the next id. Tricky part is that if code needs to fetch next batch of unique identifiers it needs to queue the other callers until fetch completes so we don’t end up fetching more than one range increment at a time.

The high and index properties were set to current value in initialization so first call to next_id will always trigger fetch.

var INDEX_STEP = 10; // range to prefetch per query

function next_id( idstatus, next ) {

    if (idstatus.high > idstatus.index) {
        // id available from memory
        return next(false, idstatus.index++);
    }

    // need to fetch, put callback in wait list
    idstatus.waiters.push( next )

    if (idstatus.infetch) {
       // already fetch in progress
       return;
    }

    // initiate fetch
    _fetch( INDEX_STEP );

    function _fetch( step ) {
        // use findandmodify to increment index and return new value
        idstatus.infetch = true;
        idcollection.findAndModify( {_id: idstatus._id}, [['_id','asc']],
				    {$inc: {index: step}},
		   		    {new: true}, _after_fetch);
    }

    function _after_fetch(err, object) {

        function _notify_waiters( err ) {
            // give id to all waiters
            while ( idstatus.waiters.length ) {
                if ( err ) {
                    (idstatus.waiters.shift())( err )
                } else {
                     if (idstatus.high <= idstatus.index) {
                        // we got more waiters during fetch and
                        // exhausted this batch, get next batch
                        return _fetch( INDEX_STEP );
                     }
                    (idstatus.waiters.shift())( false, idstatus.index++ )
                }
            }
           idstatus.infetch = false;
        }

        if (err) return _notify_waiters( err )
        if (!object) return _notify_waiters('index not found')

        idstatus.high = object.index

        // the current index must be reset to the allocated range
        // start, because there could be several parallel nodes making
        // incremental queries to the db so each node does not get
        // sequential ranges.
        idstatus.index = object.index - INDEX_STEP

        _notify_waiters();
    }
}

Code gets next id as argument to callback

    next_id( idstatus, function(err, id) {

        // 'id' is next unique id to use!
    });

Note.

  • Identifiers are sequential (growing) but not incremental, as multiple node instances will at some point make requests at the same time.
  • Each startup increments the current value of sequence in database by STEP_INDEX amount if next_id is called at least once
  • INDEX_STEP must be large enough to avoid race condition, or optimally should implement some kind of exponential retry

EC2 EBS Backup Python script

This is simple EC2 backup script that snapshots listed EBS volumes daily. Script keeps maximum number of daily, weekly and monthly snapshots per volume and checks if daily backup has already been done or in progress, so it does not make duplicates for single day.

Prerequisities

1. Ec2 command line tools.
Check that you can run them from command line

$ ec2-describe-snapshots
SNAPSHOT	snap-070cba6c	vol-123123	completed	2012-04-19T02:06:54+0000	100%	457025778133		my.com root
SNAPSHOT	snap-170cba7c	vol-455445	completed	2012-04-19T02:07:09+0000	100%	457025778133	10	my.net root
...

2. Fabric administration and deployment scripting tool

Install with easy_install or pip

<pre>$ sudo easy_install fabric</pre>

See http://docs.fabfile.org/en/1.4.1/installation.html for more details

3. The script.
Copy following to ec2-backup.py and replace the BACKUP_VOLS array with your own volumes and their descriptions. Script is also available in GitHub.

import os, sys, time
import dateutil.parser
from datetime import date, timedelta, datetime

from fabric.api import (local, settings, abort, run, lcd, cd, put)
from fabric.contrib.console import confirm
from fabric.api import env

# for each volume, define the how many daily, weekly and monthly backups
# you want to keep. For weekly monday's backup is kept and for the each month
# the one from 1st day.
BACKUP_VOLS = {
	'vol-abc1234': {'comment': 'my.com root', 'days': 7, 'weeks': 4, 'months': 4},
	'vol-1234565': {'comment': 'my.com database', 'days': 7, 'weeks': 4, 'months': 4},
}


today = date.today()

snapshots = {}
hastoday = {}
savedays = {}	# retained snapshot days for each volume

for (volume, conf) in BACKUP_VOLS.items():
	daylist = savedays[volume] = []
	# last n days
	for c in range(conf['days'] - 1, -1, -1):
		daylist.append(today - timedelta(days=c))
	# last n weeks (get mondays)
	monday = today - timedelta(days=today.isoweekday() - 1)
	daylist.append(monday)
	for c in range(conf['weeks'] - 1, 0, -1):
		daylist.append(monday - timedelta(days=c * 7))
	# last n months (first day of month)
	for c in range(conf['months'] - 1, -1, -1):
		daylist.append(datetime(today.year, today.month - c, 1).date())

SNAPSHOTS = local('ec2-describe-snapshots', capture=True).split('\n')

SNAPSHOTS = [tuple(l.split('\t')) for l in SNAPSHOTS if l.startswith('SNAPSHOT')]

for (_, snapshot, volume, status, datestr, progress, _, _, _) in SNAPSHOTS:
	snapshotdate = dateutil.parser.parse(datestr).date()
	if volume in BACKUP_VOLS:
		if snapshotdate == today:
			hastoday[volume] = {'status': status, 'snapshot': snapshot, 'progress': progress.replace('%', '')}
		if volume not in snapshots:
			snapshots[volume] = []
		snapshots[volume].append((snapshot, status, snapshotdate))

for snapshotlist in snapshots.values():
	snapshotlist.sort(key=lambda x: x[2], reverse=True)

for volume in BACKUP_VOLS.keys():
	if volume not in snapshots:
		snapshots[volume] = []

print "VOLUME\tSNAPSHOT\tSTATUS\tDATE\tDESC"
for (volume, snapshotlist) in snapshots.items():
	for (snapshot, status, date) in snapshotlist:
		datestr = date.strftime('%Y-%m-%d')
		print "%s\t%s\t%s\t%s\t%s" % (volume, snapshot, status, datestr, BACKUP_VOLS[volume]['comment'])


def status():
	pass


def backup(dryrun=False):
	print "\nCREATING SNAPSHOTS"
	for (volume, snapshotlist) in snapshots.items():
		if volume in hastoday:
			print '%s has %s%% %s snapshot %s for today "%s"' % (volume,
															hastoday[volume]['progress'],
															hastoday[volume]['status'],
															hastoday[volume]['snapshot'],
															BACKUP_VOLS[volume]['comment'])
		else:
			print 'creating snapshot for %s "%s"' % (volume, BACKUP_VOLS[volume]['comment'])
			snapshotlist.insert(0, ('new', 'incomplete', today))
			if not dryrun:
				local('ec2-create-snapshot %s -d "%s"' % (volume, BACKUP_VOLS[volume]['comment']))

	print "\nDELETING OLD SNAPSHOTS"
	for (volume, snapshotlist) in snapshots.items():
		for (snapshot, _, date) in snapshotlist:
			if not date in savedays[volume]:
				datestr = date.strftime('%Y-%m-%d')
				print "deleting %s %s for %s (%s)" % (snapshot, datestr, volume, BACKUP_VOLS[volume]['comment'])
				if not dryrun:
					with settings(warn_only=True):
						local('ec2-delete-snapshot %s' % snapshot)


def dryrun():
	print """

*** DRY RUN ***

"""
	backup(dryrun=True)

You can dry run the script first to see what it would do

$ fab -f ec2-backup.py dryrun

To make actual backup

$ fab -f ec2-backup.py backup

Example output

$ fab -f ec2-backup.py backup
[localhost] local: ec2-describe-snapshots
VOLUME	SNAPSHOT	STATUS	DATE	DESC
vol-abc1234	snap-48fe4023	completed	2012-04-24	my.com database
vol-abc1234	snap-23863a48	completed	2012-04-23	my.com database
vol-abc1234	snap-838131e8	completed	2012-04-20	my.com database
vol-abc1234	snap-1b0cba70	completed	2012-04-19	my.com database
vol-abc1234	snap-0d4ffb66	completed	2012-04-17	my.com database
vol-1234565	snap-42fe4029	completed	2012-04-24	my.com root
vol-1234565	snap-25863a4e	completed	2012-04-23	my.com root
vol-1234565	snap-858131ee	completed	2012-04-20	my.com root
vol-1234565	snap-1f0cba74	completed	2012-04-19	my.com root
vol-1234565	snap-034ffb68	completed	2012-04-17	my.com root

CREATING SNAPSHOTS
creating snapshot for vol-abc1234 "my.com database"
[localhost] local: ec2-create-snapshot vol-abc1234 -d "my.com database"
SNAPSHOT	snap-8ccd74e7	vol-abc1234	pending	2012-04-25T02:18:58+0000		457025778133	50	my.com database
creating snapshot for vol-1234565 "my.com root"
[localhost] local: ec2-create-snapshot vol-1234565 -d "my.com root"
SNAPSHOT	snap-86cd74ed	vol-1234565	pending	2012-04-25T02:19:03+0000		457025778133	8	my.com root

DELETING OLD SNAPSHOTS
deleting snap-0d4ffb66 2012-04-17 for vol-abc1234 (my.com database)
[localhost] local: ec2-delete-snapshot snap-0d4ffb66
SNAPSHOT	snap-0d4ffb66
deleting snap-034ffb68 2012-04-17 for vol-1234565 (my.com root)
[localhost] local: ec2-delete-snapshot snap-034ffb68
SNAPSHOT	snap-034ffb68

Done.

If you try to run it again, it will notify about already running backups

...

CREATING SNAPSHOTS
vol-abc1234 has 55% pending snapshot snap-8ccd74e7 for today "my.com database"
vol-1234565 has 100% completed snapshot snap-86cd74ed for today "my.com root"

...