Quick and robust C++ CSV reader with boost

This is quick and simple CSV reader based on Boost regular expression token iterator. Parser splits the input with a regular expressions and returns the result as a collection of vectors of strings.
Regular expression handles neatly lot of the complicated edge cases such as empty columns, quoted text, etc..

Parser code

#include <boost/regex.hpp>

// used to split the file in lines
const boost::regex linesregx("\\r\\n|\\n\\r|\\n|\\r");

// used to split each line to tokens, assuming ',' as column separator
const boost::regex fieldsregx(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");

typedef std::vector<std::string> Row;

std::vector<Row> parse(const char* data, unsigned int length)
{
    std::vector<Row> result;

    // iterator splits data to lines
    boost::cregex_token_iterator li(data, data + length, linesregx, -1);
    boost::cregex_token_iterator end;

    while (li != end) {
        std::string line = li->str();
        ++li;

        // Split line to tokens
        boost::sregex_token_iterator ti(line.begin(), line.end(), fieldsregx, -1);
        boost::sregex_token_iterator end2;

        std::vector<std::string> row;
        while (ti != end2) {
            std::string token = ti->str();
            ++ti;
            row.push_back(token);
        }
        if (line.back() == ',') {
            // last character was a separator
            row.push_back("");
        }
        result.push_back(row);
    }
    return result;
}

Example

CSV data with common problem cases, such as empty quotes, commas inside quotes and empty last column.

a,b,c
1,"cat",3
",2",dog,4
3,a b,5
4,empty,
5,,empty
6,"",empty2
7,x,long story no commas
8,y,"some, commas, here,"

Read and parse the CSV data above and output the parsed result

int main(int argc, char*argv[])
{
	// read example file
	std::ifstream infile;
	infile.open("example.csv");
	char buffer[1024];
	infile.read(buffer, sizeof(buffer));
	buffer[infile.tellg()] = '\0';

	// parse file
	std::vector<Row> result  = parse(buffer, strlen(buffer));

	// print out result
	for(size_t r=0; r < result.size(); r++) {
		Row& row = result[r];
		for(size_t c=0; c < row.size() - 1; c++) {
			std::cout << row[c] << "\t";
		}
		std::cout << row.back() << std::endl;
	}
}

Output

$ ./reader
a      	b      	c
1      	"cat"  	3
",2"   	dog    	4
3      	a b    	5
4      	empty
5      	        empty
6      	""      empty2
7      	x      	long story no commas
8      	y      	"some, commas, here,"

See full example code in Github: https://github.com/tikonen/blog/tree/master/boostcsvreader

Finding maximal rectangles in a grid

Maximal rectangles problem is to find rectangles inside an area that have maximum possible surface area. This is often needed in games to show the player where in area a certain item can fit etc.. There are many ways to approach the problem, and for some cases (such as areas defined by polygon) they can get complex to implement. Here is one relatively simple method for looking up those maximum rectangles inside an area in a common grid. It’s pretty efficient and does support following restrictions

  • Rectangles can have minimum size (e.g. larger than 2×2).
  • There can be more than one maximum rectangle in the area. (in case two large areas are joined with narrow path)

First part is defining function that looks up rectangle dimensions. This function starts from a tile that will be left top corner of rectangle and returns its width and height. The listings here are in pseudocode.

Listing 1. Get rectangle size

FUNC get_rectangle_size( tile, tiles )
BEGIN
    # find rectangle width
    width = 1
    WHILE contains( (tile.x + width, c.y,  tiles )
      width = width + 1
   
    # find rectangle height
    height = 0
    stop = false
    REPEAT
       height = height + 1
       LOOP FROM x = tile.x TO tile.x + width
         IF NOT contains( (x, tile.y + height), tiles )
         THEN
            stop = true
            BREAK
         ELSE
            x = x + 1
    UNTIL stop  

    RETURN (width, height)
END

Function finds the maximum width of the rectangle starting from the left corner and working right (lines 5-6), then it finds rectangle height by looking up rows of same width (lines 11-20).

The actual algorithm that finds the rectangles accepts list of grid tiles as input and outputs the found rectangles that are maximally sized. The pseudocode listing here assumes some helper functions for simplicity.

  • sort ( list, compare ) – Sorts list in place by compare function
  • add( item, list ) – Adds item to a list
  • remove (item, list) – Removes item from list
  • contains( item, list ) – Returns true if item is found from list
  • length( list ) – Returns lists length
  • overlap (rect1, rect2) – Returns true if two rectangles overlap

Listing 2. Find maximal rectangles. Unoptimized for readability!

FUNC find_tiles( tiles )  # the list of tiles as (x,y) tuples
BEGIN
  rectangles = []  # list of found rectangles

  # sort with compare order by y and x.
  sort(tiles,  (a, b) => a.y == b.y ? a.x - b.x : a.y - b.y)

  WHILE length(tiles) > 0
    c = head(tiles)   # take first item from the sorted list
    
    # get rectangle size
    (width, height) = get_rectangle_size( c, tiles ) 

    # remove tiles that can not be part of any larger rectangle
    LOOP FROM x = c.x TO c.x + width
      LOOP FROM y = c.y TO c.y + height
         IF NOT contains( (c.x - 1, y, tiles) AND
            NOT contains( (c.x + width, y), tiles) AND
            NOT contains( (x, c.y - 1), tiles) AND
            NOT contains( (x, c.y + height), tiles)
         THEN
            remove( (x, y), tiles )
 
    # if big enough rectangle add it to list
    IF width > 1 AND height > 1
    THEN
      add ( (c.x, c.y, width, height), rectangles ) 

  # sort rectangles by area size
  sort( rectangles, (r1, r2) => r2.width * r2.height - r1.width * r1.height )

  # remove overlapping rectangles
  i = 0
  WHILE i < length(rectangles) - 1
    LOOP FROM j = length(rectangles) - 1 TO i + 1 # descending order 
        IF overlaps(rectangles[i], rectangles[j])
        THEN
          remove(rectangles[j], rectangles)
    i = i + 1

  RETURN rectangles
END

Here is example how the algorithm works, here white area presents the working set of tiles that algorithm works on, white presents the tiles that are in list tiles.
step1-rect

Algorithm sorts first the tiles in top-to-bottom and left-to-right order so it starts from top left tile. The pink denotes the current working tile c. The green represents the tiles that belong to rectangle as determined by call to get_rectangle_size.
step2-rect

Next step is to remove tiles from working set that can not be part of any other rectangle. This is determined by checking if the row and column of tile are bounded inside the current rectangle. The purple presents tiles that were removed from the working set tiles. Found rectangle is added in the list rectangles if it’s large enough (in this case larger than 2×2).
Then loop is executed again with next corner tile c.
step3-rect

Next iteration of the main loop returns another rectangle and closed tiles are removed and rectangle added to list.
step4-rect

Final loop of the rectangle. There are no more tiles to work so main loop stops.
step5-rect

Finally algorithm sorts the found rectangles in descending surface area order and removes overlap. The resulting rectangles are returned as output.

Finding Patterns in Data Sequences

Some time ago I was presented question if there would be easy and quick way to recognize certain usage patterns in huge amount of event data. The events in question were common user activity like clicking buttons, waiting certain time between and so on. This data could then used to improve user experience or catch cheaters that write bots to play their game for them and so on.

There are multiple ways to approach this problem but I came up with relatively simple method that seems to have flexibility. We can think the events of interest as sequence of symbols and in case also time between events is of interest, quantize the time as symbols for short, long and medium time. Then, sequence of events can be presented as list of symbols

ABC123ABC123ABC123ABC123XABC123ABC123ABC123ABC123XABC123ABC123ABC123ABC123X

Here A could be button click to open window, B a tab click in window, C close button, 1 a second pause before moving mouse, 2 a few second pause before moving mouse and X is collect resources etc..

Now we can approach problem with pattern finding. First let’s define function that is supposed to find recurring patterns and output them and input as sequence of those patterns. (Following is written in Python)

def findpattern(s, trivial):
	patterns = {}
	symbolmap = {}

	p = patterns
	acc = []
	output = []
	for c in s:
		if not c in p:
			if not c in patterns or (not trivial and acc[-1] == c):
				p[c] = {}
				p = p[c]
				acc.append(c)
			else:
				acc = ''.join(map(lambda x: str(x),acc))
				if not acc in symbolmap:
					symbolmap[acc] = true
				output.append(acc)

				acc = [c]
				p = patterns[c]
		else:
			p = p[c]
			acc.append(c)

	if acc:
		acc = ''.join(map(lambda x: str(x),acc))
		if not acc in symbolmap:
			symbolmap[acc] = true
		output.append(acc)

	return (symbolmap.keys(), output)

Function accepts two arguments, the list of symbols and flag if trivial pattern should be accepted. This is needed to determine if you consider a input list “AAAAA” to be one instance of pattern “AAAAA” or 5 instances of pattern “A”.

When this function is applied on the example string above, we get list of core patterns and the input as pattern list.

> s = "ABC123ABC123ABC123ABC123XABC123ABC123ABC123ABC123XABC123ABC123ABC123ABC123X"
> findpattern(s, False)
['ABC123', 'ABC123X'] ['ABC123', 'ABC123', 'ABC123', 'ABC123X', 'ABC123', 'ABC123', 'ABC123', 'ABC123X', 'ABC123', 'ABC123', 'ABC123', 'ABC123X']

Here it’s easy to see that the string is composed of two main patterns, ABC123 and ABC123X. The second output list is the input string split as instances of pattern.
Real data is noisier than the example here, so proper data cleaning and event definition / filtering are vital for any kind of useful pattern recognition.

In this case one could guess that the player might be bot, as the same sequence occur regularly with too much precision and too little variability for human player. (Remember that the symbols 1, 2 and 3 presented also time in this example.)

Going forward we can also define function that recursively uses the output of the findpattern to find patterns of patterns up to the “prime” pattern of the input, i.e. the pattern that when repeated N times produces the original input string.

def findprimepattern(s, trivial=False):
	o = s
	while True:
		s, o = findpattern(o, trivial)
		print s,o
		if len(s) == 1:
			return s[0]

Example with more complicated pattern string

> s = (((("ABC"*2)+"DE")*3+"X")*2+"O")*4
"ABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXOABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXOABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXOABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXO"
> findprimepattern(s)
['ABCDEXO', 'ABC', 'ABCDE', 'ABCDEX'] ['ABC', 'ABCDE', 'ABC', 'ABCDE', 'ABC', 'ABCDEX', 'ABC', 'ABCDE', 'ABC', 'ABCDE', 'ABC', 'ABCDEXO', 'ABC', 'ABCDE', 'ABC', 'ABCDE', 'ABC', 'ABCDEX', 'ABC', 'ABCDE', 'ABC', 'ABCDE', 'ABC', 'ABCDEXO', 'ABC', 'ABCDE', 'ABC', 'ABCDE', 'ABC', 'ABCDEX', 'ABC', 'ABCDE', 'ABC', 'ABCDE', 'ABC', 'ABCDEXO', 'ABC', 'ABCDE', 'ABC', 'ABCDE', 'ABC', 'ABCDEX', 'ABC', 'ABCDE', 'ABC', 'ABCDE', 'ABC', 'ABCDEXO']
['ABCABCDE', 'ABCABCDEX', 'ABCABCDEXO'] ['ABCABCDE', 'ABCABCDE', 'ABCABCDEX', 'ABCABCDE', 'ABCABCDE', 'ABCABCDEXO', 'ABCABCDE', 'ABCABCDE', 'ABCABCDEX', 'ABCABCDE', 'ABCABCDE', 'ABCABCDEXO', 'ABCABCDE', 'ABCABCDE', 'ABCABCDEX', 'ABCABCDE', 'ABCABCDE', 'ABCABCDEXO', 'ABCABCDE', 'ABCABCDE', 'ABCABCDEX', 'ABCABCDE', 'ABCABCDE', 'ABCABCDEXO']
['ABCABCDEABCABCDEABCABCDEX', 'ABCABCDEABCABCDEABCABCDEXO'] ['ABCABCDEABCABCDEABCABCDEX', 'ABCABCDEABCABCDEABCABCDEXO', 'ABCABCDEABCABCDEABCABCDEX', 'ABCABCDEABCABCDEABCABCDEXO', 'ABCABCDEABCABCDEABCABCDEX', 'ABCABCDEABCABCDEABCABCDEXO', 'ABCABCDEABCABCDEABCABCDEX', 'ABCABCDEABCABCDEABCABCDEXO']
['ABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXO'] ['ABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXO', 'ABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXO', 'ABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXO', 'ABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXO']
ABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXO

And indeed the input string is the prime pattern ABCABCDEABCABCDEABCABCDEXABCABCDEABCABCDEABCABCDEXO concatenated 4 times.

Note effect of parameter trival on the function.

> findprimepattern('AAAA', False)
['AAAAA'] ['AAAAA']
AAAAA
> findprimepattern('AAAA', True)
['A'] ['A', 'A', 'A', 'A', 'A']
A

Former is single instance of pattern “AAAA” and latter is 4 instances of pattern “A”.

Permutation of a list

Let’s consider following requirement. We have list of objects, say questions, that we want to show our users. Let’s also assume that the list is very long, 500 messages.

Each message is unique and we want to

  1. show each question only once for each user.
  2. show the questions in different order to each user.

Each user responds to only few questions each day so it will take several months before they exhaust the list. We need to remember over each session where user left off.

Basic way to implement this is to take the list of questions, make a copy of it, shuffle it’s content and save it as part of users profile data. This has obvious drawbacks of the data duplication and maintenance. Also if the list changes, every single copy of the list has to be updated.

Cheaper way to implement this is to use prime walk. In this method we can keep only single copy of the list and store only number of answered questions in the users profile.

  1. First define the list, say we have 500 items.
  2. Find prime number that is greater than 500. For example 503.
  3. Select seed for each user. This seed is any number that is constant for an user, like user id or time of account creation.

Then we define function that uses these values. Note that this implementation is tail-recursive for code simplicity, efficient implementation should not use recursion.

var prime = 503;
var list = ["question1", "question2", ... "question500"];

function pick_nth(seed, n, p, l) {
    if(!n) return l[seed % p];
    return pick_nth((seed + l.length) % p, n - 1, l);
}

Now picking up n:th question from list becomes easy. For example:

pick_nth(userId, 0, prime, list);  // first question
pick_nth(userId, 1, prime, list);  // second question
...
pick_nth(userId, 499, prime, list);  // 500th question

Function returns items from list in different order for every unique user id. Only thing needed to keep track of progress is to store the number of questions user has answered.

Changing list requires some care. To delete question, just mark the entry in list to null to skip it. To add questions, add another list that is used after user has exhausted the original one.
If you change the list length, the order of selection changes and it’s not anymore guaranteed that user sees every question and sees every question only once.

Selecting biased random objects from a set

Normal random pick from set of objects will return every object equally likely.
For example, following function returns any character with equal probability from the argument string.

function randomselect(s) {
    return s[~~(Math.random() * s.length)];
}

This is usually the desired functionality, but games often require to pick one thing more likely than another. Typical use case are random drops from enemies, the common loot should drop most often and the good stuff should be much more rare.

Biased random pick is selecting object from a set randomly, so that some of them selected more often than others.

Calling randomselect("abc") will return each “a”, “b” and “c” with 33% probability.
Selection can be biased by just adding more of characters in the string. Calling randomselect("aaabbc") will return “a” with 50%,”b” with 33% and “c” with 16% propability.

Just adding entries in a list is bit clumsy, and more generic method is to use integral of probability density.

// define elements and probability for each
var elems = [
  ['a', 0.5],
  ['b', 0.333],
  ['c', 0.166]
]

// Biased random entry selection
function select(elems) {
    var r = Math.random();
    var ref = 0;

    for(var i=0; i < elems.length; i++) {
        var elem= elems[i];
        ref += elem[1];
        if (r < ref) return elem[0];
    }
    // This happens only if probabilities do not add up to 1
    return null;
}

Calling select(elems) returns a with 50%, b with 33.3% and c with 16.6% probability.
If defined probabilities do not sum to 1 you can get null, i.e not selection. In this case 0.5 + 0.333 + 0.166 is 0.999 so there is 0.1% probability (one out 1000) to get nothing.

For example to pick 100 times randomly from the set.

var elems = [
  ['a', 0.5],
  ['b', 0.333],
  ['c', 0.166]
];

var s = '';
for(var i=0; i < 100; i++) {
	s += select(elems);
}
console.log(s);

Run this with node to get string of randomly picked characters. Here you can see that selection bias works, half of the letters are ‘a’, one third ‘b’ and rest ‘c’.

$ node biased.js
babbaaaabaabbaabaccbaacabbbabbacbaabacabccacacacbacaaaaacbcaaaababaccbcbbbcaabcaabaccabbbcbcbacaabaa

Simple Cooldown and Rate Control Algorithm

Software that accepts requests from external, potentially unfriendly sources often need to implement some kind of rate restrictions or cooldowns to limit the amount of resources single user or app client can use.

Typical uses for cooldowns are

  • Limit number of login attempts per user
  • Limit rate of API calls
  • Batch incoming messages or jobs
  • Throttle access to shared resource

There are several approaches how limitations are calculated, but it usually boils down to simple TPS limit, i.e. transactions per second.

Well behaving rate limiter must allow bursts, as client may not have control over the network buffering and even when it sends on steady rate server can receive requests as chunked together. Therefore it does not make sense to enforce strict TPS limit that only understands about seconds but it’s better to use average over time. Optimally rate limiter should also not require lot of data management nor storage.

The algorithm

This is proposal for implementing described rate limiter with minimal overhead as variant of Leaky Bucket algorithm.

  • Let T be time span in seconds and n number of operations we want to allow per T. For example n = 3 and T = 60 would mean 3 operations per minute.
  • Let S = 0 be initial score for rate limited resource.

Every time resource is accessed,

  1. Compute trial score Sn where now is seconds since some globally fixed reference time.
    • Sn = now + T/n if S=0
    • Sn = Sn + T/n otherwise.
  2. Compare trial score Sn and current time.
    • Update S = now + T/n and allow request if Sn < now.
    • Reject request if Sn > now + T
    • Otherwise update score S = Sn and allow request.

This algorithm has some performance benefits as it only needs to store a single integer for score for each throttle and it does not need to be updated if request is rejected so its system load characteristics are well bounded. Algorithm also accepts max n requests in T as burst and averages to (n/T) TPS for constant load.
Implementation also fits well in distributed environments where redis or memcached can be used to atomically increment the score value.

Example

Here is javascript example for reference, save it in file and run with node.js.



var n = 3; // # of requests
var T = 60; // timestamp

var S = 0;
function throttle() {

	var Sn = S;
	var now = ~~(Date.now() / 1000);
	if (!Sn) {
		Sn = now + T/n;
	} else {
		Sn += T/n;
	}
	if ( Sn < now ) {
		Sn = now + T/n; // reset back to initial value
	} else if ( Sn > now + T ) {
		console.log('REJECT', S - now);
		return false; // throttled
	}
	S = Sn;

	console.log('ALLOW', S - now);
	return true; // ok
}


console.log('Allow '+n+' requests per '+T+'s');
var readline = require('readline');
var rl = readline.createInterface({
  input: process.stdin,
  output: process.stdout
});
rl.setPrompt('hit>');
rl.prompt();
rl.on('line', function() {
	// each enter is request
	throttle();
	rl.prompt();
});

When you run the example, hit enter to simulate request. It tells if you if it was accepted or rejected with the current score delta.

~/work $ node throttle.js
Allow 3 requests per 60s
hit>
ALLOW 20
hit>
ALLOW 40
hit>
ALLOW 60
hit>
REJECT 59
hit>
REJECT 55
hit>
REJECT 50
hit>
REJECT 45
hit>
ALLOW 59
hit>
REJECT 58
hit>

Adjust n and T to your liking. This implementation is also easy to change to return number of seconds client has to wait until trying again.

See also Simple popularity algorithm with decay how to apply this method for popularity scoring.

Simple popularity algorithm with decay

Most social web sites have user created objects such as pictures, posts, links, etc.. that need to be shown in some dynamic popularity or ranking order on website. The measurement of popularity is typically some user activity such as votes, downvotes, number of clicks or shares per object.

It would be trivial just to calculate total sum of the actions, and sort objects by the sum but this doesn’t take into account the bandwagon effect where people start voting the already top items so the top list freezes quickly and no new items have change to bubble on top. Another problem of this approach is that once object reaches high score it stays high even when no actions would anymore happen. Good popularity measurement needs to decay somehow.

First page of top list (typically 10-20 items)  should always show fresh “hot” items together with long time favorites. We want that older objects fade and need more and more actions to stay on the top list as time passes by.

Simple way to do this is with relative scoring, where base score  grows constantly so that score for younger object is always naturally higher than for old objects with same number of actions. Object creation timestamp can be used trivially as base score. For example:

object_score = created_timestamp + number_of_actions

created_timestamp is object creation time as number of hours or days from some arbitrary fixed reference time . The popularity list is simply all objects sorted by the score. New objects have naturally higher score and old ones need more and more actions to stay on top.

Example in Javascript

Fixed reference time 1st Jan 2000.  Timestamp as hours

Object A created on midnight 1st May 2012, 30 votes

(new Date('2012-5-1') - new Date('2000-1-1'))/1000/60/60 + 30 = 108126

Object B created on midnight 2nd May 2012 2 votes

(new Date('2012-5-2') - new Date('2000-1-1'))/1000/60/60 + 2 = 108122

So, on 2nd May the object A score would still be ahead B but the B needs only 5 more votes to pass A. The A will fall behind quickly unless it gets constantly more votes. In fact with this trivial formula object has to get 24 actions per day just to keep up with the increasing base score.

Major benefit of growing base score is that its one pass, i.e. app  updates score for every new action and thats it. There is no need to do computation on query and no periodic maintenance is required. It’s also compatible with Map-Reduce based views (e.g. CouchDB) as its not dependent of query time.

I assume many high traffic web sites (Digg for sure) use  some derivative of this basic method, but each site needs to tune it based on their actions, traffic and user behavior.

See also Simple Cooldown and Rate Control Algorithm for another perspective to time based scoring.