Quick and robust C++ CSV reader with boost

This is quick and simple CSV reader based on Boost regular expression token iterator. Parser splits the input with a regular expressions and returns the result as a collection of vectors of strings.
Regular expression handles neatly lot of the complicated edge cases such as empty columns, quoted text, etc..

Parser code

#include <boost/regex.hpp>

// used to split the file in lines
const boost::regex linesregx("\\r\\n|\\n\\r|\\n|\\r");

// used to split each line to tokens, assuming ',' as column separator
const boost::regex fieldsregx(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))");

typedef std::vector<std::string> Row;

std::vector<Row> parse(const char* data, unsigned int length)
{
    std::vector<Row> result;

    // iterator splits data to lines
    boost::cregex_token_iterator li(data, data + length, linesregx, -1);
    boost::cregex_token_iterator end;

    while (li != end) {
        std::string line = li->str();
        ++li;

        // Split line to tokens
        boost::sregex_token_iterator ti(line.begin(), line.end(), fieldsregx, -1);
        boost::sregex_token_iterator end2;

        std::vector<std::string> row;
        while (ti != end2) {
            std::string token = ti->str();
            ++ti;
            row.push_back(token);
        }
        if (line.back() == ',') {
            // last character was a separator
            row.push_back("");
        }
        result.push_back(row);
    }
    return result;
}

Example

CSV data with common problem cases, such as empty quotes, commas inside quotes and empty last column.

a,b,c
1,"cat",3
",2",dog,4
3,a b,5
4,empty,
5,,empty
6,"",empty2
7,x,long story no commas
8,y,"some, commas, here,"

Read and parse the CSV data above and output the parsed result

int main(int argc, char*argv[])
{
	// read example file
	std::ifstream infile;
	infile.open("example.csv");
	char buffer[1024];
	infile.read(buffer, sizeof(buffer));
	buffer[infile.tellg()] = '\0';

	// parse file
	std::vector<Row> result  = parse(buffer, strlen(buffer));

	// print out result
	for(size_t r=0; r < result.size(); r++) {
		Row& row = result[r];
		for(size_t c=0; c < row.size() - 1; c++) {
			std::cout << row[c] << "\t";
		}
		std::cout << row.back() << std::endl;
	}
}

Output

$ ./reader
a      	b      	c
1      	"cat"  	3
",2"   	dog    	4
3      	a b    	5
4      	empty
5      	        empty
6      	""      empty2
7      	x      	long story no commas
8      	y      	"some, commas, here,"

See full example code in Github: https://github.com/tikonen/blog/tree/master/boostcsvreader

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: