Quick and robust C++ CSV reader with boost
September 17, 2016 Leave a comment
This is quick and simple CSV reader based on Boost regular expression token iterator. Parser splits the input with a regular expressions and returns the result as a collection of vectors of strings.
Regular expression handles neatly lot of the complicated edge cases such as empty columns, quoted text, etc..
Parser code
#include <boost/regex.hpp> // used to split the file in lines const boost::regex linesregx("\\r\\n|\\n\\r|\\n|\\r"); // used to split each line to tokens, assuming ',' as column separator const boost::regex fieldsregx(",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))"); typedef std::vector<std::string> Row; std::vector<Row> parse(const char* data, unsigned int length) { std::vector<Row> result; // iterator splits data to lines boost::cregex_token_iterator li(data, data + length, linesregx, -1); boost::cregex_token_iterator end; while (li != end) { std::string line = li->str(); ++li; // Split line to tokens boost::sregex_token_iterator ti(line.begin(), line.end(), fieldsregx, -1); boost::sregex_token_iterator end2; std::vector<std::string> row; while (ti != end2) { std::string token = ti->str(); ++ti; row.push_back(token); } if (line.back() == ',') { // last character was a separator row.push_back(""); } result.push_back(row); } return result; }
Example
CSV data with common problem cases, such as empty quotes, commas inside quotes and empty last column.
a,b,c 1,"cat",3 ",2",dog,4 3,a b,5 4,empty, 5,,empty 6,"",empty2 7,x,long story no commas 8,y,"some, commas, here,"
Read and parse the CSV data above and output the parsed result
int main(int argc, char*argv[]) { // read example file std::ifstream infile; infile.open("example.csv"); char buffer[1024]; infile.read(buffer, sizeof(buffer)); buffer[infile.tellg()] = '\0'; // parse file std::vector<Row> result = parse(buffer, strlen(buffer)); // print out result for(size_t r=0; r < result.size(); r++) { Row& row = result[r]; for(size_t c=0; c < row.size() - 1; c++) { std::cout << row[c] << "\t"; } std::cout << row.back() << std::endl; } }
Output
$ ./reader a b c 1 "cat" 3 ",2" dog 4 3 a b 5 4 empty 5 empty 6 "" empty2 7 x long story no commas 8 y "some, commas, here,"
See full example code in Github: https://github.com/tikonen/blog/tree/master/boostcsvreader