Simpler and faster parsing code with std::views::split

Simpler and faster parsing code with std::views::split

Parsing text files is often confusing irrespective of your programming language. It can also be surprising slow.

As an example, let us consider the following problem. You have a comma-separated-value file. It is a text file made of several lines and each line is separated into fields by a comma. Excel spreadsheets are often exported as comma-separated-value files. You want to extract just one column and iterate over it.

I asked ChatGPT for a solution in C++, and it gave me the following code:

std::vector<std::string> get_column(const std::string& input, 
       size_t row_number, char delimiter = ',') {
    std::vector<std::string> rows;
    std::istringstream iss(input);
    std::string row;

    // Split the string into rows
    while (std::getline(iss, row)) {
        rows.push_back(row);
    }

    std::vector<std::string> result;
    for (const auto& r : rows) {
        std::istringstream row_stream(r);
        std::string field;
        size_t index = 0;

        while (std::getline(row_stream, field, delimiter)) {
            if (index == row_number) {
                result.push_back(field);
                break;
            }
            ++index;
        }
    }

    return result;
}
        

The function takes a string input representing CSV-like data, a column index row_number, and an optional delimiter (defaulting to comma). It splits the input string into rows, then for each row, it further splits the row into fields based on the delimiter. It collects the field at the specified column index from each row into a vector of strings, which it then returns.

It is pretty bad code. I do not think that any professional C++ programmer would ever write such bad code. But then, again, I could be surprised.

If you have a modern system with C++20, you can write much less code and get high performance with std::views::split. Let me demonstrate:

auto get_column_cxx20(std::string_view data, 
    size_t row_number, char delimiter = ',') {
  auto rows = data | std::views::split('\n');
  auto column =
      rows |
      std::views::transform(
          [delimiter, row_number](auto &&row) {
            auto fields = row | std::views::split(delimiter);
            auto it = std::ranges::begin(fields);
            std::advance(it, row_number);
            return *it;
          }) |
      std::views::transform([](auto &&rng) -> std::string_view {
        return std::string_view(&*rng.begin(), 
             std::ranges::distance(rng));
      });
  return column;
}        

The function uses C++20’s ranges and views to extract a specific column from CSV-like data represented as a std::string_view. It first splits the input data into rows using newline characters, then applies a series of transformations: it splits each row into fields based on a delimiter, selects the field at the specified column index (row_number), and finally converts these selected fields back into std::string_view objects. This approach is memory-efficient as it avoids copying data, using lazy evaluation to only process data when necessary.

I wrote a small benchmark where I take an existing CSV file, I ask for the second column and I just add up the width of that column. Using LLVM 16 and an Apple M2 processor, I get the following results:


Article content

Importantly, there is no disk access. Everything is in memory. So the old-school approach is purely computationally bounded at a puny 0.11 GB/s. My source code is available on GitHub.

I did not attempt to tune the std::views::split approach. I suspect that we could get more performance out of it by further tuning. But it is already 20 times faster than the naive approach.

The main annoyance with the application std::views::split on strings is that it works naturally with ranges and subranges whereas I much prefer working with std::string_view instances. Thankfully, you can write your own converter and reuse it as needed:

auto to_view = std::views::transform([](auto &&rng) -> std::string_view { 
    return std::string_view(&*rng.begin(), std::ranges::distance(rng)); });


auto get_column_cxx20more(std::string_view data, 
    size_t row_number, char delimiter = ',') {
  auto rows = data | std::views::split('\n');
  auto column =
      rows |
      std::views::transform(
          [delimiter, row_number](auto &&row) {
            auto fields = row | std::views::split(delimiter);
            auto it = std::ranges::begin(fields);
            std::advance(it, row_number);
            return *it;
          }) | to_view;
  return column;
}        

To view or add a comment, sign in

More articles by Daniel Lemire

Insights from the community

Others also viewed

Explore topics