The Little Things: Everyday efficiencies

At some point, we have all heard a quote attributed to Donald Knuth, saying that:

Premature optimization is the root of all evil

There have been many fights over whether this applies, when is an optimization premature, and so on. This post is not meant to participate in these fights^[1], but I do want to quote Donald Knuth in full before continuing:

Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.

The full quote says that we should avoid pessimizing maintenance in the name of performance, unless we know that the performance matters. Thus the topic of this post: some ways to speed-up frequently written code without sacrificing the maintainability of the code.

We will look at two boring, yet commonly done, things, and see how we can easily lose (or gain) performance based on how we implement them. The two things are:

iterating over containers
bulk data transformation

Iterating over containers

C++11 added a new type of loop, called range-based for loop (or for-each loop, or range-for loop). It serves to iterate over all elements in a range, as defined by the iterators returned from begin() and end(). Using range-based for loops greatly simplifies some patterns relying on iterators, such as iterating over all entries in a std::set.

// pre C++11
for (std::set<std::string>::const_iterator it = set.begin(); it != set.end(); ++it) {
    std::cout << *it << '\n';
}

// post C++11
for (auto const& elem : set) {
    std::cout << elem  << '\n';
}

The most significant advantage of using range-for is that it is more limited than other forms of loops. Inside the loop you cannot refer to the index or iterator of the element^[2], which helps communicate your intent: you want to iterate all elements, and there is no index-based trickery going on.

There is also a secondary advantage, though, and that is its potential to improve runtime performance. We will look at some examples and compare the generated assembly for an index loop over a std::vector with the assembly generated when using a range loop instead.

Consider these two simple functions:

void indexed(std::vector<int>& in) {
    for (size_t idx = 0; idx < vec.size(); ++idx) {
        vec[idx] *= 2;
    }
}

void foreach(std::vector<int>& in) {
    for (auto& elem : vec) {
        vec *= 2;
    }
}

both of them do the same thing, that is, multiply each element in a vector by 2. However, when using GCC 10.2 -O2, they do not compile into quite the same assembly (godbolt link):

indexed(std::vector<int, std::allocator<int> >&):
        mov     rax, QWORD PTR [rdi]
        mov     rdx, QWORD PTR [rdi+8]
        sub     rdx, rax
        mov     rcx, rdx
        shr     rcx, 2
        je      .L1
        add     rdx, rax
.L3:
        sal     DWORD PTR [rax]
        add     rax, 4
        cmp     rdx, rax
        jne     .L3
.L1:
        ret

foreach(std::vector<int, std::allocator<int> >&):
        mov     rax, QWORD PTR [rdi]
        mov     rdx, QWORD PTR [rdi+8]
        cmp     rax, rdx
        je      .L9
.L11:
        sal     DWORD PTR [rax]
        add     rax, 4
        cmp     rax, rdx
        jne     .L11
.L9:
        ret

The critical part, the inner loop itself, is the same for both - 4 instructions, but indexed has 7 instructions before the loop, while foreach has only 4. While the difference is tiny, and with larger inputs completely negligible, we should understand where it comes from before moving onto more complex examples.

The explanation is quite simple. std::vector consists of 3 pointers^[3], one for the start of allocated memory, one for the first empty slot, and one that points one-past-the-allocation. This representation then means that std::vector::size has to be implemented as a subtraction between two pointers, which adds the extra instructions to the start of indexed.

So, for a simple example, the performance advantage goes to for-range loop, but it is only a constant factor advantage. This means that the larger the actual input, the smaller the difference between the two loops is.

Now, let's take a look at a more complex example. More specifically, we will look at what happens if we call an opaque function inside the loop:

void foo(std::vector<int> const&);

void indexed(std::vector<std::vector<int>> const& in) {
    for (size_t idx = 0; idx < in.size(); ++idx) {
        foo(in[idx]);
    }
}

void foreach(std::vector<std::vector<int>> const& in) {
    for (auto& elem : in) {
        foo(elem);
    }
}

again, both of them do the same thing, that is, call foo on every element in in, and again, they compile into different assembly. But this time, the assembly is significantly different (godbolt link):

indexed(std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&):
        mov     rdx, QWORD PTR [rdi]
        cmp     QWORD PTR [rdi+8], rdx
        je      .L6
        push    r12
        mov     r12, rdi
        push    rbp
        movabs  rbp, -6148914691236517205
        push    rbx
        xor     ebx, ebx
.L3:
        lea     rax, [rbx+rbx*2]
        add     rbx, 1
        lea     rdi, [rdx+rax*8]
        call    foo(std::vector<int, std::allocator<int> > const&)
        mov     rdx, QWORD PTR [r12]
        mov     rax, QWORD PTR [r12+8]
        sub     rax, rdx
        sar     rax, 3
        imul    rax, rbp
        cmp     rbx, rax
        jb      .L3
        pop     rbx
        pop     rbp
        pop     r12
        ret
.L6:
        ret

foreach(std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&):
        push    rbp
        push    rbx
        sub     rsp, 8
        mov     rbx, QWORD PTR [rdi]
        mov     rbp, QWORD PTR [rdi+8]
        cmp     rbx, rbp
        je      .L10
.L12:
        mov     rdi, rbx
        add     rbx, 24
        call    foo(std::vector<int, std::allocator<int> > const&)
        cmp     rbp, rbx
        jne     .L12
.L10:
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret

This time, the inner loops differ significantly, and foreach has a significant performance advantage. In indexed, the inner loop consists of 11 instructions, while in foreach it consists of only 5 instructions. The reason for this difference is due to the opaque call to foo.

The call to foo forbids the compiler from assuming that in is unchanged^[4] between iterations. Without this assumption, in.size() has to be recalculated every iteration^[5], which requires loading in in's members from memory, followed by a subtraction+division^[6] to get the actual size.

The foreach function does not have to reload in on every iteration for a very simple reason: range-for is syntax sugar for an iterator loop that stores the begin and end iterators locally, before the loop starts. Thanks to this, the range-for loop does not have to recalculate size on every iteration^[7]. There is a cost to this difference though. If foo does modify in, and causes reallocation, then foreach invokes UB, while indexed works correctly.

Bulk data transformation

Another common operation is bulk transforming data from one representation to another, e.g. extracting list of UserIds from JSON. Let's take a look at two simple functions:

std::vector<int> no_reserve(std::vector<int> const& input) {
    std::vector<int> ret;
    for (int elem : input) {
        ret.push_back(2 * elem);
    }
    return ret;
}

std::vector<int> do_reserve(std::vector<int> const& input) {
    std::vector<int> ret;
    ret.reserve(input.size());
    for (int elem : input) {
        ret.push_back(2 * elem);
    }
    return ret;
}

Both these functions take a vector<int> and return new vector, with all elements multiplied by two. The difference is that do_reserve reserves sufficient space in the return vector before filling it. Obviously this is going to perform better, but how much? Let's benchmark it, using Catch2's benchmarking support:

#include <catch2/catch_test_macros.hpp>
#include <catch2/generators/catch_generators.hpp>
#include <catch2/benchmark/catch_benchmark_all.hpp>
#include <vector>

namespace {

std::vector<int> generate_ints(size_t sz) {
    std::vector<int> ret;
    ret.reserve(sz);
    
    for (size_t i = 0; i < sz; ++i) {
        ret.push_back(i % 128);
    }
    
    return ret;
}

std::vector<double> no_reserve(std::vector<int> const& input) { ... }
std::vector<double> do_reserve(std::vector<int> const& input) { ... }

} // end unnamed namespace


TEST_CASE("Benchmark reserving vectors", "[reserve][benchmark]") {
    const auto size = GENERATE(10'000,
                              100'000,
                            1'000'000,
                           10'000'000);
    auto data = generate_ints(size);
    CAPTURE(size);
    BENCHMARK("no reserve") {
        auto tripled = no_reserve(data);
        return tripled;
    };
    BENCHMARK("reserve") {
        auto tripled = do_reserve(data);
        return tripled;
    };
    SUCCEED();
}

Compiling the above with in Release configuration, using Clang 10 and running it on my machine, I get these results:

size	`no_reserve`	`do_reserve`	relative speedup
10K	9.89 ± 0.08 us	7.42 ± 0.01 us	1.15x
100K	94.34 ± 0.31 us	76.56 ± 0.27 us	1.23x
1M	1.01 ± 0.00 ms	0.79 ± 0.00 ms	1.27x
10M	36.06 ± 0.02 ms	17.70 ± 0.01 ms	2.04x

The exact timings aren't important. What is important is that the speed-up increases with the increasing size of the data. The speed-up increases because the larger the input size is, the more times the no_reserve function ends up reallocating the return vector, and the more times the elements inside it are copied. Given that both functions perform the same transformation, the difference is entirely due to the superfluous reallocations.

When interpreting the numbers above, you should keep in mind that in our example, the transformation work per element is trivial^[8]. If the per-element work was less trivial, the relative speed-up would be lesser. An example with the inner loop changed to calculate exp(elem) is shown in this table:

size	`no_reserve`	`do_reserve`	relative speedup
10K	119.15 ± 0.41 us	115.94 ± 0.42 us	1.03x
100K	1.19 ± 0.00 ms	1.16 ± 0.00 ms	1.03x
1M	12.13 ± 0.00 ms	11.60 ± 0.00 ms	1.05x
10M	183.38 ± 0.04 ms	138.98 ± 0.03 ms	1.32x

As with using range-for to iterate ranges, calling vector::reserve when we know the final size of a vector will improve the code's performance without impacting future maintainability of the code. Thus, we should use it when possible.

However, calling vector::reserve multiple times on a single instance is very likely a performance bug. Repeat calls to vector::reserve on the same instance can easily lead to O(n^2) overall complexity for appending elements (or O(n) for single vector::push_back call). This problem commonly occurs when we insert elements in batches of, say, 100, and every time we "helpfully" reserve current_size + batch_size.

As a general rule, unless you 100% know what you are doing, reserve should never be called on one vector instance more than once during its lifetime. Ideally, you know what the final size will be and can reserve that outright. Less ideally, you can guard the call to reserve with a check that the vector instance does not have allocated any capacity yet. Doing so can improve the performance of repeated batch inserts without risking the accidentally quadratic behaviour.

Bonus: inserting newlines into streams

Even though std::format has been standardized into C++20, and should be preferred to formatting using streams, I expect that we will still be dealing with streams and stream formatting for a long time^[9]. Because streams are commonly poorly taught, many people end up writing unintentionally pesimized code, and I would prefer if they did not. Luckily, keeping with theme of this post, the better performing code is also more maintainable.

Let's say that we want to write a bunch of strings to a stream, with each string being on its own line. A straightforward implementation of such function could look like this:

void write_strings(std::ostream& out, std::vector<std::string> const& input) {
    for (auto const& elem : input) {
        out << elem << std::endl;
    }
}

This code works, but the use of std::endl to write the newlines is inefficient because it does more than just write a newline. It also flushes the stream, which is an expensive operation. Keeping with the theme of this post, the way to remove this inefficiency is, once again, to explicitly state our intent in the code, and insert \n to the stream.

void write_strings(std::ostream& out, std::vector<std::string> const& input) {
    for (auto const& elem : input) {
        out << elem << "\n";
    }
}

But wait, why are we appending a string consisting of a single character to the stream? We only want to add single character, not a string. This gives us our third implementation:

void write_strings(std::ostream& out, std::vector<std::string> const& input) {
    for (auto const& elem : input) {
        out << elem << '\n';
    }
}

I wrote a quick benchmark, where these functions wrote out a bunch of strings^[10] to a file. Running it on Linux machine with SSD as the main drive, I get the following numbers:

n	`std::endl`	`"\n"`	`'\n'`	`endl` vs `"\n"` speedup	`"\n"` vs `'\n'` speedup
100k	1.90 ms	1.61 ms	1.60 ms	1.18x	1.01x
1M	19.59 ms	16.79 ms	16.47 ms	1.17x	1.02x
10M	196.43 ms	169.23 ms	166.93 ms	1.16x	1.01x

From the numbers, you can see that going from std::endl to "\n" is a significant improvement, and there is also a tiny improvement going from "\n" (inserting the newline as a string of single character) to '\n' (inserting the newline as a single character).

Putting it all together, if you want to insert a newline to a stream, you should insert it as \n, whether as a part of a string, or as a single character. If you want to also flush the stream at the same time, you should use \n + std::flush, to explicitly document^[11] your intent of flushing the stream, rather than using std::endl.

That's all for this post. Maybe the next one will come in sooner than in 6 months.

My opinion on this boils down to "it depends, be smart about it, consider more than just the code in isolation". The long version could easily take up another 4k words post, so I will not go there today. ↩︎
You can, of course, explicitly ask for the index by wrapping the iterated range in an enumerate, which then turns the returned element into an (index, element) pair. In doing so, you state you are using the indices for something up-front, at the loop declaration, which is still helpful in determining the programmer's intent. ↩︎
This is true for all three big implementations, libc++, libstdc++, MS's STL, and also some other, like EA STL, and FB's Folly. The reason for this is that it improves the performance of insert-at-the-end operations (push_back, emplace_back, etc), and calls to end(), at the cost of worsened performance of size() and capacity() queries. As those should be used relatively rarely, this tradeoff is generally seen as beneficial. ↩︎
Yes, foo is allowed to modify in, and the only thing preventing that is that it would be a terrible programming style. While both functions take in by const&, the constness there only constrains the indexed/foreach functions, nothing else. Especially not the actual argument passed by the const&. ↩︎
Actually, the loop in indexed could be modified to provide the same performance by writing it as for (size_t idx = 0, end = in.size(); idx < end; ++idx). However, this version is less common, and thus less maintainable than the range-for loop. ↩︎
The compiler is smart enough to replace the division with multiplication by reciprocal constant, which has the same final effect but performs better. ↩︎
Note that just using iterators for the loop is not enough. If you use them naively and formulate your end condition as iter != in.end(), then the loop will be simpler but will still try to reload in in every iteration. You have to store the iterator returned from in.end() outside the loop. ↩︎
Multiplying int by two should be turned into a single bitshift by any optimizing compiler worth the name. ↩︎
One of the big reasons they won't die out soon is that appending to a stream was the big customization point on how to serialize a type since C++98. There is significant inertia in that, and there is a reason format has special support for types that know how to write themselves to a stream but do not provide custom formatter. ↩︎
This was a very hacky benchmark, so the input strings are nowhere near evenly distributed. Most of the strings have 30 characters, about 10% have 20 characters, and 1% has 10 characters. The strings are also strictly numeric. ↩︎
Yes, you could use std::endl and add a comment saying that you do want the flushing behaviour. However, doing so is more work than having a separate newline and stream flush, and also makes for a worse maintenance experience later. Because the use of std::endl is usually an antipattern, everyone touching the code will have to read the comment and understand that this one use is fine, unlike usual. ↩︎

← The Coding Nest

Iterating over containers

Bulk data transformation

Bonus: inserting newlines into streams