At some point, we have all heard a quote attributed to Donald Knuth, saying that:
Premature optimization is the root of all evil
There have been many fights over whether this applies, when is an optimization premature, and so on. This post is not meant to participate in these fights[1], but I do want to quote Donald Knuth in full before continuing:
Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.
The full quote says that we should avoid pessimizing maintenance in the name of performance, unless we know that the performance matters. Thus the topic of this post: some ways to speed-up frequently written code without sacrificing the maintainability of the code.
We will look at two boring, yet commonly done, things, and see how we can easily lose (or gain) performance based on how we implement them. The two things are:
- iterating over containers
- bulk data transformation
Iterating over containers
C++11 added a new type of loop, called range-based for loop (or for-each loop, or range-for loop). It serves to iterate over all elements in a range, as defined by the iterators returned from begin()
and end()
. Using range-based for loops greatly simplifies some patterns relying on iterators, such as iterating over all entries in a std::set
.
// pre C++11
for (std::set<std::string>::const_iterator it = set.begin(); it != set.end(); ++it) {
std::cout << *it << '\n';
}
// post C++11
for (auto const& elem : set) {
std::cout << elem << '\n';
}
The most significant advantage of using range-for is that it is more limited than other forms of loops. Inside the loop you cannot refer to the index or iterator of the element[2], which helps communicate your intent: you want to iterate all elements, and there is no index-based trickery going on.
There is also a secondary advantage, though, and that is its potential to improve runtime performance. We will look at some examples and compare the generated assembly for an index loop over a std::vector
with the assembly generated when using a range loop instead.
Consider these two simple functions:
void indexed(std::vector<int>& in) {
for (size_t idx = 0; idx < vec.size(); ++idx) {
vec[idx] *= 2;
}
}
void foreach(std::vector<int>& in) {
for (auto& elem : vec) {
vec *= 2;
}
}
both of them do the same thing, that is, multiply each element in a vector by 2. However, when using GCC 10.2 -O2, they do not compile into quite the same assembly (godbolt link):
indexed(std::vector<int, std::allocator<int> >&):
mov rax, QWORD PTR [rdi]
mov rdx, QWORD PTR [rdi+8]
sub rdx, rax
mov rcx, rdx
shr rcx, 2
je .L1
add rdx, rax
.L3:
sal DWORD PTR [rax]
add rax, 4
cmp rdx, rax
jne .L3
.L1:
ret
foreach(std::vector<int, std::allocator<int> >&):
mov rax, QWORD PTR [rdi]
mov rdx, QWORD PTR [rdi+8]
cmp rax, rdx
je .L9
.L11:
sal DWORD PTR [rax]
add rax, 4
cmp rax, rdx
jne .L11
.L9:
ret
The critical part, the inner loop itself, is the same for both - 4 instructions, but indexed
has 7 instructions before the loop, while foreach
has only 4. While the difference is tiny, and with larger inputs completely negligible, we should understand where it comes from before moving onto more complex examples.
The explanation is quite simple. std::vector
consists of 3 pointers[3], one for the start of allocated memory, one for the first empty slot, and one that points one-past-the-allocation. This representation then means that std::vector::size
has to be implemented as a subtraction between two pointers, which adds the extra instructions to the start of indexed
.
So, for a simple example, the performance advantage goes to for-range loop, but it is only a constant factor advantage. This means that the larger the actual input, the smaller the difference between the two loops is.
Now, let's take a look at a more complex example. More specifically, we will look at what happens if we call an opaque function inside the loop:
void foo(std::vector<int> const&);
void indexed(std::vector<std::vector<int>> const& in) {
for (size_t idx = 0; idx < in.size(); ++idx) {
foo(in[idx]);
}
}
void foreach(std::vector<std::vector<int>> const& in) {
for (auto& elem : in) {
foo(elem);
}
}
again, both of them do the same thing, that is, call foo
on every element in in
, and again, they compile into different assembly. But this time, the assembly is significantly different (godbolt link):
indexed(std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&):
mov rdx, QWORD PTR [rdi]
cmp QWORD PTR [rdi+8], rdx
je .L6
push r12
mov r12, rdi
push rbp
movabs rbp, -6148914691236517205
push rbx
xor ebx, ebx
.L3:
lea rax, [rbx+rbx*2]
add rbx, 1
lea rdi, [rdx+rax*8]
call foo(std::vector<int, std::allocator<int> > const&)
mov rdx, QWORD PTR [r12]
mov rax, QWORD PTR [r12+8]
sub rax, rdx
sar rax, 3
imul rax, rbp
cmp rbx, rax
jb .L3
pop rbx
pop rbp
pop r12
ret
.L6:
ret
foreach(std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > > const&):
push rbp
push rbx
sub rsp, 8
mov rbx, QWORD PTR [rdi]
mov rbp, QWORD PTR [rdi+8]
cmp rbx, rbp
je .L10
.L12:
mov rdi, rbx
add rbx, 24
call foo(std::vector<int, std::allocator<int> > const&)
cmp rbp, rbx
jne .L12
.L10:
add rsp, 8
pop rbx
pop rbp
ret
This time, the inner loops differ significantly, and foreach
has a significant performance advantage. In indexed
, the inner loop consists of 11 instructions, while in foreach
it consists of only 5 instructions. The reason for this difference is due to the opaque call to foo
.
The call to foo
forbids the compiler from assuming that in
is unchanged[4] between iterations. Without this assumption, in.size()
has to be recalculated every iteration[5], which requires loading in in
's members from memory, followed by a subtraction+division[6] to get the actual size.
The foreach
function does not have to reload in
on every iteration for a very simple reason: range-for is syntax sugar for an iterator loop that stores the begin
and end
iterators locally, before the loop starts. Thanks to this, the range-for loop does not have to recalculate size on every iteration[7]. There is a cost to this difference though. If foo
does modify in
, and causes reallocation, then foreach
invokes UB, while indexed
works correctly.
Bulk data transformation
Another common operation is bulk transforming data from one representation to another, e.g. extracting list of UserId
s from JSON. Let's take a look at two simple functions:
std::vector<int> no_reserve(std::vector<int> const& input) {
std::vector<int> ret;
for (int elem : input) {
ret.push_back(2 * elem);
}
return ret;
}
std::vector<int> do_reserve(std::vector<int> const& input) {
std::vector<int> ret;
ret.reserve(input.size());
for (int elem : input) {
ret.push_back(2 * elem);
}
return ret;
}
Both these functions take a vector<int>
and return new vector, with all elements multiplied by two. The difference is that do_reserve
reserves sufficient space in the return vector before filling it. Obviously this is going to perform better, but how much? Let's benchmark it, using Catch2's benchmarking support:
#include <catch2/catch_test_macros.hpp>
#include <catch2/generators/catch_generators.hpp>
#include <catch2/benchmark/catch_benchmark_all.hpp>
#include <vector>
namespace {
std::vector<int> generate_ints(size_t sz) {
std::vector<int> ret;
ret.reserve(sz);
for (size_t i = 0; i < sz; ++i) {
ret.push_back(i % 128);
}
return ret;
}
std::vector<double> no_reserve(std::vector<int> const& input) { ... }
std::vector<double> do_reserve(std::vector<int> const& input) { ... }
} // end unnamed namespace
TEST_CASE("Benchmark reserving vectors", "[reserve][benchmark]") {
const auto size = GENERATE(10'000,
100'000,
1'000'000,
10'000'000);
auto data = generate_ints(size);
CAPTURE(size);
BENCHMARK("no reserve") {
auto tripled = no_reserve(data);
return tripled;
};
BENCHMARK("reserve") {
auto tripled = do_reserve(data);
return tripled;
};
SUCCEED();
}
Compiling the above with in Release configuration, using Clang 10 and running it on my machine, I get these results:
size | no_reserve |
do_reserve |
relative speedup |
---|---|---|---|
10K | 9.89 ± 0.08 us | 7.42 ± 0.01 us | 1.15x |
100K | 94.34 ± 0.31 us | 76.56 ± 0.27 us | 1.23x |
1M | 1.01 ± 0.00 ms | 0.79 ± 0.00 ms | 1.27x |
10M | 36.06 ± 0.02 ms | 17.70 ± 0.01 ms | 2.04x |
The exact timings aren't important. What is important is that the speed-up increases with the increasing size of the data. The speed-up increases because the larger the input size is, the more times the no_reserve
function ends up reallocating the return vector, and the more times the elements inside it are copied. Given that both functions perform the same transformation, the difference is entirely due to the superfluous reallocations.
When interpreting the numbers above, you should keep in mind that in our example, the transformation work per element is trivial[8]. If the per-element work was less trivial, the relative speed-up would be lesser. An example with the inner loop changed to calculate exp(elem)
is shown in this table:
size | no_reserve |
do_reserve |
relative speedup |
---|---|---|---|
10K | 119.15 ± 0.41 us | 115.94 ± 0.42 us | 1.03x |
100K | 1.19 ± 0.00 ms | 1.16 ± 0.00 ms | 1.03x |
1M | 12.13 ± 0.00 ms | 11.60 ± 0.00 ms | 1.05x |
10M | 183.38 ± 0.04 ms | 138.98 ± 0.03 ms | 1.32x |
As with using range-for to iterate ranges, calling vector::reserve
when we know the final size of a vector will improve the code's performance without impacting future maintainability of the code. Thus, we should use it when possible.
However, calling vector::reserve
multiple times on a single instance is very likely a performance bug. Repeat calls to vector::reserve
on the same instance can easily lead to O(n^2) overall complexity for appending elements (or O(n)
for single vector::push_back
call). This problem commonly occurs when we insert elements in batches of, say, 100, and every time we "helpfully" reserve current_size + batch_size
.
As a general rule, unless you 100% know what you are doing, reserve
should never be called on one vector instance more than once during its lifetime. Ideally, you know what the final size will be and can reserve that outright. Less ideally, you can guard the call to reserve
with a check that the vector instance does not have allocated any capacity yet. Doing so can improve the performance of repeated batch inserts without risking the accidentally quadratic behaviour.
Bonus: inserting newlines into streams
Even though std::format
has been standardized into C++20, and should be preferred to formatting using streams, I expect that we will still be dealing with streams and stream formatting for a long time[9]. Because streams are commonly poorly taught, many people end up writing unintentionally pesimized code, and I would prefer if they did not. Luckily, keeping with theme of this post, the better performing code is also more maintainable.
Let's say that we want to write a bunch of strings to a stream, with each string being on its own line. A straightforward implementation of such function could look like this:
void write_strings(std::ostream& out, std::vector<std::string> const& input) {
for (auto const& elem : input) {
out << elem << std::endl;
}
}
This code works, but the use of std::endl
to write the newlines is inefficient because it does more than just write a newline. It also flushes the stream, which is an expensive operation. Keeping with the theme of this post, the way to remove this inefficiency is, once again, to explicitly state our intent in the code, and insert \n
to the stream.
void write_strings(std::ostream& out, std::vector<std::string> const& input) {
for (auto const& elem : input) {
out << elem << "\n";
}
}
But wait, why are we appending a string consisting of a single character to the stream? We only want to add single character, not a string. This gives us our third implementation:
void write_strings(std::ostream& out, std::vector<std::string> const& input) {
for (auto const& elem : input) {
out << elem << '\n';
}
}
I wrote a quick benchmark, where these functions wrote out a bunch of strings[10] to a file. Running it on Linux machine with SSD as the main drive, I get the following numbers:
n | std::endl |
"\n" |
'\n' |
endl vs "\n" speedup |
"\n" vs '\n' speedup |
---|---|---|---|---|---|
100k | 1.90 ms | 1.61 ms | 1.60 ms | 1.18x | 1.01x |
1M | 19.59 ms | 16.79 ms | 16.47 ms | 1.17x | 1.02x |
10M | 196.43 ms | 169.23 ms | 166.93 ms | 1.16x | 1.01x |
From the numbers, you can see that going from std::endl
to "\n"
is a significant improvement, and there is also a tiny improvement going from "\n"
(inserting the newline as a string of single character) to '\n'
(inserting the newline as a single character).
Putting it all together, if you want to insert a newline to a stream, you should insert it as \n
, whether as a part of a string, or as a single character. If you want to also flush the stream at the same time, you should use \n
+ std::flush
, to explicitly document[11] your intent of flushing the stream, rather than using std::endl
.
That's all for this post. Maybe the next one will come in sooner than in 6 months.
My opinion on this boils down to "it depends, be smart about it, consider more than just the code in isolation". The long version could easily take up another 4k words post, so I will not go there today. ↩︎
You can, of course, explicitly ask for the index by wrapping the iterated range in an
enumerate
, which then turns the returned element into an(index, element)
pair. In doing so, you state you are using the indices for something up-front, at the loop declaration, which is still helpful in determining the programmer's intent. ↩︎This is true for all three big implementations, libc++, libstdc++, MS's STL, and also some other, like EA STL, and FB's Folly. The reason for this is that it improves the performance of insert-at-the-end operations (
push_back
,emplace_back
, etc), and calls toend()
, at the cost of worsened performance ofsize()
andcapacity()
queries. As those should be used relatively rarely, this tradeoff is generally seen as beneficial. ↩︎Yes,
foo
is allowed to modifyin
, and the only thing preventing that is that it would be a terrible programming style. While both functions takein
byconst&
, the constness there only constrains theindexed
/foreach
functions, nothing else. Especially not the actual argument passed by theconst&
. ↩︎Actually, the loop in
indexed
could be modified to provide the same performance by writing it asfor (size_t idx = 0, end = in.size(); idx < end; ++idx)
. However, this version is less common, and thus less maintainable than the range-for loop. ↩︎The compiler is smart enough to replace the division with multiplication by reciprocal constant, which has the same final effect but performs better. ↩︎
Note that just using iterators for the loop is not enough. If you use them naively and formulate your end condition as
iter != in.end()
, then the loop will be simpler but will still try to reloadin
in every iteration. You have to store the iterator returned fromin.end()
outside the loop. ↩︎Multiplying
int
by two should be turned into a single bitshift by any optimizing compiler worth the name. ↩︎One of the big reasons they won't die out soon is that appending to a stream was the big customization point on how to serialize a type since C++98. There is significant inertia in that, and there is a reason
format
has special support for types that know how to write themselves to a stream but do not provide custom formatter. ↩︎This was a very hacky benchmark, so the input strings are nowhere near evenly distributed. Most of the strings have 30 characters, about 10% have 20 characters, and 1% has 10 characters. The strings are also strictly numeric. ↩︎
Yes, you could use
std::endl
and add a comment saying that you do want the flushing behaviour. However, doing so is more work than having a separate newline and stream flush, and also makes for a worse maintenance experience later. Because the use ofstd::endl
is usually an antipattern, everyone touching the code will have to read the comment and understand that this one use is fine, unlike usual. ↩︎