The Little Things: The Missing Performance in std::vector
std::vector
is often said to be the default container. This is because it provides a simple abstraction over a growable contiguous chunk of memory, thus providing good baseline performance for common operations, such as adding more elements or iterating over them.
However, many non-standard reimplementations of std::vector
have emerged over time. These are usually driven by performance consideration, e.g. avoiding allocations for small vectors[1], having static capacity[2], tightly integrating with the OS for faster reallocations[3], or optimizing moving elements with bitwise copying[4].
Most of these cannot be merged into std::vector
without changing it significantly, e.g. small vector optimization completely changes the expected complexity of moving the vector[5]. Other can require other library and language changes, such as support for bitwise moves. But recently, I ran into a simple API change that can improve the performance of a common usage pattern by 10+ %.
Let's consider a simple function that takes one vector and returns a transformation of the input.
std::vector<size_t> convert_to_indices(std::vector<Key> const& keys) {
std::vector<size_t> indices;
for (Key k : keys) {
indices.push_back(to_index(k));
}
return indices;
}
As I mentioned in a previous article, this implementation leaves a bunch of performance on the table. We can easily make it faster by reserving enough memory for indices
ahead of time, like so
std::vector<size_t> convert_to_indices(std::vector<Key> const& keys) {
std::vector<size_t> indices;
indices.reserve(keys.size());
for (Key k : keys) {
indices.push_back(to_index(k));
}
return indices;
}
But this code still leaves some performance on the table, depending on the compiler[6] and stdlib implementation. So, how can we fix this?
We cannot, not without changing std::vector
. The lost performance comes from the fact that vector::push_back
has to check whether it needs to grow the vector, even though it will never happen in the convert_to_indices
function. A sufficiently smart compiler could optimize this out, but neither GCC nor Clang do (and MSVC makes a complete hash of things).
What we need is a function that inserts a new element into the vector, but does not check for capacity[7]. I like the name push_back_unchecked
for this functionality, so I will use it in the rest of this post.
Benchmarks
How much of a difference does using push_back_unchecked
make? To figure this out, we would need a reimplementation of std::vector
, which also includes push_back_unchecked
. So I made a very simple one. It only supports push_back
and push_back_unchecked
and a few miscellaneous functions like size
, or reserve
.
Let's run 3 different benchmarks, all following the basic structure from the snippet earlier, that is, create an output vector, reserve enough space for all elements, and then add the elements to the output vector.
The transformation operation is going to be
- The Key -> index conversion snippet from earlier
- Multiplying every element by 2
For each of these, we will compare the performance of std::vector::push_back
, nonstd::vector::push_back
, and nonstd::vector::push_back_unchecked
[8].
You can check the benchmark implementation in the GitHub repository.
I ended up collecting results for the three main compilers, specifically MSVC 19.29.30151[9], GCC 10.5, and Clang 15.0.7 (using libc++-15)[10]. GCC and Clang were run under WSL2 and MSVC on Windows 10 directly.
Results
I collated the results for all benchmarks into one table per compiler.
- std column is the result of using
std::vector::push_back
- nonstd is the result of using
nonstd::vector::push_back
- unchecked is the result of using
nonstd::vector::push_back_unchecked
- \(\frac{\textbf{std}}{\textbf{nonstd}}\) is the speedup from using
nonstd::vector::push_back
instead ofstd::vector::push_back
- \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\) is the speedup from using
nonstd::vector::push_back_unchecked
instead ofnonstd::vector::push_back
Generally we want to see that the \(\frac{\textbf{std}}{\textbf{nonstd}}\) is around 1, and \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\) is more than 1.
Clang
benchmark | n | std | nonstd | unchecked | \(\frac{\textbf{std}}{\textbf{nonstd}}\) | \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\) |
---|---|---|---|---|---|---|
copy + mult | 1'000 | 540 ± 48 ns | 551 ± 27 ns | 93 ± 2 ns | 0.98 | 5.92 |
100'000 | 51 ± 2.0 us | 51 ± 4.1 us | 10 ± 3.0 us | 1.00 | 5.10 | |
10'000'000 | 9.5 ± 0.2 ms | 9.5 ± 0.3 ms | 7.3 ± 0.7 ms | 1.00 | 1.30 | |
to_index | 1'000 | 460 ± 19 ns | 470 ± 40 ns | 168 ± 10 ns | 0.98 | 2.80 |
100'000 | 44 ± 3.5 us | 44 ± 8.9 us | 15 ± 4.6 us | 1.00 | 2.93 | |
10'000'000 | 14.2 ± 0.1 ms | 14.5 ± 0.2 ms | 13.1 ± 0.1 ms | 0.98 | 1.11 |
The results from Clang+libc++ are about what we want to see. The performance of using plain push_back
is roughly the same between nonstd::vector
and libc++'s std::vector
, and push_back_unchecked
provides improvements ranging from significant for N == 10'000'000
to massive for smaller inputs.
GCC
benchmark | n | std | nonstd | unchecked | \(\frac{\textbf{std}}{\textbf{nonstd}}\) | \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\) |
---|---|---|---|---|---|---|
copy + mult | 1'000 | 546 ± 12 ns | 540 ± 17 ns | 167 ± 8 ns | 1.01 | 3.23 |
100'000 | 56 ± 11.4 us | 55 ± 4.9 us | 12 ± 2.9 us | 1.02 | 4.58 | |
10'000'000 | 9.7 ± 0.2 ms | 9.7 ± 0.1 ms | 7.5 ± 0.8 ms | 1.00 | 1.29 | |
to_index | 1'000 | 528 ± 29 ns | 552 ± 91 ns | 175 ± 17 ns | 0.95 | 3.15 |
100'000 | 51 ± 7.7 us | 52 ± 8.2 us | 15 ± 4.8 us | 0.98 | 3.47 | |
10'000'000 | 15.1 ± 0.1 ms | 15.2 ± 0.1 ms | 13.4 ± 1.6 ms | 0.99 | 1.13 |
The results from GCC+libstdc++ are very similar to those from Clang+libc++. Compared to Clang's results, there is a slightly greater difference between using std::vector
and nonstd::vector
, but the differences are still minor. GCC does seem to be worse at taking advantage of push_back_unchecked
, in the copy + mult benchmark, providing significantly worse performance for small input sizes.
MSVC
benchmark | n | std | nonstd | unchecked | \(\frac{\textbf{std}}{\textbf{nonstd}}\) | \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\) |
---|---|---|---|---|---|---|
copy + mult | 1'000 | 792 ± 36 ns | 599 ± 32 ns | 525 ± 33 ns | 1.32 | 1.14 |
100'000 | 73 ± 2.4 us | 54 ± 3.2 us | 46 ± 3.9 us | 1.35 | 1.17 | |
10'000'000 | 24.0 ± 2.6 ms | 13.3 ± 5.1 ms | 11.5 ± 4.1 ms | 1.80 | 1.16 | |
to_index | 1'000 | 791 ± 39 ns | 563 ± 35 ns | 397 ± 23 ns | 1.40 | 1.42 |
100'000 | 91 ± 31.5 us | 50 ± 5.9 us | 34 ± 9.5 us | 1.82 | 1.47 | |
10'000'000 | 29.4 ± 4.3 ms | 20.4 ± 7.3 ms | 18.4 ± 6.6 ms | 1.44 | 1.11 |
The results for MSVC are, in a way, the most interesting ones. This is because nonstd::vector
already significantly outperforms MS' std::vector
. MSVC also does not take advantage of push_back_unchecked
to generate massively faster code for small inputs the way GCC and Clang do.
Conclusion
From the benchmarks, push_back_unchecked
provides a nice (10+%) improvement for a common usage pattern. The improvement could even be called "massive" for specific compiler and input size combinations. Obviously, it is not a panacea; the improvement comes from simplifying the code the compiler has to reason about when adding a new element to the vector. As adding the new element (or computing the new element) becomes more expensive, the total improvement trends to zero.
The trade-off for the increased performance is a whole new way to trigger UB when using a vector. Is this trade-off worth it? I think it certainly is, and we should add something like this to the actual std::vector
. There is even a precedent for this in the standard library, std::basic_string::resize_and_overwrite
[11], which also allows the user to bypass the capacity checks when operating on a basic_string
instance.
And finally, push_back
is not the only member function on std::vector
we could apply this treatment to. emplace_back
is an obvious candidate here, as it also operates on individual elements and is often used in cases where we know the required capacity ahead of time. To me, the question mark is over things like assign
and other range-based functions. On the one hand, doing one capacity check per the whole range should be cheap enough not to matter. On the other, if we add push_back_unchecked
, adding, e.g. assign_unchecked
would be symmetrical, and there is no doubt that it would help someone out.
Other lessons learned
Having learned my lesson from trying to improve the performance of Catch2, I always benchmarked my own vector implementation against std's. These baseline benchmarks have paid off almost immediately because in the first implementation, my vector's push_back
was almost twice as slow as libstdc++'s std::vector
's.
The code below is the first implementation of push_back
I had, see if you can spot the issue:
void push_back(T&& elem) {
if (m_next != m_end) {
Detail::construct_at(m_next, std::move(elem));
++m_next;
return;
}
const auto current_size = size();
const auto new_cap = calculate_new_cap(current_size + 1);
auto* new_memory = Detail::allocate_uninit<T>(new_cap);
// Constructing new element has to happen before we move
// old elements, in case someone is doing `v.push_back(v[0])`
try {
Detail::construct_at(new_memory + current_size, std::move(elem));
}
catch (...) {
Detail::deallocate_no_destroy(new_memory, new_cap);
throw;
}
m_next = std::uninitialized_move(m_first, m_next, new_memory);
// Account for the inserted element
++m_next;
Detail::deallocate_no_destroy(m_first, current_size);
m_first = new_memory;
m_end = m_first + new_cap;
}
There is no real issue if you define real issue as code that is inefficient on its own. However, when I benchmarked this implementation against libstdc++'s std::vector
, I found that its performance was up to 4x worse than the actual std::vector
.
Why? Because inlining is the most important optimization in the world of modern C++, the GCC's inliner did not like how big the function is. When I instead outlined the last few lines into its own helper function, adopt_new_memory
, I got performance on par with std::vector
.
This is the "fixed" version:
void push_back(T&& elem) {
if (m_next != m_end) {
Detail::construct_at(m_next, std::move(elem));
++m_next;
return;
}
const auto current_size = size();
const auto new_cap = calculate_new_cap(current_size + 1);
auto* new_memory = Detail::allocate_uninit<T>(new_cap);
// Constructing new element has to happen before we move
// old elements, in case someone is doing `v.push_back(v[0])`
try {
Detail::construct_at(new_memory + current_size, std::move(elem));
}
catch (...) {
Detail::deallocate_no_destroy(new_memory, new_cap);
throw;
}
adopt_new_memory(new_memory, new_cap);
// Account for the element inserted ahead of time
++m_next;
}
Next up was benchmarking against MSVC's stdlib. Compared to MSVC's std::vector
, the implementation above was significantly slower, up to 3x. Because both std::vector
and nonstd::vector
were significantly slower when compiled with MSVC than when compiled with GCC, I had to look for some pattern that broke MSVC's optimizer badly.
After some thinking, I noticed that I could completely remove the try-catch
block because in nonstd::vector
, I static_assert
that the elements in it are no-throw move-constructible.
This meant that the implementation of push_back(T&&)
[12] could be simplified down to just
void push_back(T&& elem) {
if (m_next != m_end) {
Detail::construct_at(m_next, std::move(elem));
++m_next;
return;
}
const auto current_size = size();
const auto new_cap = calculate_new_cap(current_size + 1);
auto* new_memory = Detail::allocate_uninit<T>(new_cap);
// Constructing new element has to happen before we move
// old elements, in case someone is doing `v.push_back(v[0])`
Detail::construct_at(new_memory + current_size, std::move(elem));
adopt_new_memory(new_memory, new_cap);
// Account for the element inserted ahead of time
++m_next;
}
This change made no difference in the performance with GCC (or, as I later checked, Clang), but this version of push_back(T&&)
now beats MS's std::vector
easily. Why? Because the MSVC version I have is bad at handling try-catch blocks where no exceptions can be thrown.
Let's take a look at an example where two functions should compile into the same ASM if the compiler properly optimizes try-catch blocks:
#include <cstdio>
void f1(int* ptr, int i) {
*ptr = i;
}
void f2(int* ptr, int i) {
try {
*ptr = i;
} catch (...) {
puts("RIP\n");
throw;
}
}
Notice that GCC is smart enough to compile both f1
and f2
into the same ASM. MSVC 19.29 instead makes a hash of f2
, compiling in the whole exception handler. MSVC 19.37 performs a lot better; the two functions still aren't identical, but now the difference is only that f2
allocates 40 bytes of stack space for no reason.
And that's all for this article. Reimplementing a tiny subset of std::vector
in a way to keep the comparison fair was an interesting exercise, and figuring out why it optimizes differently was quite fun.
Standardizing std::vector::push_back_unchecked
would be nice, but I am not touching wg21 unless I am well-compensated for doing so.
This is usually called a "Small Vector". The idea is to have a buffer for some N elements inline with the object so that the vector does not need to allocate memory while it contains less than N elements.
std::string
already does this because it has the advantage that its elements are trivial types that cannot tell the difference. Overall, I think this is the most common vector mutation in the wild, see e.g. LLVM, Abseil, Boost, many hits on GitHub and in 3 of 4 commercial codebases I've worked on. ↩︎This is usually called "Static Vector". It boils down to a small vector that does not allow increasing capacity. This mutation used to be useful for constexpr programming, but C++20 allows us to use
std::vector
in most constexpr contexts. This leaves the static vector with a much smaller niche; cases where there is a statically known bound on the size, the bound is small, and we either won't need to move the vector, or the elements are cheap to move. This was another fairly common vector mutation, but lately, it became less common. It can still be found in, e.g. Boost. ↩︎As far as I know, this is a relatively new idea, usually called something like "Pinned Vector". The idea is that a vector can reserve lots of contiguous virtual memory on construction and only ask for it to be committed and backed by actual physical memory once it is used. This lets the pinned vector avoid copying/moving elements during reallocation, thus providing performance benefits when the final size is not known up-front. Pinned vectors also have a secondary advantage in that adding more elements does not invalidate pointers to already inserted elements. ↩︎
Many types are movable without invoking their move constructor/assignment operator, just by copying their bits into another place. Both
std::vector
itself andstd::unique_ptr<T>
are examples of such types because their representation is not self-referencing, even though they need a custom destructor. The C++ standard currently does not support this, even though there have been some proposals to introduce support for this into the language (by adding a new keyword along the lines of=default
and=delete
) and into the standard library (by adding a trait the types can opt into). An example of this approach is the vector reimplementation in Facebook's folly library. Whilefolly::fbvector
fails to compile if the type it stores does not specialize a trait to indicate that it can be moved bitwise, some implementations instead leave checking this to their user. ↩︎std::string
already does small string optimization, but it has the advantage that its elements are always trivial. This means that it can easily move the elements without risking exceptions being thrown or causing unexpected performance overhead. ↩︎A sufficiently smart compiler could fix this. But I have to use GCC/Clang/MSVC, so I have no sufficiently smart compiler. ↩︎
Yes, this means that calling it without sufficient capacity invokes UB. ↩︎
The reason to include
std::vector::push_back
in the benchmarks is to check that the implementation ofnonstd::vector::push_back
is not doing something stupid, and thus giving an unfair advantage tononstd::vector::push_back_unchecked
. I needed to make multiple changes for my custom vector to matchstd::vector
on GCC and MSVC. I describe some of them in the last section. ↩︎This MSVC version corresponds to fully patched VS 2019 as of the time of writing. ↩︎
These compiler versions are what I use at work, so it is what I already have set up locally. Given the results, it would be nice to rerun the benchmark with newer MSVC to see how much the improved optimizer helps. ↩︎
This API covers a few more use cases than just plain
push_back_unchecked
would, but it also imposes a lot more compilation costs and takes advantage of character types being much simpler to work with. I don't think a copy of this API would be a good fit for a generic vector type. ↩︎Note that we cannot make the same change for
push_back(T const&)
, because we only assert no-throwability of moves, not copies. For copies, we generally assume that they will want to acquire resources, e.g. allocate memory, thus, throwing copy constructors are reasonably common. ↩︎