The Little Things: The Missing Performance in std::vector

std::vector is often said to be the default container. This is because it provides a simple abstraction over a growable contiguous chunk of memory, thus providing good baseline performance for common operations, such as adding more elements or iterating over them.

However, many non-standard reimplementations of std::vector have emerged over time. These are usually driven by performance consideration, e.g. avoiding allocations for small vectors^[1], having static capacity^[2], tightly integrating with the OS for faster reallocations^[3], or optimizing moving elements with bitwise copying^[4].

Most of these cannot be merged into std::vector without changing it significantly, e.g. small vector optimization completely changes the expected complexity of moving the vector^[5]. Other can require other library and language changes, such as support for bitwise moves. But recently, I ran into a simple API change that can improve the performance of a common usage pattern by 10+ %.

Let's consider a simple function that takes one vector and returns a transformation of the input.

std::vector<size_t> convert_to_indices(std::vector<Key> const& keys) {
    std::vector<size_t> indices;
    for (Key k : keys) {
        indices.push_back(to_index(k));
    }
    return indices;
}

As I mentioned in a previous article, this implementation leaves a bunch of performance on the table. We can easily make it faster by reserving enough memory for indices ahead of time, like so

std::vector<size_t> convert_to_indices(std::vector<Key> const& keys) {
    std::vector<size_t> indices;
    indices.reserve(keys.size());
    for (Key k : keys) {
        indices.push_back(to_index(k));
    }
    return indices;
}

But this code still leaves some performance on the table, depending on the compiler^[6] and stdlib implementation. So, how can we fix this?

We cannot, not without changing std::vector. The lost performance comes from the fact that vector::push_back has to check whether it needs to grow the vector, even though it will never happen in the convert_to_indices function. A sufficiently smart compiler could optimize this out, but neither GCC nor Clang do (and MSVC makes a complete hash of things).

What we need is a function that inserts a new element into the vector, but does not check for capacity^[7]. I like the name push_back_unchecked for this functionality, so I will use it in the rest of this post.

Benchmarks

How much of a difference does using push_back_unchecked make? To figure this out, we would need a reimplementation of std::vector, which also includes push_back_unchecked. So I made a very simple one. It only supports push_back and push_back_unchecked and a few miscellaneous functions like size, or reserve.

Let's run 3 different benchmarks, all following the basic structure from the snippet earlier, that is, create an output vector, reserve enough space for all elements, and then add the elements to the output vector.

The transformation operation is going to be

The Key -> index conversion snippet from earlier
Multiplying every element by 2

For each of these, we will compare the performance of std::vector::push_back, nonstd::vector::push_back, and nonstd::vector::push_back_unchecked^[8].

You can check the benchmark implementation in the GitHub repository.

I ended up collecting results for the three main compilers, specifically MSVC 19.29.30151^[9], GCC 10.5, and Clang 15.0.7 (using libc++-15)^[10]. GCC and Clang were run under WSL2 and MSVC on Windows 10 directly.

Results

I collated the results for all benchmarks into one table per compiler.

std column is the result of using std::vector::push_back
nonstd is the result of using nonstd::vector::push_back
unchecked is the result of using nonstd::vector::push_back_unchecked
\(\frac{\textbf{std}}{\textbf{nonstd}}\) is the speedup from using nonstd::vector::push_back instead of std::vector::push_back
\(\frac{\textbf{nonstd}}{\textbf{unchecked}}\) is the speedup from using nonstd::vector::push_back_unchecked instead of nonstd::vector::push_back

Generally we want to see that the \(\frac{\textbf{std}}{\textbf{nonstd}}\) is around 1, and \(\frac{\textbf{nonstd}}{\textbf{unchecked}}\) is more than 1.

Clang

benchmark	n	std	nonstd	unchecked	\(\frac{\textbf{std}}{\textbf{nonstd}}\)	\(\frac{\textbf{nonstd}}{\textbf{unchecked}}\)
copy + mult	1'000	540 ± 48 ns	551 ± 27 ns	93 ± 2 ns	0.98	5.92
	100'000	51 ± 2.0 us	51 ± 4.1 us	10 ± 3.0 us	1.00	5.10
	10'000'000	9.5 ± 0.2 ms	9.5 ± 0.3 ms	7.3 ± 0.7 ms	1.00	1.30
to_index	1'000	460 ± 19 ns	470 ± 40 ns	168 ± 10 ns	0.98	2.80
	100'000	44 ± 3.5 us	44 ± 8.9 us	15 ± 4.6 us	1.00	2.93
	10'000'000	14.2 ± 0.1 ms	14.5 ± 0.2 ms	13.1 ± 0.1 ms	0.98	1.11

The results from Clang+libc++ are about what we want to see. The performance of using plain push_back is roughly the same between nonstd::vector and libc++'s std::vector, and push_back_unchecked provides improvements ranging from significant for N == 10'000'000 to massive for smaller inputs.

GCC

benchmark	n	std	nonstd	unchecked	\(\frac{\textbf{std}}{\textbf{nonstd}}\)	\(\frac{\textbf{nonstd}}{\textbf{unchecked}}\)
copy + mult	1'000	546 ± 12 ns	540 ± 17 ns	167 ± 8 ns	1.01	3.23
	100'000	56 ± 11.4 us	55 ± 4.9 us	12 ± 2.9 us	1.02	4.58
	10'000'000	9.7 ± 0.2 ms	9.7 ± 0.1 ms	7.5 ± 0.8 ms	1.00	1.29
to_index	1'000	528 ± 29 ns	552 ± 91 ns	175 ± 17 ns	0.95	3.15
	100'000	51 ± 7.7 us	52 ± 8.2 us	15 ± 4.8 us	0.98	3.47
	10'000'000	15.1 ± 0.1 ms	15.2 ± 0.1 ms	13.4 ± 1.6 ms	0.99	1.13

The results from GCC+libstdc++ are very similar to those from Clang+libc++. Compared to Clang's results, there is a slightly greater difference between using std::vector and nonstd::vector, but the differences are still minor. GCC does seem to be worse at taking advantage of push_back_unchecked, in the copy + mult benchmark, providing significantly worse performance for small input sizes.

MSVC

benchmark	n	std	nonstd	unchecked	\(\frac{\textbf{std}}{\textbf{nonstd}}\)	\(\frac{\textbf{nonstd}}{\textbf{unchecked}}\)
copy + mult	1'000	792 ± 36 ns	599 ± 32 ns	525 ± 33 ns	1.32	1.14
	100'000	73 ± 2.4 us	54 ± 3.2 us	46 ± 3.9 us	1.35	1.17
	10'000'000	24.0 ± 2.6 ms	13.3 ± 5.1 ms	11.5 ± 4.1 ms	1.80	1.16
to_index	1'000	791 ± 39 ns	563 ± 35 ns	397 ± 23 ns	1.40	1.42
	100'000	91 ± 31.5 us	50 ± 5.9 us	34 ± 9.5 us	1.82	1.47
	10'000'000	29.4 ± 4.3 ms	20.4 ± 7.3 ms	18.4 ± 6.6 ms	1.44	1.11

The results for MSVC are, in a way, the most interesting ones. This is because nonstd::vector already significantly outperforms MS' std::vector. MSVC also does not take advantage of push_back_unchecked to generate massively faster code for small inputs the way GCC and Clang do.

Conclusion

From the benchmarks, push_back_unchecked provides a nice (10+%) improvement for a common usage pattern. The improvement could even be called "massive" for specific compiler and input size combinations. Obviously, it is not a panacea; the improvement comes from simplifying the code the compiler has to reason about when adding a new element to the vector. As adding the new element (or computing the new element) becomes more expensive, the total improvement trends to zero.

The trade-off for the increased performance is a whole new way to trigger UB when using a vector. Is this trade-off worth it? I think it certainly is, and we should add something like this to the actual std::vector. There is even a precedent for this in the standard library, std::basic_string::resize_and_overwrite^[11], which also allows the user to bypass the capacity checks when operating on a basic_string instance.

And finally, push_back is not the only member function on std::vector we could apply this treatment to. emplace_back is an obvious candidate here, as it also operates on individual elements and is often used in cases where we know the required capacity ahead of time. To me, the question mark is over things like assign and other range-based functions. On the one hand, doing one capacity check per the whole range should be cheap enough not to matter. On the other, if we add push_back_unchecked, adding, e.g. assign_unchecked would be symmetrical, and there is no doubt that it would help someone out.

Other lessons learned

Having learned my lesson from trying to improve the performance of Catch2, I always benchmarked my own vector implementation against std's. These baseline benchmarks have paid off almost immediately because in the first implementation, my vector's push_back was almost twice as slow as libstdc++'s std::vector's.

The code below is the first implementation of push_back I had, see if you can spot the issue:

    void push_back(T&& elem) {
        if (m_next != m_end) {
            Detail::construct_at(m_next, std::move(elem));
            ++m_next;
            return;
        }
        const auto current_size = size();
        const auto new_cap = calculate_new_cap(current_size + 1);
        auto* new_memory = Detail::allocate_uninit<T>(new_cap);
        // Constructing new element has to happen before we move
        // old elements, in case someone is doing `v.push_back(v[0])`
        try {
            Detail::construct_at(new_memory + current_size, std::move(elem));
        }
        catch (...) {
            Detail::deallocate_no_destroy(new_memory, new_cap);
            throw;
        }
        m_next = std::uninitialized_move(m_first, m_next, new_memory);
        // Account for the inserted element
        ++m_next;
        Detail::deallocate_no_destroy(m_first, current_size);
        m_first = new_memory;
        m_end = m_first + new_cap;
    }

There is no real issue if you define real issue as code that is inefficient on its own. However, when I benchmarked this implementation against libstdc++'s std::vector, I found that its performance was up to 4x worse than the actual std::vector.

Why? Because inlining is the most important optimization in the world of modern C++, the GCC's inliner did not like how big the function is. When I instead outlined the last few lines into its own helper function, adopt_new_memory, I got performance on par with std::vector.

This is the "fixed" version:

    void push_back(T&& elem) {
        if (m_next != m_end) {
            Detail::construct_at(m_next, std::move(elem));
            ++m_next;
            return;
        }
        const auto current_size = size();
        const auto new_cap = calculate_new_cap(current_size + 1);
        auto* new_memory = Detail::allocate_uninit<T>(new_cap);
        // Constructing new element has to happen before we move
        // old elements, in case someone is doing `v.push_back(v[0])`
        try {
            Detail::construct_at(new_memory + current_size, std::move(elem));
        }
        catch (...) {
            Detail::deallocate_no_destroy(new_memory, new_cap);
            throw;
        }
        adopt_new_memory(new_memory, new_cap);
        // Account for the element inserted ahead of time
        ++m_next;
    }

Next up was benchmarking against MSVC's stdlib. Compared to MSVC's std::vector, the implementation above was significantly slower, up to 3x. Because both std::vector and nonstd::vector were significantly slower when compiled with MSVC than when compiled with GCC, I had to look for some pattern that broke MSVC's optimizer badly.

After some thinking, I noticed that I could completely remove the try-catch block because in nonstd::vector, I static_assert that the elements in it are no-throw move-constructible.

This meant that the implementation of push_back(T&&)^[12] could be simplified down to just

    void push_back(T&& elem) {
        if (m_next != m_end) {
            Detail::construct_at(m_next, std::move(elem));
            ++m_next;
            return;
        }
        const auto current_size = size();
        const auto new_cap = calculate_new_cap(current_size + 1);
        auto* new_memory = Detail::allocate_uninit<T>(new_cap);
        // Constructing new element has to happen before we move
        // old elements, in case someone is doing `v.push_back(v[0])`
        Detail::construct_at(new_memory + current_size, std::move(elem));
        adopt_new_memory(new_memory, new_cap);
        // Account for the element inserted ahead of time
        ++m_next;
    }

This change made no difference in the performance with GCC (or, as I later checked, Clang), but this version of push_back(T&&) now beats MS's std::vector easily. Why? Because the MSVC version I have is bad at handling try-catch blocks where no exceptions can be thrown.

Let's take a look at an example where two functions should compile into the same ASM if the compiler properly optimizes try-catch blocks:

#include <cstdio>

void f1(int* ptr, int i) {
    *ptr = i;
}

void f2(int* ptr, int i) {
    try {
        *ptr = i;
    } catch (...) {
        puts("RIP\n");
        throw;
    }
}

godbolt

Notice that GCC is smart enough to compile both f1 and f2 into the same ASM. MSVC 19.29 instead makes a hash of f2, compiling in the whole exception handler. MSVC 19.37 performs a lot better; the two functions still aren't identical, but now the difference is only that f2 allocates 40 bytes of stack space for no reason.

And that's all for this article. Reimplementing a tiny subset of std::vector in a way to keep the comparison fair was an interesting exercise, and figuring out why it optimizes differently was quite fun.

Standardizing std::vector::push_back_unchecked would be nice, but I am not touching wg21 unless I am well-compensated for doing so.

This is usually called a "Small Vector". The idea is to have a buffer for some N elements inline with the object so that the vector does not need to allocate memory while it contains less than N elements. std::string already does this because it has the advantage that its elements are trivial types that cannot tell the difference. Overall, I think this is the most common vector mutation in the wild, see e.g. LLVM, Abseil, Boost, many hits on GitHub and in 3 of 4 commercial codebases I've worked on. ↩︎
This is usually called "Static Vector". It boils down to a small vector that does not allow increasing capacity. This mutation used to be useful for constexpr programming, but C++20 allows us to use std::vector in most constexpr contexts. This leaves the static vector with a much smaller niche; cases where there is a statically known bound on the size, the bound is small, and we either won't need to move the vector, or the elements are cheap to move. This was another fairly common vector mutation, but lately, it became less common. It can still be found in, e.g. Boost. ↩︎
As far as I know, this is a relatively new idea, usually called something like "Pinned Vector". The idea is that a vector can reserve lots of contiguous virtual memory on construction and only ask for it to be committed and backed by actual physical memory once it is used. This lets the pinned vector avoid copying/moving elements during reallocation, thus providing performance benefits when the final size is not known up-front. Pinned vectors also have a secondary advantage in that adding more elements does not invalidate pointers to already inserted elements. ↩︎
Many types are movable without invoking their move constructor/assignment operator, just by copying their bits into another place. Both std::vector itself and std::unique_ptr<T> are examples of such types because their representation is not self-referencing, even though they need a custom destructor. The C++ standard currently does not support this, even though there have been some proposals to introduce support for this into the language (by adding a new keyword along the lines of =default and =delete) and into the standard library (by adding a trait the types can opt into). An example of this approach is the vector reimplementation in Facebook's folly library. While folly::fbvector fails to compile if the type it stores does not specialize a trait to indicate that it can be moved bitwise, some implementations instead leave checking this to their user. ↩︎
std::string already does small string optimization, but it has the advantage that its elements are always trivial. This means that it can easily move the elements without risking exceptions being thrown or causing unexpected performance overhead. ↩︎
A sufficiently smart compiler could fix this. But I have to use GCC/Clang/MSVC, so I have no sufficiently smart compiler. ↩︎
Yes, this means that calling it without sufficient capacity invokes UB. ↩︎
The reason to include std::vector::push_back in the benchmarks is to check that the implementation of nonstd::vector::push_back is not doing something stupid, and thus giving an unfair advantage to nonstd::vector::push_back_unchecked. I needed to make multiple changes for my custom vector to match std::vector on GCC and MSVC. I describe some of them in the last section. ↩︎
This MSVC version corresponds to fully patched VS 2019 as of the time of writing. ↩︎
These compiler versions are what I use at work, so it is what I already have set up locally. Given the results, it would be nice to rerun the benchmark with newer MSVC to see how much the improved optimizer helps. ↩︎
This API covers a few more use cases than just plain push_back_unchecked would, but it also imposes a lot more compilation costs and takes advantage of character types being much simpler to work with. I don't think a copy of this API would be a good fit for a generic vector type. ↩︎
Note that we cannot make the same change for push_back(T const&), because we only assert no-throwability of moves, not copies. For copies, we generally assume that they will want to acquire resources, e.g. allocate memory, thus, throwing copy constructors are reasonably common. ↩︎

← The Coding Nest