The Little Things: Speeding up C++ compilation

The Little Things is a new series of posts based on Locksley's internal training sessions. Often the contents are either proprietary (e.g. the inner workings of specific master key platforms) or not generally interesting (e.g. our internal libraries and tooling), but sometimes the contents are suitable for a wider audience, in which case I want to share them.

This post will be about some source-level techniques for speeding up C++ compilation, and their (dis)advantages. It will not talk about things external to C++, such as buying better hardware, using a better build system, or using smarter linker^[1]. It will also not talk about the tooling that can find compilation bottlenecks, as that will be a subject of a later post.

Overview of C++ compilation model

I will start with a quick overview of the C++ compilation model, to provide context for some of the tricks I will show later. Note that this overview will be very coarse, if you want a detailed look at the subtleties of the 9 phase compilation model defined in the C++ standard, look elsewhere.

We will consider the compilation of C++ binary to happen in 3 steps:

Preprocessing
Compilation
Linking

Preprocessing

The first step is preprocessing. During it, the preprocessor takes a .cpp file and parses it, looking for preprocessor directives, such as #include, #define, #ifdef, etc.

Let's take this super simple file as an example

// tiny.cpp
#define KONSTANTA 123

int main() {
    return KONSTANTA;
}

It contains one preprocessor directive, #define. It says that any following occurence of KONSTANTA should be replaced with 123. Running the file through a preprocessor leads to output like this one:

$ clang++ -E tiny.cpp
# 1 "tiny.cpp"
# 1 "<built-in>" 1
# 1 "<built-in>" 3
# 383 "<built-in>" 3
# 1 "<command line>" 1
# 1 "<built-in>" 2
# 1 "tiny.cpp" 2


int main() {
    return 123;
}

We can see that in return KONSTANTA the KONSTANTA part was replaced with 123, as it should be. We also see that the compiler left itself a bunch of other notes, that we do not care about that much^[2].

The big problem with the preprocessor model is that the #include directive literally means "copy-paste all of this file's contents here". Of course, if that file's contents contain further #include directives, then more files will be opened, their contents copied out, and in turn, the compiler will have more code to deal with. In other words, preprocessing increases the size of the input, usually significantly so.

The following is a simple "Hello World" in C++, using streams.

// hello-world.cpp
#include <iostream>

int main() {
    std::cout << "Hello World\n";
}

After preprocessing, the file will have 28115^[3] lines for the next step, compilation, to deal with.

$ clang++ -E hello-world.cpp | wc -l
28115

Compilation

After a file is preprocessed, it is compiled into an object file. Object files contain the actual code to run, but cannot be run without linking. One of the reasons for this is that object files can refer to symbols (usually functions) that they do not have the definition (code) for. This happens, e.g. if a .cpp file uses a function that has been declared, but not defined, like so:

// unlinked.cpp
void bar(); // defined elsewhere (hopefully)

void foo() {
    bar();
}

You can look inside a compiled object file to see what symbols it provides and what symbols it needs, using nm (Linux) or dumpbin (Windows). If we look at the output for the unlinked.cpp file, we get this:

$ clang++ -c unlinked.cpp && nm -C unlinked.o
                 U bar()
0000000000000000 T foo()

U means that the symbol is not defined in this object file. T means that the symbol is in the text/code section and that it is exported, which means that other object files can get foo from this unlinked.o. It is important to know that symbols might also be present in an object file, but not be available to other object files. Such symbols are marked with t.

Linking

After all the files have been compiled into object files, they have to be linked into the final binary artefact. During linking, all the various object files are smashed together in a specific format, e.g. ELF, and the various references to undefined symbols in object files are resolved with the address of the symbol, as provided by a different object file (or library).

With this overview done, we can start tackling the different ways to speed up the compilation of your code. Let's start simple.

`#include` less

Including a file usually brings in a lot of extra code, which the compiler then needs to parse and check. Thus the simplest, and usually also the biggest, way to speed up the compilation of your code, is to just #include fewer files. Reducing the include set is especially beneficial in header files, as they are likely to be included from other files, thus amplifying the impact of your improvements.

The easiest way to do this is to remove any unused includes. Unused includes shouldn't happen often, but sometimes they are left behind during refactoring, and using a tool like IWYU can^[4] make it simple to do. However, just cleaning up unused includes is unlikely to provide many benefits, and so you will have to reach for bigger guns, forward declarations and manual outlining.

But before explaining forward declarations and manual outlining, I want to go over the costs of header inclusion quickly, so we can build up intuition on what sort of speed-ups we can expect from pruning down include graphs.

The cost of header inclusion

The table below shows the time required by Clang^[5] to compile a file that only includes some stdlib headers.

header(s) included	time to compile (ms)	difference from baseline (ms)
none	11.3 ± 0.2	-
`<vector>`	68.8 ± 0.3	57.5 ± 0.36
`<string>`	136.3 ± 0.8	125.0 ± 0.82
`<stdexcept>`	137.0 ± 0.8	125.7 ± 0.82
`<vector>`, `<string>`	155.3 ± 0.9	144.0 ± 0.92
`<string>`, `<stdexcept>`	136.7 ± 0.7	125.4 ± 0.73
`<vector>`, `<string>`, `<stdexcept>`	156.1 ± 0.8	144.8 ± 0.82

The first row shows the time needed to compile a completely empty file, to provide a baseline time required by the compiler to start, read the file, and do nothing. The other lines are more interesting. As the second line says, just including <vector> adds 57 ms to compilation times, even though there will be no actual line emitted. As we can see, the cost to include <string> is more than double of <vector>, and the cost to include <stdexcept> is about the same as for <string>.

More interesting are the rows for combinations of headers, because no combination of headers is as expensive as compiling each of them on its own. The reason is quite simple: their internal include overlap. The most extreme case is <string> + <stdexcept>, because <stdexcept> is basically <string> + couple of types deriving from std::exception.

What you should take away from this are two things:

Even if you do not use anything from a header, you still have to pay for it.
Include costs do not neatly sum, nor subtract.

Now let's go through techniques we can use to include fewer files.

Forward declarations

Quite often, when we mention a type, we only need to know that it exists but do not need to know its definition. The common case is creating a pointer or a reference to a type, in which case you need a knowledge that the type exists (a forward declaration), but not what it looks like (a definition).

As an example, this header is valid:

class KeyShape; // forward declaration

size_t count_differences(KeyShape const& lhs, KeyShape const& rhs);

as long as the implementation file includes the appropriate headers:

#include "key-shape.hpp" // provides the full definition of KeyShape

size_t count_differences(KeyShape const& lhs, KeyShape const& rhs) {
    assert(lhs.positions() == rhs.positions());
    ...
}

You can also use forward declaration together with some templated classes, whose size does not change depending on the template argument, e.g. std::unique_ptr and std::vector^[6]. However, doing so can force you to outline your constructors, destructors and other special member functions (SMFs), as those usually need to see the full definition of the type. Your code then ends up looking like this:

// foo.hpp
#include <memory>

class Bar;

class Foo {
    std::unique_ptr<Bar> m_ptr;
public:
    Foo(); // = default;
    ~Foo(); // = default;
};

// foo.cpp
#include "bar.hpp"

Foo::Foo() = default;
Foo::~Foo() = default;

Notice that we still use the compiler-generated default constructor and destructor, but do so in the .cpp file, where we see the full definition of Bar. I also like to use the // = default; comment to signal to other programmers reading the code that the SMF is explicitly declared but will be defaulted, and thus there won't be any special logic in it.

When using this technique, please remember that the outlined functions cannot be inlined without LTO. In other words, you probably do not want to outline every function just because you can, because calling trivial functions can be much more expensive than inlining their code directly.

Explicit outlining

The idea underlying explicit outlining is quite simple: sometimes we get better results if a piece of code is explicitly split away from a function. One of the most common reasons is, perhaps ironically, improving inlining by making the common path of a function small. However, in our case, the reason for doing this is to improve the compilation times.

If a piece of code is expensive to compile, and inlining it is not crucial for performance, only one TU has to pay for compiling it. The canonical example of this is throwing an exception in general, and exceptions from <stdexcept> in particular. Throwing an exception generates quite a lot of code, and throwing more complex standard exception types, such as std::runtime_error, also requires an expensive^[7] header, <stdexcept> to be included.

By instead replacing all throw foo; statements with calls to a helper function along the lines of [[noreturn]] void throw_foo(char const* msg), the call sites become smaller, and all compilation costs related to the throw statements are concentrated in a single TU. This is a useful optimization even for code that is only present in a .cpp file. For code in headers^[8], this optimization is almost critical, due to the multiplicative effect of textual code inclusion.

Let's try this with a simple example: consider a toy constexpr static_vector^[9] implementation. It will throw std::logic_error from push_back if there is no more capacity, and we will test two version: one that throws the exception inline, and one that instead calls a helper function to do it.

The inline-throwing implementation looks something like this:

#include <stdexcept>

class static_vector {
    int arr[10]{};
    std::size_t idx = 0;
public:
    constexpr void push_back(int i) {
        if (idx >= 10) {
            throw std::logic_error("overflew static vector");
        }
        arr[idx++] = i;
    }
    constexpr std::size_t size() const { return idx; }
    
    // other constexpr accessors and modifiers as appropriate
};

The only change in the out-of-line throwing implementation is that the throw std::logic_error(...) line is replaced with a call to a throw_logic_error helper function. Otherwise, they are the same.

We will now create 5 TUs that include the static vector header, and contain a simple function that uses the static vector, like this:

#include "static-vector.hpp"

void foo1(int n) {
    static_vector vec;
    for (int i = 0; i < n / 2; ++i) {
        vec.push_back(i);
    }
}

Using the same compiler, settings^[5:1], and machine as before, compiling a full binary in the inline-throwing case takes up 883.2 ms (± 1.8), while the out-of-line-throwing case takes up 285.5 ms (± 0.8). This is a significant (~3x) improvement, and the improvement grows with the number of compiled TUs that include the static-vector.hpp header. Of course, it is good to also keep in mind that the more complex the TUs would be, the smaller the improvement would be, as the cost of the <stdexcept> header becomes a smaller part of the total cost of the TU.

There is not much more to be said about improving your build times by just including less stuff, so it is time to look at another trick: using hidden friends.

Hidden Friends

Hidden friends is the name of a technique that uses relatively obscure rule about the visibility of names (functions/operators) to reduce the size of overload sets. The basic idea is that a friend function declared only inside a class can only be found and called via Argument Dependent Lookup (ADL). This then means that the function does not participate in overload resolution unless its "owning" type is present in the expression.

Hidden friends are best explained with some examples.

operator<< as hidden friend

struct A {
    friend int operator<<(A, int); // hidden friend
    friend int operator<<(int, A); // not a hidden friend
};
int operator<<(int, A);

In the snippet above, only the first overload of operator<< is a hidden friend. The second overload is not, because it is also declared outside of A's declaration.

Pruning the overload set has multiple advantages:

Shorter compilation errors when overload resolution fails. Compare the error for the same expression with hidden friends versus without them.
Less chance for implicit conversions to happen. For an implicit conversion to happen, at least one argument has to already have the target type, overload that would require implicit conversions of all arguments cannot be selected. Example
Faster compilation, because the compiler has less work to do.

Given the topic of this post, that last advantage is what we care about. So how much difference does using hidden friends make? To test this, I generated a simple .cpp file with 200 structs like the one above, giving a total of 400^[10] overloads of operator<<. The TU also contains a one-line function that returns A1{} << 1, to induce overload resolution of operator<<.

When using hidden overloads, it took Clang^[5:2] 25.4 (± 0.1) ms to compile this TU into an object file. Without hidden overloads, it took 36.7 (± 0.2) ms. This is already a nice speed-up, the question is, will the speed-up scale with more overload resolutions in the TU? Let's try modifying the function to contain 1/10/50/100 summed up operator<< calls, and see the results.

`operator<<` calls	hidden (ms)	non-hidden (ms)	speed up
1	25.4 ± 0.1	36.7 ± 0.2	1.44 ± 0.01
10	25.3 ± 0.1	40.2 ± 0.2	1.59 ± 0.01
50	27.6 ± 0.2	57.9 ± 0.6	2.10 ± 0.02
100	29.9 ± 0.1	79.9 ± 1.4	2.67 ± 0.05

As we can see, the speed up increases with the number of overload resolutions required by the TU, even though the overload resolution always happens for the same expression. However, even for large TUs, with large overload sets and many overload resolutions, the difference in absolute number is ~50 ms. This is a nice speed-up, but if you remember the table on the cost of including different stdlib headers, you know that this is less than the difference between compiling an empty file and a file that includes <vector>.

In practice, this means that you are more likely to see larger improvements in compilation times from pruning unnecessary #includes than using hidden friends. However, hidden-friends also improve your code in different ways and are surprisingly powerful in highly templated code.

There is one disadvantage to using hidden friends. The header where you declare the class and the hidden friend must contain all other declarations involved in declaring the hidden friend. This can increase the header's heft significantly, e.g. if you need to include <iosfwd> for std::ostream& for stream insertion operator^[11].

To sum it all up, using hidden friends improves your compilation times, improves your error messages, and also prevents some cases of implicit conversions. This means that you should default to providing operator overloads and ADL customization points as hidden friends^[12].

Now let's look at the last trick we will look at today, putting less pressure on the linker.

Link less

There are two ways to have the linker do less work. The first one is to hide symbols from linking, the second one is to make symbols names shorter. Because the latter is... not worth it except in extreme cases^[13], we will only look at the former.

During the compilation model overview, I mentioned that a symbol could be present in an object file without being available to other object files. Such symbol is said to have an internal linkage (as opposed to having external linkage). The compilation speed advantage of symbols with internal linkage comes from the fact that the linker does not have to keep track of it as available, and thus has less work to do.

As we will see later, there are also runtime-performance and object file size benefits to symbol hiding, but first, let's look at an example.

// local-linkage.cpp
static int helper1() { return -1; }

namespace {
int helper2() { return  1; }
}

int do_stuff() { return helper1() + helper2(); }

In the example above, both helper1 and helper2 have internal linkage. helper1 because of the static keyword, helper2 because it is enclosed in an unnamed^[14] namespace. We can check this with nm:

$ clang++ -c local-linkage.cpp && nm -C local-linkage.o
0000000000000000 T do_stuff()
0000000000000030 t helper1()
0000000000000040 t (anonymous namespace)::helper2()

What's even more interesting is that if we up the optimization level, both helper1 and helper2 disappear entirely. This is because they are small enough to be inlined in do_stuff, and no code from different TU can refer to them, because they have internal linkage.

$ clang++ -c local-linkage.cpp -O1 && nm -C local-linkage.o
0000000000000000 T do_stuff()

This is also how internal linkage can improve runtime performance. Because the compiler sees all places where the symbol is used, it has more motivation to inline it into the call sites to remove the function altogether. And even if it cannot, it can optimize the code with extra knowledge based on its call sites.

The compilation performance improvements from hiding your symbols are generally small. After all, the amount of work a linker does per-symbol is small, especially if your linker is smart about it. However, large binaries can have millions of symbols, and just as with hidden friends, there are also non-compilation performance benefits to hiding symbols, namely preventing ODR violations between helper functions.

That's all for this post. In a later post, I intend to write about tools that can be used to find places where your compile times suffer unnecessarily, and about some other techniques to mitigate this.

However, if your day job is writing C++, you should be using ninja, work on a machine with lots of cores and RAM, and see if lld works for you. ↩︎
These can have many different uses. One of the more interesting ones is that compiler can mark down which file (and line) a line came from, so it then can issue diagnostics with the proper context. MSVC STL uses a similar trick to tell the compiler that it should step through all layers of, e.g. std::function internals, when the user is asking for "Just My Code" debugging. ↩︎
At least against a specific version of libstdc++ using a specific version of Clang. Changing the version of either can change the exact number, but not that much... on the other hand, switching to MSVC + MSVC STL gives you about 50k lines after preprocessing. ↩︎
As of the time of writing, IWYU has a big problem in that it hardcodes assumptions from Google's C++ style guide around includes. This means that if your project uses the same style, it will work quite well for you, even though you still have to check what changes it made, or it will include <iosfwd> to provide size_t. If your project uses <> style for internal includes, then it will suggest replacing every single one of your includes. ↩︎
The compilation used Clang in version 10, compiling against libstdc++ version 8 (release date 20200304), using -g -std=c++17 -c -o /dev/null command line arguments. ↩︎ ↩︎ ↩︎
Remember that std::vector consists of three pointers to a chunk of dynamically allocated memory. Because sizeof(T*) does not change based on T, sizeof(std::vector<T>) also does not change based on T. ↩︎
At least if you are not already including <string>. ↩︎
This can mean templated code, constexpr code, or just plain code that is important to inline even without LTO build. ↩︎
static_vector is a vector with fixed capacity, usually allocated in the object itself. This allows it to be constexpr before C++20's constexpr allocations^[15], with the tradeoff being that you have to know how much memory you will need ahead of time and that you have to fit onto the stack. ↩︎
I think that using 400 overloads simulates a TU in a relatively large (but not impossibly so) codebase. If you feel the number seems too high, consider that basically every QT type has operator<< for QDebug and that your own custom types are likely to provide operator<< for std::ostream. Alternatively, think of highly generic libraries that create many different types, which leads to many different instantiations of templated functions. ↩︎
Our codebase avoids using hidden friends for specific classes, because even on Linux, including <iosfwd> increases the header heft 3+ times. With MSVC, the difference is orders-of-magnitude. ↩︎
Like with the advice to use std::vector by default, this does not mean that there are never cases where using something else is better. Just that you need to have a reason to deviate, and you should always document the reasons you had. ↩︎
The one case where I can see myself recommending shortening your symbols names for compilation performance is for internal types in expression templates. If you are doing something like converting brainfuck code into a complex nested type, using shorter names for the internal types might very well be worth it, and the type will be unreadable either way. ↩︎
It is more often called anonymous namespace, but unnamed namespace is more technically correct. ↩︎
Note that even with C++20, the memory allocation cannot outlive the compilation phase and be accessible during runtime. If you need a dynamically sized constexpr data structure, you still need something like the static_vector. ↩︎

← The Coding Nest