The Little Things: Why you should always have benchmarks ready

At the end of January, I removed some low-hanging allocations from Catch2. This meant that running the SelfTest binary made 8k fewer allocations with v3.3.1 than with v3.3.0. Unbeknownst to me, it also made section tracking in v3.3.1 ~4x slower. This post is about how that happened and what I/you should learn from this.

Let's start at the beginning.

The beginning

As I already mentioned, I decided to work on the number of allocations in Catch2. This was motivated by this topic coming up at work^[1] and by looking at Snitch, which is a test framework that provides API similar to Catch2's but with zero allocations^[2].

As this was originally just a one-day project^[3], I did not want to do any significant refactoring, only small local changes, like reserving enough memory ahead of time or ensuring that things end up moved rather than copied. I say originally because I noticed that the section tracking requires a quadratic number of allocations in the number of sibling sections, and I decided to fix this, even though it required a larger surgery.

The result was that running tests/SelfTest with v3.3.1 needed just over 8k fewer allocations than it needed with v3.3.0. And also, unbeknownst to me, that section tracking became about 4x slower. I did not notice this until quite a lot later because I did not benchmark the changes. After all, all the changes were trivial, and having fewer allocations means better runtime, no?

The realization

About three weeks later, I wrote some microbenchmarks for specific extreme uses of Catch2^[4]. The most interesting results came from a microbenchmark for section tracking, with 1000 sibling sections:

TEST_CASE("Some simple test hahaha", "[tag1][tag2]"){
    SECTION("Some section name 0"){  }
    SECTION("Some section name 1"){  }
    SECTION("Some section name 2"){  }
    SECTION("Some section name 3"){  }
    SECTION("Some section name 4"){  }
    ...
    SECTION("Some section name 999"){  }
}

Tracking allocations coming from running this empty test case gave me some impressive numbers; v3.3.0 made 2013410^[5] allocations, but v3.3.1 made only 11410 allocations^[6].

This is a great improvement, but there was a ~~Catch2~~ catch. Running the test compiled against v3.3.1 was noticeably slower than running the one compiled against v3.3.0. So I benchmarked the binaries using hyperfine and found out that the difference was between 2.5x (debug configuration) to more than 4x (release configuration).

At this point, I looked through the 8 commits that went into v3.3.1 again, expecting to quickly find some very suspicious change that would make the issue obvious. I found absolutely nothing suspicious, so I had to use a more systematic approach...

The investigation

With only 8 commits that could've caused the difference, the simplest way of narrowing down the suspects was to just build the benchmark against all of them and compare their runtimes. This clearly narrowed down the slowdown to single commit^[7], unsurprisingly the most complex one.

Try looking at the linked commit as if you were doing a code review, and see if you find the issue by inspection. Do you think you see it?

> Click here for a hint <

It is one of the first changes in catch_test_case_tracker.cpp

I sure didn't, so I had to dig deeper to figure it out. First, I tried profiling the binary built with v3.3.1. Let's take a look at a summary of functions on the hot path:

Function Name	Inclusive CPU in %	Exclusive CPU in %
`Catch::TestCaseTracking::SectionTracker::acquire`	97.14%	0.09%
`Catch::TestCaseTracking::ITracker::findChild`	96.75%	17.22%
`Catch::StringRef::operator==`	77.78%	27.14%
`[External Call] vcruntime140.dll`	50.64%	50.64%

Looking at the inclusive^[8] time, we spend the absolute majority of the time looking for a child of a tracker. This is done when checking whether Catch2 should enter the currently encountered section. Given that our microbenchmark only ever walks over SECTION macros (it will walk over 1M SECTIONs in one execution), this makes sense and is about what we would expect.

Looking for the child section spends most of the time comparing StringRefs, which in turn spends most of the time in a runtime library, specifically in std::memcmp, to compare the contents.

With the benefit of hindsight, the issue is obvious from the trace.

However, at the time, I had no idea, so I tried various things:

Make StringRef::operator== eligible for inlining without LTO by moving it to the header. This helped a bit, but not enough to offset the slowdown.
Avoid passing the newly introduced NameAndLocationRef by value. The difference this made did not surpass noise.
Better inlining and devirtualization of various functions in trackers. In aggregate, these changes improved performance slightly, but not nearly enough to bring the perf back to previous levels.

After spinning my wheels for an embarrassingly long time, I finally had an idea;

I could compare profiles from running the benchmark with v3.3.0 and v3.3.1.

This is the hot path for the same benchmark with v3.3.0:

Function Name	Inclusive CPU in %	Exclusive CPU in %
`Catch::TestCaseTracking::SectionTracker::acquire`	79.75%	0.08%
`Catch::TestCaseTracking::ITracker::findChild`	79.25%	22.76%
`Catch::SourceLineInfo::operator==`	53.61%	53.61%

We still spend the majority of time looking for trackers, but as it runs faster in v3.3.0, it takes up less of the total CPU time. However, this time the majority of the time is taken up by a different operator== than with v3.3.1^[9].

After seeing this, the issue became obvious.

Look at the commit in question again. Do you see it now?

> Click here for the answer <

         auto it = std::find_if(
             m_children.begin(),
             m_children.end(),
             [&nameAndLocation]( ITrackerPtr const& tracker ) {
-                return tracker->nameAndLocation().location ==
-                           nameAndLocation.location &&
-                       tracker->nameAndLocation().name == nameAndLocation.name;
+                return tracker->nameAndLocation() == nameAndLocation;
             } );

> Click here for the explanation <

Changing the lambda to use NameAndLocation::operator== directly changed the order in which the tracker's members are compared. The original version checked the tracker's location first and the tracker's name second. op== instead checks the name first and then the location second. op== uses this order because that is the order they are defined in NameAndLocation.

In most cases, the line where a SECTION is defined is a better discriminant than the SECTION's name. This is because it is a lot faster to compare the lines than it is to compare their names. Thus, when the refactoring changed the order in which they were checked, it slowed down the comparison significantly, and thus the section tracking also slowed down significantly.

Takeaways

Even innocuous changes can have massive consequences.

The refactoring was an obvious way to make the code cleaner. Instead of manually comparing two members, we called the appropriate operator==. The result was cleaner code that took significantly longer to run, due to differences in how it processes real-world inputs.

You should always have some benchmarks, and run them regularly^[10].

I ran into a similar issue last week at work. I noticed a function that copied a vector, and optionally also XORed the data with a constant. I thought it was overcomplicated and cleaned it up, but because the benchmarks showed that the result was slower^[11], the changes never made it to git.

When investigating differences in running time, profiling just one binary might not be enough.

You should profile both binaries and then diff the resulting traces. I am not aware of any tools for smart trace diffing, but usually, just opening both of them side by side is enough.

We did not need to reduce the number of allocations by itself, but we needed to reduce the fragmentation caused by having interspersed shortlived and longlived allocations. ↩︎
It makes some interesting tradeoffs, some of which I agree with (e.g. if it did not break backwards compatibility, TEST_CASE would also accept only string literals), and some of which I don't (matcher description has to be saved internally in the object, SECTION tracking is much more limited). ↩︎
I really don't have much spare time for projects nowadays, even less for Catch2. ↩︎
E.g. 1000s of TEST_CASEs in a single file, 1000s of REQUIREs in a single test case, 1000s of sibling SECTIONs, huge generators and so on. ↩︎
Given that the absolute majority of these allocations come from std::strings, having shorter section names would lead to less allocations. Trying to figure out a realistic distribution of section name lengths is beyond the scope of this post though. For the sake of completeness, if the names are shortened to just the number, the result is 8410 vs 6410 allocations. ↩︎
Further cleanups made in v3.3.2 brought this down to 10409 allocations, and 6409 allocations for the short name case. ↩︎
Other commits also had some effect, both positive and negative, but this one was responsible for the absolute majority of the slowdown. ↩︎
Simplified a bit, function's inclusive time is the time spent between entering the function and exiting the function. This means that time spent in functions called from the function also count in its inclusive time. Function's exclusive time is the time spent only inside that one function. ↩︎
If you noticed that this time there is no separate line for calls into runtime for std::strcmp, yeah, there isn't. There should be a bunch of calls there, but the profile shows nothing. I suspect that MSVC treats calls to std::strcmp differently from std::memcmp, and inlines it. ↩︎
Incidentally, I am looking for help setting up some benchmarks for Catch2. The hard part is tracking the results over time, so that trends can be identified, rather than just getting the results of the latest run. ↩︎
I investigated why the new code was slower, and the result turned out quite surprising and funny. The compiler was able to vectorize the rewritten function, so it should've been quite a lot faster. But because all inputs were very short, the vectorized loop did not pay off. ↩︎

← The Coding Nest

The beginning

The realization

The investigation

Takeaways