The Little Things: My ?radical? opinions about unit tests

Due to maintaining Catch2 and generally caring about SW correctness, I spend a lot of my time thinking about tests^[1]. This has left me with many opinions about tests, some conventional and some heterodox.

Originally I wanted to make a lightning talk out of these, but I won't be giving any this year, so instead, I wrote them up here. And since this is a blog post, I also have enough space to explain the reasons behind the opinions.

Unit test opinions

Note that the list below is in no particular order. I added opinions to the list as I remembered them.

You should be using Catch2.

Okay, I am not entirely serious about this one, but I had to include it anyway.

If the test runner finishes without running any tests, it should return a non-zero error code.

When I was making this change to Catch2, I expected it to be non-controversial. After all, everyone has run into having green CI while tests weren't running due to a misconfiguration, no? And making the test runner fail if it didn't run any tests would easily catch this problem.

I ran two Twitter polls^[2] about this, and as it turns out, people disagree, in some cases, and agree in others. If the test runner is called without a filter, e.g. just ./tests, then the majority of people voted to return 0, but when there is a filter, e.g. ./tests "some test filter or another", then the majority of people voted to return non-zero error code.

There is an easy argument about the logical purity of returning 0. A non-zero return code means that at least one test has failed. If no tests were run, then no tests could've failed; thus, returning 0 is the correct answer.

And logically, it is the correct answer. But pragmatically, returning non-zero return code when no tests were run is much more useful behaviour, and I prefer usefulness in practice over logical purity.

If you want to read the arguments people made for/against returning 0 in either case, look at the replies to the poll tweets.

Unit tests aren't good because they provide correctness. In fact, they are bad at providing correctness.

Unit tests only provide a little bit of correctness. What makes unit tests valuable is that they also need only a little effort to provide that small bit of correctness. They also provide correctness under the pay-as-you-go model.

This means that there is good correspondence between the effort you expend upon the unit tests and how much correctness guarantees you get out of your effort. And you can put in as much, or as little, effort as you want/can afford. This contrasts sharply with formal verification, complex fuzzing setups, etc., where you have to invest a significant amount of effort up-front for them to pay out. And if you stop investing the effort halfway through, you get nothing back.

The downside to the pay-as-you-go model is that if you want a high degree of correctness guarantees, you will pay a lot more than you would for using a different approach.

You should not be running your unit tests in isolation.

There are two main reasons to run your tests in isolation. The first one is to run them in parallel. The second one is to avoid positive interference between tests, where one test changes the global state in a way that causes a later test to pass. Both are good reasons, but there is a better way to achieve these goals.

There are also two issues with running your tests in isolation. The first one is that on some platforms, the static overhead from running each test in its own process can be a significant part of the total runtime of the tests. The other is that running your unit tests in isolation also hides negative interference between tests, where one test changes the global state in a way that would cause a later test to fail.

The first issue is rarely a problem in practice^[3]. The real problem with running tests in isolation is the second issue. In my experience, negative interference between tests usually means issues in the code under test rather than the tests themselves, and thus it is important to find it and fix it.

If you instead run your tests in randomized order inside a single process, you will eliminate both positive and negative interference between tests. The disadvantage to this approach is that you don't get parallel test execution and that figuring out which tests interfere with each other is complicated due to the shuffling.

The thing is, good tools^[4] can solve both of these disadvantages. You can get back parallel test execution if your test runner supports splitting up tests into randomized batches. And debugging test interference is easy if your test runner's shuffle implementation is subset-invariant^[5].

Tests should be run in random order by default.

Running tests in random order is the correct behaviour for the long-term health of tests. Your defaults should adhere to the correct configuration unless there is an overriding constraint.

Catch2 currently does not default to this but will with the next major (breaking) release. During the last major release, I considered this bit too radical, especially in conjunction with the other changes to defaults I made.

Test Driven Development (TDD) is not useful for actual development.

In my experience, there are two options when I need to write a larger piece of code. The first one is that I know how to implement it; thus, I get no benefit from the "write tests first" part of TDD. The second one is that I don't know how to implement something, and then TDD is still useless because what I actually need is to do a bunch of design up-front... and then writing tests first has the same issue as in the first case.

These issues are compounded further by the fact that I often work on things where the "run tests" step can require multiple CPU years to evaluate appropriately, or crunching through half a terabyte of data, etc., and comparing the output with the current implementation.

Don't read this to mean that you should not write tests. You should be writing tests. You just don't have to dogmatically write them first. Write them when it makes sense, and don't be afraid to write them in large batches.

TDD is a good learning tool.

While I don't think TDD is useful in day-to-day work (see the previous opinion), I would still recommend every starting developer to try using it for some time, say a year, while they are learning to write code. The reason behind this is quite simple. A common complaint when starting with writing unit tests is that writing unit tests is annoying due to the interface of the code under test being annoying to use.

Being forced to use the interface before writing the implementation (through writing the tests first) is an excellent way to determine whether the interface is usable. And since the implementation is not written yet, the barrier to changing the interface is trivial.

Eventually, the developer should be able to evaluate the interface without being forced to write code that uses it, which is a good time to stop using TDD.

Your use of tests will be influenced by your test framework. This makes picking a good test framework critical.

This does not seem like a radical opinion, but I think few people appreciate how much this is true and what it means.

A straightforward example is the difference between the test names you end up with, if your test names have to be legal identifiers (e.g. GTest) or if they can be arbitrary strings (e.g. Catch2). Which one of these do you prefer to read?

TEST(Exceptions, ExceptionMessagesContainTimestampAndLocation) {
    ...
}

TEST_CASE("Exception messages contain timestamp and location", "[exceptions]") {
    ...
}

I ran into another, worse, example this year at NDC TechTown. During a discussion about tests and why tests should be run shuffled and not isolated (see point 3), someone told me that debugging shuffled tests is too hard. Why? Because you have to find which tests cause the issue and removing tests changes their order, subsetting the tests is annoying, and so on.

This is only true if your test framework does not provide good support for test shuffling, but if you picked such a framework, then trying to do the right thing is now more painful than it has to be. Maybe even painful enough that you won't bother running the tests shuffled, losing out on significant correctness improvement.

I also saw a huge example last year, incidentally also at NDC TechTown. In his keynote "Testing as an equal 1st class citizen (to coding)", Jon Jagger had a part where he notes that he no longer uses descriptive test names because writing them as a valid identifier is annoying and ends up unreadable anyway. Instead, he uses UUID-ish names, like test_de724a, test_de724b and so on. The description of the test is then pushed down into the test's docstring (this is in Python). Because the docstring is an arbitrary string, it can contain newlines, and long sentences, making it more readable than the identifier-like name.

Another supposed bonus to this approach was that it makes running all tests related to some functionality easy, e.g. pytest -k de7, because the test name prefix encoded other properties of the test.

I think this idea is pretty ingenious.

But it is an ingenious workaround for inadequate tools. Users shouldn't have to write test names as valid identifiers, and users should be able to group tests without messing with the test names.

update 14.09.2022

Shortly after publication, I remembered another great example of this principle, the "Lakos rule"^[6]. The purpose behind the rule was to make defensive checks in narrow contract^[7] functions "testable". The idea was that the assertions would be configured to throw instead of terminating, and then the tests would check for exceptions being thrown. But for this to work, the narrow contract functions couldn't be marked noexcept because that disallows them from throwing.

There is a relatively simple alternative to this; something called "death tests". A death test checks whether a specific expression terminates the binary that executes it. GTest is a good example of a test framework that supports death tests and supports them well.

For a bit over 8 years^[8], the Lakos rule was used to guide the use of the noexcept specifier in the C++ standard library. Many functions that could be marked noexcept are not because they have a narrow contract and the Lakos rule said that they should not be marked noexcept.

So a lack of good death test support in the test framework Bloomberg uses caused a non-trivial difference in the contents of the C++ standard. That's a nice impact for something as trivial as a test framework choice, right?

Other random testing opinions

I also decided to add a smattering of opinions that are not about unit tests. I will not try to explain these as much as the ones about unit tests, maybe in a later article.

Property-based testing is neat; we should do more of it. But we need good tooling support first.
Dogmatically insisting on one assertion per test case is stupid. Writing more assertions is often easier than defining a new matcher.
Being pragmatic is more important than being clean/correct/logical.
Fuzzing is good. The issue is making it work in resource-constrained environments.
Mocking should be only used very rarely.
Overtesting is worse than (slight) undertesting.
Using special function names to declare that a function is a test is wrong.

And I don't mean just unit tests, e.g. my undergrad thesis was about applying symbolic execution towards testing hard real-time safety-critical systems. Later on I did bunch of work with formal verification tools, namely Alloy and UPPAAL. ↩︎
Twitter poll about the case without filter, Twitter poll about the case with test filter. Later on, I also ran a Twitter poll about the case with a tag filter. ↩︎
I've run into a case where the static overhead from running each test in its own process was about 15% of the total runtime. This was due to a combination of most test cases being very cheap to execute, spawning processes on that platform being expensive, and large test binaries. Remember that all the tests in the binary must be registered to run even a single one. ↩︎
Such as Catch2. ↩︎
Subset-invariant shuffle means that the relative order of two test cases is the same, no matter how many other test cases are also shuffled at the time. Obviously, this assumes a fixed random seed. Different random seeds can (and should) change the relative order of the same two test cases. ↩︎
Lakos rule says that, in general, narrow contract functions should not be marked noexcept. There is a bit more to it than just that, e.g. it also says that non-throwing functions with a wide contract^[7:1] should be marked noexcept and has a part about special functions with a conditionally wide contract. ↩︎
Narrow contract functions, or sometimes just "narrow functions", are functions that can be called only when some preconditions are met. The classic example in C++ is std::vector::operator[] because it expects to be called only with a valid index as the argument. Calling it with an invalid index is undefined behaviour and a user error. Wide contract functions, or "wide functions", are functions that have no preconditions. The classic example in C++ is std::vector::at because it accepts out-of-range indices... it reacts by throwing an exception. ↩︎ ↩︎
N3279 formalized stdlib's noexcept policy following arguments made in N3248. P1656 changed the policy at the end of 2019. Funnily enough, P1656 explicitly calls out that neither libc++ nor libstdc++ use in-process validation because trying to do so ran into too many issues to be helpful. When P1656 was presented, MSVC's STL implementer had a similar, maybe even a bit more extreme, position. (ISO rules say that the meeting notes are not public, so you need to be part of wg21 to open the link.) ↩︎

← The Coding Nest

Unit test opinions

Other random testing opinions