Random distributions are not one-size-fits-all (part 1)

Recently, I was once again on CppCast, and once again^[1], I talked about Catch2 and C++'s standard library support for random numbers. One of the things I said was that random distributions are not one-size-fits-all.

This post is about how this applies to Lemire's algorithm, which is the state of the art for generating uniformly distributed integers in specific intervals.

Introduction

In C++, <random> separates the concern of generating random numbers between Random Number Engines, which generate integers in some predefined range, and distributions, which turn these into other numbers in the range requested by the user. This is a powerful, expert-friendly design, even though the actual standardized library leaves a lot to be desired.

Notably, the distributions are specified as types and are allowed to be stateful. This means that a distribution could be written to optimize for repeatedly generating numbers from the same instance rather than generating a single number as fast as possible^[2].

This post will look specifically at Lemire's algorithm for generating uniformly distributed integers. This is because I recently implemented reproducible uniform integer distribution for Catch2 using Lemire's algorithm, so I can use my own code for the benchmarks^[3].

Lemire's algorithm is the current state of the art for generating uniformly distributed random integers in an interval [0, max)^[4]. I won't go over the proof of the algorithm in this post; if you want to know why it works, go read the paper, but we will look at a pseudo-implementation.

uint32_t lemire( uint32_t max ) {
  uint64_t long_mult = random32() * uint64_t(max);
  uint32_t low_bits = uint32_t(long_mult);
  // Cheap probabilistic check
  if (low_bits < max) {
    uint32_t rejection_threshold = -max % max;
    // Precise check
    while (low_bits < rejection_threshold) {
      long_mult = random32() * uint64_t(max);
      low_bits = uint32_t(long_mult);
    }
  }
  return long_mult >> 32;
}

The core trick is in the probabilistic check guarding the expensive modulo operation used to calculate precise rejection threshold. The chance that we will need the modulo is \( \frac{\textbf{max}}{2^{32}} \), which translates to "basically zero" for a lot of the common use cases^[5].

But we can also end up computing rejection_threshold on every invocation. And if you look at how we compute it, you will notice that it does not depend on the generated random number, only on max. This is where a stateful object can optimize the code for generating multiple numbers; it can compute rejection_threshold once and use it every time it generates a new number.

Will precomputing the threshold pay off? Let's write some benchmarks and see.

Benchmarks

Please note that while this post has benchmark results for uint32_t and uint64_t, the latter are not representative of an optimized implementation. Lemire's algorithm uses extended multiplication, but the used implementation from Catch2 uses naive long-multiplication. This means that the results for uint64_t are much slower than they would be with optimized implementation. Thus, they can show the trend, but the performance difference between the two implementations is smaller than it should be.

Setup

All the code for the benchmarks is publicly available on GitHub. I took the code for the RNG and Lemire's algorithm from Catch2, the benchmarks themselves are done using Catch2's micro benchmarking support.

There are three benchmarks:

Stacked heavily for reuse.
The larger the target interval, the greater the chance Lemire's algorithm will need to compute modulo. Thus, we will generate numbers in interval [0, numeric_limits<type> - 1)^[6], and generate lot of them.
Stacked heavily against reuse.
As long as we never reuse the distribution object instance, there is no advantage to precomputing the modulo. If we also always create the distribution object with a new bound, this corresponds to how it would be used during a Fisher-Yates shuffle.
Benchmark based on use in Catch2.
Catch2 internally uses uniform int distribution in two places. Both places keep reusing the same distribution but with significantly different bounds and expected number of invocations.

For every benchmark, we have three contenders:

Just the RNG
This does not generate the required numbers, but it acts as a baseline to tell us what the overhead of the distribution algorithm is.
Plain implementation of Lemire's algorithm
Implementation of Lemire's algorithm with precomputed thresholds

The numbers in this post were taken from my personal machine with a Ryzen 7 4750 CPU, using MSVC in version 19.38.33133 (this MSVC version comes installed with VS 17.8.3) and compiling for CMake-generated Release configuration^[7].

Results

Benchmark stacked for reuse

uint32_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)
100'000	271.75 ± 2.34 us	291.29 ± 3.57 us	292.30 ± 4.15 us	1.00
1'000'000	2.70 ± 0.01 ms	2.94 ± 0.05 ms	2.92 ± 0.03 ms	1.01
10'000'000	27.06 ± 0.05 ms	29.26 ± 0.16 ms	29.33 ± 0.16 ms	1.00

uint64_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)
100'000	430.37 ± 8.55 us	545.54 ± 11.43 us	540.70 ± 10.84 us	1.01
1'000'000	4.36 ± 0.03 ms	5.42 ± 0.04 ms	5.48 ± 0.04 ms	0.99
10'000'000	43.37 ± 0.09 ms	54.46 ± 0.11 ms	54.45 ± 0.13 ms	1.00

As we can see, when we stack the deck towards the reuse-optimized implementation, the reuse-optimized implementation performs …

…

… exactly the same as the basic implementation.

Wait what?

Benchmarking is hard: the optimizer is good at its job

As it turns out, when the optimizer sees code like this

constexpr uint32_t high = std::numeric_limits<uint32_t>::max()-1;
BENCHMARK("noreuse, iters=" + std::to_string(iters)) {
	uint32_t sum = 0;
	lemire_algorithm_no_reuse<uint32_t> dist(0, high);
	for (size_t n = 0; n < iters; ++n) {
		sum += dist(rng);
	}
	return sum;
};

it uses the fact that high is statically known to inline everything and precompute the rejection threshold during compilation, so there is no div instruction in the resulting code. Thus, the plain implementation never pays extra cost by recomputing the modulo and cannot lose.

You can play with a simplified example on Compiler Explorer. To keep the generated ASM short, the code only generates a single number, and the RNG implementation is opaque to avoid inlining it.

Since the compiler does more or less the same for the reuse-optimized implementation, neither has an advantage when the target interval is statically known. So, to measure the effect we care about, we have to hide the interval from the compiler. In the benchmark code, this is done by calling an opaque function, which returns the desired value but cannot be inlined due to lack of LTO.

Benchmark results revisited

Benchmark stacked for reuse (again)

uint32_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000	271.68 ± 4.14 us	335.53 ± 3.65 us	275.80 ± 4.79 us	1.22	15.50
1'000'000	2.69 ± 0.01 ms	3.38 ± 0.01 ms	2.74 ± 0.01 ms	1.23	13.80
10'000'000	26.89 ± 0.05 ms	33.76 ± 0.04 ms	28.00 ± 0.04 ms	1.21	6.19

uint64_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000	418.77 ± 6.33 us	638.53 ± 11.42 us	588.60 ± 11.42 us	1.08	1.29
1'000'000	4.35 ± 0.04 ms	6.26 ± 0.04 ms	6.07 ± 0.03 ms	1.03	1.11
10'000'000	42.51 ± 0.09 ms	64.74 ± 0.12 ms	59.53 ± 0.12 ms	1.09	1.31

As we can see, after we stop the optimizer from optimizing out the important part, precomputing the modulo speeds up the random number generation by \(\approx 20\%\), with the distribution itself being up to \(15\times\) faster.^[8] Remember that this benchmark is the best case for the stateful implementation, as we stacked everything in its favour.

Let's take a look at the opposite benchmark.

Benchmark stacked against reuse

uint32_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000	263.53 ± 3.69 us	284.64 ± 5.10 us	525.50 ± 3.14 us	0.54	0.08
1'000'000	2.63 ± 0.01 ms	2.86 ± 0.02 ms	4.90 ± 0.01 ms	0.58	0.10
10'000'000	26.39 ± 0.20 ms	28.37 ± 0.08 ms	44.97 ± 0.04 ms	0.63	0.11

uint64_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000	449.11 ± 10.71 us	630.46 ± 12.96 us	928.24 ± 6.31 us	0.68	0.38
1'000'000	4.54 ± 0.04 ms	6.27 ± 0.04 ms	8.96 ± 0.03 ms	0.70	0.39
10'000'000	44.84 ± 0.14 ms	62.77 ± 0.10 ms	84.98 ± 0.07 ms	0.74	0.45

Once again, the difference in the overhead of the two implementations is massive, with plain Lemire being 9 to 12 times faster. This boils down to this benchmark being an inversion of the first one. Where the first benchmark forced the plain Lemire implementation to compute modulo for every number and let the reuse variant compute it exactly once, this benchmark forces the reuse variant to always compute modulo while plain Lemire avoids the computation most of the time.

The difference in the total runtime is much more significant this time around, with the reuse variant being 60-80% slower. This is a terrible result and shows that the reuse variant does not make a good default.

Benchmark(s) from Catch2

The two places in Catch2 needing uniformly generated integers are generating uniformly distributed floating point numbers^[9], and resampling benchmark measurements. The interval for the former depends on the range the user asks for, but we will use \([0, 33554430]\) and \([0, 18014398509481982]\) for the benchmark^[10]. These numbers present the upper bound on what we can expect and thus are most favourable for the reuse implementation. The number of iterations is unknown, but generally, I would expect it to be low (\(<1000\)).

Resampling benchmark measurements defaults to interval \([0, 100)\) and generates 10M random numbers with the one instance it creates^[11]. The user can customize both of these, but generally they won't.

Generating floats

uint32_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1	2.64 ± 0.11 ns	2.86 ± 0.10 ns	4.06 ± 0.06 ns	0.70	0.15
5	13.20 ± 0.45 ns	13.89 ± 0.76 ns	13.70 ± 0.23 ns	1.01	1.38
10	26.30 ± 0.37 ns	27.60 ± 0.42 ns	27.06 ± 0.45 ns	1.02	1.71
100	262.76 ± 3.77 ns	276.58 ± 3.71 ns	270.62 ± 0.52 ns	1.02	1.76
1000	2.63 ± 0.03 us	2.75 ± 0.05 us	2.69 ± 0.04 us	1.02	2.00

uint64_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1	3.93 ± 0.13 ns	8.34 ± 0.36 ns	8.44 ± 0.10 ns	0.99	0.98
5	20.08 ± 0.31 ns	31.09 ± 0.50 ns	31.35 ± 0.80 ns	0.99	0.98
10	40.48 ± 0.94 ns	59.29 ± 0.70 ns	60.26 ± 4.21 ns	0.98	0.95
100	407.88 ± 14.16 ns	575.40 ± 8.08 ns	580.08 ± 10.15 ns	0.99	0.97
1000	4.08 ± 0.08 us	5.74 ± 0.23 us	5.69 ± 0.19 us	1.01	1.03

Unsurprisingly, when you only want to get a single random (floating point) number, the plain Lemire algorithm has a large advantage. What is surprising is how few we need to generate for the reuse implementation to catch up and eventually gain an advantage. Even 1000 numbers should be too low to require the plain implementation to recompute the division repeatedly.

For uint64_t, we see the slooow multiplication implementation hide any other difference.

Benchmark resampling

uint32_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1'000'000	2.63 ± 0.01 ms	2.72 ± 0.01 ms	2.71 ± 0.01 ms	1.00	1.13
10'000'000	26.33 ± 0.04 ms	27.21 ± 0.05 ms	27.19 ± 0.04 ms	1.00	1.05
100'000'000	263.29 ± 0.17 ms	271.99 ± 0.17 ms	271.77 ± 0.16 ms	1.00	1.03
1'000'000'000	2.63 ± 0.01 s	2.83 ± 0.14 s	2.79 ± 0.02 s	1.01	1.25

uint64_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1'000'000	4.18 ± 0.01 ms	5.79 ± 0.02 ms	5.84 ± 0.01 ms	0.99	0.97
10'000'000	41.87 ± 0.04 ms	57.77 ± 0.15 ms	58.41 ± 0.07 ms	0.99	0.96
100'000'000	418.45 ± 0.23 ms	577.70 ± 0.58 ms	585.25 ± 1.64 ms	0.99	0.95
1'000'000'000	4.19 ± 0.02 s	5.80 ± 0.02 s	5.78 ± 0.01 s	1.00	1.01

For benchmark resampling, there are minimal differences between the two implementations. Plain Lemire implementation is very unlikely to pay the extra division cost due to the small upper bound, and the reuse implementation easily amortizes the additional cost over the many generated numbers. When generating a lot^[12] of numbers, the reuse implementation starts winning again, but only by small margins.

Conclusion

So, what did we learn from the benchmarks?

First, if the compiler can see the interval bounds during compilation time, it does not matter which of the two shown implementations we pick, because it will optimize the division away in either case. In other words, a function like this

uint32_t roll_D6(RNG& rng) {
	lemire_algorithm<uint32_t> dist(1, 6);    
    return dist(rng);
}

will always perform optimally with either implementation. But when the compiler cannot see the bounds, their performance difference can be massive. If we craft benchmarks specifically to advantage one implementation over the other, we see more than a 10x difference between the overhead imposed by the two implementations.

From how we crafted the edge case benchmarks, we also see that if we can expose only one implementation, the implementation optimized for reuse is a bad default. The benchmark favouring reuse is contrived, and as we make it less so, the advantage of the reuse implementation decreases. On the other hand, the benchmark favouring plain Lemire is based on a real-world usage pattern representing the Fisher-Yates shuffle.

We have also learned that Lemire's algorithm strongly relies on fast extended multiplication. If we do not have optimized extended multiplication for a specific width, other implementation considerations do not matter, and Lemire's algorithm becomes much slower.

From the workload-specific benchmarks, we've learned that Catch2 is indeed better off with the reuse-optimized implementation.^[13] Catch2 does not use random shuffle^[14], and while the performance improvement will be slight, it is functionally free because Catch2 had to implement its own distribution anyway.

That's it for this post. I will write at least one follow-up post about the case where we care about the statistical properties of the generated numbers but not their reproducibility. This means that we can use completely different algorithms, rather than looking into different implementation strategies for the same one.)

Previously, I was on for episode 248, and the topic was "Catch2 and std::random". ↩︎
Technically, some distributions already generate more than one number at a time, e.g. the Marsaglia polar method generates two normally distributed numbers per run. ↩︎
I also implemented reproducible uniform floating point distribution for Catch2. This might come up later in this post. ↩︎
The transformation from [0, max) interval to an [a, b) interval is trivial; just add a to the result. For [a, b], you use max = b + 1. ↩︎
Consider simulating a fair D6. In that case, \(\text{max} = 6\), and the chance of any call having to calculate the exact rejection threshold is \(\frac{3}{2^{31}}\). But even in less nice cases, the chance that we have to calculate the modulo is small. When randomly shuffling 100k elements, there is still \(\approx 31\%\) chance of not having to compute modulo at all^[15]. ↩︎
Why numeric_limits::max() - 1 and not numeric_limits::max()? Because when all values in the type are valid, returning bits from the random number generator directly is a common optimization. Thus, we ask for the largest number in type to be invalid to force the actual rejection algorithm to run. ↩︎
Visual Studio solutions generated by CMake do not have LTO enabled for Release builds. However, all relevant functions are visible and inlinable in the benchmarks, so there should be no difference from this. ↩︎
Why look at both the total runtime and the distribution overhead? Because both are interesting. The overall runtime is what the end user sees, but the speed of the RNG largely dominates it. Meanwhile, the difference in the distribution overhead tells us what effect the implementation choices have, but it does not tell us whether the difference matters in practice. ↩︎
Catch2 implements Frédéric Goualard's algorithm for drawing uniformly distributed floating point numbers in an interval. As part of the process, it generates a uniformly distributed integer first. ↩︎
\(33554430\) (or \(2^{25} - 2)\) is how many equidistant floats are there between -FLT_MAX and FLT_MAX. \(18014398509481982\) (or \(2^{54} - 2\)) is how many equidistant doubles are there between -DBL_MAX and DBL_MAX. ↩︎
This is because the default is to perform 100k resamples of 100 samples each. Selecting each sample requires generating a random number first. ↩︎
Originally, I thought benchmarking with 1'000'000'000 numbers was unreasonable, but when I was writing this post, someone came onto Catch2's Discord for help with slow benchmarks. They set benchmarking samples to 2000, which means that just preparing all the resamplings required generating 2 (short) billion numbers from the same distribution. ↩︎
But I do need to implement 64 -> 128-bit multiplication before I can use it for resampling benchmark results, or it will be a significant slowdown from the currently used std::uniform_int_distribution. ↩︎
Catch2 supports random shuffling of tests, but it is not implemented as a random shuffle to guarantee subset invariance during the shuffle. ↩︎
You could also decrease the chance of needing modulo to almost zero by using 64-bit indices to shuffle the elements, but doubling the output width of the generator will likely cost you more than the occasional modulo would. ↩︎

← The Coding Nest

Introduction

Benchmarks

Setup

Results

Benchmark stacked for reuse

Benchmarking is hard: the optimizer is good at its job

Benchmark results revisited

Benchmark stacked for reuse (again)

Benchmark stacked against reuse

Benchmark(s) from Catch2

Generating floats

Benchmark resampling

Conclusion