Random Numbers - The Coding Nest

Random distributions are not one-size-fits-all (part 2.5)

Martin Hořeňovský — Mon, 01 Jul 2024 12:05:28 GMT

I recently realized that I could try to implement the reuse-oriented distributions from part 2 using libdivide instead of native division (modulo). This post will quickly go through the results of benchmarking this approach.

Introduction

If you are not familiar with libdivide, it is a library for optimizing integer division in workloads where you will be using the same runtime^[1] divisor multiple times. libdivide does not support computing modulo, but we can fake it by substituting n % d for n - (n / d) * d.

In the distribution reuse workload from part 2, both the OpenBSD and the Java algorithm compute modulo using the same modulus every time, so precomputing a faster division has the potential to improve performance.

Benchmarks

For this post, I just extended the benchmark for generating many random variates from the same range from part 2. For more details, look there.

Results

As in part 2, all running time shown is in time per single element, and you can click all images for a fullscreen version. There is also once again a separate interactive graph with all data and distributions.

Interestingly, using libdivide is not always better. It provides a significant speed-up for small ranges but is a slowdown for large ranges. This is interesting, because it means that for large divisors, the performance of the hardware divider is close to the performance of multiplication^[2].

Compared to Lemire's algorithm, OpenBSD + libdivide no longer loses significantly for small ranges. But it also does not win by a meaningful margin for bigger ranges. Technically, it wins against Lemire's with slower multiplication, but that is a deeply unfair comparison; both algorithms now rely on 64x64->128 extended multiplication, so we cannot use slower implementation just for one of them.

Conclusion

Using libdivide is not a strict win for either OpenBSD or Java^[3] algorithm. In a world where Lemire's algorithm does not exist, it would be an interesting improvement, as it evens out the performance differences between the different output range sizes.

However, Lemire's algorithm does exist, and using libdivide turns the OpenBSD algorithm into a worse Lemire's algorithm. OpenBSD + libdivide replaces the modulo operation with an extended multiplication, addition, shift, and another multiplication + subtraction. Lemire's algorithm uses a clever insight about extended multiplication + downshift to range reduce the generated random number directly.

If the divisor is known at compile time, the compiler will optimize it for you. ↩︎
libdivide works by doing extended multiplication between the input and a precomputed constant, followed by some adds and shifts to fix up the results^[4]. For our use case, that is, taking modulo, we add extra multiplication and subtraction on top of this. ↩︎
I did not talk about results from Java's algorithm in this post for a simple reason: it was worse for this use case in part 2, and using libdivide has changed nothing about the underlying reason for this. I included the data in the big graph, but the tl;dr is that using libdivide for Java's algorithm is the same as with OpenBSD: better performance for small ranges, worse performance for large ranges. ↩︎
Does this sound weirdly familiar? It should; Lemire's algorithm is based around a similar emul + downshift trick to reduce the input range into a predefined range. ↩︎

Random distributions are not one-size-fits-all (part 2)

Martin Hořeňovský — Mon, 01 Apr 2024 14:13:42 GMT

Updated 30.06.2024: I noticed a bug in the Java distribution implementation, which changed the performance profile. I reran the benchmarks and updated text below to reflect the reality.

In the previous post, we looked at Lemire's algorithm for generating uniformly distributed integers and how it can be specialized for different use cases. In this post, we will look at different algorithms for generating uniformly distributed random numbers and determine whether they fit specific use cases better than Lemire's algorithm.

Introduction

The original paper that introduced Lemire's algorithm compared it to two other algorithms, dubbed "OpenBSD" and "Java", based on the project they were taken from. Lemire's algorithm has won, by quite a lot. But when I was running the benchmarks for part 1, I noticed that when generating uint64_t, the expected differences decreased, or even disappeared entirely, due to the slow implementation of extended 64x64 -> 128-bit multiplication.

That got me thinking: Would Lemire's algorithm still win if we did not have access to hardware that has native support for 64x64->128 multiplication? Alternatively, how slow would the multiplication have to be for the implementations relying on modulo to win?

Let's first look at the different implementations of extended multiplication we will be using, and then we will look at the alternative algorithms for generating uniformly distributed random integers.

Extended multiplication

This post will benchmark three different approaches towards extended multiplication.

Naive
This is the approach we all learned in elementary school as "long multiplication". Since we are working on a computer, we will use 32-bit integers as the individual digits. The advantage of this approach is that it is evidently correct and easy to implement.
Optimized
You can find this approach floating around in places like Hacker's Delight. It also implements long multiplication, but unlike the naive approach, we do not keep to strictly 32-bit digits. This allows us to significantly reduce the number of additions that happen before we get the result. The advantage of this approach is that it is reasonably performant and also widely portable.
Intrinsic
This approach uses the compiler intrinsic for extended multiplication. This leads to great results^[1] on CPUs that natively support 64x64 extended multiplication, but it is not available on all platforms, e.g. 32-bit x86 or ARM.

You can look at all of these on Compiler Explorer. When you look at the generated assembly, the optimized approach does less work than the naive approach. There is still the same total number of multiplications, but there are fewer additions, and fewer shifts and the critical chain is shorter.

Obviously, not all ASM instructions are equal, and counting them does not tell the whole story. So, we will benchmark all three approaches and compare their results later.

Alternative algorithms

Let's take a look at the pseudo-implementations of the two alternative algorithms.

OpenBSD

uint64_t openbsd(uint64_t max) {
    uint64_t threshold = (-max) % max;
    uint64_t x;
    do {
        x = random64();
    } while (x < threshold);
    return x % max;
}

As written, this algorithm needs to compute modulo twice per returned number, which virtually ensures terrible performance. Lemire's algorithm innovation is that, in most cases, it does not need even 1 modulo.

However, if we know we will be generating numbers from the same distribution repeatedly, one of the modulos can be amortized as it depends purely on the distribution range (see part 1). The specialized implementation will require only one modulo per generated number, thus providing reasonable performance.

For use cases where we cannot amortize the first modulo for computing threshold, we can safely assume that this algorithm is not competitive.

Java

uint64_t java(uint64_t max) {
    uint64_t x = random64();
    uint64_t r = x % max;
    while (x - r > -max) {
        x = random64();
        r = x % max;
    }
    return r;
}

This algorithm needs on average 1 modulo per generated number, but it might require multiple. Unlike the OpenBSD and Lemire algorithms, there are no optimizations we can apply for generating multiple numbers from the same distribution^[2].

We should expect this to be about equal to the OpenBSD algorithm for generating many numbers in the same distribution, and much better for generating numbers from different distributions.

Benchmarks

Once again, all the benchmarking code is on GitHub. In fact, it is in the same repository. For this post, we have three different benchmarks:

Extended multiplication
Generating multiple random numbers from the same range
Generating random numbers from different ranges

Setup

Just as with the previous post, the numbers in this post were taken from my personal machine with a Ryzen 7 4750 CPU, but for this post, I am using slightly newer MSVC in version 19.39.33520 (this MSVC version comes installed with VS 17.9.1) and compiling for CMake-generated Release configuration^[3].

When benchmarking extended multiplication, we generate two arrays of varying sizes, multiply each pair, and sum the results to ensure that the computation is not optimized away. We do this because we cannot benchmark the multiplication of a single pair of numbers, as the overhead from Catch2's benchmarking becomes a significant part of the total runtime.

We use the same basic setup as in the previous part for the distribution benchmarks. There will be one benchmark where we generate a number from a different distribution every time and one where we generate many numbers from the same distribution. However, unlike the previous post, we do not attempt to model some real-world usage and instead look for bounds where Lemire's algorithm loses.

Because the previous post has shown that we can gain performance from specializing the implementation for generating multiple numbers from one distribution, I implemented each distribution in this post twice, once as standard for the no-reuse benchmark and once specialized for the reuse benchmark.

Results

All running time shown is in time per single element. You can click all images for a fullscreen version.

Extended multiplication

You can also play around with the interactive version of the graph

We see some slowdown for long inputs. This slowdown is not interesting to us, as it is caused by running out of space in lower levels of cache, but in distribution code, we will only ever multiply two numbers at once. Looking at the results for small inputs, we see that using naive multiplication is more than 4 times slower than getting the compiler to "do the right thing" using intrinsics. Even if we optimize our implementation, there is still a 2.6x slowdown from using portable code.

This shows that other algorithms might be able to win if we do not use native 64x64 extended multiplication.

Single-use distributions

This is the case where we generate every number from a different range.

You can also play around with the interactive version of the graph

As expected, the OpenBSD algorithm is not competitive for this use case. It takes roughly twice as long as the Java one because it pays the cost of the modulo twice per number. But Java is not competitive either. Lemire's algorithm is much faster even when it uses the naive extended multiplication implementation, and with the best extended multiplication implementation, it is roughly twice as fast.

Let's take a look at generating a lot of numbers from the same range instead.

Reused distributions

Updated 30.06.2024: I noticed a bug in the Java distribution implementation, which changed its performance profile. I reran the benchmarks and updated text below to reflect the reality.

This is the case where we generate a lot of numbers from the same range.

I originally collected a lot more data than is shown below, but I could not make the graph with all of them readable in image form. If you want to see the data in detail, here is an interactive graph.

You can also play around with the interactive version of the graph

The first thing we see here is that the time needed to generate one uniformly distributed number varies across different bounds. There are two reasons for this. The first is that the bounds change how many numbers from the underlying random number generators will be rejected^[4]. The second reason is that the DIV^[5] instruction has variable latency based on the input values.

Because Lemire's algorithm does not use modulo, the differences in its runtime come purely from the effect of rejection frequency. For N == UINT64_MAX / 2 + 1, each output from the random number generator has only \(\approx 50\%\) chance of being usable, which causes the massive spike there. Both OpenBSD and Java algorithms use modulo, and we can see that their runtime decreases^[6] even between bounds with roughly the same amount of rejection.

When we compare the distributions between themselves, we can see that the OpenBSD algorithm consistently wins over Java. This is a reversal of the results from the previous use case and is caused by the Java algorithm needing to compute one modulo per candidate number from a random number generator. In comparison, OpenBSD only needs one modulo per returned variate.

We can also see the OpenBSD algorithm slightly winning over Lemire's algorithm for specific, very high bounds. However, Lemire's algorithm is the better option for a generic use case, as it outperforms OpenBSD significantly for low bounds and performs equally or only slightly worse for high bounds.

The situation would be more complicated if we forced Lemire's algorithm to use portable extended multiplication. In this case, the OpenBSD algorithm achieves performance parity with Lemire's at lower bounds than against intrinsic extended multiplication and, for higher bounds, actually outperforms it.

Conclusion

In the shuffle-like workload, Lemire's algorithm outperforms both Java and OpenBSD algorithms by a significant margin. In fact, OpenBSD performs so badly for this workload type that if you have to pick a single algorithm for all your needs, you should never pick the OpenBSD algorithm.

The OpenBSD algorithm can outperform Lemire's when generating many numbers from the same, very large, range. However, the biggest measured win for OpenBSD is ~8%, while it loses by much bigger margins for small ranges, so Lemire's algorithm is the better workload-independent choice. However, we can gain performance by choosing and picking different algorithms based on the size of the range from which to generate. This links back to my old claim that an ideal library should expose the different distribution algorithms under their names rather than under abstract names based on their statistical properties. Back then I based this opinion on allowing future extensions to the standard library, as people invent better algorithms to generate random numbers with specific properties, but these results provide another reason; usage-dependent performance differences.

That's it for this part. The next part will be about generating uniformly distributed floating point numbers and how different algorithms lead not just to different performance, but to fundamentally different generated numbers.

For x64 the emul turns into a single mul instruction (+ few movs to move the results into the right registers), while for ARM64 it turns into a umulh and mul pair. ↩︎
Technically, when writing a C++-style distribution object, we can precompute the max value from the user-provided [a, b] range. This optimization has no measurable effect, but I did this in the linked benchmark code anyway, just for completeness's sake. ↩︎
Visual Studio solutions generated by CMake do not have LTO enabled for Release builds. However, all relevant functions are visible and inlinable in the benchmarks, so there should be no difference from this. ↩︎
Remember that generating uniformly distributed random numbers from a random bit generator requires a range reduction and rejecting the "surplus" numbers that cannot be fully mapped to the target range. ↩︎
On x64, div returns the division result and the remainder. On ARM64, udiv returns only the division result (and needs to be followed by msub to get the remainder), but it also has variable latency. ↩︎
The larger the divisor, the faster the modulo runs. I threw together a benchmark for modulo performance over the different bounds. Here are the results as an interactive graph (as still image). ↩︎

Random distributions are not one-size-fits-all (part 1)

Martin Hořeňovský — Sat, 13 Jan 2024 23:52:51 GMT

Recently, I was once again on CppCast, and once again^[1], I talked about Catch2 and C++'s standard library support for random numbers. One of the things I said was that random distributions are not one-size-fits-all.

This post is about how this applies to Lemire's algorithm, which is the state of the art for generating uniformly distributed integers in specific intervals.

Introduction

In C++, separates the concern of generating random numbers between Random Number Engines, which generate integers in some predefined range, and distributions, which turn these into other numbers in the range requested by the user. This is a powerful, expert-friendly design, even though the actual standardized library leaves a lot to be desired.

Notably, the distributions are specified as types and are allowed to be stateful. This means that a distribution could be written to optimize for repeatedly generating numbers from the same instance rather than generating a single number as fast as possible^[2].

This post will look specifically at Lemire's algorithm for generating uniformly distributed integers. This is because I recently implemented reproducible uniform integer distribution for Catch2 using Lemire's algorithm, so I can use my own code for the benchmarks^[3].

Lemire's algorithm is the current state of the art for generating uniformly distributed random integers in an interval [0, max)^[4]. I won't go over the proof of the algorithm in this post; if you want to know why it works, go read the paper, but we will look at a pseudo-implementation.

uint32_t lemire( uint32_t max ) {
  uint64_t long_mult = random32() * uint64_t(max);
  uint32_t low_bits = uint32_t(long_mult);
  // Cheap probabilistic check
  if (low_bits < max) {
    uint32_t rejection_threshold = -max % max;
    // Precise check
    while (low_bits < rejection_threshold) {
      long_mult = random32() * uint64_t(max);
      low_bits = uint32_t(long_mult);
    }
  }
  return long_mult >> 32;
}

The core trick is in the probabilistic check guarding the expensive modulo operation used to calculate precise rejection threshold. The chance that we will need the modulo is \( \frac{\textbf{max}}{2^{32}} \), which translates to "basically zero" for a lot of the common use cases^[5].

But we can also end up computing rejection_threshold on every invocation. And if you look at how we compute it, you will notice that it does not depend on the generated random number, only on max. This is where a stateful object can optimize the code for generating multiple numbers; it can compute rejection_threshold once and use it every time it generates a new number.

Will precomputing the threshold pay off? Let's write some benchmarks and see.

Benchmarks

Please note that while this post has benchmark results for uint32_t and uint64_t, the latter are not representative of an optimized implementation. Lemire's algorithm uses extended multiplication, but the used implementation from Catch2 uses naive long-multiplication. This means that the results for uint64_t are much slower than they would be with optimized implementation. Thus, they can show the trend, but the performance difference between the two implementations is smaller than it should be.

Setup

All the code for the benchmarks is publicly available on GitHub. I took the code for the RNG and Lemire's algorithm from Catch2, the benchmarks themselves are done using Catch2's micro benchmarking support.

There are three benchmarks:

Stacked heavily for reuse.
The larger the target interval, the greater the chance Lemire's algorithm will need to compute modulo. Thus, we will generate numbers in interval [0, numeric_limits - 1)^[6], and generate lot of them.
Stacked heavily against reuse.
As long as we never reuse the distribution object instance, there is no advantage to precomputing the modulo. If we also always create the distribution object with a new bound, this corresponds to how it would be used during a Fisher-Yates shuffle.
Benchmark based on use in Catch2.
Catch2 internally uses uniform int distribution in two places. Both places keep reusing the same distribution but with significantly different bounds and expected number of invocations.

For every benchmark, we have three contenders:

Just the RNG
This does not generate the required numbers, but it acts as a baseline to tell us what the overhead of the distribution algorithm is.
Plain implementation of Lemire's algorithm
Implementation of Lemire's algorithm with precomputed thresholds

The numbers in this post were taken from my personal machine with a Ryzen 7 4750 CPU, using MSVC in version 19.38.33133 (this MSVC version comes installed with VS 17.8.3) and compiling for CMake-generated Release configuration^[7].

Results

Benchmark stacked for reuse

uint32_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)
100'000	271.75 ± 2.34 us	291.29 ± 3.57 us	292.30 ± 4.15 us	1.00
1'000'000	2.70 ± 0.01 ms	2.94 ± 0.05 ms	2.92 ± 0.03 ms	1.01
10'000'000	27.06 ± 0.05 ms	29.26 ± 0.16 ms	29.33 ± 0.16 ms	1.00

uint64_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)
100'000	430.37 ± 8.55 us	545.54 ± 11.43 us	540.70 ± 10.84 us	1.01
1'000'000	4.36 ± 0.03 ms	5.42 ± 0.04 ms	5.48 ± 0.04 ms	0.99
10'000'000	43.37 ± 0.09 ms	54.46 ± 0.11 ms	54.45 ± 0.13 ms	1.00

As we can see, when we stack the deck towards the reuse-optimized implementation, the reuse-optimized implementation performs …

…

… exactly the same as the basic implementation.

Wait what?

Benchmarking is hard: the optimizer is good at its job

As it turns out, when the optimizer sees code like this

constexpr uint32_t high = std::numeric_limits::max()-1;
BENCHMARK("noreuse, iters=" + std::to_string(iters)) {
	uint32_t sum = 0;
	lemire_algorithm_no_reuse dist(0, high);
	for (size_t n = 0; n < iters; ++n) {
		sum += dist(rng);
	}
	return sum;
};

it uses the fact that high is statically known to inline everything and precompute the rejection threshold during compilation, so there is no div instruction in the resulting code. Thus, the plain implementation never pays extra cost by recomputing the modulo and cannot lose.

You can play with a simplified example on Compiler Explorer. To keep the generated ASM short, the code only generates a single number, and the RNG implementation is opaque to avoid inlining it.

Since the compiler does more or less the same for the reuse-optimized implementation, neither has an advantage when the target interval is statically known. So, to measure the effect we care about, we have to hide the interval from the compiler. In the benchmark code, this is done by calling an opaque function, which returns the desired value but cannot be inlined due to lack of LTO.

Benchmark results revisited

Benchmark stacked for reuse (again)

uint32_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000	271.68 ± 4.14 us	335.53 ± 3.65 us	275.80 ± 4.79 us	1.22	15.50
1'000'000	2.69 ± 0.01 ms	3.38 ± 0.01 ms	2.74 ± 0.01 ms	1.23	13.80
10'000'000	26.89 ± 0.05 ms	33.76 ± 0.04 ms	28.00 ± 0.04 ms	1.21	6.19

uint64_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000	418.77 ± 6.33 us	638.53 ± 11.42 us	588.60 ± 11.42 us	1.08	1.29
1'000'000	4.35 ± 0.04 ms	6.26 ± 0.04 ms	6.07 ± 0.03 ms	1.03	1.11
10'000'000	42.51 ± 0.09 ms	64.74 ± 0.12 ms	59.53 ± 0.12 ms	1.09	1.31

As we can see, after we stop the optimizer from optimizing out the important part, precomputing the modulo speeds up the random number generation by \(\approx 20\%\), with the distribution itself being up to \(15\times\) faster.^[8] Remember that this benchmark is the best case for the stateful implementation, as we stacked everything in its favour.

Let's take a look at the opposite benchmark.

Benchmark stacked against reuse

uint32_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000	263.53 ± 3.69 us	284.64 ± 5.10 us	525.50 ± 3.14 us	0.54	0.08
1'000'000	2.63 ± 0.01 ms	2.86 ± 0.02 ms	4.90 ± 0.01 ms	0.58	0.10
10'000'000	26.39 ± 0.20 ms	28.37 ± 0.08 ms	44.97 ± 0.04 ms	0.63	0.11

uint64_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
100'000	449.11 ± 10.71 us	630.46 ± 12.96 us	928.24 ± 6.31 us	0.68	0.38
1'000'000	4.54 ± 0.04 ms	6.27 ± 0.04 ms	8.96 ± 0.03 ms	0.70	0.39
10'000'000	44.84 ± 0.14 ms	62.77 ± 0.10 ms	84.98 ± 0.07 ms	0.74	0.45

Once again, the difference in the overhead of the two implementations is massive, with plain Lemire being 9 to 12 times faster. This boils down to this benchmark being an inversion of the first one. Where the first benchmark forced the plain Lemire implementation to compute modulo for every number and let the reuse variant compute it exactly once, this benchmark forces the reuse variant to always compute modulo while plain Lemire avoids the computation most of the time.

The difference in the total runtime is much more significant this time around, with the reuse variant being 60-80% slower. This is a terrible result and shows that the reuse variant does not make a good default.

Benchmark(s) from Catch2

The two places in Catch2 needing uniformly generated integers are generating uniformly distributed floating point numbers^[9], and resampling benchmark measurements. The interval for the former depends on the range the user asks for, but we will use \([0, 33554430]\) and \([0, 18014398509481982]\) for the benchmark^[10]. These numbers present the upper bound on what we can expect and thus are most favourable for the reuse implementation. The number of iterations is unknown, but generally, I would expect it to be low (\(<1000\)).

Resampling benchmark measurements defaults to interval \([0, 100)\) and generates 10M random numbers with the one instance it creates^[11]. The user can customize both of these, but generally they won't.

Generating floats

uint32_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1	2.64 ± 0.11 ns	2.86 ± 0.10 ns	4.06 ± 0.06 ns	0.70	0.15
5	13.20 ± 0.45 ns	13.89 ± 0.76 ns	13.70 ± 0.23 ns	1.01	1.38
10	26.30 ± 0.37 ns	27.60 ± 0.42 ns	27.06 ± 0.45 ns	1.02	1.71
100	262.76 ± 3.77 ns	276.58 ± 3.71 ns	270.62 ± 0.52 ns	1.02	1.76
1000	2.63 ± 0.03 us	2.75 ± 0.05 us	2.69 ± 0.04 us	1.02	2.00

uint64_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1	3.93 ± 0.13 ns	8.34 ± 0.36 ns	8.44 ± 0.10 ns	0.99	0.98
5	20.08 ± 0.31 ns	31.09 ± 0.50 ns	31.35 ± 0.80 ns	0.99	0.98
10	40.48 ± 0.94 ns	59.29 ± 0.70 ns	60.26 ± 4.21 ns	0.98	0.95
100	407.88 ± 14.16 ns	575.40 ± 8.08 ns	580.08 ± 10.15 ns	0.99	0.97
1000	4.08 ± 0.08 us	5.74 ± 0.23 us	5.69 ± 0.19 us	1.01	1.03

Unsurprisingly, when you only want to get a single random (floating point) number, the plain Lemire algorithm has a large advantage. What is surprising is how few we need to generate for the reuse implementation to catch up and eventually gain an advantage. Even 1000 numbers should be too low to require the plain implementation to recompute the division repeatedly.

For uint64_t, we see the slooow multiplication implementation hide any other difference.

Benchmark resampling

uint32_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1'000'000	2.63 ± 0.01 ms	2.72 ± 0.01 ms	2.71 ± 0.01 ms	1.00	1.13
10'000'000	26.33 ± 0.04 ms	27.21 ± 0.05 ms	27.19 ± 0.04 ms	1.00	1.05
100'000'000	263.29 ± 0.17 ms	271.99 ± 0.17 ms	271.77 ± 0.16 ms	1.00	1.03
1'000'000'000	2.63 ± 0.01 s	2.83 ± 0.14 s	2.79 ± 0.02 s	1.01	1.25

uint64_t

n	RNG	plain Lemire	reuse Lemire	\(\frac{\textbf{plain}}{\textbf{reuse}} \)	\(\frac{\textbf{plain - RNG}}{\textbf{reuse - RNG}} \)
1'000'000	4.18 ± 0.01 ms	5.79 ± 0.02 ms	5.84 ± 0.01 ms	0.99	0.97
10'000'000	41.87 ± 0.04 ms	57.77 ± 0.15 ms	58.41 ± 0.07 ms	0.99	0.96
100'000'000	418.45 ± 0.23 ms	577.70 ± 0.58 ms	585.25 ± 1.64 ms	0.99	0.95
1'000'000'000	4.19 ± 0.02 s	5.80 ± 0.02 s	5.78 ± 0.01 s	1.00	1.01

For benchmark resampling, there are minimal differences between the two implementations. Plain Lemire implementation is very unlikely to pay the extra division cost due to the small upper bound, and the reuse implementation easily amortizes the additional cost over the many generated numbers. When generating a lot^[12] of numbers, the reuse implementation starts winning again, but only by small margins.

Conclusion

So, what did we learn from the benchmarks?

First, if the compiler can see the interval bounds during compilation time, it does not matter which of the two shown implementations we pick, because it will optimize the division away in either case. In other words, a function like this

uint32_t roll_D6(RNG& rng) {
	lemire_algorithm dist(1, 6);    
    return dist(rng);
}

will always perform optimally with either implementation. But when the compiler cannot see the bounds, their performance difference can be massive. If we craft benchmarks specifically to advantage one implementation over the other, we see more than a 10x difference between the overhead imposed by the two implementations.

From how we crafted the edge case benchmarks, we also see that if we can expose only one implementation, the implementation optimized for reuse is a bad default. The benchmark favouring reuse is contrived, and as we make it less so, the advantage of the reuse implementation decreases. On the other hand, the benchmark favouring plain Lemire is based on a real-world usage pattern representing the Fisher-Yates shuffle.

We have also learned that Lemire's algorithm strongly relies on fast extended multiplication. If we do not have optimized extended multiplication for a specific width, other implementation considerations do not matter, and Lemire's algorithm becomes much slower.

From the workload-specific benchmarks, we've learned that Catch2 is indeed better off with the reuse-optimized implementation.^[13] Catch2 does not use random shuffle^[14], and while the performance improvement will be slight, it is functionally free because Catch2 had to implement its own distribution anyway.

That's it for this post. I will write at least one follow-up post about the case where we care about the statistical properties of the generated numbers but not their reproducibility. This means that we can use completely different algorithms, rather than looking into different implementation strategies for the same one.)

Previously, I was on for episode 248, and the topic was "Catch2 and std::random". ↩︎
Technically, some distributions already generate more than one number at a time, e.g. the Marsaglia polar method generates two normally distributed numbers per run. ↩︎
I also implemented reproducible uniform floating point distribution for Catch2. This might come up later in this post. ↩︎
The transformation from [0, max) interval to an [a, b) interval is trivial; just add a to the result. For [a, b], you use max = b + 1. ↩︎
Consider simulating a fair D6. In that case, \(\text{max} = 6\), and the chance of any call having to calculate the exact rejection threshold is \(\frac{3}{2^{31}}\). But even in less nice cases, the chance that we have to calculate the modulo is small. When randomly shuffling 100k elements, there is still \(\approx 31\%\) chance of not having to compute modulo at all^[15]. ↩︎
Why numeric_limits::max() - 1 and not numeric_limits::max()? Because when all values in the type are valid, returning bits from the random number generator directly is a common optimization. Thus, we ask for the largest number in type to be invalid to force the actual rejection algorithm to run. ↩︎
Visual Studio solutions generated by CMake do not have LTO enabled for Release builds. However, all relevant functions are visible and inlinable in the benchmarks, so there should be no difference from this. ↩︎
Why look at both the total runtime and the distribution overhead? Because both are interesting. The overall runtime is what the end user sees, but the speed of the RNG largely dominates it. Meanwhile, the difference in the distribution overhead tells us what effect the implementation choices have, but it does not tell us whether the difference matters in practice. ↩︎
Catch2 implements Frédéric Goualard's algorithm for drawing uniformly distributed floating point numbers in an interval. As part of the process, it generates a uniformly distributed integer first. ↩︎
\(33554430\) (or \(2^{25} - 2)\) is how many equidistant floats are there between -FLT_MAX and FLT_MAX. \(18014398509481982\) (or \(2^{54} - 2\)) is how many equidistant doubles are there between -DBL_MAX and DBL_MAX. ↩︎
This is because the default is to perform 100k resamples of 100 samples each. Selecting each sample requires generating a random number first. ↩︎
Originally, I thought benchmarking with 1'000'000'000 numbers was unreasonable, but when I was writing this post, someone came onto Catch2's Discord for help with slow benchmarks. They set benchmarking samples to 2000, which means that just preparing all the resamplings required generating 2 (short) billion numbers from the same distribution. ↩︎
But I do need to implement 64 -> 128-bit multiplication before I can use it for resampling benchmark results, or it will be a significant slowdown from the currently used std::uniform_int_distribution. ↩︎
Catch2 supports random shuffling of tests, but it is not implemented as a random shuffle to guarantee subset invariance during the shuffle. ↩︎
You could also decrease the chance of needing modulo to almost zero by using 64-bit indices to shuffle the elements, but doubling the output width of the generator will likely cost you more than the occasional modulo would. ↩︎

Generating random numbers using C++ standard library: the solutions

Martin Hořeňovský — Sat, 23 May 2020 19:59:01 GMT

Last week I wrote about the various problem with using C++'s standard library (mainly ) to generate random numbers. This week I will outline what I think are the (standardizable) solutions to fix the functionality in ^[1] and make it widely usable.

The content of this post is based on the three C++ standardization papers I presented in Prague, P2058, P2059, P2060, and various conversations I had afterwards on the same topic.

Now, onto the solutions themselves.

Fixing `std::random_device`

In my last post, I complained that std::random_device is allowed to be not random at all, and there is no way to find out, because std::random_device::entropy is interpreted very differently across different standard library implementations.

My ideal way to fix this would be to mandate that a standard library implementation only provides std::random_device if it provides proper randomness. And by proper, I mean cryptographically strong. While this sounds onerous, the three major implementations already provide this in practice, they just do not advertise it... However, I also think that such a proposal would never pass the standard committee, and so we need to fix it differently.

Provide users with better queries for the properties of the implementation

Users generally care about one of two things.

Whether the random_device is random, that is, it does not produce the same sequence every time the code is run.
Whether the random_device produces cryptographically secure outputs.

Obviously, the second property is much stronger, because a random_device that is cryptographically secure is also random, but random_device can be random while not being cryptographically secure. As currently standardized, a random_device is also allowed to be neither random nor cryptographically secure^[2].

A nice feature of these properties is that they are binary, so the answer to them is either yes, or no, with no possibilities in-between. They are also reasonably well defined, which should avoid an entropy-like fiasco with implementations interpreting them differently and causing them to be useless in practice.

My proposal to fix std::random_device in standard simply follows from the above. std::random_device interface should be extended with 2 new member functions:

class random_device {
   ...
   // Returns true if different instances generate different bytes
   constexpr bool is_random() const;
   
   // Returns true if generated bytes are cryptographically secure
   bool is_cryptographically_secure() const;
};

You might notice that only is_random is constexpr. The reason for that is that it is the weaker property and, outside of maliciously constructed cases, the implementation should know whether the random_device is randomized. is_random could even be made static, if we restricted users from using the explicit random_device(const string& token) constructor^[3].

is_cryptographically_secure is not constexpr to increase implementations' latitude to handle things like hardware errata, which can only be checked at runtime. Just like is_random, it could be made static if we imposed further restrictions on users of random_device.

Deprecate `std::random_device::entropy`

Now that random_device provides a way to query basic properties of its implementation, we should also ~~remove~~ deprecate^[4] random_device::entropy, because it is wholly useless, and (very) potentially even dangerous.

Provide reproducible distributions

How should reproducible distributions be standardized is the place where I changed my opinion the most since writing a paper. Initially, my preferred solution was to standardize the algorithms underlying std::*_distribution, but that is no longer the case. Nowadays, my prefered solution is to:

Standardize specific algorithms as distributions

The basic idea is simple, we standardize specific algorithms under their own name, and users who want reproducibility just use one of these specific algorithms. As an example, one of the possible algorithms to implement std::normal_distribution is Marsaglia polar method. To provide reproducible normal distribution, it would be standardized as std::marsaglia_polar_method_distribution.

This solution has a significant advantage in that it is both backwards compatible because it does not change the meaning of existing code, and it allows future extensions. If, we standardize some set of algorithms as the reproducible distributions, and 10 years after that someone comes up with a better algorithm for generating normally^[5] distributed numbers, then it can easily be standardized in the next C++ standard. C++ code then can adopt this new algorithm if they do not need backwards compatibility, or keep using the old ones, if they do need backwards compatibility.

It is also very expert friendly, as different algorithms have different performance and numeric characteristics, which the experts might care about. As an example, Marsaglia polar method calls the underlying RNG more often than Box-Muller transform does, but it does not use trigonometric functions and provides slightly better numeric properties.

This approach is not without its negatives. The two big ones are that it introduces a lot of new types, and thus maintenance burden, into the standard library, and that it makes using even less user-friendly. A user that wants a reproducible distribution has to pick which exact algorithm to use. Doing so requires either obtaining a significant amount of expert knowledge, or picking one essentially at random.

Other considered (and rejected) options

Back at Prague's meeting, I proposed two other alternatives^[6] to the option above. In fact, I considered the option outlined above the worst one. However, I've changed my mind since then and no longer consider them good. They are:

Mandate specific implementation of all std::foo_distribution types
Provide std::reproducible_foo_distribution types with specified implementation

Both of these options share the same problem, that they do not provide future extensibility, and the same advantage in that they introduce less burden on both the maintainers and the non-expert users of . They also provide some different trade-offs in regards to backwards compatibility, implementation latitude and so on.

Challenges, problems, and pitfalls

All three options mentioned above share one big problem, floating-point numbers. This problem further splits into two more problems, floating-point representations, and transcendental functions.

The problem with floating representations is that the C++ standard does not mandate a specific one. In practice, it is unlikely to encounter a platform that does not support IEEE-754, but the C++ standard allows them. There is also the issue of floating-point dialects, caused by compiler flags, such as -ffast-math.

This means that any standard-provided reproducible distribution over floating-point numbers will require some wording to the effect of "results are only reproducible between platforms with the same floating-point number representation"^[7].

The other challenge to providing reproducible floating-point distributions is the fact that most algorithms for e.g. normal distribution use transcendental functions, such as trigonometric operations (Box-Muller), or logarithms (Marsaglia). The problem is that transcendental functions are computed by approximation, both the result and the precision of such approximations vary, and which approximation your code ends up using is compiler, platform, and settings dependent^[8].

There are two possible workarounds for the transcendental functions issue:

Standard mandates specific implementation for use in
We use algorithms that avoid these issues at the cost of performance^[9]

Neither of these options is great, but they are workable. I don't think that would be well served by just option 2, but I also don't think it should be overlooked.

Rework seeding of Random Number Engines

The last of my complaints in the previous post was that there is no right way to seed an unknown Random Number Engine^[10] properly. This issue is caused by a combination of the requirements on Seed Sequence being overly restrictive, and that there is no way to ask an RNE how much seeding it requires upfront.

Strictly speaking, it is possible to fix this with just one change, letting users query any random number engine on how much data it requires for seeding itself. However, that would still leave proper seeding very unergonomic, and so I propose more changes, to fix this. They are:

Let users query RNEs for required seed size
Provide a weaker version of the Seed Sequence requirements
Modify std::random_device to fulfil these requirements

Let users query Random Number Engines required seed size

The idea behind this change is simple. If we know how much random data is required to seed some RNE, we can generate that much randomness ahead of time, and then use a straightforward Seed Sequence type that just copies randomness in and out, while obeying all Seed Sequence requirements.

To do this, we add static constexpr size_t required_seed_size member function to the requirements on Random Number Engines. Its return value is the number of bytes the RNE requires to fully seed itself. Together with a simple, randomness-copying Seed Sequence sized_seed_seq, the code to fully seed a mt19937 with random data would look something like this:

// This prepares the seed sequence
constexpr auto data_needed = std::mt19337::required_seed_size() / sizeof(std::random_device::result_type);
std::array random_data;
std::generate(random_data.begin(), random_data.end(), std::random_device{});

// Actual seeding
std::mt19937 urbg(sized_seed_seq(random_data.begin(), random_data.end()));

While this works and does what we want, the usability is terrible. To fix the usability for the typical case of random seeding, we need to change the requirements of Seed Sequence.

Provide a weaker version of Seed Sequence requirements

In the ideal world, we would just pass a std::random_device to the constructor of the engine, like so:

std::mt19937(std::random_device{});

However, std::random_device is not a Seed Sequence, and thus the code above does not work. The requirements of Seed Sequence are also such that we cannot create a simple wrapper around random_device that fulfils them. Let's see what requirements we have to drop before a randomized_seed_seq, a seed sequence that just wraps std::random_device, is implementable.

Many of the requirements on Seed Sequence boil down to requiring Seed Sequence instances to be serializable and reproducible. A Seed Sequence-ish that wraps std::random_device cannot provide either, which means that

We should drop both param and size member functions. Without param, size is useless, and param cannot be implemented on top of random_device.
We should also drop both the range and the initializer list constructors. They require that the bits provided therein are used in the seed sequence, but that cannot be done with random_device.

Removing these functions leaves us with the default constructor and the generate member function. And also with the result_type typedef, but that is almost trivial^[11]. We obviously ~~want~~ need to keep the default constructor, but we cannot satisfy the requirements that the state of all default-constructed instances is the same, so we will drop that part. The same thing applies to the generate member function. Any reasonable Seed Sequence has to provide it, but we would need to drop the requirement that the output depends on the inputs during construction (not that there are any).

Thus I propose a new set of named requirements, Basic Seed Sequence^[12]. Type only has to fulfil 3 requirements to be considered a Basic Seed Sequence, namely:

It provides result_type typedef that is an unsigned integer type of at least^[13] 32 bits.
It provides a default constructor with constant runtime complexity.
It provides a generate(rb, re) where rb and re are mutable random access iterators^[14] which fills [rb, re) with 32-bit quantities. There are no constraints on the generated data.

This is the minimal set of requirements for a useful Seed Sequence-ish type, and a wrapper type over std::random_device can easily fullfill them:

class randomized_seed_seq {
    std::random_device m_dev;
    
    static_assert(32 <= sizeof(std::random_device::result_type) * CHAR_BIT,
                  "I don't wanna handle this case");
public:

    using result_type = std::random_device::result_type;
    
    template 
    void generate(Iter first, Sentinel last) {
        using dest_type = typename std::iterator_traits::value_type;
        // We should also check that it is unsigned, but eh.
        static_assert(32 <= sizeof(dest_type) * CHAR_BIT, "");
        
        
        while (first != last) {
            // Note that we are _required_ to only output 32 bits
            *first++ = static_cast(m_dev());
        }
    }
};

With the wrapper above, we can now seed any Random Number Engine like this:

randomized_seed_seq sseq;
std::mt19937 rng(sseq);

RNEs take the SeedSequence constructor argument using plain ref, so we cannot quite write an oneliner, but compared to the original monstrosity, this is good enough. However, I also think that users shouldn't have to wrap std::random_device in their own type to get this behaviour, but rather the standard should provide it. This leads me to my last suggestion:

Turn `std::random_device` into a Basic Seed Sequence

This one is simple. If we add generate to std::random_device, it becomes a Basic Seed Sequence as per the definition above. This would let users write these two lines to get a randomly seeded Random Number Engine:

std::random_device dev;
std::mt19937 rng(dev);

Users who require a large number of random bytes could also use this interface to achieve significant performance gain over successively calling random_device::operator()^[15].

Other possible improvements

Until now, this post was about fixing the problems outlined in the previous one. However, in that post, I skipped over "small" issues with , ones that are annoying but do not make it unusable. In this section, I want to also go over some other issues with . These issues are too small to prevent people from using std.random, but they are still annoying enough while using it.

The following issues are mentioned in no specific order.

There are no modern PRNGs in . The best PRNG in is probably^[16] the Mersenne Twister, but using Mersenne Twister instead of say Xorshift, or a PCG variant leaves a lot of performance lying on the table. This lack of modern PRNGs means that serious users will end up writing their own, even if all issues with seeding, distributions, and so on, are fixed.

Most (all?) of the PRNGs in could be constexpr, but they are not. As far as I can tell, this is caused by the fact that nobody actually uses enough to care about constexpr-ing it, rather than any technical reasons.

Random Number Engines take Seed Sequence arguments by plain reference. This prevents creating and fully seeding an RNE from being an oneliner.

There are no ease-of-use utilities. If all the fixes proposed in this post were incorporated, seeding a PRNG becomes easy. However, selecting a random element from
a std::vector would still require a significant amount of boilerplate.

There are likely many more tiny issues with that I am either unaware of completely, or that I haven't run into recently enough to remember them. The point is that if all of my proposed changes were standardized, would become much better but definitely not perfect.

That's it for this post, and for my writing about . At some point in the future I want to write a post about my standardization efforts towards fixing , but that will be a non-technical post about the standardization process itself, rather than about the technical details of .

I don't think the C library and std::rand can be fixed without effectively standardizing , but for C and fixed. ↩︎
I think that a nonrandom random_device is useless, and if a standard library implementation cannot provide random random_device, then it should not provide random_device at all. However, I do not see such a requirement getting past the WG21 committee. ↩︎
MSVC already disallows it and libstdc++ allows the arguments to be /dev/urandom and /dev/random. libc++ is the odd one out here, but given that using the constructor is already non-portable, I believe it could be made more restrictive as well. ↩︎
One does not simply remove things from C++ standard. I am honestly surprised how many things were actually removed after a period of deprecation, but it still takes time even if the removal happens. ↩︎
Or any other specific distribution. ↩︎
Given the experience from the meeting, I now know that this was a fundamentally wrong approach. Present the committee with one opinionated option if you want anything to be done. ↩︎
And probably no guarantees for -ffast-math and equivalents, because that allows the compiler to reorder operations arbitrarily, potentially allowing different results from the same code due to different computational context. ↩︎
As an example, if you are using x86 CPU, it almost definitely has hardware instruction for computing both sin and cos at the same time. No compiler worth its salt uses them, because it is significantly slower than a well-tuned software implementation. Instead, compilers call into sin/cos from the supporting math library. However, some library implementations punt on implementing these functions and use the CPU-provided instruction instead... ↩︎
I am not aware of many algorithms in this category, but I am also not a numerics expert, and I did not perform much of research on this. This paper on generating normally distributed numbers without involving floating-point arithmetics at all looks interesting though. ↩︎
Random Number Engine is any type that fulfils a set of named requirements in the standard. ↩︎
We cannot blindly typedef it to random_device::result_type, because random_device::result_type is unsigned int, and thus is allowed to be less than 32 bits. However, papering over this difference is easy enough, and not even needed on most platforms. ↩︎
I would prefer to give this set of requirements the name Seed Sequence and rename the current Seed Sequence to something like Serializable Seed Sequence, or Repeatable Seed Sequence, or Deterministic Seed Sequence, but it is easier to introduce a new name for the new requirements than to deal with backwards compatibility concerns. ↩︎
The fact that this is at least 32 bits, rather than exactly 32 bits, has caused some interesting bugs in the real world, but let's be compatible with Seed Sequence. ↩︎
The random access requirements are to keep as close to the current Seed Sequence as possible. If we went with the principle of requiring minimal iterator strength for the operation, we should probably require just forward iterators. On the other hand, I think it is unlikely that a random number engine will store its state in something like a list, and if we required contiguous iterators, we could utilize batch writes for significant performance gains. ↩︎
As an example, MSVC's random device generates, (IIRC) 256 random bits per call to operator(). It then returns 32 of them and throws the rest away. Trying to generate a substantial amount of randomness via operator() then leads to a lot of wasted work. ↩︎
It depends on your use case, but if you just want some PRNG, Mersenne Twister is a fine default. ↩︎

Generating random numbers using C++ standard library: the problems

Martin Hořeňovský — Sun, 17 May 2020 19:23:20 GMT

Recently I found myself once again writing a long forum post about the problems with standard-provided random number generation facilities (both C++'s , and C's rand) in C++. Since I keep writing these, I decided to write it all down into one blog post so that I can link it to people later. This is that blog post.

A quick summary of this post would be "Using C++'s standard library for random number generation is a bad idea, and you should either roll your own, or use an existing library. I recommend C++ PCG utilities, or, if you already use Boost, Boost.Random".

Now, onto the actual content itself.

In this post, we will use what should be a straightforward task: generate a bunch of uniformly distributed integers in the range [0, 100k).

C's standard library facilities

Let's start with some C-style random number generation.

// Seed based on time. Not really random.
std::srand(std::time(nullptr));

// Generate 1'000 random numbers in range 0-100'000
for (size_t _ = 0; _ < 1'000; ++_) {
    std::cout << std::rand() % 100'000 << '\n';
}

This code is simple enough to write and to understand but comes with a host of problems.

The resulting numbers will not be uniformly distributed. The results will be biased towards lower numbers, due to the use of modulo.
Numbers above 32767 might not be present at all.
Whether the code is thread-safe is up to the implementation. Which functions invoke rand is also up to the implementation, so data races can happen without you expecting them.

If you do not see why converting the numbers using modulo cause non-uniformly distributed results, consider a simple case, where std::rand can only return 0, 1, or 2, each with the same probability, and we desire numbers in the range [0, 2). There are 2 ways to get 0, 0 % 2, and 2 % 2, while there is only one way to get 1, 1 % 2. In other words, we get 2:1 ratio of 0s to 1s due to using modulo.

The second problem is more obscure, but simpler to understand. The range of possible values generated by std::rand is specified as [0, RAND_MAX), where RAND_MAX can be any constant larger-or-equal to 32767. On platforms that use this lower bound^[1], the example above will never print number larger than 32767.

The last problem is just a symptom of the original C specification ignored threading.

The first two problems are solvable. Replacing modulo with rejection sampling (and potentially calling std::rand multiple times if needed) solves the bias issue. To generate values larger than RAND_MAX, you can just concatenate the result of multiple calls to std::rand.

The thread-safety is impossible to solve in general case^[2], but in specific cases, you can guard user code calls to std::rand with a mutex, and it should work well enough. Some implementations provide a per-thread std::rand, which is a much better solution, but you cannot rely on this.

However, solving all of this is either impossible, or a lot of non-trivial work, and even then you run into the problem that std::rand is allowed to return different numbers on different platforms given the same seed. At this point, it is easier to write your own set of random number generation tools, and so C++11 standardized its own set, in the form of .

C++'s standard library facilities

At first glance, seems exceedingly complex for a simple task. You have to pick a templated Uniform Random Bit Generator, possibly seed it, pick a templated Distribution, and then pass an instance of your URBG to the distribution to get a number... This is the C example rewritten using :

// Truly random seed. 
std::mt19937 rng(std::random_device{}());

// Avoid constructing distribution all the time
std::uniform_int_distribution<> dist(0, 100'000);

// Generate 1'000 random numbers in range 0-100'000
for (size_t _ = 0; _ < 1'000; ++_) {
    std::cout << dist(rng) << '\n';
}

There is bit more code than there was with C, but bearably so, and most of the issues are fixed. The distribution will be uniform, all numbers in the desired interval are possible, and the code is thread-safe.

At second glance, is awesome, even if there is a bit of boilerplate for simple operations. The decomposed and pluggable design means that you can customize your random numbers by replacing only a small part of the random number generation pipeline. The standard also provides a wide range of Random Number Engines and distributions^[3], so you should be able to do most things you want out of the box. It even provides an abstraction for getting actually random numbers for seeding the generators, std::random_device.

At the third glance, when you've started using extensively and started digging deeper, you will find out that every single part of it is deeply flawed, and the best solution is to avoid using it completely.

Distributions are nonportable

Did you notice that the text above said

most of the issues are fixed

and then did not talk about portability? That's because both of the snippets, the C one and the C++ one, share one issue. Even if you hardcode the seed, the snippets will give you different results on different platforms^[4]. For bonus points, the results are not even guaranteed to be portable between different versions of the same standard library, as the standard library implementations are allowed to change how they implement std::uniform_int_distribution^[5].

What this boils down to is that if you have repeatability requirements for your generated random numbers^[6], then you cannot use the standard-provided distributions. Luckily, generating random numbers using is properly decomposed, and you can "just" write your own distributions, and keep using the rest of , right?

Well...

`std::random_device` might not be random, and there is no way to check

The C++ snippet uses std::random_device to generate some initial randomness to seed our instance of Mersenne Twister in the form of std::mt19937. The problem is that std::random_device is poorly specified, and inscrutable.

In theory, it should serve as an abstraction over some external source of entropy. In practice, an implementation is allowed to use any deterministic random number engine to implement it, e.g. a Mersenne twister, and there is no way to find out. There is a member function std::random_device::entropy(), which is in theory there to detect such case, but it does not work in practice.

The blame for this is shared between the standard and the implementations. The function's full signature is double entropy() const noexcept, and it is the return type that breaks it. The standard provides a definition of entropy^[7], but it does not provide any sort of guidance on how to count entropy of an external source of randomness, or expected return values for different cases.

This, in turn, caused different implementations to do their own thing. We will take a look at the big three, MS's STL, libc++ and libstdc++.

MS's implementation handles this the best. It knows its random_device is just a thin wrapper over kernel's cryptographically secure random, so it always returns 32 and inlines the member function into the header to allow for constant propagation^[8].

In order of sanity of implementation, libc++ is next, because it always just returns 0. This return value does not reflect reality, 4 out of 5 possible configurations^[9] of libc++'s random_device use strong random backend, and the last one also provides strong random bytes unless the user deliberately sabotages themselves. The return value also makes libc++'s implementation of std::random_device::entropy useless, but at least it is obviously useless, so the user is not given false hopes and expectations. There is value in this.

The worst implementation of std::random_device::entropy can be found in libstdc++. The reason it is the worst is that it is not obviously useless, you have to think about it for a bit to figure out why the return value is useless. This is because, unlike libc++, libstdc++ can return non-zero values. In most configurations, libstdc++ always returns 0^[10], but when it is configured to read from /dev/urandom (or /dev/random), it uses RNDGETENTCNT to check how much entropy the kernel thinks it has available and returns that to the user.

The underlying problem of this approach is TOCTOU. If you first check whether there is enough randomness^[11], and only then ask for that randomness, then by the time you ask for the randomness it could've been depleted, and you cannot get it anymore.

At this point, we know that we will likely have to implement our own distributions, and either implement our own random_device, or detect which standard library we are compiling against, and hardcode versions that provide good random_device::operator() implementations. But at least we can still use all the different Random Number Engines provided by the standard library, right?

Well...

There is no way to seed a Random Number Engine properly

The Random Number Engines almost work. But if something only almost works, it is broken.

Let's go back to the first line of the C++ example.

std::mt19937 rng(std::random_device{}());

It seeds a specific version of Mersenne Twister with unsigned int worth of random data. Let's assume sizeof(unsigned int) == 4. The internal state of mt19937 is 2496 (624 * 4) bytes. Taken together, this means that for every state we can seed the rng into, there are \(2^{4984}\) states that we cannot seed the rng into.

This has some interesting implications. For example, the program below will never print 7^[12].

int main() {
    std::mt19937 urbg(std::random_device{}());
    std::cout << urbg() << '\n';
}

Some output values also uniquely identify their seed. If I tell you that the code program printed 3046098682, then you can quickly^[13] find the seed generated by random_device, and thus predict all future outputs of a Mersenne twister seeded in this way^[14].

In theory, the standard provides a way to seed the Mersenne twister properly. The tool is called SeedSequence, and there is an implementation of it in the standard library, std::seed_seq. Once again, when you try to use it in practice, it breaks down.

std::seed_seq is essentially a wrapper over std::vector that you can give a bunch of randomness to, and then a random number engine can extract (stretched) randomness out. It is used like this:

auto rd_dev = std::random_device{};
std::seed_seq seq{rd_dev(), rd_dev(), rd_dev(), rd_dev()};
std::mt19937 urbg(seq);

This time we initialized our instance of mt19937 with 16 (4 * 4) bytes of randomness. Progress! There are two problems with this snippet though:

There is no way to know how much randomness you have to provide to a RandomNumberEngine T, and thus how much randomness you have to feed into seed_seq.
std::seed_seq is very tightly specified by the standard. The implementation forced by the standard is not a bijection^[15].

A fun fact about 1. is that std::mersenne_twister_engine provides a member variable you can query to find out how much data it needs^[16]. However, this is an accident of standardization, and no other standard-provided random number engine provides a way to retrieve this information.

The second problem means that even if you hardcode seed sizes of all random number engine types your program uses, you still couldn't use std::seed_seq for initialization, because it loses entropy... here is an example of this on Godbolt:

#include 
#include 
#include 

int main() {
    std::seed_seq seq1({0xf5e5b5c0, 0xdcb8e4b1}),
                  seq2({0xd34295df, 0xba15c4d0});

    std::array arr1, arr2;
    seq1.generate(arr1.begin(), arr1.end());
    seq2.generate(arr2.begin(), arr2.end());

    // prints 1 because seed_seq::generate is not a bijection
    std::cout << (arr1 == arr2) << '\n';
}

In other words, even if you write your own type that fulfils the SeedSequence named requirements, you have to hardcode the sizes of your Random Number Engine types somewhere.

Recap

To recapitulate, generating random numbers using C standard library has many problems, with some fixable at great programming effort, and other unfixable. If you are for some reason stuck with just the C library, you should definitely write your own.

Generating random numbers using C++ standard library fixes most of the problems of using the C library. However, the operative word here is most, and it introduces its own problems instead. In the end, whether you can successfully use depends on your requirements.

If you need cross-platform reproducibility, then you cannot use standard-provided distributions at all, and you have to write your own.
If you need actually random data for whatever reason, you either have to write your own version of random_device, or hardcode a list of platforms + configurations where you can use std::random_device.
if you want to properly seed a Random Number Engine, you have to write your own SeedSequence, and also hardcode the required seed sizes of all your Random Number Engines.

My use cases for usually require cross-platform reproducibility, need properly random seed values, and would prefer fully seeded RNEs. This means that I either have to write 90% of on my own, or use a different implementation, such as Boost.Random or PCG random utilities...

And I am not the only one. When I was writing a couple of standardization proposals for fixing , I made an informal Reddit poll asking people about their use of . The absolute majority of people answered either that they have their own implementation, or use Boost.Random. Few people used other open source libraries, and very, very, very few people use the standard random.

That's it for this post. The next post explores possible avenues for fixing and making it usable by more people in more domains.

This is not a theoretical problem, e.g. MSVC defines RAND_MAX as 32767. ↩︎
There are two problems. The first is that your standard library is allowed to call std::rand from arbitrary other functions. The other is that the other libraries in your binary are also allowed to call std::rand, and you have no way to tell them to lock a mutex before doing so. ↩︎
Including some less known distributions, like Student's t-distribution. ↩︎
Both these configurations share the same libc, so the result of std::rand will be the same. However, it obviously differs from MSVC's results, because MSVC's std::rand cannot return a number that big. ↩︎
Microsoft's standard library implementation is in fact considering a change to a more performant implementation of std::uniform_int_distribution. ↩︎
There are many different use cases for reproducible random generation. Fuzzing, procedural generation, networked clients, physics simulations, etc. ↩︎
I am reasonably sure that standard library maintainers are smart enough to look it up on their own, but ¯\_(ツ)_/¯. ↩︎
This does not mean you can use it for compile-time checking, but at least it is something. ↩︎
Libc++ can either use getentropy, arc4random, nacl_secure_random, rand_s, or read from a file, like /dev/urandom. In theory the user can specify their own file, and thus get nonrandom data. ↩︎
As an example, current versions of libstdc++ use rand_s when on Windows. rand_s is kernel-provided cryptographically secure random number generator, but libstdc++ still returns 0. ↩︎
I do not subscribe to the view that we can use up kernel's entropy after it has initialized its pools properly, but I can see how a good-faith implementation can arrive here. ↩︎
It actually cannot print over 30% of all possible values of a 4-byte unsigned integer. ↩︎
It takes about 10 minutes on my desktop. ↩︎
With a properly seeded Mersenne twister, you would have to observe 624 successive outputs to be able to predict all future generated numbers. ↩︎
If you assume that std::seed_seq never gets more random data than a random number engine will request later, then it would suffice for it to be an injection. It is not an injection either. ↩︎
You actually have to query more than one member variable and do some arithmetics, but it is possible to find out how much random data you need to feed a std::mersenne_twister_engine to initialize it fully. ↩︎

Random Numbers - The Coding Nest

Random distributions are not one-size-fits-all (part 2.5)

Introduction

Benchmarks

Results

Conclusion

Random distributions are not one-size-fits-all (part 2)

Introduction

Extended multiplication

Alternative algorithms

OpenBSD

Java

Benchmarks

Setup

Results

Extended multiplication

Single-use distributions

Reused distributions

Conclusion

Random distributions are not one-size-fits-all (part 1)

Introduction

Benchmarks

Setup

Results

Benchmark stacked for reuse

Benchmarking is hard: the optimizer is good at its job

Benchmark results revisited

Benchmark stacked for reuse (again)

Benchmark stacked against reuse

Benchmark(s) from Catch2

Generating floats

Benchmark resampling

Conclusion

Generating random numbers using C++ standard library: the solutions

Fixing std::random_device

Provide users with better queries for the properties of the implementation

Deprecate std::random_device::entropy

Provide reproducible distributions

Standardize specific algorithms as distributions

Other considered (and rejected) options

Challenges, problems, and pitfalls

Rework seeding of Random Number Engines

Let users query Random Number Engines required seed size

Provide a weaker version of Seed Sequence requirements

Turn std::random_device into a Basic Seed Sequence

Other possible improvements

Generating random numbers using C++ standard library: the problems

C's standard library facilities

C++'s standard library facilities

Distributions are nonportable

std::random_device might not be random, and there is no way to check

There is no way to seed a Random Number Engine properly

Recap

Fixing `std::random_device`

Deprecate `std::random_device::entropy`

Turn `std::random_device` into a Basic Seed Sequence

`std::random_device` might not be random, and there is no way to check