Generating random numbers using C++ standard library: the solutions

Last week I wrote about the various problem with using C++'s standard library (mainly <random>) to generate random numbers. This week I will outline what I think are the (standardizable) solutions to fix the functionality in <random>^[1] and make it widely usable.

The content of this post is based on the three C++ standardization papers I presented in Prague, P2058, P2059, P2060, and various conversations I had afterwards on the same topic.

Now, onto the solutions themselves.

Fixing `std::random_device`

In my last post, I complained that std::random_device is allowed to be not random at all, and there is no way to find out, because std::random_device::entropy is interpreted very differently across different standard library implementations.

My ideal way to fix this would be to mandate that a standard library implementation only provides std::random_device if it provides proper randomness. And by proper, I mean cryptographically strong. While this sounds onerous, the three major implementations already provide this in practice, they just do not advertise it... However, I also think that such a proposal would never pass the standard committee, and so we need to fix it differently.

Provide users with better queries for the properties of the implementation

Users generally care about one of two things.

Whether the random_device is random, that is, it does not produce the same sequence every time the code is run.
Whether the random_device produces cryptographically secure outputs.

Obviously, the second property is much stronger, because a random_device that is cryptographically secure is also random, but random_device can be random while not being cryptographically secure. As currently standardized, a random_device is also allowed to be neither random nor cryptographically secure^[2].

A nice feature of these properties is that they are binary, so the answer to them is either yes, or no, with no possibilities in-between. They are also reasonably well defined, which should avoid an entropy-like fiasco with implementations interpreting them differently and causing them to be useless in practice.

My proposal to fix std::random_device in standard simply follows from the above. std::random_device interface should be extended with 2 new member functions:

class random_device {
   ...
   // Returns true if different instances generate different bytes
   constexpr bool is_random() const;
   
   // Returns true if generated bytes are cryptographically secure
   bool is_cryptographically_secure() const;
};

You might notice that only is_random is constexpr. The reason for that is that it is the weaker property and, outside of maliciously constructed cases, the implementation should know whether the random_device is randomized. is_random could even be made static, if we restricted users from using the explicit random_device(const string& token) constructor^[3].

is_cryptographically_secure is not constexpr to increase implementations' latitude to handle things like hardware errata, which can only be checked at runtime. Just like is_random, it could be made static if we imposed further restrictions on users of random_device.

Deprecate `std::random_device::entropy`

Now that random_device provides a way to query basic properties of its implementation, we should also ~~remove~~ deprecate^[4] random_device::entropy, because it is wholly useless, and (very) potentially even dangerous.

Provide reproducible distributions

How should reproducible distributions be standardized is the place where I changed my opinion the most since writing a paper. Initially, my preferred solution was to standardize the algorithms underlying std::*_distribution, but that is no longer the case. Nowadays, my prefered solution is to:

Standardize specific algorithms as distributions

The basic idea is simple, we standardize specific algorithms under their own name, and users who want reproducibility just use one of these specific algorithms. As an example, one of the possible algorithms to implement std::normal_distribution is Marsaglia polar method. To provide reproducible normal distribution, it would be standardized as std::marsaglia_polar_method_distribution.

This solution has a significant advantage in that it is both backwards compatible because it does not change the meaning of existing code, and it allows future extensions. If, we standardize some set of algorithms as the reproducible distributions, and 10 years after that someone comes up with a better algorithm for generating normally^[5] distributed numbers, then it can easily be standardized in the next C++ standard. C++ code then can adopt this new algorithm if they do not need backwards compatibility, or keep using the old ones, if they do need backwards compatibility.

It is also very expert friendly, as different algorithms have different performance and numeric characteristics, which the experts might care about. As an example, Marsaglia polar method calls the underlying RNG more often than Box-Muller transform does, but it does not use trigonometric functions and provides slightly better numeric properties.

This approach is not without its negatives. The two big ones are that it introduces a lot of new types, and thus maintenance burden, into the standard library, and that it makes using <random> even less user-friendly. A user that wants a reproducible distribution has to pick which exact algorithm to use. Doing so requires either obtaining a significant amount of expert knowledge, or picking one essentially at random.

Other considered (and rejected) options

Back at Prague's meeting, I proposed two other alternatives^[6] to the option above. In fact, I considered the option outlined above the worst one. However, I've changed my mind since then and no longer consider them good. They are:

Mandate specific implementation of all std::foo_distribution types
Provide std::reproducible_foo_distribution types with specified implementation

Both of these options share the same problem, that they do not provide future extensibility, and the same advantage in that they introduce less burden on both the maintainers and the non-expert users of <random>. They also provide some different trade-offs in regards to backwards compatibility, implementation latitude and so on.

Challenges, problems, and pitfalls

All three options mentioned above share one big problem, floating-point numbers. This problem further splits into two more problems, floating-point representations, and transcendental functions.

The problem with floating representations is that the C++ standard does not mandate a specific one. In practice, it is unlikely to encounter a platform that does not support IEEE-754, but the C++ standard allows them. There is also the issue of floating-point dialects, caused by compiler flags, such as -ffast-math.

This means that any standard-provided reproducible distribution over floating-point numbers will require some wording to the effect of "results are only reproducible between platforms with the same floating-point number representation"^[7].

The other challenge to providing reproducible floating-point distributions is the fact that most algorithms for e.g. normal distribution use transcendental functions, such as trigonometric operations (Box-Muller), or logarithms (Marsaglia). The problem is that transcendental functions are computed by approximation, both the result and the precision of such approximations vary, and which approximation your code ends up using is compiler, platform, and settings dependent^[8].

There are two possible workarounds for the transcendental functions issue:

Standard mandates specific implementation for use in <random>
We use algorithms that avoid these issues at the cost of performance^[9]

Neither of these options is great, but they are workable. I don't think that <random> would be well served by just option 2, but I also don't think it should be overlooked.

Rework seeding of Random Number Engines

The last of my complaints in the previous post was that there is no right way to seed an unknown Random Number Engine^[10] properly. This issue is caused by a combination of the requirements on Seed Sequence being overly restrictive, and that there is no way to ask an RNE how much seeding it requires upfront.

Strictly speaking, it is possible to fix this with just one change, letting users query any random number engine on how much data it requires for seeding itself. However, that would still leave proper seeding very unergonomic, and so I propose more changes, to fix this. They are:

Let users query RNEs for required seed size
Provide a weaker version of the Seed Sequence requirements
Modify std::random_device to fulfil these requirements

Let users query Random Number Engines required seed size

The idea behind this change is simple. If we know how much random data is required to seed some RNE, we can generate that much randomness ahead of time, and then use a straightforward Seed Sequence type that just copies randomness in and out, while obeying all Seed Sequence requirements.

To do this, we add static constexpr size_t required_seed_size member function to the requirements on Random Number Engines. Its return value is the number of bytes the RNE requires to fully seed itself. Together with a simple, randomness-copying Seed Sequence sized_seed_seq, the code to fully seed a mt19937 with random data would look something like this:

// This prepares the seed sequence
constexpr auto data_needed = std::mt19337::required_seed_size() / sizeof(std::random_device::result_type);
std::array<std::random_device::result_type, data_needed> random_data;
std::generate(random_data.begin(), random_data.end(), std::random_device{});

// Actual seeding
std::mt19937 urbg(sized_seed_seq(random_data.begin(), random_data.end()));

While this works and does what we want, the usability is terrible. To fix the usability for the typical case of random seeding, we need to change the requirements of Seed Sequence.

Provide a weaker version of Seed Sequence requirements

In the ideal world, we would just pass a std::random_device to the constructor of the engine, like so:

std::mt19937(std::random_device{});

However, std::random_device is not a Seed Sequence, and thus the code above does not work. The requirements of Seed Sequence are also such that we cannot create a simple wrapper around random_device that fulfils them. Let's see what requirements we have to drop before a randomized_seed_seq, a seed sequence that just wraps std::random_device, is implementable.

Many of the requirements on Seed Sequence boil down to requiring Seed Sequence instances to be serializable and reproducible. A Seed Sequence-ish that wraps std::random_device cannot provide either, which means that

We should drop both param and size member functions. Without param, size is useless, and param cannot be implemented on top of random_device.
We should also drop both the range and the initializer list constructors. They require that the bits provided therein are used in the seed sequence, but that cannot be done with random_device.

Removing these functions leaves us with the default constructor and the generate member function. And also with the result_type typedef, but that is almost trivial^[11]. We obviously ~~want~~ need to keep the default constructor, but we cannot satisfy the requirements that the state of all default-constructed instances is the same, so we will drop that part. The same thing applies to the generate member function. Any reasonable Seed Sequence has to provide it, but we would need to drop the requirement that the output depends on the inputs during construction (not that there are any).

Thus I propose a new set of named requirements, Basic Seed Sequence^[12]. Type only has to fulfil 3 requirements to be considered a Basic Seed Sequence, namely:

It provides result_type typedef that is an unsigned integer type of at least^[13] 32 bits.
It provides a default constructor with constant runtime complexity.
It provides a generate(rb, re) where rb and re are mutable random access iterators^[14] which fills [rb, re) with 32-bit quantities. There are no constraints on the generated data.

This is the minimal set of requirements for a useful Seed Sequence-ish type, and a wrapper type over std::random_device can easily fullfill them:

class randomized_seed_seq {
    std::random_device m_dev;
    
    static_assert(32 <= sizeof(std::random_device::result_type) * CHAR_BIT,
                  "I don't wanna handle this case");
public:

    using result_type = std::random_device::result_type;
    
    template <typename Iter, typename Sentinel>
    void generate(Iter first, Sentinel last) {
        using dest_type = typename std::iterator_traits<Iter>::value_type;
        // We should also check that it is unsigned, but eh.
        static_assert(32 <= sizeof(dest_type) * CHAR_BIT, "");
        
        
        while (first != last) {
            // Note that we are _required_ to only output 32 bits
            *first++ = static_cast<uint32_t>(m_dev());
        }
    }
};

With the wrapper above, we can now seed any Random Number Engine like this:

randomized_seed_seq sseq;
std::mt19937 rng(sseq);

RNEs take the SeedSequence constructor argument using plain ref, so we cannot quite write an oneliner, but compared to the original monstrosity, this is good enough. However, I also think that users shouldn't have to wrap std::random_device in their own type to get this behaviour, but rather the standard should provide it. This leads me to my last suggestion:

Turn `std::random_device` into a Basic Seed Sequence

This one is simple. If we add generate to std::random_device, it becomes a Basic Seed Sequence as per the definition above. This would let users write these two lines to get a randomly seeded Random Number Engine:

std::random_device dev;
std::mt19937 rng(dev);

Users who require a large number of random bytes could also use this interface to achieve significant performance gain over successively calling random_device::operator()^[15].

Other possible improvements

Until now, this post was about fixing the problems outlined in the previous one. However, in that post, I skipped over "small" issues with <random>, ones that are annoying but do not make it unusable. In this section, I want to also go over some other issues with <random>. These issues are too small to prevent people from using std.random, but they are still annoying enough while using it.

The following issues are mentioned in no specific order.

There are no modern PRNGs in <random>. The best PRNG in <random> is probably^[16] the Mersenne Twister, but using Mersenne Twister instead of say Xorshift, or a PCG variant leaves a lot of performance lying on the table. This lack of modern PRNGs means that serious users will end up writing their own, even if all issues with seeding, distributions, and so on, are fixed.

Most (all?) of the PRNGs in <random> could be constexpr, but they are not. As far as I can tell, this is caused by the fact that nobody actually uses <random> enough to care about constexpr-ing it, rather than any technical reasons.

Random Number Engines take Seed Sequence arguments by plain reference. This prevents creating and fully seeding an RNE from being an oneliner.

There are no ease-of-use utilities. If all the fixes proposed in this post were incorporated, seeding a PRNG becomes easy. However, selecting a random element from
a std::vector would still require a significant amount of boilerplate.

There are likely many more tiny issues with <random> that I am either unaware of completely, or that I haven't run into recently enough to remember them. The point is that if all of my proposed changes were standardized, <random> would become much better but definitely not perfect.

That's it for this post, and for my writing about <random>. At some point in the future I want to write a post about my standardization efforts towards fixing <random>, but that will be a non-technical post about the standardization process itself, rather than about the technical details of <random>.

I don't think the C library and std::rand can be fixed without effectively standardizing <random>, but for C and fixed. ↩︎
I think that a nonrandom random_device is useless, and if a standard library implementation cannot provide random random_device, then it should not provide random_device at all. However, I do not see such a requirement getting past the WG21 committee. ↩︎
MSVC already disallows it and libstdc++ allows the arguments to be /dev/urandom and /dev/random. libc++ is the odd one out here, but given that using the constructor is already non-portable, I believe it could be made more restrictive as well. ↩︎
One does not simply remove things from C++ standard. I am honestly surprised how many things were actually removed after a period of deprecation, but it still takes time even if the removal happens. ↩︎
Or any other specific distribution. ↩︎
Given the experience from the meeting, I now know that this was a fundamentally wrong approach. Present the committee with one opinionated option if you want anything to be done. ↩︎
And probably no guarantees for -ffast-math and equivalents, because that allows the compiler to reorder operations arbitrarily, potentially allowing different results from the same code due to different computational context. ↩︎
As an example, if you are using x86 CPU, it almost definitely has hardware instruction for computing both sin and cos at the same time. No compiler worth its salt uses them, because it is significantly slower than a well-tuned software implementation. Instead, compilers call into sin/cos from the supporting math library. However, some library implementations punt on implementing these functions and use the CPU-provided instruction instead... ↩︎
I am not aware of many algorithms in this category, but I am also not a numerics expert, and I did not perform much of research on this. This paper on generating normally distributed numbers without involving floating-point arithmetics at all looks interesting though. ↩︎
Random Number Engine is any type that fulfils a set of named requirements in the standard. ↩︎
We cannot blindly typedef it to random_device::result_type, because random_device::result_type is unsigned int, and thus is allowed to be less than 32 bits. However, papering over this difference is easy enough, and not even needed on most platforms. ↩︎
I would prefer to give this set of requirements the name Seed Sequence and rename the current Seed Sequence to something like Serializable Seed Sequence, or Repeatable Seed Sequence, or Deterministic Seed Sequence, but it is easier to introduce a new name for the new requirements than to deal with backwards compatibility concerns. ↩︎
The fact that this is at least 32 bits, rather than exactly 32 bits, has caused some interesting bugs in the real world, but let's be compatible with Seed Sequence. ↩︎
The random access requirements are to keep as close to the current Seed Sequence as possible. If we went with the principle of requiring minimal iterator strength for the operation, we should probably require just forward iterators. On the other hand, I think it is unlikely that a random number engine will store its state in something like a list, and if we required contiguous iterators, we could utilize batch writes for significant performance gains. ↩︎
As an example, MSVC's random device generates, (IIRC) 256 random bits per call to operator(). It then returns 32 of them and throws the rest away. Trying to generate a substantial amount of randomness via operator() then leads to a lot of wasted work. ↩︎
It depends on your use case, but if you just want some PRNG, Mersenne Twister is a fine default. ↩︎

← The Coding Nest

Fixing std::random_device