The Little Things: Comparing Floating Point Numbers

Martin Hořeňovský — Thu, 02 Sep 2021 21:53:01 GMT

There is a lot of confusion about floating-point numbers and a lot of bad advice going around. IEEE-754 floating-point numbers are a complex beast^[1], and comparing them is not always easy, but in this post, we will take a look at different approaches and their tradeoffs.

Note that this whole post assumes binary IEEE-754 floating-point numbers. There are more different types of floating-point numbers, e.g. IBM likes decimal floating-point numbers enough to support them in hardware. However, most of the text below should be applicable to different representations too.

Floating point basics

I do not want to get into too many details about the representation of floating-point numbers or their arithmetics, but we still need to go over some important points. They are required to build an understanding of the different comparison methods we will look at later.

Floating-point numbers are a (one) way of dealing with real numbers in fixed-size storage inside a computer. The binary representation consists of 3 parts, the sign bit, the mantissa, and the exponent.

The sign bit should be self-explanatory. It decides which sign the number resulting from the rest of the bits will have^[2]. The mantissa stores the digits of the represented number, while the exponent stores the magnitude of the number.

Because the total number of bits split between these three parts is fixed, we must logically lose precision when representing some numbers due to insufficient bits in the mantissa. The fact that the bit allocation to each part of the representation is also fixed^[3] means that as we represent higher numbers, the absolute loss of precision increases. However, the relative loss of precision remains the same.

Floating-point numbers also contain some special values used to represent specific "states" outside of normal operations. As an example, if a number is so big it overflows the floating-point type, it will be represented as infinity (or negative infinity in case of underflow). The other important special kind of values are the NaN (Not a Number) values.

There are different types of NaN, but the important part of them is that they are the result of invalid floating point operation, e.g. \(\frac{0}{0}\) or \(\frac{\infty}{\infty}\) and that they behave unintuitively, because \(\textrm{NaN} \neq \textrm{NaN}\)^[4].

With this knowledge we can now look at how we can compare two floating-point numbers.

Comparing floating-point numbers

There are 4 (5) different ways to compare floating-point numbers. They are:

Bitwise comparison
Direct ("exact") IEEE-754 comparison
Absolute margin comparison
Relative epsilon comparison
ULP (Unit In Last Place) based comparison

Apart from bitwise comparison, all of them have their merits (and drawbacks). The bitwise comparison is included only to contrast it with the "exact" comparison, I am not aware of any use for it in the real world.

Bitwise and direct comparison

The idea behind bitwise comparison is exceedingly simple. Two floating-point numbers are equal iff their bit representations are the same.

This is not what happens if you write lhs == rhs^[5] in your code.

If you write lhs == rhs in your code, you get what is often called "exact" comparison. However, this doesn't mean that the numbers are compared bitwise because e.g. -0. == 0. and NaN != NaN, even though in the first case both sides have different bit representations, and in the latter case, both sides might have the exact same bit representation

Direct comparison is useful only rarely, but it is not completely useless. Because the basic operations^[6] are specified exactly, any computation using only them should^[7] provide specific output for an input. The situation is worse for various transcendental functions^[8], but reasonably fast correctly rounded libraries are beginning to exist.

All in all, if you are writing code that does some computations with floating-point numbers and you require the results to be portable, you should have a bunch of tests relying purely on direct comparison.

Absolute margin comparison

Absolute margin comparison is the name for writing \(|\textrm{lhs} - \textrm{rhs}| \leq \textrm{margin}\)^[9]. This means that two numbers are equal if their distance is less than some fixed margin.

The two big advantages of absolute margin comparison are that it is easy to reason about decimally ("I want to be within 0.5 of the correct result") and that it does not break down close to 0. The disadvantage is that it instead breaks down for large values of lhs or rhs, where it decays into direct comparison^[10].

Relative epsilon comparison

The relative epsilon^[11] comparison is the name for writing \(|\textrm{lhs} - \textrm{rhs}| \leq \varepsilon * \max(|\textrm{lhs}|, |\textrm{rhs}|)\)^[12]. This means that two numbers are equal if they are within some factor of each other.

Unlike margin comparison, epsilon comparison does not break down for large lhs and rhs values. The tradeoff is that it instead breaks down (by decaying to exact comparison) around 0^[13]. Just like margin comparison, it is quite easy to reason about decimally ("I want to be within 5% of the correct result").

You can also swap the maximum for a minimum of the two numbers, which gives you a stricter comparison^[14] but with the same advantages and disadvantages.

ULP-based comparison

The last option is to compare two numbers based on their ULP distance. The ULP distance of two numbers is how many representable floating-point numbers there is between them + 1. This means that if two numbers do not have any other representable numbers between them, their ULP distance is 1. If there is one number between them, the distance is 2, etc.

The big advantage of using ULP comparisons is that it automatically scales across different magnitudes of compared numbers. It doesn't break down around 0, nor does it break down for large numbers. ULP based comparison is also very easy to reason about numerically. You know what operations happened to the input and thus how far the output can be from the canonical answer and still be considered correct.

The significant disadvantage is that it is ~~very hard~~ impossible to reason about decimally without being an expert in numerical computations. Imagine explaining to a non-technical customer that you guarantee to be within 5 ULP of the correct answer.

So, what does all this mean? What comparison should you use in your code?

Sadly there is no one-size-fits-all answer. When comparing two floating-point numbers, you need to understand your domain and how the numbers came to be and then decide based on that.

What about Catch2?

I maintain a popular testing framework, Catch2, so you might be wondering how does Catch2 handle comparing floating-point numbers. Catch2 provides some useful tools for comparing floating-point numbers, namely Approx and 3 different floating-point matchers, but doesn't make any decisions for you.

Approx is a type that provides standard relational operators, so it can be used directly in assertions and provides both margin and epsilon comparisons. Approx equals a number if the number is either margin or epsilon (or both) equal to the target.

There are two crucial things^[15] to remember about Approx. The first is that the epsilon comparison scales only with the Approx'd value, not the min/max of both sides of the comparison. The other is that a default-constructed Approx instance only performs epsilon comparison (margin defaults to 0).

The matchers each implement one of the three approximate comparisons, and since they are matchers, you can arbitrarily combine them to compare two numbers with the desired semantics. However, it is important to remember that the ULP matcher does have a slightly non-standard interpretation of ULP distance.

The ULP matcher's underlying assumption is that the distance between two numbers that directly compare equal should be 0, even though this is not the interpretation by the standard library, e.g. through std::nextafter. This means that e.g. ulpDistance(-0, 0) == 0 as far as the ULP matcher is concerned, leading to other minor differences from naive ULP distance calculations.

Summarizing the behaviour of the ULP matcher:
\[
\begin{align}
x = y &\implies \textrm{ulpDistance}(x, y) = 0 \\
\textrm{ulpDistance}(\textrm{max-finite}, \infty) &= 1 \\
\textrm{ulpDistance}(x, -x) &= 2 \times \textrm{ulpDistance}(x, 0) \\
\textrm{ulpDistance}(\textrm{NaN}, x) &= \infty
\end{align}
\]

That is all for this post. Now you can go and fix floating-point comparisons in your code. Or use this post to win internet arguments. As long as you don't give advice assuming that floating-point comparisons are one-size-fits-all, it is fine by me.

As it turns out, trying to represent real numbers (of which there is an uncountable infinity) using only fixed space is a very complex problem. ↩︎
This simplifies some things but also means that zero can be either positive or negative. Luckily they compare equal, but that makes the cases where you cannot use them interchangeably even more surprising. ↩︎
For single-precision floating-point numbers (usually float), the exponent has 8 bits, and mantissa has 23 bits. For double-precision floating-point numbers (usually double), the exponent has 11 bits, and the mantissa has 52 bits. ↩︎
Not only are NaNs never equal to any floating value (including other NaNs), trying to compare them also always produces an unordered result. This means that all of \(\textrm{NaN} < x\), \(\textrm{NaN} \le x\), \(\textrm{NaN} > x\), \(\textrm{NaN} \ge x\) are false, even if \(x\) is another NaN. ↩︎
Obviously this assumes that a and b are of a floating-point type, e.g. double. ↩︎
This means addition, subtraction, multiplication, division, and, surprisingly, sqrt given specific rounding mode. ↩︎
This doesn't apply if you use -ffast-math or equivalent, if your code targets x87 FPU, or if your compiler likes to use fused multiply-add instructions. ↩︎
These are your logarithms, sine and cosine, etc. The definition of transcendental function is that they are not expressible as a finite combination of additions, subtractions, multiplications, divisions, powers or roots. ↩︎
Interestingly, you will get different results if you implement the comparison using absolute value versus splitting it into two comparion (so you get `\((\textrm{lhs} + \textrm{margin} \geq \textrm{rhs} \wedge \textrm{rhs} + \textrm{margin} \geq \textrm{lhs}\)). Specifically comparison using absolute value will reject two infinities, while the split comparison accepts infinities as equal. For this reason Catch2 internally uses the latter implementation. ↩︎
This is because of the floating nature of the floating-point numbers, and the smallest representable difference becoming bigger for bigger numbers. Remember that for large enough numbers, X + 1 == X. ↩︎
To make matters more confusing, people also sometimes talk about something called machine epsilon. Machine epsilon is the difference between 1.0 and the next higher representable value (in other words, a value that is 1 ULP from 1.0 in the direction of positive infinity). This epsilon is a different epsilon than the one used in relative comparisons, even though they can be related (e.g. Catch2 defaults relative epsilon to 100 * machine epsilon of the type). ↩︎
Notice that the formulas for absolute margin and relative epsilon comparisons are very similar. The only difference is whether we scale the allowed differences or not. ↩︎
To understand how this happens, consider a comparison between 0 and some other number. The only way this comparison can ever pass is to set epsilon to 1, thus allowing up to 100% difference between the two sides. Such epsilon, in turn, can only fail for numbers with different signs, making the comparison useless.

Even if the other side is the smallest representable positive number and thus likely "close enough" for approximate comparisons, we would still have to set the epsilon to 1 for the comparison to succeed. ↩︎
Consider \(\varepsilon = 0.1\), \(\textrm{lhs} = 10\), and \(\textrm{rhs} = 11.1\). ↩︎
Both of these behaviours are there for legacy reasons because I do not want to break backwards compatibility in a way that can cause tests to pass when they would not before. Generally, I recommend considering Approx legacy and building the comparisons out of matchers. ↩︎

Floating Point Numbers - The Coding Nest