What is a floating-point number?
Digital computers represent numbers in binary form. That is, using values of high or low voltages that represent a “1” or a “0”. This base 2 numbering system can only represent a limited set of numbers.
A signed number can be any number positive or negative. An additional bit of information is needed to identify a positive or negative number, this is called the sign bit. If a number is unsigned, it it assumed to be positive and the bit that would have been used for the sign is added as an additional data bit.
In computers, the number of bits that are used in the processor is an important part of the design. That is why there are typically 2 version of the software, a 32-bit version and a 64-bit version. Those bit values represent the bit length of the numbers that the processors operate on. If you had a 32-bit system, the largest binary value that that computer could handle is . The largest numbers you could add together would be around 1.52 Billion and the largest numbers you could multiply would be far less than that. Enter floating point numbers. With a 32-bit floating point number you can represent values all the way up to around 3.4×10^38 with varying degrees of precision.
A floating point number is a number that is in an exponential notation, similar to the scientific notation that we were taught in school. There are three parts to a floating point number representation: the sign, the mantissa, and the exponent. The specific length of these different parts are standardized in the IEEE 754 document.
How do we use it?
Therefore the reason for the floating point number’s existence is to represent numbers across a wider range than is possible with a normal binary number. On the left is a representation of all the numbers that a floating point number can be. There are more numbers closer to zero, and as you get further away the precision of the numbers decreases. This can have some subtle nasty effects, as the accuracy of some operations depends on the value of the numbers that you are using.
Where are these used?
Floating point numbers are used everywhere all the time. Almost any mathematical operation on the computer uses floating point numbers. The Cores of almost all major GPU and CPU designs can use floating point numbers.
What are some issues?
Rounding Errors
Rounding errors are caused by the inability of floating point representations to represent all numbers in their range. If your were to make a line graph of all of the floating point values, you would notice that many of them are concentrated close to zero. As the exponent gets larger there are
Catastrophic Cancellation (precision limits)
Catastrophic cancellation is the reason that you always want to avoid comparing small differences in large numbers. As the floating point number gets larger it loses precision. As a result, the large numbers have more rounding errors. Eventually a point is reached where small differences are mostly rounding errors, This can have the effect of cancelling out meaningful calculations and causing undesired behavior.
Case Studies
Patriot Missile
In 1991 a Patriot missile battery failed to intercept an incoming Iraqi scud missile. The computers that processed data from the tracking radars counted the time in seconds by multiplying the time in 10ths of a second by 1/10 to get the time in seconds. The registers that performed this calculation were 24-bits in width, meaning they would only hold a 24-bit number. The fractional value of 1/10 was truncated 24 places after the radix point. The main issue is that conversion from a real number to a floating point number causes a loss of precision. The 24-bit register was then converted into a 48-bit floating point number, and that lost precision propagated through every timing cycle.
Ariane 5
On its maiden flight in 1997, the Ariane five rocket exploded 37 seconds after launch destroying the rocket and it’s payload of 4 satellites. The diagnostic pattern was interpreted as data, and the software exception was thrown when the software try to make a conversion from a 64-bit floating point to a 16-bit signed integer value.
Example
Basic Arithmetic
Our first example will use Python code to demonstrate how operations with floating-point could go awry. Let’s open up our python interpreter to show some examples. These show the results from a simple subtraction of two floating point numbers. As you see, the answer for all of these calculations should be 0.1, but that’s not what we get.
The errors jump above and below 0.1 but was we’re really interested in is how the magnitude of the errors relates to the magnitude of the numbers that were are subtracting. When we’re just subtracting 1.1 from 1.2 the error is relatively small, but when we subtract 10 billion point 1 from 10 billion point 2 the magnitude of the error is almost 4 billion times greater than the magnitude of the first error!
When we plot the deviations from the actual result, we get a graph that looks a little like this.
This example shows how the Catastrophic Cancellation phenomenon could happen and it gives us a rule of thumb to think about when using floating point numbers for any important calculations:
“Be wary of small differences between very large numbers“
Algorithmic Problems
Many scientific and engineering applications use computers to solve large matrices or complicated differential equations. and there is an entire field called Numerical Stability that is dedicated to figuring out how the propagation of small errors could mess up algorithms. For this example, we’ll look at a simple program that represents a physics simulation and see how forgetting about floating point numbers can affect the control flow of an algorithm.
Let’s say we have the main loop in a physics simulation. Typically a high-end physics simulation, such as a computational fluid dynamics program, perform many evaluations of a function over discrete time steps. In the case of a CFD solver, which needs to run for hundreds of thousands or even millions of time steps, results aren’t computed every time step. The simulation may run on a scale of milliseconds, but you only want to save every 40th time step so that you can render an animation running at 24 frames per second.
Someone who is inexperienced may write a program that looks like this:
Let me preface this with the fact that you should never use the modulus (%) operator with floating pointer numbers for this very reason. The idea seems sound enough, 1000ms divided by 24 fps is around 41-42 ms so whenever the time is a multiple of 41 we save the value for a frame, but it will not behave as expected. Because of that rounding error, the time will never equal a multiple of 0.041, therefore nothing will be printed out to the console.
We can fix this problem by using a non-floating point number to keep track of the number of time steps and convert it back into actual time values whenever we need it:
Sources
- “The Patriot Missile Failure.” Accessed: Nov. 02, 2022. [Online]. Available: https://www-users.cse.umn.edu/~arnold/disasters/patriot.html
- “ARIANE 5 Failure – Full Report.” Accessed: Nov. 02, 2022. [Online]. Available: https://www-users.cse.umn.edu/~arnold/disasters/ariane5rep.html
- varshneyDeadlyConsequencesRounding2019
- L. Varshney, “The Deadly Consequences of Rounding Errors,” Slate, Oct. 31, 2019. Accessed: Nov. 14, 2022. [Online]. Available: https://slate.com/technology/2019/10/round-floor-software-errors-stock-market-battlefield.html