Hacking Hardware to Find and Fix Floating Point Bugs in C++ Code

Hacking Hardware to Find and Fix Floating Point Bugs in C++ Code

In the crazy world of game development, developers often stumble upon bugs that are as easy to find as a needle in a haystack... in a tornado. These bugs can morph game models into spaghetti or send characters soaring into the digital ether faster than you can say, "Where did my player go?" One sneaky culprit behind these wild behaviors is floating-point exceptions.

Floating-point exceptions happen when the floating-point unit (FPU) of a CPU encounters an operation that produces an invalid result, like division by zero, overflow, underflow, or a calculation that results in a NaN (Not a Number). Normally, these exceptions are masked, meaning they don't immediately cause the program to crash or throw a tantrum. Instead, they quietly propagate NaNs or infinities through the calculations, leading to unpredictable and hair-pulling debugging sessions.

The Problem

Imagine you're knee-deep in game development, and suddenly, your character starts an unscheduled trip to the moon, but only in some hard-to-reproduce 3-body physics interaction that is rare, but QA finds it each build. After hours of debugging, you're still lost. What if there was a way to catch these floating-point errors right when they happen, rather than letting them wreak havoc?

Let's take a hyperbolic but illustrative example to highlight the concept. Consider the following C++ code snippet, which appears to perform a straightforward floating-point calculation:

#include <iostream>
#include <cmath>

float compute(float a, float b) {
    return std::sqrt(a) / b;
}

int main() {
    float result = compute(-1.0f, 0.0f);
    std::cout << "Result: " << result << std::endl;
    return 0;
}        

Now, this example is intentionally exaggerated, but it highlights the kinds of issues that can occur. It contains two potential pitfalls:

The square root of a negative number results in a NaN.

Division by zero produces an infinity.

These issues might not be immediately evident and can lead to strange behaviors if not handled properly. Let's not even consider how GPU drivers get a case of the Blue Screen crazies when other hardware is fed a plate of infinities with a side of NaN (they mostly handle it very well these days, but a GPU crash is darned near impossible to untangle).

Performance Penalties

It's not just the bizarre bugs that make floating-point errors a nightmare. When garbage (invalid) floating-point calculations are made, they can introduce significant performance penalties. Performing invalid floating-point operations can run 15 to 30 times slower than clean calculations. These performance hits are due to the additional cycles the CPU spends handling these invalid states, which further emphasizes the importance of catching and fixing floating-point errors early. Maybe some coder applies sane constraints after calculations, and everything is functionally correct, but the game is getting slower. Who knows?

Enabling Floating-Point Exceptions

To catch these floating-point exceptions as they occur, we can enable the hardware to throw an exception whenever such an event happens. This way, we can immediately halt execution and examine the state of the program at the exact point of failure.

Before performing any floating-point operations, we enable the desired exceptions. In this example, we'll enable exceptions for division by zero, invalid operations (like taking the square root of a negative number), and overflow.

#include <fenv.h>
#include <cmath>
#include <iostream>

int main() {
    // Enable floating-point exceptions
    feenableexcept(FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW);

    float result = compute(-1.0f, 0.0f);
    std::cout << "Result: " << result << std::endl;
    return 0;
}        

Enabling floating-point exceptions catch errors precisely when they occur, saving countless hours of debugging and ensuring the code runs more smoothly than a greased penguin on an ice slide. This is a debugging tool and should not be part of any release build. It is only of value to programmers chasing down bugs in code. If you are a fan of Test-Driven Development (TDD) and integrating very fast tests as part of your build, this can be turned on just for the test run and catch potential bugs before they make their way to source control and ruin the day for other developers (or worse, players).

Understanding how computers actually work (a dark art not commonly taught in most university CS programs) can help track down inexplicable bugs and find the culprit hiding in plain sight.

This is not a silver bullet. I have, maybe twice or thrice in my entire career since the 1990s had to employ this very niche and specialized approach. When I did need to, it nuked 80% of the bugs bogging down development, actually shipping what became well-loved and delightful games I hope many of you reading this have enjoyed. Old coders collect a lot of these special little incantations and cantrips in their bag of tricks. Maybe a future article will cover the very dark voodoo of page protection, memory trashers, under-runs and over-runs; or some other dark arts of using code to leverage actual hardware in the pursuit of better engineering.

(Please forgive typos. I am writing a LinkedIn blog post. The code here is to illustrate. I haven't compiled it or run it. Comments on errors or other editorial changes are welcome.)

To view or add a comment, sign in

More articles by Justin Randall

Insights from the community

Others also viewed

Explore topics