Achieving Greater Accuracy in Automated Performance Analysis within Unit Tests
Comparing fixed timing in heterogeneous testing environments is just not reliable or provable data. What's worse, first-attempt approaches to the problem usually end up taking a very long time to execute, which is the opposite of what is desired for automated testing.
It is a death by a thousand little cuts as applications (especially in games) pile on feature after feature. The automated test process should only add milliseconds or maybe seconds to a build, so that everyone runs them when changes are made. Every developer should have a very high degree of confidence that a change does not make things worse, but makes things better. Just as developers rely on "does it compile?" as a pre-requisite to submit changes to source control, so too should they also run the test suite before submission and be confident nothing got slower or broken as a result. From coders to artists to designers -- everyone wants to know they are making things better and nobody wants to find out they made things worse when the changes go from their keyboard to the customer.
One approach (and it is no panacea) is to measure complexity rather than time for performance. Use that "Big-O" notation to calculate the complexity closer to unit tests, rather than implement integration tests as part of the CI build.
Make the tests self-comparative so that it does not matter if they run on bare-metal or are farmed away to some container or virtual machine.
Relative Timing
For performance measurements, conduct three very fast runs with increasing workloads: 1 unit, 10 units, and 100 units (or whatever makes the most sense for what is being tested). The runs could measure call counts to see if they are linear, logarithmic, or exponential. It could be the number of objects to process. It is up to the developer to provide appropriate context, but there should be an increased workload and relative timings to estimate what kind of complexity is measured versus what is expected.
Collect and compare these timings. Sure, there are cache-line considerations and other factors to adjust for, but if the complexity is expected to be O(1) and becomes O(n) or worse, O(n^2), there is a problem. Klaxons should sound, strobe lights should flash, and whoever is making the change needs to take a closer look before it propagates.
The increase in timings may not be precise, but there is a wide delta between O(1) and O(n), or O(n) and O(log n), and certainly between O(log n) and O(n^2).
Recommended by LinkedIn
Most coders reading this are probably already imagining a TEST() helper that takes some timings in to see where they fall on calculated complexity vs. what is expected and can return true or false based on what is measured.
EXPECT_TRUE(time1, time2, time3, ENUM_COMPLEXITY).
Conclusion
Measure relative complexity, not absolute time.
This is merely another conceptual tool, not a silver bullet that solves all issues measuring performance and catching slow-downs early. There is no substitute for experience and real skin in the game trying to tackle these problems.