Regression Tests Not Portable Between Different Compilers and Systems

# The Problem

The regression testing strategy in OpenMC compares results for each test against previously generated result files. This strategy is simple and highly effective at ensuring PRs don't break working code. However, the downside is that it fundamentally assumes that different compilers and systems will produce the same outputs given the same version of the code. I've found that I often get test failures (e.g., different test results) when compiling on both my local macbook as well as linux cluster systems as compared to what OpenMC's github CI produces. Ideally - we would like our tests to run cleanly everywhere.

# Causes

We can get different results from OpenMC on different compilers and systems for at least three reasons:

1. Floating point non-associativity. This is not a big deal for serial operations, but when running with multiple OpenMP threads (or sometimes in MPI depending on if the code requires strict rank ordering), certain floating point operations (like tallying) can happen in inconsistent order depending on which system conditions causing threads to execute in different orders. This problem isn't typically very serious though, as it only shows up in a few places, and usually doesn't cause macro-scale divergence.

2. Differences in system math transcendental operations (e.g., `std::sin()`, `std::exp()`) etc. In the C++ regions of OpenMC, these functions map to the system's C standard library implementation (`libm`). On linux systems, this is typically the GNU C Library (glibc), on macs this is the apple libSystem (e.g., `libSystem.B.dylib`), and on windows this is provided by the microsoft C runtime. All of these libraries have independent implementations. Furthermore - transcendental functions are not governed by IEEE 754 rules, such that each implementation may be rounded differently. As such, something like `std::exp(-0.1)` may give a slightly different result, even just when running on linux but using two different versions of glibc.

It's easy to see how differences in `log()` might affect transport and cause macro-scale divergence over time (e.g., in `Particle::event_advance()`):

```C++
    collision_distance() = -std::log(prn(current_seed())) / macro_xs().total;
```

Similarly for scattering operations, sin/cos etc will adjust outgoing angles slightly, causing a particle that previously nicked a surface to miss it, and causing macro-scale divergence.

3. Differences in compiler FMA contraction. Modern CPUs and GPUs typically have specialized floating point instructions for fused multiply-add (FMA) operations. Consider a line of code like `a = b + c*d`. Normally a compiler would treat this as a multiply operation and then an addition operation, with an implicit rounding operation happening after each operation (i.e., two FP operations + 2 FP roundings). This can be "contracted" into a single FMA instruction. The FMA instruction is notable as it is just one FP operation + 1 FP rounding. As such, it can be more accurate due to only having to round once at the end, which can be quite valuable in some cases. Additionally, depending on the architecture, it may be faster as 1 FMA might just take 1 clock cycle as compared to needing 2 for two separate multiple then add instructions.

Compilers have different default configurations for how they decide whether or not to "contract" a candidate operation into an FMA operation. By default they may have different levels of aggressiveness. Even under the same flags, it's potentially possible that one may limit its analysis to single line contractions, whereas another might do more analysis to determine if two lines of C++ code might be contracted into a single FMA. As such, when any level of FMA contraction is enabled, you can potentially get different instructions being generated between compilers and compiler versions, even on the same system. As FMA operations affect rounding, the results of a single line of code like `a = b + c*d` may give different results for `a`.

This problem is notable in that it occurs right out of the gate in OpenMC. For example, in material.cpp in `Material::normalize_density()`, as we are initializing material data right after reading it in, we are already going to be rounding things differently depending on FMA treatment:

```C++
  for (int i = 0; i < nuclide_.size(); ++i) {
    int i_nuc = nuclide_[i];
    double awr = settings::run_CE ? data::nuclides[i_nuc]->awr_ : 1.0;
    int z = settings::run_CE ? data::nuclides[i_nuc]->Z_ : 0.0;
    density_gpcc_ += atom_density_(i) * awr * MASS_NEUTRON / N_AVOGADRO;  # Potential FMA
    charge_density_ += atom_density_(i) * z; # Potential FMA
  }
```

Thus, before we've even sampled the first particle, our underlying material/XS data is a little different. It won't take long before you get macro-scale particle divergence.

4. Misc. There are a few other potential differences between compilers that might cause problems, though these all seem less severe than what is listed above. For instance, some systems might treat subnormal floating point values different (e.g., flush to zero). Some compilers might auto-vectorize a reduction loop, in which case the floating point ordering may change significantly. 

# Implications

The implications of the above three issues are that you may see both minor differences in simulation outputs (e.g., tally results) caused by minor changes in FP rounding. By itself, such small changes in rounding are not such a big deal, as we can make tests tolerant of tiny rounding differences. However, the bigger issues at play is that tiny differences in XS data, sampling, and scatter operations can rapidly cause macro-scale divergence. A particle that previously just barely nicks a curved surface may instead miss it, potentially causing it to sample its next reaction in a different material altogether, and thus causing it to scatter to a completely different angle or sample a totally different sort of reaction. This can cause more significant changes in tally results for fixed source problems (especially true if we are only running a few hundred particles, as is typically the case for fast regression testing). Furthermore, it can cause complete divergence in eigenvalue mode, where a minor change in one history in the first batch can affect which particles get sampled in the next batch, causing the simulations to be be running completely  different particles after just a few batches.

# Severity and Potential Fixes

Each of the problems listed in the "Causes" section above have different severities and potential remedies.

1) Parallel Floating point non-associativity. For most cases I can think of I don't see this causing macro-scale divergence, rather just minor rounding differences in tallies. As such, I don't see that we are sensitive to these types of differences yet, as our testing framework seems pretty reproducible (at least for a given compiler + system) regardless of how many threads are used. Thus, no remedy may even be needed.

2) Math Transcendentals. This is a much higher severity problem, and I believe does cause many test failures. Different linux ssytems may have different glibc versions. MacOS systems will use the apple standard library instead of the GNU C standard library as on linux, etc. Thus, the only way I can see to get around this is to use a fixed implementation of the transcendental functions instead of using the system library. There are actually good options here, like the `core-math` https://gitlab.inria.fr/core-math/core-math/ library. In theory this could be ingested as a vendor dependency. Or - as it is distributed under the MIT license, we could ingest it permanently into the OpenMC repo, which would make things like adding GPU flags much easier.

3) FMA Contraction. This is also a high severity problem, which is most noticeable when compiling with LLVM clang vs. GNU gcc. Thankfully, there are compiler flags that can globally disable/enable/set these policies such that consistent behavior can be achieved by simply altering the OpenMC `CMakeLists.txt` to be more explicit about what we want, and not leave it up to default CMake/compiler-specific choice.

4) Misc. I'm guessing these are all low severity and we aren't affected by this, but we may need to look more closely at these areas if the above items are all remedied and differences are still showing up in tests.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression Tests Not Portable Between Different Compilers and Systems #3820

The Problem

Causes

Implications

Severity and Potential Fixes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regression Tests Not Portable Between Different Compilers and Systems #3820

Description

The Problem

Causes

Implications

Severity and Potential Fixes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions