MIBM Performance Optimizations by danieljvickers · Pull Request #1157 · MFlowCode/MFC

danieljvickers · 2026-02-17T16:09:41Z

User description

Description

Following the refactor of the levelset, there were several performance optimizations left to be made to the code. This PR introduces optimizations that will make multi-particle MIBM code viable. It also expands the upper bound of allowed number of immersed boundaries to 1000. Performance was measured on 1-4 ranks of ACC GPU compute using A100 GPUs.

This PR has extended optimization to STL IBs, which should significantly improve accuracy, performance, and code cleanliness. The primary optimizations are as follows:

IB compute is now fully supported on the GPU. Processing of the mesh into memory has been parallelized, as well as IB marker generation and levelset compute.
Interpolation of IB markers was removed in favor of a projection-based distance search. This means that we no longer need to interpolate (saving significant per-processing time and memory), we iterate over significantly fewer vertices (by a factor of 50 for the coarsest grids), and we get exact values for the levelset instead of approximations based on how fine the interpolated mesh was.
Unified distance and normal compute. We not longer perform a distance search and then a separate normal vector search. This is all handled in a single subroutine, cutting search time in half.
Deleting about 1000 lines of code related to the maintenance and compute of STL models, leading to a cleaner base to build on
Modified STL generation to be cell center based instead of volume fraction based. This prevents edge cases where the cell center is outside the model, but the point is labeled as inside the mode. This error causes levelsets that point into the immersed boundaries and because instabilities

Type of change

New feature
Refactor
Other: Performance Tuning

Testing

All changes pass the IBM section of the test suite on GPUs with the NVHPC compiler. Performance was measured with a case of 1000 particles with viscosity enabled. The particles are all resolved 3D spheres given random non-overlapping positions generated by the following case file:

import json
import argparse

num_cells = [240, 240, 240]
dim = [8., 8., 8.]
num_particles = 1000

# create a stationary fluid
case = {
    "run_time_info": "T",
    "parallel_io": "T",
    "m": num_cells[0]-1,
    "n": num_cells[1]-1,
    "p": num_cells[2]-1,
    "dt": 0.005,
    "t_step_start": 0,
    "t_step_stop": 2,
    "t_step_save": 1,
    "num_patches": 1,
    "model_eqns": 2,
    "alt_soundspeed": "F",
    "num_fluids": 1,
    "mpp_lim": "F",
    "mixture_err": "T",
    "time_stepper": 3,
    "recon_type": 1,
    "weno_eps": 1e-16,
    "riemann_solver": 2,
    "wave_speeds": 1,
    "avg_state": 2,
    "precision": 2,
    "format": 1,
    "prim_vars_wrt": "T",
    "E_wrt": "T",
    "viscous": "T",
    "x_domain%beg": -0.5*dim[0],
    "x_domain%end": 0.5*dim[0],
    "y_domain%beg": -0.5*dim[1],
    "y_domain%end": 0.5*dim[1],
    "z_domain%beg": -0.5*dim[2],
    "z_domain%end": 0.5*dim[2],
    "bc_x%beg": -3,
    "bc_x%end": -3,
    "bc_y%beg": -3,
    "bc_y%end": -3,
    "bc_z%beg": -3,
    "bc_z%end": -3,
    "patch_icpp(1)%geometry": 9,
    "patch_icpp(1)%z_centroid": 0.0,
    "patch_icpp(1)%length_z": dim[2],
    "patch_icpp(1)%y_centroid": 0.0,
    "patch_icpp(1)%length_y": dim[1],
    "patch_icpp(1)%x_centroid": 0.0,
    "patch_icpp(1)%length_x": dim[0],
    "weno_order": 5,
    "patch_icpp(1)%pres": 1.0,
    "patch_icpp(1)%alpha_rho(1)": 1.0,
    "patch_icpp(1)%alpha(1)": 1.0,
    "patch_icpp(1)%vel(1)": 0.0,
    "patch_icpp(1)%vel(2)": 0.0,
    "patch_icpp(1)%vel(3)": 0.0,
    "fluid_pp(1)%gamma": 2.5000000000000004,
    "fluid_pp(1)%pi_inf": 0.0,
    "fluid_pp(1)%Re(1)": 2500000,
}

import random
random.seed(42)

dx = [float(dim[i]) / float(num_cells[i]) for i in range(3)]

#set particle properties
particle_radius_cells = 5 # particle radius in grid cells
particle_cell_spacing = particle_radius_cells*2 + 5 # set the spacing to be double the radius plus 5 to garuntee no image points in other IBs
mpi_cell_spacing = particle_radius_cells + 5 # space safely away from MPI halo regions to prevent out of bounds errors
genreation_bounds = [6., 6., 6.] # generate particles in this box safely away from the boundary
velocity_magnitude = 1.

# convert non-dimnesional grid cell units to the units of the simulation
radius_units = float(particle_radius_cells) * dx[0]
paticle_units_spacing = float(particle_cell_spacing) * dx[0]
paticle_units_spacing_squared = paticle_units_spacing**2
mpi_units_spacing = [float(mpi_cell_spacing) * dx[i] for i in range(3)]

# generate an array of xyz values that garuntee non-overlapping grid cells
particles = []
while len(particles) < num_particles:
  # generate a completely random position
  position = [random.random() for i in range(3)]
  position = [(position[i] - 0.5) * genreation_bounds[i] for i in range(3)]

  # first check if the particle is too close to the MPI halo regions
  valid = True
  # for i in range(3):
  #   valid = valid and abs(position[i]) >= mpi_units_spacing[i]

  # check for the minimum spacing between all particles, exiting once we find an error
  for particle in particles:
    distance_squared = sum([(particle[i] - position[i])**2 for i in range(3)])
    valid = valid and distance_squared >= paticle_units_spacing_squared
    if not valid:
      break

  if valid:
    particles.append(position)
    # print(f"\rProgress: {100.*float(len(particles))/float(num_particles)}%", end="", flush=True)

# print()
# convert out array of positions to valid JSON for the immersed boundary
ib_properties = {"ib": "T", "num_ibs": num_particles,}
for i in range(len(particles)):
  ib_properties[f'patch_ib({i+1})%radius'] = radius_units
  ib_properties[f'patch_ib({i+1})%slip'] = 'F'
  ib_properties[f'patch_ib({i+1})%geometry'] = 8
  ib_properties[f'patch_ib({i+1})%moving_ibm'] = 2
  ib_properties[f'patch_ib({i+1})%mass'] = 1.
  ib_properties[f'patch_ib({i+1})%x_centroid'] = particles[i][0]
  ib_properties[f'patch_ib({i+1})%y_centroid'] = particles[i][1]
  ib_properties[f'patch_ib({i+1})%z_centroid'] = particles[i][2]
  # move the particle away radially to garuntee they never touch during the simulation
  position_mag = (sum([(particles[i][j])**2 for j in range(3)]))**0.5
  for j in range(3):
    ib_properties[f'patch_ib({i+1})%vel({j+1})'] = particles[i][j] * velocity_magnitude / position_mag

print(json.dumps({**case, **ib_properties}))

These optimizations add nearly x1000 performance in the moving IBM propagation and generation code. Prior to these optimizations, this was the result of the benchmark case using the NVIDIA NSight profiler showing 45 seconds to run a single RK substep:

Following these optimizations, the same profile achieves almost 50 ms per RK substep:

For STLs, the optimizations were tested on a 822,000 vertex mesh of a Mach 0.4 corgi, given by this STL:
https://www.thingiverse.com/thing:4721563/files

The final simulation finished in a total of 25 minutes on a 200^3 grid for 4k time steps on a single A100 GPU. All of the code related to the STL model (file reading, preprocessing, IB marker generation, and levelset compute) took only 20 seconds of the run time. The result of that simulation can be viewed here:
https://www.youtube.com/watch?v=h44BNCKo0Hs

Checklist

I updated documentation if user-facing behavior changed

See the developer guide for full coding standards.

GPU changes (expand if you modified src/simulation/)

GPU results match CPU results
Tested on NVIDIA GPU or AMD GPU

CodeAnt-AI Description

GPU-accelerate STL immersed-boundary compute and support up to 1000 IBs

What Changed

STL models are read, transformed, packed into GPU-friendly flat arrays and uploaded so immersed-boundary (IB) marker and levelset checks run on the device instead of on CPU-side model structures
Levelset/inside-model queries now use GPU-ready routines (including a GPU-safe random-number step and a flattened triangle intersection path) so per-cell STL inside/distance tests execute on the GPU
IB patch routines and levelset application limit loops to grid regions that overlap each patch (binary-searched bounding indices), reducing the number of cells inspected for every patch
API and bookkeeping now handle many more IBs: the global patch limit raised to 1000 and model metadata (counts, bounding boxes) are kept in GPU-updatable arrays
2D/3D model distance and normal computation changed to projection-based nearest-point checks (exact triangle/edge projections) and triangle normals are normalized when reading STLs
Minor runtime observability: NVTX profiling ranges added around immersed-boundary propagation

Impact

✅ Faster IB marker generation
✅ Lower CPU during IB setup and levelset evaluation
✅ Support up to 1000 immersed boundaries

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

…mpute

…mething about that...

…yet)

…generating golden file

sbryngelson · 2026-02-26T06:38:07Z

Claude Code Review

Head SHA: 7abaffc747841846b0e65447ef86f64bd8a70503
Files changed: 21

Summary

The PR makes significant and valuable performance improvements by moving IBM computations to the GPU with proper Fypp macro usage. The new projection-based distance computation is mathematically cleaner than the interpolation approach. GPU macro usage is correct throughout — all GPU parallelism uses GPU_PARALLEL_LOOP/END_GPU_PARALLEL_LOOP, GPU_ATOMIC, GPU_ROUTINE, GPU_DECLARE, GPU_UPDATE. No raw !$acc or !$omp directives found. Precision discipline is respected.

However, there are several issues that should be addressed before merging.

Critical Issues

1. stl_bounding_boxes allocated inside a loop — will crash with multiple STL patches

src/common/m_model.fpp, in s_instantiate_STL_models:

stl_bounding_boxes is allocated inside the do patch_id = 1, num_ibs loop with allocate(stl_bounding_boxes(patch_id, 1:3, 1:3)). On the first STL patch, this allocates a (1, 3, 3) array. When a second STL patch is encountered, the array is already allocated, causing a Fortran runtime error.

Fix: Allocate stl_bounding_boxes(1:num_ibs, 1:3, 1:3) once before the loop.

2. num_patches_max raised from 10 to 1000 — massive unintended memory inflation

src/common/m_constants.fpp, line 26.

num_patches_max controls both IB patches and initial condition patches. It sizes patch_icpp(num_patches_max) and patch_ib(num_patches_max) in both pre_process and simulation. Each ic_patch_parameters contains alter_patch(0:num_patches_max-1) (now 1000 booleans), plus many reals and strings. With 1000 entries, patch_icpp alone balloons from ~40KB to ~4MB of static data per MPI rank — even for cases with zero IBs.

Fix: Introduce a separate num_ibs_max = 1000 constant for IB patches, leaving num_patches_max at 10.

3. Division by zero in s_distance_normals_2D when point coincides with vertex

src/common/m_model.fpp, in s_distance_normals_2D, the branches for t < 0 and t > 1:

dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm = norm/dist   ! Division by zero if dist == 0

The 3D counterpart (s_distance_normals_3D) correctly guards with if (norm_mag > 0._wp), but the 2D version does not.

Fix: Add if (dist > 0._wp) guard before normalizing.

4. Missing @:DEALLOCATE for GPU arrays allocated in s_instantiate_STL_models

src/common/m_model.fpp allocates these module-level GPU arrays with @:ALLOCATE:

gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_edge_count, gpu_total_vertices, gpu_boundary_v

None are deallocated anywhere, violating the @:ALLOCATE/@:DEALLOCATE pairing rule. Additionally, stl_bounding_boxes and models(:) are allocated with plain allocate but never deallocated.

Fix: Add a s_finalize_model_module routine that @:DEALLOCATEs all GPU arrays, called from s_finalize_ibm_module.

Important Issues

5. error stop used instead of call s_mpi_abort()

src/simulation/m_ibm.fpp, ~line 504:

error stop "Ghost Point and Image Point on Different Processors"

Per project rules, use call s_mpi_abort() instead.

6. Stale error message in case_validator.py

toolchain/mfc/case_validator.py, line 687:

self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
             "num_ibs must be between 1 and num_patches_max (10)")

Limit changed to 1000 but message still says "(10)".

7. Unbounded do while on GPU in s_compute_image_points

src/simulation/m_ibm.fpp, in s_compute_image_points:

The bounds check that prevents index from going out of range is #ifdef-disabled on GPU builds. If an image point falls outside the domain, this creates an infinite loop on the GPU or out-of-bounds memory access.

Fix: Add a bounded iteration limit and exit if exceeded.

Minor Notes

Commented-out GPU_ROUTINE directives in f_model_random_number and f_model_is_inside (src/common/m_model.fpp, lines 511, 531) — clarify whether these are intentionally CPU-only.
s_find_ghost_points ghost point ordering is nondeterministic on GPU due to atomic capture — acceptable if downstream code is order-insensitive.
s_check_boundary uses copy clause on large local arrays — may be slow for large 2D models but only runs at initialization.

Verdict

The GPU offloading approach is correct and the Fypp macro usage follows project conventions. The four critical issues (loop allocation crash, memory inflation, division by zero, missing deallocations) should be fixed before merging. The important issues are lower priority but worth addressing.

danieljvickers · 2026-02-27T13:00:00Z

@sbryngelson The Frontier CPU test failed due to compile seg fault. This happened two days ago, so I manually got onto frontier, did an ./mfc.sh clean and rebuilt. It built fine and passed the test suite when I ran it manually. I just submitted it again assuming it was spurious failure, but that appears to not be true. The strangeness is compounded by the fact that the openMP and openACC tests compiled fine on Frontier. It took me about 30 hours to get that runner and I would like to avoid the game of blindly committing changes hoping it fixes the compiler seg fault. Before I go down that rabbit hole, do you think that this could in any way be related to the caching changes you made on frontier? And if so, do you think we can test building this one from scratch to see if it passes? I will work on this in the background until I hear either way.

github-actions · 2026-02-27T20:51:47Z

Claude Code Review

Head SHA: 370aa4c
Files changed: 21

Changed files:

src/common/include/parallel_macros.fpp
src/common/m_constants.fpp
src/common/m_derived_types.fpp
src/common/m_helper.fpp
src/common/m_model.fpp (largest change: +475/-488)
src/pre_process/m_icpp_patches.fpp
src/simulation/m_compute_levelset.fpp
src/simulation/m_ib_patches.fpp
src/simulation/m_ibm.fpp
src/simulation/m_mpi_proxy.fpp (+5/-187)
src/simulation/m_time_steppers.fpp
tests/ (3 test UUIDs, golden files regenerated on GPU/nvfortran)
toolchain/ (param definitions, validator, test cases)

Summary:

Removes STL interpolation in favour of exact projection-based distance / inside queries (both 2D and 3D), eliminating ~1 000 lines of interpolation code
Packs GPU-friendly flat arrays (gpu_trs_v, gpu_trs_n, gpu_boundary_v) and uploads to device so levelset/inside-model compute runs fully on-GPU
Introduces bounding-index narrowing for every IB patch type, drastically reducing the per-patch cell loop
Removes s_mpi_sendrecv_ib_buffers / s_populate_ib_buffers and instead extends IB-marker loops into ghost-cell ranges
Raises num_patches_max from 10 → 1000

Findings

🔴 Bug — re-allocated inside loop (, )

stl_bounding_boxes is a module-level allocatable. The first STL patch allocates it with shape (1,3,3); the second STL patch hits an already-allocated variable and causes a runtime error. Fix: allocate once before the loop:

Then remove the per-patch allocate inside the loop.

🔴 Correctness — MPI halo exchange for IB markers removed (, )

s_mpi_sendrecv_ib_buffers and s_populate_ib_buffers were deleted. The IB-marker loops now extend into ghost-cell index ranges locally, but this only helps if the IB geometry overlaps the local subdomain's ghost region. When two ranks share an IB near their subdomain boundary, the ghost cells of rank A are filled by rank A's local marker kernel (correct), but rank A's interior ghost cells that should reflect rank B's IB markers are never populated via MPI. The result is that s_find_num_ghost_points and s_find_ghost_points see stale zeros in neighbour ghost cells, causing ghost points near MPI boundaries to be silently missed.

This is difficult to hit in single-rank GPU benchmarks (which is how the PR was validated), but it is a regression for any multi-rank MIBM run where an IB boundary crosses a subdomain edge.

🟡 Portability — Named DO exits inside (, and )

Named DO constructs with exit label inside GPU parallel loops are not guaranteed to compile/work correctly on Cray CCE or AMD flang backends. In the Cray OpenMP path (--gpu mp), exit from a named construct inside a target teams distribute region is undefined behaviour. Consider replacing with a scalar flag (is_gp) and using cycle on the outer loops, or restructuring to avoid early exit.

🟡 GPU atomic capture correctness (, )

The PR adds END_GPU_ATOMIC_CAPTURE() to parallel_macros.fpp but uses the existing GPU_ATOMIC(atomic='capture') macro as the opening:

If the existing GPU_ATOMIC macro only emits ! atomic update (ignoring the capture argument), then the closing ! end atomic from the new macro will produce a dangling directive and likely a compile error with nvfortran/CCE. Please verify that GPU_ATOMIC(atomic='capture') does indeed emit ! atomic capture / ! atomic capture rather than update.

🟡 Ray direction bug in (, ~line 224–226)

This sets the raw (unnormalised) ray direction component to point(k) + random - 0.5. The point coordinates are added to what should be a zero-centred direction, biasing all rays toward the origin. The original code used random - 0.5 without adding point(k). The normalisation on the next line partially recovers, but introduces a systematic directional bias proportional to the cell location. Consider removing point(k) +.

🟡 references global without or argument ()

The new GPU-callable function uses the module-level variable p (z-grid size) to decide 2D vs 3D behaviour, and the loop uses gpu_trs_v/gpu_trs_n as module-level GPU-declared arrays. For Cray OpenMP target offload, module-level variables used inside GPU_ROUTINE functions must be explicitly listed in a declare target clause (or the function must receive them as arguments). p and the gpu_* arrays are not in GPU_DECLARE here — verify this is handled correctly for --gpu mp builds.

ℹ️ Minor — Unused variables in ()

Variables point and local_point are declared but never referenced in the new 2D s_ib_model body. They should be removed to avoid compiler warnings.

ℹ️ Minor — Golden files regenerated only on GPU/nvfortran

The three updated golden files (0A362971, 4E0FBE72, EA8FA07E) were regenerated on an NVIDIA GPU with nvfortran 23.11. Per the project checklist, golden files should normally also be verified against the gfortran CPU path. The metadata confirms no CPU regeneration was done.

Overall

The performance approach is sound and the ~1 000× speedup claim is plausible given the algorithmic improvements (projection-based distance, GPU offload of all IB compute, bounding-index narrowing). The critical item is the stl_bounding_boxes allocation bug, which will crash on any simulation with ≥ 2 STL patches. The MPI-correctness regression is the second-highest priority before merge. Recommend:

Fix the allocate-in-loop bug.
Re-evaluate whether removing the IB-marker MPI halo exchange is safe, or re-add a lightweight exchange.
Verify GPU_ATOMIC(atomic='capture') expands correctly.
Validate on ≥ 2 MPI ranks with an STL patch that spans the domain split.

coderabbitai · 2026-02-27T20:56:00Z

📝 Walkthrough

Walkthrough

This pull request implements GPU-optimized changes for STL model handling and immersed boundary computation. Key modifications include: expanding maximum patch limits from 10 to 1000; refactoring STL model geometry APIs to use GPU-resident buffers and data structures (introducing gpu_ntrs, gpu_trs_v, gpu_trs_n, and related arrays); replacing logical return types with integer flags in intersection testing; transitioning from interpolation-dependent geometry checks to direct distance-normal calculations via deterministic projections; migrating IBM patch processing to GPU parallel constructs with atomic operations for point counting; adding NVTX profiling markers around IBM workflows; introducing seeded Fortran-based RNG for deterministic GPU execution; and extending MPI broadcasts for model-related parameters. Test metadata files reflect updated hardware environments and compiler toolchain changes to NVHPC with OpenACC enabled.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'MIBM Performance Optimizations' is concise, clear, and directly summarizes the primary focus of the changeset—performance improvements to the moving immersed boundary method (MIBM) code.
Description check	✅ Passed	The PR description provides comprehensive context: clear motivation (making multi-particle MIBM viable), specific optimizations (GPU acceleration, projection-based distance, unified normal/distance compute), measurable performance results (1000x speedup in IB propagation), testing details, and supporting artifacts (case file, benchmark images, video link).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

src/simulation/m_ibm.fpp (1)

491-506: ⚠️ Potential issue | 🔴 Critical

Bound the image-point search on GPU and remove error stop.

Line 494 disables the bounds-failure block for GPU builds, so the loop in Lines 491-493 can run index out of range and continue dereferencing s_cc(index + 1). Also, Line 504 uses error stop, which is disallowed.

💡 Proposed fix

-        integer :: dir
-        integer :: index
+        integer :: dir
+        integer :: index, iter, max_iter
@@
-                    do while ((temp_loc < s_cc(index) &
-                               .or. temp_loc > s_cc(index + 1)))
+                    iter = 0
+                    max_iter = 2*(bound + buff_size + 2)
+                    do while ((temp_loc < s_cc(index) &
+                               .or. temp_loc > s_cc(index + 1)))
                         index = index + dir
-#if !defined(MFC_OpenACC) && !defined(MFC_OpenMP)
-                        if (index < -buff_size .or. index > bound) then
+                        iter = iter + 1
+                        if (index < -buff_size .or. index > bound .or. iter > max_iter) then
                             print *, "A required image point is not located in this computational domain."
                             print *, "Ghost Point is located at ", [x_cc(i), y_cc(j), z_cc(k)], " while moving in dimension ", dim
                             print *, "We are searching for image point at ", ghost_points_in(q)%ip_loc(:)
@@
-                            error stop "Ghost Point and Image Point on Different Processors"
+                            call s_mpi_abort()
                         end if
-#endif
                     end do

As per coding guidelines: "Never use stop or error stop. Use call s_mpi_abort() or @:PROHIBIT()/@:ASSERT() instead."

src/simulation/m_ib_patches.fpp (1)

268-284: ⚠️ Potential issue | 🟠 Major

Privatize airfoil temporaries used inside the collapsed GPU loop.

xa, yc, and dycdxc are assigned inside the kernel but are not listed in the private clause at Line 268.
💡 Proposed fix
-        $:GPU_PARALLEL_LOOP(private='[i,j,xy_local,k,f]', &
+        $:GPU_PARALLEL_LOOP(private='[i,j,xy_local,k,f,xa,yc,dycdxc]', &
                   & copyin='[patch_id,center,inverse_rotation,offset,ma,ca_in,airfoil_grid_u,airfoil_grid_l]', collapse=2)
As per coding guidelines: "Declare private(...) on loop-local variables in GPU kernels to avoid unintended data sharing between threads."

🧹 Nitpick comments (4)

src/common/m_helper.fpp (1)
336-336: Guard GPU_ROUTINE in src/common with MFC_SIMULATION.

Please wrap this directive with #:if MFC_SIMULATION / #:endif in src/common to avoid GPU decoration leaking into non-simulation targets.
Suggested change
-        $:GPU_ROUTINE(parallelism='[seq]')
+        #:if MFC_SIMULATION
+            $:GPU_ROUTINE(parallelism='[seq]')
+        #:endif
As per coding guidelines: Only `src/simulation/` is GPU-accelerated; guard GPU macros with `#:if MFC_SIMULATION` for code in `src/common/`.
toolchain/mfc/params/definitions.py (1)
14-14: Use NI as the single source of truth for num_ibs max.

1000 is now duplicated. Reusing NI in CONSTRAINTS prevents drift.
Suggested change
-    "num_ibs": {"min": 0, "max": 1000},
+    "num_ibs": {"min": 0, "max": NI},
Also applies to: 675-675
src/simulation/m_ib_patches.fpp (1)
945-959: Use the copied threshold scalar inside the kernel.

Line 957 reads patch_ib(patch_id)%model_threshold even though threshold is already copied into the kernel, adding unnecessary global reads.
💡 Proposed fix
-                    if (eta > patch_ib(patch_id)%model_threshold) then
+                    if (eta > threshold) then
                         ib_markers%sf(i, j, k) = patch_id
                     end if
src/common/m_model.fpp (1)
547-560: Use direction-only random vectors for ray casting.

Line 558 adds point(k) into ray_dirs, which biases directions by absolute position and reduces angular diversity of the ray test.
💡 Proposed fix
-                ray_dirs(i, k) = point(k) + f_model_random_number(rand_seed) - 0.5_wp
+                ray_dirs(i, k) = f_model_random_number(rand_seed) - 0.5_wp

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35b2134 and 370aa4c.

📒 Files selected for processing (21)

src/common/include/parallel_macros.fpp
src/common/m_constants.fpp
src/common/m_derived_types.fpp
src/common/m_helper.fpp
src/common/m_model.fpp
src/pre_process/m_icpp_patches.fpp
src/simulation/m_compute_levelset.fpp
src/simulation/m_ib_patches.fpp
src/simulation/m_ibm.fpp
src/simulation/m_mpi_proxy.fpp
src/simulation/m_time_steppers.fpp
tests/0A362971/golden-metadata.txt
tests/0A362971/golden.txt
tests/4E0FBE72/golden-metadata.txt
tests/4E0FBE72/golden.txt
tests/EA8FA07E/golden-metadata.txt
tests/EA8FA07E/golden.txt
toolchain/mfc/case_validator.py
toolchain/mfc/params/definitions.py
toolchain/mfc/params_tests/test_definitions.py
toolchain/mfc/test/cases.py

coderabbitai · 2026-02-27T20:56:04Z

src/common/m_constants.fpp

    integer, parameter :: num_fluids_max = 10                     !< Maximum number of fluids in the simulation
    integer, parameter :: num_probes_max = 10                     !< Maximum number of flow probes in the simulation
-    integer, parameter :: num_patches_max = 10
+    integer, parameter :: num_patches_max = 1000


⚠️ Potential issue | 🟠 Major

Decouple IB scaling from global patch limit.

Line 26 increases num_patches_max to 1000, which expands all patch-sized static structures globally, not just IB capacity. This creates avoidable memory pressure and can impact startup/runtime stability. Consider keeping num_patches_max at its prior scope and introducing a dedicated IB limit constant (e.g., num_ibs_max).

Suggested constant split

- integer, parameter :: num_patches_max = 1000 + integer, parameter :: num_patches_max = 10 + integer, parameter :: num_ibs_max = 1000

Based on learnings: Code Review Priorities (in order): (1) Correctness, (2) Precision discipline, (3) Memory management, (4) MPI correctness, (5) GPU code, (6) Physics consistency, (7) Compiler portability.

coderabbitai · 2026-02-27T20:56:04Z

src/common/m_model.fpp

+#ifdef MFC_SIMULATION
+    public :: s_instantiate_STL_models
+#endif
+
+    !! array of STL models that can be allocated and then used in IB marker and levelset compute
+    type(t_model_array), allocatable, target :: models(:)
+    !! GPU-friendly flat arrays for STL model data
+    integer, allocatable :: gpu_ntrs(:)
+    real(wp), allocatable, dimension(:, :, :, :) :: gpu_trs_v
+    real(wp), allocatable, dimension(:, :, :) :: gpu_trs_n
+    real(wp), allocatable, dimension(:, :, :, :) :: gpu_boundary_v
+    integer, allocatable :: gpu_boundary_edge_count(:)
+    integer, allocatable :: gpu_total_vertices(:)
+    real(wp), allocatable :: stl_bounding_boxes(:, :, :)
+    $:GPU_DECLARE(create='[gpu_ntrs,gpu_trs_v,gpu_trs_n,gpu_boundary_v,gpu_boundary_edge_count,gpu_total_vertices]')



⚠️ Potential issue | 🟠 Major

Guard GPU declarations in src/common with #:if MFC_SIMULATION.

This module is in src/common; keeping GPU declarations unconditional here increases non-simulation build risk.

💡 Proposed fix

- $:GPU_DECLARE(create='[gpu_ntrs,gpu_trs_v,gpu_trs_n,gpu_boundary_v,gpu_boundary_edge_count,gpu_total_vertices]') +#ifdef MFC_SIMULATION + $:GPU_DECLARE(create='[gpu_ntrs,gpu_trs_v,gpu_trs_n,gpu_boundary_v,gpu_boundary_edge_count,gpu_total_vertices]') +#endif

Based on learnings: "Only src/simulation/ is GPU-accelerated; guard GPU macros with #:if MFC_SIMULATION for code in src/common/."

coderabbitai · 2026-02-27T20:56:04Z

src/common/m_model.fpp

+            read (line(13:), *) normal
+            v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2)
+            if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm



⚠️ Potential issue | 🟠 Major

Initialize ASCII facet normals for degenerate normals.

If v_norm == 0 at Line 159, model%trs(i)%n is never assigned in this path and remains undefined.

💡 Proposed fix

call s_skip_ignored_lines(iunit, buffered_line, is_buffered) read (line(13:), *) normal v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2) - if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm + model%trs(i)%n = 0._wp + if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

read (line(13:), *) normal

v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2)

if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm

read (line(13:), *) normal

v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2)

model%trs(i)%n = 0._wp

if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm

coderabbitai · 2026-02-27T20:56:04Z

src/common/m_model.fpp

+            else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge
+                dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
+                norm(1) = v1(1) - point(1)
+                norm(2) = v1(2) - point(2)
+                norm = norm/dist
+            else ! t > 1 means that v2 is the closest point on the line edge
+                dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2)
+                norm(1) = v2(1) - point(1)
+                norm(2) = v2(2) - point(2)
+                norm = norm/dist
            end if


⚠️ Potential issue | 🔴 Critical

Prevent divide-by-zero when normalizing vertex vectors.

In Lines 1028 and 1033, norm = norm/dist can divide by zero when the query point coincides with v1 or v2.

💡 Proposed fix

else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2) norm(1) = v1(1) - point(1) norm(2) = v1(2) - point(2) - norm = norm/dist + if (dist > 0._wp) then + norm = norm/dist + else + norm = 0._wp + end if else ! t > 1 means that v2 is the closest point on the line edge dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2) norm(1) = v2(1) - point(1) norm(2) = v2(2) - point(2) - norm = norm/dist + if (dist > 0._wp) then + norm = norm/dist + else + norm = 0._wp + end if end if

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge

dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)

norm(1) = v1(1) - point(1)

norm(2) = v1(2) - point(2)

norm = norm/dist

else ! t > 1 means that v2 is the closest point on the line edge

dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2)

norm(1) = v2(1) - point(1)

norm(2) = v2(2) - point(2)

norm = norm/dist

end if

else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge

dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)

norm(1) = v1(1) - point(1)

norm(2) = v1(2) - point(2)

if (dist > 0._wp) then

norm = norm/dist

else

norm = 0._wp

end if

else ! t > 1 means that v2 is the closest point on the line edge

dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2)

norm(1) = v2(1) - point(1)

norm(2) = v2(2) - point(2)

if (dist > 0._wp) then

norm = norm/dist

else

norm = 0._wp

end if

end if

coderabbitai · 2026-02-27T20:56:04Z

src/common/m_model.fpp

+                allocate (stl_bounding_boxes(patch_id, 1:3, 1:3))
+                stl_bounding_boxes(patch_id, 1, 1:3) = [bbox%min(1), (bbox%min(1) + bbox%max(1))/2._wp, bbox%max(1)]
+                stl_bounding_boxes(patch_id, 2, 1:3) = [bbox%min(2), (bbox%min(2) + bbox%max(2))/2._wp, bbox%max(2)]
+                stl_bounding_boxes(patch_id, 3, 1:3) = [bbox%min(3), (bbox%min(3) + bbox%max(3))/2._wp, bbox%max(3)]


⚠️ Potential issue | 🔴 Critical

stl_bounding_boxes allocation will fail with multiple STL patches.

Allocating stl_bounding_boxes inside the patch loop with allocate (stl_bounding_boxes(patch_id, ...)) will fail on the second STL patch because the allocatable is already allocated.

💡 Proposed fix

- do patch_id = 1, num_ibs + if (.not. allocated(stl_bounding_boxes)) then + allocate (stl_bounding_boxes(1:num_ibs, 1:3, 1:3)) + stl_bounding_boxes = 0._wp + end if + + do patch_id = 1, num_ibs @@ - allocate (stl_bounding_boxes(patch_id, 1:3, 1:3)) stl_bounding_boxes(patch_id, 1, 1:3) = [bbox%min(1), (bbox%min(1) + bbox%max(1))/2._wp, bbox%max(1)] stl_bounding_boxes(patch_id, 2, 1:3) = [bbox%min(2), (bbox%min(2) + bbox%max(2))/2._wp, bbox%max(2)] stl_bounding_boxes(patch_id, 3, 1:3) = [bbox%min(3), (bbox%min(3) + bbox%max(3))/2._wp, bbox%max(3)]

coderabbitai · 2026-02-27T20:56:04Z

src/common/m_model.fpp

+                @:ALLOCATE(gpu_ntrs(1:num_ibs))
+                @:ALLOCATE(gpu_trs_v(1:3, 1:3, 1:max_ntrs, 1:num_ibs))
+                @:ALLOCATE(gpu_trs_n(1:3, 1:max_ntrs, 1:num_ibs))
+                @:ALLOCATE(gpu_boundary_edge_count(1:num_ibs))
+                @:ALLOCATE(gpu_total_vertices(1:num_ibs))
+
+                gpu_ntrs = 0
+                gpu_trs_v = 0._wp
+                gpu_trs_n = 0._wp
+                gpu_boundary_edge_count = 0
+                gpu_total_vertices = 0
+
+                if (max_bv1 > 0) then
+                    @:ALLOCATE(gpu_boundary_v(1:max_bv1, 1:max_bv2, 1:max_bv3, 1:num_ibs))
+                    gpu_boundary_v = 0._wp
+                end if

-    !> This procedure determines the levelset of interpolated 2D models.
-    !! @param interpolated_boundary_v      Group of all the boundary vertices of the interpolated 2D model
-    !! @param total_vertices               Total number of vertices after interpolation
-    !! @param point                        The cell centers of the current levelset cell
-    !! @return                             Distance which the levelset distance without interpolation
-    function f_interpolated_distance(interpolated_boundary_v, total_vertices, point) result(distance)
+                do pid = 1, num_ibs
+                    if (allocated(models(pid)%model)) then
+                        gpu_ntrs(pid) = models(pid)%ntrs
+                        gpu_trs_v(:, :, 1:models(pid)%ntrs, pid) = models(pid)%trs_v
+                        gpu_trs_n(:, 1:models(pid)%ntrs, pid) = models(pid)%trs_n
+                        gpu_boundary_edge_count(pid) = models(pid)%boundary_edge_count
+                        gpu_total_vertices(pid) = models(pid)%total_vertices
+                    end if
+                    if (allocated(models(pid)%boundary_v) .and. p == 0) then
+                        gpu_boundary_v(1:size(models(pid)%boundary_v, 1), &
+                                       1:size(models(pid)%boundary_v, 2), &
+                                       1:size(models(pid)%boundary_v, 3), pid) = models(pid)%boundary_v
+                    end if
+                end do

-        $:GPU_ROUTINE(parallelism='[seq]')
+                $:GPU_UPDATE(device='[gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_edge_count, gpu_total_vertices]')
+                if (allocated(gpu_boundary_v)) then
+                    $:GPU_UPDATE(device='[gpu_boundary_v]')
+                end if
+            end if


⚠️ Potential issue | 🟠 Major

Add a finalizer for newly allocated module GPU buffers.

@:ALLOCATE is used for gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_edge_count, gpu_total_vertices, and gpu_boundary_v, but this file adds no matching teardown path. Repeated setup/restarts will leak memory.

💡 Proposed fix (module-side finalizer)

+ subroutine s_finalize_model_module() + integer :: pid + + if (allocated(gpu_boundary_v)) @:DEALLOCATE(gpu_boundary_v) + if (allocated(gpu_trs_n)) @:DEALLOCATE(gpu_trs_n) + if (allocated(gpu_trs_v)) @:DEALLOCATE(gpu_trs_v) + if (allocated(gpu_total_vertices)) @:DEALLOCATE(gpu_total_vertices) + if (allocated(gpu_boundary_edge_count)) @:DEALLOCATE(gpu_boundary_edge_count) + if (allocated(gpu_ntrs)) @:DEALLOCATE(gpu_ntrs) + + if (allocated(stl_bounding_boxes)) deallocate(stl_bounding_boxes) + + if (allocated(models)) then + do pid = 1, size(models) + if (allocated(models(pid)%trs_v)) deallocate(models(pid)%trs_v) + if (allocated(models(pid)%trs_n)) deallocate(models(pid)%trs_n) + if (allocated(models(pid)%boundary_v)) deallocate(models(pid)%boundary_v) + if (allocated(models(pid)%model)) then + if (allocated(models(pid)%model%trs)) deallocate(models(pid)%model%trs) + deallocate(models(pid)%model) + end if + end do + deallocate(models) + end if + end subroutine s_finalize_model_module

As per coding guidelines: "Always pair @:ALLOCATE calls with matching @:DEALLOCATE calls in the corresponding finalization subroutine."

coderabbitai · 2026-02-27T20:56:04Z

src/simulation/m_compute_levelset.fpp

+        boundary_edge_count = gpu_boundary_edge_count(patch_id)
+        total_vertices = gpu_total_vertices(patch_id)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the routine containing lines 661-662 and show surrounding context rg -n -B30 'boundary_edge_count = gpu_boundary_edge_count' src/simulation/m_compute_levelset.fpp | head -100

Repository: MFlowCode/MFC

Length of output: 1287

🏁 Script executed:

# Show more context after line 662 to see if total_vertices is used rg -n -A30 'total_vertices = gpu_total_vertices' src/simulation/m_compute_levelset.fpp

Repository: MFlowCode/MFC

Length of output: 1564

🏁 Script executed:

# Check distance routine signatures to see if they use total_vertices parameter rg -n -A15 'subroutine[[:space:]]+s_distance_normals' src/simulation src/common

Repository: MFlowCode/MFC

Length of output: 4161

🏁 Script executed:

# Search for all uses of total_vertices in the file rg -n 'total_vertices' src/simulation/m_compute_levelset.fpp

Repository: MFlowCode/MFC

Length of output: 192

Remove the unused total_vertices load on line 662.

The variable total_vertices is assigned from gpu_total_vertices(patch_id) but is never consumed in this routine. The distance computation routines (s_distance_normals_2D and s_distance_normals_3D) do not accept total_vertices as a parameter, and the variable is not used elsewhere in s_model_levelset. Remove line 662 and the declaration of total_vertices from line 649.

coderabbitai · 2026-02-27T20:56:04Z

src/simulation/m_ibm.fpp

+        $:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3)
        do i = 0, m
            do j = 0, n
-                if (p == 0) then
-                    ! 2D
-                    if (ib_markers%sf(i, j, 0) /= 0) then
-                        subsection_2D = ib_markers%sf( &
-                                        i - gp_layers:i + gp_layers, &
-                                        j - gp_layers:j + gp_layers, 0)
-                        if (any(subsection_2D == 0)) then
-                            ghost_points_in(count)%loc = [i, j, 0]
-                            patch_id = ib_markers%sf(i, j, 0)
-                            ghost_points_in(count)%ib_patch_id = &
-                                patch_id
-
-                            ghost_points_in(count)%slip = patch_ib(patch_id)%slip
-                            ! ghost_points(count)%rank = proc_rank
+                do k = 0, p
+                    if (ib_markers%sf(i, j, k) /= 0) then
+                        is_gp = .false.
+                        marker_search: do ii = i - gp_layers, i + gp_layers
+                            do jj = j - gp_layers, j + gp_layers
+                                do kk = k - gp_layers_z, k + gp_layers_z
+                                    if (ib_markers%sf(ii, jj, kk) == 0) then
+                                        ! if any neighbors are not in the IB, it is a ghost point
+                                        is_gp = .true.
+                                        exit marker_search
+                                    end if
+                                end do
+                            end do
+                        end do marker_search
+
+                        if (is_gp) then
+                            $:GPU_ATOMIC(atomic='capture')
+                            count = count + 1
+                            local_idx = count
+                            $:END_GPU_ATOMIC_CAPTURE()
+
+                            ghost_points_in(local_idx)%loc = [i, j, k]
+                            patch_id = ib_markers%sf(i, j, k)
+                            ghost_points_in(local_idx)%ib_patch_id = patch_id
+
+                            ghost_points_in(local_idx)%slip = patch_ib(patch_id)%slip



⚠️ Potential issue | 🟠 Major

patch_id should be private in this GPU kernel.

patch_id is assigned per-iteration (Lines 613 and 651) but is not listed in the private clause at Line 588, which can cause data races between threads.

💡 Proposed fix

- $:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3) + $:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx,patch_id]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3)

As per coding guidelines: "Declare private(...) on loop-local variables in GPU kernels to avoid unintended data sharing between threads."

coderabbitai · 2026-02-27T20:56:04Z

toolchain/mfc/case_validator.py

+        self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
                     "num_ibs must be between 1 and num_patches_max (10)")


⚠️ Potential issue | 🟡 Minor

Fix stale validation message for num_ibs upper bound.

The check allows up to 1000, but the error text still says 10, which will mislead users.

Suggested change

- self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000), - "num_ibs must be between 1 and num_patches_max (10)") + self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000), + "num_ibs must be between 1 and 1000")

coderabbitai · 2026-02-27T20:56:04Z

toolchain/mfc/test/cases.py

+                           "3D_IGR_33jet","1D_multispecies_diffusion",
+                           "2D_ibm_stl_MFCCharacter"]


⚠️ Potential issue | 🟡 Minor

Add a reason for the new skip entry.

Line 1044 adds 2D_ibm_stl_MFCCharacter to casesToSkip, which drops coverage on the same subsystem this PR optimizes. Please add a short reason and tracking issue so this skip is auditable and temporary.

… into gpu-optimizations

danieljvickers · 2026-02-27T21:18:40Z

This branch now contains the contents for a generalized periodic IB implementation. We can close the implementation from @conraddelgado and move forward with this one. Conrad, you can close your branch and just work off of this one if you like.

github-actions · 2026-02-27T21:24:33Z

Claude Code Review

Head SHA: 6c3a125ca28e1023bb2bc72a9cdd8f3c53bc413b

Files changed: 24 files (+1639 / -1609)

Key files:

src/common/m_model.fpp (+475 / -488)
src/simulation/m_ib_patches.fpp (+543 / -304)
src/simulation/m_ibm.fpp (+238 / -275)
src/simulation/m_compute_levelset.fpp (+40 / -86)
src/simulation/m_mpi_proxy.fpp (+5 / -187)
src/common/include/parallel_macros.fpp (+10 / -0)
src/common/m_constants.fpp (+1 / -5) — num_patches_max 10 → 1000

Summary

GPU-offloads STL IB marker generation and levelset compute by packing triangle/edge data into flat GPU-friendly arrays (gpu_trs_v, gpu_trs_n, gpu_boundary_v).
Replaces interpolation-based distance search with an exact projection-based algorithm (s_distance_normals_3D, s_distance_normals_2D).
Replaces random ray-casting inside-model test with 26 fixed-direction rays, enabling GPU execution (f_model_is_inside_flat).
Adds bounding-box index pruning for IB patch loops (binary search via get_bounding_indices), reducing per-timestep work dramatically.
Removes MPI IB-marker halo exchange, replacing it with each rank independently filling ghost cells from globally known patch positions. Adds periodic image patch support.

Findings

BUG (Critical): `stl_bounding_boxes` re-allocated inside a loop

File: src/common/m_model.fpp, inside s_instantiate_STL_models

do patch_id = 1, num_ibs
    if (patch_ib(patch_id)%geometry == 5 .or. patch_ib(patch_id)%geometry == 12) then
        ...
        allocate (stl_bounding_boxes(patch_id, 1:3, 1:3))  ! BUG: re-allocates each iteration
        stl_bounding_boxes(patch_id, ...) = ...
    end if
end do

Each loop iteration allocates stl_bounding_boxes with first dimension = current patch_id. When a second STL patch is encountered, stl_bounding_boxes is already allocated — a Fortran runtime error (abort). The dimension patch_id also means only patches 1..patch_id could be indexed. This needs a single allocation before the loop:

allocate (stl_bounding_boxes(1:num_ibs, 1:3, 1:3))
do patch_id = 1, num_ibs
    if (...) then
        stl_bounding_boxes(patch_id, ...) = ...
    end if
end do

This will silently work for the single-STL case but crash for multiple STL patches — exactly the multi-particle MIBM use case this PR targets.

BUG (Precision/Correctness): Division by zero in `s_distance_normals_2D`

File: src/common/m_model.fpp, subroutine s_distance_normals_2D

When t < 0 (closest point is vertex v1) or t > 1 (closest point is vertex v2):

dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm(1) = v1(1) - point(1)
norm(2) = v1(2) - point(2)
norm = norm/dist   ! undefined if dist == 0

If the cell center exactly coincides with a boundary vertex, dist = 0 causes a NaN/Inf normal. Needs a guard: if (dist > 0._wp) norm = norm/dist.

CONCERN: `GPU_PARALLEL_LOOP` called with unrecognized `parallelism='[seq]'` kwarg

File: src/simulation/m_ibm.fpp, s_fill_ghost_points and the interp-coefficient loop

$:GPU_PARALLEL_LOOP(private='[q,i,...]', parallelism='[seq]')
do q = 1, num_gps

parallelism is a parameter for GPU_ROUTINE, not GPU_PARALLEL_LOOP. Depending on how the Fypp macro handles unknown kwargs, this either (a) silently ignores it and parallelizes correctly, or (b) causes a Fypp error at build time. This should be verified — a GPU_PARALLEL_LOOP with parallelism='[seq]' is contradictory.

CONCERN: Named-loop `exit` inside `GPU_PARALLEL_LOOP`

File: src/simulation/m_ibm.fpp, ghost-point detection loop

$:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp]', ..., collapse=3)
do i = 0, m
  do j = 0, n
    do k = 0, p
      marker_search: do ii = ...
        ...
        exit marker_search   ! named exit from inner nested loop
      end do marker_search

Named EXIT from inner loops is supported by NVHPC/OpenACC, but behavior under OpenMP target offload (Cray ftn, AMD flang) may differ. The CI gate includes Cray ftn — confirm this compiles and produces correct results with --gpu mp.

CONCERN: Removal of MPI halo exchange for IB markers

File: src/simulation/m_mpi_proxy.fpp (-187 lines), src/simulation/m_ibm.fpp

s_populate_ib_buffers / s_mpi_sendrecv_ib_buffers is deleted entirely. IB patch routines now iterate over ghost-cell-extended index ranges (e.g., -gp_layers-1:m+gp_layers+1) using globally known centroid positions, so each rank independently fills its own ghost region without MPI communication.

This is conceptually correct for geometric IB patches (all ranks know all centroids via the MPI broadcast in m_mpi_proxy.fpp). Please confirm this is intentional and has been validated in multi-rank (≥2) configurations — especially for the moving_ibm=2 case where centroids change each timestep. The new centroid data (patch_ib%x_centroid) needs to be re-broadcast to all ranks after every update.

MINOR: Missing subroutine name on `end subroutine`

File: src/common/m_model.fpp, s_pack_model_for_gpu

    end subroutine   ! should be: end subroutine s_pack_model_for_gpu

Style convention; not a correctness issue.

MINOR: Leftover comment in production loop

File: src/simulation/m_ib_patches.fpp, s_ib_3d_model

do k = kl, kr
    ! do i = -gp_layers, m+gp_layers    ← dead comment

Notes

The new projection-based 3D distance computation in s_distance_normals_3D is algorithmically sound (barycentric + edge/vertex clamping). The early-exit if (abs(d) >= dist_min) cycle is a valid optimization since n is normalized.
Triangle normal normalization on read (binary and ASCII STL paths) is a welcome correctness fix.
The encode_patch_periodicity / decode_patch_periodicity base-3 encoding with num_ibs+1=1001 gives maximum encoded ID of ~27000 — no integer overflow at 1000 IBs.
Bounding-index binary search in get_bounding_indices has correct semantics (conservative: returns one cell past the physical bound).

🤖 Generated with Claude Code

danieljvickers added 16 commits February 16, 2026 12:19

Moved all of the IB marker calculation to the GPU without copy

0a304a2

Added profiling and increased maximum num IBs to 1000

e8c778e

Performance tuning complted for marker generation

5d885bb

Bindary search for IB index region beginning for reduced IB marker co…

85889c0

…mpute

ghost points are now computed on the GPU

f0085c9

image points computed on the GPU for x4 performance in that subroutine

bfcc593

Merge branch 'master' into gpu-optimizations

3c4b6dd

Need WAY more parameters in the case file... We should probably do so…

5201ee7

…mething about that...

Extended the binary search reduction to all 3D IB geometries

622edb0

Extended area reduction to all non-model IBs

0a089ce

Intermittent commit for GPU STLs

804a286

Ib markers computed on GPU working

64bc348

Passes STL tests with GPU compute for IB markers (not added levelset …

bc972ca

…yet)

Moved mdoel-specific code to the model file for cleanliness

a1769d0

STLs appear to be working on the GPU with NVHPC!

5e58655

STLs ran on GPU in 3D!

6cc7acc

danieljvickers marked this pull request as ready for review February 19, 2026 18:39

This comment was marked as off-topic.

Sign in to view

Merge branch 'master' into gpu-optimizations

5fc31be

codeant-ai bot added the size:XL This PR changes 500-999 lines, ignoring generated files label Feb 20, 2026

This comment was marked as off-topic.

Sign in to view

danieljvickers added 2 commits February 20, 2026 10:27

Missed on comparison that NVHPC allows, but GNU does not

4dc9072

Resolved issues with GPU arrays allocation for 3D STLs

0fd5492

codeant-ai bot added size:XL This PR changes 500-999 lines, ignoring generated files and removed size:XL This PR changes 500-999 lines, ignoring generated files labels Feb 20, 2026

This comment was marked as off-topic.

Sign in to view

Finished GPU implementation of all subroutines. Final check before re…

6edad1b

…generating golden file

This comment was marked as off-topic.

Sign in to view

MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026

MFlowCode deleted a comment from github-actions bot Feb 26, 2026

Some initial periodic framework

76e72dc

danieljvickers and others added 4 commits February 27, 2026 08:16

Applying ib markers periodically

42f35e8

Finished periodicity and added another optimization

c208699

Added parallel sequential to the decope periodicity subroutine

82f1372

Merge branch 'master' into gpu-optimizations

370aa4c

coderabbitai bot reviewed Feb 27, 2026

View reviewed changes

danieljvickers added 2 commits February 27, 2026 16:16

Added periodic boundaries

3c6e196

Merge branch 'gpu-optimizations' of github.com:danieljvickers/MFC-Dan…

6c3a125

… into gpu-optimizations

danieljvickers dismissed sbryngelson’s stale review via 6c3a125 February 27, 2026 21:18

		boundary_edge_count = gpu_boundary_edge_count(patch_id)
		total_vertices = gpu_total_vertices(patch_id)

		self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
		"num_ibs must be between 1 and num_patches_max (10)")

		"3D_IGR_33jet","1D_multispecies_diffusion",
		"2D_ibm_stl_MFCCharacter"]

Conversation

danieljvickers commented Feb 17, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

Description

Type of change

Testing

Checklist

CodeAnt-AI Description

What Changed

Impact

Checking Your Pull Request

Talking to CodeAnt AI

Example

Preserve Org Learnings with CodeAnt

Example

Retrigger review

Check Your Repository Health

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

sbryngelson commented Feb 26, 2026

Claude Code Review

Summary

Critical Issues

Important Issues

Minor Notes

Verdict

Uh oh!

danieljvickers commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Feb 27, 2026

Claude Code Review

Findings

🔴 Bug — re-allocated inside loop (, )

🔴 Correctness — MPI halo exchange for IB markers removed (, )

🟡 Portability — Named DO exits inside (, and )

🟡 GPU atomic capture correctness (, )

🟡 Ray direction bug in (, ~line 224–226)

🟡 references global without or argument ()

ℹ️ Minor — Unused variables in ()

ℹ️ Minor — Golden files regenerated only on GPU/nvfortran

Overall

Uh oh!

coderabbitai bot commented Feb 27, 2026

Walkthrough

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

danieljvickers commented Feb 17, 2026 •

edited by coderabbitai bot

Loading

danieljvickers commented Feb 27, 2026 •

edited

Loading

BUG (Critical): `stl_bounding_boxes` re-allocated inside a loop

BUG (Precision/Correctness): Division by zero in `s_distance_normals_2D`

CONCERN: `GPU_PARALLEL_LOOP` called with unrecognized `parallelism='[seq]'` kwarg

CONCERN: Named-loop `exit` inside `GPU_PARALLEL_LOOP`

MINOR: Missing subroutine name on `end subroutine`