Skip to content

MIBM Performance Optimizations#1157

Open
danieljvickers wants to merge 66 commits intoMFlowCode:masterfrom
danieljvickers:gpu-optimizations
Open

MIBM Performance Optimizations#1157
danieljvickers wants to merge 66 commits intoMFlowCode:masterfrom
danieljvickers:gpu-optimizations

Conversation

@danieljvickers
Copy link
Member

@danieljvickers danieljvickers commented Feb 17, 2026

User description

Description

Following the refactor of the levelset, there were several performance optimizations left to be made to the code. This PR introduces optimizations that will make multi-particle MIBM code viable. It also expands the upper bound of allowed number of immersed boundaries to 1000. Performance was measured on 1-4 ranks of ACC GPU compute using A100 GPUs.

This PR has extended optimization to STL IBs, which should significantly improve accuracy, performance, and code cleanliness. The primary optimizations are as follows:

  • IB compute is now fully supported on the GPU. Processing of the mesh into memory has been parallelized, as well as IB marker generation and levelset compute.
  • Interpolation of IB markers was removed in favor of a projection-based distance search. This means that we no longer need to interpolate (saving significant per-processing time and memory), we iterate over significantly fewer vertices (by a factor of 50 for the coarsest grids), and we get exact values for the levelset instead of approximations based on how fine the interpolated mesh was.
  • Unified distance and normal compute. We not longer perform a distance search and then a separate normal vector search. This is all handled in a single subroutine, cutting search time in half.
  • Deleting about 1000 lines of code related to the maintenance and compute of STL models, leading to a cleaner base to build on
  • Modified STL generation to be cell center based instead of volume fraction based. This prevents edge cases where the cell center is outside the model, but the point is labeled as inside the mode. This error causes levelsets that point into the immersed boundaries and because instabilities

Type of change

  • New feature
  • Refactor
  • Other: Performance Tuning

Testing

All changes pass the IBM section of the test suite on GPUs with the NVHPC compiler. Performance was measured with a case of 1000 particles with viscosity enabled. The particles are all resolved 3D spheres given random non-overlapping positions generated by the following case file:

import json
import argparse

num_cells = [240, 240, 240]
dim = [8., 8., 8.]
num_particles = 1000

# create a stationary fluid
case = {
    "run_time_info": "T",
    "parallel_io": "T",
    "m": num_cells[0]-1,
    "n": num_cells[1]-1,
    "p": num_cells[2]-1,
    "dt": 0.005,
    "t_step_start": 0,
    "t_step_stop": 2,
    "t_step_save": 1,
    "num_patches": 1,
    "model_eqns": 2,
    "alt_soundspeed": "F",
    "num_fluids": 1,
    "mpp_lim": "F",
    "mixture_err": "T",
    "time_stepper": 3,
    "recon_type": 1,
    "weno_eps": 1e-16,
    "riemann_solver": 2,
    "wave_speeds": 1,
    "avg_state": 2,
    "precision": 2,
    "format": 1,
    "prim_vars_wrt": "T",
    "E_wrt": "T",
    "viscous": "T",
    "x_domain%beg": -0.5*dim[0],
    "x_domain%end": 0.5*dim[0],
    "y_domain%beg": -0.5*dim[1],
    "y_domain%end": 0.5*dim[1],
    "z_domain%beg": -0.5*dim[2],
    "z_domain%end": 0.5*dim[2],
    "bc_x%beg": -3,
    "bc_x%end": -3,
    "bc_y%beg": -3,
    "bc_y%end": -3,
    "bc_z%beg": -3,
    "bc_z%end": -3,
    "patch_icpp(1)%geometry": 9,
    "patch_icpp(1)%z_centroid": 0.0,
    "patch_icpp(1)%length_z": dim[2],
    "patch_icpp(1)%y_centroid": 0.0,
    "patch_icpp(1)%length_y": dim[1],
    "patch_icpp(1)%x_centroid": 0.0,
    "patch_icpp(1)%length_x": dim[0],
    "weno_order": 5,
    "patch_icpp(1)%pres": 1.0,
    "patch_icpp(1)%alpha_rho(1)": 1.0,
    "patch_icpp(1)%alpha(1)": 1.0,
    "patch_icpp(1)%vel(1)": 0.0,
    "patch_icpp(1)%vel(2)": 0.0,
    "patch_icpp(1)%vel(3)": 0.0,
    "fluid_pp(1)%gamma": 2.5000000000000004,
    "fluid_pp(1)%pi_inf": 0.0,
    "fluid_pp(1)%Re(1)": 2500000,
}

import random
random.seed(42)

dx = [float(dim[i]) / float(num_cells[i]) for i in range(3)]

#set particle properties
particle_radius_cells = 5 # particle radius in grid cells
particle_cell_spacing = particle_radius_cells*2 + 5 # set the spacing to be double the radius plus 5 to garuntee no image points in other IBs
mpi_cell_spacing = particle_radius_cells + 5 # space safely away from MPI halo regions to prevent out of bounds errors
genreation_bounds = [6., 6., 6.] # generate particles in this box safely away from the boundary
velocity_magnitude = 1.

# convert non-dimnesional grid cell units to the units of the simulation
radius_units = float(particle_radius_cells) * dx[0]
paticle_units_spacing = float(particle_cell_spacing) * dx[0]
paticle_units_spacing_squared = paticle_units_spacing**2
mpi_units_spacing = [float(mpi_cell_spacing) * dx[i] for i in range(3)]

# generate an array of xyz values that garuntee non-overlapping grid cells
particles = []
while len(particles) < num_particles:
  # generate a completely random position
  position = [random.random() for i in range(3)]
  position = [(position[i] - 0.5) * genreation_bounds[i] for i in range(3)]

  # first check if the particle is too close to the MPI halo regions
  valid = True
  # for i in range(3):
  #   valid = valid and abs(position[i]) >= mpi_units_spacing[i]

  # check for the minimum spacing between all particles, exiting once we find an error
  for particle in particles:
    distance_squared = sum([(particle[i] - position[i])**2 for i in range(3)])
    valid = valid and distance_squared >= paticle_units_spacing_squared
    if not valid:
      break

  if valid:
    particles.append(position)
    # print(f"\rProgress: {100.*float(len(particles))/float(num_particles)}%", end="", flush=True)

# print()
# convert out array of positions to valid JSON for the immersed boundary
ib_properties = {"ib": "T", "num_ibs": num_particles,}
for i in range(len(particles)):
  ib_properties[f'patch_ib({i+1})%radius'] = radius_units
  ib_properties[f'patch_ib({i+1})%slip'] = 'F'
  ib_properties[f'patch_ib({i+1})%geometry'] = 8
  ib_properties[f'patch_ib({i+1})%moving_ibm'] = 2
  ib_properties[f'patch_ib({i+1})%mass'] = 1.
  ib_properties[f'patch_ib({i+1})%x_centroid'] = particles[i][0]
  ib_properties[f'patch_ib({i+1})%y_centroid'] = particles[i][1]
  ib_properties[f'patch_ib({i+1})%z_centroid'] = particles[i][2]
  # move the particle away radially to garuntee they never touch during the simulation
  position_mag = (sum([(particles[i][j])**2 for j in range(3)]))**0.5
  for j in range(3):
    ib_properties[f'patch_ib({i+1})%vel({j+1})'] = particles[i][j] * velocity_magnitude / position_mag

print(json.dumps({**case, **ib_properties}))

These optimizations add nearly x1000 performance in the moving IBM propagation and generation code. Prior to these optimizations, this was the result of the benchmark case using the NVIDIA NSight profiler showing 45 seconds to run a single RK substep:

Screenshot 2026-02-16 at 2 57 19 PM

Following these optimizations, the same profile achieves almost 50 ms per RK substep:
Screenshot 2026-02-17 at 10 51 00 AM

For STLs, the optimizations were tested on a 822,000 vertex mesh of a Mach 0.4 corgi, given by this STL:
https://www.thingiverse.com/thing:4721563/files

The final simulation finished in a total of 25 minutes on a 200^3 grid for 4k time steps on a single A100 GPU. All of the code related to the STL model (file reading, preprocessing, IB marker generation, and levelset compute) took only 20 seconds of the run time. The result of that simulation can be viewed here:
https://www.youtube.com/watch?v=h44BNCKo0Hs

Checklist

  • I updated documentation if user-facing behavior changed

See the developer guide for full coding standards.

GPU changes (expand if you modified src/simulation/)
  • GPU results match CPU results
  • Tested on NVIDIA GPU or AMD GPU

CodeAnt-AI Description

GPU-accelerate STL immersed-boundary compute and support up to 1000 IBs

What Changed

  • STL models are read, transformed, packed into GPU-friendly flat arrays and uploaded so immersed-boundary (IB) marker and levelset checks run on the device instead of on CPU-side model structures
  • Levelset/inside-model queries now use GPU-ready routines (including a GPU-safe random-number step and a flattened triangle intersection path) so per-cell STL inside/distance tests execute on the GPU
  • IB patch routines and levelset application limit loops to grid regions that overlap each patch (binary-searched bounding indices), reducing the number of cells inspected for every patch
  • API and bookkeeping now handle many more IBs: the global patch limit raised to 1000 and model metadata (counts, bounding boxes) are kept in GPU-updatable arrays
  • 2D/3D model distance and normal computation changed to projection-based nearest-point checks (exact triangle/edge projections) and triangle normals are normalized when reading STLs
  • Minor runtime observability: NVTX profiling ranges added around immersed-boundary propagation

Impact

✅ Faster IB marker generation
✅ Lower CPU during IB setup and levelset evaluation
✅ Support up to 1000 immersed boundaries

💡 Usage Guide

Checking Your Pull Request

Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.

Talking to CodeAnt AI

Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:

@codeant-ai ask: Your question here

This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.

Example

@codeant-ai ask: Can you suggest a safer alternative to storing this secret?

Preserve Org Learnings with CodeAnt

You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:

@codeant-ai: Your feedback here

This helps CodeAnt AI learn and adapt to your team's coding style and standards.

Example

@codeant-ai: Do not flag unused imports.

Retrigger review

Ask CodeAnt AI to review the PR again, by typing:

@codeant-ai: review

Check Your Repository Health

To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.

@danieljvickers danieljvickers marked this pull request as ready for review February 19, 2026 18:39
cubic-dev-ai[bot]

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as off-topic.

@codeant-ai codeant-ai bot added the size:XL This PR changes 500-999 lines, ignoring generated files label Feb 20, 2026
coderabbitai[bot]

This comment was marked as off-topic.

@codeant-ai codeant-ai bot added size:XL This PR changes 500-999 lines, ignoring generated files and removed size:XL This PR changes 500-999 lines, ignoring generated files labels Feb 20, 2026
cubic-dev-ai[bot]

This comment was marked as off-topic.

coderabbitai[bot]

This comment was marked as off-topic.

cubic-dev-ai[bot]

This comment was marked as off-topic.

@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from codeant-ai bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@MFlowCode MFlowCode deleted a comment from github-actions bot Feb 26, 2026
@sbryngelson
Copy link
Member

Claude Code Review

Head SHA: 7abaffc747841846b0e65447ef86f64bd8a70503
Files changed: 21


Summary

The PR makes significant and valuable performance improvements by moving IBM computations to the GPU with proper Fypp macro usage. The new projection-based distance computation is mathematically cleaner than the interpolation approach. GPU macro usage is correct throughout — all GPU parallelism uses GPU_PARALLEL_LOOP/END_GPU_PARALLEL_LOOP, GPU_ATOMIC, GPU_ROUTINE, GPU_DECLARE, GPU_UPDATE. No raw !$acc or !$omp directives found. Precision discipline is respected.

However, there are several issues that should be addressed before merging.


Critical Issues

1. stl_bounding_boxes allocated inside a loop — will crash with multiple STL patches

src/common/m_model.fpp, in s_instantiate_STL_models:

stl_bounding_boxes is allocated inside the do patch_id = 1, num_ibs loop with allocate(stl_bounding_boxes(patch_id, 1:3, 1:3)). On the first STL patch, this allocates a (1, 3, 3) array. When a second STL patch is encountered, the array is already allocated, causing a Fortran runtime error.

Fix: Allocate stl_bounding_boxes(1:num_ibs, 1:3, 1:3) once before the loop.

2. num_patches_max raised from 10 to 1000 — massive unintended memory inflation

src/common/m_constants.fpp, line 26.

num_patches_max controls both IB patches and initial condition patches. It sizes patch_icpp(num_patches_max) and patch_ib(num_patches_max) in both pre_process and simulation. Each ic_patch_parameters contains alter_patch(0:num_patches_max-1) (now 1000 booleans), plus many reals and strings. With 1000 entries, patch_icpp alone balloons from ~40KB to ~4MB of static data per MPI rank — even for cases with zero IBs.

Fix: Introduce a separate num_ibs_max = 1000 constant for IB patches, leaving num_patches_max at 10.

3. Division by zero in s_distance_normals_2D when point coincides with vertex

src/common/m_model.fpp, in s_distance_normals_2D, the branches for t < 0 and t > 1:

dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm = norm/dist   ! Division by zero if dist == 0

The 3D counterpart (s_distance_normals_3D) correctly guards with if (norm_mag > 0._wp), but the 2D version does not.

Fix: Add if (dist > 0._wp) guard before normalizing.

4. Missing @:DEALLOCATE for GPU arrays allocated in s_instantiate_STL_models

src/common/m_model.fpp allocates these module-level GPU arrays with @:ALLOCATE:

  • gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_edge_count, gpu_total_vertices, gpu_boundary_v

None are deallocated anywhere, violating the @:ALLOCATE/@:DEALLOCATE pairing rule. Additionally, stl_bounding_boxes and models(:) are allocated with plain allocate but never deallocated.

Fix: Add a s_finalize_model_module routine that @:DEALLOCATEs all GPU arrays, called from s_finalize_ibm_module.


Important Issues

5. error stop used instead of call s_mpi_abort()

src/simulation/m_ibm.fpp, ~line 504:

error stop "Ghost Point and Image Point on Different Processors"

Per project rules, use call s_mpi_abort() instead.

6. Stale error message in case_validator.py

toolchain/mfc/case_validator.py, line 687:

self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
             "num_ibs must be between 1 and num_patches_max (10)")

Limit changed to 1000 but message still says "(10)".

7. Unbounded do while on GPU in s_compute_image_points

src/simulation/m_ibm.fpp, in s_compute_image_points:

The bounds check that prevents index from going out of range is #ifdef-disabled on GPU builds. If an image point falls outside the domain, this creates an infinite loop on the GPU or out-of-bounds memory access.

Fix: Add a bounded iteration limit and exit if exceeded.


Minor Notes

  • Commented-out GPU_ROUTINE directives in f_model_random_number and f_model_is_inside (src/common/m_model.fpp, lines 511, 531) — clarify whether these are intentionally CPU-only.
  • s_find_ghost_points ghost point ordering is nondeterministic on GPU due to atomic capture — acceptable if downstream code is order-insensitive.
  • s_check_boundary uses copy clause on large local arrays — may be slow for large 2D models but only runs at initialization.

Verdict

The GPU offloading approach is correct and the Fypp macro usage follows project conventions. The four critical issues (loop allocation crash, memory inflation, division by zero, missing deallocations) should be fixed before merging. The important issues are lower priority but worth addressing.

@danieljvickers
Copy link
Member Author

danieljvickers commented Feb 27, 2026

@sbryngelson The Frontier CPU test failed due to compile seg fault. This happened two days ago, so I manually got onto frontier, did an ./mfc.sh clean and rebuilt. It built fine and passed the test suite when I ran it manually. I just submitted it again assuming it was spurious failure, but that appears to not be true. The strangeness is compounded by the fact that the openMP and openACC tests compiled fine on Frontier. It took me about 30 hours to get that runner and I would like to avoid the game of blindly committing changes hoping it fixes the compiler seg fault. Before I go down that rabbit hole, do you think that this could in any way be related to the caching changes you made on frontier? And if so, do you think we can test building this one from scratch to see if it passes? I will work on this in the background until I hear either way.

@github-actions
Copy link

Claude Code Review

Head SHA: 370aa4c
Files changed: 21

Changed files:

  • src/common/include/parallel_macros.fpp
  • src/common/m_constants.fpp
  • src/common/m_derived_types.fpp
  • src/common/m_helper.fpp
  • src/common/m_model.fpp (largest change: +475/-488)
  • src/pre_process/m_icpp_patches.fpp
  • src/simulation/m_compute_levelset.fpp
  • src/simulation/m_ib_patches.fpp
  • src/simulation/m_ibm.fpp
  • src/simulation/m_mpi_proxy.fpp (+5/-187)
  • src/simulation/m_time_steppers.fpp
  • tests/ (3 test UUIDs, golden files regenerated on GPU/nvfortran)
  • toolchain/ (param definitions, validator, test cases)

Summary:

  • Removes STL interpolation in favour of exact projection-based distance / inside queries (both 2D and 3D), eliminating ~1 000 lines of interpolation code
  • Packs GPU-friendly flat arrays (gpu_trs_v, gpu_trs_n, gpu_boundary_v) and uploads to device so levelset/inside-model compute runs fully on-GPU
  • Introduces bounding-index narrowing for every IB patch type, drastically reducing the per-patch cell loop
  • Removes s_mpi_sendrecv_ib_buffers / s_populate_ib_buffers and instead extends IB-marker loops into ghost-cell ranges
  • Raises num_patches_max from 10 → 1000

Findings

🔴 Bug — re-allocated inside loop (, )

stl_bounding_boxes is a module-level allocatable. The first STL patch allocates it with shape (1,3,3); the second STL patch hits an already-allocated variable and causes a runtime error. Fix: allocate once before the loop:

Then remove the per-patch allocate inside the loop.


🔴 Correctness — MPI halo exchange for IB markers removed (, )

s_mpi_sendrecv_ib_buffers and s_populate_ib_buffers were deleted. The IB-marker loops now extend into ghost-cell index ranges locally, but this only helps if the IB geometry overlaps the local subdomain's ghost region. When two ranks share an IB near their subdomain boundary, the ghost cells of rank A are filled by rank A's local marker kernel (correct), but rank A's interior ghost cells that should reflect rank B's IB markers are never populated via MPI. The result is that s_find_num_ghost_points and s_find_ghost_points see stale zeros in neighbour ghost cells, causing ghost points near MPI boundaries to be silently missed.

This is difficult to hit in single-rank GPU benchmarks (which is how the PR was validated), but it is a regression for any multi-rank MIBM run where an IB boundary crosses a subdomain edge.


🟡 Portability — Named DO exits inside (, and )

Named DO constructs with exit label inside GPU parallel loops are not guaranteed to compile/work correctly on Cray CCE or AMD flang backends. In the Cray OpenMP path (--gpu mp), exit from a named construct inside a target teams distribute region is undefined behaviour. Consider replacing with a scalar flag (is_gp) and using cycle on the outer loops, or restructuring to avoid early exit.


🟡 GPU atomic capture correctness (, )

The PR adds END_GPU_ATOMIC_CAPTURE() to parallel_macros.fpp but uses the existing GPU_ATOMIC(atomic='capture') macro as the opening:

If the existing GPU_ATOMIC macro only emits ! atomic update (ignoring the capture argument), then the closing ! end atomic from the new macro will produce a dangling directive and likely a compile error with nvfortran/CCE. Please verify that GPU_ATOMIC(atomic='capture') does indeed emit ! atomic capture / ! atomic capture rather than update.


🟡 Ray direction bug in (, ~line 224–226)

This sets the raw (unnormalised) ray direction component to point(k) + random - 0.5. The point coordinates are added to what should be a zero-centred direction, biasing all rays toward the origin. The original code used random - 0.5 without adding point(k). The normalisation on the next line partially recovers, but introduces a systematic directional bias proportional to the cell location. Consider removing point(k) +.


🟡 references global without or argument ()

The new GPU-callable function uses the module-level variable p (z-grid size) to decide 2D vs 3D behaviour, and the loop uses gpu_trs_v/gpu_trs_n as module-level GPU-declared arrays. For Cray OpenMP target offload, module-level variables used inside GPU_ROUTINE functions must be explicitly listed in a declare target clause (or the function must receive them as arguments). p and the gpu_* arrays are not in GPU_DECLARE here — verify this is handled correctly for --gpu mp builds.


ℹ️ Minor — Unused variables in ()

Variables point and local_point are declared but never referenced in the new 2D s_ib_model body. They should be removed to avoid compiler warnings.


ℹ️ Minor — Golden files regenerated only on GPU/nvfortran

The three updated golden files (0A362971, 4E0FBE72, EA8FA07E) were regenerated on an NVIDIA GPU with nvfortran 23.11. Per the project checklist, golden files should normally also be verified against the gfortran CPU path. The metadata confirms no CPU regeneration was done.


Overall

The performance approach is sound and the ~1 000× speedup claim is plausible given the algorithmic improvements (projection-based distance, GPU offload of all IB compute, bounding-index narrowing). The critical item is the stl_bounding_boxes allocation bug, which will crash on any simulation with ≥ 2 STL patches. The MPI-correctness regression is the second-highest priority before merge. Recommend:

  1. Fix the allocate-in-loop bug.
  2. Re-evaluate whether removing the IB-marker MPI halo exchange is safe, or re-add a lightweight exchange.
  3. Verify GPU_ATOMIC(atomic='capture') expands correctly.
  4. Validate on ≥ 2 MPI ranks with an STL patch that spans the domain split.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 27, 2026

📝 Walkthrough

Walkthrough

This pull request implements GPU-optimized changes for STL model handling and immersed boundary computation. Key modifications include: expanding maximum patch limits from 10 to 1000; refactoring STL model geometry APIs to use GPU-resident buffers and data structures (introducing gpu_ntrs, gpu_trs_v, gpu_trs_n, and related arrays); replacing logical return types with integer flags in intersection testing; transitioning from interpolation-dependent geometry checks to direct distance-normal calculations via deterministic projections; migrating IBM patch processing to GPU parallel constructs with atomic operations for point counting; adding NVTX profiling markers around IBM workflows; introducing seeded Fortran-based RNG for deterministic GPU execution; and extending MPI broadcasts for model-related parameters. Test metadata files reflect updated hardware environments and compiler toolchain changes to NVHPC with OpenACC enabled.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'MIBM Performance Optimizations' is concise, clear, and directly summarizes the primary focus of the changeset—performance improvements to the moving immersed boundary method (MIBM) code.
Description check ✅ Passed The PR description provides comprehensive context: clear motivation (making multi-particle MIBM viable), specific optimizations (GPU acceleration, projection-based distance, unified normal/distance compute), measurable performance results (1000x speedup in IB propagation), testing details, and supporting artifacts (case file, benchmark images, video link).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/simulation/m_ibm.fpp (1)

491-506: ⚠️ Potential issue | 🔴 Critical

Bound the image-point search on GPU and remove error stop.

Line 494 disables the bounds-failure block for GPU builds, so the loop in Lines 491-493 can run index out of range and continue dereferencing s_cc(index + 1). Also, Line 504 uses error stop, which is disallowed.

💡 Proposed fix
-        integer :: dir
-        integer :: index
+        integer :: dir
+        integer :: index, iter, max_iter
@@
-                    do while ((temp_loc < s_cc(index) &
-                               .or. temp_loc > s_cc(index + 1)))
+                    iter = 0
+                    max_iter = 2*(bound + buff_size + 2)
+                    do while ((temp_loc < s_cc(index) &
+                               .or. temp_loc > s_cc(index + 1)))
                         index = index + dir
-#if !defined(MFC_OpenACC) && !defined(MFC_OpenMP)
-                        if (index < -buff_size .or. index > bound) then
+                        iter = iter + 1
+                        if (index < -buff_size .or. index > bound .or. iter > max_iter) then
                             print *, "A required image point is not located in this computational domain."
                             print *, "Ghost Point is located at ", [x_cc(i), y_cc(j), z_cc(k)], " while moving in dimension ", dim
                             print *, "We are searching for image point at ", ghost_points_in(q)%ip_loc(:)
@@
-                            error stop "Ghost Point and Image Point on Different Processors"
+                            call s_mpi_abort()
                         end if
-#endif
                     end do

As per coding guidelines: "Never use stop or error stop. Use call s_mpi_abort() or @:PROHIBIT()/@:ASSERT() instead."

src/simulation/m_ib_patches.fpp (1)

268-284: ⚠️ Potential issue | 🟠 Major

Privatize airfoil temporaries used inside the collapsed GPU loop.

xa, yc, and dycdxc are assigned inside the kernel but are not listed in the private clause at Line 268.

💡 Proposed fix
-        $:GPU_PARALLEL_LOOP(private='[i,j,xy_local,k,f]', &
+        $:GPU_PARALLEL_LOOP(private='[i,j,xy_local,k,f,xa,yc,dycdxc]', &
                   & copyin='[patch_id,center,inverse_rotation,offset,ma,ca_in,airfoil_grid_u,airfoil_grid_l]', collapse=2)

As per coding guidelines: "Declare private(...) on loop-local variables in GPU kernels to avoid unintended data sharing between threads."

🧹 Nitpick comments (4)
src/common/m_helper.fpp (1)

336-336: Guard GPU_ROUTINE in src/common with MFC_SIMULATION.

Please wrap this directive with #:if MFC_SIMULATION / #:endif in src/common to avoid GPU decoration leaking into non-simulation targets.

Suggested change
-        $:GPU_ROUTINE(parallelism='[seq]')
+        #:if MFC_SIMULATION
+            $:GPU_ROUTINE(parallelism='[seq]')
+        #:endif
As per coding guidelines: Only `src/simulation/` is GPU-accelerated; guard GPU macros with `#:if MFC_SIMULATION` for code in `src/common/`.
toolchain/mfc/params/definitions.py (1)

14-14: Use NI as the single source of truth for num_ibs max.

1000 is now duplicated. Reusing NI in CONSTRAINTS prevents drift.

Suggested change
-    "num_ibs": {"min": 0, "max": 1000},
+    "num_ibs": {"min": 0, "max": NI},

Also applies to: 675-675

src/simulation/m_ib_patches.fpp (1)

945-959: Use the copied threshold scalar inside the kernel.

Line 957 reads patch_ib(patch_id)%model_threshold even though threshold is already copied into the kernel, adding unnecessary global reads.

💡 Proposed fix
-                    if (eta > patch_ib(patch_id)%model_threshold) then
+                    if (eta > threshold) then
                         ib_markers%sf(i, j, k) = patch_id
                     end if
src/common/m_model.fpp (1)

547-560: Use direction-only random vectors for ray casting.

Line 558 adds point(k) into ray_dirs, which biases directions by absolute position and reduces angular diversity of the ray test.

💡 Proposed fix
-                ray_dirs(i, k) = point(k) + f_model_random_number(rand_seed) - 0.5_wp
+                ray_dirs(i, k) = f_model_random_number(rand_seed) - 0.5_wp

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 35b2134 and 370aa4c.

📒 Files selected for processing (21)
  • src/common/include/parallel_macros.fpp
  • src/common/m_constants.fpp
  • src/common/m_derived_types.fpp
  • src/common/m_helper.fpp
  • src/common/m_model.fpp
  • src/pre_process/m_icpp_patches.fpp
  • src/simulation/m_compute_levelset.fpp
  • src/simulation/m_ib_patches.fpp
  • src/simulation/m_ibm.fpp
  • src/simulation/m_mpi_proxy.fpp
  • src/simulation/m_time_steppers.fpp
  • tests/0A362971/golden-metadata.txt
  • tests/0A362971/golden.txt
  • tests/4E0FBE72/golden-metadata.txt
  • tests/4E0FBE72/golden.txt
  • tests/EA8FA07E/golden-metadata.txt
  • tests/EA8FA07E/golden.txt
  • toolchain/mfc/case_validator.py
  • toolchain/mfc/params/definitions.py
  • toolchain/mfc/params_tests/test_definitions.py
  • toolchain/mfc/test/cases.py

integer, parameter :: num_fluids_max = 10 !< Maximum number of fluids in the simulation
integer, parameter :: num_probes_max = 10 !< Maximum number of flow probes in the simulation
integer, parameter :: num_patches_max = 10
integer, parameter :: num_patches_max = 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Decouple IB scaling from global patch limit.

Line 26 increases num_patches_max to 1000, which expands all patch-sized static structures globally, not just IB capacity. This creates avoidable memory pressure and can impact startup/runtime stability. Consider keeping num_patches_max at its prior scope and introducing a dedicated IB limit constant (e.g., num_ibs_max).

Suggested constant split
-    integer, parameter :: num_patches_max = 1000
+    integer, parameter :: num_patches_max = 10
+    integer, parameter :: num_ibs_max = 1000
Based on learnings: Code Review Priorities (in order): (1) Correctness, (2) Precision discipline, (3) Memory management, (4) MPI correctness, (5) GPU code, (6) Physics consistency, (7) Compiler portability.

Comment on lines +29 to 44
#ifdef MFC_SIMULATION
public :: s_instantiate_STL_models
#endif

!! array of STL models that can be allocated and then used in IB marker and levelset compute
type(t_model_array), allocatable, target :: models(:)
!! GPU-friendly flat arrays for STL model data
integer, allocatable :: gpu_ntrs(:)
real(wp), allocatable, dimension(:, :, :, :) :: gpu_trs_v
real(wp), allocatable, dimension(:, :, :) :: gpu_trs_n
real(wp), allocatable, dimension(:, :, :, :) :: gpu_boundary_v
integer, allocatable :: gpu_boundary_edge_count(:)
integer, allocatable :: gpu_total_vertices(:)
real(wp), allocatable :: stl_bounding_boxes(:, :, :)
$:GPU_DECLARE(create='[gpu_ntrs,gpu_trs_v,gpu_trs_n,gpu_boundary_v,gpu_boundary_edge_count,gpu_total_vertices]')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard GPU declarations in src/common with #:if MFC_SIMULATION.

This module is in src/common; keeping GPU declarations unconditional here increases non-simulation build risk.

💡 Proposed fix
-    $:GPU_DECLARE(create='[gpu_ntrs,gpu_trs_v,gpu_trs_n,gpu_boundary_v,gpu_boundary_edge_count,gpu_total_vertices]')
+#ifdef MFC_SIMULATION
+    $:GPU_DECLARE(create='[gpu_ntrs,gpu_trs_v,gpu_trs_n,gpu_boundary_v,gpu_boundary_edge_count,gpu_total_vertices]')
+#endif

Based on learnings: "Only src/simulation/ is GPU-accelerated; guard GPU macros with #:if MFC_SIMULATION for code in src/common/."

Comment on lines +157 to 160
read (line(13:), *) normal
v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2)
if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Initialize ASCII facet normals for degenerate normals.

If v_norm == 0 at Line 159, model%trs(i)%n is never assigned in this path and remains undefined.

💡 Proposed fix
             call s_skip_ignored_lines(iunit, buffered_line, is_buffered)
             read (line(13:), *) normal
             v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2)
-            if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm
+            model%trs(i)%n = 0._wp
+            if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
read (line(13:), *) normal
v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2)
if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm
read (line(13:), *) normal
v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2)
model%trs(i)%n = 0._wp
if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm

Comment on lines +1024 to 1034
else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge
dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm(1) = v1(1) - point(1)
norm(2) = v1(2) - point(2)
norm = norm/dist
else ! t > 1 means that v2 is the closest point on the line edge
dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2)
norm(1) = v2(1) - point(1)
norm(2) = v2(2) - point(2)
norm = norm/dist
end if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Prevent divide-by-zero when normalizing vertex vectors.

In Lines 1028 and 1033, norm = norm/dist can divide by zero when the query point coincides with v1 or v2.

💡 Proposed fix
             else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge
                 dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
                 norm(1) = v1(1) - point(1)
                 norm(2) = v1(2) - point(2)
-                norm = norm/dist
+                if (dist > 0._wp) then
+                    norm = norm/dist
+                else
+                    norm = 0._wp
+                end if
             else ! t > 1 means that v2 is the closest point on the line edge
                 dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2)
                 norm(1) = v2(1) - point(1)
                 norm(2) = v2(2) - point(2)
-                norm = norm/dist
+                if (dist > 0._wp) then
+                    norm = norm/dist
+                else
+                    norm = 0._wp
+                end if
             end if
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge
dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm(1) = v1(1) - point(1)
norm(2) = v1(2) - point(2)
norm = norm/dist
else ! t > 1 means that v2 is the closest point on the line edge
dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2)
norm(1) = v2(1) - point(1)
norm(2) = v2(2) - point(2)
norm = norm/dist
end if
else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge
dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm(1) = v1(1) - point(1)
norm(2) = v1(2) - point(2)
if (dist > 0._wp) then
norm = norm/dist
else
norm = 0._wp
end if
else ! t > 1 means that v2 is the closest point on the line edge
dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2)
norm(1) = v2(1) - point(1)
norm(2) = v2(2) - point(2)
if (dist > 0._wp) then
norm = norm/dist
else
norm = 0._wp
end if
end if

Comment on lines +1138 to +1141
allocate (stl_bounding_boxes(patch_id, 1:3, 1:3))
stl_bounding_boxes(patch_id, 1, 1:3) = [bbox%min(1), (bbox%min(1) + bbox%max(1))/2._wp, bbox%max(1)]
stl_bounding_boxes(patch_id, 2, 1:3) = [bbox%min(2), (bbox%min(2) + bbox%max(2))/2._wp, bbox%max(2)]
stl_bounding_boxes(patch_id, 3, 1:3) = [bbox%min(3), (bbox%min(3) + bbox%max(3))/2._wp, bbox%max(3)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

stl_bounding_boxes allocation will fail with multiple STL patches.

Allocating stl_bounding_boxes inside the patch loop with allocate (stl_bounding_boxes(patch_id, ...)) will fail on the second STL patch because the allocatable is already allocated.

💡 Proposed fix
-        do patch_id = 1, num_ibs
+        if (.not. allocated(stl_bounding_boxes)) then
+            allocate (stl_bounding_boxes(1:num_ibs, 1:3, 1:3))
+            stl_bounding_boxes = 0._wp
+        end if
+
+        do patch_id = 1, num_ibs
@@
-                allocate (stl_bounding_boxes(patch_id, 1:3, 1:3))
                 stl_bounding_boxes(patch_id, 1, 1:3) = [bbox%min(1), (bbox%min(1) + bbox%max(1))/2._wp, bbox%max(1)]
                 stl_bounding_boxes(patch_id, 2, 1:3) = [bbox%min(2), (bbox%min(2) + bbox%max(2))/2._wp, bbox%max(2)]
                 stl_bounding_boxes(patch_id, 3, 1:3) = [bbox%min(3), (bbox%min(3) + bbox%max(3))/2._wp, bbox%max(3)]

Comment on lines +1173 to +1209
@:ALLOCATE(gpu_ntrs(1:num_ibs))
@:ALLOCATE(gpu_trs_v(1:3, 1:3, 1:max_ntrs, 1:num_ibs))
@:ALLOCATE(gpu_trs_n(1:3, 1:max_ntrs, 1:num_ibs))
@:ALLOCATE(gpu_boundary_edge_count(1:num_ibs))
@:ALLOCATE(gpu_total_vertices(1:num_ibs))

gpu_ntrs = 0
gpu_trs_v = 0._wp
gpu_trs_n = 0._wp
gpu_boundary_edge_count = 0
gpu_total_vertices = 0

if (max_bv1 > 0) then
@:ALLOCATE(gpu_boundary_v(1:max_bv1, 1:max_bv2, 1:max_bv3, 1:num_ibs))
gpu_boundary_v = 0._wp
end if

!> This procedure determines the levelset of interpolated 2D models.
!! @param interpolated_boundary_v Group of all the boundary vertices of the interpolated 2D model
!! @param total_vertices Total number of vertices after interpolation
!! @param point The cell centers of the current levelset cell
!! @return Distance which the levelset distance without interpolation
function f_interpolated_distance(interpolated_boundary_v, total_vertices, point) result(distance)
do pid = 1, num_ibs
if (allocated(models(pid)%model)) then
gpu_ntrs(pid) = models(pid)%ntrs
gpu_trs_v(:, :, 1:models(pid)%ntrs, pid) = models(pid)%trs_v
gpu_trs_n(:, 1:models(pid)%ntrs, pid) = models(pid)%trs_n
gpu_boundary_edge_count(pid) = models(pid)%boundary_edge_count
gpu_total_vertices(pid) = models(pid)%total_vertices
end if
if (allocated(models(pid)%boundary_v) .and. p == 0) then
gpu_boundary_v(1:size(models(pid)%boundary_v, 1), &
1:size(models(pid)%boundary_v, 2), &
1:size(models(pid)%boundary_v, 3), pid) = models(pid)%boundary_v
end if
end do

$:GPU_ROUTINE(parallelism='[seq]')
$:GPU_UPDATE(device='[gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_edge_count, gpu_total_vertices]')
if (allocated(gpu_boundary_v)) then
$:GPU_UPDATE(device='[gpu_boundary_v]')
end if
end if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add a finalizer for newly allocated module GPU buffers.

@:ALLOCATE is used for gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_edge_count, gpu_total_vertices, and gpu_boundary_v, but this file adds no matching teardown path. Repeated setup/restarts will leak memory.

💡 Proposed fix (module-side finalizer)
+    subroutine s_finalize_model_module()
+        integer :: pid
+
+        if (allocated(gpu_boundary_v)) @:DEALLOCATE(gpu_boundary_v)
+        if (allocated(gpu_trs_n)) @:DEALLOCATE(gpu_trs_n)
+        if (allocated(gpu_trs_v)) @:DEALLOCATE(gpu_trs_v)
+        if (allocated(gpu_total_vertices)) @:DEALLOCATE(gpu_total_vertices)
+        if (allocated(gpu_boundary_edge_count)) @:DEALLOCATE(gpu_boundary_edge_count)
+        if (allocated(gpu_ntrs)) @:DEALLOCATE(gpu_ntrs)
+
+        if (allocated(stl_bounding_boxes)) deallocate(stl_bounding_boxes)
+
+        if (allocated(models)) then
+            do pid = 1, size(models)
+                if (allocated(models(pid)%trs_v)) deallocate(models(pid)%trs_v)
+                if (allocated(models(pid)%trs_n)) deallocate(models(pid)%trs_n)
+                if (allocated(models(pid)%boundary_v)) deallocate(models(pid)%boundary_v)
+                if (allocated(models(pid)%model)) then
+                    if (allocated(models(pid)%model%trs)) deallocate(models(pid)%model%trs)
+                    deallocate(models(pid)%model)
+                end if
+            end do
+            deallocate(models)
+        end if
+    end subroutine s_finalize_model_module

As per coding guidelines: "Always pair @:ALLOCATE calls with matching @:DEALLOCATE calls in the corresponding finalization subroutine."

Comment on lines +661 to +662
boundary_edge_count = gpu_boundary_edge_count(patch_id)
total_vertices = gpu_total_vertices(patch_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Find the routine containing lines 661-662 and show surrounding context
rg -n -B30 'boundary_edge_count = gpu_boundary_edge_count' src/simulation/m_compute_levelset.fpp | head -100

Repository: MFlowCode/MFC

Length of output: 1287


🏁 Script executed:

# Show more context after line 662 to see if total_vertices is used
rg -n -A30 'total_vertices = gpu_total_vertices' src/simulation/m_compute_levelset.fpp

Repository: MFlowCode/MFC

Length of output: 1564


🏁 Script executed:

# Check distance routine signatures to see if they use total_vertices parameter
rg -n -A15 'subroutine[[:space:]]+s_distance_normals' src/simulation src/common

Repository: MFlowCode/MFC

Length of output: 4161


🏁 Script executed:

# Search for all uses of total_vertices in the file
rg -n 'total_vertices' src/simulation/m_compute_levelset.fpp

Repository: MFlowCode/MFC

Length of output: 192


Remove the unused total_vertices load on line 662.

The variable total_vertices is assigned from gpu_total_vertices(patch_id) but is never consumed in this routine. The distance computation routines (s_distance_normals_2D and s_distance_normals_3D) do not accept total_vertices as a parameter, and the variable is not used elsewhere in s_model_levelset. Remove line 662 and the declaration of total_vertices from line 649.

Comment on lines 588 to 617
$:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3)
do i = 0, m
do j = 0, n
if (p == 0) then
! 2D
if (ib_markers%sf(i, j, 0) /= 0) then
subsection_2D = ib_markers%sf( &
i - gp_layers:i + gp_layers, &
j - gp_layers:j + gp_layers, 0)
if (any(subsection_2D == 0)) then
ghost_points_in(count)%loc = [i, j, 0]
patch_id = ib_markers%sf(i, j, 0)
ghost_points_in(count)%ib_patch_id = &
patch_id

ghost_points_in(count)%slip = patch_ib(patch_id)%slip
! ghost_points(count)%rank = proc_rank
do k = 0, p
if (ib_markers%sf(i, j, k) /= 0) then
is_gp = .false.
marker_search: do ii = i - gp_layers, i + gp_layers
do jj = j - gp_layers, j + gp_layers
do kk = k - gp_layers_z, k + gp_layers_z
if (ib_markers%sf(ii, jj, kk) == 0) then
! if any neighbors are not in the IB, it is a ghost point
is_gp = .true.
exit marker_search
end if
end do
end do
end do marker_search

if (is_gp) then
$:GPU_ATOMIC(atomic='capture')
count = count + 1
local_idx = count
$:END_GPU_ATOMIC_CAPTURE()

ghost_points_in(local_idx)%loc = [i, j, k]
patch_id = ib_markers%sf(i, j, k)
ghost_points_in(local_idx)%ib_patch_id = patch_id

ghost_points_in(local_idx)%slip = patch_ib(patch_id)%slip

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

patch_id should be private in this GPU kernel.

patch_id is assigned per-iteration (Lines 613 and 651) but is not listed in the private clause at Line 588, which can cause data races between threads.

💡 Proposed fix
-        $:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3)
+        $:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx,patch_id]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3)

As per coding guidelines: "Declare private(...) on loop-local variables in GPU kernels to avoid unintended data sharing between threads."

Comment on lines +688 to 689
self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
"num_ibs must be between 1 and num_patches_max (10)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix stale validation message for num_ibs upper bound.

The check allows up to 1000, but the error text still says 10, which will mislead users.

Suggested change
-        self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
-                     "num_ibs must be between 1 and num_patches_max (10)")
+        self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
+                     "num_ibs must be between 1 and 1000")

Comment on lines +1043 to +1044
"3D_IGR_33jet","1D_multispecies_diffusion",
"2D_ibm_stl_MFCCharacter"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a reason for the new skip entry.

Line 1044 adds 2D_ibm_stl_MFCCharacter to casesToSkip, which drops coverage on the same subsystem this PR optimizes. Please add a short reason and tracking issue so this skip is auditable and temporary.

@danieljvickers
Copy link
Member Author

This branch now contains the contents for a generalized periodic IB implementation. We can close the implementation from @conraddelgado and move forward with this one. Conrad, you can close your branch and just work off of this one if you like.

@github-actions
Copy link

Claude Code Review

Head SHA: 6c3a125ca28e1023bb2bc72a9cdd8f3c53bc413b

Files changed: 24 files (+1639 / -1609)

Key files:

  • src/common/m_model.fpp (+475 / -488)
  • src/simulation/m_ib_patches.fpp (+543 / -304)
  • src/simulation/m_ibm.fpp (+238 / -275)
  • src/simulation/m_compute_levelset.fpp (+40 / -86)
  • src/simulation/m_mpi_proxy.fpp (+5 / -187)
  • src/common/include/parallel_macros.fpp (+10 / -0)
  • src/common/m_constants.fpp (+1 / -5) — num_patches_max 10 → 1000

Summary

  • GPU-offloads STL IB marker generation and levelset compute by packing triangle/edge data into flat GPU-friendly arrays (gpu_trs_v, gpu_trs_n, gpu_boundary_v).
  • Replaces interpolation-based distance search with an exact projection-based algorithm (s_distance_normals_3D, s_distance_normals_2D).
  • Replaces random ray-casting inside-model test with 26 fixed-direction rays, enabling GPU execution (f_model_is_inside_flat).
  • Adds bounding-box index pruning for IB patch loops (binary search via get_bounding_indices), reducing per-timestep work dramatically.
  • Removes MPI IB-marker halo exchange, replacing it with each rank independently filling ghost cells from globally known patch positions. Adds periodic image patch support.

Findings

BUG (Critical): stl_bounding_boxes re-allocated inside a loop

File: src/common/m_model.fpp, inside s_instantiate_STL_models

do patch_id = 1, num_ibs
    if (patch_ib(patch_id)%geometry == 5 .or. patch_ib(patch_id)%geometry == 12) then
        ...
        allocate (stl_bounding_boxes(patch_id, 1:3, 1:3))  ! BUG: re-allocates each iteration
        stl_bounding_boxes(patch_id, ...) = ...
    end if
end do

Each loop iteration allocates stl_bounding_boxes with first dimension = current patch_id. When a second STL patch is encountered, stl_bounding_boxes is already allocated — a Fortran runtime error (abort). The dimension patch_id also means only patches 1..patch_id could be indexed. This needs a single allocation before the loop:

allocate (stl_bounding_boxes(1:num_ibs, 1:3, 1:3))
do patch_id = 1, num_ibs
    if (...) then
        stl_bounding_boxes(patch_id, ...) = ...
    end if
end do

This will silently work for the single-STL case but crash for multiple STL patches — exactly the multi-particle MIBM use case this PR targets.


BUG (Precision/Correctness): Division by zero in s_distance_normals_2D

File: src/common/m_model.fpp, subroutine s_distance_normals_2D

When t < 0 (closest point is vertex v1) or t > 1 (closest point is vertex v2):

dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm(1) = v1(1) - point(1)
norm(2) = v1(2) - point(2)
norm = norm/dist   ! undefined if dist == 0

If the cell center exactly coincides with a boundary vertex, dist = 0 causes a NaN/Inf normal. Needs a guard: if (dist > 0._wp) norm = norm/dist.


CONCERN: GPU_PARALLEL_LOOP called with unrecognized parallelism='[seq]' kwarg

File: src/simulation/m_ibm.fpp, s_fill_ghost_points and the interp-coefficient loop

$:GPU_PARALLEL_LOOP(private='[q,i,...]', parallelism='[seq]')
do q = 1, num_gps

parallelism is a parameter for GPU_ROUTINE, not GPU_PARALLEL_LOOP. Depending on how the Fypp macro handles unknown kwargs, this either (a) silently ignores it and parallelizes correctly, or (b) causes a Fypp error at build time. This should be verified — a GPU_PARALLEL_LOOP with parallelism='[seq]' is contradictory.


CONCERN: Named-loop exit inside GPU_PARALLEL_LOOP

File: src/simulation/m_ibm.fpp, ghost-point detection loop

$:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp]', ..., collapse=3)
do i = 0, m
  do j = 0, n
    do k = 0, p
      marker_search: do ii = ...
        ...
        exit marker_search   ! named exit from inner nested loop
      end do marker_search

Named EXIT from inner loops is supported by NVHPC/OpenACC, but behavior under OpenMP target offload (Cray ftn, AMD flang) may differ. The CI gate includes Cray ftn — confirm this compiles and produces correct results with --gpu mp.


CONCERN: Removal of MPI halo exchange for IB markers

File: src/simulation/m_mpi_proxy.fpp (-187 lines), src/simulation/m_ibm.fpp

s_populate_ib_buffers / s_mpi_sendrecv_ib_buffers is deleted entirely. IB patch routines now iterate over ghost-cell-extended index ranges (e.g., -gp_layers-1:m+gp_layers+1) using globally known centroid positions, so each rank independently fills its own ghost region without MPI communication.

This is conceptually correct for geometric IB patches (all ranks know all centroids via the MPI broadcast in m_mpi_proxy.fpp). Please confirm this is intentional and has been validated in multi-rank (≥2) configurations — especially for the moving_ibm=2 case where centroids change each timestep. The new centroid data (patch_ib%x_centroid) needs to be re-broadcast to all ranks after every update.


MINOR: Missing subroutine name on end subroutine

File: src/common/m_model.fpp, s_pack_model_for_gpu

    end subroutine   ! should be: end subroutine s_pack_model_for_gpu

Style convention; not a correctness issue.


MINOR: Leftover comment in production loop

File: src/simulation/m_ib_patches.fpp, s_ib_3d_model

do k = kl, kr
    ! do i = -gp_layers, m+gp_layers    ← dead comment

Notes

  • The new projection-based 3D distance computation in s_distance_normals_3D is algorithmically sound (barycentric + edge/vertex clamping). The early-exit if (abs(d) >= dist_min) cycle is a valid optimization since n is normalized.
  • Triangle normal normalization on read (binary and ASCII STL paths) is a welcome correctness fix.
  • The encode_patch_periodicity / decode_patch_periodicity base-3 encoding with num_ibs+1=1001 gives maximum encoded ID of ~27000 — no integer overflow at 1000 IBs.
  • Bounding-index binary search in get_bounding_indices has correct semantics (conservative: returns one cell past the physical bound).

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:XXL This PR changes 1000+ lines, ignoring generated files

Development

Successfully merging this pull request may close these issues.

2 participants