MIBM Performance Optimizations#1157
MIBM Performance Optimizations#1157danieljvickers wants to merge 66 commits intoMFlowCode:masterfrom
Conversation
…mething about that...
…generating golden file
Claude Code ReviewHead SHA: SummaryThe PR makes significant and valuable performance improvements by moving IBM computations to the GPU with proper Fypp macro usage. The new projection-based distance computation is mathematically cleaner than the interpolation approach. GPU macro usage is correct throughout — all GPU parallelism uses However, there are several issues that should be addressed before merging. Critical Issues1.
Fix: Allocate 2.
Fix: Introduce a separate 3. Division by zero in
dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm = norm/dist ! Division by zero if dist == 0The 3D counterpart ( Fix: Add 4. Missing
None are deallocated anywhere, violating the Fix: Add a Important Issues5.
error stop "Ghost Point and Image Point on Different Processors"Per project rules, use 6. Stale error message in
self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
"num_ibs must be between 1 and num_patches_max (10)")Limit changed to 1000 but message still says "(10)". 7. Unbounded
The bounds check that prevents Fix: Add a bounded iteration limit and exit if exceeded. Minor Notes
VerdictThe GPU offloading approach is correct and the Fypp macro usage follows project conventions. The four critical issues (loop allocation crash, memory inflation, division by zero, missing deallocations) should be fixed before merging. The important issues are lower priority but worth addressing. |
|
@sbryngelson The Frontier CPU test failed due to compile seg fault. This happened two days ago, so I manually got onto frontier, did an |
Claude Code ReviewHead SHA: 370aa4c Changed files:
Summary:
Findings🔴 Bug — re-allocated inside loop (, )
Then remove the per-patch 🔴 Correctness — MPI halo exchange for IB markers removed (, )
This is difficult to hit in single-rank GPU benchmarks (which is how the PR was validated), but it is a regression for any multi-rank MIBM run where an IB boundary crosses a subdomain edge. 🟡 Portability — Named DO exits inside (, and )Named DO constructs with 🟡 GPU atomic capture correctness (, )The PR adds If the existing 🟡 Ray direction bug in (, ~line 224–226)This sets the raw (unnormalised) ray direction component to 🟡 references global without or argument ()The new GPU-callable function uses the module-level variable ℹ️ Minor — Unused variables in ()Variables ℹ️ Minor — Golden files regenerated only on GPU/nvfortranThe three updated golden files ( OverallThe performance approach is sound and the ~1 000× speedup claim is plausible given the algorithmic improvements (projection-based distance, GPU offload of all IB compute, bounding-index narrowing). The critical item is the
|
📝 WalkthroughWalkthroughThis pull request implements GPU-optimized changes for STL model handling and immersed boundary computation. Key modifications include: expanding maximum patch limits from 10 to 1000; refactoring STL model geometry APIs to use GPU-resident buffers and data structures (introducing gpu_ntrs, gpu_trs_v, gpu_trs_n, and related arrays); replacing logical return types with integer flags in intersection testing; transitioning from interpolation-dependent geometry checks to direct distance-normal calculations via deterministic projections; migrating IBM patch processing to GPU parallel constructs with atomic operations for point counting; adding NVTX profiling markers around IBM workflows; introducing seeded Fortran-based RNG for deterministic GPU execution; and extending MPI broadcasts for model-related parameters. Test metadata files reflect updated hardware environments and compiler toolchain changes to NVHPC with OpenACC enabled. 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 10
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
src/simulation/m_ibm.fpp (1)
491-506:⚠️ Potential issue | 🔴 CriticalBound the image-point search on GPU and remove
error stop.Line 494 disables the bounds-failure block for GPU builds, so the loop in Lines 491-493 can run
indexout of range and continue dereferencings_cc(index + 1). Also, Line 504 useserror stop, which is disallowed.💡 Proposed fix
- integer :: dir - integer :: index + integer :: dir + integer :: index, iter, max_iter @@ - do while ((temp_loc < s_cc(index) & - .or. temp_loc > s_cc(index + 1))) + iter = 0 + max_iter = 2*(bound + buff_size + 2) + do while ((temp_loc < s_cc(index) & + .or. temp_loc > s_cc(index + 1))) index = index + dir -#if !defined(MFC_OpenACC) && !defined(MFC_OpenMP) - if (index < -buff_size .or. index > bound) then + iter = iter + 1 + if (index < -buff_size .or. index > bound .or. iter > max_iter) then print *, "A required image point is not located in this computational domain." print *, "Ghost Point is located at ", [x_cc(i), y_cc(j), z_cc(k)], " while moving in dimension ", dim print *, "We are searching for image point at ", ghost_points_in(q)%ip_loc(:) @@ - error stop "Ghost Point and Image Point on Different Processors" + call s_mpi_abort() end if -#endif end doAs per coding guidelines: "Never use
stoporerror stop. Usecall s_mpi_abort()or@:PROHIBIT()/@:ASSERT()instead."src/simulation/m_ib_patches.fpp (1)
268-284:⚠️ Potential issue | 🟠 MajorPrivatize airfoil temporaries used inside the collapsed GPU loop.
xa,yc, anddycdxcare assigned inside the kernel but are not listed in the private clause at Line 268.💡 Proposed fix
- $:GPU_PARALLEL_LOOP(private='[i,j,xy_local,k,f]', & + $:GPU_PARALLEL_LOOP(private='[i,j,xy_local,k,f,xa,yc,dycdxc]', & & copyin='[patch_id,center,inverse_rotation,offset,ma,ca_in,airfoil_grid_u,airfoil_grid_l]', collapse=2)As per coding guidelines: "Declare private(...) on loop-local variables in GPU kernels to avoid unintended data sharing between threads."
🧹 Nitpick comments (4)
src/common/m_helper.fpp (1)
336-336: GuardGPU_ROUTINEinsrc/commonwithMFC_SIMULATION.Please wrap this directive with
#:if MFC_SIMULATION/#:endifinsrc/commonto avoid GPU decoration leaking into non-simulation targets.As per coding guidelines: Only `src/simulation/` is GPU-accelerated; guard GPU macros with `#:if MFC_SIMULATION` for code in `src/common/`.Suggested change
- $:GPU_ROUTINE(parallelism='[seq]') + #:if MFC_SIMULATION + $:GPU_ROUTINE(parallelism='[seq]') + #:endiftoolchain/mfc/params/definitions.py (1)
14-14: UseNIas the single source of truth fornum_ibsmax.
1000is now duplicated. ReusingNIinCONSTRAINTSprevents drift.Suggested change
- "num_ibs": {"min": 0, "max": 1000}, + "num_ibs": {"min": 0, "max": NI},Also applies to: 675-675
src/simulation/m_ib_patches.fpp (1)
945-959: Use the copiedthresholdscalar inside the kernel.Line 957 reads
patch_ib(patch_id)%model_thresholdeven thoughthresholdis already copied into the kernel, adding unnecessary global reads.💡 Proposed fix
- if (eta > patch_ib(patch_id)%model_threshold) then + if (eta > threshold) then ib_markers%sf(i, j, k) = patch_id end ifsrc/common/m_model.fpp (1)
547-560: Use direction-only random vectors for ray casting.Line 558 adds
point(k)intoray_dirs, which biases directions by absolute position and reduces angular diversity of the ray test.💡 Proposed fix
- ray_dirs(i, k) = point(k) + f_model_random_number(rand_seed) - 0.5_wp + ray_dirs(i, k) = f_model_random_number(rand_seed) - 0.5_wp
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (21)
src/common/include/parallel_macros.fppsrc/common/m_constants.fppsrc/common/m_derived_types.fppsrc/common/m_helper.fppsrc/common/m_model.fppsrc/pre_process/m_icpp_patches.fppsrc/simulation/m_compute_levelset.fppsrc/simulation/m_ib_patches.fppsrc/simulation/m_ibm.fppsrc/simulation/m_mpi_proxy.fppsrc/simulation/m_time_steppers.fpptests/0A362971/golden-metadata.txttests/0A362971/golden.txttests/4E0FBE72/golden-metadata.txttests/4E0FBE72/golden.txttests/EA8FA07E/golden-metadata.txttests/EA8FA07E/golden.txttoolchain/mfc/case_validator.pytoolchain/mfc/params/definitions.pytoolchain/mfc/params_tests/test_definitions.pytoolchain/mfc/test/cases.py
| integer, parameter :: num_fluids_max = 10 !< Maximum number of fluids in the simulation | ||
| integer, parameter :: num_probes_max = 10 !< Maximum number of flow probes in the simulation | ||
| integer, parameter :: num_patches_max = 10 | ||
| integer, parameter :: num_patches_max = 1000 |
There was a problem hiding this comment.
Decouple IB scaling from global patch limit.
Line 26 increases num_patches_max to 1000, which expands all patch-sized static structures globally, not just IB capacity. This creates avoidable memory pressure and can impact startup/runtime stability. Consider keeping num_patches_max at its prior scope and introducing a dedicated IB limit constant (e.g., num_ibs_max).
Suggested constant split
- integer, parameter :: num_patches_max = 1000
+ integer, parameter :: num_patches_max = 10
+ integer, parameter :: num_ibs_max = 1000| #ifdef MFC_SIMULATION | ||
| public :: s_instantiate_STL_models | ||
| #endif | ||
|
|
||
| !! array of STL models that can be allocated and then used in IB marker and levelset compute | ||
| type(t_model_array), allocatable, target :: models(:) | ||
| !! GPU-friendly flat arrays for STL model data | ||
| integer, allocatable :: gpu_ntrs(:) | ||
| real(wp), allocatable, dimension(:, :, :, :) :: gpu_trs_v | ||
| real(wp), allocatable, dimension(:, :, :) :: gpu_trs_n | ||
| real(wp), allocatable, dimension(:, :, :, :) :: gpu_boundary_v | ||
| integer, allocatable :: gpu_boundary_edge_count(:) | ||
| integer, allocatable :: gpu_total_vertices(:) | ||
| real(wp), allocatable :: stl_bounding_boxes(:, :, :) | ||
| $:GPU_DECLARE(create='[gpu_ntrs,gpu_trs_v,gpu_trs_n,gpu_boundary_v,gpu_boundary_edge_count,gpu_total_vertices]') | ||
|
|
There was a problem hiding this comment.
Guard GPU declarations in src/common with #:if MFC_SIMULATION.
This module is in src/common; keeping GPU declarations unconditional here increases non-simulation build risk.
💡 Proposed fix
- $:GPU_DECLARE(create='[gpu_ntrs,gpu_trs_v,gpu_trs_n,gpu_boundary_v,gpu_boundary_edge_count,gpu_total_vertices]')
+#ifdef MFC_SIMULATION
+ $:GPU_DECLARE(create='[gpu_ntrs,gpu_trs_v,gpu_trs_n,gpu_boundary_v,gpu_boundary_edge_count,gpu_total_vertices]')
+#endifBased on learnings: "Only src/simulation/ is GPU-accelerated; guard GPU macros with #:if MFC_SIMULATION for code in src/common/."
| read (line(13:), *) normal | ||
| v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2) | ||
| if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm | ||
|
|
There was a problem hiding this comment.
Initialize ASCII facet normals for degenerate normals.
If v_norm == 0 at Line 159, model%trs(i)%n is never assigned in this path and remains undefined.
💡 Proposed fix
call s_skip_ignored_lines(iunit, buffered_line, is_buffered)
read (line(13:), *) normal
v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2)
- if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm
+ model%trs(i)%n = 0._wp
+ if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| read (line(13:), *) normal | |
| v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2) | |
| if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm | |
| read (line(13:), *) normal | |
| v_norm = sqrt(normal(1)**2 + normal(2)**2 + normal(3)**2) | |
| model%trs(i)%n = 0._wp | |
| if (v_norm > 0._wp) model%trs(i)%n = normal/v_norm |
| else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge | ||
| dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2) | ||
| norm(1) = v1(1) - point(1) | ||
| norm(2) = v1(2) - point(2) | ||
| norm = norm/dist | ||
| else ! t > 1 means that v2 is the closest point on the line edge | ||
| dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2) | ||
| norm(1) = v2(1) - point(1) | ||
| norm(2) = v2(2) - point(2) | ||
| norm = norm/dist | ||
| end if |
There was a problem hiding this comment.
Prevent divide-by-zero when normalizing vertex vectors.
In Lines 1028 and 1033, norm = norm/dist can divide by zero when the query point coincides with v1 or v2.
💡 Proposed fix
else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge
dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2)
norm(1) = v1(1) - point(1)
norm(2) = v1(2) - point(2)
- norm = norm/dist
+ if (dist > 0._wp) then
+ norm = norm/dist
+ else
+ norm = 0._wp
+ end if
else ! t > 1 means that v2 is the closest point on the line edge
dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2)
norm(1) = v2(1) - point(1)
norm(2) = v2(2) - point(2)
- norm = norm/dist
+ if (dist > 0._wp) then
+ norm = norm/dist
+ else
+ norm = 0._wp
+ end if
end if📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge | |
| dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2) | |
| norm(1) = v1(1) - point(1) | |
| norm(2) = v1(2) - point(2) | |
| norm = norm/dist | |
| else ! t > 1 means that v2 is the closest point on the line edge | |
| dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2) | |
| norm(1) = v2(1) - point(1) | |
| norm(2) = v2(2) - point(2) | |
| norm = norm/dist | |
| end if | |
| else if (t < 0._wp) then ! negative t means that v1 is the closest point on the edge | |
| dist = sqrt((point(1) - v1(1))**2 + (point(2) - v1(2))**2) | |
| norm(1) = v1(1) - point(1) | |
| norm(2) = v1(2) - point(2) | |
| if (dist > 0._wp) then | |
| norm = norm/dist | |
| else | |
| norm = 0._wp | |
| end if | |
| else ! t > 1 means that v2 is the closest point on the line edge | |
| dist = sqrt((point(1) - v2(1))**2 + (point(2) - v2(2))**2) | |
| norm(1) = v2(1) - point(1) | |
| norm(2) = v2(2) - point(2) | |
| if (dist > 0._wp) then | |
| norm = norm/dist | |
| else | |
| norm = 0._wp | |
| end if | |
| end if |
| allocate (stl_bounding_boxes(patch_id, 1:3, 1:3)) | ||
| stl_bounding_boxes(patch_id, 1, 1:3) = [bbox%min(1), (bbox%min(1) + bbox%max(1))/2._wp, bbox%max(1)] | ||
| stl_bounding_boxes(patch_id, 2, 1:3) = [bbox%min(2), (bbox%min(2) + bbox%max(2))/2._wp, bbox%max(2)] | ||
| stl_bounding_boxes(patch_id, 3, 1:3) = [bbox%min(3), (bbox%min(3) + bbox%max(3))/2._wp, bbox%max(3)] |
There was a problem hiding this comment.
stl_bounding_boxes allocation will fail with multiple STL patches.
Allocating stl_bounding_boxes inside the patch loop with allocate (stl_bounding_boxes(patch_id, ...)) will fail on the second STL patch because the allocatable is already allocated.
💡 Proposed fix
- do patch_id = 1, num_ibs
+ if (.not. allocated(stl_bounding_boxes)) then
+ allocate (stl_bounding_boxes(1:num_ibs, 1:3, 1:3))
+ stl_bounding_boxes = 0._wp
+ end if
+
+ do patch_id = 1, num_ibs
@@
- allocate (stl_bounding_boxes(patch_id, 1:3, 1:3))
stl_bounding_boxes(patch_id, 1, 1:3) = [bbox%min(1), (bbox%min(1) + bbox%max(1))/2._wp, bbox%max(1)]
stl_bounding_boxes(patch_id, 2, 1:3) = [bbox%min(2), (bbox%min(2) + bbox%max(2))/2._wp, bbox%max(2)]
stl_bounding_boxes(patch_id, 3, 1:3) = [bbox%min(3), (bbox%min(3) + bbox%max(3))/2._wp, bbox%max(3)]| @:ALLOCATE(gpu_ntrs(1:num_ibs)) | ||
| @:ALLOCATE(gpu_trs_v(1:3, 1:3, 1:max_ntrs, 1:num_ibs)) | ||
| @:ALLOCATE(gpu_trs_n(1:3, 1:max_ntrs, 1:num_ibs)) | ||
| @:ALLOCATE(gpu_boundary_edge_count(1:num_ibs)) | ||
| @:ALLOCATE(gpu_total_vertices(1:num_ibs)) | ||
|
|
||
| gpu_ntrs = 0 | ||
| gpu_trs_v = 0._wp | ||
| gpu_trs_n = 0._wp | ||
| gpu_boundary_edge_count = 0 | ||
| gpu_total_vertices = 0 | ||
|
|
||
| if (max_bv1 > 0) then | ||
| @:ALLOCATE(gpu_boundary_v(1:max_bv1, 1:max_bv2, 1:max_bv3, 1:num_ibs)) | ||
| gpu_boundary_v = 0._wp | ||
| end if | ||
|
|
||
| !> This procedure determines the levelset of interpolated 2D models. | ||
| !! @param interpolated_boundary_v Group of all the boundary vertices of the interpolated 2D model | ||
| !! @param total_vertices Total number of vertices after interpolation | ||
| !! @param point The cell centers of the current levelset cell | ||
| !! @return Distance which the levelset distance without interpolation | ||
| function f_interpolated_distance(interpolated_boundary_v, total_vertices, point) result(distance) | ||
| do pid = 1, num_ibs | ||
| if (allocated(models(pid)%model)) then | ||
| gpu_ntrs(pid) = models(pid)%ntrs | ||
| gpu_trs_v(:, :, 1:models(pid)%ntrs, pid) = models(pid)%trs_v | ||
| gpu_trs_n(:, 1:models(pid)%ntrs, pid) = models(pid)%trs_n | ||
| gpu_boundary_edge_count(pid) = models(pid)%boundary_edge_count | ||
| gpu_total_vertices(pid) = models(pid)%total_vertices | ||
| end if | ||
| if (allocated(models(pid)%boundary_v) .and. p == 0) then | ||
| gpu_boundary_v(1:size(models(pid)%boundary_v, 1), & | ||
| 1:size(models(pid)%boundary_v, 2), & | ||
| 1:size(models(pid)%boundary_v, 3), pid) = models(pid)%boundary_v | ||
| end if | ||
| end do | ||
|
|
||
| $:GPU_ROUTINE(parallelism='[seq]') | ||
| $:GPU_UPDATE(device='[gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_edge_count, gpu_total_vertices]') | ||
| if (allocated(gpu_boundary_v)) then | ||
| $:GPU_UPDATE(device='[gpu_boundary_v]') | ||
| end if | ||
| end if |
There was a problem hiding this comment.
Add a finalizer for newly allocated module GPU buffers.
@:ALLOCATE is used for gpu_ntrs, gpu_trs_v, gpu_trs_n, gpu_boundary_edge_count, gpu_total_vertices, and gpu_boundary_v, but this file adds no matching teardown path. Repeated setup/restarts will leak memory.
💡 Proposed fix (module-side finalizer)
+ subroutine s_finalize_model_module()
+ integer :: pid
+
+ if (allocated(gpu_boundary_v)) @:DEALLOCATE(gpu_boundary_v)
+ if (allocated(gpu_trs_n)) @:DEALLOCATE(gpu_trs_n)
+ if (allocated(gpu_trs_v)) @:DEALLOCATE(gpu_trs_v)
+ if (allocated(gpu_total_vertices)) @:DEALLOCATE(gpu_total_vertices)
+ if (allocated(gpu_boundary_edge_count)) @:DEALLOCATE(gpu_boundary_edge_count)
+ if (allocated(gpu_ntrs)) @:DEALLOCATE(gpu_ntrs)
+
+ if (allocated(stl_bounding_boxes)) deallocate(stl_bounding_boxes)
+
+ if (allocated(models)) then
+ do pid = 1, size(models)
+ if (allocated(models(pid)%trs_v)) deallocate(models(pid)%trs_v)
+ if (allocated(models(pid)%trs_n)) deallocate(models(pid)%trs_n)
+ if (allocated(models(pid)%boundary_v)) deallocate(models(pid)%boundary_v)
+ if (allocated(models(pid)%model)) then
+ if (allocated(models(pid)%model%trs)) deallocate(models(pid)%model%trs)
+ deallocate(models(pid)%model)
+ end if
+ end do
+ deallocate(models)
+ end if
+ end subroutine s_finalize_model_moduleAs per coding guidelines: "Always pair @:ALLOCATE calls with matching @:DEALLOCATE calls in the corresponding finalization subroutine."
| boundary_edge_count = gpu_boundary_edge_count(patch_id) | ||
| total_vertices = gpu_total_vertices(patch_id) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the routine containing lines 661-662 and show surrounding context
rg -n -B30 'boundary_edge_count = gpu_boundary_edge_count' src/simulation/m_compute_levelset.fpp | head -100Repository: MFlowCode/MFC
Length of output: 1287
🏁 Script executed:
# Show more context after line 662 to see if total_vertices is used
rg -n -A30 'total_vertices = gpu_total_vertices' src/simulation/m_compute_levelset.fppRepository: MFlowCode/MFC
Length of output: 1564
🏁 Script executed:
# Check distance routine signatures to see if they use total_vertices parameter
rg -n -A15 'subroutine[[:space:]]+s_distance_normals' src/simulation src/commonRepository: MFlowCode/MFC
Length of output: 4161
🏁 Script executed:
# Search for all uses of total_vertices in the file
rg -n 'total_vertices' src/simulation/m_compute_levelset.fppRepository: MFlowCode/MFC
Length of output: 192
Remove the unused total_vertices load on line 662.
The variable total_vertices is assigned from gpu_total_vertices(patch_id) but is never consumed in this routine. The distance computation routines (s_distance_normals_2D and s_distance_normals_3D) do not accept total_vertices as a parameter, and the variable is not used elsewhere in s_model_levelset. Remove line 662 and the declaration of total_vertices from line 649.
src/simulation/m_ibm.fpp
Outdated
| $:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3) | ||
| do i = 0, m | ||
| do j = 0, n | ||
| if (p == 0) then | ||
| ! 2D | ||
| if (ib_markers%sf(i, j, 0) /= 0) then | ||
| subsection_2D = ib_markers%sf( & | ||
| i - gp_layers:i + gp_layers, & | ||
| j - gp_layers:j + gp_layers, 0) | ||
| if (any(subsection_2D == 0)) then | ||
| ghost_points_in(count)%loc = [i, j, 0] | ||
| patch_id = ib_markers%sf(i, j, 0) | ||
| ghost_points_in(count)%ib_patch_id = & | ||
| patch_id | ||
|
|
||
| ghost_points_in(count)%slip = patch_ib(patch_id)%slip | ||
| ! ghost_points(count)%rank = proc_rank | ||
| do k = 0, p | ||
| if (ib_markers%sf(i, j, k) /= 0) then | ||
| is_gp = .false. | ||
| marker_search: do ii = i - gp_layers, i + gp_layers | ||
| do jj = j - gp_layers, j + gp_layers | ||
| do kk = k - gp_layers_z, k + gp_layers_z | ||
| if (ib_markers%sf(ii, jj, kk) == 0) then | ||
| ! if any neighbors are not in the IB, it is a ghost point | ||
| is_gp = .true. | ||
| exit marker_search | ||
| end if | ||
| end do | ||
| end do | ||
| end do marker_search | ||
|
|
||
| if (is_gp) then | ||
| $:GPU_ATOMIC(atomic='capture') | ||
| count = count + 1 | ||
| local_idx = count | ||
| $:END_GPU_ATOMIC_CAPTURE() | ||
|
|
||
| ghost_points_in(local_idx)%loc = [i, j, k] | ||
| patch_id = ib_markers%sf(i, j, k) | ||
| ghost_points_in(local_idx)%ib_patch_id = patch_id | ||
|
|
||
| ghost_points_in(local_idx)%slip = patch_ib(patch_id)%slip | ||
|
|
There was a problem hiding this comment.
patch_id should be private in this GPU kernel.
patch_id is assigned per-iteration (Lines 613 and 651) but is not listed in the private clause at Line 588, which can cause data races between threads.
💡 Proposed fix
- $:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3)
+ $:GPU_PARALLEL_LOOP(private='[i,j,k,ii,jj,kk,is_gp,local_idx,patch_id]', copyin='[count,count_i, x_domain, y_domain, z_domain]', firstprivate='[gp_layers,gp_layers_z]', collapse=3)As per coding guidelines: "Declare private(...) on loop-local variables in GPU kernels to avoid unintended data sharing between threads."
| self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000), | ||
| "num_ibs must be between 1 and num_patches_max (10)") |
There was a problem hiding this comment.
Fix stale validation message for num_ibs upper bound.
The check allows up to 1000, but the error text still says 10, which will mislead users.
Suggested change
- self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
- "num_ibs must be between 1 and num_patches_max (10)")
+ self.prohibit(ib and (num_ibs <= 0 or num_ibs > 1000),
+ "num_ibs must be between 1 and 1000")| "3D_IGR_33jet","1D_multispecies_diffusion", | ||
| "2D_ibm_stl_MFCCharacter"] |
There was a problem hiding this comment.
Add a reason for the new skip entry.
Line 1044 adds 2D_ibm_stl_MFCCharacter to casesToSkip, which drops coverage on the same subsystem this PR optimizes. Please add a short reason and tracking issue so this skip is auditable and temporary.
… into gpu-optimizations
|
This branch now contains the contents for a generalized periodic IB implementation. We can close the implementation from @conraddelgado and move forward with this one. Conrad, you can close your branch and just work off of this one if you like. |
Claude Code ReviewHead SHA: Files changed: 24 files (+1639 / -1609) Key files:
Summary
FindingsBUG (Critical):
|
User description
Description
Following the refactor of the levelset, there were several performance optimizations left to be made to the code. This PR introduces optimizations that will make multi-particle MIBM code viable. It also expands the upper bound of allowed number of immersed boundaries to 1000. Performance was measured on 1-4 ranks of ACC GPU compute using A100 GPUs.
This PR has extended optimization to STL IBs, which should significantly improve accuracy, performance, and code cleanliness. The primary optimizations are as follows:
Type of change
Testing
All changes pass the IBM section of the test suite on GPUs with the NVHPC compiler. Performance was measured with a case of 1000 particles with viscosity enabled. The particles are all resolved 3D spheres given random non-overlapping positions generated by the following case file:
These optimizations add nearly x1000 performance in the moving IBM propagation and generation code. Prior to these optimizations, this was the result of the benchmark case using the NVIDIA NSight profiler showing 45 seconds to run a single RK substep:
Following these optimizations, the same profile achieves almost 50 ms per RK substep:

For STLs, the optimizations were tested on a 822,000 vertex mesh of a Mach 0.4 corgi, given by this STL:
https://www.thingiverse.com/thing:4721563/files
The final simulation finished in a total of 25 minutes on a 200^3 grid for 4k time steps on a single A100 GPU. All of the code related to the STL model (file reading, preprocessing, IB marker generation, and levelset compute) took only 20 seconds of the run time. The result of that simulation can be viewed here:
https://www.youtube.com/watch?v=h44BNCKo0Hs
Checklist
See the developer guide for full coding standards.
GPU changes (expand if you modified
src/simulation/)CodeAnt-AI Description
GPU-accelerate STL immersed-boundary compute and support up to 1000 IBs
What Changed
Impact
✅ Faster IB marker generation✅ Lower CPU during IB setup and levelset evaluation✅ Support up to 1000 immersed boundaries💡 Usage Guide
Checking Your Pull Request
Every time you make a pull request, our system automatically looks through it. We check for security issues, mistakes in how you're setting up your infrastructure, and common code problems. We do this to make sure your changes are solid and won't cause any trouble later.
Talking to CodeAnt AI
Got a question or need a hand with something in your pull request? You can easily get in touch with CodeAnt AI right here. Just type the following in a comment on your pull request, and replace "Your question here" with whatever you want to ask:
This lets you have a chat with CodeAnt AI about your pull request, making it easier to understand and improve your code.
Example
Preserve Org Learnings with CodeAnt
You can record team preferences so CodeAnt AI applies them in future reviews. Reply directly to the specific CodeAnt AI suggestion (in the same thread) and replace "Your feedback here" with your input:
This helps CodeAnt AI learn and adapt to your team's coding style and standards.
Example
Retrigger review
Ask CodeAnt AI to review the PR again, by typing:
Check Your Repository Health
To analyze the health of your code repository, visit our dashboard at https://app.codeant.ai. This tool helps you identify potential issues and areas for improvement in your codebase, ensuring your repository maintains high standards of code health.