Add test sharding, proactive clean, and retry logic for self-hosted CI#1171
Add test sharding, proactive clean, and retry logic for self-hosted CI#1171sbryngelson wants to merge 17 commits intoMFlowCode:masterfrom
Conversation
The -s check already guarantees the file is non-empty, so NUM_FAILED > 0 is always true in that branch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…zero-match guard - Include shard in SLURM job_slug to prevent output file collisions between parallel shards (e.g., test-gpu-acc-1-of-2.out) - Consolidate frontier/ and frontier_amd/ submit.sh and test.sh into identical scripts that derive compiler flag and config from directory - Add $shard_opts to CPU test branch for future-proofing - Add zero-match guard for --only filter to fail loudly instead of silently exiting 0 when no tests match - Hoist failed_uuids_path to single definition at top of test() - Compute log slug dynamically in test.yml for shard-aware filenames - Remove unnecessary shard: '' from non-sharded matrix entries - Replace useless cat|tr pipeline with tr < file Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --only filter now detects whether each term is a UUID (8-char hex) or a trace label and applies appropriate matching: - Labels: AND logic (--only 2D Bubbles matches tests with both) - UUIDs: OR logic (--only UUID1 UUID2 matches tests with either) - Mixed: keep case if all labels match OR any UUID matches This preserves the documented behavior for label filtering while correctly supporting the CI retry path that passes multiple UUIDs. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
submit.sh now auto-detects job type (bench vs test) from the submitted script's basename, selecting the appropriate SBATCH account, time limit, and partition. This eliminates three submit-bench.sh files and makes frontier/ and frontier_amd/ scripts byte-identical via directory-name detection for compiler flags and cluster-specific options. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Raise MFCException when --shard produces zero cases (prevents silent green CI with nothing executed) - Pin nick-fields/retry to commit SHA for security on self-hosted runners with cluster credentials Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace per-case case-optimized builds with one generic build, reducing build time from ~34 min to ~5-10 min. Halve benchmark timesteps to compensate for slower non-optimized runtime. Reduce GPU --mem from 12 to 4 GB. Lower test build retry timeout from 480 to 60 minutes. Closes MFlowCode#1275 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
.github/workflows/phoenix/bench.sh (1)
37-37: Quote$(nproc)to prevent word splitting.The static analysis tool correctly identifies that
$(nproc)should be quoted to prevent potential word splitting issues.Proposed fix
- if ./mfc.sh build -j $(nproc) $build_opts; then + if ./mfc.sh build -j "$(nproc)" $build_opts; thenAlso apply to line 53:
-./mfc.sh bench $bench_opts -j $(nproc) -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranks +./mfc.sh bench $bench_opts -j "$(nproc)" -o "$job_slug.yaml" -- -c phoenix-bench $device_opts -n $n_ranks.github/workflows/frontier_amd/build.sh (1)
46-46: Quote the command substitution to prevent word splitting.Per shellcheck SC2046, the unquoted
$(...)can cause unexpected word splitting. While the current output is a single flag, quoting improves robustness.Suggested fix
- if ./mfc.sh test -v -a --dry-run $([ "$cluster_name" = "frontier" ] && echo "--rdma-mpi") -j 8 $build_opts; then + if ./mfc.sh test -v -a --dry-run "$([ "$cluster_name" = "frontier" ] && echo "--rdma-mpi")" -j 8 $build_opts; thenNote: Quoting the empty string when the condition is false will pass an empty argument. If
mfc.shcannot handle empty arguments gracefully, use a conditional approach instead:rdma_opt="" [ "$cluster_name" = "frontier" ] && rdma_opt="--rdma-mpi" if ./mfc.sh test -v -a --dry-run $rdma_opt -j 8 $build_opts; then.github/workflows/frontier_amd/submit.sh (1)
21-22: Quote$1to handle filenames with spaces or special characters.While unlikely in this CI context, quoting the variable is a shell best practice.
Suggested fix
if [ ! -z "$1" ]; then - sbatch_script_contents=`cat $1` + sbatch_script_contents="$(cat "$1")".github/workflows/phoenix/submit.sh (1)
12-13: Quote$1for robustness.Same recommendation as the frontier submit scripts.
Suggested fix
if [ ! -z "$1" ]; then - sbatch_script_contents=`cat $1` + sbatch_script_contents="$(cat "$1")"
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (24)
.github/scripts/submit_and_monitor_bench.sh.github/workflows/bench.yml.github/workflows/frontier/bench.sh.github/workflows/frontier/build.sh.github/workflows/frontier/submit-bench.sh.github/workflows/frontier/submit.sh.github/workflows/frontier/test.sh.github/workflows/frontier_amd/bench.sh.github/workflows/frontier_amd/build.sh.github/workflows/frontier_amd/submit-bench.sh.github/workflows/frontier_amd/submit.sh.github/workflows/frontier_amd/test.sh.github/workflows/phoenix/bench.sh.github/workflows/phoenix/submit-bench.sh.github/workflows/phoenix/submit.sh.github/workflows/test.ymlbenchmarks/5eq_rk3_weno3_hllc/case.pybenchmarks/hypo_hll/case.pybenchmarks/ibm/case.pybenchmarks/igr/case.pybenchmarks/viscous_weno5_sgb_acoustic/case.pytoolchain/mfc/bench.pytoolchain/mfc/cli/commands.pytoolchain/mfc/test/test.py
💤 Files with no reviewable changes (3)
- .github/workflows/frontier/submit-bench.sh
- .github/workflows/frontier_amd/submit-bench.sh
- .github/workflows/phoenix/submit-bench.sh
| shard_suffix="" | ||
| if [ -n "$4" ]; then | ||
| shard_suffix="-$(echo "$4" | sed 's|/|-of-|')" | ||
| fi | ||
| job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2-$3${shard_suffix}" |
There was a problem hiding this comment.
Validate shard format before deriving job_slug.
An invalid shard value currently gets queued and only fails later in test argument parsing. Early validation here avoids wasting scheduler time.
🔧 Proposed fix
shard_suffix=""
if [ -n "$4" ]; then
- shard_suffix="-$(echo "$4" | sed 's|/|-of-|')"
+ if [[ ! "$4" =~ ^[1-9][0-9]*/[1-9][0-9]*$ ]]; then
+ echo "ERROR: Invalid shard '$4'. Expected i/n (e.g., 1/2)."
+ exit 1
+ fi
+ shard_i="${4%/*}"
+ shard_n="${4#*/}"
+ if [ "$shard_i" -gt "$shard_n" ]; then
+ echo "ERROR: Invalid shard '$4'. Expected i<=n."
+ exit 1
+ fi
+ shard_suffix="-${shard_i}-of-${shard_n}"
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| shard_suffix="" | |
| if [ -n "$4" ]; then | |
| shard_suffix="-$(echo "$4" | sed 's|/|-of-|')" | |
| fi | |
| job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2-$3${shard_suffix}" | |
| shard_suffix="" | |
| if [ -n "$4" ]; then | |
| if [[ ! "$4" =~ ^[1-9][0-9]*/[1-9][0-9]*$ ]]; then | |
| echo "ERROR: Invalid shard '$4'. Expected i/n (e.g., 1/2)." | |
| exit 1 | |
| fi | |
| shard_i="${4%/*}" | |
| shard_n="${4#*/}" | |
| if [ "$shard_i" -gt "$shard_n" ]; then | |
| echo "ERROR: Invalid shard '$4'. Expected i<=n." | |
| exit 1 | |
| fi | |
| shard_suffix="-${shard_i}-of-${shard_n}" | |
| fi | |
| job_slug="`basename "$1" | sed 's/\.sh$//' | sed 's/[^a-zA-Z0-9]/-/g'`-$2-$3${shard_suffix}" |
| uuids = [t for t in ARG("only") if is_uuid(t)] | ||
| labels = [t for t in ARG("only") if not is_uuid(t)] | ||
|
|
||
| for case in cases[:]: | ||
| check = set(case.trace.split(" -> ")) | ||
| check.add(case.get_uuid()) | ||
|
|
||
| label_ok = all(label in check for label in labels) if labels else True | ||
| uuid_ok = any(u in check for u in uuids) if uuids else True |
There was a problem hiding this comment.
Normalize --only UUID terms before matching.
Line 53 accepts uppercase UUID input, but Lines 60 and 63 perform case-sensitive membership checks. That can miss valid UUID filters like AB12CD34.
🔧 Proposed fix
- uuids = [t for t in ARG("only") if is_uuid(t)]
+ uuids = [t.lower() for t in ARG("only") if is_uuid(t)]
@@
- check.add(case.get_uuid())
+ check.add(case.get_uuid().lower())
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1171 +/- ##
==========================================
- Coverage 44.05% 44.04% -0.02%
==========================================
Files 70 70
Lines 20496 20499 +3
Branches 1991 1993 +2
==========================================
- Hits 9029 9028 -1
- Misses 10328 10330 +2
- Partials 1139 1141 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
Hardens self-hosted CI with test sharding, retry logic, and script deduplication.
Test sharding & retry
--shard i/nflag to./mfc.sh test— splits tests via modular arithmetic for even distribution--onlyand--shard— empty results raise an error instead of silent green CItests/failed_uuids.txtfailed_uuids.txtto prevent stale retries--onlyfilter improvements--onlymatching zero tests now raises an error instead of silently passingCI script consolidation
submit-bench.shintosubmit.shfor all 3 clusters (frontier, frontier_amd, phoenix) —submit.shauto-detects bench vs test mode from the submitted script's basenamefrontier/andfrontier_amd/scripts via directory-name detection —build.sh,bench.sh,submit.sh, andtest.share now byte-identical across both directoriesOther
--qos=normalon batch partition (1h59m, CFD154 account)--requeueon Phoenix SLURM jobs for preemption recoverynick-fields/retryto commit SHA for security on self-hosted runnersDepends on: #1170
Test plan
--requeueand preemption recoverysubmit.sh(bench mode auto-detected)frontier/andfrontier_amd/scripts are identical and detect cluster correctly--shardwith zero resulting tests raises an error (not silent pass)