Skip to content

ci: add interrupted-OTA swap-move canary#2648

Open
neilberkman wants to merge 1 commit intomcu-tools:mainfrom
neilberkman:tardigrade-ci
Open

ci: add interrupted-OTA swap-move canary#2648
neilberkman wants to merge 1 commit intomcu-tools:mainfrom
neilberkman:tardigrade-ci

Conversation

@neilberkman
Copy link

@neilberkman neilberkman commented Mar 4, 2026

MCUboot has had real interrupted-OTA regressions in swap-using-move and revert handling where a power cut during upgrade leaves the device bricked or booting the wrong image. This PR adds a post-merge canary for that failure class on main so new regressions in that path are caught earlier in CI.

The canary is implemented with tardigrade, a Renode-based fault-injection harness for OTA boot paths. In one representative swap-using-move configuration on nrf52840dk_nrf52840, it builds MCUboot, builds and signs two hello_world images, injects power-loss faults across the upgrade path, and fails if any fault point bricks the device or leaves it on the wrong image.

What this workflow does:

  • builds MCUboot main in one representative swap-move configuration
  • builds two signed hello_world images with MCUboot's RSA-2048 test key so the image format matches the bootloader configuration
  • runs on main when relevant OTA files change, weekly on a schedule, and manually via workflow_dispatch
  • pins a stable Zephyr release tag so the signal stays about MCUboot rather than unrelated Zephyr main churn
  • pins the tardigrade action by full commit SHA

Scope:

  • intended to catch swap-move resume and revert regressions in the #2100 / #2199 class
  • does not cover swap-scratch (#2109), different-sized images, alternate validation policies, or other swap modes
  • tardigrade can exercise those configurations locally, but this PR intentionally wires in one representative swap-move canary first

About the external action:

  • the tardigrade action is a stateless composite action
  • it downloads the caller-provided Renode tarball, installs Python dependencies, runs the local audit CLI, and uploads the resulting JSON report
  • there is no external service, backend, or telemetry path beyond the tool downloads already visible in the workflow

@neilberkman neilberkman requested a review from d3zd3z as a code owner March 4, 2026 04:44
@de-nordic de-nordic requested a review from davidvincze March 5, 2026 20:34
@de-nordic de-nordic added the CI label Mar 5, 2026
@neilberkman neilberkman changed the title ci: add tardigrade OTA resilience testing ci: add OTA resilience canary workflow Mar 5, 2026
@neilberkman neilberkman changed the title ci: add OTA resilience canary workflow ci: add swap-move OTA resilience canary Mar 6, 2026
@neilberkman neilberkman force-pushed the tardigrade-ci branch 2 times, most recently from 8fa90d5 to 498516c Compare March 6, 2026 12:07
@neilberkman neilberkman changed the title ci: add swap-move OTA resilience canary ci: add interrupted-OTA swap-move canary Mar 6, 2026
Pin tardigrade action at 2dd6e87 which includes:
- update_sequence multi-phase support
- sweep_hash_bypass_symbols rename (no more footgun)
- hook-fault command_drop implementation
- VTOR settle fix for copy-on-boot

Signed-off-by: Neil Berkman <neil@mirala.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants