Skip to content

perf: speed up firecracker standby with diff snapshot reuse#146

Merged
rgarcia merged 14 commits intomainfrom
firecracker-diff-snapshot-bases-pr
Mar 16, 2026
Merged

perf: speed up firecracker standby with diff snapshot reuse#146
rgarcia merged 14 commits intomainfrom
firecracker-diff-snapshot-bases-pr

Conversation

@rgarcia
Copy link
Contributor

@rgarcia rgarcia commented Mar 13, 2026

Summary

  • cut Firecracker standby latency by reusing a hidden retained snapshot base across restore/standby cycles so later standby snapshots can use fast diff snapshots instead of rewriting a full memory image
  • keep the user-facing standby semantics the same: snapshot-latest is only visible while an instance is actually in Standby, while the retained snapshot-base stays internal and gets cleaned up on stop
  • model snapshot-base reuse and graceful VMM shutdown as hypervisor capabilities instead of encoding those decisions in Firecracker-specific paths or ErrNotSupported control flow
  • move static capability lookup to a type-level registry so orchestration code can make lifecycle decisions from hypervisor type without constructing starter/client values just to ask about static behavior
  • harden snapshot failure rollback by discarding a promoted snapshot target after snapshot errors so we do not reuse a partially written retained base on the next standby cycle
  • local benchmark harness and workflow live in companion PR feat: add firecracker standby autoresearch harness #145

Why These Refactors Are In This PR

  • The original performance win came from Firecracker-specific behavior, but the lifecycle code was cleaner once the reusable parts were expressed as capabilities. SupportsSnapshotBaseReuse lets the manager reason about retained bases generically instead of branching on hypervisor.Type.
  • The graceful shutdown optimization started as "skip the wait when Shutdown() returns ErrNotSupported", but that is really a static property of the hypervisor, not a runtime error case. SupportsGracefulVMMShutdown makes that intent explicit and keeps error handling separate from control flow.
  • Registering capabilities by hypervisor type avoids the earlier smell of needing a starter or zero-value client instance just to answer static questions about what a hypervisor supports.

Firecracker Snapshot Behavior

  • Firecracker now uses SnapshotType: Full when no retained memory base exists.
    This covers the first standby after boot and retries after a failed snapshot discarded the retained base.
  • Firecracker uses SnapshotType: Diff only when the retained memory base already exists.
    In the normal happy path, that means the second and later standby cycles reuse snapshot-base and only write changed pages.
  • After a successful restore, snapshot-latest is moved back into the hidden snapshot-base so the next standby can reuse it.

Behavior Changes

  • StopInstance now removes both visible standby snapshots and any retained hidden base.
  • On snapshot failure after promoting a retained base, the promoted directory is discarded instead of renamed back into place.
  • Graceful VMM shutdown behavior is now driven by hypervisor capabilities, and unsupported graceful shutdown paths skip the wait and go straight to force-kill.

Performance

Local benchmark harness from #145:

  • Baseline: score_ms 5146.5, standby_p50_ms 5032.0
  • Cleaned-up retained diff-base result: score_ms 264.7, standby_p50_ms 157.0

Integration timing run from go test ./lib/instances -run TestFirecrackerStandbyAndRestore -v:

  • first standby, full snapshot expected: 3.33545878s
  • first restore to running: 101.817536ms
  • first restore to exec-ready: 1.618021087s
  • second standby, diff snapshot expected: 557.88755ms
  • second restore to running: 103.666293ms
  • second restore to exec-ready: 115.73631ms

That integration test also verifies guest state persists across both cycles by writing one file before the first standby, another after the first restore, and asserting both files still exist after the second restore.

Test Plan

  • make test TEST=TestFirecrackerStandbyAndRestore
  • make test TEST=TestFirecrackerStopClearsStaleSnapshot
  • go test ./lib/hypervisor/...
  • go test ./lib/hypervisor/firecracker -run TestSnapshotParamPaths
  • go test ./lib/instances -run 'TestDiscardPromotedRetainedSnapshotTargetAfterSnapshotError'
  • go test ./lib/instances -run TestFirecrackerStandbyAndRestore -v
  • benchmark locally with the harness from feat: add firecracker standby autoresearch harness #145 via prepare.go + run.go -budget 180s

Note

Medium Risk
Touches core instance lifecycle (standby/restore/stop) and process-shutdown behavior, so regressions could leave stale snapshots or incorrectly kill/retain VMM processes. Changes are capability-gated and covered by expanded unit/integration tests, reducing but not eliminating risk.

Overview
Speeds up Firecracker standby by reusing a hidden retained snapshot base across restore/standby cycles, so later standbys can write diff snapshots instead of full memory images; a new internal snapshot-base directory is introduced and snapshot-latest remains the only user-visible standby artifact.

Adds a type-level hypervisor capability registry (RegisterCapabilities/CapabilitiesForType) and extends hypervisor.Capabilities with SupportsSnapshotBaseReuse and SupportsGracefulVMMShutdown, which now drive orchestration decisions (retaining snapshot base on restore, cleaning it on stop, and skipping graceful shutdown waits for hypervisors that can’t do it).

Hardens snapshot lifecycle handling: promotes/rolls back retained bases around standby snapshot creation, moves snapshot-latest to snapshot-base after successful restore, and ensures both visible and hidden snapshot dirs are removed on stop; tests are updated/added to validate full-vs-diff snapshot selection and retained-base cleanup/rollback behavior.

Written by Cursor Bugbot for commit 5b0c4c1. This will update automatically on new commits. Configure here.

rgarcia added 2 commits March 13, 2026 15:44
Avoid spending standby time waiting for hypervisors without a shutdown API to exit gracefully when the process will need an immediate kill anyway.

Made-with: Cursor
Keep a hidden Firecracker diff snapshot base across restores so standby only writes changed pages while preserving the existing standby snapshot semantics exposed to users.

Made-with: Cursor
rgarcia added 2 commits March 13, 2026 16:05
Express retained snapshot-base behavior through hypervisor capabilities and generic snapshot-base paths so the diff snapshot optimization does not leak Firecracker-specific semantics.

Made-with: Cursor
Move static hypervisor capabilities into a type-level registry so snapshot base reuse checks do not depend on starter instances or zero-value client construction.

Made-with: Cursor
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Snapshot failure rollback preserves potentially corrupted base
    • When snapshot creation fails after promoting a retained base, the rollback now deletes snapshot-latest instead of renaming it back to snapshot-base, preventing reuse of potentially partially written data.

Create PR

Or push these changes by commenting:

@cursor push 967ad1b1de
Preview (967ad1b1de)
diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -105,8 +105,10 @@
 		log.ErrorContext(ctx, "snapshot failed, attempting to resume VM", "instance_id", id, "error", err)
 		hv.Resume(ctx)
 		if promotedRetainedBase {
-			if rollbackErr := restoreRetainedSnapshotBase(snapshotDir, retainedBaseDir); rollbackErr != nil {
-				log.WarnContext(ctx, "failed to restore retained snapshot base after snapshot error", "instance_id", id, "error", rollbackErr)
+			// Snapshot errors can happen after partial file writes; discard the
+			// promoted base to avoid reusing potentially corrupted data.
+			if rollbackErr := os.RemoveAll(snapshotDir); rollbackErr != nil {
+				log.WarnContext(ctx, "failed to discard promoted snapshot target after snapshot error", "instance_id", id, "error", rollbackErr)
 			}
 		}
 		return nil, fmt.Errorf("create snapshot: %w", err)

rgarcia added 2 commits March 14, 2026 18:21
Use explicit hypervisor capabilities to decide whether standby should attempt a graceful VMM shutdown, keeping unsupported shutdown semantics out of the core control flow.

Made-with: Cursor
Avoid reusing partially written retained snapshot bases after standby snapshot failures, and lock in the rollback behavior with a regression test.

Made-with: Cursor
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Lost base after failed snapshot yields incomplete diff
    • When snapshot creation fails after promoting a retained base, the rollback path now moves snapshot-latest back to snapshot-base instead of deleting it, preserving a base for safe retry diff snapshots.

Create PR

Or push these changes by commenting:

@cursor push 6875956f62
Preview (6875956f62)
diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -105,7 +105,7 @@
 		log.ErrorContext(ctx, "snapshot failed, attempting to resume VM", "instance_id", id, "error", err)
 		hv.Resume(ctx)
 		if promotedExistingBase {
-			if rollbackErr := discardPromotedRetainedSnapshotTarget(snapshotDir); rollbackErr != nil {
+			if rollbackErr := discardPromotedRetainedSnapshotTarget(snapshotDir, retainedBaseDir); rollbackErr != nil {
 				log.WarnContext(ctx, "failed to discard promoted snapshot target after snapshot error", "instance_id", id, "error", rollbackErr)
 			}
 		}
@@ -209,8 +209,8 @@
 	return false, nil
 }
 
-func discardPromotedRetainedSnapshotTarget(snapshotDir string) error {
-	return os.RemoveAll(snapshotDir)
+func discardPromotedRetainedSnapshotTarget(snapshotDir string, retainedBaseDir string) error {
+	return restoreRetainedSnapshotBase(snapshotDir, retainedBaseDir)
 }
 
 func restoreRetainedSnapshotBase(snapshotDir string, retainedBaseDir string) error {

diff --git a/lib/instances/standby_test.go b/lib/instances/standby_test.go
--- a/lib/instances/standby_test.go
+++ b/lib/instances/standby_test.go
@@ -29,10 +29,12 @@
 	// Simulate a partially written snapshot target before the snapshot API returns an error.
 	require.NoError(t, os.WriteFile(filepath.Join(snapshotDir, "partial-marker"), []byte("partial"), 0644))
 
-	require.NoError(t, discardPromotedRetainedSnapshotTarget(snapshotDir))
+	require.NoError(t, discardPromotedRetainedSnapshotTarget(snapshotDir, retainedBaseDir))
 
 	_, err = os.Stat(snapshotDir)
 	assert.True(t, os.IsNotExist(err), "snapshot failures should discard the promoted snapshot target")
 	_, err = os.Stat(retainedBaseDir)
-	assert.True(t, os.IsNotExist(err), "snapshot failures should not restore the promoted base for reuse")
+	require.NoError(t, err, "snapshot failures should restore the promoted base for reuse")
+	_, err = os.Stat(filepath.Join(retainedBaseDir, "base-marker"))
+	assert.NoError(t, err, "base snapshot contents should be restored for retry")
 }

rgarcia added 2 commits March 15, 2026 11:42
Skip the graceful-exit wait when a hypervisor cannot shut down its VMM cleanly, and log best-effort resume failures so standby recovery errors are visible.

Made-with: Cursor
Restore the structural shutdown comments and split the graceful-exit decision into explicit branches so unsupported shutdown paths are easier to follow.

Made-with: Cursor
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Ambiguous duplicate forceKillHypervisorProcess name in same package
    • Renamed the standalone standby helper to forceKillHypervisorPID and updated its call sites to remove the ambiguous duplicate name with the manager method.

Create PR

Or push these changes by commenting:

@cursor push b3649dc9e9
Preview (b3649dc9e9)
diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -263,13 +263,13 @@
 				log.DebugContext(ctx, "hypervisor shutdown gracefully", "instance_id", inst.Id, "pid", pid)
 			} else {
 				log.WarnContext(ctx, "hypervisor did not exit gracefully in time, force killing process", "instance_id", inst.Id, "pid", pid)
-				if err := forceKillHypervisorProcess(pid); err != nil {
+				if err := forceKillHypervisorPID(pid); err != nil {
 					return err
 				}
 			}
 		} else {
 			log.DebugContext(ctx, "skipping graceful exit wait; force killing hypervisor process", "instance_id", inst.Id, "pid", pid)
-			if err := forceKillHypervisorProcess(pid); err != nil {
+			if err := forceKillHypervisorPID(pid); err != nil {
 				return err
 			}
 		}
@@ -282,7 +282,7 @@
 	return nil
 }
 
-func forceKillHypervisorProcess(pid int) error {
+func forceKillHypervisorPID(pid int) error {
 	if err := syscall.Kill(pid, syscall.SIGKILL); err != nil {
 		if err == syscall.ESRCH {
 			return nil

rgarcia added 3 commits March 15, 2026 12:13
Use full Firecracker snapshots when the retained memory base is missing so first-time standby and post-failure retries do not produce incomplete diff snapshots.

Made-with: Cursor
Exercise Firecracker standby/restore with guest file persistence across the initial full snapshot cycle and a subsequent retained-base diff snapshot cycle, and log the observed timings in the integration test.

Made-with: Cursor
Rename the standby-only SIGKILL helper to avoid colliding with the stop-path implementation, and only clean up retained snapshot bases on stop for hypervisors that actually support snapshot base reuse.

Made-with: Cursor
@rgarcia rgarcia requested a review from sjmiller609 March 15, 2026 16:56
Keep the full-versus-diff standby timing data in the Firecracker integration test while avoiding hard CI assertions on runner-dependent latency.

Made-with: Cursor
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Stale snapshot dir blocks retries when reusing base
    • Snapshot creation now removes stale snapshotDir unless a retained base was actually promoted, so retries no longer get stuck in invalid diff-snapshot mode.

Create PR

Or push these changes by commenting:

@cursor push ccfc1bcdcd
Preview (ccfc1bcdcd)
diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -102,7 +102,7 @@
 		}
 	}
 	log.DebugContext(ctx, "creating snapshot", "instance_id", id, "snapshot_dir", snapshotDir)
-	if err := createSnapshot(ctx, hv, snapshotDir, reuseSnapshotBase); err != nil {
+	if err := createSnapshot(ctx, hv, snapshotDir, reuseSnapshotBase, promotedExistingBase); err != nil {
 		// Snapshot failed - try to resume VM
 		log.ErrorContext(ctx, "snapshot failed, attempting to resume VM", "instance_id", id, "error", err)
 		if resumeErr := hv.Resume(ctx); resumeErr != nil {
@@ -166,13 +166,14 @@
 	return &finalInst, nil
 }
 
-// createSnapshot creates a snapshot using the hypervisor interface
-func createSnapshot(ctx context.Context, hv hypervisor.Hypervisor, snapshotDir string, reuseSnapshotBase bool) error {
+// createSnapshot creates a snapshot using the hypervisor interface.
+// When snapshot-base reuse is enabled, we only keep snapshotDir contents if a
+// retained base was promoted into place for this snapshot attempt.
+func createSnapshot(ctx context.Context, hv hypervisor.Hypervisor, snapshotDir string, reuseSnapshotBase bool, promotedExistingBase bool) error {
 	log := logger.FromContext(ctx)
 
-	// Remove old snapshot if the hypervisor does not support reusing snapshots
-	// (diff-based snapshots).
-	if !reuseSnapshotBase {
+	// Remove stale snapshot data unless we are actively reusing a promoted base.
+	if !reuseSnapshotBase || !promotedExistingBase {
 		os.RemoveAll(snapshotDir)
 	}
 

diff --git a/lib/instances/standby_test.go b/lib/instances/standby_test.go
--- a/lib/instances/standby_test.go
+++ b/lib/instances/standby_test.go
@@ -1,10 +1,14 @@
 package instances
 
 import (
+	"context"
+	"fmt"
 	"os"
 	"path/filepath"
 	"testing"
+	"time"
 
+	"github.com/kernel/hypeman/lib/hypervisor"
 	"github.com/stretchr/testify/assert"
 	"github.com/stretchr/testify/require"
 )
@@ -36,3 +40,66 @@
 	_, err = os.Stat(retainedBaseDir)
 	assert.True(t, os.IsNotExist(err), "snapshot failures should not restore the promoted base for reuse")
 }
+
+type snapshotValidationHypervisor struct {
+	validateSnapshotDir func(destPath string) error
+}
+
+func (h snapshotValidationHypervisor) DeleteVM(context.Context) error { return nil }
+
+func (h snapshotValidationHypervisor) Shutdown(context.Context) error { return nil }
+
+func (h snapshotValidationHypervisor) GetVMInfo(context.Context) (*hypervisor.VMInfo, error) {
+	return nil, hypervisor.ErrNotSupported
+}
+
+func (h snapshotValidationHypervisor) Pause(context.Context) error { return nil }
+
+func (h snapshotValidationHypervisor) Resume(context.Context) error { return nil }
+
+func (h snapshotValidationHypervisor) Snapshot(_ context.Context, destPath string) error {
+	if h.validateSnapshotDir != nil {
+		if err := h.validateSnapshotDir(destPath); err != nil {
+			return err
+		}
+	}
+	return os.WriteFile(filepath.Join(destPath, "snapshot-created"), []byte("ok"), 0644)
+}
+
+func (h snapshotValidationHypervisor) ResizeMemory(context.Context, int64) error {
+	return hypervisor.ErrNotSupported
+}
+
+func (h snapshotValidationHypervisor) ResizeMemoryAndWait(context.Context, int64, time.Duration) error {
+	return hypervisor.ErrNotSupported
+}
+
+func (h snapshotValidationHypervisor) Capabilities() hypervisor.Capabilities {
+	return hypervisor.Capabilities{SupportsSnapshot: true}
+}
+
+func TestCreateSnapshotRemovesStaleTargetWhenNoBasePromoted(t *testing.T) {
+	t.Parallel()
+
+	snapshotDir := filepath.Join(t.TempDir(), "snapshot-latest")
+	staleMarker := filepath.Join(snapshotDir, "memory")
+	require.NoError(t, os.MkdirAll(snapshotDir, 0755))
+	require.NoError(t, os.WriteFile(staleMarker, []byte("partial"), 0644))
+
+	hv := snapshotValidationHypervisor{
+		validateSnapshotDir: func(destPath string) error {
+			if _, err := os.Stat(filepath.Join(destPath, "memory")); err == nil {
+				return fmt.Errorf("stale memory file should be removed before snapshot")
+			}
+			return nil
+		},
+	}
+
+	err := createSnapshot(context.Background(), hv, snapshotDir, true, false)
+	require.NoError(t, err)
+
+	_, err = os.Stat(staleMarker)
+	assert.True(t, os.IsNotExist(err), "stale snapshot memory file should be removed on retry")
+	_, err = os.Stat(filepath.Join(snapshotDir, "snapshot-created"))
+	require.NoError(t, err, "snapshot API should still run after stale cleanup")
+}

rgarcia added 2 commits March 15, 2026 16:04
Keep the network integration checks focused on TAP existence, bridge attachment, and end-to-end connectivity instead of requiring a specific oper state that varies on the shared runner.

Made-with: Cursor
Discard leftover snapshot-latest scratch state before promoting a retained base so failed Firecracker standby attempts can retry cleanly instead of getting stuck on a stale diff snapshot target.

Made-with: Cursor
@rgarcia rgarcia merged commit 6d85d59 into main Mar 16, 2026
6 checks passed
@rgarcia rgarcia deleted the firecracker-diff-snapshot-bases-pr branch March 16, 2026 17:38
@ryw
Copy link
Contributor

ryw commented Mar 18, 2026

love use of autoresearch :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants