perf: speed up firecracker standby with diff snapshot reuse#146
perf: speed up firecracker standby with diff snapshot reuse#146
Conversation
Avoid spending standby time waiting for hypervisors without a shutdown API to exit gracefully when the process will need an immediate kill anyway. Made-with: Cursor
Keep a hidden Firecracker diff snapshot base across restores so standby only writes changed pages while preserving the existing standby snapshot semantics exposed to users. Made-with: Cursor
Express retained snapshot-base behavior through hypervisor capabilities and generic snapshot-base paths so the diff snapshot optimization does not leak Firecracker-specific semantics. Made-with: Cursor
Move static hypervisor capabilities into a type-level registry so snapshot base reuse checks do not depend on starter instances or zero-value client construction. Made-with: Cursor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Snapshot failure rollback preserves potentially corrupted base
- When snapshot creation fails after promoting a retained base, the rollback now deletes
snapshot-latestinstead of renaming it back tosnapshot-base, preventing reuse of potentially partially written data.
- When snapshot creation fails after promoting a retained base, the rollback now deletes
Or push these changes by commenting:
@cursor push 967ad1b1de
Preview (967ad1b1de)
diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -105,8 +105,10 @@
log.ErrorContext(ctx, "snapshot failed, attempting to resume VM", "instance_id", id, "error", err)
hv.Resume(ctx)
if promotedRetainedBase {
- if rollbackErr := restoreRetainedSnapshotBase(snapshotDir, retainedBaseDir); rollbackErr != nil {
- log.WarnContext(ctx, "failed to restore retained snapshot base after snapshot error", "instance_id", id, "error", rollbackErr)
+ // Snapshot errors can happen after partial file writes; discard the
+ // promoted base to avoid reusing potentially corrupted data.
+ if rollbackErr := os.RemoveAll(snapshotDir); rollbackErr != nil {
+ log.WarnContext(ctx, "failed to discard promoted snapshot target after snapshot error", "instance_id", id, "error", rollbackErr)
}
}
return nil, fmt.Errorf("create snapshot: %w", err)Use explicit hypervisor capabilities to decide whether standby should attempt a graceful VMM shutdown, keeping unsupported shutdown semantics out of the core control flow. Made-with: Cursor
Avoid reusing partially written retained snapshot bases after standby snapshot failures, and lock in the rollback behavior with a regression test. Made-with: Cursor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Lost base after failed snapshot yields incomplete diff
- When snapshot creation fails after promoting a retained base, the rollback path now moves
snapshot-latestback tosnapshot-baseinstead of deleting it, preserving a base for safe retry diff snapshots.
- When snapshot creation fails after promoting a retained base, the rollback path now moves
Or push these changes by commenting:
@cursor push 6875956f62
Preview (6875956f62)
diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -105,7 +105,7 @@
log.ErrorContext(ctx, "snapshot failed, attempting to resume VM", "instance_id", id, "error", err)
hv.Resume(ctx)
if promotedExistingBase {
- if rollbackErr := discardPromotedRetainedSnapshotTarget(snapshotDir); rollbackErr != nil {
+ if rollbackErr := discardPromotedRetainedSnapshotTarget(snapshotDir, retainedBaseDir); rollbackErr != nil {
log.WarnContext(ctx, "failed to discard promoted snapshot target after snapshot error", "instance_id", id, "error", rollbackErr)
}
}
@@ -209,8 +209,8 @@
return false, nil
}
-func discardPromotedRetainedSnapshotTarget(snapshotDir string) error {
- return os.RemoveAll(snapshotDir)
+func discardPromotedRetainedSnapshotTarget(snapshotDir string, retainedBaseDir string) error {
+ return restoreRetainedSnapshotBase(snapshotDir, retainedBaseDir)
}
func restoreRetainedSnapshotBase(snapshotDir string, retainedBaseDir string) error {
diff --git a/lib/instances/standby_test.go b/lib/instances/standby_test.go
--- a/lib/instances/standby_test.go
+++ b/lib/instances/standby_test.go
@@ -29,10 +29,12 @@
// Simulate a partially written snapshot target before the snapshot API returns an error.
require.NoError(t, os.WriteFile(filepath.Join(snapshotDir, "partial-marker"), []byte("partial"), 0644))
- require.NoError(t, discardPromotedRetainedSnapshotTarget(snapshotDir))
+ require.NoError(t, discardPromotedRetainedSnapshotTarget(snapshotDir, retainedBaseDir))
_, err = os.Stat(snapshotDir)
assert.True(t, os.IsNotExist(err), "snapshot failures should discard the promoted snapshot target")
_, err = os.Stat(retainedBaseDir)
- assert.True(t, os.IsNotExist(err), "snapshot failures should not restore the promoted base for reuse")
+ require.NoError(t, err, "snapshot failures should restore the promoted base for reuse")
+ _, err = os.Stat(filepath.Join(retainedBaseDir, "base-marker"))
+ assert.NoError(t, err, "base snapshot contents should be restored for retry")
}Skip the graceful-exit wait when a hypervisor cannot shut down its VMM cleanly, and log best-effort resume failures so standby recovery errors are visible. Made-with: Cursor
Restore the structural shutdown comments and split the graceful-exit decision into explicit branches so unsupported shutdown paths are easier to follow. Made-with: Cursor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
Autofix Details
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Ambiguous duplicate
forceKillHypervisorProcessname in same package- Renamed the standalone standby helper to
forceKillHypervisorPIDand updated its call sites to remove the ambiguous duplicate name with the manager method.
- Renamed the standalone standby helper to
Or push these changes by commenting:
@cursor push b3649dc9e9
Preview (b3649dc9e9)
diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -263,13 +263,13 @@
log.DebugContext(ctx, "hypervisor shutdown gracefully", "instance_id", inst.Id, "pid", pid)
} else {
log.WarnContext(ctx, "hypervisor did not exit gracefully in time, force killing process", "instance_id", inst.Id, "pid", pid)
- if err := forceKillHypervisorProcess(pid); err != nil {
+ if err := forceKillHypervisorPID(pid); err != nil {
return err
}
}
} else {
log.DebugContext(ctx, "skipping graceful exit wait; force killing hypervisor process", "instance_id", inst.Id, "pid", pid)
- if err := forceKillHypervisorProcess(pid); err != nil {
+ if err := forceKillHypervisorPID(pid); err != nil {
return err
}
}
@@ -282,7 +282,7 @@
return nil
}
-func forceKillHypervisorProcess(pid int) error {
+func forceKillHypervisorPID(pid int) error {
if err := syscall.Kill(pid, syscall.SIGKILL); err != nil {
if err == syscall.ESRCH {
return nilUse full Firecracker snapshots when the retained memory base is missing so first-time standby and post-failure retries do not produce incomplete diff snapshots. Made-with: Cursor
Exercise Firecracker standby/restore with guest file persistence across the initial full snapshot cycle and a subsequent retained-base diff snapshot cycle, and log the observed timings in the integration test. Made-with: Cursor
Rename the standby-only SIGKILL helper to avoid colliding with the stop-path implementation, and only clean up retained snapshot bases on stop for hypervisors that actually support snapshot base reuse. Made-with: Cursor
Keep the full-versus-diff standby timing data in the Firecracker integration test while avoiding hard CI assertions on runner-dependent latency. Made-with: Cursor
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix prepared a fix for the issue found in the latest run.
- ✅ Fixed: Stale snapshot dir blocks retries when reusing base
- Snapshot creation now removes stale
snapshotDirunless a retained base was actually promoted, so retries no longer get stuck in invalid diff-snapshot mode.
- Snapshot creation now removes stale
Or push these changes by commenting:
@cursor push ccfc1bcdcd
Preview (ccfc1bcdcd)
diff --git a/lib/instances/standby.go b/lib/instances/standby.go
--- a/lib/instances/standby.go
+++ b/lib/instances/standby.go
@@ -102,7 +102,7 @@
}
}
log.DebugContext(ctx, "creating snapshot", "instance_id", id, "snapshot_dir", snapshotDir)
- if err := createSnapshot(ctx, hv, snapshotDir, reuseSnapshotBase); err != nil {
+ if err := createSnapshot(ctx, hv, snapshotDir, reuseSnapshotBase, promotedExistingBase); err != nil {
// Snapshot failed - try to resume VM
log.ErrorContext(ctx, "snapshot failed, attempting to resume VM", "instance_id", id, "error", err)
if resumeErr := hv.Resume(ctx); resumeErr != nil {
@@ -166,13 +166,14 @@
return &finalInst, nil
}
-// createSnapshot creates a snapshot using the hypervisor interface
-func createSnapshot(ctx context.Context, hv hypervisor.Hypervisor, snapshotDir string, reuseSnapshotBase bool) error {
+// createSnapshot creates a snapshot using the hypervisor interface.
+// When snapshot-base reuse is enabled, we only keep snapshotDir contents if a
+// retained base was promoted into place for this snapshot attempt.
+func createSnapshot(ctx context.Context, hv hypervisor.Hypervisor, snapshotDir string, reuseSnapshotBase bool, promotedExistingBase bool) error {
log := logger.FromContext(ctx)
- // Remove old snapshot if the hypervisor does not support reusing snapshots
- // (diff-based snapshots).
- if !reuseSnapshotBase {
+ // Remove stale snapshot data unless we are actively reusing a promoted base.
+ if !reuseSnapshotBase || !promotedExistingBase {
os.RemoveAll(snapshotDir)
}
diff --git a/lib/instances/standby_test.go b/lib/instances/standby_test.go
--- a/lib/instances/standby_test.go
+++ b/lib/instances/standby_test.go
@@ -1,10 +1,14 @@
package instances
import (
+ "context"
+ "fmt"
"os"
"path/filepath"
"testing"
+ "time"
+ "github.com/kernel/hypeman/lib/hypervisor"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
@@ -36,3 +40,66 @@
_, err = os.Stat(retainedBaseDir)
assert.True(t, os.IsNotExist(err), "snapshot failures should not restore the promoted base for reuse")
}
+
+type snapshotValidationHypervisor struct {
+ validateSnapshotDir func(destPath string) error
+}
+
+func (h snapshotValidationHypervisor) DeleteVM(context.Context) error { return nil }
+
+func (h snapshotValidationHypervisor) Shutdown(context.Context) error { return nil }
+
+func (h snapshotValidationHypervisor) GetVMInfo(context.Context) (*hypervisor.VMInfo, error) {
+ return nil, hypervisor.ErrNotSupported
+}
+
+func (h snapshotValidationHypervisor) Pause(context.Context) error { return nil }
+
+func (h snapshotValidationHypervisor) Resume(context.Context) error { return nil }
+
+func (h snapshotValidationHypervisor) Snapshot(_ context.Context, destPath string) error {
+ if h.validateSnapshotDir != nil {
+ if err := h.validateSnapshotDir(destPath); err != nil {
+ return err
+ }
+ }
+ return os.WriteFile(filepath.Join(destPath, "snapshot-created"), []byte("ok"), 0644)
+}
+
+func (h snapshotValidationHypervisor) ResizeMemory(context.Context, int64) error {
+ return hypervisor.ErrNotSupported
+}
+
+func (h snapshotValidationHypervisor) ResizeMemoryAndWait(context.Context, int64, time.Duration) error {
+ return hypervisor.ErrNotSupported
+}
+
+func (h snapshotValidationHypervisor) Capabilities() hypervisor.Capabilities {
+ return hypervisor.Capabilities{SupportsSnapshot: true}
+}
+
+func TestCreateSnapshotRemovesStaleTargetWhenNoBasePromoted(t *testing.T) {
+ t.Parallel()
+
+ snapshotDir := filepath.Join(t.TempDir(), "snapshot-latest")
+ staleMarker := filepath.Join(snapshotDir, "memory")
+ require.NoError(t, os.MkdirAll(snapshotDir, 0755))
+ require.NoError(t, os.WriteFile(staleMarker, []byte("partial"), 0644))
+
+ hv := snapshotValidationHypervisor{
+ validateSnapshotDir: func(destPath string) error {
+ if _, err := os.Stat(filepath.Join(destPath, "memory")); err == nil {
+ return fmt.Errorf("stale memory file should be removed before snapshot")
+ }
+ return nil
+ },
+ }
+
+ err := createSnapshot(context.Background(), hv, snapshotDir, true, false)
+ require.NoError(t, err)
+
+ _, err = os.Stat(staleMarker)
+ assert.True(t, os.IsNotExist(err), "stale snapshot memory file should be removed on retry")
+ _, err = os.Stat(filepath.Join(snapshotDir, "snapshot-created"))
+ require.NoError(t, err, "snapshot API should still run after stale cleanup")
+}Keep the network integration checks focused on TAP existence, bridge attachment, and end-to-end connectivity instead of requiring a specific oper state that varies on the shared runner. Made-with: Cursor
Discard leftover snapshot-latest scratch state before promoting a retained base so failed Firecracker standby attempts can retry cleanly instead of getting stuck on a stale diff snapshot target. Made-with: Cursor
|
love use of autoresearch :) |


Summary
snapshot-latestis only visible while an instance is actually inStandby, while the retainedsnapshot-basestays internal and gets cleaned up on stopErrNotSupportedcontrol flowWhy These Refactors Are In This PR
SupportsSnapshotBaseReuselets the manager reason about retained bases generically instead of branching onhypervisor.Type.Shutdown()returnsErrNotSupported", but that is really a static property of the hypervisor, not a runtime error case.SupportsGracefulVMMShutdownmakes that intent explicit and keeps error handling separate from control flow.Firecracker Snapshot Behavior
SnapshotType: Fullwhen no retained memory base exists.This covers the first standby after boot and retries after a failed snapshot discarded the retained base.
SnapshotType: Diffonly when the retained memory base already exists.In the normal happy path, that means the second and later standby cycles reuse
snapshot-baseand only write changed pages.snapshot-latestis moved back into the hiddensnapshot-baseso the next standby can reuse it.Behavior Changes
StopInstancenow removes both visible standby snapshots and any retained hidden base.Performance
Local benchmark harness from #145:
score_ms 5146.5,standby_p50_ms 5032.0score_ms 264.7,standby_p50_ms 157.0Integration timing run from
go test ./lib/instances -run TestFirecrackerStandbyAndRestore -v:3.33545878s101.817536ms1.618021087s557.88755ms103.666293ms115.73631msThat integration test also verifies guest state persists across both cycles by writing one file before the first standby, another after the first restore, and asserting both files still exist after the second restore.
Test Plan
make test TEST=TestFirecrackerStandbyAndRestoremake test TEST=TestFirecrackerStopClearsStaleSnapshotgo test ./lib/hypervisor/...go test ./lib/hypervisor/firecracker -run TestSnapshotParamPathsgo test ./lib/instances -run 'TestDiscardPromotedRetainedSnapshotTargetAfterSnapshotError'go test ./lib/instances -run TestFirecrackerStandbyAndRestore -vprepare.go+run.go -budget 180sNote
Medium Risk
Touches core instance lifecycle (standby/restore/stop) and process-shutdown behavior, so regressions could leave stale snapshots or incorrectly kill/retain VMM processes. Changes are capability-gated and covered by expanded unit/integration tests, reducing but not eliminating risk.
Overview
Speeds up Firecracker standby by reusing a hidden retained snapshot base across restore/standby cycles, so later standbys can write diff snapshots instead of full memory images; a new internal
snapshot-basedirectory is introduced andsnapshot-latestremains the only user-visible standby artifact.Adds a type-level hypervisor capability registry (
RegisterCapabilities/CapabilitiesForType) and extendshypervisor.CapabilitieswithSupportsSnapshotBaseReuseandSupportsGracefulVMMShutdown, which now drive orchestration decisions (retaining snapshot base on restore, cleaning it on stop, and skipping graceful shutdown waits for hypervisors that can’t do it).Hardens snapshot lifecycle handling: promotes/rolls back retained bases around standby snapshot creation, moves
snapshot-latesttosnapshot-baseafter successful restore, and ensures both visible and hidden snapshot dirs are removed on stop; tests are updated/added to validate full-vs-diff snapshot selection and retained-base cleanup/rollback behavior.Written by Cursor Bugbot for commit 5b0c4c1. This will update automatically on new commits. Configure here.