Skip to content

Add startup probe to calico-node for faster rolling updates#4562

Closed
caseydavenport wants to merge 1 commit intotigera:masterfrom
caseydavenport:casey-readiness-probe-interval
Closed

Add startup probe to calico-node for faster rolling updates#4562
caseydavenport wants to merge 1 commit intotigera:masterfrom
caseydavenport:casey-readiness-probe-interval

Conversation

@caseydavenport
Copy link
Member

@caseydavenport caseydavenport commented Mar 17, 2026

The calico-node readiness probe checks Felix health and BIRD status via cheap local calls (HTTP to localhost, birdcl on a unix socket). These complete in milliseconds.

This adds a startup probe with periodSeconds: 5 and failureThreshold: 24 (2 minute startup budget). Kubernetes doesn't start readiness/liveness probes until the startup probe succeeds, so this gives fast initial ready detection during pod startup while keeping the steady-state readiness check at the default 10s interval.

The main benefit is decoupling startup from steady-state — if we ever want to relax the readiness probe period for large clusters, the startup probe ensures rollout speed isn't affected. The immediate improvement is modest (~5-10s per node during rolling updates).

The calico-node readiness probe checks Felix health and BIRD status
via cheap local calls (HTTP to localhost and a unix socket command).
Previously the readiness probe used the Kubernetes default 10s period,
which meant each node took 10-30s to be marked ready during rollouts.

Add a startup probe with a 5s period that runs the same check. K8s
doesn't start the readiness/liveness probes until the startup probe
succeeds, so this gives fast initial detection during rolling updates
while keeping steady-state probes at the default interval. The startup
probe allows up to 2 minutes for initial startup (failureThreshold=24
x periodSeconds=5).

On a 4-node cluster this reduces DaemonSet rollout from ~5 minutes
to ~2 minutes. On larger clusters the improvement scales linearly.
@marvin-tigera marvin-tigera added this to the v1.42.0 milestone Mar 17, 2026
@caseydavenport caseydavenport marked this pull request as ready for review March 17, 2026 15:10
@caseydavenport caseydavenport requested a review from a team as a code owner March 17, 2026 15:10
@caseydavenport
Copy link
Member Author

Closing — the startup probe adds risk of restart loops on slow-starting nodes for marginal rollout speed improvement (~5-10s per node). The readiness and startup checks are identical, so the startup probe doesn't buy us much here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants