Skip to content

Enforce auction timeout in orchestrator wait logic#469

Open
ChristianPavilonis wants to merge 4 commits intomainfrom
fix/auction-timeout
Open

Enforce auction timeout in orchestrator wait logic#469
ChristianPavilonis wants to merge 4 commits intomainfrom
fix/auction-timeout

Conversation

@ChristianPavilonis
Copy link
Collaborator

@ChristianPavilonis ChristianPavilonis commented Mar 10, 2026

Summary

  • Enforce the configured auction timeout (settings.auction.timeout_ms) in the orchestrator's select() loop — previously, waits could extend to the backend's hardcoded 15s first_byte_timeout, ignoring the auction deadline.
  • Two complementary mechanisms: (1) each auction provider's backend first_byte_timeout is set to timeout_ms instead of 15s, and (2) the orchestrator checks elapsed time after each select() return, dropping remaining requests when the deadline is exceeded.
  • Mediator receives remaining time budget after the bidding phase, preventing mediation from extending the auction past the configured deadline.

Changes

File Change
crates/common/src/backend.rs Add configurable first_byte_timeout field to BackendConfig (default 15s), builder method, timeout-aware backend naming, and from_url_with_first_byte_timeout() convenience method
crates/common/src/auction/orchestrator.rs Add deadline enforcement in select() loop — track auction_start, drop remaining requests when timeout exceeded, pass only remaining time to mediator
crates/common/src/integrations/prebid.rs Use from_url_with_first_byte_timeout() with context.timeout_ms for bid requests
crates/common/src/integrations/aps.rs Use from_url_with_first_byte_timeout() with context.timeout_ms for bid requests; rename _contextcontext
crates/common/src/integrations/adserver_mock.rs Use from_url_with_first_byte_timeout() with context.timeout_ms for mediation requests

Closes

Closes #405

Test plan

  • cargo test --workspace
  • cargo clippy --all-targets --all-features -- -D warnings
  • cargo fmt --all -- --check
  • JS tests: cd crates/js/lib && npx vitest run
  • JS format: cd crates/js/lib && npm run format
  • Docs format: cd docs && npm run format
  • WASM build: cargo build --bin trusted-server-fastly --release --target wasm32-wasip1
  • Manual testing via fastly compute serve
  • Other:

Checklist

  • Changes follow CLAUDE.md conventions
  • No unwrap() in production code — use expect("should ...")
  • Uses tracing macros (not println!)
  • New code has tests
  • No secrets or credentials committed

@ChristianPavilonis ChristianPavilonis self-assigned this Mar 10, 2026
- Pass remaining time budget to each provider instead of full timeout,
  so backend first_byte_timeout cannot exceed the auction deadline
- Extract remaining_budget_ms helper and add unit tests for it
- Simplify from_url_with_first_byte_timeout to take Duration (not Option)
- Fix misleading doc on ensure() to note timeout is not in backend name
- Update TODO comment with specific untested timeout paths
Copy link
Collaborator

@aram356 aram356 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Staff-level review of auction timeout enforcement

Good work overall. The approach of threading remaining_budget_ms through the orchestrator and into backend first_byte_timeout is sound. The doc comments are thorough and the compute_name / ensure split in BackendConfig is a clean refactor. A few issues to address before merging, ranging from a semantic correctness concern to testing gaps.

Summary of findings

Priority Issue
P0 Backend timeout is first-registration-wins; later providers in the same auction may inherit a stale timeout
P1 Requests still launch when remaining budget is already 0 ms
P1 select() blocking means wall-clock can exceed timeout_ms (documented, but consider mitigation)
P2 Per-provider timeout_ms() trait method is ignored at the transport layer
P2 Critical timeout paths remain untested (acknowledged in TODO, but risky to merge without)
P3 as u32 truncation in remaining_budget_ms
P3 URL parsing logic triplicated across three methods

Copy link
Collaborator

@aram356 aram356 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

This PR adds auction deadline enforcement to the orchestrator — providers now receive a shrinking time budget via remaining_budget_ms(), backends get a matching first_byte_timeout, and the select() loop checks the deadline after each response. The mediator is skipped when the budget is exhausted. The backend_name() trait method no longer registers backends as a side-effect.

Blocking

🔧 wrench

  • 0ms budget guard missing in provider loop: requests still launch when remaining budget is 0ms — the mediator path checks for this but the provider loop does not (orchestrator.rs:284)
  • Backend timeout poisoning: first_byte_timeout is not part of the backend name, so first-registration-wins — if bidders and mediators share an origin, the mediator's tighter timeout is silently ignored (backend.rs:128)

❓ question

  • Can bidders and mediators share the same origin? This determines whether the timeout poisoning is a real issue in the current deployment topology (backend.rs:128)

Non-blocking

🤔 thinking

  • as u32 truncation: remaining_budget_ms casts u128 to u32 — safe in practice but technically unsound (orchestrator.rs:21)
  • Provider timeout_ms() unused: per-provider timeout config is ignored at the transport layer (orchestrator.rs:289)

♻️ refactor

  • Triplicated URL parsing: three methods have identical parsing logic — extract a helper (backend.rs:222)

🌱 seedling

  • Untested timeout paths: core behavioral change has no integration test coverage (orchestrator.rs:691)

🏕 camp site

  • Log ordering: "Running auction with strategy" log on line 83 fires after the auction completes — reads as though the auction is starting. Consider moving it before the if/else or changing to "Auction completed with strategy".

👍 praise

  • backend_name() side-effect removed: no longer registers a backend on read (prebid.rs:955)
  • Remaining-budget mediator design: clean early-return and fallback (orchestrator.rs:121)

…ate URL parsing

- Include first_byte_timeout in backend name to prevent first-registration-wins
  poisoning where later requests silently inherit an earlier timeout
- Add 0ms budget guard in provider loop (consistent with mediator path)
- Use min(provider.timeout_ms(), remaining_budget) to respect per-provider caps
- Fix as u32 truncation in remaining_budget_ms with try_from/unwrap_or
- Extract parse_origin helper to deduplicate URL parsing across 3 methods
- Update backend_name trait method to accept timeout_ms for correct name prediction
- Add comment documenting select() blocking dependency on first_byte_timeout
- Expand TODO with follow-up guidance for testability improvements
@ChristianPavilonis
Copy link
Collaborator Author

ChristianPavilonis commented Mar 11, 2026

Review feedback addressed (48fb712)

Thanks for the thorough review! All findings have been addressed in the latest push. Here's a breakdown:

Blocking — resolved

🔧 0ms budget guard in provider loop (orchestrator.rs)
Added an if effective_timeout == 0 { continue; } guard after computing the remaining budget, consistent with the mediator path at line 123. Providers are now skipped with a warning log instead of launching a doomed request with a 0ms timeout.

🔧 Backend timeout poisoning (backend.rs)
first_byte_timeout is now part of the backend name (appended as _t{ms}), so different timeout values produce different backend registrations. This eliminates the first-registration-wins issue entirely. In practice bidders and mediators use distinct origins in the current deployment topology, but the name now encodes the timeout defensively.

The backend_name() trait method and backend_name_for_url() were updated to accept the timeout so the predicted name matches the actual registration. The orchestrator loop was reordered to compute effective_timeout before calling backend_name(effective_timeout).

Non-blocking — resolved

🤔 as u32 truncation (orchestrator.rs:21)
Replaced start.elapsed().as_millis() as u32 with u32::try_from(start.elapsed().as_millis()).unwrap_or(u32::MAX) — the saturating_sub then correctly produces 0 for any absurdly large elapsed time.

🤔 Provider timeout_ms() unused at transport layer (orchestrator.rs)
The effective timeout is now remaining_ms.min(provider.timeout_ms()), respecting both the auction deadline and the provider's own configured latency expectation.

♻️ Triplicated URL parsing (backend.rs)
Extracted a private parse_origin() helper that from_url_with_first_byte_timeout and backend_name_for_url both call, eliminating the duplicated Url::parsescheme()host_str()port() chain.

🌱 Untested timeout paths (orchestrator.rs)
Expanded the TODO comment with concrete follow-up guidance: introduce a trait abstraction over select() for unit-testability, and consider an #[ignore] Viceroy integration test. Also added the new provider-skip path to the list of untested paths.

Created issue #473

Comment about select() blocking (orchestrator.rs)
Added a note above the select() loop explaining that hard deadline enforcement depends on every backend's first_byte_timeout being set to at most the remaining auction budget, which Phase 1 guarantees.


All changes pass cargo fmt, cargo clippy -D warnings, and cargo test --workspace (477 tests).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Auction timeout config not enforced by orchestrator wait logic

2 participants