Skip to content

[Tests][UserEvents] Improve test timing and re-enable tests#124919

Open
mdh1418 wants to merge 5 commits intodotnet:mainfrom
mdh1418:fix/userevents-flaky-tests
Open

[Tests][UserEvents] Improve test timing and re-enable tests#124919
mdh1418 wants to merge 5 commits intodotnet:mainfrom
mdh1418:fix/userevents-flaky-tests

Conversation

@mdh1418
Copy link
Member

@mdh1418 mdh1418 commented Feb 26, 2026

Fix flaky userevents tracing tests

Fixes #123442

Root Causes

Investigation identified three independent causes of test flakiness:

1. Race between tracee event generation and record-trace setup

record-trace must discover the tracee process, send an EventPipe IPC command (which the runtime processes to register user_events tracepoints), and enable its PerfSession (ring buffer collection) before the tracee writes any events. The tracee has no callback to know when these steps are complete. With the original 200ms delay, the tracee could write events before PerfSession was enabled — a race observed when the tracee is discovered during record-trace's /proc scan. Events written in this window are silently lost.

Empirically measured on a 2-core system, which CI runs on and hit the failures more frequently than my local WSL2 environment, both tracepoint registration and PerfSession enable completed within 229ms (p99). The delay is increased to 700ms, providing a ~3x safety margin.

2. Cross-process event contamination

record-trace captures events from all .NET processes on the machine, not just the tracee. On CI machines running multiple .NET processes, the trace validators could match events from unrelated processes or fail to find the tracee's events among the noise.

Each validator now receives the tracee's PID and filters events by ProcessID, ignoring events from other processes.

3. Permission errors cleaning up diagnostic ports

EnsureCleanDiagnosticPorts deletes zombie diagnostic IPC sockets in /tmp, but on shared CI machines these sockets may be owned by other users. UnauthorizedAccessException and IOException during deletion would crash the test before it even started.

These exceptions are now caught and logged, allowing the test to proceed.

Additional Improvements

  • Output ordering: Subprocess stderr is routed to stdout with [stdout]/[stderr] tags so test output appears in chronological order, making timing-sensitive failures easier to diagnose.
  • record-trace version: Bumped from 0.1.33304 to 0.1.33421 to pick up fixes for EventPipe IPC handling.

Validation

Stress-tested with 500 consecutive runs (100 per scenario) on a 2-core system — 500/500 passed (0% failure rate, up from ~65% with the original code).

mdh1418 and others added 5 commits February 26, 2026 02:28
Update the Microsoft.OneCollect.RecordTrace package to 0.1.33421 which
includes a fix to drain perf_event ring buffers after session disable,
ensuring all buffered events are flushed to the trace file on SIGINT.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a 700ms delay (EventGenerationDelayMs) before the tracee performs its
event-generating action. record-trace must discover the process, send an
EventPipe IPC command, register tracepoints, and enable PerfSession before
the tracee writes events. Without this delay, events can be lost if the
tracee writes before PerfSession is enabled — a race observed when the
tracee is discovered during record-trace's /proc scan.

Route all subprocess stderr to stdout with [stdout]/[stderr] tags so test
output appears in the correct chronological order.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
record-trace traces system-wide (pid=-1), so the resulting nettrace file
contains events from ALL processes on the system. Pass the tracee PID to
each test's trace validator so it can filter events to only those from
the test process, preventing false positives and false negatives caused
by concurrent event sources.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
EnsureCleanDiagnosticPorts deletes leftover Unix domain sockets from
/tmp/dotnet-diagnostic-*. On multi-user systems, these files may be
owned by other users. Catch IOException and UnauthorizedAccessException
when deleting files to avoid test failures from permission errors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove CLRTestTargetUnsupported that was set in response to persistent
flakiness (issue dotnet#123442). The root causes have been addressed:
- OneCollect 0.1.33421 drains ring buffers on SIGINT
- Synchronization delays cover tracepoint registration and
  PerfSession::enable ordering
- PID filtering prevents cross-process event contamination

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request fixes flaky user events tracing tests that were failing intermittently on CI. The tests were originally disabled due to issue #123442 where traces would not contain expected events. The PR identifies and addresses three independent root causes: race conditions in trace setup timing, cross-process event contamination, and permission errors when cleaning diagnostic ports.

Changes:

  • Increased tracee event generation delay from 200ms to 700ms based on empirical measurements to prevent race conditions
  • Added ProcessID filtering to all trace validators to eliminate cross-process event contamination
  • Added exception handling for diagnostic port cleanup to handle permission errors on shared CI machines
  • Improved test output ordering by routing stderr to stdout with clear tags for better diagnostics
  • Bumped record-trace version from 0.1.33304 to 0.1.33421 for EventPipe IPC handling fixes
  • Re-enabled the tests by removing the CLRTestTargetUnsupported property

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/tests/tracing/userevents/Directory.Build.props Re-enables the user events tests by removing the CLRTestTargetUnsupported property that was added due to issue #123442
eng/Versions.props Updates MicrosoftOneCollectRecordTraceVersion from 0.1.33304 to 0.1.33421 to pick up EventPipe IPC handling fixes
src/tests/tracing/userevents/common/UserEventsTestRunner.cs Core infrastructure changes: increases event generation delay to 700ms, adds PID parameter to validator signature, improves output tagging, adds exception handling for diagnostic port cleanup
src/tests/tracing/userevents/basic/basic.cs Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages
src/tests/tracing/userevents/activity/activity.cs Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages
src/tests/tracing/userevents/custommetadata/custommetadata.cs Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages
src/tests/tracing/userevents/managedevent/managedevent.cs Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages
src/tests/tracing/userevents/multithread/multithread.cs Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages

// must discover the process, send an EventPipe IPC command that the runtime
// processes to register user_events tracepoints, and enable its PerfSession
// (ring buffer collection). The tracee has no callback to know when these steps
// are complete, so this delay provides a sufficient window. Without it, events
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about EventSource.IsEnabled() or EventSource.OnEventCommand? IsEnabled should toggle from false->true and OnEventCommand should be invoked when the runtime receives that IPC command. As long as record-trace enables the ring buffer/subscribes to the events prior to the IPC command and the runtime updates its internal filtering state prior to updating the C# visible state I'd expect an event emitted immediately after this to be captured. I'm pretty sure the runtime does use that ordering but I'm not sure about record-trace.

If that is reliable ideally we don't need a delay.

fi.Delete();
try
{
Console.WriteLine($"Deleting zombie diagnostic port: {fi.FullName}");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why we need to do this at all? In theory the runtime should do a good enough job cleaning up after itself rather than relying on external tools to do it. I don't think we need to address it in this PR but I suspect either the test is doing needless work or the test is actively masking a runtime issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Test][UserEvents] Trace file does not contain expected events

3 participants