[Tests][UserEvents] Improve test timing and re-enable tests#124919
[Tests][UserEvents] Improve test timing and re-enable tests#124919mdh1418 wants to merge 5 commits intodotnet:mainfrom
Conversation
Update the Microsoft.OneCollect.RecordTrace package to 0.1.33421 which includes a fix to drain perf_event ring buffers after session disable, ensuring all buffered events are flushed to the trace file on SIGINT. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a 700ms delay (EventGenerationDelayMs) before the tracee performs its event-generating action. record-trace must discover the process, send an EventPipe IPC command, register tracepoints, and enable PerfSession before the tracee writes events. Without this delay, events can be lost if the tracee writes before PerfSession is enabled — a race observed when the tracee is discovered during record-trace's /proc scan. Route all subprocess stderr to stdout with [stdout]/[stderr] tags so test output appears in the correct chronological order. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
record-trace traces system-wide (pid=-1), so the resulting nettrace file contains events from ALL processes on the system. Pass the tracee PID to each test's trace validator so it can filter events to only those from the test process, preventing false positives and false negatives caused by concurrent event sources. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
EnsureCleanDiagnosticPorts deletes leftover Unix domain sockets from /tmp/dotnet-diagnostic-*. On multi-user systems, these files may be owned by other users. Catch IOException and UnauthorizedAccessException when deleting files to avoid test failures from permission errors. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove CLRTestTargetUnsupported that was set in response to persistent flakiness (issue dotnet#123442). The root causes have been addressed: - OneCollect 0.1.33421 drains ring buffers on SIGINT - Synchronization delays cover tracepoint registration and PerfSession::enable ordering - PID filtering prevents cross-process event contamination Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This pull request fixes flaky user events tracing tests that were failing intermittently on CI. The tests were originally disabled due to issue #123442 where traces would not contain expected events. The PR identifies and addresses three independent root causes: race conditions in trace setup timing, cross-process event contamination, and permission errors when cleaning diagnostic ports.
Changes:
- Increased tracee event generation delay from 200ms to 700ms based on empirical measurements to prevent race conditions
- Added ProcessID filtering to all trace validators to eliminate cross-process event contamination
- Added exception handling for diagnostic port cleanup to handle permission errors on shared CI machines
- Improved test output ordering by routing stderr to stdout with clear tags for better diagnostics
- Bumped record-trace version from 0.1.33304 to 0.1.33421 for EventPipe IPC handling fixes
- Re-enabled the tests by removing the
CLRTestTargetUnsupportedproperty
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/tests/tracing/userevents/Directory.Build.props | Re-enables the user events tests by removing the CLRTestTargetUnsupported property that was added due to issue #123442 |
| eng/Versions.props | Updates MicrosoftOneCollectRecordTraceVersion from 0.1.33304 to 0.1.33421 to pick up EventPipe IPC handling fixes |
| src/tests/tracing/userevents/common/UserEventsTestRunner.cs | Core infrastructure changes: increases event generation delay to 700ms, adds PID parameter to validator signature, improves output tagging, adds exception handling for diagnostic port cleanup |
| src/tests/tracing/userevents/basic/basic.cs | Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages |
| src/tests/tracing/userevents/activity/activity.cs | Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages |
| src/tests/tracing/userevents/custommetadata/custommetadata.cs | Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages |
| src/tests/tracing/userevents/managedevent/managedevent.cs | Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages |
| src/tests/tracing/userevents/multithread/multithread.cs | Updates validator to accept and filter by traceePid, tracks and logs events from other processes, includes PID in error messages |
| // must discover the process, send an EventPipe IPC command that the runtime | ||
| // processes to register user_events tracepoints, and enable its PerfSession | ||
| // (ring buffer collection). The tracee has no callback to know when these steps | ||
| // are complete, so this delay provides a sufficient window. Without it, events |
There was a problem hiding this comment.
What about EventSource.IsEnabled() or EventSource.OnEventCommand? IsEnabled should toggle from false->true and OnEventCommand should be invoked when the runtime receives that IPC command. As long as record-trace enables the ring buffer/subscribes to the events prior to the IPC command and the runtime updates its internal filtering state prior to updating the C# visible state I'd expect an event emitted immediately after this to be captured. I'm pretty sure the runtime does use that ordering but I'm not sure about record-trace.
If that is reliable ideally we don't need a delay.
| fi.Delete(); | ||
| try | ||
| { | ||
| Console.WriteLine($"Deleting zombie diagnostic port: {fi.FullName}"); |
There was a problem hiding this comment.
Do you know why we need to do this at all? In theory the runtime should do a good enough job cleaning up after itself rather than relying on external tools to do it. I don't think we need to address it in this PR but I suspect either the test is doing needless work or the test is actively masking a runtime issue.
Fix flaky userevents tracing tests
Fixes #123442
Root Causes
Investigation identified three independent causes of test flakiness:
1. Race between tracee event generation and record-trace setup
record-trace must discover the tracee process, send an EventPipe IPC command (which the runtime processes to register user_events tracepoints), and enable its PerfSession (ring buffer collection) before the tracee writes any events. The tracee has no callback to know when these steps are complete. With the original 200ms delay, the tracee could write events before PerfSession was enabled — a race observed when the tracee is discovered during record-trace's
/procscan. Events written in this window are silently lost.Empirically measured on a 2-core system, which CI runs on and hit the failures more frequently than my local WSL2 environment, both tracepoint registration and PerfSession enable completed within 229ms (p99). The delay is increased to 700ms, providing a ~3x safety margin.
2. Cross-process event contamination
record-trace captures events from all .NET processes on the machine, not just the tracee. On CI machines running multiple .NET processes, the trace validators could match events from unrelated processes or fail to find the tracee's events among the noise.
Each validator now receives the tracee's PID and filters events by
ProcessID, ignoring events from other processes.3. Permission errors cleaning up diagnostic ports
EnsureCleanDiagnosticPortsdeletes zombie diagnostic IPC sockets in/tmp, but on shared CI machines these sockets may be owned by other users.UnauthorizedAccessExceptionandIOExceptionduring deletion would crash the test before it even started.These exceptions are now caught and logged, allowing the test to proceed.
Additional Improvements
[stdout]/[stderr]tags so test output appears in chronological order, making timing-sensitive failures easier to diagnose.Validation
Stress-tested with 500 consecutive runs (100 per scenario) on a 2-core system — 500/500 passed (0% failure rate, up from ~65% with the original code).