Skip to content

Fixing the Flaky Test for org.apache.fluss.client.table.scanner.batch.KvSnapshotBatchScannerITCase.testKvSnapshotLease#2808

Open
hemanthsavasere wants to merge 2 commits intoapache:mainfrom
hemanthsavasere:2807-fix-Flaky-test-KvSnapshotBatchScannerITCase.testKvSnapshotLease
Open

Fixing the Flaky Test for org.apache.fluss.client.table.scanner.batch.KvSnapshotBatchScannerITCase.testKvSnapshotLease#2808
hemanthsavasere wants to merge 2 commits intoapache:mainfrom
hemanthsavasere:2807-fix-Flaky-test-KvSnapshotBatchScannerITCase.testKvSnapshotLease

Conversation

@hemanthsavasere
Copy link
Contributor

@hemanthsavasere hemanthsavasere commented Mar 6, 2026

Purpose

  • Fix a race condition in FlussClusterExtension.triggerSnapshot() that caused testKvSnapshotLease to intermittently fail with expected: [1L, 1L, 1L] but was: [1L, 0L, 1L].

Linked issue: close #2807

Brief change log

File: fluss-server/src/test/java/org/apache/fluss/server/testutils/FlussClusterExtension.java

  • Before: Read currentSnapshotId() synchronously right after triggerSnapshot(); if unchanged, returned null and skipped the bucket
  • After: Poll currentSnapshotId() in a loop (10ms sleep, 30s timeout) until the ID increments, confirming the async trigger was actually processed; fail with a clear message on timeout

Tests

  • Test: fluss-client/src/test/java/org/apache/fluss/client/table/scanner/batch/KvSnapshotBatchScannerITCase.java
  • Fix target: fluss-server/src/test/java/org/apache/fluss/server/testutils/FlussClusterExtension.java:747-775

API and Format

  • No API changes. Changes only related to tests.

Documentation

  • No new features are introduced

….KvSnapshotBatchScannerITCase.testKvSnapshotLeas
….KvSnapshotBatchScannerITCase.testKvSnapshotLease timeout fix
throw new RuntimeException(e);
}
}
return snapshotId;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use retry() instead of while loop to assert the snapshot id incremental?

Copy link
Contributor Author

@hemanthsavasere hemanthsavasere Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @swuferhong,
As per my analysis

  • Both retry() and waitUntil() throw/fail on timeout. The current while loop returns null on timeout, a fundamentally different semantic. The null return is intentional: the caller triggerAndWaitSnapshot() (line 737) checks for null and calls fail() only if no snapshot was triggered, distinguishing "not triggered" from "legitimately skipped.

  • Also, wrapping retry or waitUntil in try-catch to swallow its failure does not serve the its purpose and is arguably less clear than the current while loop.

A new utility like pollUntil() returning Optional could be added

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky Test: KvSnapshotBatchScannerITCase.testKvSnapshotLease

2 participants