feat: add async 'for' loop support to LogScanner (#424) by qzyu999 · Pull Request #438 · apache/fluss-rust

qzyu999 · 2026-03-09T03:55:37Z

Purpose

Linked issue: close #424

This pull request completes Issue #424 by enabling standard cross-boundary native Python async for language built-ins over the high-performance PyO3 wrapped LogScanner stream instance.

Brief change log

Previously, PyFluss developers had to manually orchestrate while True polling loops over network boundaries using scanner.poll(timeout). This PR refactors the Python LogScanner iterator logic by implementing the async traversal natively via Rust __anext__ polling bindings and Python Generator __aiter__ context adapters:

State Independence: Refactored ScannerKind internals into a safely buffered Arc<tokio::sync::Mutex<ScannerState>>. This guarantees strict thread-safety and fulfills Rust's lifetime constraints enabling unboxed state transitions inside the python_async_runtimes tokio closure.
Asynchronous Execution: Polling evaluates non-blocking loops. PyFluss automatically maps Arrow records onto the .await future yield sequence smoothly without blocking event cycles or hardware threads directly!
Iterable Compliance: To correctly resolve runtime inspect.isasyncgen() compliance checks within strictly versioned Python 3.12+ engines (such as modern IPython Jupyter servers), __aiter__ dynamically generates a properly wrapped coroutine generator dynamically inside the codebase via py.run(). This completely masks the Python ecosystem's iterator type limitations automatically out-of-the-box.

Tests

[NEW] test_log_table.py::test_async_iterator: Integrated a testcontainers ecosystem confirming zero-configuration iteration capabilities function natively evaluating async for record in scanner perfectly without pipeline interruptions while yielding thousands of appended instances sequentially backwards matching existing legacy data frameworks.

API and Format

Yes, this expands the API natively extending capabilities allowing async for loops gracefully. Existing user logic leveraging explicit implementations of .poll_arrow() or legacy functions are untouched.

Documentation

Yes, I updated integration tests acting as live documentation proof demonstrating the capability natively.

fresh-borzoni · 2026-03-10T01:27:49Z

@qzyu999 Ty for the PR, but I checked this branch out and integration tests for python still hang even when I run them locally. PTAL

qzyu999 · 2026-03-10T02:40:55Z

@qzyu999 Ty for the PR, but I checked this branch out and integration tests for python still hang even when I run them locally. PTAL

Hi @fresh-borzoni, applied the fix. Ran cargo fmt --all locally and it passed. Please run the CI again.

…within a local scope in `to_arrow`

Copilot

Pull request overview

This PR adds Python async for support for the PyO3 LogScanner binding to address Issue #424, aiming to let PyFluss users iterate scanner results via the native async-iterator protocol instead of manual polling loops.

Changes:

Added a new Python integration test covering async for record in scanner on a record-based LogScanner.
Refactored the Rust LogScanner binding to store scanner state behind Arc<tokio::sync::Mutex<_>> with a pending-records buffer for per-record yielding.
Implemented __aiter__ / __anext__ in the Rust binding (via future_into_py) to produce awaitable next-items for async iteration.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 11 comments.

File	Description
bindings/python/test/test_log_table.py	Adds an async-iterator integration test for `LogScanner`.
bindings/python/src/table.rs	Introduces async iterator support and refactors scanner state management for Python bindings.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-11T02:24:09Z

bindings/python/src/table.rs

+    fn __aiter__<'py>(slf: PyRef<'py, Self>) -> PyResult<Bound<'py, PyAny>> {
+        let py = slf.py();
+        let code = pyo3::ffi::c_str!(
+            r#"
+async def _adapter(obj):
+    while True:
+        try:
+            yield await obj.__anext__()
+        except StopAsyncIteration:
+            break
+"#
+        );
+        let globals = pyo3::types::PyDict::new(py);
+        py.run(code, Some(&globals), None)?;
+        let adapter = globals.get_item("_adapter")?.unwrap();
+        // Return adapt(self)
+        adapter.call1((slf.into_bound_py_any(py)?,))
+    }


__aiter__ recompiles and executes Python source via py.run() on every iteration start. Consider caching the adapter function (e.g., in a PyOnceLock) or returning self directly as the async iterator if possible; this avoids repeated code compilation and reduces overhead per async for loop.

Copilot · 2026-03-11T02:24:09Z

bindings/python/src/table.rs

+            let mut state = state_arc.lock().await;
+
+            // 1. If we already have buffered records, pop and return immediately
+            if let Some(record) = state.pending_records.pop_front() {
+                return Ok(record.into_any());
+            }
+
+            // 2. Buffer is empty, we must poll the network for the next batch
+            // The underlying kind must be a Record-based scanner.
+            let scanner = match state.kind.as_record() {
+                Ok(s) => s,
+                Err(_) => {
+                    return Err(pyo3::exceptions::PyStopAsyncIteration::new_err(
+                        "Stream Ended",
+                    ));
+                }
+            };
+
+            // Poll with a reasonable internal timeout before unblocking the event loop
+            let timeout = core::time::Duration::from_millis(5000);
+
+            let mut current_records = scanner
+                .poll(timeout)
+                .await
+                .map_err(|e| FlussError::from_core_error(&e))?;
+
+            // If it's a real timeout with zero records, loop or throw StopAsyncIteration?
+            // Since it's a streaming log, we can yield None or block. Blocking requires a loop in the future.
+            while current_records.is_empty() {
+                current_records = scanner
+                    .poll(timeout)
+                    .await
+                    .map_err(|e| FlussError::from_core_error(&e))?;
+            }


__anext__ holds state_arc.lock() across scanner.poll(timeout).await (and the retry loop). This blocks all other methods needing the same mutex (e.g., subscribe/unsubscribe/poll/to_arrow) for the full network wait time and can lead to poor responsiveness or deadlock-like behavior under concurrent use. Consider narrowing the critical section (e.g., split locks for kind vs pending_records, or temporarily take/move the scanner out of the state while polling).

Copilot · 2026-03-11T02:24:09Z

bindings/python/src/table.rs

+        let scanner_ref =
+            unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };
+        let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });


Avoid the unsafe pointer cast when accessing self.state. You can lock the mutex directly via self.state.lock() (or clone the Arc first) without unsafe; the current cast is unnecessary and introduces unsoundness risk if the field type ever changes.

Suggested change

let scanner_ref =

unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });

let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await });

Copilot · 2026-03-11T02:24:10Z

bindings/python/src/table.rs

+        let scanner_ref =
+            unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };
+        let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });


query_latest_offsets() uses the same unsafe cast pattern to lock self.state. Please replace this with a safe lock on self.state (or a cloned Arc) to avoid unnecessary unsafe in the Python bindings.

Suggested change

let scanner_ref =

unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });

let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await });

Copilot · 2026-03-11T02:24:10Z

bindings/python/src/table.rs

+                    return Err(pyo3::exceptions::PyStopAsyncIteration::new_err(
+                        "Stream Ended",


__anext__ treats the batch-based scanner variant as end-of-stream (StopAsyncIteration). That will silently terminate async for on scanners created via create_record_batch_log_scanner(), and it also masks the helpful error message from as_record(). Either implement async iteration for the batch variant (yielding RecordBatch/Arrow), or raise a TypeError explaining that async iteration is only supported for record scanners.

Suggested change

return Err(pyo3::exceptions::PyStopAsyncIteration::new_err(

"Stream Ended",

return Err(PyTypeError::new_err(

"Async iteration is only supported for record scanners; \

use create_record_log_scanner() instead.",

Copilot · 2026-03-11T02:24:10Z

bindings/python/test/test_log_table.py

+    scanner = await table.new_scan().create_log_scanner()
+    num_buckets = (await admin.get_table_info(table_path)).num_buckets
+    scanner.subscribe_buckets({i: fluss.EARLIEST_OFFSET for i in range(num_buckets)})


This test only covers async for on a record-based scanner (create_log_scanner()). Since LogScanner can also wrap the batch variant (create_record_batch_log_scanner()), consider adding a companion test for async iteration on the batch scanner (or explicitly asserting that async iteration is unsupported there) so the intended behavior is locked in by tests.

Copilot · 2026-03-11T02:24:11Z

bindings/python/src/table.rs

+        let scanner_ref =
+            unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };
+        let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });
+        let scanner = lock.kind.as_batch()?;



Same as in poll(): please remove the unsafe cast used to get scanner_ref. Lock self.state directly; keeping this unsafe here makes the method harder to reason about and can hide real lifetime/aliasing issues.

Suggested change

let scanner_ref =

unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });

let scanner = lock.kind.as_batch()?;

let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await });

let scanner = lock.kind.as_batch()?;

Copilot · 2026-03-11T02:24:11Z

bindings/python/src/table.rs

+        let scanner_ref =
+            unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };
+        let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });


Remove the unsafe pointer cast when locking self.state in poll_arrow(). This can be expressed safely with self.state.lock().await (via TOKIO_RUNTIME.block_on) and avoids introducing UB hazards.

Suggested change

let scanner_ref =

unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await });

let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await });

Copilot · 2026-03-11T02:24:11Z

bindings/python/src/table.rs

+            let scanner_ref = unsafe {
+                &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>)
+            };


to_arrow() also uses an unsafe cast to access self.state. This should be rewritten to safely clone/borrow self.state and lock it without unsafe to keep the bindings memory-safe.

Suggested change

let scanner_ref = unsafe {

&*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>)

};

let scanner_ref = &self.state;

Copilot · 2026-03-11T02:24:11Z

bindings/python/src/table.rs

+        let scanner_ref =
+            unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };


poll_until_offsets() also relies on the unsafe cast to access self.state. This should be refactored to lock self.state safely; keeping unsafe here is especially risky because this method can run for a long time and is on a hot path for to_arrow().

Suggested change

let scanner_ref =

unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

let scanner_ref = &self.state;

fresh-borzoni · 2026-03-11T12:52:33Z

@qzyu999 Ty, took a look at the approach, have some ideas. PTAL

The scanner is already thread-safe internally (&self on all methods), so the Mutex isn't needed, it just adds locking to every call and forces 5 unsafe pointer casts to work around borrow issues it created. The anext loop is also problematic: it runs inside tokio::spawn, so breaking out of async for leaves it polling forever in the background.

Simpler idea: store the scanner in an Arc, keep existing methods as-is. Add _async_poll(timeout_ms) that does one bounded poll and returns a list. aiter returns a small Python async generator that calls _async_poll and yields records. Break stops the generator naturally, so no leaks, no unsafe, no mutex.

…_ms) instead

qzyu999 · 2026-03-11T23:47:52Z

Hi @fresh-borzoni, thanks for the recommendations. I've taken a look and came up with the following changes, PTAL when available:

Replace Arc<tokio::sync::Mutex<ScannerState>> with Arc<ScannerKind>
Remove ScannerState struct and VecDeque buffer
Remove all 6 unsafe pointer casts
Replace __anext__ with _async_poll(timeout_ms) (single bounded poll)
Replace __aiter__ with PyOnceLock-cached Python async generator
Change batch scanner error from StopAsyncIteration to TypeError
Update with_scanner! macro or inline to use &self.kind directly
Add break-safety and batch-scanner-error tests

qzyu999 added 3 commits March 8, 2026 20:51

feat: add async 'for' loop support to LogScanner (apache#424)

768266d

chore: revert formatting changes to __init__.pyi

0e01b8b

fix: remove unused PyClassInitializer and PyErr imports

3aa067b

style: apply cargo fmt

1065665

refactor: release scanner lock earlier by cloning subscribed buckets …

195ec7c

…within a local scope in `to_arrow`

luoyuxia requested a review from Copilot March 11, 2026 02:17

Copilot started reviewing on behalf of luoyuxia March 11, 2026 02:18 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

refactor: Remove Mutex and utilize __aiter__ with _async_poll(timeout…

4ad2fd8

…_ms) instead

		return Err(pyo3::exceptions::PyStopAsyncIteration::new_err(
		"Stream Ended",

-                    return Err(pyo3::exceptions::PyStopAsyncIteration::new_err(
-                        "Stream Ended",
+                    return Err(PyTypeError::new_err(
+                        "Async iteration is only supported for record scanners; \
+                         use create_record_log_scanner() instead.",

		let scanner_ref =
		unsafe { &(&self.state as const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };

	let scanner_ref =
	unsafe { &(&self.state as const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) };
	let scanner_ref = &self.state;

Conversation

qzyu999 commented Mar 9, 2026

Purpose

Brief change log

Tests

API and Format

Documentation

Uh oh!

fresh-borzoni commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qzyu999 commented Mar 10, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

fresh-borzoni commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qzyu999 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fresh-borzoni commented Mar 10, 2026 •

edited

Loading

fresh-borzoni commented Mar 11, 2026 •

edited

Loading

qzyu999 commented Mar 11, 2026 •

edited

Loading