feat: add async 'for' loop support to LogScanner (#424)#438
feat: add async 'for' loop support to LogScanner (#424)#438qzyu999 wants to merge 6 commits intoapache:mainfrom
Conversation
|
@qzyu999 Ty for the PR, but I checked this branch out and integration tests for python still hang even when I run them locally. PTAL |
Hi @fresh-borzoni, applied the fix. Ran |
…within a local scope in `to_arrow`
There was a problem hiding this comment.
Pull request overview
This PR adds Python async for support for the PyO3 LogScanner binding to address Issue #424, aiming to let PyFluss users iterate scanner results via the native async-iterator protocol instead of manual polling loops.
Changes:
- Added a new Python integration test covering
async for record in scanneron a record-basedLogScanner. - Refactored the Rust
LogScannerbinding to store scanner state behindArc<tokio::sync::Mutex<_>>with a pending-records buffer for per-record yielding. - Implemented
__aiter__/__anext__in the Rust binding (viafuture_into_py) to produce awaitable next-items for async iteration.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 11 comments.
| File | Description |
|---|---|
| bindings/python/test/test_log_table.py | Adds an async-iterator integration test for LogScanner. |
| bindings/python/src/table.rs | Introduces async iterator support and refactors scanner state management for Python bindings. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| fn __aiter__<'py>(slf: PyRef<'py, Self>) -> PyResult<Bound<'py, PyAny>> { | ||
| let py = slf.py(); | ||
| let code = pyo3::ffi::c_str!( | ||
| r#" | ||
| async def _adapter(obj): | ||
| while True: | ||
| try: | ||
| yield await obj.__anext__() | ||
| except StopAsyncIteration: | ||
| break | ||
| "# | ||
| ); | ||
| let globals = pyo3::types::PyDict::new(py); | ||
| py.run(code, Some(&globals), None)?; | ||
| let adapter = globals.get_item("_adapter")?.unwrap(); | ||
| // Return adapt(self) | ||
| adapter.call1((slf.into_bound_py_any(py)?,)) | ||
| } |
There was a problem hiding this comment.
__aiter__ recompiles and executes Python source via py.run() on every iteration start. Consider caching the adapter function (e.g., in a PyOnceLock) or returning self directly as the async iterator if possible; this avoids repeated code compilation and reduces overhead per async for loop.
bindings/python/src/table.rs
Outdated
| let mut state = state_arc.lock().await; | ||
|
|
||
| // 1. If we already have buffered records, pop and return immediately | ||
| if let Some(record) = state.pending_records.pop_front() { | ||
| return Ok(record.into_any()); | ||
| } | ||
|
|
||
| // 2. Buffer is empty, we must poll the network for the next batch | ||
| // The underlying kind must be a Record-based scanner. | ||
| let scanner = match state.kind.as_record() { | ||
| Ok(s) => s, | ||
| Err(_) => { | ||
| return Err(pyo3::exceptions::PyStopAsyncIteration::new_err( | ||
| "Stream Ended", | ||
| )); | ||
| } | ||
| }; | ||
|
|
||
| // Poll with a reasonable internal timeout before unblocking the event loop | ||
| let timeout = core::time::Duration::from_millis(5000); | ||
|
|
||
| let mut current_records = scanner | ||
| .poll(timeout) | ||
| .await | ||
| .map_err(|e| FlussError::from_core_error(&e))?; | ||
|
|
||
| // If it's a real timeout with zero records, loop or throw StopAsyncIteration? | ||
| // Since it's a streaming log, we can yield None or block. Blocking requires a loop in the future. | ||
| while current_records.is_empty() { | ||
| current_records = scanner | ||
| .poll(timeout) | ||
| .await | ||
| .map_err(|e| FlussError::from_core_error(&e))?; | ||
| } |
There was a problem hiding this comment.
__anext__ holds state_arc.lock() across scanner.poll(timeout).await (and the retry loop). This blocks all other methods needing the same mutex (e.g., subscribe/unsubscribe/poll/to_arrow) for the full network wait time and can lead to poor responsiveness or deadlock-like behavior under concurrent use. Consider narrowing the critical section (e.g., split locks for kind vs pending_records, or temporarily take/move the scanner out of the state while polling).
bindings/python/src/table.rs
Outdated
| let scanner_ref = | ||
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | ||
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); |
There was a problem hiding this comment.
Avoid the unsafe pointer cast when accessing self.state. You can lock the mutex directly via self.state.lock() (or clone the Arc first) without unsafe; the current cast is unnecessary and introduces unsoundness risk if the field type ever changes.
| let scanner_ref = | |
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | |
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); | |
| let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await }); |
bindings/python/src/table.rs
Outdated
| let scanner_ref = | ||
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | ||
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); |
There was a problem hiding this comment.
query_latest_offsets() uses the same unsafe cast pattern to lock self.state. Please replace this with a safe lock on self.state (or a cloned Arc) to avoid unnecessary unsafe in the Python bindings.
| let scanner_ref = | |
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | |
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); | |
| let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await }); |
bindings/python/src/table.rs
Outdated
| return Err(pyo3::exceptions::PyStopAsyncIteration::new_err( | ||
| "Stream Ended", |
There was a problem hiding this comment.
__anext__ treats the batch-based scanner variant as end-of-stream (StopAsyncIteration). That will silently terminate async for on scanners created via create_record_batch_log_scanner(), and it also masks the helpful error message from as_record(). Either implement async iteration for the batch variant (yielding RecordBatch/Arrow), or raise a TypeError explaining that async iteration is only supported for record scanners.
| return Err(pyo3::exceptions::PyStopAsyncIteration::new_err( | |
| "Stream Ended", | |
| return Err(PyTypeError::new_err( | |
| "Async iteration is only supported for record scanners; \ | |
| use create_record_log_scanner() instead.", |
| scanner = await table.new_scan().create_log_scanner() | ||
| num_buckets = (await admin.get_table_info(table_path)).num_buckets | ||
| scanner.subscribe_buckets({i: fluss.EARLIEST_OFFSET for i in range(num_buckets)}) |
There was a problem hiding this comment.
This test only covers async for on a record-based scanner (create_log_scanner()). Since LogScanner can also wrap the batch variant (create_record_batch_log_scanner()), consider adding a companion test for async iteration on the batch scanner (or explicitly asserting that async iteration is unsupported there) so the intended behavior is locked in by tests.
bindings/python/src/table.rs
Outdated
| let scanner_ref = | ||
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | ||
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); | ||
| let scanner = lock.kind.as_batch()?; | ||
|
|
There was a problem hiding this comment.
Same as in poll(): please remove the unsafe cast used to get scanner_ref. Lock self.state directly; keeping this unsafe here makes the method harder to reason about and can hide real lifetime/aliasing issues.
| let scanner_ref = | |
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | |
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); | |
| let scanner = lock.kind.as_batch()?; | |
| let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await }); | |
| let scanner = lock.kind.as_batch()?; |
bindings/python/src/table.rs
Outdated
| let scanner_ref = | ||
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | ||
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); |
There was a problem hiding this comment.
Remove the unsafe pointer cast when locking self.state in poll_arrow(). This can be expressed safely with self.state.lock().await (via TOKIO_RUNTIME.block_on) and avoids introducing UB hazards.
| let scanner_ref = | |
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | |
| let lock = TOKIO_RUNTIME.block_on(async { scanner_ref.lock().await }); | |
| let lock = TOKIO_RUNTIME.block_on(async { self.state.lock().await }); |
bindings/python/src/table.rs
Outdated
| let scanner_ref = unsafe { | ||
| &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) | ||
| }; |
There was a problem hiding this comment.
to_arrow() also uses an unsafe cast to access self.state. This should be rewritten to safely clone/borrow self.state and lock it without unsafe to keep the bindings memory-safe.
| let scanner_ref = unsafe { | |
| &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) | |
| }; | |
| let scanner_ref = &self.state; |
bindings/python/src/table.rs
Outdated
| let scanner_ref = | ||
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; |
There was a problem hiding this comment.
poll_until_offsets() also relies on the unsafe cast to access self.state. This should be refactored to lock self.state safely; keeping unsafe here is especially risky because this method can run for a long time and is on a hot path for to_arrow().
| let scanner_ref = | |
| unsafe { &*(&self.state as *const std::sync::Arc<tokio::sync::Mutex<ScannerState>>) }; | |
| let scanner_ref = &self.state; |
|
@qzyu999 Ty, took a look at the approach, have some ideas. PTAL The scanner is already thread-safe internally (&self on all methods), so the Mutex isn't needed, it just adds locking to every call and forces 5 unsafe pointer casts to work around borrow issues it created. The anext loop is also problematic: it runs inside tokio::spawn, so breaking out of async for leaves it polling forever in the background. Simpler idea: store the scanner in an Arc, keep existing methods as-is. Add _async_poll(timeout_ms) that does one bounded poll and returns a list. aiter returns a small Python async generator that calls _async_poll and yields records. Break stops the generator naturally, so no leaks, no unsafe, no mutex. |
|
Hi @fresh-borzoni, thanks for the recommendations. I've taken a look and came up with the following changes, PTAL when available:
|
Purpose
Linked issue: close #424
This pull request completes Issue #424 by enabling standard cross-boundary native Python
async forlanguage built-ins over the high-performance PyO3 wrappedLogScannerstream instance.Brief change log
Previously, PyFluss developers had to manually orchestrate
while Truepolling loops over network boundaries usingscanner.poll(timeout). This PR refactors the PythonLogScanneriterator logic by implementing the async traversal natively via Rust__anext__polling bindings and Python Generator__aiter__context adapters:ScannerKindinternals into a safely bufferedArc<tokio::sync::Mutex<ScannerState>>. This guarantees strict thread-safety and fulfills Rust's lifetime constraints enabling unboxed state transitions inside thepython_async_runtimestokioclosure..awaitfuture yield sequence smoothly without blocking event cycles or hardware threads directly!inspect.isasyncgen()compliance checks within strictly versioned Python 3.12+ engines (such as modern IPython Jupyter servers),__aiter__dynamically generates a properly wrapped coroutine generator dynamically inside the codebase viapy.run(). This completely masks the Python ecosystem's iterator type limitations automatically out-of-the-box.Tests
test_log_table.py::test_async_iterator: Integrated a testcontainers ecosystem confirming zero-configuration iteration capabilities function natively evaluatingasync for record in scannerperfectly without pipeline interruptions while yielding thousands of appended instances sequentially backwards matching existing legacy data frameworks.API and Format
Yes, this expands the API natively extending capabilities allowing
async forloops gracefully. Existing user logic leveraging explicit implementations of.poll_arrow()or legacy functions are untouched.Documentation
Yes, I updated integration tests acting as live documentation proof demonstrating the capability natively.