Skip to content

Fill missing chunks#3748

Open
williamsnell wants to merge 8 commits intozarr-developers:mainfrom
williamsnell:fill-missing-chunks
Open

Fill missing chunks#3748
williamsnell wants to merge 8 commits intozarr-developers:mainfrom
williamsnell:fill-missing-chunks

Conversation

@williamsnell
Copy link

@williamsnell williamsnell commented Mar 5, 2026

Add config options for whether a missing chunk should:

  • appear as a chunk filled with fill_value (current behaviour; retained as default)
  • raise a MissingChunkError

This PR is entirely based on the work of @tomwhite in this issue. I've started this PR as this an important feature that I'd like to see merged.
I've added a test (based on the demo in the issue) and a minor docs tweak.

Questions:

  • I've added an example to config.md - is this the right place?

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.md
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added needs release notes Automatically applied to PRs which haven't added release notes and removed needs release notes Automatically applied to PRs which haven't added release notes labels Mar 5, 2026
@tomwhite
Copy link
Member

tomwhite commented Mar 5, 2026

Thanks for doing this @williamsnell!

@d-v-b
Copy link
Contributor

d-v-b commented Mar 5, 2026

this is great @williamsnell! I'm wondering if this be exposed as part of the general array configuration, which is where write_empty_chunks currently sits.

@williamsnell
Copy link
Author

@d-v-b I've pushed a new commit moving this to ArrayConfig so we can look into the ergonomics. It does feel like a more natural fit, especially since users probably want to set write_empty_chunks and fill_missing_chunks as a pair.

One note: I'm not sure how this interacts with Sharding now - following the existing code I hardcoded fill_missing_chunks=True here - does this imply there's no way to use fill_missing_chunks alongside sharding? If so, I can update the docs.

@d-v-b
Copy link
Contributor

d-v-b commented Mar 5, 2026

One note: I'm not sure how this interacts with Sharding now - following the existing code I hardcoded fill_missing_chunks=True here - does this imply there's no way to use fill_missing_chunks alongside sharding? If so, I can update the docs.

I don't think we want this new configuration option to change the behavior of the sharding codec. A missing subchunk inside a shard is conveyed explicitly via the shard index, so from the sharding codec's POV you can't have a subchunk appear missing due to a network error.

@williamsnell
Copy link
Author

williamsnell commented Mar 6, 2026

I don't think we want this new configuration option to change the behavior of the sharding codec. A missing subchunk inside a shard is conveyed explicitly via the shard index, so from the sharding codec's POV you can't have a subchunk appear missing due to a network error.

If I've understood correctly, we'll want to make this tweak to ShardingCodec._get_chunk_spec:

def _get_chunk_spec(self, shard_spec: ArraySpec) -> ArraySpec:
+  # Because the shard index and inner chunks should be stored
+  # together, we detect missing data via the shard index.
+  # The inner chunks defined here are thus allowed to return
+  # None, even if fill_missing_chunks=False at the array level.
+  config = replace(shard_spec.config, fill_missing_chunks=True)
   return ArraySpec(
       shape=self.chunk_shape,
       dtype=shard_spec.dtype,
       fill_value=shard_spec.fill_value,
-      config=shard_spec.config,
+      config=config,
       prototype=shard_spec.prototype,
   )

With this change, I think my previous point was wrong - we would be able to use fill_missing_chunk=False with sharding, and we would only raise an error if an entire shard (specifically its shard index) was missing.

@williamsnell williamsnell force-pushed the fill-missing-chunks branch from 6db55a1 to de7afd8 Compare March 6, 2026 20:49
@williamsnell
Copy link
Author

I've committed the change to _get_chunk_spec and rebased onto main.

I've also made two more changes:

  • Added some tests to check/codify expected behaviour around:
    • when a sharded array should or shouldn't raise MissingChunkError
    • how fill_missing_chunks is expected to interact with write_empty_chunks
  • Simplified the branching in codec_pipeline into if/elif/else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants