Add block-transposed multi-vector representation for SIMD-friendly layouts#805
Add block-transposed multi-vector representation for SIMD-friendly layouts#805suri-kumkaran wants to merge 10 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a new multi-vector matrix representation (BlockTransposed<T, GROUP, PACK>) intended to replace/extend the prior kmeans-specific BlockTranspose layout and enable SIMD-friendly access patterns (including optional packed column interleaving).
Changes:
- Added
multi_vector::BlockTransposedrepresentation with row proxy types, block views, constructors, and extensive tests. - Migrated kmeans (lloyds, kmeans++) and product-quantization training/pivot codepaths to use
Mat<BlockTransposed<...>>instead ofBlockTranspose. - Removed the old
BlockTransposeimplementation fromalgorithms::kmeans::commonand stopped re-exporting it fromalgorithms::kmeans.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| diskann-quantization/src/product/train.rs | Switches PQ training to build/use Mat<BlockTransposed<...>> for SIMD-friendly kmeans operations. |
| diskann-quantization/src/product/tables/transposed/pivots.rs | Updates pivot chunk storage to Mat<BlockTransposed<...>> and adjusts group-size/construction calls. |
| diskann-quantization/src/multi_vector/mod.rs | Adds the new block_transposed module and re-exports its public types. |
| diskann-quantization/src/multi_vector/matrix.rs | Makes some internals pub(crate) to support the new representation’s accessors/constructors. |
| diskann-quantization/src/multi_vector/block_transposed.rs | New core implementation of block-transposed (+ packed) representation with row proxies, block views, constructors, and tests. |
| diskann-quantization/src/algorithms/kmeans/plusplus.rs | Replaces BlockTranspose usage with Mat<BlockTransposed<...>> and updates microkernel wiring. |
| diskann-quantization/src/algorithms/kmeans/mod.rs | Removes pub use common::BlockTranspose; export. |
| diskann-quantization/src/algorithms/kmeans/lloyds.rs | Replaces BlockTranspose usage with Mat<BlockTransposed<...>>. |
| diskann-quantization/src/algorithms/kmeans/common.rs | Removes the old BlockTranspose implementation and its tests, leaving square_norm. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #805 +/- ##
==========================================
+ Coverage 88.96% 89.03% +0.07%
==========================================
Files 442 443 +1
Lines 81906 82690 +784
==========================================
+ Hits 72868 73626 +758
- Misses 9038 9064 +26
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks Suryansh! I've taken a look over and left a few stylistic comments, but there are few larger ones that I will group together here:
-
We should be very careful with the unchecked arithmetic, particularly in
compute_capacity. We honestly probably want two flavors of this function, one that verifies none of the intermediate steps overflow that gets used in the constructor, and one that assumes everything has already been checked (this would be the currentcompute_capacityfunction). I would also recommend leaning in tousize::next_multiple_ofinstead of doing the operation manually. -
I'm getting more convinced that we do not want to use
Mat/MatRef/MatMutin public APIs. This forces us into an awkward situation where we need to start adding a bunch of inherent methods to these types, which makes method discoverability a little harder and is not a pattern that can be replicated outside ofdiskann-quantization. Instead, I think it would be cleaner to have thin wrappers:struct BlockTransposed<T, const GROUP: usize, const PACK: usize> { data: Mat<Repr<T, GROUP, PACK>>, } struct BlockTransposedRef<'a, T, const GROUP: usize, const PACK: usize> { data: MatRef<'a, Repr<T, GROUP, PACK>>, } // etc.
Inherent methods can be added to these at will and the generated docs for these inherent methods will be clear (since they will not be grouped in with all the other inherent methods). This will also let you safely has
as_sliceas an inherent method to avoid the manual unsafe construction in tests.Also, I'll admit a small preference for using method delegation instead of stamping out a giant macro, something like:
impl<T, const GROUP: usize, const PACK: usize> BlockTransposed<T, GROUP, PACK> { fn block(&self, block: usize) -> MatrixView<'_, T> { self.reborrow().block(block) } }
But I think that is secondary.
-
In the tests - we should really have one that calls
collecton aMat::row_iter_mut()and writes to a bunch of rows in like athread::scopedor something. Basically, something to make sure that working on multiple rows concurrently is fine (and when using Miri - that will add another level of safety).
Thanks Mark, for the amazing suggestions. I have addressed (1) and (2) and will soon add tests to cover (3). Please help with a follow up review. 05-03-2026 Update: Added tests to cover 3 as well. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
diskann-quantization/src/utils.rs
Outdated
| /// pointer suitable for storing inside a [`Mat`](super::multi_vector::matrix::Mat). | ||
| /// | ||
| /// To reclaim the memory later, reconstruct the `Box` via | ||
| /// `Box::from_raw(slice_from_raw_parts_mut(ptr, len))`. |
There was a problem hiding this comment.
The doc uses slice_from_raw_parts_mut without a path, which can be confusing in rustdoc (it’s not in the prelude). Consider spelling it as std::ptr::slice_from_raw_parts_mut to make the recovery recipe copy/pasteable as-is.
| /// `Box::from_raw(slice_from_raw_parts_mut(ptr, len))`. | |
| /// `Box::from_raw(std::ptr::slice_from_raw_parts_mut(ptr, len))`. |
hildebrandmw
left a comment
There was a problem hiding this comment.
Thank you @suri-kumkaran for the hard work - we're almost there. I have a few more asks I came across during the review process:
Coverage
A lot of the repeated methods between BlockTransposeMut and BlockTranspose aren't covered by the test, including functions with non-trivial logic like BlockTransposedMut::remainder_block_mut. I've included techniques in this review for simplifying the delegation and deduplicating the mutable method implementations, which should address most of this naturally.
As a rule of thumb: if a method purely delegates to a tested method with no additional logic, it's fine to leave uncovered (though coverage is appreciated). If it contains arithmetic, unsafe, or any branching, it needs coverage.
You can run coverage analysis locally with
cargo llvm-cov nextest --html --package diskann-quantization --cargo-profile ci
Miri and Corner Case Handling
Please run Miri before submitting:
cargo +nightly miri nextest run --package diskann-quantization --cargo-profile ci multi_vector
You'll find that test_row_view_empty triggers undefined behavior because linear_index computes a non-zero offset when ncols == 0. Additionally, there are no empty-row tests for PACK > 1 or for the mutable row views.
Given the amount of unsafe in this change, thorough corner-case testing is essential for confidence. The parameterized tests (test_block_transposed_impl, test_packed_impl) are a good foundation; consider unifying them into a single function parameterized over <T, GROUP, PACK> that systematically exercises the full API surface (construction, row views, mutable row views, block views, element access, OOB returns, padding). That way, each new (T, GROUP, PACK, nrows, ncols) combination automatically gets complete coverage.
Thanks for the amazing suggestions. I have unified testing, increased coverage and completed local miri testing. Please help with review :) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 12 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
hildebrandmw
left a comment
There was a problem hiding this comment.
Thanks for the back and forth - sorry this was such a long review process. Let's send it!
|
|
||
| #[test] | ||
| fn test_api_pack1_group16() { | ||
| let row_range = if cfg!(miri) { 127..128 } else { 0..128 }; |
There was a problem hiding this comment.
The Miri ranges bug me, the only hit one arbitrary "nice" point. Can we make them hit layout boundaries instead? Like 0, 1, GROUP - 1, GROUP, GROPU + 1 rows (and a few multiples there-of) and 0, 1, PACK - 1, PACK, PACK + 1 columns?
It's exactly these kinds of transition zones that can accidentally exercise UB and that Miri will protect agains.
Summary
Introduces
BlockTransposed<T, GROUP, PACK>, a newReprimplementation that stores multi-vectors in a block-transposed layout optimized for SIMD processing. This enables faster multi-vector distance computations by ensuring data is arranged for efficient vectorized access patterns.Key changes
BlockTransposed<T, GROUP, PACK>: Generic block-transposed matrix representation where groups ofGROUProws are stored transposed, with an optionalPACKfactor for column interleaving (e.g.PACK=2forvpmaddwd-style instructions).BlockTransposedRow/BlockTransposedRowMutwith indexed access, iteration, andSend/Syncsupport.new_block_transposed,from_strided,from_matrix_view— generic over anyT: Copy + Default.block()/block_mut()/remainder_block()/remainder_block_mut()returningMatrixView/MutMatrixViewwith dimensions(padded_ncols/PACK, GROUP*PACK)for direct SIMD consumption.GROUP > 0,PACK > 0,GROUP % PACK == 0enforced at monomorphization.f32,i32,u8), bounds-checking panic tests, and Miri-compatible paths.