Skip to content

GH-343: Fix BaseVariableWidthVector and BaseLargeVariableWidthVector offset buffer serialization#989

Open
Yicong-Huang wants to merge 5 commits intoapache:mainfrom
Yicong-Huang:fix/343-base-variable-width-vector-offset-buffer
Open

GH-343: Fix BaseVariableWidthVector and BaseLargeVariableWidthVector offset buffer serialization#989
Yicong-Huang wants to merge 5 commits intoapache:mainfrom
Yicong-Huang:fix/343-base-variable-width-vector-offset-buffer

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Jan 27, 2026

What's Changed

Fix BaseVariableWidthVector/BaseLargeVariableWidthVector IPC serialization when valueCount is 0.

Problem

When valueCount == 0, setReaderAndWriterIndex() was setting offsetBuffer.writerIndex(0), which means readableBytes() == 0. IPC serializer uses readableBytes() to determine buffer size, so 0 bytes were written to the IPC stream. This crashes IPC readers in other libraries because Arrow spec requires offset buffer to have at least one entry [0].

This is a follow-up to #967 which fixed the same issue in ListVector/LargeListVector.

Fix

Simplify setReaderAndWriterIndex() to always use (valueCount + 1) * OFFSET_WIDTH for offset buffer's writerIndex. When valueCount == 0, this correctly sets writerIndex to OFFSET_WIDTH, ensuring offset[0] is included in serialization.

Testing

Added tests for empty VarCharVector and LargeVarCharVector verifying offset buffer has correct readableBytes().

related to #343

@github-actions

This comment has been minimized.

@Yicong-Huang
Copy link
Contributor Author

cc @viirya @jbonofre

@lidavidm lidavidm added the bug-fix PRs that fix a big. label Jan 27, 2026
Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the similar fix as for valueBuffer afair. It sounds good to me.

@jbonofre jbonofre added this to the 19.0.0 milestone Jan 27, 2026
@jbonofre
Copy link
Member

@Yicong-Huang sorry to be a pain, can you please rebase ? Thanks !

@Yicong-Huang
Copy link
Contributor Author

Yicong-Huang commented Jan 27, 2026

@Yicong-Huang sorry to be a pain, can you please rebase ? Thanks !

Thanks @jbonofre . I did rebase but I think the root error in this change is BaseVariableWidthVector's offsetBuffer is allocated to be 0 capacity when it is empty. Writing the offset 0 to it would be invalid. It is also not easy to increase the allocation as all tests are written against that. Could you advise on this case?

@jbonofre
Copy link
Member

@Yicong-Huang can you please rebase again ? Sorry about that. Else I can do the rebase for you.

@Yicong-Huang
Copy link
Contributor Author

@Yicong-Huang can you please rebase again ? Sorry about that. Else I can do the rebase for you.

Thanks. Just merged the latest master back in.

@jbonofre
Copy link
Member

@Yicong-Huang thanks a lot ! I would like to include this fix on Arrow Java 19.0.0 release 😄

@jbonofre
Copy link
Member

Closes #343

@Yicong-Huang
Copy link
Contributor Author

@Yicong-Huang thanks a lot ! I would like to include this fix on Arrow Java 19.0.0 release 😄

Thanks that'd be nice! I can work with you closely on fixing this. However I see it is still failing CI, I think the original issue is still persist

I did rebase but I think the root error in this change is BaseVariableWidthVector's offsetBuffer is allocated to be 0 capacity when it is empty. Writing the offset 0 to it would be invalid. It is also not easy to increase the allocation as all tests are written against that.

Do you have any suggestions?

@jbonofre
Copy link
Member

I did a new pass on the PR and I don't think it's correct.

  1. TestValueVector tests are failing. The PR changes offset buffer behavior for empty vectors, now always writes OFFSET_WIDTH bytes. In the case of empty vectors, I would expect 0 (as before).
  2. There are some memory leaks (TestDenseUnionVector.testGetFieldTypeInfo, TestStructVector.testAddChildVectorsWithDuplicatedFieldNamesForConflictPolicyReplace, TestSplitAndTransfer.testWithEmptyVector, TestUnionVector.testGetFieldTypeInfo). This is a consequence of this change as the offset buffer now allocating extra bytes that aren't being properly freed in the test cleanup.

I think the problem is that the PR changes setReaderAndWriterIndex() to always use (valueCount + 1) + OFFSET_WIDTH for the offset buffer. The text expectations need to be adjusted by +4 bytes (one 32-bit offset) for the affected cases.

Copy link
Member

@jbonofre jbonofre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous comment about the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug-fix PRs that fix a big.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants