Hello,
#458 broke arm64 compilation by including x86-specific headers - namely <immintrin.h>. This header is used to implement parallel_memcpy. When I hacked around this locally to make parallel_memcpy invoke memcpy directly, AllreduceNewTest.TestTimeout consistently segfaulted in ~AllreduceSharedMemoryData.
This issue will likely also block the corresponding PyTorch PR pytorch/pytorch#172297