In anticipation of Icelake bringing AVX512 to consumer hardware, I've come up with an alternative to the pdep
method for encoding a 32-bit 2D coordinate into a 64-bit morton code which utilizes the vpshufbitqmb
instruction of the BITALG subset of AVX512.
vpshufbitqmb
allows building a new value bit by bit by indexing into a 64-bit lane(into an AVX512 k
-register mask).
It relies on interpreting two 32-bit X and Y coordinates as a continuous 64-bit value though. It can support much higher dimensions provided you python-zip
(as in, feed in and interleave the leading bytes or so of the coordinates vector elements into each 64-bit lane before selecting bits) certain bytes of the input coordinates.
There is no throughput or latency data at the moment for this instruction or any hardware to actually test this upon but the assembly looks very promising, compared to the currently fastest pdep
method. Especially for any bulk-processing where all the constants are loaded outside of the loop.
An illustrative example of the two methods:
#pragma pack(push, 1)
union Coord2D
{
std::uint32_t u32[2]; // x, y
std::uint64_t u64;
};
#pragma pack(pop)
std::uint64_t Morton2D_AVX512( const Coord2D& coord )
{
const __m512i CoordVec = _mm512_set1_epi64(coord.u64);
// Interleave bits from 32-bit X and Y coordinate
const __mmask64 Interleave = _mm512_bitshuffle_epi64_mask(
CoordVec,+
_mm512_set_epi16(
0x20'00 + 0x01'01 * 31, 0x20'00 + 0x01'01 * 30,
0x20'00 + 0x01'01 * 29, 0x20'00 + 0x01'01 * 28,
0x20'00 + 0x01'01 * 27, 0x20'00 + 0x01'01 * 26,
0x20'00 + 0x01'01 * 25, 0x20'00 + 0x01'01 * 24,
0x20'00 + 0x01'01 * 23, 0x20'00 + 0x01'01 * 22,
0x20'00 + 0x01'01 * 21, 0x20'00 + 0x01'01 * 20,
0x20'00 + 0x01'01 * 19, 0x20'00 + 0x01'01 * 18,
0x20'00 + 0x01'01 * 17, 0x20'00 + 0x01'01 * 16,
0x20'00 + 0x01'01 * 15, 0x20'00 + 0x01'01 * 14,
0x20'00 + 0x01'01 * 13, 0x20'00 + 0x01'01 * 12,
0x20'00 + 0x01'01 * 11, 0x20'00 + 0x01'01 * 10,
0x20'00 + 0x01'01 * 9, 0x20'00 + 0x01'01 * 8,
0x20'00 + 0x01'01 * 7, 0x20'00 + 0x01'01 * 6,
0x20'00 + 0x01'01 * 5, 0x20'00 + 0x01'01 * 4,
0x20'00 + 0x01'01 * 3, 0x20'00 + 0x01'01 * 2,
0x20'00 + 0x01'01 * 1, 0x20'00 + 0x01'01 * 0
)
);
return _cvtmask64_u64(Interleave);
}
std::uint64_t Morton2D_BMI(const Coord2D& coord)
{
return _pdep_u64(static_cast<std::uint64_t>(coord.u32[0]), 0x5555555555555555)
| _pdep_u64(static_cast<std::uint64_t>(coord.u32[1]), 0xAAAAAAAAAAAAAAAA);
}
The generated assembly using gcc 9.1
with -Ofast -march=icelake-client
:
Morton2D_AVX512(Coord2D const&):
vpbroadcastq zmm0, QWORD PTR [rdi]
vpshufbitqmb k0, zmm0, ZMMWORD PTR .LC0[rip]
kmovq rax, k0
vzeroupper
ret
Morton2D_BMI(Coord2D const&):
mov eax, DWORD PTR [rdi]
movabs rdx, 6148914691236517205
pdep rax, rax, rdx
mov edx, DWORD PTR [rdi+4]
movabs rcx, -6148914691236517206
pdep rdx, rdx, rcx
or rax, rdx
ret
I have verified the implementation with Intel's emulator and I'd love to update with some actual benchmarks once I get some Icelake hardware in my hands.
Top comments (0)