Morton encoding with AVX512

#cpp #c #performance #algorithms

In anticipation of Icelake bringing AVX512 to consumer hardware, I've come up with an alternative to the pdep method for encoding a 32-bit 2D coordinate into a 64-bit morton code which utilizes the vpshufbitqmb instruction of the BITALG subset of AVX512.
vpshufbitqmb allows building a new value bit by bit by indexing into a 64-bit lane(into an AVX512 k-register mask).

It relies on interpreting two 32-bit X and Y coordinates as a continuous 64-bit value though. It can support much higher dimensions provided you python-zip(as in, feed in and interleave the leading bytes or so of the coordinates vector elements into each 64-bit lane before selecting bits) certain bytes of the input coordinates.

There is no throughput or latency data at the moment for this instruction or any hardware to actually test this upon but the assembly looks very promising, compared to the currently fastest pdep method. Especially for any bulk-processing where all the constants are loaded outside of the loop.

An illustrative example of the two methods:

#pragma pack(push, 1)
union Coord2D
{
    std::uint32_t u32[2]; // x, y
    std::uint64_t u64;
};
#pragma pack(pop)

std::uint64_t Morton2D_AVX512( const Coord2D& coord )
{
    const __m512i CoordVec = _mm512_set1_epi64(coord.u64);
    // Interleave bits from 32-bit X and Y coordinate
    const __mmask64 Interleave = _mm512_bitshuffle_epi64_mask(
        CoordVec,+
        _mm512_set_epi16(
            0x20'00 + 0x01'01 * 31, 0x20'00 + 0x01'01 * 30,
            0x20'00 + 0x01'01 * 29, 0x20'00 + 0x01'01 * 28,
            0x20'00 + 0x01'01 * 27, 0x20'00 + 0x01'01 * 26,
            0x20'00 + 0x01'01 * 25, 0x20'00 + 0x01'01 * 24,
            0x20'00 + 0x01'01 * 23, 0x20'00 + 0x01'01 * 22,
            0x20'00 + 0x01'01 * 21, 0x20'00 + 0x01'01 * 20,
            0x20'00 + 0x01'01 * 19, 0x20'00 + 0x01'01 * 18,
            0x20'00 + 0x01'01 * 17, 0x20'00 + 0x01'01 * 16,
            0x20'00 + 0x01'01 * 15, 0x20'00 + 0x01'01 * 14,
            0x20'00 + 0x01'01 * 13, 0x20'00 + 0x01'01 * 12,
            0x20'00 + 0x01'01 * 11, 0x20'00 + 0x01'01 * 10,
            0x20'00 + 0x01'01 *  9, 0x20'00 + 0x01'01 *  8,
            0x20'00 + 0x01'01 *  7, 0x20'00 + 0x01'01 *  6,
            0x20'00 + 0x01'01 *  5, 0x20'00 + 0x01'01 *  4,
            0x20'00 + 0x01'01 *  3, 0x20'00 + 0x01'01 *  2,
            0x20'00 + 0x01'01 *  1, 0x20'00 + 0x01'01 *  0
        )
    );
    return _cvtmask64_u64(Interleave);
}

std::uint64_t Morton2D_BMI(const Coord2D& coord)
{
    return _pdep_u64(static_cast<std::uint64_t>(coord.u32[0]), 0x5555555555555555)
        | _pdep_u64(static_cast<std::uint64_t>(coord.u32[1]), 0xAAAAAAAAAAAAAAAA);
}

The generated assembly using gcc 9.1 with -Ofast -march=icelake-client:

Morton2D_AVX512(Coord2D const&):
        vpbroadcastq    zmm0, QWORD PTR [rdi]
        vpshufbitqmb    k0, zmm0, ZMMWORD PTR .LC0[rip]
        kmovq   rax, k0
        vzeroupper
        ret
Morton2D_BMI(Coord2D const&):
        mov     eax, DWORD PTR [rdi]
        movabs  rdx, 6148914691236517205
        pdep    rax, rax, rdx
        mov     edx, DWORD PTR [rdi+4]
        movabs  rcx, -6148914691236517206
        pdep    rdx, rdx, rcx
        or      rax, rdx
        ret

I have verified the implementation with Intel's emulator and I'd love to update with some actual benchmarks once I get some Icelake hardware in my hands.

DEV Community

Morton encoding with AVX512

Top comments (0)

Read next

Investigating the performance of np.einsum

Unlocking improved Flutter performance; a case for Keys

Algorithms Behind JavaScript Array Methods

Building a Simple Blockchain in Golang