Introduction
In the previous blog, I introduced the Python numeric calculation package NumPy and the SIMD optimization implemented in it. I have checked the code:
#ifdef _MSC_VER
#include <Intrin.h>
#endif
#include <arm_neon.h>
int main(void)
{
float32x4_t v1 = vdupq_n_f32(1.0f), v2 = vdupq_n_f32(2.0f);
/* MAXMIN */
int ret = (int)vgetq_lane_f32(vmaxnmq_f32(v1, v2), 0);
ret += (int)vgetq_lane_f32(vminnmq_f32(v1, v2), 0);
/* ROUNDING */
ret += (int)vgetq_lane_f32(vrndq_f32(v1), 0);
#ifdef __aarch64__
{
float64x2_t vd1 = vdupq_n_f64(1.0), vd2 = vdupq_n_f64(2.0);
/* MAXMIN */
ret += (int)vgetq_lane_f64(vmaxnmq_f64(vd1, vd2), 0);
ret += (int)vgetq_lane_f64(vminnmq_f64(vd1, vd2), 0);
/* ROUNDING */
ret += (int)vgetq_lane_f64(vrndq_f64(vd1), 0);
}
#endif
return ret;
}
As ARMv9 is approaching, I am thinking that NumPy should support the improved SIMD implementation called Scalable Vector Extensions v2 (SVE2) coming together with ARMv9.
SVE2
It is important to know that the existing implementations are for fixed-width SIMD, while SVE2 is variable-width, and that AArch64 code will need to detect either at compile-time or runtime whether advanced SIMD or SVE2 SIMD instructions should be used.
SVE2 is following the development of Neon architecture, which has a fixed 128-bit vector length for the instruction set. SVE is an extension to AArch64, and a superset of SVE and Neon, to allow for flexible vector length implementations. SVE improves the suitability of the architecture for High Performance Computing (HPC) applications, which require very large quantities of data processing. Particularly, SVE2 can be implemented from 128 bits up to 2048 bits with 128 bits increment.
Code Example and Recommendation
The code examples were provided in official documentation.
In this documentation, the Section B of Generic Vector and Matrix Operations examples are very useful for update of NumPy SIMD optimization.
They have example such as shown below for vectors dot-product with complex SP floating-point elements:
struct cplx_f32_t {
float re;
float im;
};
void vecdot(int64_t n, cplx_f32_t* a, cplx_f32_t* b, cplx_f32_t* c) {
cplx_f32_t acc;
acc.re = 0;
acc.im = 0;
for (int64_t i = 0; i < n; ++i) {
acc.re += a[i].re * b[i].re - a[i].im * b[i].im;
acc.im += a[i].re * b[i].im + a[i].im * b[i].re;
}
c->re = acc.re;
c->im = acc.im;
}
Top comments (0)