Foreword ๐
Dotnet provides several classes, some under the System.Runtime.Intrinsics namespace that allow hardware to execute instructions in parallel.
using System.Runtime.Intrinsics;
Vector512 v512;
Vector256 v256;
Vector128 v128;
The number suffix (512, 256, 128) indicates the size in bits of the vector that the hardware can process in parallel.
This has positive impact in operations that performs aggregates, specially in a loop with large arrays.
To know if the hardware allows this type of registers we can consult the static read-only property IsHardwareAccelerated
if (Vector256.IsHardwareAccelerated)
{
_is256 = true;
...
}
The above code will test if our hardware supports 256 bit vector operations through JIT intrinsics.
Exploring ๐ง
Suppose we want to simultaneously calculate the maximum and minimum of a sequence of integers using Vector256.
The process will consist of creating a loop in which we will move forward taking 256-bit chunks and updating the maximum and minimum
(T Min, T Max) MinMax256<T>(ReadOnlySpan<T> source)
where T : struct, INumber<T>
{
}
First we initialize some variables to hold the current element, the last element, and the last size wise element (the to variable)
ref T current = ref MemoryMarshal.GetReference(source);
ref T last = ref Unsafe.Add(ref current, source.Length);
ref T to = ref Unsafe.Add(ref last, -Vector256<T>.Count);
Vector256<T> minElement = Vector256.LoadUnsafe(ref current);
Vector256<T> maxElement = minElement;
Then we start the loop. Inside, we load data in 256 bit chunks calling Vector256.LoadUsafe
while (Unsafe.IsAddressLessThan(ref current, ref to))
{
Vector256<T> tempElement = Vector256.LoadUnsafe(ref current);
minElement = Vector256.Min(minElement, tempElement);
maxElement = Vector256.Max(maxElement, tempElement);
current = ref Unsafe.Add(ref current, Vector256<T>.Count);
}
We use the static Min and Max methods of Vector256and store the value in minElement and maxElement.
Finally, we increment the position counter (current) by adding 256 bits to the pointer.
Once we have exceeded the established size, we have to calculate the maximum and minimum individually
T min = minElement[0];
T max = maxElement[0];
for (int i = 1; i < Vector256<T>.Count; i++)
{
T tempMin = minElement[i];
if (tempMin < min)
{
min = tempMin;
}
T tempMax = maxElement[i];
if (tempMax > max)
{
max = tempMax;
}
}
After that we calculate the remaining elements if any:
while (Unsafe.IsAddressLessThan(ref current, ref last))
{
if (current < min)
{
min = current;
}
if (current > max)
{
max = current;
}
current = ref Unsafe.Add(ref current, 1);
}
And that's all, we return the results:
return (min, max);
Benchmark ๐ฅ
A quick test with BenchmarkDotnet calculating the maximum and minimum of an array of 10_000 integers reveals a performance improvement of x146 with Vector256 support.
๐ก Ryzen 7 1700, 1 CPU
.NET SDK=8.0.100-rc.1.23455.8
Method | Mean (ns) |
---|---|
๐ข MinMaxLinq .NET Framework 4.8 | 118,675.226 |
โก MinMaxSimd .NET 8.0 | 808.150 |
Farewell
All the code with a more elavorated example is hosted in github. Be happy and love your family ๐
NetDefender / SimdIteration
SIMD tests
Simd Iteration
Test SIMD 512, 256, 128 registers for fast aggregate calculations.
Unfortunately my hardware doesn't support Vector512.
Anyway, the performance improvement is mindblowing.
Important
net8 is x146 times faster than net48 for calculate the Min and Max at the same time !!
Results
- BenchmarkDotNet=v0.13.5, OS=Windows 10 (10.0.19044.3086/21H2/November2021Update)
- AMD Ryzen 7 1700, 1 CPU, 16 logical and 8 physical cores
- .NET SDK=8.0.100-rc.1.23455.8
- [Host] : .NET 8.0.0 (8.0.23.41904), X64 RyuJIT AVX2
- .NET 7.0 : .NET 7.0.11 (7.0.1123.42427), X64 RyuJIT AVX2
- .NET 8.0 : .NET 8.0.0 (8.0.23.41904), X64 RyuJIT AVX2
- .NET Framework 4.8 : .NET Framework 4.8 (4.8.4644.0), X64 RyuJIT VectorSize=256
Method | Runtime | Size | Mean | Allocated |
---|---|---|---|---|
MinMaxLinq | .NET Framework 4.8 | 10000 | 118,675.226 ns | 65 B |
MinMaxLinq | .NET 7.0 | 10000 | 2,350.046 ns | - |
MinMaxLinq | .NET 8.0 | 10000 | 1,228.518 ns | - |
MinMaxSimd | .NET 7.0 | 10000 | 834.291 ns | - |
MinMaxSimd | .NET 8.0 | 10000 | 808.150 ns | - |
References
System.Runtime.Intrinsics work planned for .NET 8 #79005
This is a work in progress as we develop our .NET 8 plans. This list is expected to change throughout the release cycle according to ongoing planning and discussions, with possible additions and subtractions to the scope.
Summary
During .NET 8, we will be focusing on AVX-512, an effort that includes the addition of a new intrinsic type Vector512
as well as Vector<T>
improvements. Beyond that major theme, we will invest in quality, enhancements and new APIs. This is an ambitious set of work, so it's likely that several of the items below will be pushed out beyond .NET 8. It is also likely additional items will be added throughout the year.
Planned for .NET 8
AVX-512
- [ ] https://github.com/dotnet/runtime/issues/63331
- [x] https://github.com/dotnet/runtime/issues/73262
- [ ] https://github.com/dotnet/runtime/issues/73604
- [x] https://github.com/dotnet/runtime/issues/74613
- [x] https://github.com/dotnet/runtime/issues/74813
- [ ] https://github.com/dotnet/runtime/issues/76244
- [ ] https://github.com/dotnet/runtime/issues/76579
- [x] https://github.com/dotnet/runtime/issues/76593
Quality
- [ ] https://github.com/dotnet/runtime/issues/64409
- [ ] https://github.com/dotnet/runtime/issues/64451
Top comments (0)