Here a quick summary of what happened last month. I will try to write a recap every month.
February Summary
We poured lots of work on improving the encoding speed, you may read some details of the journey:
- Analyze our memory access patterns and improve the layout and the update strategy of a structure accessed a lot in our hottest code-path.
- parallelize one of the remaining bottleneck so we improve the average thread usage and improve both speed and latency.
- add the temporal rdo lookahead to our speed levels, measure its quality-vs-speed impact and retune them accordingly.
The benchmarks are prepared using speed-levels-rs.
The encoder is using the following settings:
--threads 16 --tiles 16 -l 100 <file> -o <encoded> -s <level>
The source file is Bosphorus from the ultravideo test sequences, the 1080p 10bit version is the 4k 10bit version scaled down, since it is not available on the website.
Overall our aarch64 support is getting fairly good, but there is still a lot of room for improvement on 8bit.
On the other hand there are 10bit optimizations it that aren't yet available for x86_64. Help in improving our SIMD coverage is very welcome :)
Digging deeper
x86_64
As expected the memory layout optimization that happened between p20210209
and p20210216
had the largest impact on the speed 0 and 1, while optimizing and tuning the temporal rdo lookahead computation has the largest impact on speed level 9 and 10.
The x86_64
10bit encoding is behaving similarly. Our SIMD support for it received a large boost in January and there is an ongoing effort to improve it even further in March.
Aarch64
The impact of the optimizations on aarch64 had been more radical with a fairly large relative improvement on speed 10.
The 10bit boost is not as extreme, but still substantial.
I tested on some different aarch64 systems to see if there is a large difference in its behavior.
The Apple M1 is fairly different, but that's something I would expect. I will talk a bit more about it in other blogposts probably.
Coming next
We already landed additional SIMD for both x86_64 and aarch64, David Barr started working on improving the segment selection and I have eventually came up with the internals architecture that would give us a better thread pool usage while not impacting a lot the overall latency.
March is going to be exciting.
Top comments (0)