Temporal RDO update optimization

#rav1e #rust #rayon #optimization

This is the second story of a specific optimization that gave a significant speed-up in rav1e. For the first one go here. For the next one go here

For readers unfamiliar with AV1 and rav1e, we will start with some background. If you already read the first article you may skip ahead to the implementation details.

The Alliance for Open Media is an effort founded by Google, Cisco and Mozilla, with many other companies joining them over the past few years. The Alliance for Open Media AV1 video codec is the first to be released as a result of this effort. We will refer to it as AV1.

rav1e is an AV1 encoder -- written primarily in the Rust programming language, its goal is to be safe and fast.

Architecture of an AV1 encoder

AV1 has a conventional block transform video coding architecture with motion compensation. A video is a sequence of frames which are divided into one or more tiles. Tiles are divided into large squares called superblocks -- either 64 or 128 pixels in width and height for AV1. Superblocks are partitioned recursively into smaller square or rectangular blocks.

Each block may be predicted by neighboring blocks in the same tile or by a motion projection of regions in other frames. There are many modes to choose from for both of these cases. The core of the video encoder is deciding how to partition the blocks, which prediction mode and reference to use for each block and how to transform the difference. The following diagram illustrates the range of techniques available in AV1.

Rate-distortion optimization (RDO)

The process for guiding encoding choices is called rate-distortion optimization, commonly abbreviated as RDO. For each set of choices there is a trade between the rate of bits that describe the decisions and how much the decoded frame is distorted compared to the original. The ratio of this trade-off can be estimated for a given quality range.

Temporal RDO

The RDO logic is normally applied over a small set of frames within a larger group. We call the smaller group sub-GOP and the larger group GOP.

The temporal RDO analyzes a larger set of frames to find areas that would recur outside the single sub-GOP and compute a bias that would guide the normal RDO decisions to invest more bits in the blocks that would be more important in the future.

The idea of temporal RDO is to compress objects well-predicted in future frames with higher quality. For example, a car moving across the screen can be encoded finely on the first frame, which will improve its visual quality on all subsequent frames.
On the other hand, an object that soon disappears can be encoded coarsely to save bits. This can also apply to multi-view images: spend more bits on objects visible in multiple views and less bits on objects visible in only a single view.
-- Ivan Molodetskikh

The temporal RDO needs to see moderately far in the future in order to be effective.

Every time a decision is made for a block, it has to propagate -- both within the frame and over all the past ones.

This caused its implementation to be one of the largest bottlenecks in the encoder:

As per every lookahead, it adds latency, potentially a lot of it.
The update process is inherently serial over the frames.

Temporal RDO implementation

In rav1e the bulk of the code implementing it lives in compute_block_importances().

Being it completely serial, once the encoding process is run on a machine with enough threads it takes a fairly large share of the overall encoding time.

In the case of encoding using 16 threads and 16 tiles, it takes a little over the 30% of the time spent on receive_packet for the speed level 10.

The rest of the encoding workload is spread nearly evenly per-tile, the computation of the block importance bias is not.

Making it faster

The update process is fairly serial if we consider frames and the per-block updates. But the analysis itself has good parallelism opportunities since it happens per block.

Enter rayon

rayon is a data-parallelism crate that makes it incredibly easy to take an iterator and execute it in parallel using a pool of workers. We use it extensively in rav1e.

After a refactor¹ to split the update loop in one that computes the per-block costs and one that propagates the importance biases to its neighbors, all it took was an apparently small change:

diff --git a/src/api/internal.rs b/src/api/internal.rs
index 395a66e4..b2ab16cd 100644
--- a/src/api/internal.rs
+++ b/src/api/internal.rs
@@ -21,6 +21,7 @@ use crate::rate::{
   RCState, FRAME_NSUBTYPES, FRAME_SUBTYPE_I, FRAME_SUBTYPE_P,
   FRAME_SUBTYPE_SEF,
 };
+use crate::rayon::prelude::*;
 use crate::scenechange::SceneChangeDetector;
 use crate::stats::EncoderStats;
 use crate::tiling::Area;
@@ -809,14 +810,14 @@ impl<T: Pixel> ContextInner<T> {
     let plane_org = &frame.planes[0];
     let plane_ref = &reference_frame.planes[0];
     let lookahead_intra_costs_lines =
-      fi.lookahead_intra_costs.chunks_exact(fi.w_in_imp_b);
+      fi.lookahead_intra_costs.par_chunks_exact(fi.w_in_imp_b);
     let block_importances_lines =
-      fi.block_importances.chunks_exact(fi.w_in_imp_b);
+      fi.block_importances.par_chunks_exact(fi.w_in_imp_b);

     let costs: Vec<_> = lookahead_intra_costs_lines
       .zip(block_importances_lines)
       .enumerate()
-      .flat_map(|(y, (lookahead_intra_costs, block_importances))| {
+      .flat_map_iter(|(y, (lookahead_intra_costs, block_importances))| {
         lookahead_intra_costs
           .iter()
           .zip(block_importances.iter())