Preface
This article explains the meaning of the Linux's sysctl parameters about the process scheduler and some background knowledge needed to understand it. Here I don't tend to explain all parameters, but just cover essential ones.
The description in this article doesn't consider the following things about process scheduling for simplicity.
- nice value
- real-time priority
This article is based on Linux kernel v5.0.
Scheduling Classes
There is a concept called scheduling classes
in the Linux kernel. All processes running on Linux belong to one of the scheduling classes. Each scheduling class defines how the processes belonging to it are scheduled.
Processes belong to fair
scheduling class by default. In this article, I call these processes normal processes.
On the other hand, processes called real-time processes
(see later) belong to realtime
scheduling class.
I'll describe the meaning of the sysctl parameters about the above-mentioned two scheduling classes in the following sections. In addition, I'll also describe a brief explanation about each scheduling class.
The sysctl parameters about fair
scheduling class
The normal processes belongs to fair
scheduling class are scheduled with Completely Fair Scheduler (CFS). The meaning of the CFS will be explained in the next section.
kernel.sched_latency_ns
parameter
If there are two or more runnable processes, CFS divide CPU time to each process as fair as possible. In this case, fair
means giving fair share of CPU time to each process.
CFS has a concept called latency target.
CFS tries to give timeslice to all runnable processes once per the latency target. Here the timeslice of each process is (latency target)/<the number of runnable processes>
. For example, if the latency target is 10ms and there are two runnable processes, these can get 5ms per 10ms. If there are four, these can get 2.5ms per 10ms.
Here kernel.sched_latency_ns
defines the latency target
of CFS in nanoseconds. If there are multiple CPUs in the system, the latency target
becomes kernel.sched_latency_ns * (1+log2(the number of CPUs)).
kernel.sched_min_granularity_ns
parameter
How about the case that there are so many runnable processes? For example, if the latency target is 10ms and there are 100 runnable processes, does each process's timeslice get just 100us? It seems to be too short since the context switch cost becomes too high in this case.
To prevent this problem, timeslice is guaranteed to become equal or longer than the value of kernel.sched_min_granularity_ns
parameter. The unit of this parameter is nanoseconds. Please note that the latency target becomes kernel.sched_min_granularity_ns * (the number of runnable processes).
Similar to the latency target, if there are multiple CPUs in the system, the guaranteed timeslice becomes kernel.sched_min_granularity_ns * (1+log2(the number of CPUs)).
kernel.wakeup_granularity_ns
parameter
The processes, which are woken up from a sleep state, tend to sleep again in a short period. So, in many cases, it's efficient to give CPU time to the woken up process as soon as possible.
The typical example is terminal emulators that directly interact with users through the input from keyboard. When a user types something, a terminal emulator
is woken up and echo back his input. If the echo back takes too long, the user experience becomes bad.
CFS has a special logic to shorten the latency of such interactive processes. However, to explain the detail of this logic is a bit difficult. So I only say that if you decrease kernel.wakeup_granularity_ns
parameter, the probability of the preemption by the woken up process gets high. Then the system's interactivity would get better.
However, please note that there is a tradeoff between interactivity and throughput. If you set the value that is shorter than the default value, the number of context switches would get large and the throughput would get worse.
The sysctl parameters about the realtime
scheduling class
realtime
scheduling class is for the processes that must run prior to any normal processes, in other words, the processes belonging to fair
scheduling class.
As I already described, the processes belong to realtime
scheduling class are called real-time processes. The definition of the real-time processes is the processes having SCHED_FIFO
scheduling policy or SCHED_RR
scheduling policy. We can set the scheduling policy of processes with sched_setscheduler()
system call.
Let's assume that a real-time process A becomes runnable in a CPU, in which process B, that belongs to fair
scheduling class, is running on this CPU. Here B can preempt A at any time by definition. So, how about the case that the B is also real-time processes? It depends on the scheduling policy of B.
If B's scheduling policy is SCHED_FIFO, A can't preempt B and can run on this CPU only when B exits or becomes sleeping state. However, if its scheduling policy is SCHED_RR, B has its predefined timeslice and B can preempt A after A exhausts its timeslice. If A also belongs to SCHED_RR, both A and B got CPU time in a round-robin manner after that.
kernel.sched_rr_timeslice_ms
parameter
This parameter means the timeslice of real-time processes that belong to SCHED_RR scheduling policy. Its unit is millisecond.
kernel.sched_rt_period_us
parameter and kernel.sched_rt_runtime_us
parameter
These parameters are to prevent CPU occupation by the out-of-control real-time processes.
If the real-time process continues to run for a long time without getting sleep, any normal processes can't get CPU time at all during this period. It would incur serious problems like hanging up the whole system. For example, let's assume a system that has only one CPU and the a real-time process A is running on the CPU. If A hangs up, the system also hangs up. In addition, we can't kill this problematic real-time process because launching bash is also prevented by this process.
To prevent this kind of problem, the process scheduler has a logic to limit the running time of real-time processes. In short, the total CPU time consumed by real-time processes can't exceed kernel.sched_rt_runtime_us
per kernel.sched_rt_period_us
. Both units are microseconds.
Conclusion
This article describes some of Linux's scheduler and the basic knowledge which is necessary to understand this explanation. If you're interested in this topic, please modify these parameters and run your workload to verify whether the description of this article is correct or not. For example, the following article would help you.
Top comments (0)