Context Switch
Switching the process running on a logical CPU is called a 'context switch.' The following figure shows a context switch when Process 0 and Process 1 are present.
Context switches occur without mercy, regardless of the code the process runs when its timeslice expires. Without understanding this, it is easy to misunderstand as follows:
But in reality, there is no guarantee that bar() will be executed immediately after foo(). If the timeslice expires right after the execution of foo(), the execution of bar() could occur sometime later.
Understanding this can provide a different perspective when a certain operation takes longer than expected to complete. Rather than rashly concluding that there must be a problem with the operation itself, one could consider the possibility that a context switch occurred during the operation and another process was running.
Performance
Complying with the system's performance requirements is important. For instance, the following indicators are used for this purpose.
- Turnaround Time: The time from when the system is asked to process until each process is finished.
- Throughput: The number of processes that can be completed per unit of time.
Let's measure these values. Here, we will get the following performance information for the measure.sh
program.
- Average Turnaround Time: The average value of the 'real' values of all
load.sh
processes, which just consumes CPU times. - Throughput: The number of processes over'real' value of the
multiload.sh
program.
To obtain this information, we use the cpuperf.sh
program.
- cpuperf.sh
#!/bin/bash
usage() {
exec >&2
echo "Usage: $0 [-m] <maximum process count>
1. Save performance information to a file named 'cpuperf.data'
* The number of entries is <maximum process count>
* The format of each line is '<process count> <average turnaround time [seconds]> <throughput [processes/second]>'
2. Create a graph of the average throughput based on the performance information and save it as 'avg-tat.jpg'
3. Similarly, create a graph of the throughput and save it as 'throughput.jpg'
The -m option is passed directly to the measure program."
exit 1
}
measure() {
local nproc=$1
local opt=$2
bash -c "time ./multiload.sh $opt $nproc" 2>&1 | grep real | sed -n -e 's/^.*0m\([.0-9]*\)s$/\1/p' | awk -v nproc=$nproc '
BEGIN{
sum_tat=0
}
(NR<=nproc){
sum_tat+=$1
}
(NR==nproc+1) {
total_real=$1
}
END{
printf("%d\t%.3f\t%.3f\n", nproc, sum_tat/nproc, nproc/total_real)
}'
}
while getopts "m" OPT ; do
case $OPT in
m)
MEASURE_OPT="-m"
;;
\?)
usage
;;
esac
done
shift $((OPTIND - 1))
if [ $# -lt 1 ]; then
usage
fi
rm -f cpuperf.data
MAX_NPROC=$1
for ((i=1;i<=MAX_NPROC;i++)) ; do
measure $i $MEASURE_OPT >>cpuperf.data
done
./plot-perf.py $MAX_NPROC
- plot-perf.py
#!/usr/bin/python3
import sys
import plot_sched
def usage():
print("""usage: {} <max_nproc>
* create graphs from cpuperf.data
* "avg-tat.jpg": aveage turnaroun time
* "throughput.jpg: troughput""".format(progname, file=sys.stderr))
sys.exit(1)
progname = sys.argv[0]
if len(sys.argv) < 2:
usage()
max_nproc = int(sys.argv[1])
plot_sched.plot_avg_tat(max_nproc)
plot_sched.plot_throughput(max_nproc)
- plot_sched.py
#!/usr/bin/python3
import numpy as np
from PIL import Image
import matplotlib
import os
matplotlib.use('Agg')
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = "sans-serif"
plt.rcParams['font.sans-serif'] = "TakaoPGothic"
def plot_avg_tat(max_nproc):
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
x, y, _ = np.loadtxt("cpuperf.data", unpack=True)
ax.scatter(x,y,s=1)
ax.set_xlim([0, max_nproc+1])
ax.set_xlabel("# of processes ")
ax.set_ylim(0)
ax.set_ylabel("average turnaround time[s]")
# Save as png and convert this to jpg to bypass the following bug.
# https://bugs.launchpad.net/ubuntu/+source/matplotlib/+bug/1897283?comments=all
pngfilename = "avg-tat.png"
jpgfilename = "avg-tat.jpg"
fig.savefig(pngfilename)
Image.open(pngfilename).convert("RGB").save(jpgfilename)
os.remove(pngfilename)
def plot_throughput(max_nproc):
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
x, _, y = np.loadtxt("cpuperf.data", unpack=True)
ax.scatter(x,y,s=1)
ax.set_xlim([0, max_nproc+1])
ax.set_xlabel("# of processes")
ax.set_ylim(0)
ax.set_ylabel("Throughput[process/s]")
# Save as png and convert this to jpg to bypass the following bug.
# https://bugs.launchpad.net/ubuntu/+source/matplotlib/+bug/1897283?comments=all
pngfilename = "avg-tat.png"
jpgfilename = "throughput.jpg"
fig.savefig(pngfilename)
Image.open(pngfilename).convert("RGB").save(jpgfilename)
os.remove(pngfilename)
- multiload.sh
#!/bin/bash
MULTICPU=0
PROGNAME=$0
SCRIPT_DIR=$(cd $(dirname $0) && pwd)
usage() {
exec >&2
echo "usage: $PROGNAME [-m] <# of processes>
Run load-testing processes, which just consumes CPU time, and show the elapsed time after waiting for the end of all processes.
These processes run on one CPU by default.
options:
-m: allow to run load-testong processes in arbitrary CPUs"
exit 1
}
while getopts "m" OPT ; do
case $OPT in
m)
MULTICPU=1
;;
\?)
usage
;;
esac
done
shift $((OPTIND - 1))
if [ $# -lt 1 ] ; then
usage
fi
CONCURRENCY=$1
if [ $MULTICPU -eq 0 ] ; then
taskset -p -c 0 $$ >/dev/null
fi
for ((i=0;i<CONCURRENCY;i++)) do
time "${SCRIPT_DIR}/load.py" &
done
for ((i=0;i<CONCURRENCY;i++)) do
wait
done
- load.sh
#!/usr/bin/python3
NLOOP=100000000
for _ in range(NLOOP):
pass
First, we'll show the results of limiting the load processing process's execution to one logical CPU and setting the maximum number of processes to 4, i.e., executing ./cpuperf 4
.
We can see that increasing the number of processes more than the number of logical CPUs only lengthens the average turnaround time. In addition, it does not improve the throughput.
If you continue to increase the number of processes, the context switches caused by the scheduler will gradually increase the average turnaround time and decrease the throughput. From a performance perspective, simply increasing the number of processes is not enough when the CPU resources are fully utilized.
Let's dig a little deeper into the turnaround time. Suppose you have a web application that does the following processing on your system:
- Receives requests from users over The Internet.
- Generates HTML files according to the request.
- Sends the results back to the user over The Internet.
If such processing arrives anew in a situation where the load on the logical CPU is high, the average turnaround time will get longer and longer. This directly connects to the response time of the web application operation from the user's perspective, compromising the user experience. For a system that prioritizes response performance, keeping each machine's CPU usage lower than a system that prioritizes throughput is necessary.
Next, we will collect data for the case where all logical CPUs can be used. The number of logical CPUs can be obtained by the grep -c processor /proc/cpuinfo
command.
grep -c processor /proc/cpuinfo
8
In the author's environment, there are 8 logical CPUs because it has 4 cores and 2 threads.
In this experiment, if the system has SMT enabled, we will disable SMT, as shown below. The reason for disabling it will be discussed in another article.
cat /sys/devices/system/cpu/smt/control # If the output is 'on', SMT is enabled. If the file does not exist, the CPU does not support SMT in the first place.
on
echo off >/sys/devices/system/cpu/smt/control
cat /sys/devices/system/cpu/smt/control
off
grep -c processor /proc/cpuinfo
4
In this state, when the maximum number of processes is set to 8 and performance information is collected, namely when ./cpuperf.sh -m 8
is executed. The results are shown in the following figure.
You can see that the average turnaround time gradually increases until the number of processes equals the number of logical CPUs (4 in this case). However, after that, it increases significantly.
Next, the degree of parallelism improves until it equals the number of logical CPUs, but then it plateaus. From this, we can say the following:
- Even if a machine has many logical CPUs, its throughput will only improve if a sufficient number of processes run on it.
- Blindly increasing the number of processes will not increase the throughput.
If SMT was enabled before the experiment, it should be re-enabled as follows:
# echo on >/sys/devices/system/cpu/smt/control
The importance of parallel execution of programs is increasing year by year. This is because the approach to improving CPU performance has changed. Once upon a time, with every new generation of CPUs, one could expect a dramatic improvement in the performance per logical CPU, referred to as single-threaded performance. In this case, the processing speed kept increasing without having to modify the program at all. However, the situation has changed over the past decade or so. It has become difficult to improve single-threaded performance due to various circumstances. As a result, when a CPU generation changes, single-threaded performance no longer improves as much as it used to. Instead, the trend has been to increase the total performance of the CPU by increasing the number of CPU cores.
The kernel has also been improving scalability when increasing the number of cores, which is in line with these changing times. As times change, common sense changes, and software changes to adapt to this new common sense.
NOTE
This article is based on my book written in Japanese. Please contact me via satoru.takeuchi@gmail.com if you're interested in publishing this book's English version.
Top comments (0)