PSI is useful for resource pressure monitoring. But the callbacks are
distributed among all the common calling paths, some of which are quite
performance critical. The hottest callback like psi_group_change is
called by both psi_task_switch and psi_task_change, which are parts of
task_switch, enqueue, dequeue. So the cpu usage of psi is quite
important.
We initialized a common hackbench test using the following command:
perf record --kernel-callchains -a -g hackbench -s 512 -P -g 10 -f 30 \
-l 1000 --pipe
In a machine setup with 8 cores, 16GB with two numa node(each node 8GB),
we saw a cpu usage of 4.3% for psi using the flame graph of the perf
data, which can make some observable influence to the actual workloads.
In this patchset, we did some improvement for the performance of hot
path, which slightly improves the performance for the psi. With a same
setup of 8 cores + 16GB, the cpu usage of psi becomes 3.4%, which has
a 20% improvement. In the future patches we may try to do more
adjustment to go further (Like add switches for different types of PSI
resources maybe).
Patch Details:
========
* Patch 1 moves the judgement of cpu_curr(cpu)->in_memstall from
psi_group_change outside to eliminate some repeated memory access.
* Patch 2 adds a bit variable need_psi to help judge whether we need
to do psi accouting for the cgroup. we move it and psi_flags, which
currently only has 5 bits, close to the bitfield variable in_memstall
together. This way they will be cacheline aligned together.
* Patch 3 adds a prefetch logic before actually accessing the parent
cgroups, since the parent cgroups will always be accessed in the
following step.
* Patch 4 only calls record_times when the state actually changes to
save some uncessary accesses.
* Patch 5 adds psi_group for the root cgroup to remove the uncessary
if condition.
* Patch 6 uses printk_deferred_once to replace the psi_bug variable
and moves tasks[NR_RUNNING] which is most likely to happen ahead
in the if condition.
Thanks for reading. Comments and suggestions are very welcome!
Signed-off-by: Luka Bai <lukabai@tencent.com>
---
Luka Bai (6):
psi: move curr_in_memstall out of psi_group_change
psi: reorganize the psi members for cacheline benifits
psi: use prefetch to preread the parent groupc
psi: do not call record_times when the state is not changed
psi: add psi group for the root cgroup
psi: remove psi_bug and moves checking of NR_RUNNING ahead.
include/linux/psi.h | 2 +-
include/linux/psi_types.h | 20 +------------
include/linux/sched.h | 29 ++++++++++++++++---
kernel/cgroup/cgroup.c | 3 ++
kernel/fork.c | 10 +++++++
kernel/sched/psi.c | 71 ++++++++++++++++++++++++++++++-----------------
6 files changed, 85 insertions(+), 50 deletions(-)
---
base-commit: 972c53e0ec3abfc6f5fe2cb503640710fb23cf95
change-id: 20260512-psi_impr-f543a199f39d
Best regards,
--
Luka Bai <lukabai@tencent.com>