include/linux/sched.h | 4 + kernel/sched/core.c | 3 + kernel/sched/fair.c | 380 +++++++++++++++++++++++------------------- kernel/sched/sched.h | 3 + 4 files changed, 216 insertions(+), 174 deletions(-)
This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/
Valentin has described the problem very well in the above link. We also
have task hung problem from time to time in our environment due to cfs quota.
It is mostly visible with rwsem: when a reader is throttled, writer comes in
and has to wait, the writer also makes all subsequent readers wait,
causing problems of priority inversion or even whole system hung.
Changes I've made since Valentin's v3:
- Use enqueue_task_fair() and dequeue_task_fair() in cfs_rq's throttle
and unthrottle path;
- Get rid of irq_work, since the task work that is supposed to throttle
the task can figure things out and do things accordingly, so no need
to have a irq_work to cancel a no longer needed task work;
- Several fix like taking care of task group change, sched class change
etc. for throttled task;
- tasks_rcu fix with this task based throttle.
Tests:
- A basic test to verify functionality like limit cgroup cpu time and
change task group, affinity etc.
- A script that tried to mimic a large cgroup setup is used to see how
bad it is to unthrottle cfs_rqs and enqueue back large number of tasks
in hrtime context.
The test was done on a 2sockets/384threads AMD CPU with the following
cgroup setup: 2 first level cgroups with quota setting, each has 100
child cgroups and each child cgroup has 10 leaf child cgroups, with a
total number of 2000 cgroups. In each leaf child cgroup, 10 cpu hog
tasks are created there. Below is the durations of
distribute_cfs_runtime() during a 1 minute window:
@durations:
[8K, 16K) 274 |@@@@@@@@@@@@@@@@@@@@@
|
[16K, 32K) 132 |@@@@@@@@@@
|
[32K, 64K) 6 |
|
[64K, 128K) 0 |
|
[128K, 256K) 2 |
|
[256K, 512K) 0 |
|
[512K, 1M) 117 |@@@@@@@@@
|
[1M, 2M) 665
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2M, 4M) 10 |
|
So the biggest duration is in 2-4ms range in this hrtime context. How
bad is this number? I think it is acceptable but maybe the setup I
created is not complex enough?
In older kernels where async unthrottle is not available, the largest
time range can be about 100ms+.
Patches:
The patchset is arranged to get the basic functionality done first and
then deal with special cases. I hope this can make it easier to review.
Patch1 is preparation work;
Patch2-3 provide the main functionality.
Patch2 deals with throttle path: when a cfs_rq is to be throttled, add a
task work to each of its tasks so that when those tasks return to user, the
task work can throttle it by dequeuing the task and remember this by
adding the task to its cfs_rq's limbo list;
Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
enqueue back those tasks in limbo list;
Patch4-5 deal with special cases.
Patch4 deals with task migration: if a task migrates to a throttled
cfs_rq, setup throttle work for it. If otherwise a task that already has
task work added migrated to a not throttled cfs_rq, its task work will
remain: the work handler will figure things out and skip the throttle.
This also deals with setting throttle task work for tasks that switched
to fair class, changed group etc. because all these things need enqueue
the task to the target cfs_rq;
Patch5 deals with the dequeue path when task changes group, sched class
etc. Task that is throttled is dequeued in fair, but task->on_rq is
still set so when it changes task group, sched class or has affinity
setting change, core will firstly dequeue it. But since this task is
already dequeued in fair class, this patch handle this situation.
Patch6-7 are two fixes while doing test. I can also fold them in if that
is better.
Patch6 makes CONFIG_TASKS_RCU happy. Throttled tasks get scheduled in
tasks_work_run() by cond_resched() but that is a preempt schedule and
doesn't mark a task rcu quiescent state, so I add a schedule call in
throttle task work directly.
Patch7 fixed a problem where unthrottle path can cause throttle to
happen again when enqueuing task.
All the patches changelogs are written by me, so if the changelogs look
poor, it's my bad.
Comments are welcome. If you see any problems or issues with this
approach, please feel free to let me know, thanks.
Base commit: tip/sched/core, commit fd881d0a085f("rseq: Fix segfault on
registration when rseq_cs is non-zero").
Known issues:
- !CONFIG_CFS_BANDWIDTH is totally not tested yet;
- task_is_throttled_fair() could probably be replaced with
task_is_throttled() now but I'll leave this to next version.
- cfs_rq's pelt clock is stopped on throttle while it can still have
tasks running(like some task is still running in kernel space).
It's also possible to keep its pelt clock running till its last task
is throttled/dequeued, but in this way, this cfs_rq's load may be
decreased too much since many of its tasks are throttled. For now,
keep it simple by keeping the current behavior.
Aaron Lu (4):
sched/fair: Take care of migrated task for task based throttle
sched/fair: Take care of group/affinity/sched_class change for
throttled task
sched/fair: fix tasks_rcu with task based throttle
sched/fair: Make sure cfs_rq has enough runtime_remaining on
unthrottle path
Valentin Schneider (3):
sched/fair: Add related data structure for task based throttle
sched/fair: Handle throttle path for task based throttle
sched/fair: Handle unthrottle path for task based throttle
include/linux/sched.h | 4 +
kernel/sched/core.c | 3 +
kernel/sched/fair.c | 380 +++++++++++++++++++++++-------------------
kernel/sched/sched.h | 3 +
4 files changed, 216 insertions(+), 174 deletions(-)
--
2.39.5
It appears this mail's message-id is changed and becomes a separate
thread, I'll check what is going wrong, sorry about this.
On Thu, Mar 13, 2025 at 02:20:59AM -0500, Aaron Lu wrote:
> Tests:
> - A basic test to verify functionality like limit cgroup cpu time and
> change task group, affinity etc.
Here is the basic test script:
pid=$$
CG_PATH1=/sys/fs/cgroup/1
CG_PATH2=/sys/fs/cgroup/2
[ -d $CG_PATH1 ] && sudo rmdir $CG_PATH1
[ -d $CG_PATH2 ] && sudo rmdir $CG_PATH2
sudo mkdir -p $CG_PATH1
sudo mkdir -p $CG_PATH2
sudo sh -c "echo $pid > $CG_PATH1/cgroup.procs"
echo "start nop"
~/src/misc/nop &
nop_pid=$!
cat /proc/$nop_pid/cgroup
pidstat -p $nop_pid 1 &
sleep 5
echo "limit $CG_PATH1 to 1/10"
sudo sh -c "echo 10000 100000 > $CG_PATH1/cpu.max"
sleep 5
echo "limit $CG_PATH1 to 5/10"
sudo sh -c "echo 50000 100000 > $CG_PATH1/cpu.max"
sleep 5
echo "move to $CG_PATH2"
sudo sh -c "echo $nop_pid > $CG_PATH2/cgroup.procs"
sleep 5
echo "limit $CG_PATH2 to 5/10"
sudo sh -c "echo 50000 100000 > $CG_PATH2/cpu.max"
sleep 5
echo "limit $CG_PATH2 to 1/10"
sudo sh -c "echo 10000 100000 > $CG_PATH2/cpu.max"
sleep 5
echo "set affinity to cpu3"
taskset -p 0x8 $nop_pid
sleep 5
echo "set affinity to cpu10"
taskset -p 0x400 $nop_pid
sleep 5
echo "unlimit $CG_PATH2"
sudo sh -c "echo max 100000 > $CG_PATH2/cpu.max"
sleep 5
echo "move to $CG_PATH1"
sudo sh -c "echo $nop_pid > $CG_PATH1/cgroup.procs"
sleep 5
echo "change to rr with priority 10"
sudo chrt -r -p 10 $nop_pid
sleep 5
echo "change to fifo with priority 10"
sudo chrt -f -p 10 $nop_pid
sleep 5
echo "change back to fair"
sudo chrt -o -p 0 $nop_pid
sleep 5
echo "unlimit $CG_PATH1"
sudo sh -c "echo max 100000 > $CG_PATH1/cpu.max"
sleep 5
kill $nop_pid
note: nop is a cpu hog that does: while (1) spin();
> - A script that tried to mimic a large cgroup setup is used to see how
> bad it is to unthrottle cfs_rqs and enqueue back large number of tasks
> in hrtime context.
Here are the test scripts:
CG_ROOT=/sys/fs/cgroup
nr_level1=2
nr_level2=100
nr_level3=10
for i in `seq $nr_level1`; do
CG_LEVEL1=$CG_ROOT/$i
echo "cg_level1: $CG_LEVEL1"
[ -d $CG_LEVEL1 ] || sudo mkdir -p $CG_LEVEL1
sudo sh -c "echo +cpu > $CG_LEVEL1/cgroup.subtree_control"
for j in `seq $nr_level2`; do
CG_LEVEL2=$CG_LEVEL1/${i}_$j
echo "cg_level2: $CG_LEVEL2"
[ -d $CG_LEVEL2 ] || sudo mkdir -p $CG_LEVEL2
sudo sh -c "echo +cpu > $CG_LEVEL2/cgroup.subtree_control"
for k in `seq $nr_level3`; do
CG_LEVEL3=$CG_LEVEL2/${i}_${j}_$k
[ -d $CG_LEVEL3 ] || sudo mkdir -p $CG_LEVEL3
~/test/run_in_cg.sh $CG_LEVEL3
done
done
done
function set_quota()
{
quota=$1
for i in `seq $nr_level1`; do
CG_LEVEL1=$CG_ROOT/$i
sudo sh -c "echo $quota 100000 > $CG_LEVEL1/cpu.max"
echo "$CG_LEVEL1: `cat $CG_LEVEL1/cpu.max`"
done
}
while true; do
echo "sleep 20"
sleep 20
echo "set 20cpu quota to first level cgroups"
set_quota 2000000
echo "sleep 20"
sleep 20
echo "set 10cpu quota to first level cgroups"
set_quota 1000000
echo "sleep 20"
sleep 20
echo "set 5cpu quota to first level cgroups"
set_quota 500000
echo "sleep 20"
sleep 20
echo "unlimit first level cgroups"
set_quota max
done
run_in_cg.sh:
set -e
CG_PATH=$1
[ -z "$CG_PATH" ] && {
echo "need cgroup path"
exit
}
echo "CG_PATH: $CG_PATH"
sudo sh -c "echo $$ > $CG_PATH/cgroup.procs"
for i in `seq 10`; do
~/src/misc/nop &
done
> The test was done on a 2sockets/384threads AMD CPU with the following
> cgroup setup: 2 first level cgroups with quota setting, each has 100
> child cgroups and each child cgroup has 10 leaf child cgroups, with a
> total number of 2000 cgroups. In each leaf child cgroup, 10 cpu hog
> tasks are created there. Below is the durations of
> distribute_cfs_runtime() during a 1 minute window:
@durations:
[8K, 16K) 274 |@@@@@@@@@@@@@@@@@@@@@ |
[16K, 32K) 132 |@@@@@@@@@@ |
[32K, 64K) 6 | |
[64K, 128K) 0 | |
[128K, 256K) 2 | |
[256K, 512K) 0 | |
[512K, 1M) 117 |@@@@@@@@@ |
[1M, 2M) 665 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2M, 4M) 10 | |
The bpftrace script used to capture this:
kfunc:distribute_cfs_runtime
{
@start[args->cfs_b] = nsecs;
}
kretfunc:distribute_cfs_runtime
{
if (@start[args->cfs_b]) {
$duration = nsecs - @start[args->cfs_b];
@durations = hist($duration);
delete(@start[args->cfs_b]);
}
}
interval:s:60
{
exit();
}
> So the biggest duration is in 2-4ms range in this hrtime context. How
> bad is this number? I think it is acceptable but maybe the setup I
> created is not complex enough?
> In older kernels where async unthrottle is not available, the largest
> time range can be about 100ms+.
© 2016 - 2025 Red Hat, Inc.