[v1] Defer throttle when task exits to user

[RFC PATCH 0/7] Defer throttle when task exits to user

Posted by Aaron Lu 11 months ago

This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/

Valentin has described the problem very well in the above link. We also
have task hung problem from time to time in our environment due to cfs quota.
It is mostly visible with rwsem: when a reader is throttled, writer comes in
and has to wait, the writer also makes all subsequent readers wait,
causing problems of priority inversion or even whole system hung.

Changes I've made since Valentin's v3:
- Use enqueue_task_fair() and dequeue_task_fair() in cfs_rq's throttle
  and unthrottle path;
- Get rid of irq_work, since the task work that is supposed to throttle
  the task can figure things out and do things accordingly, so no need
  to have a irq_work to cancel a no longer needed task work;
- Several fix like taking care of task group change, sched class change
  etc. for throttled task;
- tasks_rcu fix with this task based throttle.

Tests:
- A basic test to verify functionality like limit cgroup cpu time and
  change task group, affinity etc.
- A script that tried to mimic a large cgroup setup is used to see how
  bad it is to unthrottle cfs_rqs and enqueue back large number of tasks
  in hrtime context.

  The test was done on a 2sockets/384threads AMD CPU with the following
  cgroup setup: 2 first level cgroups with quota setting, each has 100
  child cgroups and each child cgroup has 10 leaf child cgroups, with a
  total number of 2000 cgroups. In each leaf child cgroup, 10 cpu hog
  tasks are created there. Below is the durations of
  distribute_cfs_runtime() during a 1 minute window:
  @durations:
  [8K, 16K)            274 |@@@@@@@@@@@@@@@@@@@@@
         |
  [16K, 32K)           132 |@@@@@@@@@@
         |
  [32K, 64K)             6 |
         |
  [64K, 128K)            0 |
         |
  [128K, 256K)           2 |
         |
  [256K, 512K)           0 |
         |
  [512K, 1M)           117 |@@@@@@@@@
         |
  [1M, 2M)             665
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [2M, 4M)              10 |
         |
  So the biggest duration is in 2-4ms range in this hrtime context. How
  bad is this number? I think it is acceptable but maybe the setup I
  created is not complex enough?
  In older kernels where async unthrottle is not available, the largest
  time range can be about 100ms+.

Patches:
The patchset is arranged to get the basic functionality done first and
then deal with special cases. I hope this can make it easier to review.

Patch1 is preparation work;

Patch2-3 provide the main functionality.
Patch2 deals with throttle path: when a cfs_rq is to be throttled, add a
task work to each of its tasks so that when those tasks return to user, the
task work can throttle it by dequeuing the task and remember this by
adding the task to its cfs_rq's limbo list;
Patch3 deals with unthrottle path: when a cfs_rq is to be unthrottled,
enqueue back those tasks in limbo list;

Patch4-5 deal with special cases.
Patch4 deals with task migration: if a task migrates to a throttled
cfs_rq, setup throttle work for it. If otherwise a task that already has
task work added migrated to a not throttled cfs_rq, its task work will
remain: the work handler will figure things out and skip the throttle.
This also deals with setting throttle task work for tasks that switched
to fair class, changed group etc. because all these things need enqueue
the task to the target cfs_rq;
Patch5 deals with the dequeue path when task changes group, sched class
etc. Task that is throttled is dequeued in fair, but task->on_rq is
still set so when it changes task group, sched class or has affinity
setting change, core will firstly dequeue it. But since this task is
already dequeued in fair class, this patch handle this situation.

Patch6-7 are two fixes while doing test. I can also fold them in if that
is better.
Patch6 makes CONFIG_TASKS_RCU happy. Throttled tasks get scheduled in
tasks_work_run() by cond_resched() but that is a preempt schedule and
doesn't mark a task rcu quiescent state, so I add a schedule call in
throttle task work directly.
Patch7 fixed a problem where unthrottle path can cause throttle to
happen again when enqueuing task.

All the patches changelogs are written by me, so if the changelogs look
poor, it's my bad.

Comments are welcome. If you see any problems or issues with this
approach, please feel free to let me know, thanks.

Base commit: tip/sched/core, commit fd881d0a085f("rseq: Fix segfault on
registration when rseq_cs is non-zero").

Known issues:
- !CONFIG_CFS_BANDWIDTH is totally not tested yet;
- task_is_throttled_fair() could probably be replaced with
  task_is_throttled() now but I'll leave this to next version.
- cfs_rq's pelt clock is stopped on throttle while it can still have
  tasks running(like some task is still running in kernel space).
  It's also possible to keep its pelt clock running till its last task
  is throttled/dequeued, but in this way, this cfs_rq's load may be
  decreased too much since many of its tasks are throttled. For now,
  keep it simple by keeping the current behavior.

Aaron Lu (4):
  sched/fair: Take care of migrated task for task based throttle
  sched/fair: Take care of group/affinity/sched_class change for
    throttled task
  sched/fair: fix tasks_rcu with task based throttle
  sched/fair: Make sure cfs_rq has enough runtime_remaining on
    unthrottle path

Valentin Schneider (3):
  sched/fair: Add related data structure for task based throttle
  sched/fair: Handle throttle path for task based throttle
  sched/fair: Handle unthrottle path for task based throttle

 include/linux/sched.h |   4 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 380 +++++++++++++++++++++++-------------------
 kernel/sched/sched.h  |   3 +
 4 files changed, 216 insertions(+), 174 deletions(-)

-- 
2.39.5

Re: [RFC PATCH 0/7] Defer throttle when task exits to user

Posted by Aaron Lu 11 months ago

It appears this mail's message-id is changed and becomes a separate
thread, I'll check what is going wrong, sorry about this.

On Thu, Mar 13, 2025 at 02:20:59AM -0500, Aaron Lu wrote:
> Tests:
> - A basic test to verify functionality like limit cgroup cpu time and
>   change task group, affinity etc.

Here is the basic test script:

pid=$$
CG_PATH1=/sys/fs/cgroup/1
CG_PATH2=/sys/fs/cgroup/2

[ -d $CG_PATH1 ] && sudo rmdir $CG_PATH1
[ -d $CG_PATH2 ] && sudo rmdir $CG_PATH2

sudo mkdir -p $CG_PATH1
sudo mkdir -p $CG_PATH2

sudo sh -c "echo $pid > $CG_PATH1/cgroup.procs"

echo "start nop"
~/src/misc/nop &
nop_pid=$!
cat /proc/$nop_pid/cgroup
pidstat -p $nop_pid 1 &
sleep 5

echo "limit $CG_PATH1 to 1/10"
sudo sh -c "echo 10000 100000 > $CG_PATH1/cpu.max"
sleep 5

echo "limit $CG_PATH1 to 5/10"
sudo sh -c "echo 50000 100000 > $CG_PATH1/cpu.max"
sleep 5

echo "move to $CG_PATH2"
sudo sh -c "echo $nop_pid > $CG_PATH2/cgroup.procs"
sleep 5

echo "limit $CG_PATH2 to 5/10"
sudo sh -c "echo 50000 100000 > $CG_PATH2/cpu.max"
sleep 5

echo "limit $CG_PATH2 to 1/10"
sudo sh -c "echo 10000 100000 > $CG_PATH2/cpu.max"
sleep 5

echo "set affinity to cpu3"
taskset -p 0x8 $nop_pid
sleep 5

echo "set affinity to cpu10"
taskset -p 0x400 $nop_pid
sleep 5

echo "unlimit $CG_PATH2"
sudo sh -c "echo max 100000 > $CG_PATH2/cpu.max"
sleep 5

echo "move to $CG_PATH1"
sudo sh -c "echo $nop_pid > $CG_PATH1/cgroup.procs"
sleep 5

echo "change to rr with priority 10"
sudo chrt -r -p 10 $nop_pid
sleep 5

echo "change to fifo with priority 10"
sudo chrt -f -p 10 $nop_pid
sleep 5

echo "change back to fair"
sudo chrt -o -p 0 $nop_pid
sleep 5

echo "unlimit $CG_PATH1"
sudo sh -c "echo max 100000 > $CG_PATH1/cpu.max"
sleep 5

kill $nop_pid

note: nop is a cpu hog that does: while (1) spin();

> - A script that tried to mimic a large cgroup setup is used to see how
>   bad it is to unthrottle cfs_rqs and enqueue back large number of tasks
>   in hrtime context.

Here are the test scripts:

CG_ROOT=/sys/fs/cgroup

nr_level1=2
nr_level2=100
nr_level3=10

for i in `seq $nr_level1`; do
	CG_LEVEL1=$CG_ROOT/$i
	echo "cg_level1: $CG_LEVEL1"
	[ -d $CG_LEVEL1 ] || sudo mkdir -p $CG_LEVEL1
	sudo sh -c "echo +cpu > $CG_LEVEL1/cgroup.subtree_control"

	for j in `seq $nr_level2`; do
		CG_LEVEL2=$CG_LEVEL1/${i}_$j
		echo "cg_level2: $CG_LEVEL2"
		[ -d $CG_LEVEL2 ] || sudo mkdir -p $CG_LEVEL2
		sudo sh -c "echo +cpu > $CG_LEVEL2/cgroup.subtree_control"

		for k in `seq $nr_level3`; do
			CG_LEVEL3=$CG_LEVEL2/${i}_${j}_$k
			[ -d $CG_LEVEL3 ] || sudo mkdir -p $CG_LEVEL3
			~/test/run_in_cg.sh $CG_LEVEL3
		done
	done
done

function set_quota()
{
	quota=$1

	for i in `seq $nr_level1`; do
		CG_LEVEL1=$CG_ROOT/$i
		sudo sh -c "echo $quota 100000 > $CG_LEVEL1/cpu.max"
		echo "$CG_LEVEL1: `cat $CG_LEVEL1/cpu.max`"
	done
}

while true; do
	echo "sleep 20"
	sleep 20

	echo "set 20cpu quota to first level cgroups"
	set_quota 2000000
	echo "sleep 20"
	sleep 20

	echo "set 10cpu quota to first level cgroups"
	set_quota 1000000
	echo "sleep 20"
	sleep 20

	echo "set 5cpu quota to first level cgroups"
	set_quota 500000
	echo "sleep 20"
	sleep 20

	echo "unlimit first level cgroups"
	set_quota max
done

run_in_cg.sh:

set -e

CG_PATH=$1
[ -z "$CG_PATH" ] && {
	echo "need cgroup path"
	exit
}

echo "CG_PATH: $CG_PATH"

sudo sh -c "echo $$ > $CG_PATH/cgroup.procs"

for i in `seq 10`; do
	~/src/misc/nop &
done

>   The test was done on a 2sockets/384threads AMD CPU with the following
>   cgroup setup: 2 first level cgroups with quota setting, each has 100
>   child cgroups and each child cgroup has 10 leaf child cgroups, with a
>   total number of 2000 cgroups. In each leaf child cgroup, 10 cpu hog
>   tasks are created there. Below is the durations of
>   distribute_cfs_runtime() during a 1 minute window:

@durations:
[8K, 16K)            274 |@@@@@@@@@@@@@@@@@@@@@                               |
[16K, 32K)           132 |@@@@@@@@@@                                          |
[32K, 64K)             6 |                                                    |
[64K, 128K)            0 |                                                    |
[128K, 256K)           2 |                                                    |
[256K, 512K)           0 |                                                    |
[512K, 1M)           117 |@@@@@@@@@                                           |
[1M, 2M)             665 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2M, 4M)              10 |                                                    |

The bpftrace script used to capture this:

kfunc:distribute_cfs_runtime
{
	@start[args->cfs_b] = nsecs;
}

kretfunc:distribute_cfs_runtime
{
	if (@start[args->cfs_b]) {
		$duration = nsecs - @start[args->cfs_b];
		@durations = hist($duration);
		delete(@start[args->cfs_b]);
	}
}

interval:s:60
{
	exit();
}

>   So the biggest duration is in 2-4ms range in this hrtime context. How
>   bad is this number? I think it is acceptable but maybe the setup I
>   created is not complex enough?
>   In older kernels where async unthrottle is not available, the largest
>   time range can be about 100ms+.