[v4] Defer throttle when task exits to user

[PATCH v4 0/5] Defer throttle when task exits to user

Posted by Aaron Lu 1 month ago

v4:
- Add cfs_bandwidth_used() in task_is_throttled() and remove unlikely
  for task_is_throttled(), suggested by Valetin Schneider;
- Add a warn for non empty throttle_node in enqueue_throttled_task(),
  suggested by Valetin Schneider;
- Improve comments in enqueue_throttled_task() by Valetin Schneider;
- Clear throttled for to-be-unthrottled tasks in tg_unthrottle_up();
- Change throttled and pelt_clock_throttled fields in cfs_rq from int to
  bool, reported by LKP;
- Improve changelog for patch4 by Valetin Schneider.

Thanks a lot for all the reviews and tests, I hope I didn't miss any of
them but if I do, please let me know. I've also run Jan's rt reproducer
and songtang's stress test and didn't notice any problem.

Apply on top of sched/core, head commit 1b5f1454091e("sched/idle: Remove
play_idle()").

v3:
- Keep throttled cfs_rq's PELT clock run as long as it still has entity
  queued, suggested by Benjamin Segall; I've folded this change into
  patch3;
- Rebased on top of tip/sched/core, commit 2885daf47081
  ("lib/smp_processor_id: Make migration check unconditional of SMP").

Hi Prateek,
I've kept your tested-by tag(Thanks!) for v2 since I believe this pelt
clock change should not affect things much, but let me know if you don't
think that is appropriate.

Tests I've done:
- Jan's rt deadlock reproducer[1]. Without this series, I saw rcu-stalls
  within 2 minutes and with this series, I do not see rcu-stalls after
  10 minutes.
- A stress test that creates a lot of pressure on fork/exit path and
  cgroup_threadgroup_rwsem. Without this series, the test will cause
  task hung in about 5 minutes and with this series, no problem found
  after several hours. Songtang wrote this test script and I've used it
  to verify the patches, thanks Songtang.

[1]: https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/

Below is previous changelogs:

v2:
- Re-org the patchset to use a single patch to implement throttle
  related changes, suggested by Chengming;
- Use check_cfs_rq_runtime()'s return value in pick_task_fair() to
  decide if throttle task work is needed instead of checking
  throttled_hierarchy(), suggested by Peter;
- Simplify throttle_count check in tg_throtthe_down() and
  tg_unthrottle_up(), suggested by Peter;
- Add enqueue_throttled_task() to speed up enqueuing a throttled task to
  a throttled cfs_rq, suggested by Peter;
- Address the missing of detach_task_cfs_rq() for throttled tasks that
  get migrated to a new rq, pointed out by Chengming;
- Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as
  cond_resched*() is going away, pointed out by Peter.
I hope I didn't miss any comments and suggestions for v1 and if I do,
please kindly let me know, thanks!

Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version
of double_rq_clock_clear_update()")

cover letter of v1:

This is a continuous work based on Valentin Schneider's posting here:
Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry
https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/

Valentin has described the problem very well in the above link and I
quote:
"
CFS tasks can end up throttled while holding locks that other,
non-throttled tasks are blocking on.

For !PREEMPT_RT, this can be a source of latency due to the throttling
causing a resource acquisition denial.

For PREEMPT_RT, this is worse and can lead to a deadlock:
o A CFS task p0 gets throttled while holding read_lock(&lock)
o A task p1 blocks on write_lock(&lock), making further readers enter
the slowpath
o A ktimers or ksoftirqd task blocks on read_lock(&lock)

If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued
on the same CPU as one where ktimers/ksoftirqd is blocked on
read_lock(&lock), this creates a circular dependency.

This has been observed to happen with:
o fs/eventpoll.c::ep->lock
o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above)
but can trigger with any rwlock that can be acquired in both process and
softirq contexts.

The linux-rt tree has had
  1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.")
which helped this scenario for non-rwlock locks by ensuring the throttled
task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately,
rwlocks cannot sanely do PI as they allow multiple readers.
"

Jan Kiszka has posted an reproducer regarding this PREEMPT_RT problem :
https://lore.kernel.org/r/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/
and K Prateek Nayak has an detailed analysis of how deadlock happened:
https://lore.kernel.org/r/e65a32af-271b-4de6-937a-1a1049bbf511@amd.com/

To fix this issue for PREEMPT_RT and improve latency situation for
!PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq
is throttled, mark its throttled status but do not remove it from cpu's
rq. Instead, for tasks that belong to this cfs_rq, when they get picked,
add a task work to them so that when they return to user, they can be
dequeued. In this way, tasks throttled will not hold any kernel resources.
When cfs_rq gets unthrottled, enqueue back those throttled tasks.

There are consequences because of this new throttle model, e.g. for a
cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their
return2user path, one task still running in kernel mode, this cfs_rq is
in a partial throttled state:
- Should its pelt clock be frozen?
- Should this state be accounted into throttled_time?

For pelt clock, I chose to keep the current behavior to freeze it on
cfs_rq's throttle time. The assumption is that tasks running in kernel
mode should not last too long, freezing the cfs_rq's pelt clock can keep
its load and its corresponding sched_entity's weight. Hopefully, this can
result in a stable situation for the remaining running tasks to quickly
finish their jobs in kernel mode.

For throttle time accounting, according to RFC v2's feedback, rework
throttle time accounting for a cfs_rq as follows:
- start accounting when the first task gets throttled in its
  hierarchy;
- stop accounting on unthrottle.

There is also the concern of increased duration of (un)throttle operations
in RFC v1. I've done some tests and with a 2000 cgroups/20K runnable tasks
setup on a 2sockets/384cpus AMD server, the longest duration of
distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see:
https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/
For throttle path, with Chengming's suggestion to move "task work setup"
from throttle time to pick time, it's not an issue anymore.

Aaron Lu (2):
  sched/fair: Task based throttle time accounting
  sched/fair: Get rid of throttled_lb_pair()

Valentin Schneider (3):
  sched/fair: Add related data structure for task based throttle
  sched/fair: Implement throttle task work and related helpers
  sched/fair: Switch to task based throttle model

 include/linux/sched.h |   5 +
 kernel/sched/core.c   |   3 +
 kernel/sched/fair.c   | 458 ++++++++++++++++++++++++------------------
 kernel/sched/pelt.h   |   4 +-
 kernel/sched/sched.h  |   7 +-
 5 files changed, 280 insertions(+), 197 deletions(-)


base-commit: 1b5f1454091e9e9fb5c944b3161acf4ec0894d0d
-- 
2.39.5

Re: [PATCH v4 0/5] Defer throttle when task exits to user

Posted by Peter Zijlstra 1 month ago

Thanks! I'll queue these in queue/sched/core and if nothing goes pop,
I'll move them along to tip.