include/linux/sched.h | 5 + kernel/sched/core.c | 3 + kernel/sched/fair.c | 458 ++++++++++++++++++++++++------------------ kernel/sched/pelt.h | 4 +- kernel/sched/sched.h | 7 +- 5 files changed, 280 insertions(+), 197 deletions(-)
v4: - Add cfs_bandwidth_used() in task_is_throttled() and remove unlikely for task_is_throttled(), suggested by Valetin Schneider; - Add a warn for non empty throttle_node in enqueue_throttled_task(), suggested by Valetin Schneider; - Improve comments in enqueue_throttled_task() by Valetin Schneider; - Clear throttled for to-be-unthrottled tasks in tg_unthrottle_up(); - Change throttled and pelt_clock_throttled fields in cfs_rq from int to bool, reported by LKP; - Improve changelog for patch4 by Valetin Schneider. Thanks a lot for all the reviews and tests, I hope I didn't miss any of them but if I do, please let me know. I've also run Jan's rt reproducer and songtang's stress test and didn't notice any problem. Apply on top of sched/core, head commit 1b5f1454091e("sched/idle: Remove play_idle()"). v3: - Keep throttled cfs_rq's PELT clock run as long as it still has entity queued, suggested by Benjamin Segall; I've folded this change into patch3; - Rebased on top of tip/sched/core, commit 2885daf47081 ("lib/smp_processor_id: Make migration check unconditional of SMP"). Hi Prateek, I've kept your tested-by tag(Thanks!) for v2 since I believe this pelt clock change should not affect things much, but let me know if you don't think that is appropriate. Tests I've done: - Jan's rt deadlock reproducer[1]. Without this series, I saw rcu-stalls within 2 minutes and with this series, I do not see rcu-stalls after 10 minutes. - A stress test that creates a lot of pressure on fork/exit path and cgroup_threadgroup_rwsem. Without this series, the test will cause task hung in about 5 minutes and with this series, no problem found after several hours. Songtang wrote this test script and I've used it to verify the patches, thanks Songtang. [1]: https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/ Below is previous changelogs: v2: - Re-org the patchset to use a single patch to implement throttle related changes, suggested by Chengming; - Use check_cfs_rq_runtime()'s return value in pick_task_fair() to decide if throttle task work is needed instead of checking throttled_hierarchy(), suggested by Peter; - Simplify throttle_count check in tg_throtthe_down() and tg_unthrottle_up(), suggested by Peter; - Add enqueue_throttled_task() to speed up enqueuing a throttled task to a throttled cfs_rq, suggested by Peter; - Address the missing of detach_task_cfs_rq() for throttled tasks that get migrated to a new rq, pointed out by Chengming; - Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as cond_resched*() is going away, pointed out by Peter. I hope I didn't miss any comments and suggestions for v1 and if I do, please kindly let me know, thanks! Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version of double_rq_clock_clear_update()") cover letter of v1: This is a continuous work based on Valentin Schneider's posting here: Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/ Valentin has described the problem very well in the above link and I quote: " CFS tasks can end up throttled while holding locks that other, non-throttled tasks are blocking on. For !PREEMPT_RT, this can be a source of latency due to the throttling causing a resource acquisition denial. For PREEMPT_RT, this is worse and can lead to a deadlock: o A CFS task p0 gets throttled while holding read_lock(&lock) o A task p1 blocks on write_lock(&lock), making further readers enter the slowpath o A ktimers or ksoftirqd task blocks on read_lock(&lock) If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued on the same CPU as one where ktimers/ksoftirqd is blocked on read_lock(&lock), this creates a circular dependency. This has been observed to happen with: o fs/eventpoll.c::ep->lock o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above) but can trigger with any rwlock that can be acquired in both process and softirq contexts. The linux-rt tree has had 1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.") which helped this scenario for non-rwlock locks by ensuring the throttled task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately, rwlocks cannot sanely do PI as they allow multiple readers. " Jan Kiszka has posted an reproducer regarding this PREEMPT_RT problem : https://lore.kernel.org/r/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/ and K Prateek Nayak has an detailed analysis of how deadlock happened: https://lore.kernel.org/r/e65a32af-271b-4de6-937a-1a1049bbf511@amd.com/ To fix this issue for PREEMPT_RT and improve latency situation for !PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq is throttled, mark its throttled status but do not remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when they get picked, add a task work to them so that when they return to user, they can be dequeued. In this way, tasks throttled will not hold any kernel resources. When cfs_rq gets unthrottled, enqueue back those throttled tasks. There are consequences because of this new throttle model, e.g. for a cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their return2user path, one task still running in kernel mode, this cfs_rq is in a partial throttled state: - Should its pelt clock be frozen? - Should this state be accounted into throttled_time? For pelt clock, I chose to keep the current behavior to freeze it on cfs_rq's throttle time. The assumption is that tasks running in kernel mode should not last too long, freezing the cfs_rq's pelt clock can keep its load and its corresponding sched_entity's weight. Hopefully, this can result in a stable situation for the remaining running tasks to quickly finish their jobs in kernel mode. For throttle time accounting, according to RFC v2's feedback, rework throttle time accounting for a cfs_rq as follows: - start accounting when the first task gets throttled in its hierarchy; - stop accounting on unthrottle. There is also the concern of increased duration of (un)throttle operations in RFC v1. I've done some tests and with a 2000 cgroups/20K runnable tasks setup on a 2sockets/384cpus AMD server, the longest duration of distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see: https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/ For throttle path, with Chengming's suggestion to move "task work setup" from throttle time to pick time, it's not an issue anymore. Aaron Lu (2): sched/fair: Task based throttle time accounting sched/fair: Get rid of throttled_lb_pair() Valentin Schneider (3): sched/fair: Add related data structure for task based throttle sched/fair: Implement throttle task work and related helpers sched/fair: Switch to task based throttle model include/linux/sched.h | 5 + kernel/sched/core.c | 3 + kernel/sched/fair.c | 458 ++++++++++++++++++++++++------------------ kernel/sched/pelt.h | 4 +- kernel/sched/sched.h | 7 +- 5 files changed, 280 insertions(+), 197 deletions(-) base-commit: 1b5f1454091e9e9fb5c944b3161acf4ec0894d0d -- 2.39.5
© 2016 - 2025 Red Hat, Inc.