include/linux/sched.h | 5 + kernel/sched/core.c | 3 + kernel/sched/fair.c | 445 +++++++++++++++++++++++------------------- kernel/sched/sched.h | 4 + 4 files changed, 253 insertions(+), 204 deletions(-)
v2: - Re-org the patchset to use a single patch to implement throttle related changes, suggested by Chengming; - Use check_cfs_rq_runtime()'s return value in pick_task_fair() to decide if throttle task work is needed instead of checking throttled_hierarchy(), suggested by Peter; - Simplify throttle_count check in tg_throtthe_down() and tg_unthrottle_up(), suggested by Peter; - Add enqueue_throttled_task() to speed up enqueuing a throttled task to a throttled cfs_rq, suggested by Peter; - Address the missing of detach_task_cfs_rq() for throttled tasks that get migrated to a new rq, pointed out by Chengming; - Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as cond_resched*() is going away, pointed out by Peter. I hope I didn't miss any comments and suggestions for v1 and if I do, please kindly let me know, thanks! Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version of double_rq_clock_clear_update()") cover letter of v1: This is a continuous work based on Valentin Schneider's posting here: Subject: [RFC PATCH v3 00/10] sched/fair: Defer CFS throttle to user entry https://lore.kernel.org/lkml/20240711130004.2157737-1-vschneid@redhat.com/ Valentin has described the problem very well in the above link and I quote: " CFS tasks can end up throttled while holding locks that other, non-throttled tasks are blocking on. For !PREEMPT_RT, this can be a source of latency due to the throttling causing a resource acquisition denial. For PREEMPT_RT, this is worse and can lead to a deadlock: o A CFS task p0 gets throttled while holding read_lock(&lock) o A task p1 blocks on write_lock(&lock), making further readers enter the slowpath o A ktimers or ksoftirqd task blocks on read_lock(&lock) If the cfs_bandwidth.period_timer to replenish p0's runtime is enqueued on the same CPU as one where ktimers/ksoftirqd is blocked on read_lock(&lock), this creates a circular dependency. This has been observed to happen with: o fs/eventpoll.c::ep->lock o net/netlink/af_netlink.c::nl_table_lock (after hand-fixing the above) but can trigger with any rwlock that can be acquired in both process and softirq contexts. The linux-rt tree has had 1ea50f9636f0 ("softirq: Use a dedicated thread for timer wakeups.") which helped this scenario for non-rwlock locks by ensuring the throttled task would get PI'd to FIFO1 (ktimers' default priority). Unfortunately, rwlocks cannot sanely do PI as they allow multiple readers. " Jan Kiszka has posted an reproducer regarding this PREEMPT_RT problem : https://lore.kernel.org/r/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/ and K Prateek Nayak has an detailed analysis of how deadlock happened: https://lore.kernel.org/r/e65a32af-271b-4de6-937a-1a1049bbf511@amd.com/ To fix this issue for PREEMPT_RT and improve latency situation for !PREEMPT_RT, change the throttle model to task based, i.e. when a cfs_rq is throttled, mark its throttled status but do not remove it from cpu's rq. Instead, for tasks that belong to this cfs_rq, when they get picked, add a task work to them so that when they return to user, they can be dequeued. In this way, tasks throttled will not hold any kernel resources. When cfs_rq gets unthrottled, enqueue back those throttled tasks. There are consequences because of this new throttle model, e.g. for a cfs_rq that has 3 tasks attached, when 2 tasks are throttled on their return2user path, one task still running in kernel mode, this cfs_rq is in a partial throttled state: - Should its pelt clock be frozen? - Should this state be accounted into throttled_time? For pelt clock, I chose to keep the current behavior to freeze it on cfs_rq's throttle time. The assumption is that tasks running in kernel mode should not last too long, freezing the cfs_rq's pelt clock can keep its load and its corresponding sched_entity's weight. Hopefully, this can result in a stable situation for the remaining running tasks to quickly finish their jobs in kernel mode. For throttle time accounting, according to RFC v2's feedback, rework throttle time accounting for a cfs_rq as follows: - start accounting when the first task gets throttled in its hierarchy; - stop accounting on unthrottle. There is also the concern of increased duration of (un)throttle operations in RFC v1. I've done some tests and with a 2000 cgroups/20K runnable tasks setup on a 2sockets/384cpus AMD server, the longest duration of distribute_cfs_runtime() is in the 2ms-4ms range. For details, please see: https://lore.kernel.org/lkml/20250324085822.GA732629@bytedance/ For throttle path, with Chengming's suggestion to move "task work setup" from throttle time to pick time, it's not an issue anymore. Aaron Lu (2): sched/fair: Task based throttle time accounting sched/fair: Get rid of throttled_lb_pair() Valentin Schneider (3): sched/fair: Add related data structure for task based throttle sched/fair: Implement throttle task work and related helpers sched/fair: Switch to task based throttle model include/linux/sched.h | 5 + kernel/sched/core.c | 3 + kernel/sched/fair.c | 445 +++++++++++++++++++++++------------------- kernel/sched/sched.h | 4 + 4 files changed, 253 insertions(+), 204 deletions(-) -- 2.39.5
Aaron Lu <ziqianlu@bytedance.com> writes: > For pelt clock, I chose to keep the current behavior to freeze it on > cfs_rq's throttle time. The assumption is that tasks running in kernel > mode should not last too long, freezing the cfs_rq's pelt clock can keep > its load and its corresponding sched_entity's weight. Hopefully, this can > result in a stable situation for the remaining running tasks to quickly > finish their jobs in kernel mode. I suppose the way that this would go wrong would be CPU 1 using up all of the quota, and then a task waking up on CPU 2 and trying to run in the kernel for a while. I suspect pelt time needs to also keep running until all the tasks are asleep (and that's what we have been running at google with the version based on separate accounting, so we haven't accidentally done a large scale test of letting it pause). Otherwise it does look ok, so long as we're ok with increasing distribute time again.
Hello Ben, On 7/3/2025 3:30 AM, Benjamin Segall wrote: > Aaron Lu <ziqianlu@bytedance.com> writes: > >> For pelt clock, I chose to keep the current behavior to freeze it on >> cfs_rq's throttle time. The assumption is that tasks running in kernel >> mode should not last too long, freezing the cfs_rq's pelt clock can keep >> its load and its corresponding sched_entity's weight. Hopefully, this can >> result in a stable situation for the remaining running tasks to quickly >> finish their jobs in kernel mode. > > I suppose the way that this would go wrong would be CPU 1 using up all > of the quota, and then a task waking up on CPU 2 and trying to run in > the kernel for a while. I suspect pelt time needs to also keep running > until all the tasks are asleep (and that's what we have been running at > google with the version based on separate accounting, so we haven't > accidentally done a large scale test of letting it pause). Thinking out loud ... One thing this can possibly do is create a lot of: throttled -> partially unthrottled -> throttled transitions when tasks wakeup on throttled hierarchy, run for a while, and then go back to sleep. If the PELT clocks aren't frozen, this either means: 1. Do a full walk_tg_tree_from() placing all the leaf cfs_rq that have any load associated back onto the list and allow PELT to progress only to then remove them again once tasks go back to sleep. A great many of these transitions are possible theoretically which is not ideal. 2. Propagate the delta time where PELT was not frozen during unthrottle and if it isn't 0, do an update_load_avg() to sync PELT. This will increase the overhead of the tg_tree callback which isn't ideal either. It can also complicate the enqueue path since the PELT of the the cfs_rq hierarchy being enqueued may need correction before the task can be enqueued. I know Josh hates both approaches since tg_tree_walks are already very expensive in your use cases and it has to be done in a non-preemptible context holding the rq_lock but which do you think is the lesser of two evils? Or is there a better solution that I have completely missed? > > Otherwise it does look ok, so long as we're ok with increasing distribute > time again. -- Thanks and Regards, Prateek
On Fri, Jul 04, 2025 at 10:04:13AM +0530, K Prateek Nayak wrote: > Hello Ben, > > On 7/3/2025 3:30 AM, Benjamin Segall wrote: > > Aaron Lu <ziqianlu@bytedance.com> writes: > > > > > For pelt clock, I chose to keep the current behavior to freeze it on > > > cfs_rq's throttle time. The assumption is that tasks running in kernel > > > mode should not last too long, freezing the cfs_rq's pelt clock can keep > > > its load and its corresponding sched_entity's weight. Hopefully, this can > > > result in a stable situation for the remaining running tasks to quickly > > > finish their jobs in kernel mode. > > > > I suppose the way that this would go wrong would be CPU 1 using up all > > of the quota, and then a task waking up on CPU 2 and trying to run in > > the kernel for a while. I suspect pelt time needs to also keep running > > until all the tasks are asleep (and that's what we have been running at > > google with the version based on separate accounting, so we haven't > > accidentally done a large scale test of letting it pause). > > Thinking out loud ... > > One thing this can possibly do is create a lot of: > > throttled -> partially unthrottled -> throttled > > transitions when tasks wakeup on throttled hierarchy, run for a while, > and then go back to sleep. If the PELT clocks aren't frozen, this > either means: > > 1. Do a full walk_tg_tree_from() placing all the leaf cfs_rq that have > any load associated back onto the list and allow PELT to progress only > to then remove them again once tasks go back to sleep. A great many of > these transitions are possible theoretically which is not ideal. > I'm going this route, becasue it is kind of already the case now :) I mean throttled cfs_rqs are already added back to the leaf cfs_rq list during enqueue time, to make the assert_list_leaf_cfs_rq(rq); at the bottom of enqueue_task_fair() happy when a task is enqueued to a throttled cfs_rq. I'm sorry if this is not obvious in this series, I guess I put too many things in patch3. Below is the diff I cooked on top of this series to keep pelt clock running as long as there is task running in a throttled cfs_rq, does it look sane? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index d869c8b51c5a6..410b850df2a12 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5290,8 +5290,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) se->on_rq = 1; if (cfs_rq->nr_queued == 1) { + struct rq *rq = rq_of(cfs_rq); + check_enqueue_throttle(cfs_rq); list_add_leaf_cfs_rq(cfs_rq); + if (cfs_rq->pelt_clock_throttled) { + cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) - + cfs_rq->throttled_clock_pelt; + cfs_rq->pelt_clock_throttled = 0; + } } } @@ -5437,8 +5444,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) if (cfs_rq->nr_queued == 0) { update_idle_cfs_rq_clock_pelt(cfs_rq); - if (throttled_hierarchy(cfs_rq)) + if (throttled_hierarchy(cfs_rq)) { + struct rq *rq = rq_of(cfs_rq); + + cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq); + cfs_rq->pelt_clock_throttled = 1; list_del_leaf_cfs_rq(cfs_rq); + } } return true; @@ -5873,8 +5885,11 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) if (--cfs_rq->throttle_count) return 0; - cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) - - cfs_rq->throttled_clock_pelt; + if (cfs_rq->pelt_clock_throttled) { + cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) - + cfs_rq->throttled_clock_pelt; + cfs_rq->pelt_clock_throttled = 0; + } if (cfs_rq->throttled_clock_self) { u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self; @@ -5939,11 +5954,13 @@ static int tg_throttle_down(struct task_group *tg, void *data) if (cfs_rq->throttle_count++) return 0; - /* group is entering throttled state, stop time */ - cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq); - if (!cfs_rq->nr_queued) + if (!cfs_rq->nr_queued) { + /* group is entering throttled state, stop time */ + cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq); + cfs_rq->pelt_clock_throttled = 1; list_del_leaf_cfs_rq(cfs_rq); + } WARN_ON_ONCE(cfs_rq->throttled_clock_self); WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list)); diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h index 62c3fa543c0f2..f921302dc40fb 100644 --- a/kernel/sched/pelt.h +++ b/kernel/sched/pelt.h @@ -162,7 +162,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { u64 throttled; - if (unlikely(cfs_rq->throttle_count)) + if (unlikely(cfs_rq->pelt_clock_throttled)) throttled = U64_MAX; else throttled = cfs_rq->throttled_clock_pelt_time; @@ -173,7 +173,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) /* rq->task_clock normalized against any time this cfs_rq has spent throttled */ static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) { - if (unlikely(cfs_rq->throttle_count)) + if (unlikely(cfs_rq->pelt_clock_throttled)) return cfs_rq->throttled_clock_pelt - cfs_rq->throttled_clock_pelt_time; return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index f2a07537d3c12..877e40ccd8cc1 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -724,7 +724,8 @@ struct cfs_rq { u64 throttled_clock_pelt_time; u64 throttled_clock_self; u64 throttled_clock_self_time; - int throttled; + int throttled:1; + int pelt_clock_throttled:1; int throttle_count; struct list_head throttled_list; struct list_head throttled_csd_list; Thanks, Aaron > 2. Propagate the delta time where PELT was not frozen during unthrottle > and if it isn't 0, do an update_load_avg() to sync PELT. This will > increase the overhead of the tg_tree callback which isn't ideal > either. It can also complicate the enqueue path since the PELT of > the the cfs_rq hierarchy being enqueued may need correction before > the task can be enqueued. > > I know Josh hates both approaches since tg_tree_walks are already very > expensive in your use cases and it has to be done in a non-preemptible > context holding the rq_lock but which do you think is the lesser of two > evils? Or is there a better solution that I have completely missed? > > > > > Otherwise it does look ok, so long as we're ok with increasing distribute > > time again. > > -- > Thanks and Regards, > Prateek >
Hello Aaron, On 7/4/2025 1:24 PM, Aaron Lu wrote: > On Fri, Jul 04, 2025 at 10:04:13AM +0530, K Prateek Nayak wrote: >> Hello Ben, >> >> On 7/3/2025 3:30 AM, Benjamin Segall wrote: >>> Aaron Lu <ziqianlu@bytedance.com> writes: >>> >>>> For pelt clock, I chose to keep the current behavior to freeze it on >>>> cfs_rq's throttle time. The assumption is that tasks running in kernel >>>> mode should not last too long, freezing the cfs_rq's pelt clock can keep >>>> its load and its corresponding sched_entity's weight. Hopefully, this can >>>> result in a stable situation for the remaining running tasks to quickly >>>> finish their jobs in kernel mode. >>> >>> I suppose the way that this would go wrong would be CPU 1 using up all >>> of the quota, and then a task waking up on CPU 2 and trying to run in >>> the kernel for a while. I suspect pelt time needs to also keep running >>> until all the tasks are asleep (and that's what we have been running at >>> google with the version based on separate accounting, so we haven't >>> accidentally done a large scale test of letting it pause). >> >> Thinking out loud ... >> >> One thing this can possibly do is create a lot of: >> >> throttled -> partially unthrottled -> throttled >> >> transitions when tasks wakeup on throttled hierarchy, run for a while, >> and then go back to sleep. If the PELT clocks aren't frozen, this >> either means: >> >> 1. Do a full walk_tg_tree_from() placing all the leaf cfs_rq that have >> any load associated back onto the list and allow PELT to progress only >> to then remove them again once tasks go back to sleep. A great many of >> these transitions are possible theoretically which is not ideal. >> > > I'm going this route, becasue it is kind of already the case now :) > > I mean throttled cfs_rqs are already added back to the leaf cfs_rq > list during enqueue time, to make the assert_list_leaf_cfs_rq(rq); at > the bottom of enqueue_task_fair() happy when a task is enqueued to a > throttled cfs_rq. > > I'm sorry if this is not obvious in this series, I guess I put too many > things in patch3. > > Below is the diff I cooked on top of this series to keep pelt clock > running as long as there is task running in a throttled cfs_rq, does it > look sane? > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index d869c8b51c5a6..410b850df2a12 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5290,8 +5290,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > se->on_rq = 1; > > if (cfs_rq->nr_queued == 1) { > + struct rq *rq = rq_of(cfs_rq); > + > check_enqueue_throttle(cfs_rq); > list_add_leaf_cfs_rq(cfs_rq); > + if (cfs_rq->pelt_clock_throttled) { > + cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) - > + cfs_rq->throttled_clock_pelt; > + cfs_rq->pelt_clock_throttled = 0; > + } At this point we've already done a update_load_avg() above in enqueue_entity() without unfreezing the PELT clock. Does it make sense to do it at the beginning? Overall idea seems sane to me. I was thinking if anything can go wrong by only unfreezing the PELT for one part of the hierarchy but I suppose the other cfs_rq can be considered individually throttled and it should turn out fine. > } > } > > @@ -5437,8 +5444,13 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > if (cfs_rq->nr_queued == 0) { > update_idle_cfs_rq_clock_pelt(cfs_rq); > - if (throttled_hierarchy(cfs_rq)) > + if (throttled_hierarchy(cfs_rq)) { > + struct rq *rq = rq_of(cfs_rq); > + > + cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq); > + cfs_rq->pelt_clock_throttled = 1; > list_del_leaf_cfs_rq(cfs_rq); > + } > } > > return true; > @@ -5873,8 +5885,11 @@ static int tg_unthrottle_up(struct task_group *tg, void *data) > if (--cfs_rq->throttle_count) > return 0; > > - cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) - > - cfs_rq->throttled_clock_pelt; > + if (cfs_rq->pelt_clock_throttled) { > + cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) - > + cfs_rq->throttled_clock_pelt; > + cfs_rq->pelt_clock_throttled = 0; > + } > > if (cfs_rq->throttled_clock_self) { > u64 delta = rq_clock(rq) - cfs_rq->throttled_clock_self; > @@ -5939,11 +5954,13 @@ static int tg_throttle_down(struct task_group *tg, void *data) > if (cfs_rq->throttle_count++) > return 0; > > - /* group is entering throttled state, stop time */ > - cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq); > > - if (!cfs_rq->nr_queued) > + if (!cfs_rq->nr_queued) { > + /* group is entering throttled state, stop time */ > + cfs_rq->throttled_clock_pelt = rq_clock_pelt(rq); > + cfs_rq->pelt_clock_throttled = 1; > list_del_leaf_cfs_rq(cfs_rq); > + } > > WARN_ON_ONCE(cfs_rq->throttled_clock_self); > WARN_ON_ONCE(!list_empty(&cfs_rq->throttled_limbo_list)); > diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h > index 62c3fa543c0f2..f921302dc40fb 100644 > --- a/kernel/sched/pelt.h > +++ b/kernel/sched/pelt.h > @@ -162,7 +162,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) > { > u64 throttled; > > - if (unlikely(cfs_rq->throttle_count)) > + if (unlikely(cfs_rq->pelt_clock_throttled)) > throttled = U64_MAX; > else > throttled = cfs_rq->throttled_clock_pelt_time; > @@ -173,7 +173,7 @@ static inline void update_idle_cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) > /* rq->task_clock normalized against any time this cfs_rq has spent throttled */ > static inline u64 cfs_rq_clock_pelt(struct cfs_rq *cfs_rq) > { > - if (unlikely(cfs_rq->throttle_count)) > + if (unlikely(cfs_rq->pelt_clock_throttled)) > return cfs_rq->throttled_clock_pelt - cfs_rq->throttled_clock_pelt_time; > > return rq_clock_pelt(rq_of(cfs_rq)) - cfs_rq->throttled_clock_pelt_time; > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index f2a07537d3c12..877e40ccd8cc1 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -724,7 +724,8 @@ struct cfs_rq { > u64 throttled_clock_pelt_time; > u64 throttled_clock_self; > u64 throttled_clock_self_time; > - int throttled; > + int throttled:1; > + int pelt_clock_throttled:1; > int throttle_count; > struct list_head throttled_list; > struct list_head throttled_csd_list; > > > Thanks, > Aaron > >> 2. Propagate the delta time where PELT was not frozen during unthrottle >> and if it isn't 0, do an update_load_avg() to sync PELT. This will >> increase the overhead of the tg_tree callback which isn't ideal >> either. It can also complicate the enqueue path since the PELT of >> the the cfs_rq hierarchy being enqueued may need correction before >> the task can be enqueued. >> >> I know Josh hates both approaches since tg_tree_walks are already very >> expensive in your use cases and it has to be done in a non-preemptible >> context holding the rq_lock but which do you think is the lesser of two >> evils? Or is there a better solution that I have completely missed? >> >>> >>> Otherwise it does look ok, so long as we're ok with increasing distribute >>> time again. >> >> -- >> Thanks and Regards, >> Prateek >> -- Thanks and Regards, Prateek
Hi Prateek, On Fri, Jul 04, 2025 at 02:18:49PM +0530, K Prateek Nayak wrote: > Hello Aaron, > > On 7/4/2025 1:24 PM, Aaron Lu wrote: > > On Fri, Jul 04, 2025 at 10:04:13AM +0530, K Prateek Nayak wrote: > > > Hello Ben, > > > > > > On 7/3/2025 3:30 AM, Benjamin Segall wrote: > > > > Aaron Lu <ziqianlu@bytedance.com> writes: > > > > > > > > > For pelt clock, I chose to keep the current behavior to freeze it on > > > > > cfs_rq's throttle time. The assumption is that tasks running in kernel > > > > > mode should not last too long, freezing the cfs_rq's pelt clock can keep > > > > > its load and its corresponding sched_entity's weight. Hopefully, this can > > > > > result in a stable situation for the remaining running tasks to quickly > > > > > finish their jobs in kernel mode. > > > > > > > > I suppose the way that this would go wrong would be CPU 1 using up all > > > > of the quota, and then a task waking up on CPU 2 and trying to run in > > > > the kernel for a while. I suspect pelt time needs to also keep running > > > > until all the tasks are asleep (and that's what we have been running at > > > > google with the version based on separate accounting, so we haven't > > > > accidentally done a large scale test of letting it pause). > > > > > > Thinking out loud ... > > > > > > One thing this can possibly do is create a lot of: > > > > > > throttled -> partially unthrottled -> throttled > > > > > > transitions when tasks wakeup on throttled hierarchy, run for a while, > > > and then go back to sleep. If the PELT clocks aren't frozen, this > > > either means: > > > > > > 1. Do a full walk_tg_tree_from() placing all the leaf cfs_rq that have > > > any load associated back onto the list and allow PELT to progress only > > > to then remove them again once tasks go back to sleep. A great many of > > > these transitions are possible theoretically which is not ideal. > > > > > > > I'm going this route, becasue it is kind of already the case now :) > > > > I mean throttled cfs_rqs are already added back to the leaf cfs_rq > > list during enqueue time, to make the assert_list_leaf_cfs_rq(rq); at > > the bottom of enqueue_task_fair() happy when a task is enqueued to a > > throttled cfs_rq. > > > > I'm sorry if this is not obvious in this series, I guess I put too many > > things in patch3. > > > > Below is the diff I cooked on top of this series to keep pelt clock > > running as long as there is task running in a throttled cfs_rq, does it > > look sane? > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index d869c8b51c5a6..410b850df2a12 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -5290,8 +5290,15 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags) > > se->on_rq = 1; > > if (cfs_rq->nr_queued == 1) { > > + struct rq *rq = rq_of(cfs_rq); > > + > > check_enqueue_throttle(cfs_rq); > > list_add_leaf_cfs_rq(cfs_rq); > > + if (cfs_rq->pelt_clock_throttled) { > > + cfs_rq->throttled_clock_pelt_time += rq_clock_pelt(rq) - > > + cfs_rq->throttled_clock_pelt; > > + cfs_rq->pelt_clock_throttled = 0; > > + } > > At this point we've already done a update_load_avg() above in > enqueue_entity() without unfreezing the PELT clock. Does it make > sense to do it at the beginning? > My thinking is, when the first entity is enqueued, the cfs_rq should have been throttled so the update_load_avg() done above should do nothing since its pelt clock is frozen(so no decay should happen). Then after the entity is added, this throttled cfs_rq's pelt clock is unfrozen and load can be updated accordingly. Thinking it another way, when we move the unfreezing earlier, the delta calculated in __update_load_sum() should be zero because cfs_rq->throttled_clock_pelt_time is adjusted and the "now" value derived from cfs_rq_clock_pelt() should be the same as last update time. Does this make sense or do I mis-understand it? > Overall idea seems sane to me. I was thinking if anything can go > wrong by only unfreezing the PELT for one part of the hierarchy but > I suppose the other cfs_rq can be considered individually throttled > and it should turn out fine. > Unfreezing the PELT for only one part seems problematic, I suppose we will have to propagate the load upwards all the way up to let the root level sched entity have a correct load setting so that when it needs to compete for cpu resources, it gets the correct share. I assume this is why Ben think it is better to keep the pelt clock running as long as there is task running. Note that the pelt clock is only frozen when the cfs_rq has no tasks running, and gets unfrozen when the first entity is enqueued or at unthrottle time. So when a new task is enqueued, at each level's enqueue_entity(): - if it's the first entity of a throttled cfs_rq, then its PELT clock will be unfrozen; - if it's not the first entity, that means the cfs_rq's pelt clock is not frozen yet. So in the end, we should not have a partial unfrozen hiearchy. At least, that's my intention, please let me know if I messed up :) Thanks for the quick review! Best regards, Aaron
Hi Benjamin, Thanks for taking a look. On Wed, Jul 02, 2025 at 03:00:56PM -0700, Benjamin Segall wrote: > Aaron Lu <ziqianlu@bytedance.com> writes: > > > For pelt clock, I chose to keep the current behavior to freeze it on > > cfs_rq's throttle time. The assumption is that tasks running in kernel > > mode should not last too long, freezing the cfs_rq's pelt clock can keep > > its load and its corresponding sched_entity's weight. Hopefully, this can > > result in a stable situation for the remaining running tasks to quickly > > finish their jobs in kernel mode. > > I suppose the way that this would go wrong would be CPU 1 using up all > of the quota, and then a task waking up on CPU 2 and trying to run in > the kernel for a while. I suspect pelt time needs to also keep running > until all the tasks are asleep (and that's what we have been running at > google with the version based on separate accounting, so we haven't > accidentally done a large scale test of letting it pause). > Got it, I'll rework this part to keep pelt clock ticking and only stop it when all tasks of a throttled cfs_rq get either throttled or blocked. I will paste a diff on top of this series and if the diff looks OK, I'll fold it to patch3 in next version. > Otherwise it does look ok, so long as we're ok with increasing distribute > time again. Good to know this! About not strictly limiting quota, I suppose that is a trade off and having a system that is operating properly is better than a system with task hung or even deadlock. Best regards, Aaron
Hello Aaron, On 6/18/2025 1:49 PM, Aaron Lu wrote: > v2: > - Re-org the patchset to use a single patch to implement throttle > related changes, suggested by Chengming; > - Use check_cfs_rq_runtime()'s return value in pick_task_fair() to > decide if throttle task work is needed instead of checking > throttled_hierarchy(), suggested by Peter; > - Simplify throttle_count check in tg_throtthe_down() and > tg_unthrottle_up(), suggested by Peter; > - Add enqueue_throttled_task() to speed up enqueuing a throttled task to > a throttled cfs_rq, suggested by Peter; > - Address the missing of detach_task_cfs_rq() for throttled tasks that > get migrated to a new rq, pointed out by Chengming; > - Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as > cond_resched*() is going away, pointed out by Peter. > I hope I didn't miss any comments and suggestions for v1 and if I do, > please kindly let me know, thanks! > > Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version > of double_rq_clock_clear_update()") Sorry for the delay! I gave this a spin with my nested hierarchy stress test with sched-messaging as well as with Jan's reproducer from [1] and I didn't see anything unexpected. A 2 vCPU VM running vanilla tip:sched/core (PREEMPT_RT) hangs within a few seconds when the two tasks from Jan's reproducer are pinned to the same CPU as the bandwidth timer. I haven't seen any hangs / rcu-stalls with this series applied on top of tip:sched/core. Feel free to include: Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> [1] https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/ -- Thanks and Regards, Prateek
Hi Prateek, On Wed, Jul 02, 2025 at 09:55:19AM +0530, K Prateek Nayak wrote: > Hello Aaron, > > On 6/18/2025 1:49 PM, Aaron Lu wrote: > > v2: > > - Re-org the patchset to use a single patch to implement throttle > > related changes, suggested by Chengming; > > - Use check_cfs_rq_runtime()'s return value in pick_task_fair() to > > decide if throttle task work is needed instead of checking > > throttled_hierarchy(), suggested by Peter; > > - Simplify throttle_count check in tg_throtthe_down() and > > tg_unthrottle_up(), suggested by Peter; > > - Add enqueue_throttled_task() to speed up enqueuing a throttled task to > > a throttled cfs_rq, suggested by Peter; > > - Address the missing of detach_task_cfs_rq() for throttled tasks that > > get migrated to a new rq, pointed out by Chengming; > > - Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as > > cond_resched*() is going away, pointed out by Peter. > > I hope I didn't miss any comments and suggestions for v1 and if I do, > > please kindly let me know, thanks! > > > > Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version > > of double_rq_clock_clear_update()") > > Sorry for the delay! I gave this a spin with my nested hierarchy stress > test with sched-messaging as well as with Jan's reproducer from [1] and > I didn't see anything unexpected. > > A 2 vCPU VM running vanilla tip:sched/core (PREEMPT_RT) hangs within a > few seconds when the two tasks from Jan's reproducer are pinned to the > same CPU as the bandwidth timer. > > I haven't seen any hangs / rcu-stalls with this series applied on top of > tip:sched/core. Feel free to include: > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > > [1] https://lore.kernel.org/all/7483d3ae-5846-4067-b9f7-390a614ba408@siemens.com/ > Thanks a lot Prateek, I really appreciate your time on testing this. Best regards, Aaron
Hello, On Wed, Jun 18, 2025 at 04:19:35PM +0800, Aaron Lu wrote: > v2: > - Re-org the patchset to use a single patch to implement throttle > related changes, suggested by Chengming; > - Use check_cfs_rq_runtime()'s return value in pick_task_fair() to > decide if throttle task work is needed instead of checking > throttled_hierarchy(), suggested by Peter; > - Simplify throttle_count check in tg_throtthe_down() and > tg_unthrottle_up(), suggested by Peter; > - Add enqueue_throttled_task() to speed up enqueuing a throttled task to > a throttled cfs_rq, suggested by Peter; > - Address the missing of detach_task_cfs_rq() for throttled tasks that > get migrated to a new rq, pointed out by Chengming; > - Remove cond_resched_tasks_rcu_qs() in throttle_cfs_rq_work() as > cond_resched*() is going away, pointed out by Peter. > I hope I didn't miss any comments and suggestions for v1 and if I do, > please kindly let me know, thanks! > > Base: tip/sched/core commit dabe1be4e84c("sched/smp: Use the SMP version > of double_rq_clock_clear_update()") I wonder is there any more comments about this series? Is it going in the right direction? Thanks.
On Tue, Jul 01, 2025 at 04:31:23PM +0800, Aaron Lu wrote: > I wonder is there any more comments about this series? > Is it going in the right direction? I had a quick look yesterday, and things seem more or less agreeable. I see there has been some feedback from Ben that warrants a new version, so I'll try and keep an eye out for that one. Thanks!
On Thu, Jul 03, 2025 at 09:37:40AM +0200, Peter Zijlstra wrote: > On Tue, Jul 01, 2025 at 04:31:23PM +0800, Aaron Lu wrote: > > > I wonder is there any more comments about this series? > > Is it going in the right direction? > > I had a quick look yesterday, and things seem more or less agreeable. > I see there has been some feedback from Ben that warrants a new version, > so I'll try and keep an eye out for that one. > > Thanks! Thank you Peter for the update.
© 2016 - 2025 Red Hat, Inc.