kernel/sched/core.c | 9 ++++++++- kernel/sched/fair.c | 16 +++++++--------- kernel/sched/sched.h | 1 + 3 files changed, 16 insertions(+), 10 deletions(-)
When a cfs_rq is to be throttled, its limbo list should be empty and
that's why there is a warn in tg_throttle_down() for non empty
cfs_rq->throttled_limbo_list.
When running a test with the following hierarchy:
root
/ \
A* ...
/ | \ ...
B
/ \
C*
where both A and C have quota settings, that warn on non empty limbo list
is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu
part of the cfs_rq for the sake of simpler representation).
Debugging showed it happened like this:
Task group C is created and quota is set, so in tg_set_cfs_bandwidth(),
cfs_rq_c is initialized with runtime_enabled set, runtime_remaining
equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c,
*multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task
group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is
called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled),
these throttled tasks are placed into cfs_rq_c's limbo list by
enqueue_throttled_task().
Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these
tasks. The first enqueue triggers check_enqueue_throttle(), and with zero
runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it
can't get more runtime and enters tg_throttle_down(), where the warning
is hit due to remaining tasks in the limbo list.
Fix this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth()
immediately after enabling bandwidth and setting runtime_remaining = 0.
This ensures cfs_rq_c is throttled upfront and cannot enter the enqueue
path in an unthrottled state with no runtime.
Also, update outdated comments in tg_throttle_down() since
unthrottle_cfs_rq() is no longer called with zero runtime_remaining.
While at it, remove a redundant assignment to se in tg_throttle_down().
Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model")
Signed-off-by: Aaron Lu <ziqianlu@bytedance.com>
---
kernel/sched/core.c | 9 ++++++++-
kernel/sched/fair.c | 16 +++++++---------
kernel/sched/sched.h | 1 +
3 files changed, 16 insertions(+), 10 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7f1e5cb94c536..421166d431fa7 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9608,7 +9608,14 @@ static int tg_set_cfs_bandwidth(struct task_group *tg,
cfs_rq->runtime_enabled = runtime_enabled;
cfs_rq->runtime_remaining = 0;
- if (cfs_rq->throttled)
+ /*
+ * Throttle cfs_rq now or it can be unthrottled with zero
+ * runtime_remaining and gets throttled on its unthrottle path.
+ */
+ if (cfs_rq->runtime_enabled && !cfs_rq->throttled)
+ throttle_cfs_rq(cfs_rq);
+
+ if (!cfs_rq->runtime_enabled && cfs_rq->throttled)
unthrottle_cfs_rq(cfs_rq);
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 22e6dd3af82fc..3ef11783369d7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5976,7 +5976,7 @@ static int tg_throttle_down(struct task_group *tg, void *data)
return 0;
}
-static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
+bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
{
struct rq *rq = rq_of(cfs_rq);
struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
@@ -6025,19 +6025,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
/*
* It's possible we are called with !runtime_remaining due to things
- * like user changed quota setting(see tg_set_cfs_bandwidth()) or async
- * unthrottled us with a positive runtime_remaining but other still
- * running entities consumed those runtime before we reached here.
+ * like async unthrottled us with a positive runtime_remaining but
+ * other still running entities consumed those runtime before we
+ * reached here.
*
- * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
- * because any enqueue in tg_unthrottle_up() will immediately trigger a
- * throttle, which is not supposed to happen on unthrottle path.
+ * We can't unthrottle this cfs_rq without any runtime remaining
+ * because any enqueue in tg_unthrottle_up() will immediately trigger
+ * a throttle, which is not supposed to happen on unthrottle path.
*/
if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
return;
- se = cfs_rq->tg->se[cpu_of(rq)];
-
cfs_rq->throttled = 0;
update_rq_clock(rq);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b5367c514c143..359bb858cffd3 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -558,6 +558,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth
extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b);
extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b);
+extern bool throttle_cfs_rq(struct cfs_rq *cfs_rq);
extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq);
extern bool cfs_task_bw_constrained(struct task_struct *p);
--
2.39.5
Hello Aaron, On 9/29/2025 1:16 PM, Aaron Lu wrote: > When a cfs_rq is to be throttled, its limbo list should be empty and > that's why there is a warn in tg_throttle_down() for non empty > cfs_rq->throttled_limbo_list. > > When running a test with the following hierarchy: > > root > / \ > A* ... > / | \ ... > B > / \ > C* > > where both A and C have quota settings, that warn on non empty limbo list > is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu > part of the cfs_rq for the sake of simpler representation). > > Debugging showed it happened like this: > Task group C is created and quota is set, so in tg_set_cfs_bandwidth(), > cfs_rq_c is initialized with runtime_enabled set, runtime_remaining > equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c, > *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task > group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is > called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled), > these throttled tasks are placed into cfs_rq_c's limbo list by > enqueue_throttled_task(). > > Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these > tasks. The first enqueue triggers check_enqueue_throttle(), and with zero > runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it > can't get more runtime and enters tg_throttle_down(), where the warning > is hit due to remaining tasks in the limbo list. > > Fix this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth() > immediately after enabling bandwidth and setting runtime_remaining = 0. > This ensures cfs_rq_c is throttled upfront and cannot enter the enqueue > path in an unthrottled state with no runtime. > > Also, update outdated comments in tg_throttle_down() since > unthrottle_cfs_rq() is no longer called with zero runtime_remaining. > > While at it, remove a redundant assignment to se in tg_throttle_down(). > > Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model") > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> > --- > kernel/sched/core.c | 9 ++++++++- > kernel/sched/fair.c | 16 +++++++--------- > kernel/sched/sched.h | 1 + > 3 files changed, 16 insertions(+), 10 deletions(-) > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > index 7f1e5cb94c536..421166d431fa7 100644 > --- a/kernel/sched/core.c > +++ b/kernel/sched/core.c > @@ -9608,7 +9608,14 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, > cfs_rq->runtime_enabled = runtime_enabled; > cfs_rq->runtime_remaining = 0; > > - if (cfs_rq->throttled) > + /* > + * Throttle cfs_rq now or it can be unthrottled with zero > + * runtime_remaining and gets throttled on its unthrottle path. > + */ > + if (cfs_rq->runtime_enabled && !cfs_rq->throttled) > + throttle_cfs_rq(cfs_rq); So one downside of this is throttle_cfs_rq() here can assign bandwidth to an empty cfs_rq and a genuine enqueue later on another CPU might not find bandwidth thus delaying its execution. Can we instead do a check_enqueue_throttle() in enqueue_throttled_task() if we find cfs_rq->throttled_limbo_list to be empty? diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 18a30ae35441..fd2d4dad9c27 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p) */ if (throttled_hierarchy(cfs_rq) && !task_current_donor(rq_of(cfs_rq), p)) { + if (list_empty(&cfs_rq->throttled_limbo_list)) + check_enqueue_throttle(cfs_rq); list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); return true; } --- > + > + if (!cfs_rq->runtime_enabled && cfs_rq->throttled) > unthrottle_cfs_rq(cfs_rq); > } > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 22e6dd3af82fc..3ef11783369d7 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5976,7 +5976,7 @@ static int tg_throttle_down(struct task_group *tg, void *data) > return 0; > } > > -static bool throttle_cfs_rq(struct cfs_rq *cfs_rq) > +bool throttle_cfs_rq(struct cfs_rq *cfs_rq) > { > struct rq *rq = rq_of(cfs_rq); > struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg); > @@ -6025,19 +6025,17 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq) > > /* > * It's possible we are called with !runtime_remaining due to things > - * like user changed quota setting(see tg_set_cfs_bandwidth()) or async > - * unthrottled us with a positive runtime_remaining but other still > - * running entities consumed those runtime before we reached here. > + * like async unthrottled us with a positive runtime_remaining but > + * other still running entities consumed those runtime before we > + * reached here. > * > - * Anyway, we can't unthrottle this cfs_rq without any runtime remaining > - * because any enqueue in tg_unthrottle_up() will immediately trigger a > - * throttle, which is not supposed to happen on unthrottle path. > + * We can't unthrottle this cfs_rq without any runtime remaining > + * because any enqueue in tg_unthrottle_up() will immediately trigger > + * a throttle, which is not supposed to happen on unthrottle path. > */ > if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0) > return; > > - se = cfs_rq->tg->se[cpu_of(rq)]; > - Ack on these bits! > cfs_rq->throttled = 0; > > update_rq_clock(rq); > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index b5367c514c143..359bb858cffd3 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -558,6 +558,7 @@ extern void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b, struct cfs_bandwidth > > extern void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b); > extern void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b); > +extern bool throttle_cfs_rq(struct cfs_rq *cfs_rq); > extern void unthrottle_cfs_rq(struct cfs_rq *cfs_rq); > extern bool cfs_task_bw_constrained(struct task_struct *p); > -- Thanks and Regards, Prateek
On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote: ... ... > Can we instead do a check_enqueue_throttle() in enqueue_throttled_task() > if we find cfs_rq->throttled_limbo_list to be empty? > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 18a30ae35441..fd2d4dad9c27 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p) > */ > if (throttled_hierarchy(cfs_rq) && > !task_current_donor(rq_of(cfs_rq), p)) { /* * Make sure to throttle this cfs_rq or it can be unthrottled * with no runtime_remaining and gets throttled again on its * unthrottle path. */ > + if (list_empty(&cfs_rq->throttled_limbo_list)) > + check_enqueue_throttle(cfs_rq); BTW, do you think a comment is needed? Something like the above, not sure if it's too redundant though, feel free to let me know your thoughts, thanks. > list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); > return true; > } > ---
Hello Aaron, On 9/30/2025 1:26 PM, Aaron Lu wrote: > On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote: > ... ... >> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task() >> if we find cfs_rq->throttled_limbo_list to be empty? >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >> index 18a30ae35441..fd2d4dad9c27 100644 >> --- a/kernel/sched/fair.c >> +++ b/kernel/sched/fair.c >> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p) >> */ >> if (throttled_hierarchy(cfs_rq) && >> !task_current_donor(rq_of(cfs_rq), p)) { > /* > * Make sure to throttle this cfs_rq or it can be unthrottled > * with no runtime_remaining and gets throttled again on its > * unthrottle path. > */ >> + if (list_empty(&cfs_rq->throttled_limbo_list)) >> + check_enqueue_throttle(cfs_rq); > > BTW, do you think a comment is needed? Something like the above, not > sure if it's too redundant though, feel free to let me know your > thoughts, thanks. Now that I'm looking at it again, I think we should actually do a: for_each_entity(se) check_enqueue_throttle(cfs_rq_of(se)); The reason being, we can have: root -> A (throttled) -> B -> C Consider B has runtime_remaining = 0, and subsequently a throttled task is queued onto C. Ideally, we should start the B/W timer for B at that point but we bail out after queuing it on C. Thoughts? Since we only catch up to the 0 runtime_remaining point, it should be fine. The comment can perhaps be something like: /* * If this is the first enqueue on throttled hierarchy, * ensure bandwidth is available when the hierarchy is * unthrottled. check_enqueue_throttle() will ensure * either some bandwidth is available, or will throttle * the cfs_rq and queue the bandwidth timer. */ > >> list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); >> return true; >> } >> --- -- Thanks and Regards, Prateek
On Tue, Sep 30, 2025 at 02:28:16PM +0530, K Prateek Nayak wrote: > Hello Aaron, > > On 9/30/2025 1:26 PM, Aaron Lu wrote: > > On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote: > > ... ... > >> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task() > >> if we find cfs_rq->throttled_limbo_list to be empty? > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 18a30ae35441..fd2d4dad9c27 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p) > >> */ > >> if (throttled_hierarchy(cfs_rq) && > >> !task_current_donor(rq_of(cfs_rq), p)) { > > /* > > * Make sure to throttle this cfs_rq or it can be unthrottled > > * with no runtime_remaining and gets throttled again on its > > * unthrottle path. > > */ > >> + if (list_empty(&cfs_rq->throttled_limbo_list)) > >> + check_enqueue_throttle(cfs_rq); > > > > BTW, do you think a comment is needed? Something like the above, not > > sure if it's too redundant though, feel free to let me know your > > thoughts, thanks. > > Now that I'm looking at it again, I think we should actually do a: > > for_each_entity(se) > check_enqueue_throttle(cfs_rq_of(se)); > > The reason being, we can have: > > root -> A (throttled) -> B -> C > > Consider B has runtime_remaining = 0, and subsequently a throttled task > is queued onto C. Ideally, we should start the B/W timer for B at that > point but we bail out after queuing it on C. Thoughts? Yes agree the B/W timer should also be considered. So in my original patch, cfs_rqs will (most likely) start with runtime_remaining == 1 and unthrottled after calling throttle_cfs_rq(), which will also start the B/W timer. The timer is not needed in this case when no cfs_rqs are actually throttled but it doesn't hurt. Looks like everything is OK, we do not need to do any special handling in enqueue_throttled_task(). Thoughts?
Hello Aaron, I'll merge the two replies in one. On 9/30/2025 4:37 PM, Aaron Lu wrote: > So in my original patch, cfs_rqs will (most likely) start with > runtime_remaining == 1 and unthrottled after calling throttle_cfs_rq(), > which will also start the B/W timer. The timer is not needed in this > case when no cfs_rqs are actually throttled but it doesn't hurt. Looks > like everything is OK, we do not need to do any special handling in > enqueue_throttled_task(). Thoughts? Now that I look at throttle_cfs_rq() properly, we'll only move the runtime_remaining from 0 to 1 so few usecs worth of bandwidth distributed at max should be okay. Sorry for the being overly cautious! So your current approach should be good. Please feel free to include: Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> As for the other thread: On 9/30/2025 6:09 PM, Aaron Lu wrote: >>> >>> root -> A (throttled) -> B -> C >>> >>> Consider B has runtime_remaining = 0, and subsequently a throttled task >>> is queued onto C. Ideally, we should start the B/W timer for B at that >>> point but we bail out after queuing it on C. Thoughts? >> >> Yes agree the B/W timer should also be considered. > > On another thought, do we really need care about B/W timer for B? > > I mean, when C is unthrottled and gets enqueued on B, > check_enqueue_throttle() will do the right thing for B so I don't > think we need to do this hierarchy check_enqueue_throttle() here. So what I though would happen here is that when A is unthrottled, you'll enqueue the task and only then realize B doesn't have any bandwidth and start the timer then but had you identified it earlier, distribution could have already added some bandwidth to B and then you could run the task without adding any further latency. -- Thanks and Regards, Prateek
On Tue, Sep 30, 2025 at 07:08:20PM +0530, K Prateek Nayak wrote: > Hello Aaron, > > I'll merge the two replies in one. > > On 9/30/2025 4:37 PM, Aaron Lu wrote: > > So in my original patch, cfs_rqs will (most likely) start with > > runtime_remaining == 1 and unthrottled after calling throttle_cfs_rq(), > > which will also start the B/W timer. The timer is not needed in this > > case when no cfs_rqs are actually throttled but it doesn't hurt. Looks > > like everything is OK, we do not need to do any special handling in > > enqueue_throttled_task(). Thoughts? > > Now that I look at throttle_cfs_rq() properly, we'll only move the > runtime_remaining from 0 to 1 so few usecs worth of bandwidth > distributed at max should be okay. Sorry for the being overly cautious! Never mind. > > So your current approach should be good. Please feel free to include: > > Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Thanks!
On Tue, Sep 30, 2025 at 07:07:17PM +0800, Aaron Lu wrote: > On Tue, Sep 30, 2025 at 02:28:16PM +0530, K Prateek Nayak wrote: > > Hello Aaron, > > > > On 9/30/2025 1:26 PM, Aaron Lu wrote: > > > On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote: > > > ... ... > > >> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task() > > >> if we find cfs_rq->throttled_limbo_list to be empty? > > >> > > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > >> index 18a30ae35441..fd2d4dad9c27 100644 > > >> --- a/kernel/sched/fair.c > > >> +++ b/kernel/sched/fair.c > > >> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p) > > >> */ > > >> if (throttled_hierarchy(cfs_rq) && > > >> !task_current_donor(rq_of(cfs_rq), p)) { > > > /* > > > * Make sure to throttle this cfs_rq or it can be unthrottled > > > * with no runtime_remaining and gets throttled again on its > > > * unthrottle path. > > > */ > > >> + if (list_empty(&cfs_rq->throttled_limbo_list)) > > >> + check_enqueue_throttle(cfs_rq); > > > > > > BTW, do you think a comment is needed? Something like the above, not > > > sure if it's too redundant though, feel free to let me know your > > > thoughts, thanks. > > > > Now that I'm looking at it again, I think we should actually do a: > > > > for_each_entity(se) > > check_enqueue_throttle(cfs_rq_of(se)); > > > > The reason being, we can have: > > > > root -> A (throttled) -> B -> C > > > > Consider B has runtime_remaining = 0, and subsequently a throttled task > > is queued onto C. Ideally, we should start the B/W timer for B at that > > point but we bail out after queuing it on C. Thoughts? > > Yes agree the B/W timer should also be considered. On another thought, do we really need care about B/W timer for B? I mean, when C is unthrottled and gets enqueued on B, check_enqueue_throttle() will do the right thing for B so I don't think we need to do this hierarchy check_enqueue_throttle() here. I think the only difference with your suggestion and my patch is, with your suggestion, it's possible for a runtime_enabled cfs_rq to reach tg_unthrottle_up() with runtime_remaining equals to 0 but since it doesn't have any tasks in its limbo list, it will not do any enqueue so won't possibly trigger throttle there, so it's still fine. i.e. I think your original suggestion works.
Hi Prateek, On Tue, Sep 30, 2025 at 02:28:16PM +0530, K Prateek Nayak wrote: > Hello Aaron, > > On 9/30/2025 1:26 PM, Aaron Lu wrote: > > On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote: > > ... ... > >> Can we instead do a check_enqueue_throttle() in enqueue_throttled_task() > >> if we find cfs_rq->throttled_limbo_list to be empty? > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 18a30ae35441..fd2d4dad9c27 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p) > >> */ > >> if (throttled_hierarchy(cfs_rq) && > >> !task_current_donor(rq_of(cfs_rq), p)) { > > /* > > * Make sure to throttle this cfs_rq or it can be unthrottled > > * with no runtime_remaining and gets throttled again on its > > * unthrottle path. > > */ > >> + if (list_empty(&cfs_rq->throttled_limbo_list)) > >> + check_enqueue_throttle(cfs_rq); > > > > BTW, do you think a comment is needed? Something like the above, not > > sure if it's too redundant though, feel free to let me know your > > thoughts, thanks. > > Now that I'm looking at it again, I think we should actually do a: > > for_each_entity(se) > check_enqueue_throttle(cfs_rq_of(se)); Nice catch and sigh. > > The reason being, we can have: > > root -> A (throttled) -> B -> C > > Consider B has runtime_remaining = 0, and subsequently a throttled task > is queued onto C. Ideally, we should start the B/W timer for B at that > point but we bail out after queuing it on C. Thoughts? > If we want to make sure no cfs_rqs with runtime_enabled gets unthrottled with zero runtime_remaining, agree we will have to do that in a hierarchy way. I don't feel good about that for_each_entity(se) check_enqueue_throttle() though, it made me feel we are duplicating enqueue_task_fair() somehow... With this said, if we have to do that hierarchical check, I would prefer to throttle it upfront in tg_set_cfs_bandwidth() :) The useless assign of runtime is just 1ns, and it should only affect the first period, so shouldn't matter much?
Hi Prateek, Thanks for taking a look and the suggestion. On Mon, Sep 29, 2025 at 03:04:03PM +0530, K Prateek Nayak wrote: > Hello Aaron, > > On 9/29/2025 1:16 PM, Aaron Lu wrote: > > When a cfs_rq is to be throttled, its limbo list should be empty and > > that's why there is a warn in tg_throttle_down() for non empty > > cfs_rq->throttled_limbo_list. > > > > When running a test with the following hierarchy: > > > > root > > / \ > > A* ... > > / | \ ... > > B > > / \ > > C* > > > > where both A and C have quota settings, that warn on non empty limbo list > > is triggered for a cfs_rq of C, let's call it cfs_rq_c(and ignore the cpu > > part of the cfs_rq for the sake of simpler representation). > > > > Debugging showed it happened like this: > > Task group C is created and quota is set, so in tg_set_cfs_bandwidth(), > > cfs_rq_c is initialized with runtime_enabled set, runtime_remaining > > equals to 0 and *unthrottled*. Before any tasks are enqueued to cfs_rq_c, > > *multiple* throttled tasks can migrate to cfs_rq_c (e.g., due to task > > group changes). When enqueue_task_fair(cfs_rq_c, throttled_task) is > > called and cfs_rq_c is in a throttled hierarchy (e.g., A is throttled), > > these throttled tasks are placed into cfs_rq_c's limbo list by > > enqueue_throttled_task(). > > > > Later, when A is unthrottled, tg_unthrottle_up(cfs_rq_c) enqueues these > > tasks. The first enqueue triggers check_enqueue_throttle(), and with zero > > runtime_remaining, cfs_rq_c can be throttled in throttle_cfs_rq() if it > > can't get more runtime and enters tg_throttle_down(), where the warning > > is hit due to remaining tasks in the limbo list. > > > > Fix this by calling throttle_cfs_rq() in tg_set_cfs_bandwidth() > > immediately after enabling bandwidth and setting runtime_remaining = 0. > > This ensures cfs_rq_c is throttled upfront and cannot enter the enqueue > > path in an unthrottled state with no runtime. > > > > Also, update outdated comments in tg_throttle_down() since > > unthrottle_cfs_rq() is no longer called with zero runtime_remaining. > > > > While at it, remove a redundant assignment to se in tg_throttle_down(). > > > > Fixes: e1fad12dcb66("sched/fair: Switch to task based throttle model") > > Signed-off-by: Aaron Lu <ziqianlu@bytedance.com> > > --- > > kernel/sched/core.c | 9 ++++++++- > > kernel/sched/fair.c | 16 +++++++--------- > > kernel/sched/sched.h | 1 + > > 3 files changed, 16 insertions(+), 10 deletions(-) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 7f1e5cb94c536..421166d431fa7 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -9608,7 +9608,14 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, > > cfs_rq->runtime_enabled = runtime_enabled; > > cfs_rq->runtime_remaining = 0; > > > > - if (cfs_rq->throttled) > > + /* > > + * Throttle cfs_rq now or it can be unthrottled with zero > > + * runtime_remaining and gets throttled on its unthrottle path. > > + */ > > + if (cfs_rq->runtime_enabled && !cfs_rq->throttled) > > + throttle_cfs_rq(cfs_rq); > > So one downside of this is throttle_cfs_rq() here can assign bandwidth > to an empty cfs_rq and a genuine enqueue later on another CPU might not > find bandwidth thus delaying its execution. Agree that assign doesn't make sense here. > > Can we instead do a check_enqueue_throttle() in enqueue_throttled_task() > if we find cfs_rq->throttled_limbo_list to be empty? > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 18a30ae35441..fd2d4dad9c27 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -5872,6 +5872,8 @@ static bool enqueue_throttled_task(struct task_struct *p) > */ > if (throttled_hierarchy(cfs_rq) && > !task_current_donor(rq_of(cfs_rq), p)) { > + if (list_empty(&cfs_rq->throttled_limbo_list)) > + check_enqueue_throttle(cfs_rq); > list_add(&p->throttle_node, &cfs_rq->throttled_limbo_list); > return true; > } > --- > Works for me, will follow your suggestion if no other comments, thanks!
© 2016 - 2025 Red Hat, Inc.