[v2] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()

[PATCH v2] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()

Posted by Hao Jia 3 months, 1 week ago

From: Hao Jia <jiahao1@lixiang.com>

Aaron Lu and I hit a non-empty throttled_limbo_list warning in
tg_throttle_down() during testing.

WARNING: kernel/sched/fair.c:5975 at tg_throttle_down+0x2bd/0x2f0, CPU#20: swapper/20/0
Call Trace:
 ? __pfx_tg_nop+0x10/0x10
 walk_tg_tree_from+0x39/0xd0
 ? __pfx_tg_throttle_down+0x10/0x10
 throttle_cfs_rq+0x176/0x1f0
 enqueue_task_fair+0x4f5/0xd30
 ? unthrottle_cfs_rq+0x2f7/0x3a0
 tg_unthrottle_up+0x10c/0x2f0
 ? __pfx_tg_unthrottle_up+0x10/0x10
 walk_tg_tree_from+0x66/0xd0
 ? __pfx_tg_nop+0x10/0x10
 unthrottle_cfs_rq+0x16b/0x3a0
 __cfsb_csd_unthrottle+0x1f0/0x250
 ? __pfx___cfsb_csd_unthrottle+0x10/0x10
 __flush_smp_call_function_queue+0x104/0x440
 ? tick_nohz_account_idle_time+0x4c/0x80
 flush_smp_call_function_queue+0x3b/0x80
 do_idle+0x14f/0x240
 cpu_startup_entry+0x30/0x40
 start_secondary+0x128/0x160
 common_startup_64+0x13e/0x141

cgroup hierarchy:

          root
        /      \
        A*     ...
     /  |  \   ...
        B* ...

Debugging shows the following:
A and B are configured with relatively small quota and large period.

At some point, cfs_rq_A is throttled. Due to the throttling of cfs_rq_A,
the tasks on cfs_rq_B are added to cfs_rq_B's throttled_limbo_list.

Resetting task_group B quota will set cfs_rq_B runtime_remaining to 0 in
tg_set_cfs_bandwidth().
Since task_group A is throttled, Therefore, task on cfs_rq_B cannot run,
and runtime_remaining stays 0. With task_group B has a small quota,
tasks on other CPUs in task_group B quickly consume all of
cfs_b_B->runtime, causing cfs_b_B->runtime to be 0.

When cfs_rq_A is unthrottled later, tg_unthrottle_up(cfs_rq_B) will
re-queues task. However, because cfs_rq_B->runtime_remaining still 0,
and it cannot obtain runtime from cfs_b_B->runtime either. Therefore,
the task will be throttled in
enqueue_task_fair()->enqueue_entity()->check_enqueue_throttle(),
triggering a non-empty throttled_limbo_list warning in tg_throttle_down().

Root Cause:
In unthrottle_cfs_rq(), we only checked cfs_rq_A->runtime_remaining, but
enqueue_task_fair() requires that the runtime_remaining of each cfs_rq
level be greater than 0.

Solution:
One way to fix this warning is to add a runtime_remaining check for
each cfs_rq level of the task in unthrottle_cfs_rq(), but this makes code
strange and complicated.
Another straightforward and simple solution is to add a new enqueue flag
to ensure that enqueue in tg_unthrottle_up() will not immediately trigger
throttling. This may enqueue sched_entity with no remaining runtime, which
is not a big deal because the current kernel also has such situations [1].

We still retain the runtime_remaining check in unthrottle_cfs_rq() for
higher-level cfs_rq to avoid enqueuing many entities with
runtime_remaining < 0.

Also remove the redundant assignment to se in tg_throttle_down().

[1]: https://lore.kernel.org/all/20251015084045.GB35@bytedance

Fixes: e1fad12dcb66 ("sched/fair: Switch to task based throttle model")
Suggested-by: Aaron Lu <ziqianlu@bytedance.com>
Suggested-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reported-by: Aaron Lu <ziqianlu@bytedance.com>
Closes: https://lore.kernel.org/all/20251016065438.GA32@bytedance
Signed-off-by: Hao Jia <jiahao1@lixiang.com>
---
v2:
  1.Fix the issue where enqueue flags were overwritten in enqueue_task_fair(),
  ensuring that the ENQUEUE_THROTTLE flag can be propagated through the hierarchy.

  2.As suggested by Aaron Lu, check ENQUEUE_THROTTLE in check_enqueue_throttle()
  to optimize for scenarios without quota set, and pair the
  ENQUEUE_THROTTLE / DEQUEUE_THROTTLE flags.

  3.Prateek suggested placing the ENQUEUE_THROTTLE check after account_cfs_rq_runtime()
  to prevent a possible missed start of the bandwidth timer.

 kernel/sched/fair.c  | 31 ++++++++++++++++++++++---------
 kernel/sched/sched.h |  7 ++++++-
 2 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 273e2871b59e..67ce46c532e2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5229,7 +5229,7 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	se->deadline = se->vruntime + vslice;
 }
 
-static void check_enqueue_throttle(struct cfs_rq *cfs_rq);
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags);
 static inline int cfs_rq_throttled(struct cfs_rq *cfs_rq);
 
 static void
@@ -5287,7 +5287,8 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	se->on_rq = 1;
 
 	if (cfs_rq->nr_queued == 1) {
-		check_enqueue_throttle(cfs_rq);
+		check_enqueue_throttle(cfs_rq, flags);
+
 		list_add_leaf_cfs_rq(cfs_rq);
 #ifdef CONFIG_CFS_BANDWIDTH
 		if (cfs_rq->pelt_clock_throttled) {
@@ -5912,7 +5913,7 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
 	list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
 		list_del_init(&p->throttle_node);
 		p->throttled = false;
-		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP);
+		enqueue_task_fair(rq_of(cfs_rq), p, ENQUEUE_WAKEUP | ENQUEUE_THROTTLE);
 	}
 
 	/* Add cfs_rq with load or one or more already running entities to the list */
@@ -6029,15 +6030,18 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	 * unthrottled us with a positive runtime_remaining but other still
 	 * running entities consumed those runtime before we reached here.
 	 *
-	 * Anyway, we can't unthrottle this cfs_rq without any runtime remaining
+	 * We can't unthrottle this cfs_rq without any runtime remaining
 	 * because any enqueue in tg_unthrottle_up() will immediately trigger a
 	 * throttle, which is not supposed to happen on unthrottle path.
+	 *
+	 * Although the ENQUEUE_THROTTLE flag ensures that enqueues in
+	 * tg_unthrottle_up() do not trigger a throttle, we still retain the check
+	 * for cfs_rq->runtime_remaining. This prevents the enqueueing of many
+	 * entities whose runtime_remaining is less than 0.
 	 */
 	if (cfs_rq->runtime_enabled && cfs_rq->runtime_remaining <= 0)
 		return;
 
-	se = cfs_rq->tg->se[cpu_of(rq)];
-
 	cfs_rq->throttled = 0;
 
 	update_rq_clock(rq);
@@ -6403,7 +6407,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
  * expired/exceeded, otherwise it may be allowed to steal additional ticks of
  * runtime as update_curr() throttling can not trigger until it's on-rq.
  */
-static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags)
 {
 	if (!cfs_bandwidth_used())
 		return;
@@ -6418,6 +6422,13 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
 
 	/* update runtime allocation */
 	account_cfs_rq_runtime(cfs_rq, 0);
+	/*
+	 * Do not attempt to throttle on the cfs_rq unthrottle path.
+	 * and it must be placed after account_cfs_rq_runtime() to
+	 * prevent a possible missed start of the bandwidth timer.
+	 */
+	if (flags & ENQUEUE_THROTTLE)
+		return;
 	if (cfs_rq->runtime_remaining <= 0)
 		throttle_cfs_rq(cfs_rq);
 }
@@ -6724,7 +6735,7 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
 
 static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
 static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
-static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
+static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags) {}
 static inline void sync_throttle(struct task_group *tg, int cpu) {}
 static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
 static void task_throttle_setup_work(struct task_struct *p) {}
@@ -6926,6 +6937,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 	int h_nr_runnable = 1;
 	int task_new = !(flags & ENQUEUE_WAKEUP);
 	int rq_h_nr_queued = rq->cfs.h_nr_queued;
+	int throttle_flag = flags & ENQUEUE_THROTTLE;
 	u64 slice = 0;
 
 	if (task_is_throttled(p) && enqueue_throttled_task(p))
@@ -6983,7 +6995,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		if (cfs_rq_is_idle(cfs_rq))
 			h_nr_idle = 1;
 
-		flags = ENQUEUE_WAKEUP;
+		/* Ensure ENQUEUE_THROTTLE flag can be propagated through the hierarchy */
+		flags = ENQUEUE_WAKEUP | throttle_flag;
 	}
 
 	for_each_sched_entity(se) {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e7718f12bc55..468013d860a6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2364,6 +2364,10 @@ extern const u32		sched_prio_to_wmult[40];
  * CLASS - going to update p->sched_class; makes sched_change call the
  *         various switch methods.
  *
+ * THROTTLE - invoke in throttle_cfs_rq_work() to ensure task dequeue
+ *            during throttling, and in tg_unthrottle_up() to ensure
+ *            task enqueue during unthrottling.
+ *
  * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
  * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
  * ENQUEUE_MIGRATED  - the task was migrated during wakeup
@@ -2381,9 +2385,9 @@ extern const u32		sched_prio_to_wmult[40];
 #define DEQUEUE_MIGRATING	0x0010 /* Matches ENQUEUE_MIGRATING */
 #define DEQUEUE_DELAYED		0x0020 /* Matches ENQUEUE_DELAYED */
 #define DEQUEUE_CLASS		0x0040 /* Matches ENQUEUE_CLASS */
+#define DEQUEUE_THROTTLE	0x0080 /* Matches ENQUEUE_THROTTLE */
 
 #define DEQUEUE_SPECIAL		0x00010000
-#define DEQUEUE_THROTTLE	0x00020000
 
 #define ENQUEUE_WAKEUP		0x0001
 #define ENQUEUE_RESTORE		0x0002
@@ -2393,6 +2397,7 @@ extern const u32		sched_prio_to_wmult[40];
 #define ENQUEUE_MIGRATING	0x0010
 #define ENQUEUE_DELAYED		0x0020
 #define ENQUEUE_CLASS		0x0040
+#define ENQUEUE_THROTTLE	0x0080
 
 #define ENQUEUE_HEAD		0x00010000
 #define ENQUEUE_REPLENISH	0x00020000
-- 
2.34.1

Re: [PATCH v2] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()

Posted by Aaron Lu 3 months, 1 week ago

On Mon, Oct 27, 2025 at 05:05:34PM +0800, Hao Jia wrote:
> @@ -6403,7 +6407,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>   * expired/exceeded, otherwise it may be allowed to steal additional ticks of
>   * runtime as update_curr() throttling can not trigger until it's on-rq.
>   */
> -static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
> +static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags)
>  {
>  	if (!cfs_bandwidth_used())
>  		return;
> @@ -6418,6 +6422,13 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>  
>  	/* update runtime allocation */
>  	account_cfs_rq_runtime(cfs_rq, 0);
> +	/*
> +	 * Do not attempt to throttle on the cfs_rq unthrottle path.
> +	 * and it must be placed after account_cfs_rq_runtime() to
> +	 * prevent a possible missed start of the bandwidth timer.

Hi Prateek and Hao,

Does it matter to start the bw timer? If no cfs_rq gets throttled, the
timer doesn't look that useful. Also, account_cfs_rq_runtime() calls
assign_cfs_rq_runtime() and if assign failed, it will do resched_curr()
but since we do not throttle cfs_rq here, that resched would be useless.

> +	 */
> +	if (flags & ENQUEUE_THROTTLE)
> +		return;
>  	if (cfs_rq->runtime_remaining <= 0)
>  		throttle_cfs_rq(cfs_rq);
>  }
> @@ -6724,7 +6735,7 @@ static void sched_fair_update_stop_tick(struct rq *rq, struct task_struct *p)
>  
>  static void account_cfs_rq_runtime(struct cfs_rq *cfs_rq, u64 delta_exec) {}
>  static bool check_cfs_rq_runtime(struct cfs_rq *cfs_rq) { return false; }
> -static void check_enqueue_throttle(struct cfs_rq *cfs_rq) {}
> +static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags) {}
>  static inline void sync_throttle(struct task_group *tg, int cpu) {}
>  static __always_inline void return_cfs_rq_runtime(struct cfs_rq *cfs_rq) {}
>  static void task_throttle_setup_work(struct task_struct *p) {}
> @@ -6926,6 +6937,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  	int h_nr_runnable = 1;
>  	int task_new = !(flags & ENQUEUE_WAKEUP);
>  	int rq_h_nr_queued = rq->cfs.h_nr_queued;
> +	int throttle_flag = flags & ENQUEUE_THROTTLE;
>  	u64 slice = 0;
>  
>  	if (task_is_throttled(p) && enqueue_throttled_task(p))
> @@ -6983,7 +6995,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		if (cfs_rq_is_idle(cfs_rq))
>  			h_nr_idle = 1;
>  
> -		flags = ENQUEUE_WAKEUP;
> +		/* Ensure ENQUEUE_THROTTLE flag can be propagated through the hierarchy */
> +		flags = ENQUEUE_WAKEUP | throttle_flag;
>  	}
>  
>  	for_each_sched_entity(se) {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index e7718f12bc55..468013d860a6 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2364,6 +2364,10 @@ extern const u32		sched_prio_to_wmult[40];
>   * CLASS - going to update p->sched_class; makes sched_change call the
>   *         various switch methods.
>   *
> + * THROTTLE - invoke in throttle_cfs_rq_work() to ensure task dequeue
> + *            during throttling, and in tg_unthrottle_up() to ensure
> + *            task enqueue during unthrottling.

Thanks for adding the comment for dequeue_throttle. One nit is:
dequeue_throttle is used for time accounting purpose so it would be
better if you can make that clear, maybe something like this:

invoked in throttle_cfs_rq_work() during throttle for accounting
purpose, and in tg_unthrottle_up() to avoid throttling when enqueuing
tasks.

> + *
>   * ENQUEUE_HEAD      - place at front of runqueue (tail if not specified)
>   * ENQUEUE_REPLENISH - CBS (replenish runtime and postpone deadline)
>   * ENQUEUE_MIGRATED  - the task was migrated during wakeup
> @@ -2381,9 +2385,9 @@ extern const u32		sched_prio_to_wmult[40];
>  #define DEQUEUE_MIGRATING	0x0010 /* Matches ENQUEUE_MIGRATING */
>  #define DEQUEUE_DELAYED		0x0020 /* Matches ENQUEUE_DELAYED */
>  #define DEQUEUE_CLASS		0x0040 /* Matches ENQUEUE_CLASS */
> +#define DEQUEUE_THROTTLE	0x0080 /* Matches ENQUEUE_THROTTLE */
>  
>  #define DEQUEUE_SPECIAL		0x00010000
> -#define DEQUEUE_THROTTLE	0x00020000
>  
>  #define ENQUEUE_WAKEUP		0x0001
>  #define ENQUEUE_RESTORE		0x0002
> @@ -2393,6 +2397,7 @@ extern const u32		sched_prio_to_wmult[40];
>  #define ENQUEUE_MIGRATING	0x0010
>  #define ENQUEUE_DELAYED		0x0020
>  #define ENQUEUE_CLASS		0x0040
> +#define ENQUEUE_THROTTLE	0x0080
>  
>  #define ENQUEUE_HEAD		0x00010000
>  #define ENQUEUE_REPLENISH	0x00020000
> -- 
> 2.34.1
>

Re: [PATCH v2] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()

Posted by Hao Jia 3 months, 1 week ago


On 2025/10/27 20:02, Aaron Lu wrote:

>>    *
>> + * THROTTLE - invoke in throttle_cfs_rq_work() to ensure task dequeue
>> + *            during throttling, and in tg_unthrottle_up() to ensure
>> + *            task enqueue during unthrottling.
> 
> Thanks for adding the comment for dequeue_throttle. One nit is:
> dequeue_throttle is used for time accounting purpose so it would be
> better if you can make that clear, maybe something like this:
> 
> invoked in throttle_cfs_rq_work() during throttle for accounting
> purpose, and in tg_unthrottle_up() to avoid throttling when enqueuing
> tasks.

Thanks, will do it.

Thanks,
Hao

Re: [PATCH v2] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()

Posted by K Prateek Nayak 3 months, 1 week ago

Hello Aaron,

On 10/27/2025 5:32 PM, Aaron Lu wrote:
> On Mon, Oct 27, 2025 at 05:05:34PM +0800, Hao Jia wrote:
>> @@ -6403,7 +6407,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
>>   * expired/exceeded, otherwise it may be allowed to steal additional ticks of
>>   * runtime as update_curr() throttling can not trigger until it's on-rq.
>>   */
>> -static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>> +static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags)
>>  {
>>  	if (!cfs_bandwidth_used())
>>  		return;
>> @@ -6418,6 +6422,13 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
>>  
>>  	/* update runtime allocation */
>>  	account_cfs_rq_runtime(cfs_rq, 0);
>> +	/*
>> +	 * Do not attempt to throttle on the cfs_rq unthrottle path.
>> +	 * and it must be placed after account_cfs_rq_runtime() to
>> +	 * prevent a possible missed start of the bandwidth timer.
> 
> Hi Prateek and Hao,
> 
> Does it matter to start the bw timer? If no cfs_rq gets throttled, the
> timer doesn't look that useful.

Ack! But if we've reached here with ENQUEUE_THROTTLE set, we are in fact
trying to enqueue a task during unthrottling and we only start the timer
if we don't have any bandwidth since __account_cfs_rq_runtime() would bail
out if "cfs_rq->runtime_remaining > 0"
> Also, account_cfs_rq_runtime() calls
> assign_cfs_rq_runtime() and if assign failed, it will do resched_curr()
> but since we do not throttle cfs_rq here, that resched would be useless.

resched only happens if "cfs_rq->curr" is set andcheck_enqueue_throttle() already bails out if "cfs_rq->curr" is set so
the resched is avoided.

-- 
Thanks and Regards,
Prateek

Re: [PATCH v2] sched/fair: Fix non-empty throttled_limbo_list warning in tg_throttle_down()

Posted by Aaron Lu 3 months, 1 week ago

On Tue, Oct 28, 2025 at 08:33:15AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 10/27/2025 5:32 PM, Aaron Lu wrote:
> > On Mon, Oct 27, 2025 at 05:05:34PM +0800, Hao Jia wrote:
> >> @@ -6403,7 +6407,7 @@ static void do_sched_cfs_slack_timer(struct cfs_bandwidth *cfs_b)
> >>   * expired/exceeded, otherwise it may be allowed to steal additional ticks of
> >>   * runtime as update_curr() throttling can not trigger until it's on-rq.
> >>   */
> >> -static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
> >> +static void check_enqueue_throttle(struct cfs_rq *cfs_rq, int flags)
> >>  {
> >>  	if (!cfs_bandwidth_used())
> >>  		return;
> >> @@ -6418,6 +6422,13 @@ static void check_enqueue_throttle(struct cfs_rq *cfs_rq)
> >>  
> >>  	/* update runtime allocation */
> >>  	account_cfs_rq_runtime(cfs_rq, 0);
> >> +	/*
> >> +	 * Do not attempt to throttle on the cfs_rq unthrottle path.
> >> +	 * and it must be placed after account_cfs_rq_runtime() to
> >> +	 * prevent a possible missed start of the bandwidth timer.
> > 
> > Hi Prateek and Hao,
> > 
> > Does it matter to start the bw timer? If no cfs_rq gets throttled, the
> > timer doesn't look that useful.
> 
> Ack! But if we've reached here with ENQUEUE_THROTTLE set, we are in fact
> trying to enqueue a task during unthrottling and we only start the timer
> if we don't have any bandwidth since __account_cfs_rq_runtime() would bail
> out if "cfs_rq->runtime_remaining > 0"
> > Also, account_cfs_rq_runtime() calls
> > assign_cfs_rq_runtime() and if assign failed, it will do resched_curr()
> > but since we do not throttle cfs_rq here, that resched would be useless.
> 
> resched only happens if "cfs_rq->curr" is set andcheck_enqueue_throttle() already bails out if "cfs_rq->curr" is set so
> the resched is avoided.

Right, sorry I missed that.