[v1] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

[PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Zicheng Qu 2 weeks, 4 days ago

Consider the following sequence on a CPU configured with nohz_full:

1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
   bandwidth control. The gse (cgroup A) where the task P attached is
dequeued and the CPU switches to idle.

2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
   another cgroup B (not throttled).

   During sched_move_task(), the task P is observed as queued but not
running, and therefore no resched_curr() is triggered.

3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
   explicit scheduling event, i.e., resched_curr().

4) Later, cgroup A is unthrottled. However, the task P has already been
   migrated out of cgroup A, so unthrottle_cfs_rq() may observe
load_weight == 0 and return early without resched_curr() called.

At this point, the task P is runnable in cgroup B (not throttled), but
the CPU remains in do_idle() with no pending reschedule point. The
system stays in this state until an unrelated event (e.g. a new task
wakeup or any cases) that can trigger a resched_curr() breaks the
nohz_full idle state, and then the task P finally gets scheduled.

The root cause is that sched_move_task() may classify the task as only
queued, not running, and therefore fails to trigger a resched_curr(),
while the later unthrottling path no longer has visibility of the
migrated task.

Preserve the existing behavior for running tasks by issuing
resched_curr(), and explicitly invoke check_preempt_curr() for tasks
that were queued at the time of migration. This ensures that runnable
tasks are reconsidered for scheduling even when nohz_full suppresses
periodic ticks.

Fixes: 29f59db3a74b ("sched: group-scheduler core")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 045f83ad261e..04271b77101c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9110,6 +9110,7 @@ static void sched_change_group(struct task_struct *tsk)
 void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 {
 	unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
+	bool queued = false;
 	bool resched = false;
 	struct rq *rq;
 
@@ -9122,10 +9123,13 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 			scx_cgroup_move_task(tsk);
 		if (scope->running)
 			resched = true;
+		queued = scope->queued;
 	}
 
 	if (resched)
 		resched_curr(rq);
+	else if (queued)
+		wakeup_preempt(rq, tsk, 0);
 
 	__balance_callbacks(rq, &rq_guard.rf);
 }
-- 
2.34.1

Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Aaron Lu 2 weeks, 3 days ago

On Tue, Jan 20, 2026 at 03:25:49AM +0000, Zicheng Qu wrote:
> Consider the following sequence on a CPU configured with nohz_full:
> 
> 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
>    bandwidth control. The gse (cgroup A) where the task P attached is
> dequeued and the CPU switches to idle.
> 
> 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
>    another cgroup B (not throttled).
> 
>    During sched_move_task(), the task P is observed as queued but not
> running, and therefore no resched_curr() is triggered.
> 
> 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
>    explicit scheduling event, i.e., resched_curr().
> 
> 4) Later, cgroup A is unthrottled. However, the task P has already been
>    migrated out of cgroup A, so unthrottle_cfs_rq() may observe
> load_weight == 0 and return early without resched_curr() called.

I suppose this is only possible when the unthrottled cfs_rq has been
fully decayed, i.e. !cfs_rq->on_list is true? Because only in that case,
it will skip the resched_curr() in the bottom of unthrottle_cfs_rq() for
the scenario you have described.

Looking at this logic,  I feel the early return due to
(!cfs_rq->load.weight) && (!cfs_rq->on_list) is strange, because the
resched in bottom:

	/* Determine whether we need to wake up potentially idle CPU: */
		if (rq->curr == rq->idle && rq->cfs.nr_queued)
			resched_curr(rq);

should not depend on whether cfs_rq is fully decayed or not...

I think it should be something like this:
- complete the branch if no task enqueued but still on_list;
- only resched_curr() if task gets enqueued

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e71302282671c..e09da54a5d117 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6009,9 +6009,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	/* update hierarchical throttle state */
 	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
 
-	if (!cfs_rq->load.weight) {
-		if (!cfs_rq->on_list)
-			return;
+	if (!cfs_rq->load.weight && cfs_rq->on_list) {
 		/*
 		 * Nothing to run but something to decay (on_list)?
 		 * Complete the branch.
@@ -6025,7 +6023,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
 	assert_list_leaf_cfs_rq(rq);
 
 	/* Determine whether we need to wake up potentially idle CPU: */
-	if (rq->curr == rq->idle && rq->cfs.nr_queued)
+	if (rq->curr == rq->idle && cfs_rq->nr_queued)
 		resched_curr(rq);
 }
 

Thoughts?

> At this point, the task P is runnable in cgroup B (not throttled), but
> the CPU remains in do_idle() with no pending reschedule point. The
> system stays in this state until an unrelated event (e.g. a new task
> wakeup or any cases) that can trigger a resched_curr() breaks the
> nohz_full idle state, and then the task P finally gets scheduled.
> 
> The root cause is that sched_move_task() may classify the task as only
> queued, not running, and therefore fails to trigger a resched_curr(),
> while the later unthrottling path no longer has visibility of the
> migrated task.
> 
> Preserve the existing behavior for running tasks by issuing
> resched_curr(), and explicitly invoke check_preempt_curr() for tasks
> that were queued at the time of migration. This ensures that runnable
> tasks are reconsidered for scheduling even when nohz_full suppresses
> periodic ticks.
> 
> Fixes: 29f59db3a74b ("sched: group-scheduler core")
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>

I haven't been able to reproduce this but the change looks reasonable to
me, so:

Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>

Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by K Prateek Nayak 2 weeks, 3 days ago

Hello Aaron,

On 1/21/2026 9:19 AM, Aaron Lu wrote:
> On Tue, Jan 20, 2026 at 03:25:49AM +0000, Zicheng Qu wrote:
>> Consider the following sequence on a CPU configured with nohz_full:
>>
>> 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
>>    bandwidth control. The gse (cgroup A) where the task P attached is
>> dequeued and the CPU switches to idle.
>>
>> 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
>>    another cgroup B (not throttled).
>>
>>    During sched_move_task(), the task P is observed as queued but not
>> running, and therefore no resched_curr() is triggered.
>>
>> 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
>>    explicit scheduling event, i.e., resched_curr().
>>
>> 4) Later, cgroup A is unthrottled. However, the task P has already been
>>    migrated out of cgroup A, so unthrottle_cfs_rq() may observe
>> load_weight == 0 and return early without resched_curr() called.
> 
> I suppose this is only possible when the unthrottled cfs_rq has been
> fully decayed, i.e. !cfs_rq->on_list is true?

Ack! Since we detach the task from cfs_rq during
task_change_group_fair(), the cfs_rq_is_decayed() during
tg_unthrottle_up can return true and we skip putting the cfs_rq on the
leaf_cfs_rq_list and unthrottle_cfs_rq() will skip the resched.

> Because only in that case,
> it will skip the resched_curr() in the bottom of unthrottle_cfs_rq() for
> the scenario you have described.

Indeed. Happy coincidence that we checked for "rq->cfs.nr_queued" and an
unrelated unthrottle could still force a resched for a missed one :-)

> 
> Looking at this logic,  I feel the early return due to
> (!cfs_rq->load.weight) && (!cfs_rq->on_list) is strange, because the
> resched in bottom:
> 
> 	/* Determine whether we need to wake up potentially idle CPU: */
> 		if (rq->curr == rq->idle && rq->cfs.nr_queued)
> 			resched_curr(rq);
> 
> should not depend on whether cfs_rq is fully decayed or not...

But if it is off list, then it doesn't have any tasks to resched
anyways and in Zicheng scenario too, the cfs_rq won't have any
tasks either at the time of unthrottle.

> 
> I think it should be something like this:
> - complete the branch if no task enqueued but still on_list;
> - only resched_curr() if task gets enqueued
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e71302282671c..e09da54a5d117 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6009,9 +6009,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  	/* update hierarchical throttle state */
>  	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
>  
> -	if (!cfs_rq->load.weight) {
> -		if (!cfs_rq->on_list)
> -			return;
> +	if (!cfs_rq->load.weight && cfs_rq->on_list) {
>  		/*
>  		 * Nothing to run but something to decay (on_list)?
>  		 * Complete the branch.
> @@ -6025,7 +6023,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>  	assert_list_leaf_cfs_rq(rq);
>  
>  	/* Determine whether we need to wake up potentially idle CPU: */
> -	if (rq->curr == rq->idle && rq->cfs.nr_queued)
> +	if (rq->curr == rq->idle && cfs_rq->nr_queued)
>  		resched_curr(rq);
>  }
>  
> 
> Thoughts?

Yes, checking for cfs_rq->nr_queued should indicate if new tasks were
woken on this unthrottled hierarchy. Should make it easier to spot the
scenarios like the one that Zicheng experienced if there are any more
of those lurking around.

-- 
Thanks and Regards,
Prateek

Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Aaron Lu 2 weeks, 3 days ago

On Wed, Jan 21, 2026 at 10:54:11AM +0530, K Prateek Nayak wrote:
> Hello Aaron,
> 
> On 1/21/2026 9:19 AM, Aaron Lu wrote:
> > On Tue, Jan 20, 2026 at 03:25:49AM +0000, Zicheng Qu wrote:
> >> Consider the following sequence on a CPU configured with nohz_full:
> >>
> >> 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
> >>    bandwidth control. The gse (cgroup A) where the task P attached is
> >> dequeued and the CPU switches to idle.
> >>
> >> 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
> >>    another cgroup B (not throttled).
> >>
> >>    During sched_move_task(), the task P is observed as queued but not
> >> running, and therefore no resched_curr() is triggered.
> >>
> >> 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
> >>    explicit scheduling event, i.e., resched_curr().
> >>
> >> 4) Later, cgroup A is unthrottled. However, the task P has already been
> >>    migrated out of cgroup A, so unthrottle_cfs_rq() may observe
> >> load_weight == 0 and return early without resched_curr() called.
> > 
> > I suppose this is only possible when the unthrottled cfs_rq has been
> > fully decayed, i.e. !cfs_rq->on_list is true?
> 
> Ack! Since we detach the task from cfs_rq during
> task_change_group_fair(), the cfs_rq_is_decayed() during
> tg_unthrottle_up can return true and we skip putting the cfs_rq on the
> leaf_cfs_rq_list and unthrottle_cfs_rq() will skip the resched.
> 
> > Because only in that case,
> > it will skip the resched_curr() in the bottom of unthrottle_cfs_rq() for
> > the scenario you have described.
> 
> Indeed. Happy coincidence that we checked for "rq->cfs.nr_queued" and an
> unrelated unthrottle could still force a resched for a missed one :-)
>

Yes.

For cfs_rqs with no tasks enqueued during unthrottle, we probably should
just return, no matter if this cfs_rq is fully decayed or not. This can
potentially avoid some unnecessary rescheds.

> > 
> > Looking at this logic,  I feel the early return due to
> > (!cfs_rq->load.weight) && (!cfs_rq->on_list) is strange, because the
> > resched in bottom:
> > 
> > 	/* Determine whether we need to wake up potentially idle CPU: */
> > 		if (rq->curr == rq->idle && rq->cfs.nr_queued)
> > 			resched_curr(rq);
> > 
> > should not depend on whether cfs_rq is fully decayed or not...
> 
> But if it is off list, then it doesn't have any tasks to resched
> anyways and in Zicheng scenario too, the cfs_rq won't have any
> tasks either at the time of unthrottle.
>

Right.

What I wanted to say is, whether to do resched or not should depend on
if tasks are woken on this unthrottled hierarchy, but the current logic
only skip resched for fully decayed cfs_rq; for cfs_rqs with no tasks
queued and still on_list, it still attempted that resched condition
check, that's what made me feel strange.

I think the current behavior is done by commit 2630cde26711("sched/fair:
 Add ancestors of unthrottled undecayed cfs_rq"). The original behavior
is to simply return if no tasks queued per commit 671fd9dabe52("sched:
Add support for unthrottling group entities"). Although in commit 671fd,
I don't see why rq->cfs.nr_running is used instead of cfs_rq's
nr_running.

> > 
> > I think it should be something like this:
> > - complete the branch if no task enqueued but still on_list;
> > - only resched_curr() if task gets enqueued
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index e71302282671c..e09da54a5d117 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6009,9 +6009,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> >  	/* update hierarchical throttle state */
> >  	walk_tg_tree_from(cfs_rq->tg, tg_nop, tg_unthrottle_up, (void *)rq);
> >  
> > -	if (!cfs_rq->load.weight) {
> > -		if (!cfs_rq->on_list)
> > -			return;
> > +	if (!cfs_rq->load.weight && cfs_rq->on_list) {
> >  		/*
> >  		 * Nothing to run but something to decay (on_list)?
> >  		 * Complete the branch.
> > @@ -6025,7 +6023,7 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
> >  	assert_list_leaf_cfs_rq(rq);
> >  
> >  	/* Determine whether we need to wake up potentially idle CPU: */
> > -	if (rq->curr == rq->idle && rq->cfs.nr_queued)
> > +	if (rq->curr == rq->idle && cfs_rq->nr_queued)
> >  		resched_curr(rq);
> >  }
> >  
> > 
> > Thoughts?
> 
> Yes, checking for cfs_rq->nr_queued should indicate if new tasks were
> woken on this unthrottled hierarchy. Should make it easier to spot the
> scenarios like the one that Zicheng experienced if there are any more
> of those lurking around.

Yes indeed :)

[PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Zicheng Qu 1 week, 1 day ago

Consider the following sequence on a CPU configured with nohz_full:

1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
   bandwidth control. The gse (cgroup A) where the task P attached is
dequeued and the CPU switches to idle.

2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
   another cgroup B (not throttled).

   During sched_move_task(), the task P is observed as queued but not
running, and therefore no resched_curr() is triggered.

3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
   explicit scheduling event, i.e., resched_curr().

4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
   P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
may observe load_weight == 0 and return early without resched_curr()
called. For kernel >= 6.6: The unthrottling path normally triggers
`resched_curr()` almost cases even when no runnable tasks remain in the
unthrottled cgroup, preventing the idle stall described above. However,
if cgroup A is removed before it gets unthrottled, the unthrottling path
for cgroup A is never executed. In a result, no `resched_curr()` can be
called.

5) At this point, the task P is runnable in cgroup B (not throttled), but
the CPU remains in do_idle() with no pending reschedule point. The
system stays in this state until an unrelated event (e.g. a new task
wakeup or any cases) that can trigger a resched_curr() breaks the
nohz_full idle state, and then the task P finally gets scheduled.

The root cause is that sched_move_task() may classify the task as only
queued, not running, and therefore fails to trigger a resched_curr(),
while the later unthrottling path no longer has visibility of the
migrated task.

Preserve the existing behavior for running tasks by issuing
resched_curr(), and explicitly invoke check_preempt_curr() for tasks
that were queued at the time of migration. This ensures that runnable
tasks are reconsidered for scheduling even when nohz_full suppresses
periodic ticks.

Fixes: 29f59db3a74b ("sched: group-scheduler core")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 045f83ad261e..04271b77101c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9110,6 +9110,7 @@ static void sched_change_group(struct task_struct *tsk)
 void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 {
 	unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
+	bool queued = false;
 	bool resched = false;
 	struct rq *rq;
 
@@ -9122,10 +9123,13 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 			scx_cgroup_move_task(tsk);
 		if (scope->running)
 			resched = true;
+		queued = scope->queued;
 	}
 
 	if (resched)
 		resched_curr(rq);
+	else if (queued)
+		wakeup_preempt(rq, tsk, 0);
 
 	__balance_callbacks(rq, &rq_guard.rf);
 }
-- 
2.34.1

Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Peter Zijlstra 5 days, 3 hours ago

On Fri, Jan 30, 2026 at 08:34:38AM +0000, Zicheng Qu wrote:
> Consider the following sequence on a CPU configured with nohz_full:
> 
> 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
>    bandwidth control. The gse (cgroup A) where the task P attached is
> dequeued and the CPU switches to idle.
> 
> 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
>    another cgroup B (not throttled).
> 
>    During sched_move_task(), the task P is observed as queued but not
> running, and therefore no resched_curr() is triggered.
> 
> 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
>    explicit scheduling event, i.e., resched_curr().
> 
> 4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
>    P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
> may observe load_weight == 0 and return early without resched_curr()
> called. For kernel >= 6.6: The unthrottling path normally triggers
> `resched_curr()` almost cases even when no runnable tasks remain in the
> unthrottled cgroup, preventing the idle stall described above. However,
> if cgroup A is removed before it gets unthrottled, the unthrottling path
> for cgroup A is never executed. In a result, no `resched_curr()` can be
> called.
> 
> 5) At this point, the task P is runnable in cgroup B (not throttled), but
> the CPU remains in do_idle() with no pending reschedule point. The
> system stays in this state until an unrelated event (e.g. a new task
> wakeup or any cases) that can trigger a resched_curr() breaks the
> nohz_full idle state, and then the task P finally gets scheduled.
> 
> The root cause is that sched_move_task() may classify the task as only
> queued, not running, and therefore fails to trigger a resched_curr(),
> while the later unthrottling path no longer has visibility of the
> migrated task.
> 
> Preserve the existing behavior for running tasks by issuing
> resched_curr(), and explicitly invoke check_preempt_curr() for tasks
> that were queued at the time of migration. This ensures that runnable
> tasks are reconsidered for scheduling even when nohz_full suppresses
> periodic ticks.
> 
> Fixes: 29f59db3a74b ("sched: group-scheduler core")
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>

Yes, that makes sense.

Thanks!

Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Zicheng Qu 1 week, 1 day ago

On 1/30/2026 4:34 PM, Zicheng Qu wrote:

> 4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
>     P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
> may observe load_weight == 0 and return early without resched_curr()
> called. For kernel >= 6.6: The unthrottling path normally triggers
> `resched_curr()` almost cases even when no runnable tasks remain in the
> unthrottled cgroup, preventing the idle stall described above. However,
> if cgroup A is removed before it gets unthrottled, the unthrottling path
> for cgroup A is never executed. In a result, no `resched_curr()` can be
> called.
Hi Aaron,

Apologies for the confusion in my earlier description — the original
failure model was identified and analyzed on kernels based on LTS 5.10.

Later I realized that on v6.6 and mainline, the issue becomes much harder
to reproduce due to additional conditions introduced in the condition
(cfs_rq->on_list) in unthrottle_cfs_rq(), which effectively mask the
original reproduction path.

As a result, I adjusted the reproducer accordingly. With the updated
reproducer, the issue can still be triggered on mainline by explicitly
bypassing the unthrottling reschedule path, as described in the commit
message.

The reproducer can be run directly via:

./make.sh

My local /proc/cmdline is:

systemd.unified_cgroup_hierarchy=0 nohz_full=2-15 rcu_nocbs=2-15

With this setup, the issue is reproducible on current mainline.

make.sh
```sh
#!/bin/bash

gcc -O2 heartbeat.c -o heartbeat

chmod +x ./run_test.sh && ./run_test.sh
```

heartbeat.c
```c
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
#include <time.h>

static inline long long now_ns(void)
{
     struct timespec ts;
     clock_gettime(CLOCK_MONOTONIC, &ts);
     return ts.tv_sec * 1000000000LL + ts.tv_nsec;
}

int main(void)
{
     cpu_set_t set;
     CPU_ZERO(&set);
     CPU_SET(12, &set);  // CPU 12 is nohz_full
     sched_setaffinity(0, sizeof(set), &set);

     long long last = now_ns();
     unsigned long long iter = 0;

     while (1) {
         iter++;
         long long now = now_ns();
         if (now - last > 1000 * 1000 * 1000) { // 1000ms
             printf("[HB] sec=%lld pid=%d cpu=%d iter=%llu\n", now / 
1000000000LL, getpid(), sched_getcpu(), iter);
             fflush(stdout);
             last = now_ns();
         }
     }
}
```

```sh
#!/bin/bash
#
# run_test.sh
#
# Reproducer for a scheduling stall on nohz_full CPUs when migrating
# queued tasks out of throttled cgroups.
#
# Test outline:
#   1. Start a CPU-bound workload (heartbeat) that prints a heartbeat (HB)
#      once per second.
#   2. Migrate the task into a heavily throttled child cgroup.
#   3. Migrate the task back to the root cgroup (potential trigger point).
#   4. Immediately remove (destroy) the throttled cgroup before it gets
#      unthrottled.
#   5. Observe whether the heartbeat continues to advance.
#      - If HB advances: no stall, continue to next round.
#      - If HB stops advancing: scheduling stall detected, freeze the setup
#        for debugging.
#

set -e

########################
# Basic configuration
########################

ROOT_CG=/sys/fs/cgroup/cpu
THROTTLED_CG=$ROOT_CG/child_cgroup

mkdir -p "$ROOT_CG"

HB_LOG=heartbeat.log

# Throttle settings: 1ms runtime per 1s period
CFS_QUOTA_US=1000
CFS_PERIOD_US=1000000

# Timeout (in seconds) to consider the workload "stuck"
STUCK_TIMEOUT=10
CHECK_INTERVAL=0.2

########################
# Cleanup logic
########################

PID=

cleanup() {
     echo
     echo "[!] cleanup: stopping workload"

     if [[ -n "$PID" ]] && kill -0 "$PID" 2>/dev/null; then
         echo "[!] killing pid $PID"
         kill -TERM "$PID"
         wait "$PID" 2>/dev/null || true
     fi

     echo "[!] cleanup done"
}

trap cleanup INT TERM EXIT

########################
# Start workload
########################

echo "[+] starting heartbeat workload"

./heartbeat | tee "$HB_LOG" &
PID=$(($! - 1)) # temporary hack PID

echo "[+] workload pid = $PID"
echo

########################
# Helper functions
########################

# Extract the last printed heartbeat second from the log
last_hb_sec() {
     tail -n 1 "$HB_LOG" 2>/dev/null | awk '{
         for (i = 1; i <= NF; i++) {
             if ($i ~ /^sec=/) {
                 split($i, a, "=");
                 print a[2];
                 exit;
             }
         }
     }'
}

verify_cgroup_location() {
     echo "  root cgroup:"
     cat "$ROOT_CG/tasks" | grep "$PID" || true
     echo "  throttled cgroup:"
     cat "$THROTTLED_CG/tasks" | grep "$PID" || true
}

########################
# Main test loop
########################

round=0

while true; do
     # Recreate the throttled cgroup for the next iteration
     mkdir -p "$THROTTLED_CG"
     echo $CFS_QUOTA_US  > "$THROTTLED_CG/cpu.cfs_quota_us"
     echo $CFS_PERIOD_US > "$THROTTLED_CG/cpu.cfs_period_us"

     round=$((round + 1))
     echo "========== ROUND $round =========="

     echo "[1] move task into throttled cgroup"
     echo "$PID" > "$THROTTLED_CG/tasks"

     echo "[1.1] verify cgroup placement"
     verify_cgroup_location

     # Give the task some time to consume its quota and become throttled
     sleep 0.2

     echo "[2] migrate task back to root cgroup (potential trigger)"
     echo "$PID" > "$ROOT_CG/tasks"

     echo "[2.1] verify cgroup placement"
     verify_cgroup_location

     #
     # IMPORTANT:
     # For kernels >= 6.6, unthrottling normally triggers resched_curr().
     # Removing the throttled cgroup before it gets unthrottled bypasses
     # the unthrottle path and is required to reproduce the stall.
     #
     echo "[2.2] remove throttled cgroup before unthrottling"
     rmdir "$THROTTLED_CG"

     # Observe heartbeat after migration back to root
     base_hb=$(last_hb_sec)
     [[ -z "$base_hb" ]] && base_hb=0

     echo "[3] observing heartbeat (base_hb_sec=$base_hb)"

     start_ts=$(date +%s)

     while true; do
         cur_hb=$(last_hb_sec)
         [[ -z "$cur_hb" ]] && cur_hb=0

         if (( cur_hb > base_hb )); then
             echo "[OK] heartbeat advanced: $base_hb -> $cur_hb"
             break
         fi

         now_ts=$(date +%s)
         if (( now_ts - start_ts >= STUCK_TIMEOUT )); then
             echo
             echo "[!!!] SCHEDULING STALL DETECTED AFTER MIGRATION !!!"
             echo "[!!!] base_hb_sec=$base_hb cur_hb_sec=$cur_hb"
             echo "[!!!] freezing setup for debugging for 20s"
             echo

             # Give some time to attach debuggers / tracing
             sleep 20

             echo "[!!!] workload still stuck, entering infinite sleep, 
and will continue to run now"

             taskset -c 12 sleep 1 # more than 1 tasks, will break the 
nohz_full state

             while true; do
                 sleep 3600
             done
         fi

         sleep "$CHECK_INTERVAL"
     done

     echo "[4] wait before next round"
     sleep 1
done
```

Best regards,
Zicheng

Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Aaron Lu 5 days, 8 hours ago

On Fri, Jan 30, 2026 at 05:03:49PM +0800, Zicheng Qu wrote:
> On 1/30/2026 4:34 PM, Zicheng Qu wrote:
> 
> > 4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
> >     P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
> > may observe load_weight == 0 and return early without resched_curr()
> > called. For kernel >= 6.6: The unthrottling path normally triggers
> > `resched_curr()` almost cases even when no runnable tasks remain in the
> > unthrottled cgroup, preventing the idle stall described above. However,
> > if cgroup A is removed before it gets unthrottled, the unthrottling path
> > for cgroup A is never executed. In a result, no `resched_curr()` can be
> > called.

I think you are right.

> Hi Aaron,
> 
> Apologies for the confusion in my earlier description — the original
> failure model was identified and analyzed on kernels based on LTS 5.10.
> 
> Later I realized that on v6.6 and mainline, the issue becomes much harder
> to reproduce due to additional conditions introduced in the condition
> (cfs_rq->on_list) in unthrottle_cfs_rq(), which effectively mask the
> original reproduction path.
> 
> As a result, I adjusted the reproducer accordingly. With the updated
> reproducer, the issue can still be triggered on mainline by explicitly
> bypassing the unthrottling reschedule path, as described in the commit
> message.
>

I can reproduce the problem using your reproducer now and also verified
your patch fixed the problem, so feel free to add:

Tested-by: Aaron Lu <ziqianlu@bytedance.com>

[tip: sched/core] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by tip-bot2 for Zicheng Qu 4 days, 4 hours ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e34881c84c255bc300f24d9fe685324be20da3d1
Gitweb:        https://git.kernel.org/tip/e34881c84c255bc300f24d9fe685324be20da3d1
Author:        Zicheng Qu <quzicheng@huawei.com>
AuthorDate:    Fri, 30 Jan 2026 08:34:38 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Feb 2026 12:04:19 +01:00

sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Consider the following sequence on a CPU configured with nohz_full:

1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
   bandwidth control. The gse (cgroup A) where the task P attached is
dequeued and the CPU switches to idle.

2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
   another cgroup B (not throttled).

   During sched_move_task(), the task P is observed as queued but not
running, and therefore no resched_curr() is triggered.

3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
   explicit scheduling event, i.e., resched_curr().

4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
   P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
may observe load_weight == 0 and return early without resched_curr()
called. For kernel >= 6.6: The unthrottling path normally triggers
`resched_curr()` almost cases even when no runnable tasks remain in the
unthrottled cgroup, preventing the idle stall described above. However,
if cgroup A is removed before it gets unthrottled, the unthrottling path
for cgroup A is never executed. In a result, no `resched_curr()` can be
called.

5) At this point, the task P is runnable in cgroup B (not throttled), but
the CPU remains in do_idle() with no pending reschedule point. The
system stays in this state until an unrelated event (e.g. a new task
wakeup or any cases) that can trigger a resched_curr() breaks the
nohz_full idle state, and then the task P finally gets scheduled.

The root cause is that sched_move_task() may classify the task as only
queued, not running, and therefore fails to trigger a resched_curr(),
while the later unthrottling path no longer has visibility of the
migrated task.

Preserve the existing behavior for running tasks by issuing
resched_curr(), and explicitly invoke check_preempt_curr() for tasks
that were queued at the time of migration. This ensures that runnable
tasks are reconsidered for scheduling even when nohz_full suppresses
periodic ticks.

Fixes: 29f59db3a74b ("sched: group-scheduler core")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260130083438.1122457-1-quzicheng@huawei.com
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f2dc0a..b411e4f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9126,6 +9126,7 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 {
 	unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
 	bool resched = false;
+	bool queued = false;
 	struct rq *rq;
 
 	CLASS(task_rq_lock, rq_guard)(tsk);
@@ -9137,10 +9138,13 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 			scx_cgroup_move_task(tsk);
 		if (scope->running)
 			resched = true;
+		queued = scope->queued;
 	}
 
 	if (resched)
 		resched_curr(rq);
+	else if (queued)
+		wakeup_preempt(rq, tsk, 0);
 
 	__balance_callbacks(rq, &rq_guard.rf);
 }