[v1] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

[PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Zicheng Qu 1 week ago

Consider the following sequence on a CPU configured with nohz_full:

1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
   bandwidth control. The gse (cgroup A) where the task P attached is
dequeued and the CPU switches to idle.

2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
   another cgroup B (not throttled).

   During sched_move_task(), the task P is observed as queued but not
running, and therefore no resched_curr() is triggered.

3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
   explicit scheduling event, i.e., resched_curr().

4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
   P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
may observe load_weight == 0 and return early without resched_curr()
called. For kernel >= 6.6: The unthrottling path normally triggers
`resched_curr()` almost cases even when no runnable tasks remain in the
unthrottled cgroup, preventing the idle stall described above. However,
if cgroup A is removed before it gets unthrottled, the unthrottling path
for cgroup A is never executed. In a result, no `resched_curr()` can be
called.

5) At this point, the task P is runnable in cgroup B (not throttled), but
the CPU remains in do_idle() with no pending reschedule point. The
system stays in this state until an unrelated event (e.g. a new task
wakeup or any cases) that can trigger a resched_curr() breaks the
nohz_full idle state, and then the task P finally gets scheduled.

The root cause is that sched_move_task() may classify the task as only
queued, not running, and therefore fails to trigger a resched_curr(),
while the later unthrottling path no longer has visibility of the
migrated task.

Preserve the existing behavior for running tasks by issuing
resched_curr(), and explicitly invoke check_preempt_curr() for tasks
that were queued at the time of migration. This ensures that runnable
tasks are reconsidered for scheduling even when nohz_full suppresses
periodic ticks.

Fixes: 29f59db3a74b ("sched: group-scheduler core")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 045f83ad261e..04271b77101c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9110,6 +9110,7 @@ static void sched_change_group(struct task_struct *tsk)
 void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 {
 	unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
+	bool queued = false;
 	bool resched = false;
 	struct rq *rq;
 
@@ -9122,10 +9123,13 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 			scx_cgroup_move_task(tsk);
 		if (scope->running)
 			resched = true;
+		queued = scope->queued;
 	}
 
 	if (resched)
 		resched_curr(rq);
+	else if (queued)
+		wakeup_preempt(rq, tsk, 0);
 
 	__balance_callbacks(rq, &rq_guard.rf);
 }
-- 
2.34.1

Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Peter Zijlstra 4 days, 19 hours ago

On Fri, Jan 30, 2026 at 08:34:38AM +0000, Zicheng Qu wrote:
> Consider the following sequence on a CPU configured with nohz_full:
> 
> 1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
>    bandwidth control. The gse (cgroup A) where the task P attached is
> dequeued and the CPU switches to idle.
> 
> 2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
>    another cgroup B (not throttled).
> 
>    During sched_move_task(), the task P is observed as queued but not
> running, and therefore no resched_curr() is triggered.
> 
> 3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
>    explicit scheduling event, i.e., resched_curr().
> 
> 4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
>    P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
> may observe load_weight == 0 and return early without resched_curr()
> called. For kernel >= 6.6: The unthrottling path normally triggers
> `resched_curr()` almost cases even when no runnable tasks remain in the
> unthrottled cgroup, preventing the idle stall described above. However,
> if cgroup A is removed before it gets unthrottled, the unthrottling path
> for cgroup A is never executed. In a result, no `resched_curr()` can be
> called.
> 
> 5) At this point, the task P is runnable in cgroup B (not throttled), but
> the CPU remains in do_idle() with no pending reschedule point. The
> system stays in this state until an unrelated event (e.g. a new task
> wakeup or any cases) that can trigger a resched_curr() breaks the
> nohz_full idle state, and then the task P finally gets scheduled.
> 
> The root cause is that sched_move_task() may classify the task as only
> queued, not running, and therefore fails to trigger a resched_curr(),
> while the later unthrottling path no longer has visibility of the
> migrated task.
> 
> Preserve the existing behavior for running tasks by issuing
> resched_curr(), and explicitly invoke check_preempt_curr() for tasks
> that were queued at the time of migration. This ensures that runnable
> tasks are reconsidered for scheduling even when nohz_full suppresses
> periodic ticks.
> 
> Fixes: 29f59db3a74b ("sched: group-scheduler core")
> Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
> Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>

Yes, that makes sense.

Thanks!

Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Zicheng Qu 1 week ago

On 1/30/2026 4:34 PM, Zicheng Qu wrote:

> 4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
>     P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
> may observe load_weight == 0 and return early without resched_curr()
> called. For kernel >= 6.6: The unthrottling path normally triggers
> `resched_curr()` almost cases even when no runnable tasks remain in the
> unthrottled cgroup, preventing the idle stall described above. However,
> if cgroup A is removed before it gets unthrottled, the unthrottling path
> for cgroup A is never executed. In a result, no `resched_curr()` can be
> called.
Hi Aaron,

Apologies for the confusion in my earlier description — the original
failure model was identified and analyzed on kernels based on LTS 5.10.

Later I realized that on v6.6 and mainline, the issue becomes much harder
to reproduce due to additional conditions introduced in the condition
(cfs_rq->on_list) in unthrottle_cfs_rq(), which effectively mask the
original reproduction path.

As a result, I adjusted the reproducer accordingly. With the updated
reproducer, the issue can still be triggered on mainline by explicitly
bypassing the unthrottling reschedule path, as described in the commit
message.

The reproducer can be run directly via:

./make.sh

My local /proc/cmdline is:

systemd.unified_cgroup_hierarchy=0 nohz_full=2-15 rcu_nocbs=2-15

With this setup, the issue is reproducible on current mainline.

make.sh
```sh
#!/bin/bash

gcc -O2 heartbeat.c -o heartbeat

chmod +x ./run_test.sh && ./run_test.sh
```

heartbeat.c
```c
#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <unistd.h>
#include <time.h>

static inline long long now_ns(void)
{
     struct timespec ts;
     clock_gettime(CLOCK_MONOTONIC, &ts);
     return ts.tv_sec * 1000000000LL + ts.tv_nsec;
}

int main(void)
{
     cpu_set_t set;
     CPU_ZERO(&set);
     CPU_SET(12, &set);  // CPU 12 is nohz_full
     sched_setaffinity(0, sizeof(set), &set);

     long long last = now_ns();
     unsigned long long iter = 0;

     while (1) {
         iter++;
         long long now = now_ns();
         if (now - last > 1000 * 1000 * 1000) { // 1000ms
             printf("[HB] sec=%lld pid=%d cpu=%d iter=%llu\n", now / 
1000000000LL, getpid(), sched_getcpu(), iter);
             fflush(stdout);
             last = now_ns();
         }
     }
}
```

```sh
#!/bin/bash
#
# run_test.sh
#
# Reproducer for a scheduling stall on nohz_full CPUs when migrating
# queued tasks out of throttled cgroups.
#
# Test outline:
#   1. Start a CPU-bound workload (heartbeat) that prints a heartbeat (HB)
#      once per second.
#   2. Migrate the task into a heavily throttled child cgroup.
#   3. Migrate the task back to the root cgroup (potential trigger point).
#   4. Immediately remove (destroy) the throttled cgroup before it gets
#      unthrottled.
#   5. Observe whether the heartbeat continues to advance.
#      - If HB advances: no stall, continue to next round.
#      - If HB stops advancing: scheduling stall detected, freeze the setup
#        for debugging.
#

set -e

########################
# Basic configuration
########################

ROOT_CG=/sys/fs/cgroup/cpu
THROTTLED_CG=$ROOT_CG/child_cgroup

mkdir -p "$ROOT_CG"

HB_LOG=heartbeat.log

# Throttle settings: 1ms runtime per 1s period
CFS_QUOTA_US=1000
CFS_PERIOD_US=1000000

# Timeout (in seconds) to consider the workload "stuck"
STUCK_TIMEOUT=10
CHECK_INTERVAL=0.2

########################
# Cleanup logic
########################

PID=

cleanup() {
     echo
     echo "[!] cleanup: stopping workload"

     if [[ -n "$PID" ]] && kill -0 "$PID" 2>/dev/null; then
         echo "[!] killing pid $PID"
         kill -TERM "$PID"
         wait "$PID" 2>/dev/null || true
     fi

     echo "[!] cleanup done"
}

trap cleanup INT TERM EXIT

########################
# Start workload
########################

echo "[+] starting heartbeat workload"

./heartbeat | tee "$HB_LOG" &
PID=$(($! - 1)) # temporary hack PID

echo "[+] workload pid = $PID"
echo

########################
# Helper functions
########################

# Extract the last printed heartbeat second from the log
last_hb_sec() {
     tail -n 1 "$HB_LOG" 2>/dev/null | awk '{
         for (i = 1; i <= NF; i++) {
             if ($i ~ /^sec=/) {
                 split($i, a, "=");
                 print a[2];
                 exit;
             }
         }
     }'
}

verify_cgroup_location() {
     echo "  root cgroup:"
     cat "$ROOT_CG/tasks" | grep "$PID" || true
     echo "  throttled cgroup:"
     cat "$THROTTLED_CG/tasks" | grep "$PID" || true
}

########################
# Main test loop
########################

round=0

while true; do
     # Recreate the throttled cgroup for the next iteration
     mkdir -p "$THROTTLED_CG"
     echo $CFS_QUOTA_US  > "$THROTTLED_CG/cpu.cfs_quota_us"
     echo $CFS_PERIOD_US > "$THROTTLED_CG/cpu.cfs_period_us"

     round=$((round + 1))
     echo "========== ROUND $round =========="

     echo "[1] move task into throttled cgroup"
     echo "$PID" > "$THROTTLED_CG/tasks"

     echo "[1.1] verify cgroup placement"
     verify_cgroup_location

     # Give the task some time to consume its quota and become throttled
     sleep 0.2

     echo "[2] migrate task back to root cgroup (potential trigger)"
     echo "$PID" > "$ROOT_CG/tasks"

     echo "[2.1] verify cgroup placement"
     verify_cgroup_location

     #
     # IMPORTANT:
     # For kernels >= 6.6, unthrottling normally triggers resched_curr().
     # Removing the throttled cgroup before it gets unthrottled bypasses
     # the unthrottle path and is required to reproduce the stall.
     #
     echo "[2.2] remove throttled cgroup before unthrottling"
     rmdir "$THROTTLED_CG"

     # Observe heartbeat after migration back to root
     base_hb=$(last_hb_sec)
     [[ -z "$base_hb" ]] && base_hb=0

     echo "[3] observing heartbeat (base_hb_sec=$base_hb)"

     start_ts=$(date +%s)

     while true; do
         cur_hb=$(last_hb_sec)
         [[ -z "$cur_hb" ]] && cur_hb=0

         if (( cur_hb > base_hb )); then
             echo "[OK] heartbeat advanced: $base_hb -> $cur_hb"
             break
         fi

         now_ts=$(date +%s)
         if (( now_ts - start_ts >= STUCK_TIMEOUT )); then
             echo
             echo "[!!!] SCHEDULING STALL DETECTED AFTER MIGRATION !!!"
             echo "[!!!] base_hb_sec=$base_hb cur_hb_sec=$cur_hb"
             echo "[!!!] freezing setup for debugging for 20s"
             echo

             # Give some time to attach debuggers / tracing
             sleep 20

             echo "[!!!] workload still stuck, entering infinite sleep, 
and will continue to run now"

             taskset -c 12 sleep 1 # more than 1 tasks, will break the 
nohz_full state

             while true; do
                 sleep 3600
             done
         fi

         sleep "$CHECK_INTERVAL"
     done

     echo "[4] wait before next round"
     sleep 1
done
```

Best regards,
Zicheng

Re: [PATCH] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by Aaron Lu 5 days, 1 hour ago

On Fri, Jan 30, 2026 at 05:03:49PM +0800, Zicheng Qu wrote:
> On 1/30/2026 4:34 PM, Zicheng Qu wrote:
> 
> > 4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
> >     P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
> > may observe load_weight == 0 and return early without resched_curr()
> > called. For kernel >= 6.6: The unthrottling path normally triggers
> > `resched_curr()` almost cases even when no runnable tasks remain in the
> > unthrottled cgroup, preventing the idle stall described above. However,
> > if cgroup A is removed before it gets unthrottled, the unthrottling path
> > for cgroup A is never executed. In a result, no `resched_curr()` can be
> > called.

I think you are right.

> Hi Aaron,
> 
> Apologies for the confusion in my earlier description — the original
> failure model was identified and analyzed on kernels based on LTS 5.10.
> 
> Later I realized that on v6.6 and mainline, the issue becomes much harder
> to reproduce due to additional conditions introduced in the condition
> (cfs_rq->on_list) in unthrottle_cfs_rq(), which effectively mask the
> original reproduction path.
> 
> As a result, I adjusted the reproducer accordingly. With the updated
> reproducer, the issue can still be triggered on mainline by explicitly
> bypassing the unthrottling reschedule path, as described in the commit
> message.
>

I can reproduce the problem using your reproducer now and also verified
your patch fixed the problem, so feel free to add:

Tested-by: Aaron Lu <ziqianlu@bytedance.com>

[tip: sched/core] sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Posted by tip-bot2 for Zicheng Qu 3 days, 20 hours ago

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     e34881c84c255bc300f24d9fe685324be20da3d1
Gitweb:        https://git.kernel.org/tip/e34881c84c255bc300f24d9fe685324be20da3d1
Author:        Zicheng Qu <quzicheng@huawei.com>
AuthorDate:    Fri, 30 Jan 2026 08:34:38 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 03 Feb 2026 12:04:19 +01:00

sched: Re-evaluate scheduling when migrating queued tasks out of throttled cgroups

Consider the following sequence on a CPU configured with nohz_full:

1) A task P runs in cgroup A, and cgroup A becomes throttled due to CFS
   bandwidth control. The gse (cgroup A) where the task P attached is
dequeued and the CPU switches to idle.

2) Before cgroup A is unthrottled, task P is migrated from cgroup A to
   another cgroup B (not throttled).

   During sched_move_task(), the task P is observed as queued but not
running, and therefore no resched_curr() is triggered.

3) Since the CPU is nohz_full, it remains in do_idle() waiting for an
   explicit scheduling event, i.e., resched_curr().

4) For kernel <= 5.10: Later, cgroup A is unthrottled. However, the task
   P has already been migrated out of cgroup A, so unthrottle_cfs_rq()
may observe load_weight == 0 and return early without resched_curr()
called. For kernel >= 6.6: The unthrottling path normally triggers
`resched_curr()` almost cases even when no runnable tasks remain in the
unthrottled cgroup, preventing the idle stall described above. However,
if cgroup A is removed before it gets unthrottled, the unthrottling path
for cgroup A is never executed. In a result, no `resched_curr()` can be
called.

5) At this point, the task P is runnable in cgroup B (not throttled), but
the CPU remains in do_idle() with no pending reschedule point. The
system stays in this state until an unrelated event (e.g. a new task
wakeup or any cases) that can trigger a resched_curr() breaks the
nohz_full idle state, and then the task P finally gets scheduled.

The root cause is that sched_move_task() may classify the task as only
queued, not running, and therefore fails to trigger a resched_curr(),
while the later unthrottling path no longer has visibility of the
migrated task.

Preserve the existing behavior for running tasks by issuing
resched_curr(), and explicitly invoke check_preempt_curr() for tasks
that were queued at the time of migration. This ensures that runnable
tasks are reconsidered for scheduling even when nohz_full suppresses
periodic ticks.

Fixes: 29f59db3a74b ("sched: group-scheduler core")
Signed-off-by: Zicheng Qu <quzicheng@huawei.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Aaron Lu <ziqianlu@bytedance.com>
Tested-by: Aaron Lu <ziqianlu@bytedance.com>
Link: https://patch.msgid.link/20260130083438.1122457-1-quzicheng@huawei.com
---
 kernel/sched/core.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 8f2dc0a..b411e4f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9126,6 +9126,7 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 {
 	unsigned int queue_flags = DEQUEUE_SAVE | DEQUEUE_MOVE;
 	bool resched = false;
+	bool queued = false;
 	struct rq *rq;
 
 	CLASS(task_rq_lock, rq_guard)(tsk);
@@ -9137,10 +9138,13 @@ void sched_move_task(struct task_struct *tsk, bool for_autogroup)
 			scx_cgroup_move_task(tsk);
 		if (scope->running)
 			resched = true;
+		queued = scope->queued;
 	}
 
 	if (resched)
 		resched_curr(rq);
+	else if (queued)
+		wakeup_preempt(rq, tsk, 0);
 
 	__balance_callbacks(rq, &rq_guard.rf);
 }