kernel/sched/core.c | 6 ------ 1 file changed, 6 deletions(-)
From: Matt Fleming <mfleming@cloudflare.com>
This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
If we dequeue a task (task B) that was sched delayed then that task is
definitely no longer on the rq and not tracked in the rbtree.
Unfortunately, task_on_rq_queued(B) will still return true because
dequeue_task() doesn't update p->on_rq.
This inconsistency can lead to tasks (task A) spinning indefinitely in
wait_task_inactive(), e.g. when delivering a fatal signal to a thread
group, because it thinks the task B is still queued (it's not) and waits
forever for it to unschedule.
Task A Task B
arch_do_signal_or_restart()
get_signal()
do_coredump()
coredump_wait()
zap_threads() arch_do_signal_or_restart()
wait_task_inactive() <-- SPIN get_signal()
do_group_exit()
do_exit()
coredump_task_exit()
schedule() <--- never comes back
Not only will task A spin forever in wait_task_inactive(), but task B
will also trigger RCU stalls:
INFO: rcu_tasks detected stalls on tasks:
00000000a973a4d8: .. nvcsw: 2/2 holdout: 1 idle_cpu: -1/79
task:ffmpeg state:I stack:0 pid:665601 tgid:665155 ppid:668691 task_flags:0x400448 flags:0x00004006
Call Trace:
<TASK>
__schedule+0x4fb/0xbf0
? srso_return_thunk+0x5/0x5f
schedule+0x27/0xf0
do_exit+0xdd/0xaa0
? __pfx_futex_wake_mark+0x10/0x10
do_group_exit+0x30/0x80
get_signal+0x81e/0x860
? srso_return_thunk+0x5/0x5f
? futex_wake+0x177/0x1a0
arch_do_signal_or_restart+0x2e/0x1f0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? __x64_sys_futex+0x10c/0x1d0
syscall_exit_to_user_mode+0xa5/0x130
do_syscall_64+0x57/0x110
entry_SYSCALL_64_after_hwframe+0x76/0x7e
RIP: 0033:0x7f22d05b0f16
RSP: 002b:00007f2265761cf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f22d05b0f16
RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005629e320d97c
RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 00005629e320d928
R13: 0000000000000000 R14: 0000000000000001 R15: 00005629e320d97c
</TASK>
Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks")
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: John Stultz <jstultz@google.com>
Cc: Chris Arges <carges@cloudflare.com>
Cc: stable@vger.kernel.org # v6.12
Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
---
kernel/sched/core.c | 6 ------
1 file changed, 6 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ccba6fc3c3fe..2dfc3977920d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2293,12 +2293,6 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state
* just go back and repeat.
*/
rq = task_rq_lock(p, &rf);
- /*
- * If task is sched_delayed, force dequeue it, to avoid always
- * hitting the tick timeout in the queued case
- */
- if (p->se.sched_delayed)
- dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
trace_sched_wait_task(p);
running = task_on_cpu(rq, p);
queued = task_on_rq_queued(p);
--
2.34.1
On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming <matt@readmodwrite.com> wrote: > > From: Matt Fleming <mfleming@cloudflare.com> > > This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6. > > If we dequeue a task (task B) that was sched delayed then that task is > definitely no longer on the rq and not tracked in the rbtree. > Unfortunately, task_on_rq_queued(B) will still return true because > dequeue_task() doesn't update p->on_rq. Hey! Sorry again my patch has been causing you trouble. Thanks for your persistence in chasing this down! It's confusing as this patch uses the similar logic as logic pick_next_entity() uses when a sched_delayed task is picked to run, as well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd fret that similar And my impression was that dequeue_task() on a sched_delayed task would update p->on_rq via calling __block_task() at the end of dequeue_entities(). However, there are two spots where we might exit dequeue_entities() early when cfs_rq_throttled(rq), so maybe that's what's catching us here? Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a little odd, as the desired dequeue didn't really complete, but dequeue_task_fair() will still return true indicating success - not that too many places are checking the dequeue_task return. Is that right? thanks -john
Hello John, Matt, On 9/26/2025 5:35 AM, John Stultz wrote: > On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming <matt@readmodwrite.com> wrote: >> >> From: Matt Fleming <mfleming@cloudflare.com> >> >> This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6. >> >> If we dequeue a task (task B) that was sched delayed then that task is >> definitely no longer on the rq and not tracked in the rbtree. >> Unfortunately, task_on_rq_queued(B) will still return true because >> dequeue_task() doesn't update p->on_rq. > > Hey! > Sorry again my patch has been causing you trouble. Thanks for your > persistence in chasing this down! > > It's confusing as this patch uses the similar logic as logic > pick_next_entity() uses when a sched_delayed task is picked to run, as > well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd > fret that similar > > And my impression was that dequeue_task() on a sched_delayed task > would update p->on_rq via calling __block_task() at the end of > dequeue_entities(). > > However, there are two spots where we might exit dequeue_entities() > early when cfs_rq_throttled(rq), so maybe that's what's catching us > here? That could very likely be it. > > Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a > little odd, as the desired dequeue didn't really complete, but > dequeue_task_fair() will still return true indicating success - not > that too many places are checking the dequeue_task return. Is that > right? I think for most part until now it was harmless as we couldn't pick on a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED) would later do a: queued = task_on_rq_queued(p); ... if (queued) enqueue_task(p) which would either lead to spuriously running a blocked task and it would block back again, or a wakeup would properly wakeup the queued task via ttwu_runnable() but wait_task_inactive() is interesting as it expects the dequeue will result in a block which never happens with throttled hierarchies. I'm impressed double dequeue doesn't result in any major splats! Matt, if possible can you try the patch attached below to check if the bailout for throttled hierarchy is indeed the root cause. Thanks in advance. P.S. the per-task throttle in tip:sched/core would get rid of all this but it would be good to have a fix via tip:sched/urgent to get it backported to v6.12 LTS and the newer stable kernels. --- Thanks and Regards, Prateek (Prepared on top of tip:sched/urgent but should apply fine on v6.17-rc7) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 8ce56a8d507f..f0a4d9d7424d 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) int h_nr_runnable = 0; struct cfs_rq *cfs_rq; u64 slice = 0; + int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */ if (entity_is_task(se)) { p = task_of(se); @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out; /* Don't dequeue parent if it has other entities besides us */ if (cfs_rq->load.weight) { @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) /* end evaluation on encountering a throttled cfs_rq */ if (cfs_rq_throttled(cfs_rq)) - return 0; + goto out; } sub_nr_running(rq, h_nr_queued); @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) if (unlikely(!was_sched_idle && sched_idle_rq(rq))) rq->next_balance = jiffies; + ret = 1; +out: if (p && task_delayed) { WARN_ON_ONCE(!task_sleep); WARN_ON_ONCE(p->on_rq != 1); @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) __block_task(rq, p); } - return 1; + return ret; } /*
On Fri, Sep 26, 2025 at 08:13:09AM +0530, K Prateek Nayak wrote: > > Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a > > little odd, as the desired dequeue didn't really complete, but > > dequeue_task_fair() will still return true indicating success - not > > that too many places are checking the dequeue_task return. Is that > > right? Bah, i'm forever confused on the throttle cases there :/ > I think for most part until now it was harmless as we couldn't pick on > a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED) > would later do a: > > queued = task_on_rq_queued(p); > ... > if (queued) > enqueue_task(p) > > which would either lead to spuriously running a blocked task and it > would block back again, or a wakeup would properly wakeup the queued > task via ttwu_runnable() but wait_task_inactive() is interesting as > it expects the dequeue will result in a block which never happens with > throttled hierarchies. I'm impressed double dequeue doesn't result in > any major splats! > > Matt, if possible can you try the patch attached below to check if the > bailout for throttled hierarchy is indeed the root cause. Thanks in > advance. > > P.S. the per-task throttle in tip:sched/core would get rid of all this > but it would be good to have a fix via tip:sched/urgent to get it > backported to v6.12 LTS and the newer stable kernels. Yes, good riddance to that code :-) > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 8ce56a8d507f..f0a4d9d7424d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > int h_nr_runnable = 0; > struct cfs_rq *cfs_rq; > u64 slice = 0; > + int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */ Right, we don't appear to really need that. > > if (entity_is_task(se)) { > p = task_of(se); > @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > > /* end evaluation on encountering a throttled cfs_rq */ > if (cfs_rq_throttled(cfs_rq)) > - return 0; > + goto out; > > /* Don't dequeue parent if it has other entities besides us */ > if (cfs_rq->load.weight) { > @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > > /* end evaluation on encountering a throttled cfs_rq */ > if (cfs_rq_throttled(cfs_rq)) > - return 0; > + goto out; > } > > sub_nr_running(rq, h_nr_queued); > @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > if (unlikely(!was_sched_idle && sched_idle_rq(rq))) > rq->next_balance = jiffies; > > + ret = 1; > +out: > if (p && task_delayed) { > WARN_ON_ONCE(!task_sleep); > WARN_ON_ONCE(p->on_rq != 1); > @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags) > __block_task(rq, p); > } > > - return 1; > + return ret; > } So the difference is that we also do __block_task() when we get throttled somewhere in the hierarchy. IIRC when I was looking at this, I thought it wouldn't matter since it won't get picked anyway, on account of the cfs_rq being blocked/detached, so who cares. But yeah, this makes sense. Patch logistics are going to be a pain -- .17 is closed and merge window is open, which means Linus will have per-task throttle and /urgent don't work no more. At this point best we can do is a patch to stable with a note that upstream is no longer affected due to rework or something.
On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > Hello John, Matt, > > On 9/26/2025 5:35 AM, John Stultz wrote: > > > > However, there are two spots where we might exit dequeue_entities() > > early when cfs_rq_throttled(rq), so maybe that's what's catching us > > here? > > That could very likely be it. That tracks -- we're heavy users of cgroups and this particular issue only appeared on our kubernetes nodes. > Matt, if possible can you try the patch attached below to check if the > bailout for throttled hierarchy is indeed the root cause. Thanks in > advance. I've been running our reproducer with this patch for the last few hours without any issues, so the fix looks good to me. Tested-by: Matt Fleming <mfleming@cloudflare.com>
Hello Matt, On 9/26/2025 9:04 PM, Matt Fleming wrote: > On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote: >> >> Hello John, Matt, >> >> On 9/26/2025 5:35 AM, John Stultz wrote: >>> >>> However, there are two spots where we might exit dequeue_entities() >>> early when cfs_rq_throttled(rq), so maybe that's what's catching us >>> here? >> >> That could very likely be it. > > That tracks -- we're heavy users of cgroups and this particular issue > only appeared on our kubernetes nodes. > >> Matt, if possible can you try the patch attached below to check if the >> bailout for throttled hierarchy is indeed the root cause. Thanks in >> advance. > > I've been running our reproducer with this patch for the last few > hours without any issues, so the fix looks good to me. > > Tested-by: Matt Fleming <mfleming@cloudflare.com> Thank you for testing the diff. I see Ingo has already posted the scheduler pull for v6.18 which indirectly solves this by removing those early returns. Once that lands, I'll attach a formal commit log and send out a patch targeting the stable kernels >= 6.12. -- Thanks and Regards, Prateek
On Thu, Sep 25, 2025 at 5:05 PM John Stultz <jstultz@google.com> wrote: > It's confusing as this patch uses the similar logic as logic > pick_next_entity() uses when a sched_delayed task is picked to run, as > well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd > fret that similar Didn't finish my sentence there. "...similar problems as the one you are seeing could still occur." thanks -john
© 2016 - 2025 Red Hat, Inc.