Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"

[PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"

Posted by Matt Fleming 6 days, 9 hours ago

From: Matt Fleming <mfleming@cloudflare.com>

This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.

If we dequeue a task (task B) that was sched delayed then that task is
definitely no longer on the rq and not tracked in the rbtree.
Unfortunately, task_on_rq_queued(B) will still return true because
dequeue_task() doesn't update p->on_rq.

This inconsistency can lead to tasks (task A) spinning indefinitely in
wait_task_inactive(), e.g. when delivering a fatal signal to a thread
group, because it thinks the task B is still queued (it's not) and waits
forever for it to unschedule.

          Task A                                    Task B

  arch_do_signal_or_restart()
    get_signal()
      do_coredump()
        coredump_wait()
	  zap_threads()                     arch_do_signal_or_restart()
          wait_task_inactive() <-- SPIN       get_signal()
	                                        do_group_exit()
						  do_exit()
						    coredump_task_exit()
						      schedule() <--- never comes back

Not only will task A spin forever in wait_task_inactive(), but task B
will also trigger RCU stalls:

  INFO: rcu_tasks detected stalls on tasks:
  00000000a973a4d8: .. nvcsw: 2/2 holdout: 1 idle_cpu: -1/79
  task:ffmpeg          state:I stack:0     pid:665601 tgid:665155 ppid:668691 task_flags:0x400448 flags:0x00004006
  Call Trace:
   <TASK>
   __schedule+0x4fb/0xbf0
   ? srso_return_thunk+0x5/0x5f
   schedule+0x27/0xf0
   do_exit+0xdd/0xaa0
   ? __pfx_futex_wake_mark+0x10/0x10
   do_group_exit+0x30/0x80
   get_signal+0x81e/0x860
   ? srso_return_thunk+0x5/0x5f
   ? futex_wake+0x177/0x1a0
   arch_do_signal_or_restart+0x2e/0x1f0
   ? srso_return_thunk+0x5/0x5f
   ? srso_return_thunk+0x5/0x5f
   ? __x64_sys_futex+0x10c/0x1d0
   syscall_exit_to_user_mode+0xa5/0x130
   do_syscall_64+0x57/0x110
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f22d05b0f16
  RSP: 002b:00007f2265761cf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
  RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f22d05b0f16
  RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005629e320d97c
  RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
  R10: 0000000000000000 R11: 0000000000000246 R12: 00005629e320d928
  R13: 0000000000000000 R14: 0000000000000001 R15: 00005629e320d97c
   </TASK>

Fixes: b7ca5743a260 ("sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks")
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: John Stultz <jstultz@google.com>
Cc: Chris Arges <carges@cloudflare.com>
Cc: stable@vger.kernel.org # v6.12
Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
---
 kernel/sched/core.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ccba6fc3c3fe..2dfc3977920d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2293,12 +2293,6 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state
 		 * just go back and repeat.
 		 */
 		rq = task_rq_lock(p, &rf);
-		/*
-		 * If task is sched_delayed, force dequeue it, to avoid always
-		 * hitting the tick timeout in the queued case
-		 */
-		if (p->se.sched_delayed)
-			dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
 		trace_sched_wait_task(p);
 		running = task_on_cpu(rq, p);
 		queued = task_on_rq_queued(p);
-- 
2.34.1

Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"

Posted by John Stultz 5 days, 23 hours ago

On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming <matt@readmodwrite.com> wrote:
>
> From: Matt Fleming <mfleming@cloudflare.com>
>
> This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
>
> If we dequeue a task (task B) that was sched delayed then that task is
> definitely no longer on the rq and not tracked in the rbtree.
> Unfortunately, task_on_rq_queued(B) will still return true because
> dequeue_task() doesn't update p->on_rq.

Hey!
  Sorry again my patch has been causing you trouble. Thanks for your
persistence in chasing this down!

It's confusing as this patch uses the similar logic as logic
pick_next_entity() uses when a sched_delayed task is picked to run, as
well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd
fret that similar

And my impression was that dequeue_task() on a sched_delayed task
would update p->on_rq via calling __block_task() at the end of
dequeue_entities().

However, there are two spots where we might exit dequeue_entities()
early when cfs_rq_throttled(rq), so maybe that's what's catching us
here?

Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a
little odd, as the desired dequeue didn't really complete, but
dequeue_task_fair() will still return true indicating success - not
that too many places are checking the dequeue_task return. Is that
right?

thanks
-john

Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"

Posted by K Prateek Nayak 5 days, 20 hours ago

Hello John, Matt,

On 9/26/2025 5:35 AM, John Stultz wrote:
> On Thu, Sep 25, 2025 at 6:33 AM Matt Fleming <matt@readmodwrite.com> wrote:
>>
>> From: Matt Fleming <mfleming@cloudflare.com>
>>
>> This reverts commit b7ca5743a2604156d6083b88cefacef983f3a3a6.
>>
>> If we dequeue a task (task B) that was sched delayed then that task is
>> definitely no longer on the rq and not tracked in the rbtree.
>> Unfortunately, task_on_rq_queued(B) will still return true because
>> dequeue_task() doesn't update p->on_rq.
> 
> Hey!
>   Sorry again my patch has been causing you trouble. Thanks for your
> persistence in chasing this down!
> 
> It's confusing as this patch uses the similar logic as logic
> pick_next_entity() uses when a sched_delayed task is picked to run, as
> well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd
> fret that similar
> 
> And my impression was that dequeue_task() on a sched_delayed task
> would update p->on_rq via calling __block_task() at the end of
> dequeue_entities().
> 
> However, there are two spots where we might exit dequeue_entities()
> early when cfs_rq_throttled(rq), so maybe that's what's catching us
> here?

That could very likely be it.

> 
> Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a
> little odd, as the desired dequeue didn't really complete, but
> dequeue_task_fair() will still return true indicating success - not
> that too many places are checking the dequeue_task return. Is that
> right?

I think for most part until now it was harmless as we couldn't pick on
a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED)
would later do a:

    queued = task_on_rq_queued(p);
    ...
    if (queued)
        enqueue_task(p)

which would either lead to spuriously running a blocked task and it
would block back again, or a wakeup would properly wakeup the queued
task via ttwu_runnable() but wait_task_inactive() is interesting as
it expects the dequeue will result in a block which never happens with
throttled hierarchies. I'm impressed double dequeue doesn't result in
any major splats!

Matt, if possible can you try the patch attached below to check if the
bailout for throttled hierarchy is indeed the root cause. Thanks in
advance.

P.S. the per-task throttle in tip:sched/core would get rid of all this
but it would be good to have a fix via tip:sched/urgent to get it
backported to v6.12 LTS and the newer stable kernels.

---
Thanks and Regards,
Prateek

(Prepared on top of tip:sched/urgent but should apply fine on v6.17-rc7)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ce56a8d507f..f0a4d9d7424d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	int h_nr_runnable = 0;
 	struct cfs_rq *cfs_rq;
 	u64 slice = 0;
+	int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */
 
 	if (entity_is_task(se)) {
 		p = task_of(se);
@@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			return 0;
+			goto out;
 
 		/* Don't dequeue parent if it has other entities besides us */
 		if (cfs_rq->load.weight) {
@@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* end evaluation on encountering a throttled cfs_rq */
 		if (cfs_rq_throttled(cfs_rq))
-			return 0;
+			goto out;
 	}
 
 	sub_nr_running(rq, h_nr_queued);
@@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
 		rq->next_balance = jiffies;
 
+	ret = 1;
+out:
 	if (p && task_delayed) {
 		WARN_ON_ONCE(!task_sleep);
 		WARN_ON_ONCE(p->on_rq != 1);
@@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 		__block_task(rq, p);
 	}
 
-	return 1;
+	return ret;
 }
 
 /*

Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"

Posted by Peter Zijlstra 2 days, 12 hours ago

On Fri, Sep 26, 2025 at 08:13:09AM +0530, K Prateek Nayak wrote:

> > Peter: Those cfs_rq_throttled() exits in dequeue_entities() seem a
> > little odd, as the desired dequeue didn't really complete, but
> > dequeue_task_fair() will still return true indicating success - not
> > that too many places are checking the dequeue_task return. Is that
> > right?

Bah, i'm forever confused on the throttle cases there :/

> I think for most part until now it was harmless as we couldn't pick on
> a throttled hierarchy and other calls to dequeue_task(DEQUEUE_DELAYED)
> would later do a:
> 
>     queued = task_on_rq_queued(p);
>     ...
>     if (queued)
>         enqueue_task(p)
> 
> which would either lead to spuriously running a blocked task and it
> would block back again, or a wakeup would properly wakeup the queued
> task via ttwu_runnable() but wait_task_inactive() is interesting as
> it expects the dequeue will result in a block which never happens with
> throttled hierarchies. I'm impressed double dequeue doesn't result in
> any major splats!
> 
> Matt, if possible can you try the patch attached below to check if the
> bailout for throttled hierarchy is indeed the root cause. Thanks in
> advance.
> 
> P.S. the per-task throttle in tip:sched/core would get rid of all this
> but it would be good to have a fix via tip:sched/urgent to get it
> backported to v6.12 LTS and the newer stable kernels.

Yes, good riddance to that code :-)

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8ce56a8d507f..f0a4d9d7424d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6969,6 +6969,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  	int h_nr_runnable = 0;
>  	struct cfs_rq *cfs_rq;
>  	u64 slice = 0;
> +	int ret = 0; /* XXX: Do we care if ret is 0 vs 1 since we only check ret < 0? */

Right, we don't appear to really need that.

>  
>  	if (entity_is_task(se)) {
>  		p = task_of(se);
> @@ -6998,7 +6999,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  
>  		/* end evaluation on encountering a throttled cfs_rq */
>  		if (cfs_rq_throttled(cfs_rq))
> -			return 0;
> +			goto out;
>  
>  		/* Don't dequeue parent if it has other entities besides us */
>  		if (cfs_rq->load.weight) {
> @@ -7039,7 +7040,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  
>  		/* end evaluation on encountering a throttled cfs_rq */
>  		if (cfs_rq_throttled(cfs_rq))
> -			return 0;
> +			goto out;
>  	}
>  
>  	sub_nr_running(rq, h_nr_queued);
> @@ -7048,6 +7049,8 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  	if (unlikely(!was_sched_idle && sched_idle_rq(rq)))
>  		rq->next_balance = jiffies;
>  
> +	ret = 1;
> +out:
>  	if (p && task_delayed) {
>  		WARN_ON_ONCE(!task_sleep);
>  		WARN_ON_ONCE(p->on_rq != 1);
> @@ -7063,7 +7066,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
>  		__block_task(rq, p);
>  	}
>  
> -	return 1;
> +	return ret;
>  }

So the difference is that we also do __block_task() when we get
throttled somewhere in the hierarchy. IIRC when I was looking at this, I
thought it wouldn't matter since it won't get picked anyway, on account
of the cfs_rq being blocked/detached, so who cares.

But yeah, this makes sense.

Patch logistics are going to be a pain -- .17 is closed and merge window
is open, which means Linus will have per-task throttle and /urgent don't
work no more.

At this point best we can do is a patch to stable with a note that
upstream is no longer affected due to rework or something.

Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"

Posted by Matt Fleming 5 days, 7 hours ago

On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello John, Matt,
>
> On 9/26/2025 5:35 AM, John Stultz wrote:
> >
> > However, there are two spots where we might exit dequeue_entities()
> > early when cfs_rq_throttled(rq), so maybe that's what's catching us
> > here?
>
> That could very likely be it.

That tracks -- we're heavy users of cgroups and this particular issue
only appeared on our kubernetes nodes.

> Matt, if possible can you try the patch attached below to check if the
> bailout for throttled hierarchy is indeed the root cause. Thanks in
> advance.

I've been running our reproducer with this patch for the last few
hours without any issues, so the fix looks good to me.

Tested-by: Matt Fleming <mfleming@cloudflare.com>

Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"

Posted by K Prateek Nayak 2 days, 19 hours ago

Hello Matt,

On 9/26/2025 9:04 PM, Matt Fleming wrote:
> On Fri, Sep 26, 2025 at 3:43 AM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>>
>> Hello John, Matt,
>>
>> On 9/26/2025 5:35 AM, John Stultz wrote:
>>>
>>> However, there are two spots where we might exit dequeue_entities()
>>> early when cfs_rq_throttled(rq), so maybe that's what's catching us
>>> here?
>>
>> That could very likely be it.
> 
> That tracks -- we're heavy users of cgroups and this particular issue
> only appeared on our kubernetes nodes.
> 
>> Matt, if possible can you try the patch attached below to check if the
>> bailout for throttled hierarchy is indeed the root cause. Thanks in
>> advance.
> 
> I've been running our reproducer with this patch for the last few
> hours without any issues, so the fix looks good to me.
> 
> Tested-by: Matt Fleming <mfleming@cloudflare.com>

Thank you for testing the diff. I see Ingo has already posted the
scheduler pull for v6.18 which indirectly solves this by removing those
early returns.

Once that lands, I'll attach a formal commit log and send out a patch
targeting the stable kernels >= 6.12.

-- 
Thanks and Regards,
Prateek

Re: [PATCH] Revert "sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks"

Posted by John Stultz 5 days, 23 hours ago

On Thu, Sep 25, 2025 at 5:05 PM John Stultz <jstultz@google.com> wrote:
> It's confusing as this patch uses the similar logic as logic
> pick_next_entity() uses when a sched_delayed task is picked to run, as
> well as elsewhere in __sched_setscheduler() and in sched_ext, so I'd
> fret that similar

Didn't finish my sentence there.  "...similar problems as the one you
are seeing could still occur."


thanks
-john