[RFC][PATCH] sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks

John Stultz posted 1 patch 7 months, 3 weeks ago
There is a newer version of this series
kernel/sched/core.c | 8 ++++++++
1 file changed, 8 insertions(+)
[RFC][PATCH] sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
Posted by John Stultz 7 months, 3 weeks ago
It was reported that in 6.12, smpboot_create_threads() was
taking much longer then in 6.6.

I narrowed down the call path to:
 smpboot_create_threads()
 -> kthread_create_on_cpu()
    -> kthread_bind()
       -> __kthread_bind_mask()
          ->wait_task_inactive()

Where in wait_task_inactive() we were regularly hitting the
queued case, which sets a 1 tick timeout, which when called
multiple times in a row, accumulates quickly into a long
delay.

I noticed disabling the DELAY_DEQUEUE sched feature recovered
the performance, and it seems the newly create tasks are usually
sched_delayed and left on the runqueue.

So in wait_task_inactive() when we see the task
task_on_rq_queued(p) and p->se.sched_delayed, manually dequeue
the sched_delayed task with DEQUEUE_DELAYED, so we don't have to
constantly wait a tick.

This seems to work, but I've only lightly tested it, so I'd love
close review and feedback in case I've missed something in
wait_task_inactive(), or if there is a simpler alternative
approach.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Ben Segall <bsegall@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: kernel-team@android.com
Signed-off-by: John Stultz <jstultz@google.com>
---
 kernel/sched/core.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c81cf642dba05..43f0931a3cd8a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2287,6 +2287,14 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state
 		running = task_on_cpu(rq, p);
 		queued = task_on_rq_queued(p);
 		ncsw = 0;
+		/*
+		 * If task is sched_delayed, force dequeue it, to avoid always
+		 * hitting the tick timeout in the queued case
+		 */
+		if (!running && queued && p->se.sched_delayed) {
+			dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
+			queued = task_on_rq_queued(p);
+		}
 		if ((match = __task_state_match(p, match_state))) {
 			/*
 			 * When matching on p->saved_state, consider this task
-- 
2.49.0.805.g082f7c87e0-goog
Re: [RFC][PATCH] sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
Posted by Peter Zijlstra 7 months, 3 weeks ago
On Mon, Apr 21, 2025 at 09:43:45PM -0700, John Stultz wrote:
> It was reported that in 6.12, smpboot_create_threads() was
> taking much longer then in 6.6.
> 
> I narrowed down the call path to:
>  smpboot_create_threads()
>  -> kthread_create_on_cpu()
>     -> kthread_bind()
>        -> __kthread_bind_mask()
>           ->wait_task_inactive()
> 
> Where in wait_task_inactive() we were regularly hitting the
> queued case, which sets a 1 tick timeout, which when called
> multiple times in a row, accumulates quickly into a long
> delay.

Argh, this is all stupid :-(

The whole __kthread_bind_*() thing is a bit weird, but fundamentally it
tries to avoid a race vs current. Notably task_state::flags is only ever
modified by current, except here.

delayed_dequeue is fine, except wait_task_inactive() hasn't been
told about it (I hate that function, murder death kill etc.).

But more fundamentally, we've put so much crap into struct kthread and
kthread() itself by now, why not also pass down the whole per-cpu-ness
thing and simply do it there. Heck, Frederic already made it do affinity
crud.

On that, Frederic, *why* do you do that after started=1, that seems like
a weird place, should this not be done before complete() ?, like next to
sched_setscheduler_nocheck() or so?
Re: [RFC][PATCH] sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
Posted by John Stultz 7 months, 3 weeks ago
On Tue, Apr 22, 2025 at 1:56 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Mon, Apr 21, 2025 at 09:43:45PM -0700, John Stultz wrote:
> > It was reported that in 6.12, smpboot_create_threads() was
> > taking much longer then in 6.6.
> >
> > I narrowed down the call path to:
> >  smpboot_create_threads()
> >  -> kthread_create_on_cpu()
> >     -> kthread_bind()
> >        -> __kthread_bind_mask()
> >           ->wait_task_inactive()
> >
> > Where in wait_task_inactive() we were regularly hitting the
> > queued case, which sets a 1 tick timeout, which when called
> > multiple times in a row, accumulates quickly into a long
> > delay.
>
> Argh, this is all stupid :-(
>
> The whole __kthread_bind_*() thing is a bit weird, but fundamentally it
> tries to avoid a race vs current. Notably task_state::flags is only ever
> modified by current, except here.
>
> delayed_dequeue is fine, except wait_task_inactive() hasn't been
> told about it (I hate that function, murder death kill etc.).

Hey Peter,
  So I hear your (sanguinary?) dissatisfaction with these functions,
but from your short discussion with Frederic it seems like
wait_task_inactive() and kthread_bind() are sticking around, so I'm
not sure what the next course of action should be.

Should I resend my patch (including Prateek's suggested tweaks to make
it a little nicer), so wait_task_inactive() is sched_delayed aware? Or
are you thinking we should solve this differently (ideally in some
form that can head to -stable to help folks hitting this in 6.12?)?

thanks
-john
Re: [RFC][PATCH] sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
Posted by Frederic Weisbecker 7 months, 3 weeks ago
Le Tue, Apr 22, 2025 at 10:56:28AM +0200, Peter Zijlstra a écrit :
> On Mon, Apr 21, 2025 at 09:43:45PM -0700, John Stultz wrote:
> > It was reported that in 6.12, smpboot_create_threads() was
> > taking much longer then in 6.6.
> > 
> > I narrowed down the call path to:
> >  smpboot_create_threads()
> >  -> kthread_create_on_cpu()
> >     -> kthread_bind()
> >        -> __kthread_bind_mask()
> >           ->wait_task_inactive()
> > 
> > Where in wait_task_inactive() we were regularly hitting the
> > queued case, which sets a 1 tick timeout, which when called
> > multiple times in a row, accumulates quickly into a long
> > delay.
> 
> Argh, this is all stupid :-(
> 
> The whole __kthread_bind_*() thing is a bit weird, but fundamentally it
> tries to avoid a race vs current. Notably task_state::flags is only ever
> modified by current, except here.
> 
> delayed_dequeue is fine, except wait_task_inactive() hasn't been
> told about it (I hate that function, murder death kill etc.).
> 
> But more fundamentally, we've put so much crap into struct kthread and
> kthread() itself by now, why not also pass down the whole per-cpu-ness
> thing and simply do it there. Heck, Frederic already made it do affinity
> crud.
> 
> On that, Frederic, *why* do you do that after started=1, that seems like
> a weird place, should this not be done before complete() ?, like next to
> sched_setscheduler_nocheck() or so?

You mean the call to kthread_affine_node() ? Because it is a default behaviour
that only happens if no call to kthread_bind() or kthread_affine_preferred()
has been issued before the first wake up to the kthread.

If kthread_affine_node() was called before everything by default instead
then we would get its unconditional overhead for all started kthreads. Plus
kthread_bind() and kthread_affine_preferred() would need to undo
kthread_affine_node().

Thanks.

-- 
Frederic Weisbecker
SUSE Labs
Re: [RFC][PATCH] sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
Posted by Peter Zijlstra 7 months, 3 weeks ago
On Tue, Apr 22, 2025 at 11:55:31AM +0200, Frederic Weisbecker wrote:
> Le Tue, Apr 22, 2025 at 10:56:28AM +0200, Peter Zijlstra a écrit :
> > On Mon, Apr 21, 2025 at 09:43:45PM -0700, John Stultz wrote:
> > > It was reported that in 6.12, smpboot_create_threads() was
> > > taking much longer then in 6.6.
> > > 
> > > I narrowed down the call path to:
> > >  smpboot_create_threads()
> > >  -> kthread_create_on_cpu()
> > >     -> kthread_bind()
> > >        -> __kthread_bind_mask()
> > >           ->wait_task_inactive()
> > > 
> > > Where in wait_task_inactive() we were regularly hitting the
> > > queued case, which sets a 1 tick timeout, which when called
> > > multiple times in a row, accumulates quickly into a long
> > > delay.
> > 
> > Argh, this is all stupid :-(
> > 
> > The whole __kthread_bind_*() thing is a bit weird, but fundamentally it
> > tries to avoid a race vs current. Notably task_state::flags is only ever
> > modified by current, except here.
> > 
> > delayed_dequeue is fine, except wait_task_inactive() hasn't been
> > told about it (I hate that function, murder death kill etc.).
> > 
> > But more fundamentally, we've put so much crap into struct kthread and
> > kthread() itself by now, why not also pass down the whole per-cpu-ness
> > thing and simply do it there. Heck, Frederic already made it do affinity
> > crud.
> > 
> > On that, Frederic, *why* do you do that after started=1, that seems like
> > a weird place, should this not be done before complete() ?, like next to
> > sched_setscheduler_nocheck() or so?
> 
> You mean the call to kthread_affine_node() ? Because it is a default behaviour
> that only happens if no call to kthread_bind() or kthread_affine_preferred()
> has been issued before the first wake up to the kthread.
> 
> If kthread_affine_node() was called before everything by default instead
> then we would get its unconditional overhead for all started kthreads. Plus
> kthread_bind() and kthread_affine_preferred() would need to undo
> kthread_affine_node().

Urgh, I see. Perhaps we should put a comment on, because I'm sure I'll
have this same question again next time (probably in another few years)
when I look at this code.
Re: [RFC][PATCH] sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
Posted by K Prateek Nayak 7 months, 3 weeks ago
Hello John,

On 4/22/2025 10:13 AM, John Stultz wrote:
> It was reported that in 6.12, smpboot_create_threads() was
> taking much longer then in 6.6.
> 
> I narrowed down the call path to:
>   smpboot_create_threads()
>   -> kthread_create_on_cpu()
>      -> kthread_bind()
>         -> __kthread_bind_mask()
>            ->wait_task_inactive()
> 
> Where in wait_task_inactive() we were regularly hitting the
> queued case, which sets a 1 tick timeout, which when called
> multiple times in a row, accumulates quickly into a long
> delay.
> 
> I noticed disabling the DELAY_DEQUEUE sched feature recovered
> the performance, and it seems the newly create tasks are usually
> sched_delayed and left on the runqueue.
> 
> So in wait_task_inactive() when we see the task
> task_on_rq_queued(p) and p->se.sched_delayed, manually dequeue
> the sched_delayed task with DEQUEUE_DELAYED, so we don't have to
> constantly wait a tick.
> 
> This seems to work, but I've only lightly tested it, so I'd love
> close review and feedback in case I've missed something in
> wait_task_inactive(), or if there is a simpler alternative
> approach.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Juri Lelli <juri.lelli@redhat.com>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Valentin Schneider <vschneid@redhat.com>
> Cc: K Prateek Nayak <kprateek.nayak@amd.com>
> Cc: kernel-team@android.com
> Signed-off-by: John Stultz <jstultz@google.com>
> ---
>   kernel/sched/core.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c81cf642dba05..43f0931a3cd8a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2287,6 +2287,14 @@ unsigned long wait_task_inactive(struct task_struct *p, unsigned int match_state
>   		running = task_on_cpu(rq, p);
>   		queued = task_on_rq_queued(p);
>   		ncsw = 0;
> +		/*
> +		 * If task is sched_delayed, force dequeue it, to avoid always
> +		 * hitting the tick timeout in the queued case
> +		 */
> +		if (!running && queued && p->se.sched_delayed) {

I think an "update_rq_clock(rq)" is required here before dequeue but other
than that ...

> +			dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);

I think this makes sense since the comment above the hrtimeout considers
a "queued" task to be preempted and still runnable but a delayed task
doesn't fit that description.

You can perhaps move dequeuing of delayed task to just after the
task_rq_lock() bit since once the rq lock and the pi_lock are acquired,
task cannot be requeued and simply checking "p->se.sched_delayed"
should be enough. Something like:

     rq = task_rq_lock(p, &rf);
     trace_sched_wait_task(p);

     if (p->se.sched_delayed) {
             update_rq_clock(rq);
             dequeue_task(rq, p, DEQUEUE_SLEEP | DEQUEUE_DELAYED);
     }

     queued = task_on_rq_queued(p);
     ...

Similar pattern exists for __sched_setscheduler() and
rt_mutex_setprio().

Other than that, feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

> +			queued = task_on_rq_queued(p);
> +		}
>   		if ((match = __task_state_match(p, match_state))) {
>   			/*
>   			 * When matching on p->saved_state, consider this task

-- 
Thanks and Regards,
Prateek
Re: [RFC][PATCH] sched/core: Tweak wait_task_inactive() to force dequeue sched_delayed tasks
Posted by K Prateek Nayak 7 months, 3 weeks ago
On 4/22/2025 12:13 PM, K Prateek Nayak wrote:
> 
> I think an "update_rq_clock(rq)" is required here before dequeue but other
> than that ...

I clearly need more sleep! Please ignore this bit.

-- 
Thanks and Regards,
Prateek

>