[v1] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

[PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 3 months ago

On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > >
> > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > finding a cfs_rq->curr ends play time pretty quickly.
> > >
> > > The below improved uptime, and trace_printk() says it's doing the
> > > intended, so I suppose I'll add a feature and see what falls out.
> >
> > From netperf, I got.. number tabulation practice.  Three runs of each
> > test with and without produced nothing but variance/noise.
>
> Make it go away then.
>
> If you could write a Changelog for you inspired bit and stick my cleaned
> up version under it, I'd be much obliged.

Salut, much obliged for eyeball relief.

---snip---

Phil Auld (Redhat) reported an fio benchmark regression having been found
to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
may be related to wakees losing the ability to migrate, and confirmed that
restoration of same indeed did restore previous performance.

(de-uglified-a-lot-by)

Reported-by: Phil Auld <pauld@redhat.com>
Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
Signed-off-by: Mike Galbraith <efault@gmx.de>
---
 kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
 kernel/sched/sched.h |    5 +++++
 2 files changed, 34 insertions(+), 19 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
-
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			wakeup_preempt(rq, p, wake_flags);
+	CLASS(__task_rq_lock, rq_guard)(p);
+	struct rq *rq = rq_guard.rq;
+
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	update_rq_clock(rq);
+	if (p->se.sched_delayed) {
+		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
+
+		/*
+		 * Since sched_delayed means we cannot be current anywhere,
+		 * dequeue it here and have it fall through to the
+		 * select_task_rq() case further along the ttwu() path.
+		 */
+		if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
+			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
+			return 0;
 		}
-		ttwu_do_wakeup(p);
-		ret = 1;
+
+		enqueue_task(rq, p, queue_flags);
+	}
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
 	}
-	__task_rq_unlock(rq, &rf);
+	ttwu_do_wakeup(p);

-	return ret;
+	return 1;
 }

 #ifdef CONFIG_SMP
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }

+DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
+		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
+		    __task_rq_unlock(_T->rq, &_T->rf),
+		    struct rq *rq; struct rq_flags rf)
+
 DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
 		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
 		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Fri, 2024-11-08 at 01:24 +0100, Mike Galbraith wrote:
> On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > > 
> > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > > 
> > > > The below improved uptime, and trace_printk() says it's doing the
> > > > intended, so I suppose I'll add a feature and see what falls out.
> > > 
> > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > test with and without produced nothing but variance/noise.
> > 
> > Make it go away then.
> > 
> > If you could write a Changelog for you inspired bit and stick my cleaned
> > up version under it, I'd be much obliged.
> 
> Salut, much obliged for eyeball relief.

Unfortunate change log place holder below aside, I think this patch may
need to be yanked as trading one not readily repeatable regression for
at least one that definitely is, and likely multiple others.

(adds knob)

tbench 8

NO_MIGRATE_DELAYED    3613.49 MB/sec
MIGRATE_DELAYED       3145.59 MB/sec
NO_DELAY_DEQUEUE      3355.42 MB/sec

First line is DELAY_DEQUEUE restoring pre-EEVDF tbench throughput as
I've mentioned it doing, but $subject promptly did away with that and
then some.

I thought I might be able to do away with the reservation like side
effect of DELAY_DEQUEUE by borrowing h_nr_delayed from...

     sched/eevdf: More PELT vs DELAYED_DEQUEUE

...for cgroups free test config, but Q/D poke at idle_cpu() helped not
at all.

> ---snip---
> 
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
> 
> (de-uglified-a-lot-by)
> 
> Reported-by: Phil Auld <pauld@redhat.com>
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
>  kernel/sched/sched.h |    5 +++++
>  2 files changed, 34 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -       struct rq_flags rf;
> -       struct rq *rq;
> -       int ret = 0;
> -
> -       rq = __task_rq_lock(p, &rf);
> -       if (task_on_rq_queued(p)) {
> -               update_rq_clock(rq);
> -               if (p->se.sched_delayed)
> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -               if (!task_on_cpu(rq, p)) {
> -                       /*
> -                        * When on_rq && !on_cpu the task is preempted, see if
> -                        * it should preempt the task that is current now.
> -                        */
> -                       wakeup_preempt(rq, p, wake_flags);
> +       CLASS(__task_rq_lock, rq_guard)(p);
> +       struct rq *rq = rq_guard.rq;
> +
> +       if (!task_on_rq_queued(p))
> +               return 0;
> +
> +       update_rq_clock(rq);
> +       if (p->se.sched_delayed) {
> +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +
> +               /*
> +                * Since sched_delayed means we cannot be current anywhere,
> +                * dequeue it here and have it fall through to the
> +                * select_task_rq() case further along the ttwu() path.
> +                */
> +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> +                       dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +                       return 0;
>                 }
> -               ttwu_do_wakeup(p);
> -               ret = 1;
> +
> +               enqueue_task(rq, p, queue_flags);
> +       }
> +       if (!task_on_cpu(rq, p)) {
> +               /*
> +                * When on_rq && !on_cpu the task is preempted, see if
> +                * it should preempt the task that is current now.
> +                */
> +               wakeup_preempt(rq, p, wake_flags);
>         }
> -       __task_rq_unlock(rq, &rf);
> +       ttwu_do_wakeup(p);
>  
> -       return ret;
> +       return 1;
>  }
>  
>  #ifdef CONFIG_SMP
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
>         raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
>  
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +                   _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +                   __task_rq_unlock(_T->rq, &_T->rf),
> +                   struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>                     _T->rq = task_rq_lock(_T->lock, &_T->rf),
>                     task_rq_unlock(_T->rq, _T->lock, &_T->rf),
> 
>

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

On Tue, Nov 12, 2024 at 08:05:04AM +0100 Mike Galbraith wrote:
> On Fri, 2024-11-08 at 01:24 +0100, Mike Galbraith wrote:
> > On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > > > 
> > > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > > > 
> > > > > The below improved uptime, and trace_printk() says it's doing the
> > > > > intended, so I suppose I'll add a feature and see what falls out.
> > > > 
> > > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > > test with and without produced nothing but variance/noise.
> > > 
> > > Make it go away then.
> > > 
> > > If you could write a Changelog for you inspired bit and stick my cleaned
> > > up version under it, I'd be much obliged.
> > 
> > Salut, much obliged for eyeball relief.
> 
> Unfortunate change log place holder below aside, I think this patch may
> need to be yanked as trading one not readily repeatable regression for
> at least one that definitely is, and likely multiple others.
> 
> (adds knob)
>

Yes, I ws just coming here to reply. I have the results from the first
version of the patch (I don't think the later one fundemtally changed
enough that it will matter but those results are still pending).

Not entirely surprisingly we've traded a ~10% rand write regression for
5-10% rand read regression. This makes sense to me since the reads are
more likely to be synchronous and thus be more buddy-like and benefit
from flipping back and forth on the same cpu.  

I'd probably have to take the reads over the writes in such a trade off :)

> tbench 8
> 
> NO_MIGRATE_DELAYED    3613.49 MB/sec
> MIGRATE_DELAYED       3145.59 MB/sec
> NO_DELAY_DEQUEUE      3355.42 MB/sec
> 
> First line is DELAY_DEQUEUE restoring pre-EEVDF tbench throughput as
> I've mentioned it doing, but $subject promptly did away with that and
> then some.
>

Yep, that's not pretty. 

> I thought I might be able to do away with the reservation like side
> effect of DELAY_DEQUEUE by borrowing h_nr_delayed from...
> 
>      sched/eevdf: More PELT vs DELAYED_DEQUEUE
> 
> ...for cgroups free test config, but Q/D poke at idle_cpu() helped not
> at all.
>

I wonder if the last_wakee stuff could be leveraged here (an idle thought,
so to speak). Haven't looked closely enough. 


Cheers,
Phil

> > ---snip---
> > 
> > Phil Auld (Redhat) reported an fio benchmark regression having been found
> > to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> > may be related to wakees losing the ability to migrate, and confirmed that
> > restoration of same indeed did restore previous performance.
> > 
> > (de-uglified-a-lot-by)
> > 
> > Reported-by: Phil Auld <pauld@redhat.com>
> > Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> > Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
> > Signed-off-by: Mike Galbraith <efault@gmx.de>
> > ---
> >  kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
> >  kernel/sched/sched.h |    5 +++++
> >  2 files changed, 34 insertions(+), 19 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
> >   */
> >  static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >  {
> > -       struct rq_flags rf;
> > -       struct rq *rq;
> > -       int ret = 0;
> > -
> > -       rq = __task_rq_lock(p, &rf);
> > -       if (task_on_rq_queued(p)) {
> > -               update_rq_clock(rq);
> > -               if (p->se.sched_delayed)
> > -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> > -               if (!task_on_cpu(rq, p)) {
> > -                       /*
> > -                        * When on_rq && !on_cpu the task is preempted, see if
> > -                        * it should preempt the task that is current now.
> > -                        */
> > -                       wakeup_preempt(rq, p, wake_flags);
> > +       CLASS(__task_rq_lock, rq_guard)(p);
> > +       struct rq *rq = rq_guard.rq;
> > +
> > +       if (!task_on_rq_queued(p))
> > +               return 0;
> > +
> > +       update_rq_clock(rq);
> > +       if (p->se.sched_delayed) {
> > +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> > +
> > +               /*
> > +                * Since sched_delayed means we cannot be current anywhere,
> > +                * dequeue it here and have it fall through to the
> > +                * select_task_rq() case further along the ttwu() path.
> > +                */
> > +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> > +                       dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> > +                       return 0;
> >                 }
> > -               ttwu_do_wakeup(p);
> > -               ret = 1;
> > +
> > +               enqueue_task(rq, p, queue_flags);
> > +       }
> > +       if (!task_on_cpu(rq, p)) {
> > +               /*
> > +                * When on_rq && !on_cpu the task is preempted, see if
> > +                * it should preempt the task that is current now.
> > +                */
> > +               wakeup_preempt(rq, p, wake_flags);
> >         }
> > -       __task_rq_unlock(rq, &rf);
> > +       ttwu_do_wakeup(p);
> >  
> > -       return ret;
> > +       return 1;
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
> >         raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
> >  }
> >  
> > +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> > +                   _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> > +                   __task_rq_unlock(_T->rq, &_T->rf),
> > +                   struct rq *rq; struct rq_flags rf)
> > +
> >  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
> >                     _T->rq = task_rq_lock(_T->lock, &_T->rf),
> >                     task_rq_unlock(_T->rq, _T->lock, &_T->rf),
> > 
> > 
> 

--

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Peter Zijlstra 1 year, 2 months ago

On Tue, Nov 12, 2024 at 07:41:17AM -0500, Phil Auld wrote:
> On Tue, Nov 12, 2024 at 08:05:04AM +0100 Mike Galbraith wrote:
> > On Fri, 2024-11-08 at 01:24 +0100, Mike Galbraith wrote:
> > > On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > > > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > > > > 
> > > > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > > > > 
> > > > > > The below improved uptime, and trace_printk() says it's doing the
> > > > > > intended, so I suppose I'll add a feature and see what falls out.
> > > > > 
> > > > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > > > test with and without produced nothing but variance/noise.
> > > > 
> > > > Make it go away then.
> > > > 
> > > > If you could write a Changelog for you inspired bit and stick my cleaned
> > > > up version under it, I'd be much obliged.
> > > 
> > > Salut, much obliged for eyeball relief.
> > 
> > Unfortunate change log place holder below aside, I think this patch may
> > need to be yanked as trading one not readily repeatable regression for
> > at least one that definitely is, and likely multiple others.
> > 
> > (adds knob)
> >
> 
> Yes, I ws just coming here to reply. I have the results from the first
> version of the patch (I don't think the later one fundemtally changed
> enough that it will matter but those results are still pending).
> 
> Not entirely surprisingly we've traded a ~10% rand write regression for
> 5-10% rand read regression. This makes sense to me since the reads are
> more likely to be synchronous and thus be more buddy-like and benefit
> from flipping back and forth on the same cpu.  

OK, so I'm going to make this commit disappear.

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Tue, 2024-11-12 at 07:41 -0500, Phil Auld wrote:
> On Tue, Nov 12, 2024 at 08:05:04AM +0100 Mike Galbraith wrote:
> >
> > Unfortunate change log place holder below aside, I think this patch may
> > need to be yanked as trading one not readily repeatable regression for
> > at least one that definitely is, and likely multiple others.
> >
> > (adds knob)
> >
>
> Yes, I ws just coming here to reply. I have the results from the first
> version of the patch (I don't think the later one fundemtally changed
> enough that it will matter but those results are still pending).
>
> Not entirely surprisingly we've traded a ~10% rand write regression for
> 5-10% rand read regression. This makes sense to me since the reads are
> more likely to be synchronous and thus be more buddy-like and benefit
> from flipping back and forth on the same cpu. 

Ok, that would seem to second "shoot it".

> I'd probably have to take the reads over the writes in such a trade off :)
>
> > tbench 8
> >
> > NO_MIGRATE_DELAYED    3613.49 MB/sec
> > MIGRATE_DELAYED       3145.59 MB/sec
> > NO_DELAY_DEQUEUE      3355.42 MB/sec
> >
> > First line is DELAY_DEQUEUE restoring pre-EEVDF tbench throughput as
> > I've mentioned it doing, but $subject promptly did away with that and
> > then some.
> >
>
> Yep, that's not pretty.

Yeah, not to mention annoying.

I get the "adds bounce cache pain" aspect, but not why pre-EEVDF
wouldn't be just as heavily affected, it having nothing blocking high
frequency migration (the eternal scheduler boogieman:).

Bottom line would appear to be that these survivors should be left
where they ended up, either due to LB or more likely bog standard
prev_cpu locality, for they are part and parcel of a progression.

> > I thought I might be able to do away with the reservation like side
> > effect of DELAY_DEQUEUE by borrowing h_nr_delayed from...
> >
> >      sched/eevdf: More PELT vs DELAYED_DEQUEUE
> >
> > ...for cgroups free test config, but Q/D poke at idle_cpu() helped not
> > at all.

We don't however have to let sched_delayed block SIS though.  Rendering
them transparent in idle_cpu() did NOT wreck the progression, so
maaaybe could help your regression.

> I wonder if the last_wakee stuff could be leveraged here (an idle thought,
> so to speak). Haven't looked closely enough.

If you mean heuristics, the less of those we have, the better off we
are.. they _always_ find a way to embed their teeth in your backside.

	-Mike

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

On Tue, Nov 12, 2024 at 03:23:38PM +0100 Mike Galbraith wrote:
> On Tue, 2024-11-12 at 07:41 -0500, Phil Auld wrote:
> > On Tue, Nov 12, 2024 at 08:05:04AM +0100 Mike Galbraith wrote:
> > >
> > > Unfortunate change log place holder below aside, I think this patch may
> > > need to be yanked as trading one not readily repeatable regression for
> > > at least one that definitely is, and likely multiple others.
> > >
> > > (adds knob)
> > >
> >
> > Yes, I ws just coming here to reply. I have the results from the first
> > version of the patch (I don't think the later one fundemtally changed
> > enough that it will matter but those results are still pending).
> >
> > Not entirely surprisingly we've traded a ~10% rand write regression for
> > 5-10% rand read regression. This makes sense to me since the reads are
> > more likely to be synchronous and thus be more buddy-like and benefit
> > from flipping back and forth on the same cpu. 
> 
> Ok, that would seem to second "shoot it".
>

Yes, drop it please, I think. Thanks!


> > I'd probably have to take the reads over the writes in such a trade off :)
> >
> > > tbench 8
> > >
> > > NO_MIGRATE_DELAYED    3613.49 MB/sec
> > > MIGRATE_DELAYED       3145.59 MB/sec
> > > NO_DELAY_DEQUEUE      3355.42 MB/sec
> > >
> > > First line is DELAY_DEQUEUE restoring pre-EEVDF tbench throughput as
> > > I've mentioned it doing, but $subject promptly did away with that and
> > > then some.
> > >
> >
> > Yep, that's not pretty.
> 
> Yeah, not to mention annoying.
> 
> I get the "adds bounce cache pain" aspect, but not why pre-EEVDF
> wouldn't be just as heavily affected, it having nothing blocking high
> frequency migration (the eternal scheduler boogieman:).
> 
> Bottom line would appear to be that these survivors should be left
> where they ended up, either due to LB or more likely bog standard
> prev_cpu locality, for they are part and parcel of a progression.
> 
> > > I thought I might be able to do away with the reservation like side
> > > effect of DELAY_DEQUEUE by borrowing h_nr_delayed from...
> > >
> > >      sched/eevdf: More PELT vs DELAYED_DEQUEUE
> > >
> > > ...for cgroups free test config, but Q/D poke at idle_cpu() helped not
> > > at all.
> 
> We don't however have to let sched_delayed block SIS though.  Rendering
> them transparent in idle_cpu() did NOT wreck the progression, so
> maaaybe could help your regression.
>

You mean something like:

if (rq->nr_running > rq->h_nr_delayed)
       return 0;

in idle_cpu() instead of the straight rq->nr_running check? I don't
have the h_nr_delayed stuff yet but can look for it.  I'm not sure
that will help with the delayees being sticky.  But I can try to try
that if I'm understanding you right.

I'll try to dig into it some more regardless.

> > I wonder if the last_wakee stuff could be leveraged here (an idle thought,
> > so to speak). Haven't looked closely enough.
> 
> If you mean heuristics, the less of those we have, the better off we
> are.. they _always_ find a way to embed their teeth in your backside.
>

Sure, I get that. But when you have a trade-off like this being "smarter"
about when to do the dequeue might help. But yes, that could go wrong.

I'm not a fan of knobs either but we could do your patch with the feat
and default it off.


Cheers,
Phil

> 	-Mike
> 

--

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Tue, 2024-11-12 at 10:41 -0500, Phil Auld wrote:
> On Tue, Nov 12, 2024 at 03:23:38PM +0100 Mike Galbraith wrote:
>
> >
> > We don't however have to let sched_delayed block SIS though.  Rendering
> > them transparent in idle_cpu() did NOT wreck the progression, so
> > maaaybe could help your regression.
> >
>
> You mean something like:
>
> if (rq->nr_running > rq->h_nr_delayed)
>        return 0;
>
> in idle_cpu() instead of the straight rq->nr_running check?

Yeah, close enough.


	-Mike

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Tue, 2024-11-12 at 17:15 +0100, Mike Galbraith wrote:
> On Tue, 2024-11-12 at 10:41 -0500, Phil Auld wrote:
> > On Tue, Nov 12, 2024 at 03:23:38PM +0100 Mike Galbraith wrote:
> >
> > >
> > > We don't however have to let sched_delayed block SIS though.  Rendering
> > > them transparent in idle_cpu() did NOT wreck the progression, so
> > > maaaybe could help your regression.
> > >
> >
> > You mean something like:
> >
> > if (rq->nr_running > rq->h_nr_delayed)
> >        return 0;
> >
> > in idle_cpu() instead of the straight rq->nr_running check?
>
> Yeah, close enough.

The below is all you need.

Watching blockage rate during part of a netperf scaling run without, a
bit over 2/sec was the highest it got, but with, that drops to the same
zero as turning off the feature, so... relevance highly unlikely but
not quite impossible?

---
 kernel/sched/fair.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9454,11 +9454,15 @@ int can_migrate_task(struct task_struct

 	/*
 	 * We do not migrate tasks that are:
+	 * 0) not runnable (not useful here/now, but are annoying), or
 	 * 1) throttled_lb_pair, or
 	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
 	 * 3) running (obviously), or
 	 * 4) are cache-hot on their current CPU.
 	 */
+	if (p->se.sched_delayed)
+		return 0;
+
 	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
 		return 0;

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

On Thu, Nov 14, 2024 at 12:07:03PM +0100 Mike Galbraith wrote:
> On Tue, 2024-11-12 at 17:15 +0100, Mike Galbraith wrote:
> > On Tue, 2024-11-12 at 10:41 -0500, Phil Auld wrote:
> > > On Tue, Nov 12, 2024 at 03:23:38PM +0100 Mike Galbraith wrote:
> > >
> > > >
> > > > We don't however have to let sched_delayed block SIS though.  Rendering
> > > > them transparent in idle_cpu() did NOT wreck the progression, so
> > > > maaaybe could help your regression.
> > > >
> > >
> > > You mean something like:
> > >
> > > if (rq->nr_running > rq->h_nr_delayed)
> > >        return 0;
> > >
> > > in idle_cpu() instead of the straight rq->nr_running check?
> >
> > Yeah, close enough.
> 
> The below is all you need.
> 
> Watching blockage rate during part of a netperf scaling run without, a
> bit over 2/sec was the highest it got, but with, that drops to the same
> zero as turning off the feature, so... relevance highly unlikely but
> not quite impossible?
>

I'll give this a try on my issue. This'll be simpler than the other way.

Thanks!



Cheers,
Phil


> ---
>  kernel/sched/fair.c |    4 ++++
>  1 file changed, 4 insertions(+)
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9454,11 +9454,15 @@ int can_migrate_task(struct task_struct
> 
>  	/*
>  	 * We do not migrate tasks that are:
> +	 * 0) not runnable (not useful here/now, but are annoying), or
>  	 * 1) throttled_lb_pair, or
>  	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
>  	 * 3) running (obviously), or
>  	 * 4) are cache-hot on their current CPU.
>  	 */
> +	if (p->se.sched_delayed)
> +		return 0;
> +
>  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
>  		return 0;
> 
> 

--

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

On Thu, Nov 14, 2024 at 06:28:54AM -0500 Phil Auld wrote:
> On Thu, Nov 14, 2024 at 12:07:03PM +0100 Mike Galbraith wrote:
> > On Tue, 2024-11-12 at 17:15 +0100, Mike Galbraith wrote:
> > > On Tue, 2024-11-12 at 10:41 -0500, Phil Auld wrote:
> > > > On Tue, Nov 12, 2024 at 03:23:38PM +0100 Mike Galbraith wrote:
> > > >
> > > > >
> > > > > We don't however have to let sched_delayed block SIS though.  Rendering
> > > > > them transparent in idle_cpu() did NOT wreck the progression, so
> > > > > maaaybe could help your regression.
> > > > >
> > > >
> > > > You mean something like:
> > > >
> > > > if (rq->nr_running > rq->h_nr_delayed)
> > > >        return 0;
> > > >
> > > > in idle_cpu() instead of the straight rq->nr_running check?
> > >
> > > Yeah, close enough.
> > 
> > The below is all you need.
> > 
> > Watching blockage rate during part of a netperf scaling run without, a
> > bit over 2/sec was the highest it got, but with, that drops to the same
> > zero as turning off the feature, so... relevance highly unlikely but
> > not quite impossible?
> >
> 
> I'll give this a try on my issue. This'll be simpler than the other way.
>

This, below, by itself, did not do help and caused a small slowdown on some
other tests.  Did this need to be on top of the wakeup change? 


Cheers,
Phil

> Thanks!
> 
> 
> 
> Cheers,
> Phil
> 
> 
> > ---
> >  kernel/sched/fair.c |    4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -9454,11 +9454,15 @@ int can_migrate_task(struct task_struct
> > 
> >  	/*
> >  	 * We do not migrate tasks that are:
> > +	 * 0) not runnable (not useful here/now, but are annoying), or
> >  	 * 1) throttled_lb_pair, or
> >  	 * 2) cannot be migrated to this CPU due to cpus_ptr, or
> >  	 * 3) running (obviously), or
> >  	 * 4) are cache-hot on their current CPU.
> >  	 */
> > +	if (p->se.sched_delayed)
> > +		return 0;
> > +
> >  	if (throttled_lb_pair(task_group(p), env->src_cpu, env->dst_cpu))
> >  		return 0;
> > 
> > 
> 
> -- 
> 
> 

--

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
>
> This, below, by itself, did not do help and caused a small slowdown on some
> other tests.  Did this need to be on top of the wakeup change?

No, that made a mess.  The numbers said it was quite a reach, no
surprise it fell flat.

	-Mike

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> >
> > This, below, by itself, did not do help and caused a small slowdown on some
> > other tests.  Did this need to be on top of the wakeup change?
>
> No, that made a mess.

Rashly speculating that turning mobile kthread component loose is what
helped your write regression...

You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
only turn hard working kthreads loose to try to dodge service latency.
Seems unlikely wakeup frequency * instances would combine to shred fio
the way turning tbench loose did.

	-Mike

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

On Wed, Nov 20, 2024 at 07:37:39PM +0100 Mike Galbraith wrote:
> On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> > On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> > >
> > > This, below, by itself, did not do help and caused a small slowdown on some
> > > other tests.  Did this need to be on top of the wakeup change?
> >
> > No, that made a mess.
> 
> Rashly speculating that turning mobile kthread component loose is what
> helped your write regression...
> 
> You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
> only turn hard working kthreads loose to try to dodge service latency.
> Seems unlikely wakeup frequency * instances would combine to shred fio
> the way turning tbench loose did.
>

Thanks, I'll try that. 


Cheers,
Phil



> 	-Mike
> 

--

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

On Thu, Nov 21, 2024 at 06:56:28AM -0500 Phil Auld wrote:
> On Wed, Nov 20, 2024 at 07:37:39PM +0100 Mike Galbraith wrote:
> > On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> > > On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> > > >
> > > > This, below, by itself, did not do help and caused a small slowdown on some
> > > > other tests.  Did this need to be on top of the wakeup change?
> > >
> > > No, that made a mess.
> > 
> > Rashly speculating that turning mobile kthread component loose is what
> > helped your write regression...
> > 
> > You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
> > only turn hard working kthreads loose to try to dodge service latency.
> > Seems unlikely wakeup frequency * instances would combine to shred fio
> > the way turning tbench loose did.
> >
> 
> Thanks, I'll try that. 
>

Also, fwiw, I think there is another report here

https://lore.kernel.org/lkml/392209D9-5AC6-4FDE-8D84-FB8A82AD9AEF@oracle.com/

which seems to be the same thing but mis-bisected. I've asked them to try
with NO_DELAY_DEQUEUE just to be sure.  But it looks like a duck.


Cheers,
Phil

--

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

On Thu, Nov 21, 2024 at 07:07:04AM -0500 Phil Auld wrote:
> On Thu, Nov 21, 2024 at 06:56:28AM -0500 Phil Auld wrote:
> > On Wed, Nov 20, 2024 at 07:37:39PM +0100 Mike Galbraith wrote:
> > > On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> > > > On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> > > > >
> > > > > This, below, by itself, did not do help and caused a small slowdown on some
> > > > > other tests.  Did this need to be on top of the wakeup change?
> > > >
> > > > No, that made a mess.
> > > 
> > > Rashly speculating that turning mobile kthread component loose is what
> > > helped your write regression...
> > > 
> > > You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
> > > only turn hard working kthreads loose to try to dodge service latency.
> > > Seems unlikely wakeup frequency * instances would combine to shred fio
> > > the way turning tbench loose did.
> > >
> > 
> > Thanks, I'll try that. 
> >
> 
> Also, fwiw, I think there is another report here
> 
> https://lore.kernel.org/lkml/392209D9-5AC6-4FDE-8D84-FB8A82AD9AEF@oracle.com/
> 
> which seems to be the same thing but mis-bisected. I've asked them to try
> with NO_DELAY_DEQUEUE just to be sure.  But it looks like a duck.
>

But it does not quack like one.  Their issue did not go away with
NO_DELAY_DEQUEUE so something different is causing that one.


> 
> Cheers,
> Phil
> 
> -- 
> 
> 

--

[PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Thu, 2024-11-21 at 06:56 -0500, Phil Auld wrote:
> On Wed, Nov 20, 2024 at 07:37:39PM +0100 Mike Galbraith wrote:
> > On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> > > On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> > > >
> > > > This, below, by itself, did not do help and caused a small slowdown on some
> > > > other tests.  Did this need to be on top of the wakeup change?
> > >
> > > No, that made a mess.
> >
> > Rashly speculating that turning mobile kthread component loose is what
> > helped your write regression...
> >
> > You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
> > only turn hard working kthreads loose to try to dodge service latency.
> > Seems unlikely wakeup frequency * instances would combine to shred fio
> > the way turning tbench loose did.
> >
>
> Thanks, I'll try that.

You may still want to try that, but my box says probably not.  Playing
with your write command line, the players I see are pinned kworkers and
mobile fio instances.

Maybe try the below instead. The changelog is obsolete BS unless you
say otherwise, but while twiddled V2 will still migrate tbench a bit,
and per trace_printk() does still let all kinds of stuff wander off to
roll the SIS dice, it does not even scratch the paint of the formerly
obliterated tbench progression.

Question: did wiping off the evil leave any meaningful goodness behind?

---

sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Phil Auld (Redhat) reported an fio benchmark regression having been found
to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
may be related to wakees losing the ability to migrate, and confirmed that
restoration of same indeed did restore previous performance.

V2: do not rip buddies apart, convenient on/off switch

Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
Signed-off-by: Mike Galbraith <efault@gmx.de>
---
 kernel/sched/core.c     |   51 ++++++++++++++++++++++++++++++------------------
 kernel/sched/features.h |    5 ++++
 kernel/sched/sched.h    |    5 ++++
 3 files changed, 42 insertions(+), 19 deletions(-)

--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
  */
 static int ttwu_runnable(struct task_struct *p, int wake_flags)
 {
-	struct rq_flags rf;
-	struct rq *rq;
-	int ret = 0;
-
-	rq = __task_rq_lock(p, &rf);
-	if (task_on_rq_queued(p)) {
-		update_rq_clock(rq);
-		if (p->se.sched_delayed)
-			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
-		if (!task_on_cpu(rq, p)) {
-			/*
-			 * When on_rq && !on_cpu the task is preempted, see if
-			 * it should preempt the task that is current now.
-			 */
-			wakeup_preempt(rq, p, wake_flags);
+	CLASS(__task_rq_lock, rq_guard)(p);
+	struct rq *rq = rq_guard.rq;
+
+	if (!task_on_rq_queued(p))
+		return 0;
+
+	update_rq_clock(rq);
+	if (p->se.sched_delayed) {
+		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
+		int dequeue = sched_feat(DEQUEUE_DELAYED);
+
+		/*
+		 * Since sched_delayed means we cannot be current anywhere,
+		 * dequeue it here and have it fall through to the
+		 * select_task_rq() case further along in ttwu() path.
+		 * Note: Do not rip buddies apart else chaos follows.
+		 */
+		if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&
+		    !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {
+			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
+			return 0;
 		}
-		ttwu_do_wakeup(p);
-		ret = 1;
+
+		enqueue_task(rq, p, queue_flags);
+	}
+	if (!task_on_cpu(rq, p)) {
+		/*
+		 * When on_rq && !on_cpu the task is preempted, see if
+		 * it should preempt the task that is current now.
+		 */
+		wakeup_preempt(rq, p, wake_flags);
 	}
-	__task_rq_unlock(rq, &rf);
+	ttwu_do_wakeup(p);

-	return ret;
+	return 1;
 }

 #ifdef CONFIG_SMP
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -47,6 +47,11 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
  */
 SCHED_FEAT(DELAY_DEQUEUE, true)
+/*
+ * Free ONLY non-buddy delayed tasks to wakeup-migrate to avoid taking.
+ * an unnecessary latency hit.  Rending buddies asunder inflicts harm.
+ */
+SCHED_FEAT(DEQUEUE_DELAYED, true)
 SCHED_FEAT(DELAY_ZERO, true)

 /*
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1783,6 +1783,11 @@ task_rq_unlock(struct rq *rq, struct tas
 	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
 }

+DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
+		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
+		    __task_rq_unlock(_T->rq, &_T->rf),
+		    struct rq *rq; struct rq_flags rf)
+
 DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
 		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
 		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),

Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

On Sat, Nov 23, 2024 at 09:44:40AM +0100 Mike Galbraith wrote:
> On Thu, 2024-11-21 at 06:56 -0500, Phil Auld wrote:
> > On Wed, Nov 20, 2024 at 07:37:39PM +0100 Mike Galbraith wrote:
> > > On Tue, 2024-11-19 at 12:51 +0100, Mike Galbraith wrote:
> > > > On Tue, 2024-11-19 at 06:30 -0500, Phil Auld wrote:
> > > > >
> > > > > This, below, by itself, did not do help and caused a small slowdown on some
> > > > > other tests.  Did this need to be on top of the wakeup change?
> > > >
> > > > No, that made a mess.
> > >
> > > Rashly speculating that turning mobile kthread component loose is what
> > > helped your write regression...
> > >
> > > You could try adding (p->flags & PF_KTHREAD) to the wakeup patch to
> > > only turn hard working kthreads loose to try to dodge service latency.
> > > Seems unlikely wakeup frequency * instances would combine to shred fio
> > > the way turning tbench loose did.
> > >
> >
> > Thanks, I'll try that.
> 
> You may still want to try that, but my box says probably not.  Playing
> with your write command line, the players I see are pinned kworkers and
> mobile fio instances.
>

Yep. The PF_KTHREAD thing did not help. 


> Maybe try the below instead. The changelog is obsolete BS unless you
> say otherwise, but while twiddled V2 will still migrate tbench a bit,
> and per trace_printk() does still let all kinds of stuff wander off to
> roll the SIS dice, it does not even scratch the paint of the formerly
> obliterated tbench progression.
>

Will give this one a try when I get caught up from being off all week for
US turkey day. 


Thanks!

> Question: did wiping off the evil leave any meaningful goodness behind?


Is that for this patch?  

If you mean for the original patch (which subsequently broke the reads) then
no. It was more or less even for all the other tests. It fixed the randwrite
issue by moving it to randread. Everything else we run regularly was about
the same. So no extra goodness to help decide :)


Cheers,
Phil

> 
> ---
> 
> sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
> 
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
> 
> V2: do not rip buddies apart, convenient on/off switch
> 
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/sched/core.c     |   51 ++++++++++++++++++++++++++++++------------------
>  kernel/sched/features.h |    5 ++++
>  kernel/sched/sched.h    |    5 ++++
>  3 files changed, 42 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -	struct rq_flags rf;
> -	struct rq *rq;
> -	int ret = 0;
> -
> -	rq = __task_rq_lock(p, &rf);
> -	if (task_on_rq_queued(p)) {
> -		update_rq_clock(rq);
> -		if (p->se.sched_delayed)
> -			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -		if (!task_on_cpu(rq, p)) {
> -			/*
> -			 * When on_rq && !on_cpu the task is preempted, see if
> -			 * it should preempt the task that is current now.
> -			 */
> -			wakeup_preempt(rq, p, wake_flags);
> +	CLASS(__task_rq_lock, rq_guard)(p);
> +	struct rq *rq = rq_guard.rq;
> +
> +	if (!task_on_rq_queued(p))
> +		return 0;
> +
> +	update_rq_clock(rq);
> +	if (p->se.sched_delayed) {
> +		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +		int dequeue = sched_feat(DEQUEUE_DELAYED);
> +
> +		/*
> +		 * Since sched_delayed means we cannot be current anywhere,
> +		 * dequeue it here and have it fall through to the
> +		 * select_task_rq() case further along in ttwu() path.
> +		 * Note: Do not rip buddies apart else chaos follows.
> +		 */
> +		if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&
> +		    !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {
> +			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +			return 0;
>  		}
> -		ttwu_do_wakeup(p);
> -		ret = 1;
> +
> +		enqueue_task(rq, p, queue_flags);
> +	}
> +	if (!task_on_cpu(rq, p)) {
> +		/*
> +		 * When on_rq && !on_cpu the task is preempted, see if
> +		 * it should preempt the task that is current now.
> +		 */
> +		wakeup_preempt(rq, p, wake_flags);
>  	}
> -	__task_rq_unlock(rq, &rf);
> +	ttwu_do_wakeup(p);
> 
> -	return ret;
> +	return 1;
>  }
> 
>  #ifdef CONFIG_SMP
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -47,6 +47,11 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
>   * DELAY_ZERO clips the lag on dequeue (or wakeup) to 0.
>   */
>  SCHED_FEAT(DELAY_DEQUEUE, true)
> +/*
> + * Free ONLY non-buddy delayed tasks to wakeup-migrate to avoid taking.
> + * an unnecessary latency hit.  Rending buddies asunder inflicts harm.
> + */
> +SCHED_FEAT(DEQUEUE_DELAYED, true)
>  SCHED_FEAT(DELAY_ZERO, true)
> 
>  /*
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1783,6 +1783,11 @@ task_rq_unlock(struct rq *rq, struct tas
>  	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
> 
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +		    __task_rq_unlock(_T->rq, &_T->rf),
> +		    struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>  		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
>  		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),
> 

--

Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Mon, 2024-12-02 at 11:24 -0500, Phil Auld wrote:
> On Sat, Nov 23, 2024 at 09:44:40AM +0100 Mike Galbraith wrote:
>
>
> > Question: did wiping off the evil leave any meaningful goodness behind?
>
> Is that for this patch?

Yeah.  Trying it on my box with your write command line didn't improve
the confidence level either.  My box has one CPU handling IRQs and
waking pinned workers to service 8 fio instances.  Patch was useless
for that.

	-Mike

Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

On Mon, Dec 02, 2024 at 05:55:28PM +0100 Mike Galbraith wrote:
> On Mon, 2024-12-02 at 11:24 -0500, Phil Auld wrote:
> > On Sat, Nov 23, 2024 at 09:44:40AM +0100 Mike Galbraith wrote:
> >
> >
> > > Question: did wiping off the evil leave any meaningful goodness behind?
> >
> > Is that for this patch?
> 
> Yeah.  Trying it on my box with your write command line didn't improve
> the confidence level either.  My box has one CPU handling IRQs and
> waking pinned workers to service 8 fio instances.  Patch was useless
> for that.
>

I'll give it a try. Our "box" is multiple different boxes but the results
vary somewhat.  The one I sent info about earlier in this thread is just
one of the more egregious and is the one the perf team lent me for a while.


Cheers,
Phil


> 	-Mike
> 
> 

--

Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

Hi Mike et al.,

On Mon, Dec 02, 2024 at 02:12:52PM -0500 Phil Auld wrote:
> On Mon, Dec 02, 2024 at 05:55:28PM +0100 Mike Galbraith wrote:
> > On Mon, 2024-12-02 at 11:24 -0500, Phil Auld wrote:
> > > On Sat, Nov 23, 2024 at 09:44:40AM +0100 Mike Galbraith wrote:
> > >
> > >
> > > > Question: did wiping off the evil leave any meaningful goodness behind?
> > >
> > > Is that for this patch?
> > 
> > Yeah.  Trying it on my box with your write command line didn't improve
> > the confidence level either.  My box has one CPU handling IRQs and
> > waking pinned workers to service 8 fio instances.  Patch was useless
> > for that.
> >
> 
> I'll give it a try. Our "box" is multiple different boxes but the results
> vary somewhat.  The one I sent info about earlier in this thread is just
> one of the more egregious and is the one the perf team lent me for a while.
>

In our testing this has the same effect as the original dequeue-when-delayed
fix.  It solves the randwrite issue and introduces the ~10-15% randread
regression. 

Seems to be a real trade-off here. The same guys who benefit from spreading
in one case benefit from staying put in the other... 


Cheers,
Phil

--

Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Mon, 2024-12-09 at 08:11 -0500, Phil Auld wrote:
>
> Hi Mike et al.,
>
> On Mon, Dec 02, 2024 at 02:12:52PM -0500 Phil Auld wrote:
> > On Mon, Dec 02, 2024 at 05:55:28PM +0100 Mike Galbraith wrote:
> > > On Mon, 2024-12-02 at 11:24 -0500, Phil Auld wrote:
> > > > On Sat, Nov 23, 2024 at 09:44:40AM +0100 Mike Galbraith wrote:
> > > >
> > > >
> > > > > Question: did wiping off the evil leave any meaningful goodness behind?
> > > >
> > > > Is that for this patch?
> > >
> > > Yeah.  Trying it on my box with your write command line didn't improve
> > > the confidence level either.  My box has one CPU handling IRQs and
> > > waking pinned workers to service 8 fio instances.  Patch was useless
> > > for that.
> > >
> >
> > I'll give it a try. Our "box" is multiple different boxes but the results
> > vary somewhat.  The one I sent info about earlier in this thread is just
> > one of the more egregious and is the one the perf team lent me for a while.
> >
>
> In our testing this has the same effect as the original dequeue-when-delayed
> fix.  It solves the randwrite issue and introduces the ~10-15% randread
> regression.
>
> Seems to be a real trade-off here. The same guys who benefit from spreading
> in one case benefit from staying put in the other...

Does as much harm as it does good isn't the mark of a keeper.  Oh well.

	-Mike

Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by K Prateek Nayak 1 year, 2 months ago

Hello Mike,

On 11/23/2024 2:14 PM, Mike Galbraith wrote:
> [..snip..]
> 
> Question: did wiping off the evil leave any meaningful goodness behind?
> 
> ---
> 
> sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
> 
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
> 
> V2: do not rip buddies apart, convenient on/off switch
> 
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>   kernel/sched/core.c     |   51 ++++++++++++++++++++++++++++++------------------
>   kernel/sched/features.h |    5 ++++
>   kernel/sched/sched.h    |    5 ++++
>   3 files changed, 42 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
>    */
>   static int ttwu_runnable(struct task_struct *p, int wake_flags)
>   {
> -	struct rq_flags rf;
> -	struct rq *rq;
> -	int ret = 0;
> -
> -	rq = __task_rq_lock(p, &rf);
> -	if (task_on_rq_queued(p)) {
> -		update_rq_clock(rq);
> -		if (p->se.sched_delayed)
> -			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -		if (!task_on_cpu(rq, p)) {
> -			/*
> -			 * When on_rq && !on_cpu the task is preempted, see if
> -			 * it should preempt the task that is current now.
> -			 */
> -			wakeup_preempt(rq, p, wake_flags);
> +	CLASS(__task_rq_lock, rq_guard)(p);
> +	struct rq *rq = rq_guard.rq;
> +
> +	if (!task_on_rq_queued(p))
> +		return 0;
> +
> +	update_rq_clock(rq);
> +	if (p->se.sched_delayed) {
> +		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +		int dequeue = sched_feat(DEQUEUE_DELAYED);
> +
> +		/*
> +		 * Since sched_delayed means we cannot be current anywhere,
> +		 * dequeue it here and have it fall through to the
> +		 * select_task_rq() case further along in ttwu() path.
> +		 * Note: Do not rip buddies apart else chaos follows.
> +		 */
> +		if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&

Do we really care if DEQUEUE_DELAYED is enabled / disabled here when we
encounter a delayed task?

> +		    !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {

Technically, we are still looking at the last wakeup here since
record_wakee() is only called later. If we care about 1:1 buddies,
should we just see "current == p->last_wakee", otherwise, there is a
good chance "p" has a m:n waker-wakee relationship in which case
perhaps a "want_affine" like heuristic can help?

For science, I was wondering if the decision to dequeue + migrate or
requeue the delayed task can be put off until after the whole
select_task_rq() target selection (note: without the h_nr_delayed
stuff, some of that wake_affine_idle() logic falls apart). Hackbench
(which saw some regression with EEVDF Complete) seem to like it
somewhat, but it still falls behind NO_DELAY_DEQUEUE.

    ==================================================================
     Test          : hackbench
     Units         : Normalized time in seconds
     Interpretation: Lower is better
     Statistic     : AMean
     ==================================================================
     NO_DELAY_DEQUEUE	     Mike's v2	  Full ttwu + requeue/migrate
     5.76	           5.72  (  1% )        5.82  ( -1% )
     6.53	           6.56  (  0% )        6.65  ( -2% )
     6.79	           7.04  ( -4% )        7.02  ( -3% )
     6.91	           7.04  ( -2% )        7.03  ( -2% )
     7.63	           8.05  ( -6% )        7.88  ( -3% )

Only subtle changes in IBS profiles; there aren't any obvious shift
in hotspots with hackbench at least. Not sure if it is just the act of
needing to do a dequeue + enqueue from the wakeup context that adds to
the overall regression.

-- 
Thanks and Regards,
Prateek

> +			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +			return 0;
>   		}
> -		ttwu_do_wakeup(p);
> -		ret = 1;
> +
> +		enqueue_task(rq, p, queue_flags);
> +	}
> +	if (!task_on_cpu(rq, p)) {
> +		/*
> +		 * When on_rq && !on_cpu the task is preempted, see if
> +		 * it should preempt the task that is current now.
> +		 */
> +		wakeup_preempt(rq, p, wake_flags);
>   	}
> -	__task_rq_unlock(rq, &rf);
> +	ttwu_do_wakeup(p);
> 
> -	return ret;
> +	return 1;
>   }
> 
>   #ifdef CONFIG_SMP
> [..snip..]

Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Tue, 2024-11-26 at 11:02 +0530, K Prateek Nayak wrote:
> Hello Mike,
> 
> On 11/23/2024 2:14 PM, Mike Galbraith wrote:
> > [..snip..]
> > 
> > Question: did wiping off the evil leave any meaningful goodness behind?
> > 
> > ---
> > 
> > sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU
> > 
> > Phil Auld (Redhat) reported an fio benchmark regression having been found
> > to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> > may be related to wakees losing the ability to migrate, and confirmed that
> > restoration of same indeed did restore previous performance.
> > 
> > V2: do not rip buddies apart, convenient on/off switch
> > 
> > Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> > Signed-off-by: Mike Galbraith <efault@gmx.de>
> > ---
> >   kernel/sched/core.c     |   51 ++++++++++++++++++++++++++++++------------------
> >   kernel/sched/features.h |    5 ++++
> >   kernel/sched/sched.h    |    5 ++++
> >   3 files changed, 42 insertions(+), 19 deletions(-)
> > 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3783,28 +3783,41 @@ ttwu_do_activate(struct rq *rq, struct t
> >    */
> >   static int ttwu_runnable(struct task_struct *p, int wake_flags)
> >   {
> > -       struct rq_flags rf;
> > -       struct rq *rq;
> > -       int ret = 0;
> > -
> > -       rq = __task_rq_lock(p, &rf);
> > -       if (task_on_rq_queued(p)) {
> > -               update_rq_clock(rq);
> > -               if (p->se.sched_delayed)
> > -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> > -               if (!task_on_cpu(rq, p)) {
> > -                       /*
> > -                        * When on_rq && !on_cpu the task is preempted, see if
> > -                        * it should preempt the task that is current now.
> > -                        */
> > -                       wakeup_preempt(rq, p, wake_flags);
> > +       CLASS(__task_rq_lock, rq_guard)(p);
> > +       struct rq *rq = rq_guard.rq;
> > +
> > +       if (!task_on_rq_queued(p))
> > +               return 0;
> > +
> > +       update_rq_clock(rq);
> > +       if (p->se.sched_delayed) {
> > +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> > +               int dequeue = sched_feat(DEQUEUE_DELAYED);
> > +
> > +               /*
> > +                * Since sched_delayed means we cannot be current anywhere,
> > +                * dequeue it here and have it fall through to the
> > +                * select_task_rq() case further along in ttwu() path.
> > +                * Note: Do not rip buddies apart else chaos follows.
> > +                */
> > +               if (dequeue && rq->nr_running > 1 && p->nr_cpus_allowed > 1 &&
> 
> Do we really care if DEQUEUE_DELAYED is enabled / disabled here when we
> encounter a delayed task?

The switch is for test convenience.

> > +                   !(rq->curr->last_wakee == p || p->last_wakee == rq->curr)) {
> 
> Technically, we are still looking at the last wakeup here since
> record_wakee() is only called later. If we care about 1:1 buddies,
> should we just see "current == p->last_wakee", otherwise, there is a
> good chance "p" has a m:n waker-wakee relationship in which case
> perhaps a "want_affine" like heuristic can help?

The intent is to blunt the instrument a bit. Paul should have a highly
active interrupt source, which will give wakeup credit to whatever is
sitting on that CPU, breaking 1:1 connections.. a little bit.   That's
why it still migrates tbench buddies, but NOT at a rate that turns a
tbench progression into a new low regression.  The hope is that the
load shift caused by that active interrupt source is enough to give
Paul's regression some of the help it demonstrated wanting, without the
collateral damage.  It might now be so weak as to not meet the
"meaningful" in my question, in which case it lands on the ginormous
pile of meh, sorta works, but why would anyone care.

> For science, I was wondering if the decision to dequeue + migrate or
> requeue the delayed task can be put off until after the whole
> select_task_rq() target selection (note: without the h_nr_delayed
> stuff, some of that wake_affine_idle() logic falls apart). Hackbench
> (which saw some regression with EEVDF Complete) seem to like it
> somewhat, but it still falls behind NO_DELAY_DEQUEUE.

You can, with a few more fast path cycles and some duplication, none of
which looks very desirable.
 
>     ==================================================================
>      Test          : hackbench
>      Units         : Normalized time in seconds
>      Interpretation: Lower is better
>      Statistic     : AMean
>      ==================================================================
>      NO_DELAY_DEQUEUE        Mike's v2    Full ttwu + requeue/migrate
>      5.76                  5.72  (  1% )        5.82  ( -1% )
>      6.53                  6.56  (  0% )        6.65  ( -2% )
>      6.79                  7.04  ( -4% )        7.02  ( -3% )
>      6.91                  7.04  ( -2% )        7.03  ( -2% )
>      7.63                  8.05  ( -6% )        7.88  ( -3% )
> 
> Only subtle changes in IBS profiles; there aren't any obvious shift
> in hotspots with hackbench at least. Not sure if it is just the act of
> needing to do a dequeue + enqueue from the wakeup context that adds to
> the overall regression.

Those numbers say say to me that hackbench doesn't care deeply.  That
works for me, because I don't care deeply about nutty fork bombs ;-)

	-Mike

Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Tue, 2024-11-26 at 07:30 +0100, Mike Galbraith wrote:
>
> The intent is to blunt the instrument a bit. Paul should have a highly
> active interrupt source, which will give wakeup credit to whatever is
> sitting on that CPU, breaking 1:1 connections.. a little bit.   That's
> why it still migrates tbench buddies, but NOT at a rate that turns a
> tbench progression into a new low regression.

BTW, the reason for the tbench wreckage being so bad is that when the
box is near saturation, not only are a portion of the surviving
sched_delayed tasks affine wakeups (always the optimal configuration
for this fast mover cliebt/server pair in an L3 equipped box), they are
exclusively affine wakeups.  That is most definitely gonna hurt.

When saturating that becomes the best option for a lot of client/server
pairs, even those with a lot of concurrency.  Turning them loose to
migrate at that time is far more likely than not to hurt a LOT, so V1
was doomed.

> The hope is that the
> load shift caused by that active interrupt source is enough to give
> Paul's regression some of the help it demonstrated wanting, without the
> collateral damage.  It might now be so weak as to not meet the
> "meaningful" in my question, in which case it lands on the ginormous
> pile of meh, sorta works, but why would anyone care.

In my IO challenged box, patch is useless to fio, nothing can help a
load where all of the IO action, and wimpy action at that, is nailed to
one CPU.  I can see it helping other latency sensitive stuff, like say
1:N mother of all work and/or control threads (and ilk), but if Phil's
problematic box looks anything like this box.. nah, it's a long reach.

	-Mike

Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Tue, 2024-11-26 at 07:30 +0100, Mike Galbraith wrote:
>
> The intent is to blunt the instrument a bit. Paul should have

Yeah I did... ahem, I meant of course Phil.

	-Mike

Re: [PATCH V2] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 2 months ago

On Tue, Nov 26, 2024 at 10:42:37AM +0100 Mike Galbraith wrote:
> On Tue, 2024-11-26 at 07:30 +0100, Mike Galbraith wrote:
> >
> > The intent is to blunt the instrument a bit. Paul should have
> 
> Yeah I did... ahem, I meant of course Phil.
>

Heh, you are not alone, Mike  :)


> 	-Mike
> 

--

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Xuewen Yan 1 year, 2 months ago

Hi Mike and Peter,

On Fri, Nov 8, 2024 at 8:28 AM Mike Galbraith <efault@gmx.de> wrote:
>
> On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > >
> > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > >
> > > > The below improved uptime, and trace_printk() says it's doing the
> > > > intended, so I suppose I'll add a feature and see what falls out.
> > >
> > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > test with and without produced nothing but variance/noise.
> >
> > Make it go away then.
> >
> > If you could write a Changelog for you inspired bit and stick my cleaned
> > up version under it, I'd be much obliged.
>
> Salut, much obliged for eyeball relief.
>
> ---snip---
>
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
>
> (de-uglified-a-lot-by)
>
> Reported-by: Phil Auld <pauld@redhat.com>
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
>  kernel/sched/sched.h |    5 +++++
>  2 files changed, 34 insertions(+), 19 deletions(-)
>
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -       struct rq_flags rf;
> -       struct rq *rq;
> -       int ret = 0;
> -
> -       rq = __task_rq_lock(p, &rf);
> -       if (task_on_rq_queued(p)) {
> -               update_rq_clock(rq);
> -               if (p->se.sched_delayed)
> -                       enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -               if (!task_on_cpu(rq, p)) {
> -                       /*
> -                        * When on_rq && !on_cpu the task is preempted, see if
> -                        * it should preempt the task that is current now.
> -                        */
> -                       wakeup_preempt(rq, p, wake_flags);
> +       CLASS(__task_rq_lock, rq_guard)(p);
> +       struct rq *rq = rq_guard.rq;
> +
> +       if (!task_on_rq_queued(p))
> +               return 0;
> +
> +       update_rq_clock(rq);
> +       if (p->se.sched_delayed) {
> +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +
> +               /*
> +                * Since sched_delayed means we cannot be current anywhere,
> +                * dequeue it here and have it fall through to the
> +                * select_task_rq() case further along the ttwu() path.
> +                */
> +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {

For sched_asym_cpucapacity system, need we consider the
task_fits_cpu_capacity there?


> +                       dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +                       return 0;
>                 }
> -               ttwu_do_wakeup(p);
> -               ret = 1;
> +
> +               enqueue_task(rq, p, queue_flags);
> +       }
> +       if (!task_on_cpu(rq, p)) {
> +               /*
> +                * When on_rq && !on_cpu the task is preempted, see if
> +                * it should preempt the task that is current now.
> +                */
> +               wakeup_preempt(rq, p, wake_flags);
>         }
> -       __task_rq_unlock(rq, &rf);
> +       ttwu_do_wakeup(p);
>
> -       return ret;
> +       return 1;
>  }
>
>  #ifdef CONFIG_SMP
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
>         raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
>
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +                   _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +                   __task_rq_unlock(_T->rq, &_T->rf),
> +                   struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>                     _T->rq = task_rq_lock(_T->lock, &_T->rf),
>                     task_rq_unlock(_T->rq, _T->lock, &_T->rf),
>
>
>

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Mike Galbraith 1 year, 2 months ago

On Mon, 2024-11-11 at 10:46 +0800, Xuewen Yan wrote:
> > 
> > +       if (p->se.sched_delayed) {
> > +               int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> > +
> > +               /*
> > +                * Since sched_delayed means we cannot be current anywhere,
> > +                * dequeue it here and have it fall through to the
> > +                * select_task_rq() case further along the ttwu() path.
> > +                */
> > +               if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> 
> For sched_asym_cpucapacity system, need we consider the
> task_fits_cpu_capacity there?

I don't think so.  Wakeup placement logic is what we're deflecting the
wakee toward, this is not the right spot to add any complexity.


	-Mike

Re: [PATCH] sched/fair: Dequeue sched_delayed tasks when waking to a busy CPU

Posted by Phil Auld 1 year, 3 months ago

On Fri, Nov 08, 2024 at 01:24:35AM +0100 Mike Galbraith wrote:
> On Thu, 2024-11-07 at 15:09 +0100, Peter Zijlstra wrote:
> > On Thu, Nov 07, 2024 at 03:02:36PM +0100, Mike Galbraith wrote:
> > > On Thu, 2024-11-07 at 10:46 +0100, Mike Galbraith wrote:
> > > > On Thu, 2024-11-07 at 05:03 +0100, Mike Galbraith wrote:
> > > > >
> > > > > I built that patch out of curiosity, and yeah, set_next_task_fair()
> > > > > finding a cfs_rq->curr ends play time pretty quickly.
> > > >
> > > > The below improved uptime, and trace_printk() says it's doing the
> > > > intended, so I suppose I'll add a feature and see what falls out.
> > >
> > > From netperf, I got.. number tabulation practice.  Three runs of each
> > > test with and without produced nothing but variance/noise.
> >
> > Make it go away then.
> >
> > If you could write a Changelog for you inspired bit and stick my cleaned
> > up version under it, I'd be much obliged.
> 
> Salut, much obliged for eyeball relief.
>

Thanks Mike (and Peter).  We have our full perf tests running on Mike's
original verion of this patch. Results probably Monday (there's a long
queue). We'll see if this blows up anything else then.  I'll queue up a
build with this cleaned up version as well but the results will be late
next week, probably.

At that point maybe some or all of these:

Suggested-by: Phil Auld <pauld@redhat.com> 
Reviewed-by: Phil Auld <pauld@redhat.com>
Tested-by: Jirka Hladky <jhladky@redhat.com>


Cheers,
Phil


> ---snip---
> 
> Phil Auld (Redhat) reported an fio benchmark regression having been found
> to have been caused by addition of the DELAY_DEQUEUE feature, suggested it
> may be related to wakees losing the ability to migrate, and confirmed that
> restoration of same indeed did restore previous performance.
> 
> (de-uglified-a-lot-by)
> 
> Reported-by: Phil Auld <pauld@redhat.com>
> Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue")
> Link: https://lore.kernel.org/lkml/20241101124715.GA689589@pauld.westford.csb/
> Signed-off-by: Mike Galbraith <efault@gmx.de>
> ---
>  kernel/sched/core.c  |   48 +++++++++++++++++++++++++++++-------------------
>  kernel/sched/sched.h |    5 +++++
>  2 files changed, 34 insertions(+), 19 deletions(-)
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3783,28 +3783,38 @@ ttwu_do_activate(struct rq *rq, struct t
>   */
>  static int ttwu_runnable(struct task_struct *p, int wake_flags)
>  {
> -	struct rq_flags rf;
> -	struct rq *rq;
> -	int ret = 0;
> -
> -	rq = __task_rq_lock(p, &rf);
> -	if (task_on_rq_queued(p)) {
> -		update_rq_clock(rq);
> -		if (p->se.sched_delayed)
> -			enqueue_task(rq, p, ENQUEUE_NOCLOCK | ENQUEUE_DELAYED);
> -		if (!task_on_cpu(rq, p)) {
> -			/*
> -			 * When on_rq && !on_cpu the task is preempted, see if
> -			 * it should preempt the task that is current now.
> -			 */
> -			wakeup_preempt(rq, p, wake_flags);
> +	CLASS(__task_rq_lock, rq_guard)(p);
> +	struct rq *rq = rq_guard.rq;
> +
> +	if (!task_on_rq_queued(p))
> +		return 0;
> +
> +	update_rq_clock(rq);
> +	if (p->se.sched_delayed) {
> +		int queue_flags = ENQUEUE_DELAYED | ENQUEUE_NOCLOCK;
> +
> +		/*
> +		 * Since sched_delayed means we cannot be current anywhere,
> +		 * dequeue it here and have it fall through to the
> +		 * select_task_rq() case further along the ttwu() path.
> +		 */
> +		if (rq->nr_running > 1 && p->nr_cpus_allowed > 1) {
> +			dequeue_task(rq, p, DEQUEUE_SLEEP | queue_flags);
> +			return 0;
>  		}
> -		ttwu_do_wakeup(p);
> -		ret = 1;
> +
> +		enqueue_task(rq, p, queue_flags);
> +	}
> +	if (!task_on_cpu(rq, p)) {
> +		/*
> +		 * When on_rq && !on_cpu the task is preempted, see if
> +		 * it should preempt the task that is current now.
> +		 */
> +		wakeup_preempt(rq, p, wake_flags);
>  	}
> -	__task_rq_unlock(rq, &rf);
> +	ttwu_do_wakeup(p);
> 
> -	return ret;
> +	return 1;
>  }
> 
>  #ifdef CONFIG_SMP
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1779,6 +1779,11 @@ task_rq_unlock(struct rq *rq, struct tas
>  	raw_spin_unlock_irqrestore(&p->pi_lock, rf->flags);
>  }
> 
> +DEFINE_LOCK_GUARD_1(__task_rq_lock, struct task_struct,
> +		    _T->rq = __task_rq_lock(_T->lock, &_T->rf),
> +		    __task_rq_unlock(_T->rq, &_T->rf),
> +		    struct rq *rq; struct rq_flags rf)
> +
>  DEFINE_LOCK_GUARD_1(task_rq_lock, struct task_struct,
>  		    _T->rq = task_rq_lock(_T->lock, &_T->rf),
>  		    task_rq_unlock(_T->rq, _T->lock, &_T->rf),
> 
> 

--