[PATCH] sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running

Chen Yu posted 1 patch 2 months ago
There is a newer version of this series
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH] sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
Posted by Chen Yu 2 months ago
Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
introduced a mechanism that a wakee with shorter slice could preempt
the current running task. It also lower the bar for the current task
to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
when the current task has ran out of time slice. Say, if there is 1 cfs
task and 1 rt task, before 85e511df3cec, update_deadline() will
not trigger a reschedule, and after 85e511df3cec, since rq->nr_running
is 2 and resched is true, a resched_curr() would happen.

Some workloads (like the hackbench reported by lkp) do not like
over-scheduling. We can see that the preemption rate has been
increased by 2.2%:

1.654e+08            +2.2%   1.69e+08        hackbench.time.involuntary_context_switches

Restore its previous check criterion.

Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 225b31aaee55..2859fc7e2da2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
 	/*
 	 * The task has consumed its request, reschedule.
 	 */
-	return true;
+	return (cfs_rq->nr_running > 1);
 }
 
 #include "pelt.h"
-- 
2.25.1
Re: [PATCH] sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
Posted by K Prateek Nayak 2 months ago
Hello Chenyu,

On 9/23/2024 12:51 PM, Chen Yu wrote:
> Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
> introduced a mechanism that a wakee with shorter slice could preempt
> the current running task. It also lower the bar for the current task
> to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
> when the current task has ran out of time slice. Say, if there is 1 cfs
> task and 1 rt task, before 85e511df3cec, update_deadline() will
> not trigger a reschedule, and after 85e511df3cec, since rq->nr_running
> is 2 and resched is true, a resched_curr() would happen.
> 
> Some workloads (like the hackbench reported by lkp) do not like
> over-scheduling. We can see that the preemption rate has been
> increased by 2.2%:
> 
> 1.654e+08            +2.2%   1.69e+08        hackbench.time.involuntary_context_switches
> 
> Restore its previous check criterion.
> 
> Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>

Gave it a spin on my dual socket 3rd Generation EPYC System and I do not
as big a jump in hackbench numbers as Oliver reported, most likely
because I couldn't emulate the exact scenario where a fair task is
running in presence of an RT task queued. Following are numbers from my
testing:

==================================================================
Test          : hackbench
Units         : Normalized time in seconds
Interpretation: Lower is better
Statistic     : AMean
==================================================================
Case:           tip[pct imp](CV)    preempt-fix[pct imp](CV)
  1-groups     1.00 [ -0.00]( 2.60)     1.00 [  0.17]( 2.12)
  2-groups     1.00 [ -0.00]( 1.21)     0.98 [  2.05]( 0.95)
  4-groups     1.00 [ -0.00]( 1.63)     0.97 [  2.65]( 1.53)
  8-groups     1.00 [ -0.00]( 1.34)     0.99 [  0.81]( 1.33)
16-groups     1.00 [ -0.00]( 2.07)     0.98 [  2.31]( 1.09)
--

Feel free to include:

Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>

> ---
>   kernel/sched/fair.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 225b31aaee55..2859fc7e2da2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>   	/*
>   	 * The task has consumed its request, reschedule.
>   	 */
> -	return true;
> +	return (cfs_rq->nr_running > 1);

Was there a strong reason why Peter decided to use "rq->nr_running"
instead of "cfs_rq->nr_running" with PREEMPT_SHORT in update_curr()?

I wonder if it was to force a pick_next_task() cycle to dequeue a
possibly delayed entity but AFAICT, "cfs_rq->nr_running" should
account for the delayed entity still on the cfs_rq and perhaps the
early return in update_curr() can just be changed to use
"cfs_rq->nr_running". Not sure if I'm missing something trivial.

>   }
>   
>   #include "pelt.h"

-- 
Thanks and Regards,
Prateek
Re: [PATCH] sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
Posted by Chen Yu 2 months ago
Hello Prateek,

On Tue, Sep 24, 2024 at 6:28 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote:
>
> Hello Chenyu,
>
> On 9/23/2024 12:51 PM, Chen Yu wrote:
> > Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
> > introduced a mechanism that a wakee with shorter slice could preempt
> > the current running task. It also lower the bar for the current task
> > to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
> > when the current task has ran out of time slice. Say, if there is 1 cfs
> > task and 1 rt task, before 85e511df3cec, update_deadline() will
> > not trigger a reschedule, and after 85e511df3cec, since rq->nr_running
> > is 2 and resched is true, a resched_curr() would happen.
> >
> > Some workloads (like the hackbench reported by lkp) do not like
> > over-scheduling. We can see that the preemption rate has been
> > increased by 2.2%:
> >
> > 1.654e+08            +2.2%   1.69e+08        hackbench.time.involuntary_context_switches
> >
> > Restore its previous check criterion.
> >
> > Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com
> > Signed-off-by: Chen Yu <yu.c.chen@intel.com>
>
> Gave it a spin on my dual socket 3rd Generation EPYC System and I do not
> as big a jump in hackbench numbers as Oliver reported, most likely
> because I couldn't emulate the exact scenario where a fair task is
> running in presence of an RT task queued.

Actually I did not dig into Oliver's test scenario. While looking at the commit
I thought that there could be scenario where a rt and a cfs task together might
make a difference in preemption which caused this problem.

> Following are numbers from my
> testing:
>
> ==================================================================
> Test          : hackbench
> Units         : Normalized time in seconds
> Interpretation: Lower is better
> Statistic     : AMean
> ==================================================================
> Case:           tip[pct imp](CV)    preempt-fix[pct imp](CV)
>   1-groups     1.00 [ -0.00]( 2.60)     1.00 [  0.17]( 2.12)
>   2-groups     1.00 [ -0.00]( 1.21)     0.98 [  2.05]( 0.95)
>   4-groups     1.00 [ -0.00]( 1.63)     0.97 [  2.65]( 1.53)
>   8-groups     1.00 [ -0.00]( 1.34)     0.99 [  0.81]( 1.33)
> 16-groups     1.00 [ -0.00]( 2.07)     0.98 [  2.31]( 1.09)
> --
>
> Feel free to include:
>
> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com>
>

Thanks for the test!

> > ---
> >   kernel/sched/fair.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 225b31aaee55..2859fc7e2da2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >       /*
> >        * The task has consumed its request, reschedule.
> >        */
> > -     return true;
> > +     return (cfs_rq->nr_running > 1);
>
> Was there a strong reason why Peter decided to use "rq->nr_running"
> instead of "cfs_rq->nr_running" with PREEMPT_SHORT in update_curr()?
>
> I wonder if it was to force a pick_next_task() cycle to dequeue a
> possibly delayed entity
>  but AFAICT, "cfs_rq->nr_running" should
> account for the delayed entity still on the cfs_rq and perhaps the
> early return in update_curr() can just be changed to use
> "cfs_rq->nr_running". Not sure if I'm missing something trivial.
>
85e511df3cec changes
if (cfs_rq->nr_running > 1)  resched
to
if (rq->nr_running == 1) not_resched
which does lower the bar to trigger resched

Yes, I think your proposal make sense, the resched should only
be triggered between 2 cfs tasks,
and the restore to update_deadline() is not needed, something like below
in update_curr() could also work:

if (cfs_rq->nr_running == 1)
      return;

thanks,
Chenyu
Re: [PATCH] sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
Posted by K Prateek Nayak 2 months ago
Hello Chenyu,

On 9/24/2024 6:40 PM, Chen Yu wrote:
> [..snip..]
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 225b31aaee55..2859fc7e2da2 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>>        /*
>>>         * The task has consumed its request, reschedule.
>>>         */
>>> -     return true;
>>> +     return (cfs_rq->nr_running > 1);
>>
>> Was there a strong reason why Peter decided to use "rq->nr_running"
>> instead of "cfs_rq->nr_running" with PREEMPT_SHORT in update_curr()?
>>
>> I wonder if it was to force a pick_next_task() cycle to dequeue a
>> possibly delayed entity
>>   but AFAICT, "cfs_rq->nr_running" should
>> account for the delayed entity still on the cfs_rq and perhaps the
>> early return in update_curr() can just be changed to use
>> "cfs_rq->nr_running". Not sure if I'm missing something trivial.
>>
> 85e511df3cec changes
> if (cfs_rq->nr_running > 1)  resched
> to
> if (rq->nr_running == 1) not_resched
> which does lower the bar to trigger resched
> 
> Yes, I think your proposal make sense, the resched should only
> be triggered between 2 cfs tasks,
> and the restore to update_deadline() is not needed, something like below
> in update_curr() could also work:
> 
> if (cfs_rq->nr_running == 1)
>        return;

That seems better IMO unless there was a strong reason for the original
change to use rq->nr_running :)

> 
> thanks,
> Chenyu

-- 
Thanks and Regards,
Prateek
Re: [PATCH] sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running
Posted by Oliver Sang 2 months ago
hi, Chenyu,

On Mon, Sep 23, 2024 at 03:21:56PM +0800, Chen Yu wrote:
> Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
> introduced a mechanism that a wakee with shorter slice could preempt
> the current running task. It also lower the bar for the current task
> to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
> when the current task has ran out of time slice. Say, if there is 1 cfs
> task and 1 rt task, before 85e511df3cec, update_deadline() will
> not trigger a reschedule, and after 85e511df3cec, since rq->nr_running
> is 2 and resched is true, a resched_curr() would happen.
> 
> Some workloads (like the hackbench reported by lkp) do not like
> over-scheduling. We can see that the preemption rate has been
> increased by 2.2%:
> 
> 1.654e+08            +2.2%   1.69e+08        hackbench.time.involuntary_context_switches
> 
> Restore its previous check criterion.
> 
> Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")

we applied this patch upon 85e511df3cec, confirmed the regression is fixed.

Tested-by: kernel test robot <oliver.sang@intel.com>


=========================================================================================
compiler/cpufreq_governor/ipc/iterations/kconfig/mode/nr_threads/rootfs/tbox_group/testcase:
  gcc-12/performance/socket/4/x86_64-rhel-8.3/process/50%/debian-12-x86_64-20240206.cgz/lkp-spr-r02/hackbench

commit:
  82e9d0456e06 ("sched/fair: Avoid re-setting virtual deadline on 'migrations'")
  85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
  1c3386188ad9 ("sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running")

82e9d0456e06cebe 85e511df3cec46021024176672a 1c3386188ad9007c9975470c817
---------------- --------------------------- ---------------------------
         %stddev     %change         %stddev     %change         %stddev
             \          |                \          |                \
    623219           -13.1%     541887            +0.3%     625154        hackbench.throughput


> Reported-by: kernel test robot <oliver.sang@intel.com>
> Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com
> Signed-off-by: Chen Yu <yu.c.chen@intel.com>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 225b31aaee55..2859fc7e2da2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  	/*
>  	 * The task has consumed its request, reschedule.
>  	 */
> -	return true;
> +	return (cfs_rq->nr_running > 1);
>  }
>  
>  #include "pelt.h"
> -- 
> 2.25.1
>