kernel/sched/fair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
introduced a mechanism that a wakee with shorter slice could preempt
the current running task. It also lower the bar for the current task
to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running
when the current task has ran out of time slice. Say, if there is 1 cfs
task and 1 rt task, before 85e511df3cec, update_deadline() will
not trigger a reschedule, and after 85e511df3cec, since rq->nr_running
is 2 and resched is true, a resched_curr() would happen.
Some workloads (like the hackbench reported by lkp) do not like
over-scheduling. We can see that the preemption rate has been
increased by 2.2%:
1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches
Restore its previous check criterion.
Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt")
Reported-by: kernel test robot <oliver.sang@intel.com>
Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com
Signed-off-by: Chen Yu <yu.c.chen@intel.com>
---
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 225b31aaee55..2859fc7e2da2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se)
/*
* The task has consumed its request, reschedule.
*/
- return true;
+ return (cfs_rq->nr_running > 1);
}
#include "pelt.h"
--
2.25.1
Hello Chenyu, On 9/23/2024 12:51 PM, Chen Yu wrote: > Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") > introduced a mechanism that a wakee with shorter slice could preempt > the current running task. It also lower the bar for the current task > to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running > when the current task has ran out of time slice. Say, if there is 1 cfs > task and 1 rt task, before 85e511df3cec, update_deadline() will > not trigger a reschedule, and after 85e511df3cec, since rq->nr_running > is 2 and resched is true, a resched_curr() would happen. > > Some workloads (like the hackbench reported by lkp) do not like > over-scheduling. We can see that the preemption rate has been > increased by 2.2%: > > 1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches > > Restore its previous check criterion. > > Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") > Reported-by: kernel test robot <oliver.sang@intel.com> > Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com > Signed-off-by: Chen Yu <yu.c.chen@intel.com> Gave it a spin on my dual socket 3rd Generation EPYC System and I do not as big a jump in hackbench numbers as Oliver reported, most likely because I couldn't emulate the exact scenario where a fair task is running in presence of an RT task queued. Following are numbers from my testing: ================================================================== Test : hackbench Units : Normalized time in seconds Interpretation: Lower is better Statistic : AMean ================================================================== Case: tip[pct imp](CV) preempt-fix[pct imp](CV) 1-groups 1.00 [ -0.00]( 2.60) 1.00 [ 0.17]( 2.12) 2-groups 1.00 [ -0.00]( 1.21) 0.98 [ 2.05]( 0.95) 4-groups 1.00 [ -0.00]( 1.63) 0.97 [ 2.65]( 1.53) 8-groups 1.00 [ -0.00]( 1.34) 0.99 [ 0.81]( 1.33) 16-groups 1.00 [ -0.00]( 2.07) 0.98 [ 2.31]( 1.09) -- Feel free to include: Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 225b31aaee55..2859fc7e2da2 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) > /* > * The task has consumed its request, reschedule. > */ > - return true; > + return (cfs_rq->nr_running > 1); Was there a strong reason why Peter decided to use "rq->nr_running" instead of "cfs_rq->nr_running" with PREEMPT_SHORT in update_curr()? I wonder if it was to force a pick_next_task() cycle to dequeue a possibly delayed entity but AFAICT, "cfs_rq->nr_running" should account for the delayed entity still on the cfs_rq and perhaps the early return in update_curr() can just be changed to use "cfs_rq->nr_running". Not sure if I'm missing something trivial. > } > > #include "pelt.h" -- Thanks and Regards, Prateek
Hello Prateek, On Tue, Sep 24, 2024 at 6:28 PM K Prateek Nayak <kprateek.nayak@amd.com> wrote: > > Hello Chenyu, > > On 9/23/2024 12:51 PM, Chen Yu wrote: > > Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") > > introduced a mechanism that a wakee with shorter slice could preempt > > the current running task. It also lower the bar for the current task > > to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running > > when the current task has ran out of time slice. Say, if there is 1 cfs > > task and 1 rt task, before 85e511df3cec, update_deadline() will > > not trigger a reschedule, and after 85e511df3cec, since rq->nr_running > > is 2 and resched is true, a resched_curr() would happen. > > > > Some workloads (like the hackbench reported by lkp) do not like > > over-scheduling. We can see that the preemption rate has been > > increased by 2.2%: > > > > 1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches > > > > Restore its previous check criterion. > > > > Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") > > Reported-by: kernel test robot <oliver.sang@intel.com> > > Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com > > Signed-off-by: Chen Yu <yu.c.chen@intel.com> > > Gave it a spin on my dual socket 3rd Generation EPYC System and I do not > as big a jump in hackbench numbers as Oliver reported, most likely > because I couldn't emulate the exact scenario where a fair task is > running in presence of an RT task queued. Actually I did not dig into Oliver's test scenario. While looking at the commit I thought that there could be scenario where a rt and a cfs task together might make a difference in preemption which caused this problem. > Following are numbers from my > testing: > > ================================================================== > Test : hackbench > Units : Normalized time in seconds > Interpretation: Lower is better > Statistic : AMean > ================================================================== > Case: tip[pct imp](CV) preempt-fix[pct imp](CV) > 1-groups 1.00 [ -0.00]( 2.60) 1.00 [ 0.17]( 2.12) > 2-groups 1.00 [ -0.00]( 1.21) 0.98 [ 2.05]( 0.95) > 4-groups 1.00 [ -0.00]( 1.63) 0.97 [ 2.65]( 1.53) > 8-groups 1.00 [ -0.00]( 1.34) 0.99 [ 0.81]( 1.33) > 16-groups 1.00 [ -0.00]( 2.07) 0.98 [ 2.31]( 1.09) > -- > > Feel free to include: > > Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> > Thanks for the test! > > --- > > kernel/sched/fair.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 225b31aaee55..2859fc7e2da2 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) > > /* > > * The task has consumed its request, reschedule. > > */ > > - return true; > > + return (cfs_rq->nr_running > 1); > > Was there a strong reason why Peter decided to use "rq->nr_running" > instead of "cfs_rq->nr_running" with PREEMPT_SHORT in update_curr()? > > I wonder if it was to force a pick_next_task() cycle to dequeue a > possibly delayed entity > but AFAICT, "cfs_rq->nr_running" should > account for the delayed entity still on the cfs_rq and perhaps the > early return in update_curr() can just be changed to use > "cfs_rq->nr_running". Not sure if I'm missing something trivial. > 85e511df3cec changes if (cfs_rq->nr_running > 1) resched to if (rq->nr_running == 1) not_resched which does lower the bar to trigger resched Yes, I think your proposal make sense, the resched should only be triggered between 2 cfs tasks, and the restore to update_deadline() is not needed, something like below in update_curr() could also work: if (cfs_rq->nr_running == 1) return; thanks, Chenyu
Hello Chenyu, On 9/24/2024 6:40 PM, Chen Yu wrote: > [..snip..] >>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c >>> index 225b31aaee55..2859fc7e2da2 100644 >>> --- a/kernel/sched/fair.c >>> +++ b/kernel/sched/fair.c >>> @@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) >>> /* >>> * The task has consumed its request, reschedule. >>> */ >>> - return true; >>> + return (cfs_rq->nr_running > 1); >> >> Was there a strong reason why Peter decided to use "rq->nr_running" >> instead of "cfs_rq->nr_running" with PREEMPT_SHORT in update_curr()? >> >> I wonder if it was to force a pick_next_task() cycle to dequeue a >> possibly delayed entity >> but AFAICT, "cfs_rq->nr_running" should >> account for the delayed entity still on the cfs_rq and perhaps the >> early return in update_curr() can just be changed to use >> "cfs_rq->nr_running". Not sure if I'm missing something trivial. >> > 85e511df3cec changes > if (cfs_rq->nr_running > 1) resched > to > if (rq->nr_running == 1) not_resched > which does lower the bar to trigger resched > > Yes, I think your proposal make sense, the resched should only > be triggered between 2 cfs tasks, > and the restore to update_deadline() is not needed, something like below > in update_curr() could also work: > > if (cfs_rq->nr_running == 1) > return; That seems better IMO unless there was a strong reason for the original change to use rq->nr_running :) > > thanks, > Chenyu -- Thanks and Regards, Prateek
hi, Chenyu, On Mon, Sep 23, 2024 at 03:21:56PM +0800, Chen Yu wrote: > Commit 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") > introduced a mechanism that a wakee with shorter slice could preempt > the current running task. It also lower the bar for the current task > to be preempted, by checking the rq->nr_running instead of cfs_rq->nr_running > when the current task has ran out of time slice. Say, if there is 1 cfs > task and 1 rt task, before 85e511df3cec, update_deadline() will > not trigger a reschedule, and after 85e511df3cec, since rq->nr_running > is 2 and resched is true, a resched_curr() would happen. > > Some workloads (like the hackbench reported by lkp) do not like > over-scheduling. We can see that the preemption rate has been > increased by 2.2%: > > 1.654e+08 +2.2% 1.69e+08 hackbench.time.involuntary_context_switches > > Restore its previous check criterion. > > Fixes: 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") we applied this patch upon 85e511df3cec, confirmed the regression is fixed. Tested-by: kernel test robot <oliver.sang@intel.com> ========================================================================================= compiler/cpufreq_governor/ipc/iterations/kconfig/mode/nr_threads/rootfs/tbox_group/testcase: gcc-12/performance/socket/4/x86_64-rhel-8.3/process/50%/debian-12-x86_64-20240206.cgz/lkp-spr-r02/hackbench commit: 82e9d0456e06 ("sched/fair: Avoid re-setting virtual deadline on 'migrations'") 85e511df3cec ("sched/eevdf: Allow shorter slices to wakeup-preempt") 1c3386188ad9 ("sched/eevdf: Fix wakeup-preempt by checking cfs_rq->nr_running") 82e9d0456e06cebe 85e511df3cec46021024176672a 1c3386188ad9007c9975470c817 ---------------- --------------------------- --------------------------- %stddev %change %stddev %change %stddev \ | \ | \ 623219 -13.1% 541887 +0.3% 625154 hackbench.throughput > Reported-by: kernel test robot <oliver.sang@intel.com> > Closes: https://lore.kernel.org/oe-lkp/202409231416.9403c2e9-oliver.sang@intel.com > Signed-off-by: Chen Yu <yu.c.chen@intel.com> > --- > kernel/sched/fair.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 225b31aaee55..2859fc7e2da2 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -1025,7 +1025,7 @@ static bool update_deadline(struct cfs_rq *cfs_rq, struct sched_entity *se) > /* > * The task has consumed its request, reschedule. > */ > - return true; > + return (cfs_rq->nr_running > 1); > } > > #include "pelt.h" > -- > 2.25.1 >
© 2016 - 2024 Red Hat, Inc.