[v2] sched/fair: limit sched slice duration

[PATCH v2] sched/fair: limit sched slice duration

Posted by Vincent Guittot 3 years, 6 months ago

In presence of a lot of small weight tasks like sched_idle tasks, normal
or high weight tasks can see their ideal runtime (sched_slice) to increase
to hundreds ms whereas it normally stays below sysctl_sched_latency.

2 normal tasks running on a CPU will have a max sched_slice of 12ms
(half of the sched_period). This means that they will make progress
every sysctl_sched_latency period.

If we now add 1000 idle tasks on the CPU, the sched_period becomes
3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
It will even become 1500ms if the idle tasks belongs to an idle cgroup.
This means that the scheduler will look for picking another waiting task
after 609ms running time (1500ms respectively). The idle tasks change
significantly the way the 2 normal tasks interleave their running time
slot whereas they should have a small impact.

Such long sched_slice can delay significantly the release of resources
as the tasks can wait hundreds of ms before the next running slot just
because of idle tasks queued on the rq.

Cap the ideal_runtime to the weighted version of sysctl_sched_latency when
comparing with the vruntime of the next waiting task to make sure that
tasks will regularly make progress and will not be significantly impacted
by idle/background tasks queued on the rq.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

I have kept the test if (delta < 0) as calc_delta_fair() can't handle negative
value.

Change since v1:
  - the first 3 patches have been already queued
  - use the weight of curr to scale sysctl_sched_latency before capping
    the ideal_runtime so we can compare vruntime values.

 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5ffec4370602..ba451bb25929 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4610,6 +4610,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	if (delta < 0)
 		return;
 
+	ideal_runtime = min_t(u64, ideal_runtime,
+				   calc_delta_fair(sysctl_sched_latency, curr));
 	if (delta > ideal_runtime)
 		resched_curr(rq_of(cfs_rq));
 }
-- 
2.17.1

Re: [PATCH v2] sched/fair: limit sched slice duration

Posted by Peter Zijlstra 3 years, 6 months ago

On Fri, Sep 16, 2022 at 03:15:38PM +0200, Vincent Guittot wrote:
> In presence of a lot of small weight tasks like sched_idle tasks, normal
> or high weight tasks can see their ideal runtime (sched_slice) to increase
> to hundreds ms whereas it normally stays below sysctl_sched_latency.
> 
> 2 normal tasks running on a CPU will have a max sched_slice of 12ms
> (half of the sched_period). This means that they will make progress
> every sysctl_sched_latency period.
> 
> If we now add 1000 idle tasks on the CPU, the sched_period becomes
> 3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
> It will even become 1500ms if the idle tasks belongs to an idle cgroup.
> This means that the scheduler will look for picking another waiting task
> after 609ms running time (1500ms respectively). The idle tasks change
> significantly the way the 2 normal tasks interleave their running time
> slot whereas they should have a small impact.
> 
> Such long sched_slice can delay significantly the release of resources
> as the tasks can wait hundreds of ms before the next running slot just
> because of idle tasks queued on the rq.
> 
> Cap the ideal_runtime to the weighted version of sysctl_sched_latency when
> comparing with the vruntime of the next waiting task to make sure that
> tasks will regularly make progress and will not be significantly impacted
> by idle/background tasks queued on the rq.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> 
> I have kept the test if (delta < 0) as calc_delta_fair() can't handle negative
> value.
> 
> Change since v1:
>   - the first 3 patches have been already queued
>   - use the weight of curr to scale sysctl_sched_latency before capping
>     the ideal_runtime so we can compare vruntime values.
> 
>  kernel/sched/fair.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5ffec4370602..ba451bb25929 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4610,6 +4610,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  	if (delta < 0)
>  		return;
>  
> +	ideal_runtime = min_t(u64, ideal_runtime,
> +				   calc_delta_fair(sysctl_sched_latency, curr));
>  	if (delta > ideal_runtime)
>  		resched_curr(rq_of(cfs_rq));
>  }

Since I'm suffering from a cold and constant interruptions I had to
write down my thinking and ended up with the below.

Does that make sense or did I go sideways somewhere (entirely possible).

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5ffec4370602..2b218167fadf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4575,17 +4575,33 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 }
 
 /*
- * Preempt the current task with a newly woken task if needed:
+ * Tick driven preemption; preempt the task if it has ran long enough.
+ * Allows other tasks to have a go.
  */
 static void
 check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 {
-	unsigned long ideal_runtime, delta_exec;
 	struct sched_entity *se;
-	s64 delta;
+	s64 delta, delta_exec;
+	u64 ideal_runtime;
 
-	ideal_runtime = sched_slice(cfs_rq, curr);
+	/* How long has this task been on the CPU for [walltime]. */
 	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+
+	/*
+	 * Ensure that a task that missed wakeup preemption by a
+	 * narrow margin doesn't have to wait for a full slice.
+	 * This also mitigates buddy induced latencies under load.
+	 */
+	if (delta_exec < sysctl_sched_min_granularity)
+		return;
+
+	/*
+	 * When many tasks blow up the sched_period; it is possible that
+	 * sched_slice() reports unusually large results (when many tasks are
+	 * very light for example). Therefore impose a maximum.
+	 */
+	ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
 	if (delta_exec > ideal_runtime) {
 		resched_curr(rq_of(cfs_rq));
 		/*
@@ -4597,19 +4613,24 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
 	}
 
 	/*
-	 * Ensure that a task that missed wakeup preemption by a
-	 * narrow margin doesn't have to wait for a full slice.
-	 * This also mitigates buddy induced latencies under load.
+	 * Strictly speaking the above is sufficient; however due to
+	 * imperfections it is possible to have a leftmost task left of
+	 * min_vruntime.
+	 *
+	 * Also impose limits on the delta in vtime.
 	 */
-	if (delta_exec < sysctl_sched_min_granularity)
-		return;
 
 	se = __pick_first_entity(cfs_rq);
 	delta = curr->vruntime - se->vruntime;
-
 	if (delta < 0)
 		return;
 
+	/*
+	 * Compare @delta [vtime] to @ideal_runtime [walltime]. This means that
+	 * heavy tasks (for which vtime goes slower) get relatively more time
+	 * before preemption, while light tasks (for which vtime goes faster)
+	 * get relatively less time.  IOW, heavy task get to run longer.
+	 */
 	if (delta > ideal_runtime)
 		resched_curr(rq_of(cfs_rq));
 }

Re: [PATCH v2] sched/fair: limit sched slice duration

Posted by Vincent Guittot 3 years, 6 months ago

On Thu, 29 Sept 2022 at 18:14, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Sep 16, 2022 at 03:15:38PM +0200, Vincent Guittot wrote:
> > In presence of a lot of small weight tasks like sched_idle tasks, normal
> > or high weight tasks can see their ideal runtime (sched_slice) to increase
> > to hundreds ms whereas it normally stays below sysctl_sched_latency.
> >
> > 2 normal tasks running on a CPU will have a max sched_slice of 12ms
> > (half of the sched_period). This means that they will make progress
> > every sysctl_sched_latency period.
> >
> > If we now add 1000 idle tasks on the CPU, the sched_period becomes
> > 3006 ms and the ideal runtime of the normal tasks becomes 609 ms.
> > It will even become 1500ms if the idle tasks belongs to an idle cgroup.
> > This means that the scheduler will look for picking another waiting task
> > after 609ms running time (1500ms respectively). The idle tasks change
> > significantly the way the 2 normal tasks interleave their running time
> > slot whereas they should have a small impact.
> >
> > Such long sched_slice can delay significantly the release of resources
> > as the tasks can wait hundreds of ms before the next running slot just
> > because of idle tasks queued on the rq.
> >
> > Cap the ideal_runtime to the weighted version of sysctl_sched_latency when
> > comparing with the vruntime of the next waiting task to make sure that
> > tasks will regularly make progress and will not be significantly impacted
> > by idle/background tasks queued on the rq.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >
> > I have kept the test if (delta < 0) as calc_delta_fair() can't handle negative
> > value.
> >
> > Change since v1:
> >   - the first 3 patches have been already queued
> >   - use the weight of curr to scale sysctl_sched_latency before capping
> >     the ideal_runtime so we can compare vruntime values.
> >
> >  kernel/sched/fair.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 5ffec4370602..ba451bb25929 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4610,6 +4610,8 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >       if (delta < 0)
> >               return;
> >
> > +     ideal_runtime = min_t(u64, ideal_runtime,
> > +                                calc_delta_fair(sysctl_sched_latency, curr));
> >       if (delta > ideal_runtime)
> >               resched_curr(rq_of(cfs_rq));
> >  }
>
> Since I'm suffering from a cold and constant interruptions I had to
> write down my thinking and ended up with the below.
>
> Does that make sense or did I go sideways somewhere (entirely possible).
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 5ffec4370602..2b218167fadf 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4575,17 +4575,33 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  }
>
>  /*
> - * Preempt the current task with a newly woken task if needed:
> + * Tick driven preemption; preempt the task if it has ran long enough.
> + * Allows other tasks to have a go.
>   */
>  static void
>  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>  {
> -       unsigned long ideal_runtime, delta_exec;
>         struct sched_entity *se;
> -       s64 delta;
> +       s64 delta, delta_exec;
> +       u64 ideal_runtime;
>
> -       ideal_runtime = sched_slice(cfs_rq, curr);
> +       /* How long has this task been on the CPU for [walltime]. */
>         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> +
> +       /*
> +        * Ensure that a task that missed wakeup preemption by a
> +        * narrow margin doesn't have to wait for a full slice.
> +        * This also mitigates buddy induced latencies under load.
> +        */
> +       if (delta_exec < sysctl_sched_min_granularity)
> +               return;

ideal_runtime can be lower than sysctl_sched_min_granularity. It can
be as low as sysctl_sched_idle_min_granularity for idle task. In this
case, we want to resched even if(delta_exec <
sysctl_sched_min_granularity). That's why the 1st test was still done
before

> +
> +       /*
> +        * When many tasks blow up the sched_period; it is possible that
> +        * sched_slice() reports unusually large results (when many tasks are
> +        * very light for example). Therefore impose a maximum.
> +        */
> +       ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);

I didn't cap ideal_runtime before this test because we have situations
where large ideal_runtime is ok: If there is only one normal thread
with hundreds of idle threads as an example: Is it acceptable to
trigger a useless resched in this case ? That's why I have computed
the virtual time generated by the capped version of ideal_runtime.

>         if (delta_exec > ideal_runtime) {
>                 resched_curr(rq_of(cfs_rq));
>                 /*
> @@ -4597,19 +4613,24 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
>         }
>
>         /*
> -        * Ensure that a task that missed wakeup preemption by a
> -        * narrow margin doesn't have to wait for a full slice.
> -        * This also mitigates buddy induced latencies under load.
> +        * Strictly speaking the above is sufficient; however due to
> +        * imperfections it is possible to have a leftmost task left of
> +        * min_vruntime.
> +        *
> +        * Also impose limits on the delta in vtime.
>          */
> -       if (delta_exec < sysctl_sched_min_granularity)
> -               return;
>
>         se = __pick_first_entity(cfs_rq);
>         delta = curr->vruntime - se->vruntime;
> -
>         if (delta < 0)
>                 return;
>
> +       /*
> +        * Compare @delta [vtime] to @ideal_runtime [walltime]. This means that
> +        * heavy tasks (for which vtime goes slower) get relatively more time
> +        * before preemption, while light tasks (for which vtime goes faster)
> +        * get relatively less time.  IOW, heavy task get to run longer.
> +        */

After your comment on v1, I looked more deeply on this and the
comparison of [vtime] with [walltime] can create a large unfairness.
vtime of nice-20 increases by ~250us for 24ms of walltime which means
that the nice-20 will have to run for a long time before reaching this
walltime delta (assuming the vruntime were similar at the beg)

>         if (delta > ideal_runtime)
>                 resched_curr(rq_of(cfs_rq));
>  }

Re: [PATCH v2] sched/fair: limit sched slice duration

Posted by Peter Zijlstra 3 years, 6 months ago

On Thu, Sep 29, 2022 at 07:15:42PM +0200, Vincent Guittot wrote:
> On Thu, 29 Sept 2022 at 18:14, Peter Zijlstra <peterz@infradead.org> wrote:

> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 5ffec4370602..2b218167fadf 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4575,17 +4575,33 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  }
> >
> >  /*
> > - * Preempt the current task with a newly woken task if needed:
> > + * Tick driven preemption; preempt the task if it has ran long enough.
> > + * Allows other tasks to have a go.
> >   */
> >  static void
> >  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> >  {
> > -       unsigned long ideal_runtime, delta_exec;
> >         struct sched_entity *se;
> > -       s64 delta;
> > +       s64 delta, delta_exec;
> > +       u64 ideal_runtime;
> >
> > -       ideal_runtime = sched_slice(cfs_rq, curr);
> > +       /* How long has this task been on the CPU for [walltime]. */
> >         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> > +
> > +       /*
> > +        * Ensure that a task that missed wakeup preemption by a
> > +        * narrow margin doesn't have to wait for a full slice.
> > +        * This also mitigates buddy induced latencies under load.
> > +        */
> > +       if (delta_exec < sysctl_sched_min_granularity)
> > +               return;
> 
> ideal_runtime can be lower than sysctl_sched_min_granularity. It can
> be as low as sysctl_sched_idle_min_granularity for idle task. In this
> case, we want to resched even if(delta_exec <
> sysctl_sched_min_granularity). That's why the 1st test was still done
> before

Duh, indeed.

> > +
> > +       /*
> > +        * When many tasks blow up the sched_period; it is possible that
> > +        * sched_slice() reports unusually large results (when many tasks are
> > +        * very light for example). Therefore impose a maximum.
> > +        */
> > +       ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
> 
> I didn't cap ideal_runtime before this test because we have situations
> where large ideal_runtime is ok: If there is only one normal thread
> with hundreds of idle threads as an example: Is it acceptable to
> trigger a useless resched in this case ? That's why I have computed
> the virtual time generated by the capped version of ideal_runtime.

Yeah; I think that should be fine. It's an edge case, and sched_latency
is fairly large already.

> >         se = __pick_first_entity(cfs_rq);
> >         delta = curr->vruntime - se->vruntime;
> > -
> >         if (delta < 0)
> >                 return;
> >
> > +       /*
> > +        * Compare @delta [vtime] to @ideal_runtime [walltime]. This means that
> > +        * heavy tasks (for which vtime goes slower) get relatively more time
> > +        * before preemption, while light tasks (for which vtime goes faster)
> > +        * get relatively less time.  IOW, heavy task get to run longer.
> > +        */
> 
> After your comment on v1, I looked more deeply on this and the
> comparison of [vtime] with [walltime] can create a large unfairness.
> vtime of nice-20 increases by ~250us for 24ms of walltime which means
> that the nice-20 will have to run for a long time before reaching this
> walltime delta (assuming the vruntime were similar at the beg)

As I wrote, strictly speaking we should do without this. The entire
vtime thing is a band-aid.

I'm sure I've tried taking it out at least once; but sadly I seem to
have forgotten everything relevant :-( That is, I can't tell you why
this code exists.

Re: [PATCH v2] sched/fair: limit sched slice duration

Posted by Vincent Guittot 3 years, 6 months ago

On Thu, 29 Sept 2022 at 21:19, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Sep 29, 2022 at 07:15:42PM +0200, Vincent Guittot wrote:
> > On Thu, 29 Sept 2022 at 18:14, Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 5ffec4370602..2b218167fadf 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4575,17 +4575,33 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> > >  }
> > >
> > >  /*
> > > - * Preempt the current task with a newly woken task if needed:
> > > + * Tick driven preemption; preempt the task if it has ran long enough.
> > > + * Allows other tasks to have a go.
> > >   */
> > >  static void
> > >  check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> > >  {
> > > -       unsigned long ideal_runtime, delta_exec;
> > >         struct sched_entity *se;
> > > -       s64 delta;
> > > +       s64 delta, delta_exec;
> > > +       u64 ideal_runtime;
> > >
> > > -       ideal_runtime = sched_slice(cfs_rq, curr);
> > > +       /* How long has this task been on the CPU for [walltime]. */
> > >         delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
> > > +
> > > +       /*
> > > +        * Ensure that a task that missed wakeup preemption by a
> > > +        * narrow margin doesn't have to wait for a full slice.
> > > +        * This also mitigates buddy induced latencies under load.
> > > +        */
> > > +       if (delta_exec < sysctl_sched_min_granularity)
> > > +               return;
> >
> > ideal_runtime can be lower than sysctl_sched_min_granularity. It can
> > be as low as sysctl_sched_idle_min_granularity for idle task. In this
> > case, we want to resched even if(delta_exec <
> > sysctl_sched_min_granularity). That's why the 1st test was still done
> > before
>
> Duh, indeed.
>
> > > +
> > > +       /*
> > > +        * When many tasks blow up the sched_period; it is possible that
> > > +        * sched_slice() reports unusually large results (when many tasks are
> > > +        * very light for example). Therefore impose a maximum.
> > > +        */
> > > +       ideal_runtime = min_t(u64, sched_slice(cfs_rq, curr), sysctl_sched_latency);
> >
> > I didn't cap ideal_runtime before this test because we have situations
> > where large ideal_runtime is ok: If there is only one normal thread
> > with hundreds of idle threads as an example: Is it acceptable to
> > trigger a useless resched in this case ? That's why I have computed
> > the virtual time generated by the capped version of ideal_runtime.
>
> Yeah; I think that should be fine. It's an edge case, and sched_latency
> is fairly large already.

In this case capping the ideal runtime from the beg is ok and will
ensure max duration before checking for a resched

>
> > >         se = __pick_first_entity(cfs_rq);
> > >         delta = curr->vruntime - se->vruntime;
> > > -
> > >         if (delta < 0)
> > >                 return;
> > >
> > > +       /*
> > > +        * Compare @delta [vtime] to @ideal_runtime [walltime]. This means that
> > > +        * heavy tasks (for which vtime goes slower) get relatively more time
> > > +        * before preemption, while light tasks (for which vtime goes faster)
> > > +        * get relatively less time.  IOW, heavy task get to run longer.
> > > +        */
> >
> > After your comment on v1, I looked more deeply on this and the
> > comparison of [vtime] with [walltime] can create a large unfairness.
> > vtime of nice-20 increases by ~250us for 24ms of walltime which means
> > that the nice-20 will have to run for a long time before reaching this
> > walltime delta (assuming the vruntime were similar at the beg)
>
> As I wrote, strictly speaking we should do without this. The entire
> vtime thing is a band-aid.
>
> I'm sure I've tried taking it out at least once; but sadly I seem to
> have forgotten everything relevant :-( That is, I can't tell you why
> this code exists.

This test can trigger resched before the end of the ideal_runtime. I
can only see a benefit with curr task nice prio > 0 when vruntime
increases faster than walltime and will trigger a resched before the
end of the [walltime] ideal runtime