[v2] sched/fair: Manage lag and run to parity with different slices

[PATCH v2 0/6] sched/fair: Manage lag and run to parity with different slices

Posted by Vincent Guittot 3 months ago

This follows the attempt to better track maximum lag of tasks in presence
of different slices duration:
[1]  https://lore.kernel.org/all/20250418151225.3006867-1-vincent.guittot@linaro.org/

Since v1, tracking of the max slice has been removed from the patchset
because we now ensure that the lag of an entity remains in the range of:
   
  [-(slice + tick) : (slice + tick)] with run_to_parity
and
  [max(-slice, -(0.7+tick) : max(slice , (0.7+tick)] without run to parity
  
As a result, there is no need the max slice of enqueued entities anymore.

Patch 1 is a simple cleanup to ease following changes.

Patch 2 fixes the lag for NO_RUN_TO_PARITY. It has been put 1st because of
its simplicity. The running task has a minimum protection of 0.7ms before
eevdf looks for another task.

Patch 3 ensures that the protection is canceled only if the waking task
will be selected by pick_task_fair. This case has been mentionned by Peter
will reviewing v1.

Patch 4 modifes the duration of the protection to take into account the
shortest slice of enqueued tasks instead of the slice of the running task.

Patch 5 fixes the case of tasks not being eligible at wakeup or after
migrating  but with a shorter slice. We need to update the duration of the
protection to not exceed the lag.

Patch 6 fixes the case of tasks still being eligible after the protected
period but others must run to no exceed lag limit. This has been
highlighted in a test with delayed entities being dequeued with a positive
lag larger than their slice but it can happen for delayed dequeue entity
too.

The patchset has been tested with rt-app on 37 different use cases, some a
simple and should never trigger any problem but have been kept to increase
the test coverage. The tests have been run on dragon rb5 with affinity on
biggest cores. The lag has been checked when we update the entity's lag at
dequeue and every time we check if an entity is eligible.

             RUN_TO_PARITY    NO_RUN_TO_PARITY
	     lag error        lag_error 
mainline       14/37            14/37
+ patch 1-2    14/37             0/37
+ patch 3-5     1/37             0/37
+ patch 6       0/37             0/37

Vincent Guittot (6):
  sched/fair: Use protect_slice() instead of direct comparison
  sched/fair: Fix NO_RUN_TO_PARITY case
  sched/fair: Remove spurious shorter slice preemption
  sched/fair: Limit run to parity to the min slice of enqueued entities
  sched/fair: Fix entity's lag with run to parity
  sched/fair: Always trigger resched at the end of a protected period

 kernel/sched/fair.c | 94 ++++++++++++++++++++++++---------------------
 1 file changed, 50 insertions(+), 44 deletions(-)

-- 
2.43.0

Re: [PATCH v2 0/6] sched/fair: Manage lag and run to parity with different slices

Posted by Peter Zijlstra 3 months ago

On Fri, Jul 04, 2025 at 04:36:06PM +0200, Vincent Guittot wrote:
> This follows the attempt to better track maximum lag of tasks in presence
> of different slices duration:
> [1]  https://lore.kernel.org/all/20250418151225.3006867-1-vincent.guittot@linaro.org/
> 
> Since v1, tracking of the max slice has been removed from the patchset
> because we now ensure that the lag of an entity remains in the range of:
>    
>   [-(slice + tick) : (slice + tick)] with run_to_parity
> and
>   [max(-slice, -(0.7+tick) : max(slice , (0.7+tick)] without run to parity
>   
> As a result, there is no need the max slice of enqueued entities anymore.
> 
> Patch 1 is a simple cleanup to ease following changes.
> 
> Patch 2 fixes the lag for NO_RUN_TO_PARITY. It has been put 1st because of
> its simplicity. The running task has a minimum protection of 0.7ms before
> eevdf looks for another task.

The usage of min() on vruntimes is broken; it doesn't work right in the
face of wrapping; use min_vruntime().

Also, perhaps it is time to better document this vlag abuse.

> Patch 3 ensures that the protection is canceled only if the waking task
> will be selected by pick_task_fair. This case has been mentionned by Peter
> will reviewing v1.
> 
> Patch 4 modifes the duration of the protection to take into account the
> shortest slice of enqueued tasks instead of the slice of the running task.
> 
> Patch 5 fixes the case of tasks not being eligible at wakeup or after
> migrating  but with a shorter slice. We need to update the duration of the
> protection to not exceed the lag.

This has issues with non-determinism; specifically,
update_protected_slice() will use the current ->vruntime, and as such
can unduly push forward the protection window.

> Patch 6 fixes the case of tasks still being eligible after the protected
> period but others must run to no exceed lag limit. This has been
> highlighted in a test with delayed entities being dequeued with a positive
> lag larger than their slice but it can happen for delayed dequeue entity
> too.

At this point resched_next_quantum() becomes !protec_slice() and can be
removed.

How about something like so? I've probably wrecked the whole
!RUN_TO_PARITY thing -- so that needs to be put back in.

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -579,7 +579,15 @@ struct sched_entity {
 	u64				sum_exec_runtime;
 	u64				prev_sum_exec_runtime;
 	u64				vruntime;
-	s64				vlag;
+	union {
+		/*
+		 * When !@on_rq this field is vlag.
+		 * When cfs_rq->curr == se (which implies @on_rq)
+		 * this field is vprot. See protect_slice().
+		 */
+		s64			vlag;
+		u64			vprot;
+	};
 	u64				slice;
 
 	u64				nr_migrations;
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -882,45 +882,36 @@ struct sched_entity *__pick_first_entity
 }
 
 /*
- * HACK, Set the vruntime up to which an entity can run before picking another
- * one, in vlag, which isn't used until dequeue.
- * In case of run to parity, we use the shortest slice of the enqueued entities
- * to define the longest runtime.
- * When run to parity is disabled, we give a minimum quantum to the running
- * entity to ensure progress.
+ * Take a snapshot of the vruntime at the point the task gets scheduled.
  */
 static inline void set_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	u64 quantum;
-
-	if (sched_feat(RUN_TO_PARITY))
-		quantum = cfs_rq_min_slice(cfs_rq);
-	else
-		quantum = normalized_sysctl_sched_base_slice;
-	quantum = min(quantum, se->slice);
-
-	if (quantum != se->slice)
-		se->vlag = min(se->deadline, se->vruntime + calc_delta_fair(quantum, se));
-	else
-		se->vlag = se->deadline;
+	se->vprot = se->vruntime;
 }
 
-static inline void update_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
+/*
+ * Should we still run @se? It is allowed to run until either se->deadline or
+ * until se->vprot + min_vslice, whichever comes first.
+ */
+static inline bool protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	u64 quantum = cfs_rq_min_slice(cfs_rq);
+	u64 min_vslice, deadline = se->deadline;
+	u64 min_slice = cfs_rq_min_slice(cfs_rq);
 
-	se->vlag = min(se->vlag, (s64)(se->vruntime + calc_delta_fair(quantum, se)));
-}
+	if (min_slice != se->slice) {
+		min_vslice = calc_delta_fair(min_slice, se);
+		deadline = min_vruntime(se->deadline, se->vprot + min_vslice);
+	}
 
-static inline bool protect_slice(struct sched_entity *se)
-{
-	return ((s64)(se->vlag - se->vruntime) > 0);
+	WARN_ON_ONCE(!se->on_rq);
+
+	return ((s64)(deadline - se->vruntime) > 0);
 }
 
-static inline void cancel_protect_slice(struct sched_entity *se)
+static inline void cancel_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 {
-	if (protect_slice(se))
-		se->vlag = se->vruntime;
+	if (protect_slice(cfs_rq, se))
+		se->vprot = se->vruntime - calc_delta_fair(NSEC_PER_SEC, se);
 }
 
 /*
@@ -959,7 +950,7 @@ static struct sched_entity *__pick_eevdf
 	if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
 		curr = NULL;
 
-	if (curr && protect && protect_slice(curr))
+	if (curr && protect && protect_slice(cfs_rq, curr))
 		return curr;
 
 	/* Pick the leftmost entity if it's eligible */
@@ -1183,14 +1174,6 @@ static inline void update_curr_task(stru
 	cgroup_account_cputime(p, delta_exec);
 }
 
-static inline bool resched_next_quantum(struct cfs_rq *cfs_rq, struct sched_entity *curr)
-{
-	if (protect_slice(curr))
-		return false;
-
-	return true;
-}
-
 /*
  * Used by other classes to account runtime.
  */
@@ -1251,7 +1234,7 @@ static void update_curr(struct cfs_rq *c
 	if (cfs_rq->nr_queued == 1)
 		return;
 
-	if (resched || resched_next_quantum(cfs_rq, curr)) {
+	if (resched || !protect_slice(cfs_rq, curr)) {
 		resched_curr_lazy(rq);
 		clear_buddies(cfs_rq, curr);
 	}
@@ -8729,7 +8712,7 @@ static void check_preempt_wakeup_fair(st
 	 * If @p has a shorter slice than current and @p is eligible, override
 	 * current's slice protection in order to allow preemption.
 	 */
-	 do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
+	do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
 
 	/*
 	 * If @p has become the most eligible task, force preemption.
@@ -8737,14 +8720,11 @@ static void check_preempt_wakeup_fair(st
 	if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
 		goto preempt;
 
-	if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
-		update_protect_slice(cfs_rq, se);
-
 	return;
 
 preempt:
 	if (do_preempt_short)
-		cancel_protect_slice(se);
+		cancel_protect_slice(cfs_rq, se);
 
 	resched_curr_lazy(rq);
 }

Re: [PATCH v2 0/6] sched/fair: Manage lag and run to parity with different slices

Posted by Vincent Guittot 3 months ago

On Mon, 7 Jul 2025 at 16:14, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jul 04, 2025 at 04:36:06PM +0200, Vincent Guittot wrote:
> > This follows the attempt to better track maximum lag of tasks in presence
> > of different slices duration:
> > [1]  https://lore.kernel.org/all/20250418151225.3006867-1-vincent.guittot@linaro.org/
> >
> > Since v1, tracking of the max slice has been removed from the patchset
> > because we now ensure that the lag of an entity remains in the range of:
> >
> >   [-(slice + tick) : (slice + tick)] with run_to_parity
> > and
> >   [max(-slice, -(0.7+tick) : max(slice , (0.7+tick)] without run to parity
> >
> > As a result, there is no need the max slice of enqueued entities anymore.
> >
> > Patch 1 is a simple cleanup to ease following changes.
> >
> > Patch 2 fixes the lag for NO_RUN_TO_PARITY. It has been put 1st because of
> > its simplicity. The running task has a minimum protection of 0.7ms before
> > eevdf looks for another task.
>
> The usage of min() on vruntimes is broken; it doesn't work right in the
> face of wrapping; use min_vruntime().

Good point. Don't know why I didn't use it

>
> Also, perhaps it is time to better document this vlag abuse.

will add something

>
> > Patch 3 ensures that the protection is canceled only if the waking task
> > will be selected by pick_task_fair. This case has been mentionned by Peter
> > will reviewing v1.
> >
> > Patch 4 modifes the duration of the protection to take into account the
> > shortest slice of enqueued tasks instead of the slice of the running task.
> >
> > Patch 5 fixes the case of tasks not being eligible at wakeup or after
> > migrating  but with a shorter slice. We need to update the duration of the
> > protection to not exceed the lag.
>
> This has issues with non-determinism; specifically,
> update_protected_slice() will use the current ->vruntime, and as such
> can unduly push forward the protection window.

se->vprot = min_vruntime(se->vprot, (se->vruntime +
calc_delta_fair(quantum, se)));

the min_vruntime (previously min) with current vprot (previously vlag)
should prevent pushing forward the protection. We can only reduce
further the vlag but never increase it

>
> > Patch 6 fixes the case of tasks still being eligible after the protected
> > period but others must run to no exceed lag limit. This has been
> > highlighted in a test with delayed entities being dequeued with a positive
> > lag larger than their slice but it can happen for delayed dequeue entity
> > too.
>
> At this point resched_next_quantum() becomes !protec_slice() and can be
> removed.

yes, it was the last remaining fix and forgot to  simplify

>
> How about something like so? I've probably wrecked the whole
> !RUN_TO_PARITY thing -- so that needs to be put back in.
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -579,7 +579,15 @@ struct sched_entity {
>         u64                             sum_exec_runtime;
>         u64                             prev_sum_exec_runtime;
>         u64                             vruntime;
> -       s64                             vlag;
> +       union {
> +               /*
> +                * When !@on_rq this field is vlag.
> +                * When cfs_rq->curr == se (which implies @on_rq)
> +                * this field is vprot. See protect_slice().
> +                */
> +               s64                     vlag;
> +               u64                     vprot;
> +       };
>         u64                             slice;
>
>         u64                             nr_migrations;
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -882,45 +882,36 @@ struct sched_entity *__pick_first_entity
>  }
>
>  /*
> - * HACK, Set the vruntime up to which an entity can run before picking another
> - * one, in vlag, which isn't used until dequeue.
> - * In case of run to parity, we use the shortest slice of the enqueued entities
> - * to define the longest runtime.
> - * When run to parity is disabled, we give a minimum quantum to the running
> - * entity to ensure progress.
> + * Take a snapshot of the vruntime at the point the task gets scheduled.
>   */
>  static inline void set_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       u64 quantum;
> -
> -       if (sched_feat(RUN_TO_PARITY))
> -               quantum = cfs_rq_min_slice(cfs_rq);
> -       else
> -               quantum = normalized_sysctl_sched_base_slice;
> -       quantum = min(quantum, se->slice);
> -
> -       if (quantum != se->slice)
> -               se->vlag = min(se->deadline, se->vruntime + calc_delta_fair(quantum, se));
> -       else
> -               se->vlag = se->deadline;
> +       se->vprot = se->vruntime;
>  }
>
> -static inline void update_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +/*
> + * Should we still run @se? It is allowed to run until either se->deadline or
> + * until se->vprot + min_vslice, whichever comes first.
> + */
> +static inline bool protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       u64 quantum = cfs_rq_min_slice(cfs_rq);
> +       u64 min_vslice, deadline = se->deadline;
> +       u64 min_slice = cfs_rq_min_slice(cfs_rq);
>
> -       se->vlag = min(se->vlag, (s64)(se->vruntime + calc_delta_fair(quantum, se)));
> -}
> +       if (min_slice != se->slice) {
> +               min_vslice = calc_delta_fair(min_slice, se);
> +               deadline = min_vruntime(se->deadline, se->vprot + min_vslice);

I didn't go into that direction because protect_slice() is call far
more often than set_protect_slice() or update_protect_slice()

> +       }
>
> -static inline bool protect_slice(struct sched_entity *se)
> -{
> -       return ((s64)(se->vlag - se->vruntime) > 0);
> +       WARN_ON_ONCE(!se->on_rq);
> +
> +       return ((s64)(deadline - se->vruntime) > 0);
>  }
>
> -static inline void cancel_protect_slice(struct sched_entity *se)
> +static inline void cancel_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
>  {
> -       if (protect_slice(se))
> -               se->vlag = se->vruntime;
> +       if (protect_slice(cfs_rq, se))
> +               se->vprot = se->vruntime - calc_delta_fair(NSEC_PER_SEC, se);
>  }
>
>  /*
> @@ -959,7 +950,7 @@ static struct sched_entity *__pick_eevdf
>         if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr)))
>                 curr = NULL;
>
> -       if (curr && protect && protect_slice(curr))
> +       if (curr && protect && protect_slice(cfs_rq, curr))
>                 return curr;
>
>         /* Pick the leftmost entity if it's eligible */
> @@ -1183,14 +1174,6 @@ static inline void update_curr_task(stru
>         cgroup_account_cputime(p, delta_exec);
>  }
>
> -static inline bool resched_next_quantum(struct cfs_rq *cfs_rq, struct sched_entity *curr)
> -{
> -       if (protect_slice(curr))
> -               return false;
> -
> -       return true;
> -}
> -
>  /*
>   * Used by other classes to account runtime.
>   */
> @@ -1251,7 +1234,7 @@ static void update_curr(struct cfs_rq *c
>         if (cfs_rq->nr_queued == 1)
>                 return;
>
> -       if (resched || resched_next_quantum(cfs_rq, curr)) {
> +       if (resched || !protect_slice(cfs_rq, curr)) {
>                 resched_curr_lazy(rq);
>                 clear_buddies(cfs_rq, curr);
>         }
> @@ -8729,7 +8712,7 @@ static void check_preempt_wakeup_fair(st
>          * If @p has a shorter slice than current and @p is eligible, override
>          * current's slice protection in order to allow preemption.
>          */
> -        do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
> +       do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice);
>
>         /*
>          * If @p has become the most eligible task, force preemption.
> @@ -8737,14 +8720,11 @@ static void check_preempt_wakeup_fair(st
>         if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse)
>                 goto preempt;
>
> -       if (sched_feat(RUN_TO_PARITY) && do_preempt_short)
> -               update_protect_slice(cfs_rq, se);
> -
>         return;
>
>  preempt:
>         if (do_preempt_short)
> -               cancel_protect_slice(se);
> +               cancel_protect_slice(cfs_rq, se);
>
>         resched_curr_lazy(rq);
>  }

Re: [PATCH v2 0/6] sched/fair: Manage lag and run to parity with different slices

Posted by Peter Zijlstra 3 months ago

On Mon, Jul 07, 2025 at 05:49:05PM +0200, Vincent Guittot wrote:

> > > Patch 5 fixes the case of tasks not being eligible at wakeup or after
> > > migrating  but with a shorter slice. We need to update the duration of the
> > > protection to not exceed the lag.
> >
> > This has issues with non-determinism; specifically,
> > update_protected_slice() will use the current ->vruntime, and as such
> > can unduly push forward the protection window.
> 
> se->vprot = min_vruntime(se->vprot, (se->vruntime +
> calc_delta_fair(quantum, se)));
> 
> the min_vruntime (previously min) with current vprot (previously vlag)
> should prevent pushing forward the protection. We can only reduce
> further the vlag but never increase it

Fair enough.

> > +/*
> > + * Should we still run @se? It is allowed to run until either se->deadline or
> > + * until se->vprot + min_vslice, whichever comes first.
> > + */
> > +static inline bool protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se)
> >  {
> > +       u64 min_vslice, deadline = se->deadline;
> > +       u64 min_slice = cfs_rq_min_slice(cfs_rq);
> >
> > +       if (min_slice != se->slice) {
> > +               min_vslice = calc_delta_fair(min_slice, se);
> > +               deadline = min_vruntime(se->deadline, se->vprot + min_vslice);
> 
> I didn't go into that direction because protect_slice() is call far
> more often than set_protect_slice() or update_protect_slice()

Right.

> > +       }
> >
> > +       WARN_ON_ONCE(!se->on_rq);
> > +
> > +       return ((s64)(deadline - se->vruntime) > 0);
> >  }

Anyway, I see you posted a new version. Let me go have a look.