kernel/sched/fair.c | 94 ++++++++++++++++++++++++--------------------- 1 file changed, 50 insertions(+), 44 deletions(-)
This follows the attempt to better track maximum lag of tasks in presence of different slices duration: [1] https://lore.kernel.org/all/20250418151225.3006867-1-vincent.guittot@linaro.org/ Since v1, tracking of the max slice has been removed from the patchset because we now ensure that the lag of an entity remains in the range of: [-(slice + tick) : (slice + tick)] with run_to_parity and [max(-slice, -(0.7+tick) : max(slice , (0.7+tick)] without run to parity As a result, there is no need the max slice of enqueued entities anymore. Patch 1 is a simple cleanup to ease following changes. Patch 2 fixes the lag for NO_RUN_TO_PARITY. It has been put 1st because of its simplicity. The running task has a minimum protection of 0.7ms before eevdf looks for another task. Patch 3 ensures that the protection is canceled only if the waking task will be selected by pick_task_fair. This case has been mentionned by Peter will reviewing v1. Patch 4 modifes the duration of the protection to take into account the shortest slice of enqueued tasks instead of the slice of the running task. Patch 5 fixes the case of tasks not being eligible at wakeup or after migrating but with a shorter slice. We need to update the duration of the protection to not exceed the lag. Patch 6 fixes the case of tasks still being eligible after the protected period but others must run to no exceed lag limit. This has been highlighted in a test with delayed entities being dequeued with a positive lag larger than their slice but it can happen for delayed dequeue entity too. The patchset has been tested with rt-app on 37 different use cases, some a simple and should never trigger any problem but have been kept to increase the test coverage. The tests have been run on dragon rb5 with affinity on biggest cores. The lag has been checked when we update the entity's lag at dequeue and every time we check if an entity is eligible. RUN_TO_PARITY NO_RUN_TO_PARITY lag error lag_error mainline 14/37 14/37 + patch 1-2 14/37 0/37 + patch 3-5 1/37 0/37 + patch 6 0/37 0/37 Vincent Guittot (6): sched/fair: Use protect_slice() instead of direct comparison sched/fair: Fix NO_RUN_TO_PARITY case sched/fair: Remove spurious shorter slice preemption sched/fair: Limit run to parity to the min slice of enqueued entities sched/fair: Fix entity's lag with run to parity sched/fair: Always trigger resched at the end of a protected period kernel/sched/fair.c | 94 ++++++++++++++++++++++++--------------------- 1 file changed, 50 insertions(+), 44 deletions(-) -- 2.43.0
On Fri, Jul 04, 2025 at 04:36:06PM +0200, Vincent Guittot wrote: > This follows the attempt to better track maximum lag of tasks in presence > of different slices duration: > [1] https://lore.kernel.org/all/20250418151225.3006867-1-vincent.guittot@linaro.org/ > > Since v1, tracking of the max slice has been removed from the patchset > because we now ensure that the lag of an entity remains in the range of: > > [-(slice + tick) : (slice + tick)] with run_to_parity > and > [max(-slice, -(0.7+tick) : max(slice , (0.7+tick)] without run to parity > > As a result, there is no need the max slice of enqueued entities anymore. > > Patch 1 is a simple cleanup to ease following changes. > > Patch 2 fixes the lag for NO_RUN_TO_PARITY. It has been put 1st because of > its simplicity. The running task has a minimum protection of 0.7ms before > eevdf looks for another task. The usage of min() on vruntimes is broken; it doesn't work right in the face of wrapping; use min_vruntime(). Also, perhaps it is time to better document this vlag abuse. > Patch 3 ensures that the protection is canceled only if the waking task > will be selected by pick_task_fair. This case has been mentionned by Peter > will reviewing v1. > > Patch 4 modifes the duration of the protection to take into account the > shortest slice of enqueued tasks instead of the slice of the running task. > > Patch 5 fixes the case of tasks not being eligible at wakeup or after > migrating but with a shorter slice. We need to update the duration of the > protection to not exceed the lag. This has issues with non-determinism; specifically, update_protected_slice() will use the current ->vruntime, and as such can unduly push forward the protection window. > Patch 6 fixes the case of tasks still being eligible after the protected > period but others must run to no exceed lag limit. This has been > highlighted in a test with delayed entities being dequeued with a positive > lag larger than their slice but it can happen for delayed dequeue entity > too. At this point resched_next_quantum() becomes !protec_slice() and can be removed. How about something like so? I've probably wrecked the whole !RUN_TO_PARITY thing -- so that needs to be put back in. --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -579,7 +579,15 @@ struct sched_entity { u64 sum_exec_runtime; u64 prev_sum_exec_runtime; u64 vruntime; - s64 vlag; + union { + /* + * When !@on_rq this field is vlag. + * When cfs_rq->curr == se (which implies @on_rq) + * this field is vprot. See protect_slice(). + */ + s64 vlag; + u64 vprot; + }; u64 slice; u64 nr_migrations; --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -882,45 +882,36 @@ struct sched_entity *__pick_first_entity } /* - * HACK, Set the vruntime up to which an entity can run before picking another - * one, in vlag, which isn't used until dequeue. - * In case of run to parity, we use the shortest slice of the enqueued entities - * to define the longest runtime. - * When run to parity is disabled, we give a minimum quantum to the running - * entity to ensure progress. + * Take a snapshot of the vruntime at the point the task gets scheduled. */ static inline void set_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) { - u64 quantum; - - if (sched_feat(RUN_TO_PARITY)) - quantum = cfs_rq_min_slice(cfs_rq); - else - quantum = normalized_sysctl_sched_base_slice; - quantum = min(quantum, se->slice); - - if (quantum != se->slice) - se->vlag = min(se->deadline, se->vruntime + calc_delta_fair(quantum, se)); - else - se->vlag = se->deadline; + se->vprot = se->vruntime; } -static inline void update_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) +/* + * Should we still run @se? It is allowed to run until either se->deadline or + * until se->vprot + min_vslice, whichever comes first. + */ +static inline bool protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) { - u64 quantum = cfs_rq_min_slice(cfs_rq); + u64 min_vslice, deadline = se->deadline; + u64 min_slice = cfs_rq_min_slice(cfs_rq); - se->vlag = min(se->vlag, (s64)(se->vruntime + calc_delta_fair(quantum, se))); -} + if (min_slice != se->slice) { + min_vslice = calc_delta_fair(min_slice, se); + deadline = min_vruntime(se->deadline, se->vprot + min_vslice); + } -static inline bool protect_slice(struct sched_entity *se) -{ - return ((s64)(se->vlag - se->vruntime) > 0); + WARN_ON_ONCE(!se->on_rq); + + return ((s64)(deadline - se->vruntime) > 0); } -static inline void cancel_protect_slice(struct sched_entity *se) +static inline void cancel_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) { - if (protect_slice(se)) - se->vlag = se->vruntime; + if (protect_slice(cfs_rq, se)) + se->vprot = se->vruntime - calc_delta_fair(NSEC_PER_SEC, se); } /* @@ -959,7 +950,7 @@ static struct sched_entity *__pick_eevdf if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) curr = NULL; - if (curr && protect && protect_slice(curr)) + if (curr && protect && protect_slice(cfs_rq, curr)) return curr; /* Pick the leftmost entity if it's eligible */ @@ -1183,14 +1174,6 @@ static inline void update_curr_task(stru cgroup_account_cputime(p, delta_exec); } -static inline bool resched_next_quantum(struct cfs_rq *cfs_rq, struct sched_entity *curr) -{ - if (protect_slice(curr)) - return false; - - return true; -} - /* * Used by other classes to account runtime. */ @@ -1251,7 +1234,7 @@ static void update_curr(struct cfs_rq *c if (cfs_rq->nr_queued == 1) return; - if (resched || resched_next_quantum(cfs_rq, curr)) { + if (resched || !protect_slice(cfs_rq, curr)) { resched_curr_lazy(rq); clear_buddies(cfs_rq, curr); } @@ -8729,7 +8712,7 @@ static void check_preempt_wakeup_fair(st * If @p has a shorter slice than current and @p is eligible, override * current's slice protection in order to allow preemption. */ - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); + do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); /* * If @p has become the most eligible task, force preemption. @@ -8737,14 +8720,11 @@ static void check_preempt_wakeup_fair(st if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) goto preempt; - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) - update_protect_slice(cfs_rq, se); - return; preempt: if (do_preempt_short) - cancel_protect_slice(se); + cancel_protect_slice(cfs_rq, se); resched_curr_lazy(rq); }
On Mon, 7 Jul 2025 at 16:14, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Jul 04, 2025 at 04:36:06PM +0200, Vincent Guittot wrote: > > This follows the attempt to better track maximum lag of tasks in presence > > of different slices duration: > > [1] https://lore.kernel.org/all/20250418151225.3006867-1-vincent.guittot@linaro.org/ > > > > Since v1, tracking of the max slice has been removed from the patchset > > because we now ensure that the lag of an entity remains in the range of: > > > > [-(slice + tick) : (slice + tick)] with run_to_parity > > and > > [max(-slice, -(0.7+tick) : max(slice , (0.7+tick)] without run to parity > > > > As a result, there is no need the max slice of enqueued entities anymore. > > > > Patch 1 is a simple cleanup to ease following changes. > > > > Patch 2 fixes the lag for NO_RUN_TO_PARITY. It has been put 1st because of > > its simplicity. The running task has a minimum protection of 0.7ms before > > eevdf looks for another task. > > The usage of min() on vruntimes is broken; it doesn't work right in the > face of wrapping; use min_vruntime(). Good point. Don't know why I didn't use it > > Also, perhaps it is time to better document this vlag abuse. will add something > > > Patch 3 ensures that the protection is canceled only if the waking task > > will be selected by pick_task_fair. This case has been mentionned by Peter > > will reviewing v1. > > > > Patch 4 modifes the duration of the protection to take into account the > > shortest slice of enqueued tasks instead of the slice of the running task. > > > > Patch 5 fixes the case of tasks not being eligible at wakeup or after > > migrating but with a shorter slice. We need to update the duration of the > > protection to not exceed the lag. > > This has issues with non-determinism; specifically, > update_protected_slice() will use the current ->vruntime, and as such > can unduly push forward the protection window. se->vprot = min_vruntime(se->vprot, (se->vruntime + calc_delta_fair(quantum, se))); the min_vruntime (previously min) with current vprot (previously vlag) should prevent pushing forward the protection. We can only reduce further the vlag but never increase it > > > Patch 6 fixes the case of tasks still being eligible after the protected > > period but others must run to no exceed lag limit. This has been > > highlighted in a test with delayed entities being dequeued with a positive > > lag larger than their slice but it can happen for delayed dequeue entity > > too. > > At this point resched_next_quantum() becomes !protec_slice() and can be > removed. yes, it was the last remaining fix and forgot to simplify > > How about something like so? I've probably wrecked the whole > !RUN_TO_PARITY thing -- so that needs to be put back in. > > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -579,7 +579,15 @@ struct sched_entity { > u64 sum_exec_runtime; > u64 prev_sum_exec_runtime; > u64 vruntime; > - s64 vlag; > + union { > + /* > + * When !@on_rq this field is vlag. > + * When cfs_rq->curr == se (which implies @on_rq) > + * this field is vprot. See protect_slice(). > + */ > + s64 vlag; > + u64 vprot; > + }; > u64 slice; > > u64 nr_migrations; > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -882,45 +882,36 @@ struct sched_entity *__pick_first_entity > } > > /* > - * HACK, Set the vruntime up to which an entity can run before picking another > - * one, in vlag, which isn't used until dequeue. > - * In case of run to parity, we use the shortest slice of the enqueued entities > - * to define the longest runtime. > - * When run to parity is disabled, we give a minimum quantum to the running > - * entity to ensure progress. > + * Take a snapshot of the vruntime at the point the task gets scheduled. > */ > static inline void set_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > - u64 quantum; > - > - if (sched_feat(RUN_TO_PARITY)) > - quantum = cfs_rq_min_slice(cfs_rq); > - else > - quantum = normalized_sysctl_sched_base_slice; > - quantum = min(quantum, se->slice); > - > - if (quantum != se->slice) > - se->vlag = min(se->deadline, se->vruntime + calc_delta_fair(quantum, se)); > - else > - se->vlag = se->deadline; > + se->vprot = se->vruntime; > } > > -static inline void update_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) > +/* > + * Should we still run @se? It is allowed to run until either se->deadline or > + * until se->vprot + min_vslice, whichever comes first. > + */ > +static inline bool protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > - u64 quantum = cfs_rq_min_slice(cfs_rq); > + u64 min_vslice, deadline = se->deadline; > + u64 min_slice = cfs_rq_min_slice(cfs_rq); > > - se->vlag = min(se->vlag, (s64)(se->vruntime + calc_delta_fair(quantum, se))); > -} > + if (min_slice != se->slice) { > + min_vslice = calc_delta_fair(min_slice, se); > + deadline = min_vruntime(se->deadline, se->vprot + min_vslice); I didn't go into that direction because protect_slice() is call far more often than set_protect_slice() or update_protect_slice() > + } > > -static inline bool protect_slice(struct sched_entity *se) > -{ > - return ((s64)(se->vlag - se->vruntime) > 0); > + WARN_ON_ONCE(!se->on_rq); > + > + return ((s64)(deadline - se->vruntime) > 0); > } > > -static inline void cancel_protect_slice(struct sched_entity *se) > +static inline void cancel_protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) > { > - if (protect_slice(se)) > - se->vlag = se->vruntime; > + if (protect_slice(cfs_rq, se)) > + se->vprot = se->vruntime - calc_delta_fair(NSEC_PER_SEC, se); > } > > /* > @@ -959,7 +950,7 @@ static struct sched_entity *__pick_eevdf > if (curr && (!curr->on_rq || !entity_eligible(cfs_rq, curr))) > curr = NULL; > > - if (curr && protect && protect_slice(curr)) > + if (curr && protect && protect_slice(cfs_rq, curr)) > return curr; > > /* Pick the leftmost entity if it's eligible */ > @@ -1183,14 +1174,6 @@ static inline void update_curr_task(stru > cgroup_account_cputime(p, delta_exec); > } > > -static inline bool resched_next_quantum(struct cfs_rq *cfs_rq, struct sched_entity *curr) > -{ > - if (protect_slice(curr)) > - return false; > - > - return true; > -} > - > /* > * Used by other classes to account runtime. > */ > @@ -1251,7 +1234,7 @@ static void update_curr(struct cfs_rq *c > if (cfs_rq->nr_queued == 1) > return; > > - if (resched || resched_next_quantum(cfs_rq, curr)) { > + if (resched || !protect_slice(cfs_rq, curr)) { > resched_curr_lazy(rq); > clear_buddies(cfs_rq, curr); > } > @@ -8729,7 +8712,7 @@ static void check_preempt_wakeup_fair(st > * If @p has a shorter slice than current and @p is eligible, override > * current's slice protection in order to allow preemption. > */ > - do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); > + do_preempt_short = sched_feat(PREEMPT_SHORT) && (pse->slice < se->slice); > > /* > * If @p has become the most eligible task, force preemption. > @@ -8737,14 +8720,11 @@ static void check_preempt_wakeup_fair(st > if (__pick_eevdf(cfs_rq, !do_preempt_short) == pse) > goto preempt; > > - if (sched_feat(RUN_TO_PARITY) && do_preempt_short) > - update_protect_slice(cfs_rq, se); > - > return; > > preempt: > if (do_preempt_short) > - cancel_protect_slice(se); > + cancel_protect_slice(cfs_rq, se); > > resched_curr_lazy(rq); > }
On Mon, Jul 07, 2025 at 05:49:05PM +0200, Vincent Guittot wrote: > > > Patch 5 fixes the case of tasks not being eligible at wakeup or after > > > migrating but with a shorter slice. We need to update the duration of the > > > protection to not exceed the lag. > > > > This has issues with non-determinism; specifically, > > update_protected_slice() will use the current ->vruntime, and as such > > can unduly push forward the protection window. > > se->vprot = min_vruntime(se->vprot, (se->vruntime + > calc_delta_fair(quantum, se))); > > the min_vruntime (previously min) with current vprot (previously vlag) > should prevent pushing forward the protection. We can only reduce > further the vlag but never increase it Fair enough. > > +/* > > + * Should we still run @se? It is allowed to run until either se->deadline or > > + * until se->vprot + min_vslice, whichever comes first. > > + */ > > +static inline bool protect_slice(struct cfs_rq *cfs_rq, struct sched_entity *se) > > { > > + u64 min_vslice, deadline = se->deadline; > > + u64 min_slice = cfs_rq_min_slice(cfs_rq); > > > > + if (min_slice != se->slice) { > > + min_vslice = calc_delta_fair(min_slice, se); > > + deadline = min_vruntime(se->deadline, se->vprot + min_vslice); > > I didn't go into that direction because protect_slice() is call far > more often than set_protect_slice() or update_protect_slice() Right. > > + } > > > > + WARN_ON_ONCE(!se->on_rq); > > + > > + return ((s64)(deadline - se->vruntime) > 0); > > } Anyway, I see you posted a new version. Let me go have a look.
© 2016 - 2025 Red Hat, Inc.