Run to parity ensures that current will get a chance to run its full
slice in one go but this can create large latency and/or lag for
entities with shorter slice that have exhausted their previous slice
and wait to run their next slice.
Clamp the run to parity to the shortest slice of all enqueued entities.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
kernel/sched/fair.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7e82b357763a..85238f2e026a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,16 +884,20 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
/*
* Set the vruntime up to which an entity can run before looking
* for another entity to pick.
- * In case of run to parity, we protect the entity up to its deadline.
+ * In case of run to parity, we use the shortest slice of the enqueued
+ * entities to set the protected period.
* When run to parity is disabled, we give a minimum quantum to the running
* entity to ensure progress.
*/
static inline void set_protect_slice(struct sched_entity *se)
{
- u64 quantum = se->slice;
+ u64 quantum;
- if (!sched_feat(RUN_TO_PARITY))
- quantum = min(quantum, normalized_sysctl_sched_base_slice);
+ if (sched_feat(RUN_TO_PARITY))
+ quantum = cfs_rq_min_slice(cfs_rq_of(se));
+ else
+ quantum = normalized_sysctl_sched_base_slice;
+ quantum = min(quantum, se->slice);
if (quantum != se->slice)
se->vprot = min_vruntime(se->deadline, se->vruntime + calc_delta_fair(quantum, se));
--
2.43.0
Hi Vincent, On 08/07/25 22:26, Vincent Guittot wrote: > Run to parity ensures that current will get a chance to run its full > slice in one go but this can create large latency and/or lag for > entities with shorter slice that have exhausted their previous slice > and wait to run their next slice. > > Clamp the run to parity to the shortest slice of all enqueued entities. > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > --- > kernel/sched/fair.c | 12 ++++++++---- > 1 file changed, 8 insertions(+), 4 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 7e82b357763a..85238f2e026a 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -884,16 +884,20 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq) > /* > * Set the vruntime up to which an entity can run before looking > * for another entity to pick. > - * In case of run to parity, we protect the entity up to its deadline. > + * In case of run to parity, we use the shortest slice of the enqueued > + * entities to set the protected period. > * When run to parity is disabled, we give a minimum quantum to the running > * entity to ensure progress. > */ If I set my task’s custom slice to a larger value but another task has a smaller slice, this change will cap my protected window to the smaller slice. Does that mean my custom slice is no longer honored? Thanks, Madadi Vineeth Reddy > static inline void set_protect_slice(struct sched_entity *se) > { > - u64 quantum = se->slice; > + u64 quantum; > > - if (!sched_feat(RUN_TO_PARITY)) > - quantum = min(quantum, normalized_sysctl_sched_base_slice); > + if (sched_feat(RUN_TO_PARITY)) > + quantum = cfs_rq_min_slice(cfs_rq_of(se)); > + else > + quantum = normalized_sysctl_sched_base_slice; > + quantum = min(quantum, se->slice); > > if (quantum != se->slice) > se->vprot = min_vruntime(se->deadline, se->vruntime + calc_delta_fair(quantum, se));
On Thu, 10 Jul 2025 at 09:00, Madadi Vineeth Reddy <vineethr@linux.ibm.com> wrote: > > Hi Vincent, > > On 08/07/25 22:26, Vincent Guittot wrote: > > Run to parity ensures that current will get a chance to run its full > > slice in one go but this can create large latency and/or lag for > > entities with shorter slice that have exhausted their previous slice > > and wait to run their next slice. > > > > Clamp the run to parity to the shortest slice of all enqueued entities. > > > > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org> > > --- > > kernel/sched/fair.c | 12 ++++++++---- > > 1 file changed, 8 insertions(+), 4 deletions(-) > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > > index 7e82b357763a..85238f2e026a 100644 > > --- a/kernel/sched/fair.c > > +++ b/kernel/sched/fair.c > > @@ -884,16 +884,20 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq) > > /* > > * Set the vruntime up to which an entity can run before looking > > * for another entity to pick. > > - * In case of run to parity, we protect the entity up to its deadline. > > + * In case of run to parity, we use the shortest slice of the enqueued > > + * entities to set the protected period. > > * When run to parity is disabled, we give a minimum quantum to the running > > * entity to ensure progress. > > */ > > If I set my task’s custom slice to a larger value but another task has a smaller slice, > this change will cap my protected window to the smaller slice. Does that mean my custom > slice is no longer honored? What do you mean by honored ? EEVDF never mandates that a request of size slice will be done in one go. Slice mainly defines the deadline and orders the entities but not that it will always run your slice in one go. Run to parity tries to minimize the number of context switches between runnable tasks but must not break fairness and lag theorem.So If your task A has a slice of 10ms and task B wakes up with a slice of 1ms. B will preempt A because its deadline is earlier. If task B still wants to run after its slice is exhausted, it will not be eligible and task A will run until task B becomes eligible, which is as long as task B's slice. > > Thanks, > Madadi Vineeth Reddy > > > static inline void set_protect_slice(struct sched_entity *se) > > { > > - u64 quantum = se->slice; > > + u64 quantum; > > > > - if (!sched_feat(RUN_TO_PARITY)) > > - quantum = min(quantum, normalized_sysctl_sched_base_slice); > > + if (sched_feat(RUN_TO_PARITY)) > > + quantum = cfs_rq_min_slice(cfs_rq_of(se)); > > + else > > + quantum = normalized_sysctl_sched_base_slice; > > + quantum = min(quantum, se->slice); > > > > if (quantum != se->slice) > > se->vprot = min_vruntime(se->deadline, se->vruntime + calc_delta_fair(quantum, se)); >
> > If I set my task’s custom slice to a larger value but another task has a smaller slice, > > this change will cap my protected window to the smaller slice. Does that mean my custom > > slice is no longer honored? > > What do you mean by honored ? EEVDF never mandates that a request of > size slice will be done in one go. Slice mainly defines the deadline > and orders the entities but not that it will always run your slice in > one go. Run to parity tries to minimize the number of context switches > between runnable tasks but must not break fairness and lag theorem.So > If your task A has a slice of 10ms and task B wakes up with a slice of > 1ms. B will preempt A because its deadline is earlier. If task B still > wants to run after its slice is exhausted, it will not be eligible and > task A will run until task B becomes eligible, which is as long as > task B's slice. Right. Added if you don't want wakeup preemption, we've got SCHED_BATCH for you.
Hi Vincent, Peter On 10/07/25 18:04, Peter Zijlstra wrote: > >>> If I set my task’s custom slice to a larger value but another task has a smaller slice, >>> this change will cap my protected window to the smaller slice. Does that mean my custom >>> slice is no longer honored? >> >> What do you mean by honored ? EEVDF never mandates that a request of >> size slice will be done in one go. Slice mainly defines the deadline >> and orders the entities but not that it will always run your slice in >> one go. Run to parity tries to minimize the number of context switches >> between runnable tasks but must not break fairness and lag theorem.So >> If your task A has a slice of 10ms and task B wakes up with a slice of >> 1ms. B will preempt A because its deadline is earlier. If task B still >> wants to run after its slice is exhausted, it will not be eligible and >> task A will run until task B becomes eligible, which is as long as >> task B's slice. > > Right. Added if you don't want wakeup preemption, we've got SCHED_BATCH > for you. Thanks for the explanation. Understood now that slice is only for deadline calculation and ordering for eligible tasks. Before your patch, I observed that each task ran for its full custom slice before preemption, which led me to assume that slice directly controlled uninterrupted runtime. With the patch series applied and RUN_TO_PARITY=true, I now see the expected behavior: - Default slice (~2.8 ms): tasks run ~3 ms each. - Increasing one task’s slice doesn’t extend its single‐run duration—it remains ~3 ms. - Decreasing one tasks’ slice shortens everyone’s run to that new minimum. With this patch series, With NO_RUN_TO_PARITY, I see runtimes near 1 ms (CONFIG_HZ=1000). However, without your patches, I was still seeing ~3 ms runs even with NO_RUN_TO_PARITY, which confused me because I expected runtime to drop to ~1 ms (preempt at every tick) rather than run up to the default slice. Without your patches and having RUN_TO_PARITY is as expected. Task running till it's slice when eligible. I ran these with 16 stress‑ng threads pinned to one CPU. Please let me know if my understanding is incorrect, and why I was still seeing ~3 ms runtimes with NO_RUN_TO_PARITY before this patch series. Thanks, Madadi Vineeth Reddy
Hi Madadi, Sorry for the late reply but I have limited network access at the moment. On Sun, 13 Jul 2025 at 20:17, Madadi Vineeth Reddy <vineethr@linux.ibm.com> wrote: > > Hi Vincent, Peter > > On 10/07/25 18:04, Peter Zijlstra wrote: > > > >>> If I set my task’s custom slice to a larger value but another task has a smaller slice, > >>> this change will cap my protected window to the smaller slice. Does that mean my custom > >>> slice is no longer honored? > >> > >> What do you mean by honored ? EEVDF never mandates that a request of > >> size slice will be done in one go. Slice mainly defines the deadline > >> and orders the entities but not that it will always run your slice in > >> one go. Run to parity tries to minimize the number of context switches > >> between runnable tasks but must not break fairness and lag theorem.So > >> If your task A has a slice of 10ms and task B wakes up with a slice of > >> 1ms. B will preempt A because its deadline is earlier. If task B still > >> wants to run after its slice is exhausted, it will not be eligible and > >> task A will run until task B becomes eligible, which is as long as > >> task B's slice. > > > > Right. Added if you don't want wakeup preemption, we've got SCHED_BATCH > > for you. > > Thanks for the explanation. Understood now that slice is only for deadline > calculation and ordering for eligible tasks. > > Before your patch, I observed that each task ran for its full custom slice > before preemption, which led me to assume that slice directly controlled > uninterrupted runtime. > > With the patch series applied and RUN_TO_PARITY=true, I now see the expected behavior: > - Default slice (~2.8 ms): tasks run ~3 ms each. > - Increasing one task’s slice doesn’t extend its single‐run duration—it remains ~3 ms. > - Decreasing one tasks’ slice shortens everyone’s run to that new minimum. > > With this patch series, With NO_RUN_TO_PARITY, I see runtimes near 1 ms (CONFIG_HZ=1000). > > However, without your patches, I was still seeing ~3 ms runs even with NO_RUN_TO_PARITY, > which confused me because I expected runtime to drop to ~1 ms (preempt at every tick) > rather than run up to the default slice. > > Without your patches and having RUN_TO_PARITY is as expected. Task running till it's > slice when eligible. > > I ran these with 16 stress‑ng threads pinned to one CPU. > > Please let me know if my understanding is incorrect, and why I was still seeing ~3 ms > runtimes with NO_RUN_TO_PARITY before this patch series. Before my patchset both NO_RUN_TO_PARITY and RUN_TO_PARITY were wrong. Patch 2 fixes NO_RUN_TO_PARITY and others RUN_TO_PARITY > > Thanks, > Madadi Vineeth Reddy
On 13/07/25 23:47, Madadi Vineeth Reddy wrote: > Hi Vincent, Peter > > On 10/07/25 18:04, Peter Zijlstra wrote: >> >>>> If I set my task’s custom slice to a larger value but another task has a smaller slice, >>>> this change will cap my protected window to the smaller slice. Does that mean my custom >>>> slice is no longer honored? >>> >>> What do you mean by honored ? EEVDF never mandates that a request of >>> size slice will be done in one go. Slice mainly defines the deadline >>> and orders the entities but not that it will always run your slice in >>> one go. Run to parity tries to minimize the number of context switches >>> between runnable tasks but must not break fairness and lag theorem.So >>> If your task A has a slice of 10ms and task B wakes up with a slice of >>> 1ms. B will preempt A because its deadline is earlier. If task B still >>> wants to run after its slice is exhausted, it will not be eligible and >>> task A will run until task B becomes eligible, which is as long as >>> task B's slice. >> >> Right. Added if you don't want wakeup preemption, we've got SCHED_BATCH >> for you. > > Thanks for the explanation. Understood now that slice is only for deadline > calculation and ordering for eligible tasks. > > Before your patch, I observed that each task ran for its full custom slice > before preemption, which led me to assume that slice directly controlled > uninterrupted runtime. > > With the patch series applied and RUN_TO_PARITY=true, I now see the expected behavior: > - Default slice (~2.8 ms): tasks run ~3 ms each. > - Increasing one task’s slice doesn’t extend its single‐run duration—it remains ~3 ms. > - Decreasing one tasks’ slice shortens everyone’s run to that new minimum. > > With this patch series, With NO_RUN_TO_PARITY, I see runtimes near 1 ms (CONFIG_HZ=1000). > > However, without your patches, I was still seeing ~3 ms runs even with NO_RUN_TO_PARITY, > which confused me because I expected runtime to drop to ~1 ms (preempt at every tick) > rather than run up to the default slice. > > Without your patches and having RUN_TO_PARITY is as expected. Task running till it's > slice when eligible. > > I ran these with 16 stress‑ng threads pinned to one CPU. > > Please let me know if my understanding is incorrect, and why I was still seeing ~3 ms > runtimes with NO_RUN_TO_PARITY before this patch series. > Hi Vincent, Just following up on my earlier question: with the patch applied (and RUN_TO_PARITY=true), reducing one task’s slice now clamps the runtime of all tasks on that runqueue to the new minimum.(By “runtime” I mean the continuous time a task runs before preemption.). Could this negatively impact throughput oriented workloads where remaining threads need longer run time before preemption? I understand that slice is only for ordering of deadlines but just curious about it's effect in scenarios like this. Thanks, Madadi Vineeth Reddy > Thanks, > Madadi Vineeth Reddy
On Sun, 20 Jul 2025 at 12:57, Madadi Vineeth Reddy <vineethr@linux.ibm.com> wrote: > > On 13/07/25 23:47, Madadi Vineeth Reddy wrote: > > Hi Vincent, Peter > > > > On 10/07/25 18:04, Peter Zijlstra wrote: > >> > >>>> If I set my task’s custom slice to a larger value but another task has a smaller slice, > >>>> this change will cap my protected window to the smaller slice. Does that mean my custom > >>>> slice is no longer honored? > >>> > >>> What do you mean by honored ? EEVDF never mandates that a request of > >>> size slice will be done in one go. Slice mainly defines the deadline > >>> and orders the entities but not that it will always run your slice in > >>> one go. Run to parity tries to minimize the number of context switches > >>> between runnable tasks but must not break fairness and lag theorem.So > >>> If your task A has a slice of 10ms and task B wakes up with a slice of > >>> 1ms. B will preempt A because its deadline is earlier. If task B still > >>> wants to run after its slice is exhausted, it will not be eligible and > >>> task A will run until task B becomes eligible, which is as long as > >>> task B's slice. > >> > >> Right. Added if you don't want wakeup preemption, we've got SCHED_BATCH > >> for you. > > > > Thanks for the explanation. Understood now that slice is only for deadline > > calculation and ordering for eligible tasks. > > > > Before your patch, I observed that each task ran for its full custom slice > > before preemption, which led me to assume that slice directly controlled > > uninterrupted runtime. > > > > With the patch series applied and RUN_TO_PARITY=true, I now see the expected behavior: > > - Default slice (~2.8 ms): tasks run ~3 ms each. > > - Increasing one task’s slice doesn’t extend its single‐run duration—it remains ~3 ms. > > - Decreasing one tasks’ slice shortens everyone’s run to that new minimum. > > > > With this patch series, With NO_RUN_TO_PARITY, I see runtimes near 1 ms (CONFIG_HZ=1000). > > > > However, without your patches, I was still seeing ~3 ms runs even with NO_RUN_TO_PARITY, > > which confused me because I expected runtime to drop to ~1 ms (preempt at every tick) > > rather than run up to the default slice. > > > > Without your patches and having RUN_TO_PARITY is as expected. Task running till it's > > slice when eligible. > > > > I ran these with 16 stress‑ng threads pinned to one CPU. > > > > Please let me know if my understanding is incorrect, and why I was still seeing ~3 ms > > runtimes with NO_RUN_TO_PARITY before this patch series. > > > > Hi Vincent, > > Just following up on my earlier question: with the patch applied (and RUN_TO_PARITY=true), > reducing one task’s slice now clamps the runtime of all tasks on that runqueue to the new > minimum.(By “runtime” I mean the continuous time a task runs before preemption.). Could this > negatively impact throughput oriented workloads where remaining threads need longer run time > before preemption? Probably, it is also expected that tasks which have shorter slices, don't want to run forever. The shorter runtime will only apply while the task is runnable and this task should run 1st or almost and go back to sleep so its impact should be small. I agree that if you have an always running task which sets its slice to 1ms it will increase number of context switch for other tasks which don't have a longer slice but we can't do much against that > > I understand that slice is only for ordering of deadlines but just curious about it's > effect in scenarios like this. > > Thanks, > Madadi Vineeth Reddy > > > Thanks, > > Madadi Vineeth Reddy >
Hi Vincent, On 21/07/25 14:41, Vincent Guittot wrote: > On Sun, 20 Jul 2025 at 12:57, Madadi Vineeth Reddy > <vineethr@linux.ibm.com> wrote: >> >> On 13/07/25 23:47, Madadi Vineeth Reddy wrote: >>> Hi Vincent, Peter >>> >>> On 10/07/25 18:04, Peter Zijlstra wrote: >>>> >>>>>> If I set my task’s custom slice to a larger value but another task has a smaller slice, >>>>>> this change will cap my protected window to the smaller slice. Does that mean my custom >>>>>> slice is no longer honored? >>>>> >>>>> What do you mean by honored ? EEVDF never mandates that a request of >>>>> size slice will be done in one go. Slice mainly defines the deadline >>>>> and orders the entities but not that it will always run your slice in >>>>> one go. Run to parity tries to minimize the number of context switches >>>>> between runnable tasks but must not break fairness and lag theorem.So >>>>> If your task A has a slice of 10ms and task B wakes up with a slice of >>>>> 1ms. B will preempt A because its deadline is earlier. If task B still >>>>> wants to run after its slice is exhausted, it will not be eligible and >>>>> task A will run until task B becomes eligible, which is as long as >>>>> task B's slice. >>>> >>>> Right. Added if you don't want wakeup preemption, we've got SCHED_BATCH >>>> for you. >>> >>> Thanks for the explanation. Understood now that slice is only for deadline >>> calculation and ordering for eligible tasks. >>> >>> Before your patch, I observed that each task ran for its full custom slice >>> before preemption, which led me to assume that slice directly controlled >>> uninterrupted runtime. >>> >>> With the patch series applied and RUN_TO_PARITY=true, I now see the expected behavior: >>> - Default slice (~2.8 ms): tasks run ~3 ms each. >>> - Increasing one task’s slice doesn’t extend its single‐run duration—it remains ~3 ms. >>> - Decreasing one tasks’ slice shortens everyone’s run to that new minimum. >>> >>> With this patch series, With NO_RUN_TO_PARITY, I see runtimes near 1 ms (CONFIG_HZ=1000). >>> >>> However, without your patches, I was still seeing ~3 ms runs even with NO_RUN_TO_PARITY, >>> which confused me because I expected runtime to drop to ~1 ms (preempt at every tick) >>> rather than run up to the default slice. >>> >>> Without your patches and having RUN_TO_PARITY is as expected. Task running till it's >>> slice when eligible. >>> >>> I ran these with 16 stress‑ng threads pinned to one CPU. >>> >>> Please let me know if my understanding is incorrect, and why I was still seeing ~3 ms >>> runtimes with NO_RUN_TO_PARITY before this patch series. >>> >> >> Hi Vincent, >> >> Just following up on my earlier question: with the patch applied (and RUN_TO_PARITY=true), >> reducing one task’s slice now clamps the runtime of all tasks on that runqueue to the new >> minimum.(By “runtime” I mean the continuous time a task runs before preemption.). Could this >> negatively impact throughput oriented workloads where remaining threads need longer run time >> before preemption? > > Probably, it is also expected that tasks which have shorter slices, > don't want to run forever. The shorter runtime will only apply while > the task is runnable and this task should run 1st or almost and go > back to sleep so its impact should be small. I agree that if you have > an always running task which sets its slice to 1ms it will increase > number of context switch for other tasks which don't have a longer > slice but we can't do much against that > >> >> I understand that slice is only for ordering of deadlines but just curious about it's >> effect in scenarios like this. Understood, thank you for the clarification. Since fairness is the first priority, I see that there's not much that can be done in the "always running" case. Thanks again for the detailed explanation. Thanks, Madadi Vineeth Reddy >> >> Thanks, >> Madadi Vineeth Reddy >> >>> Thanks, >>> Madadi Vineeth Reddy >>
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 052c3d87c82ea4ee83232b747512847b4e8c9976
Gitweb: https://git.kernel.org/tip/052c3d87c82ea4ee83232b747512847b4e8c9976
Author: Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 08 Jul 2025 18:56:28 +02:00
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 09 Jul 2025 13:40:23 +02:00
sched/fair: Limit run to parity to the min slice of enqueued entities
Run to parity ensures that current will get a chance to run its full
slice in one go but this can create large latency and/or lag for
entities with shorter slice that have exhausted their previous slice
and wait to run their next slice.
Clamp the run to parity to the shortest slice of all enqueued entities.
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20250708165630.1948751-5-vincent.guittot@linaro.org
---
kernel/sched/fair.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96718b3..45e057f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -884,18 +884,20 @@ struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
/*
* Set the vruntime up to which an entity can run before looking
* for another entity to pick.
- * In case of run to parity, we protect the entity up to its deadline.
+ * In case of run to parity, we use the shortest slice of the enqueued
+ * entities to set the protected period.
* When run to parity is disabled, we give a minimum quantum to the running
* entity to ensure progress.
*/
static inline void set_protect_slice(struct sched_entity *se)
{
- u64 slice = se->slice;
+ u64 slice = normalized_sysctl_sched_base_slice;
u64 vprot = se->deadline;
- if (!sched_feat(RUN_TO_PARITY))
- slice = min(slice, normalized_sysctl_sched_base_slice);
+ if (sched_feat(RUN_TO_PARITY))
+ slice = cfs_rq_min_slice(cfs_rq_of(se));
+ slice = min(slice, se->slice);
if (slice != se->slice)
vprot = min_vruntime(vprot, se->vruntime + calc_delta_fair(slice, se));
© 2016 - 2025 Red Hat, Inc.