include/linux/sched.h | 1 + kernel/sched/fair.c | 76 ++++++++++++++++++++++++++++++++++--------- 2 files changed, 61 insertions(+), 16 deletions(-)
This follows the attempt to better track maximum lag of task in presence of different slices duration: [1] https://lore.kernel.org/all/20250418151225.3006867-1-vincent.guittot@linaro.org/ Patch 1 is a simple cleanup to ease following changes. Patch 2 uses Peter's proposal made in [1] to track the max slice of enqueued tasks and use it to clamp the lag of dequeued task. Patch 3 modify the protection of the slice of the current task to take into account case when tasks with shorter slice wait to run on the cpu Patch 4 extend to slice protection mecanism to the NO_RUN_TO_PARITY case to ensure that a resched will be set regularly. Currently, the resched will be set only when curr will elapse its slice which is partly similar to RUN_TO_PARITY. Now the running task has a minimum time quantum (0.7ms) before eevdf looks for another task to run with the exception of PREEMPT_SHORT case which remains valid. Peter Zijlstra (1): sched/fair: Increase max lag clamping Vincent Guittot (3): sched/fair: Use protect_slice() instead of direct comparison sched/fair: Limit run to parity to the min slice of enqueued entities sched/fair: Improve NO_RUN_TO_PARITY include/linux/sched.h | 1 + kernel/sched/fair.c | 76 ++++++++++++++++++++++++++++++++++--------- 2 files changed, 61 insertions(+), 16 deletions(-) -- 2.43.0
On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote: > Vincent Guittot (3): > sched/fair: Use protect_slice() instead of direct comparison > sched/fair: Limit run to parity to the min slice of enqueued entities > sched/fair: Improve NO_RUN_TO_PARITY Ah. I wrote these here patches and then totally forgot about them :/. They take a different approach. The approach I took was to move decision to stick with curr after pick, instead of before it. That way we can evaluate the tree at the time of preemption.
On Tue, Jun 17, 2025 at 11:22:08AM +0200, Peter Zijlstra wrote: > On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote: > > Vincent Guittot (3): > > sched/fair: Use protect_slice() instead of direct comparison > > sched/fair: Limit run to parity to the min slice of enqueued entities > > sched/fair: Improve NO_RUN_TO_PARITY > > Ah. I wrote these here patches and then totally forgot about them :/. > They take a different approach. > Mind sticking them on a git tree somewhere please? Dhaval
On Tue, 17 Jun 2025 at 11:22, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote: > > Vincent Guittot (3): > > sched/fair: Use protect_slice() instead of direct comparison > > sched/fair: Limit run to parity to the min slice of enqueued entities > > sched/fair: Improve NO_RUN_TO_PARITY > > Ah. I wrote these here patches and then totally forgot about them :/. > They take a different approach. > > The approach I took was to move decision to stick with curr after pick, > instead of before it. That way we can evaluate the tree at the time of > preemption. Let me have a look at your patches > >
On Wed, 18 Jun 2025 at 09:03, Vincent Guittot <vincent.guittot@linaro.org> wrote: > > On Tue, 17 Jun 2025 at 11:22, Peter Zijlstra <peterz@infradead.org> wrote: > > > > On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote: > > > Vincent Guittot (3): > > > sched/fair: Use protect_slice() instead of direct comparison > > > sched/fair: Limit run to parity to the min slice of enqueued entities > > > sched/fair: Improve NO_RUN_TO_PARITY > > > > Ah. I wrote these here patches and then totally forgot about them :/. > > They take a different approach. > > > > The approach I took was to move decision to stick with curr after pick, > > instead of before it. That way we can evaluate the tree at the time of > > preemption. > > Let me have a look at your patches I have looked and tested your patches but they don't solve the lag and run to parity issues not sur what he's going wrong. Also, my patchset take into account the NO_RUN_TO_PARITY case by adding a notion of quantum execution time which was missing until now Regarding the "fix delayed requeue", I already get an update of current before requeueing a delayed task. Do you have a use case in mind ? > > > > >
On Thu, Jun 19, 2025 at 02:27:43PM +0200, Vincent Guittot wrote: > On Wed, 18 Jun 2025 at 09:03, Vincent Guittot > <vincent.guittot@linaro.org> wrote: > > > > On Tue, 17 Jun 2025 at 11:22, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote: > > > > Vincent Guittot (3): > > > > sched/fair: Use protect_slice() instead of direct comparison > > > > sched/fair: Limit run to parity to the min slice of enqueued entities > > > > sched/fair: Improve NO_RUN_TO_PARITY > > > > > > Ah. I wrote these here patches and then totally forgot about them :/. > > > They take a different approach. > > > > > > The approach I took was to move decision to stick with curr after pick, > > > instead of before it. That way we can evaluate the tree at the time of > > > preemption. > > > > Let me have a look at your patches > > I have looked and tested your patches but they don't solve the lag and > run to parity issues not sur what he's going wrong. Humm.. So what you do in patch 3, setting the protection to min_slice instead of the deadline, that only takes into account the tasks present at the point we schedule. Which is why I approached it by moving the protection to after pick; because then we can directly compare the task we're running to the best pick -- which includes the tasks that got woken. This gives check_preempt_wakeup_fair() better chances. To be fair, I did not get around to testing the patches much beyond booting them, so quite possibly they're buggered :-/ > Also, my patchset take into account the NO_RUN_TO_PARITY case by > adding a notion of quantum execution time which was missing until now Right; not ideal, but I suppose for the people that disable RUN_TO_PARITY it might make sense. But perhaps there should be a little more justification for why we bother tweaking a non-default option. The problem with usage of normalized_sysctl_ values is that you then get behavioural differences between 1 and 8 CPUs or so. Also, perhaps its time to just nuke that whole scaling thing (I'm sure someone mentioned that a short while ago). > Regarding the "fix delayed requeue", I already get an update of > current before requeueing a delayed task. Do you have a use case in > mind ? Ah, it was just from reading code, clearly I missed something. Happy to forget about that patch :-)
On Fri, 20 Jun 2025 at 10:42, Peter Zijlstra <peterz@infradead.org> wrote: > > On Thu, Jun 19, 2025 at 02:27:43PM +0200, Vincent Guittot wrote: > > On Wed, 18 Jun 2025 at 09:03, Vincent Guittot > > <vincent.guittot@linaro.org> wrote: > > > > > > On Tue, 17 Jun 2025 at 11:22, Peter Zijlstra <peterz@infradead.org> wrote: > > > > > > > > On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote: > > > > > Vincent Guittot (3): > > > > > sched/fair: Use protect_slice() instead of direct comparison > > > > > sched/fair: Limit run to parity to the min slice of enqueued entities > > > > > sched/fair: Improve NO_RUN_TO_PARITY > > > > > > > > Ah. I wrote these here patches and then totally forgot about them :/. > > > > They take a different approach. > > > > > > > > The approach I took was to move decision to stick with curr after pick, > > > > instead of before it. That way we can evaluate the tree at the time of > > > > preemption. > > > > > > Let me have a look at your patches > > > > I have looked and tested your patches but they don't solve the lag and > > run to parity issues not sur what he's going wrong. > > Humm.. So what you do in patch 3, setting the protection to min_slice > instead of the deadline, that only takes into account the tasks present > at the point we schedule. yes but at this point any waking up task is either the next running task or enqueued in the rb tree > > Which is why I approached it by moving the protection to after pick; > because then we can directly compare the task we're running to the > best pick -- which includes the tasks that got woken. This gives > check_preempt_wakeup_fair() better chances. we don't always want to break the run to parity but only when a task wakes up and should preempt current or decrease the run to parity period. Otherwise, the protection applies for a duration that is short enough to stay fair for others I will see if check_preempt_wakeup_fair can be smarter when deciding to cancel the protection > > To be fair, I did not get around to testing the patches much beyond > booting them, so quite possibly they're buggered :-/ > > > Also, my patchset take into account the NO_RUN_TO_PARITY case by > > adding a notion of quantum execution time which was missing until now > > Right; not ideal, but I suppose for the people that disable > RUN_TO_PARITY it might make sense. But perhaps there should be a little > more justification for why we bother tweaking a non-default option. Otherwise disabling RUN_TO_PARITY to check if it's the root cause of a regression or a problem becomes pointless because the behavior without the feature is wrong. And some might not want to run to parity but behave closer to the white paper with a pick after each quantum with quantum being something in the range [0.7ms:2*tick) > > The problem with usage of normalized_sysctl_ values is that you then get > behavioural differences between 1 and 8 CPUs or so. Also, perhaps its normalized_sysctl_ values don't scale with the number of CPUs. In this case, it's always 0.7ms which is short enough compare to 1ms tick period to prevent default irq accounting to keep current for another tick > time to just nuke that whole scaling thing (I'm sure someone mentioned > that a short while ago). > > > Regarding the "fix delayed requeue", I already get an update of > > current before requeueing a delayed task. Do you have a use case in > > mind ? > > Ah, it was just from reading code, clearly I missed something. Happy to > forget about that patch :-)
On Fri, Jun 20, 2025 at 12:29:27PM +0200, Vincent Guittot wrote: > yes but at this point any waking up task is either the next running > task or enqueued in the rb tree The scenario I was thinking of was something like: A (long slice) B (short slice) C (short slice) A wakes up and goes running Since A is the only task around, it gets normal protection B wakes up and doesn't win So now we have A running with long protection and short task on-rq C wakes up ... Whereas what we would've wanted to end up with for C is A running with short protection. > > Which is why I approached it by moving the protection to after pick; > > because then we can directly compare the task we're running to the > > best pick -- which includes the tasks that got woken. This gives > > check_preempt_wakeup_fair() better chances. > > we don't always want to break the run to parity but only when a task > wakes up and should preempt current or decrease the run to parity > period. Otherwise, the protection applies for a duration that is short > enough to stay fair for others > > I will see if check_preempt_wakeup_fair can be smarter when deciding > to cancel the protection Thanks. In the above scenario B getting selected when C wakes up would be a clue I suppose :-) > > To be fair, I did not get around to testing the patches much beyond > > booting them, so quite possibly they're buggered :-/ > > > > > Also, my patchset take into account the NO_RUN_TO_PARITY case by > > > adding a notion of quantum execution time which was missing until now > > > > Right; not ideal, but I suppose for the people that disable > > RUN_TO_PARITY it might make sense. But perhaps there should be a little > > more justification for why we bother tweaking a non-default option. > > Otherwise disabling RUN_TO_PARITY to check if it's the root cause of a > regression or a problem becomes pointless because the behavior without > the feature is wrong. Fair enough. > And some might not want to run to parity but behave closer to the > white paper with a pick after each quantum with quantum being > something in the range [0.7ms:2*tick) > > > > > The problem with usage of normalized_sysctl_ values is that you then get > > behavioural differences between 1 and 8 CPUs or so. Also, perhaps its > > normalized_sysctl_ values don't scale with the number of CPUs. In this > case, it's always 0.7ms which is short enough compare to 1ms tick > period to prevent default irq accounting to keep current for another > tick Right; but it not scaling means it is the full slice on UP, half the slice on SMP-4 and a third for SMP-8 and up or somesuch. It probably doesn't matter much, but its weird.
On Mon, 23 Jun 2025 at 13:16, Peter Zijlstra <peterz@infradead.org> wrote: > > On Fri, Jun 20, 2025 at 12:29:27PM +0200, Vincent Guittot wrote: > > > yes but at this point any waking up task is either the next running > > task or enqueued in the rb tree > > The scenario I was thinking of was something like: > > A (long slice) > B (short slice) > C (short slice) > > A wakes up and goes running > > Since A is the only task around, it gets normal protection > > B wakes up and doesn't win > > So now we have A running with long protection and short task on-rq > > C wakes up ... > > Whereas what we would've wanted to end up with for C is A running with > short protection. I will look at this case more deeply. We might want to update the slice protection with the new min slice even if B doesn't preempt A. That's part of a smarter check_preempt_wakeup_fair that I mentioned below. In case of B deadline not being before A, we don't need to update the protection as the remaining protect duration is already shorter than the new slice In case of B not eligible and already on the cpu, B is already enqueued (delayed dequeue) so its short slice is already accounted in set protection. I still have to look at B being migrated from another CPU with negative lag > > > > Which is why I approached it by moving the protection to after pick; > > > because then we can directly compare the task we're running to the > > > best pick -- which includes the tasks that got woken. This gives > > > check_preempt_wakeup_fair() better chances. > > > > we don't always want to break the run to parity but only when a task > > wakes up and should preempt current or decrease the run to parity > > period. Otherwise, the protection applies for a duration that is short > > enough to stay fair for others > > > > I will see if check_preempt_wakeup_fair can be smarter when deciding > > to cancel the protection > > Thanks. In the above scenario B getting selected when C wakes up would > be a clue I suppose :-) yes, fixing the comment : * Note that even if @p does not turn out to be the most eligible * task at this moment, current's slice protection will be lost. > > > > To be fair, I did not get around to testing the patches much beyond > > > booting them, so quite possibly they're buggered :-/ > > > > > > > Also, my patchset take into account the NO_RUN_TO_PARITY case by > > > > adding a notion of quantum execution time which was missing until now > > > > > > Right; not ideal, but I suppose for the people that disable > > > RUN_TO_PARITY it might make sense. But perhaps there should be a little > > > more justification for why we bother tweaking a non-default option. > > > > Otherwise disabling RUN_TO_PARITY to check if it's the root cause of a > > regression or a problem becomes pointless because the behavior without > > the feature is wrong. > > Fair enough. > > > And some might not want to run to parity but behave closer to the > > white paper with a pick after each quantum with quantum being > > something in the range [0.7ms:2*tick) > > > > > > > > The problem with usage of normalized_sysctl_ values is that you then get > > > behavioural differences between 1 and 8 CPUs or so. Also, perhaps its > > > > normalized_sysctl_ values don't scale with the number of CPUs. In this > > case, it's always 0.7ms which is short enough compare to 1ms tick > > period to prevent default irq accounting to keep current for another > > tick > > Right; but it not scaling means it is the full slice on UP, half the > slice on SMP-4 and a third for SMP-8 and up or somesuch. Yes, the goal is to implement a kind of quantum of time which doesn't scale with number CPUs unlike the default slice duration > > It probably doesn't matter much, but its weird.
© 2016 - 2025 Red Hat, Inc.