[v1] sched/fair: Manage lag and run to parity with different slices

[PATCH 0/4] sched/fair: Manage lag and run to parity with different slices

Posted by Vincent Guittot 3 months, 4 weeks ago

This follows the attempt to better track maximum lag of task in presence
of different slices duration:
[1]  https://lore.kernel.org/all/20250418151225.3006867-1-vincent.guittot@linaro.org/

Patch 1 is a simple cleanup to ease following changes.

Patch 2 uses Peter's proposal made in [1] to track the max slice of
enqueued tasks and use it to clamp the lag of dequeued task.

Patch 3 modify the protection of the slice of the current task to take
into account case when tasks with shorter slice wait to run on the cpu

Patch 4 extend to slice protection mecanism to the NO_RUN_TO_PARITY case
to ensure that a resched will be set regularly. Currently, the resched 
will be set only when curr will elapse its slice which is partly similar
to RUN_TO_PARITY. Now the running task has a minimum time quantum (0.7ms)
before eevdf looks for another task to run with the exception of
PREEMPT_SHORT case which remains valid.

Peter Zijlstra (1):
  sched/fair: Increase max lag clamping

Vincent Guittot (3):
  sched/fair: Use protect_slice() instead of direct comparison
  sched/fair: Limit run to parity to the min slice of enqueued entities
  sched/fair: Improve NO_RUN_TO_PARITY

 include/linux/sched.h |  1 +
 kernel/sched/fair.c   | 76 ++++++++++++++++++++++++++++++++++---------
 2 files changed, 61 insertions(+), 16 deletions(-)

-- 
2.43.0

Re: [PATCH 0/4] sched/fair: Manage lag and run to parity with different slices

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote:
> Vincent Guittot (3):
>   sched/fair: Use protect_slice() instead of direct comparison
>   sched/fair: Limit run to parity to the min slice of enqueued entities
>   sched/fair: Improve NO_RUN_TO_PARITY

Ah. I wrote these here patches and then totally forgot about them :/.
They take a different approach.

The approach I took was to move decision to stick with curr after pick,
instead of before it. That way we can evaluate the tree at the time of
preemption.

Re: [PATCH 0/4] sched/fair: Manage lag and run to parity with different slices

Posted by Dhaval Giani 3 months, 2 weeks ago

On Tue, Jun 17, 2025 at 11:22:08AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote:
> > Vincent Guittot (3):
> >   sched/fair: Use protect_slice() instead of direct comparison
> >   sched/fair: Limit run to parity to the min slice of enqueued entities
> >   sched/fair: Improve NO_RUN_TO_PARITY
> 
> Ah. I wrote these here patches and then totally forgot about them :/.
> They take a different approach.
> 

Mind sticking them on a git tree somewhere please?

Dhaval

Re: [PATCH 0/4] sched/fair: Manage lag and run to parity with different slices

Posted by Vincent Guittot 3 months, 3 weeks ago

On Tue, 17 Jun 2025 at 11:22, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote:
> > Vincent Guittot (3):
> >   sched/fair: Use protect_slice() instead of direct comparison
> >   sched/fair: Limit run to parity to the min slice of enqueued entities
> >   sched/fair: Improve NO_RUN_TO_PARITY
>
> Ah. I wrote these here patches and then totally forgot about them :/.
> They take a different approach.
>
> The approach I took was to move decision to stick with curr after pick,
> instead of before it. That way we can evaluate the tree at the time of
> preemption.

Let me have a look at your patches

>
>

Re: [PATCH 0/4] sched/fair: Manage lag and run to parity with different slices

Posted by Vincent Guittot 3 months, 3 weeks ago

On Wed, 18 Jun 2025 at 09:03, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Tue, 17 Jun 2025 at 11:22, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote:
> > > Vincent Guittot (3):
> > >   sched/fair: Use protect_slice() instead of direct comparison
> > >   sched/fair: Limit run to parity to the min slice of enqueued entities
> > >   sched/fair: Improve NO_RUN_TO_PARITY
> >
> > Ah. I wrote these here patches and then totally forgot about them :/.
> > They take a different approach.
> >
> > The approach I took was to move decision to stick with curr after pick,
> > instead of before it. That way we can evaluate the tree at the time of
> > preemption.
>
> Let me have a look at your patches

I have looked and tested your patches but they don't solve the lag and
run to parity issues not sur what he's going wrong. Also, my patchset
take into account the NO_RUN_TO_PARITY case by adding a notion of
quantum execution time which was missing until now
Regarding the "fix delayed requeue", I already get an update of
current before requeueing a delayed task. Do you have a use case in
mind ?

>
> >
> >

Re: [PATCH 0/4] sched/fair: Manage lag and run to parity with different slices

Posted by Peter Zijlstra 3 months, 3 weeks ago

On Thu, Jun 19, 2025 at 02:27:43PM +0200, Vincent Guittot wrote:
> On Wed, 18 Jun 2025 at 09:03, Vincent Guittot
> <vincent.guittot@linaro.org> wrote:
> >
> > On Tue, 17 Jun 2025 at 11:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > > On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote:
> > > > Vincent Guittot (3):
> > > >   sched/fair: Use protect_slice() instead of direct comparison
> > > >   sched/fair: Limit run to parity to the min slice of enqueued entities
> > > >   sched/fair: Improve NO_RUN_TO_PARITY
> > >
> > > Ah. I wrote these here patches and then totally forgot about them :/.
> > > They take a different approach.
> > >
> > > The approach I took was to move decision to stick with curr after pick,
> > > instead of before it. That way we can evaluate the tree at the time of
> > > preemption.
> >
> > Let me have a look at your patches
> 
> I have looked and tested your patches but they don't solve the lag and
> run to parity issues not sur what he's going wrong.

Humm.. So what you do in patch 3, setting the protection to min_slice
instead of the deadline, that only takes into account the tasks present
at the point we schedule.

Which is why I approached it by moving the protection to after pick;
because then we can directly compare the task we're running to the
best pick -- which includes the tasks that got woken. This gives
check_preempt_wakeup_fair() better chances.

To be fair, I did not get around to testing the patches much beyond
booting them, so quite possibly they're buggered :-/

> Also, my patchset take into account the NO_RUN_TO_PARITY case by
> adding a notion of quantum execution time which was missing until now

Right; not ideal, but I suppose for the people that disable
RUN_TO_PARITY it might make sense. But perhaps there should be a little
more justification for why we bother tweaking a non-default option.

The problem with usage of normalized_sysctl_ values is that you then get
behavioural differences between 1 and 8 CPUs or so. Also, perhaps its
time to just nuke that whole scaling thing (I'm sure someone mentioned
that a short while ago).

> Regarding the "fix delayed requeue", I already get an update of
> current before requeueing a delayed task. Do you have a use case in
> mind ?

Ah, it was just from reading code, clearly I missed something. Happy to
forget about that patch :-)

Re: [PATCH 0/4] sched/fair: Manage lag and run to parity with different slices

Posted by Vincent Guittot 3 months, 3 weeks ago

On Fri, 20 Jun 2025 at 10:42, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Jun 19, 2025 at 02:27:43PM +0200, Vincent Guittot wrote:
> > On Wed, 18 Jun 2025 at 09:03, Vincent Guittot
> > <vincent.guittot@linaro.org> wrote:
> > >
> > > On Tue, 17 Jun 2025 at 11:22, Peter Zijlstra <peterz@infradead.org> wrote:
> > > >
> > > > On Fri, Jun 13, 2025 at 04:05:10PM +0200, Vincent Guittot wrote:
> > > > > Vincent Guittot (3):
> > > > >   sched/fair: Use protect_slice() instead of direct comparison
> > > > >   sched/fair: Limit run to parity to the min slice of enqueued entities
> > > > >   sched/fair: Improve NO_RUN_TO_PARITY
> > > >
> > > > Ah. I wrote these here patches and then totally forgot about them :/.
> > > > They take a different approach.
> > > >
> > > > The approach I took was to move decision to stick with curr after pick,
> > > > instead of before it. That way we can evaluate the tree at the time of
> > > > preemption.
> > >
> > > Let me have a look at your patches
> >
> > I have looked and tested your patches but they don't solve the lag and
> > run to parity issues not sur what he's going wrong.
>
> Humm.. So what you do in patch 3, setting the protection to min_slice
> instead of the deadline, that only takes into account the tasks present
> at the point we schedule.

yes but at this point any waking up task is either the next running
task or enqueued in the rb tree

>
> Which is why I approached it by moving the protection to after pick;
> because then we can directly compare the task we're running to the
> best pick -- which includes the tasks that got woken. This gives
> check_preempt_wakeup_fair() better chances.

we don't always want to break the run to parity but only when a task
wakes up and should preempt current or decrease the run to parity
period. Otherwise, the protection applies for a duration that is short
enough to stay fair for others

I will see if check_preempt_wakeup_fair can be smarter when deciding
to cancel the protection

>
> To be fair, I did not get around to testing the patches much beyond
> booting them, so quite possibly they're buggered :-/
>
> > Also, my patchset take into account the NO_RUN_TO_PARITY case by
> > adding a notion of quantum execution time which was missing until now
>
> Right; not ideal, but I suppose for the people that disable
> RUN_TO_PARITY it might make sense. But perhaps there should be a little
> more justification for why we bother tweaking a non-default option.

Otherwise disabling RUN_TO_PARITY to check if it's the root cause of a
regression or a problem becomes pointless because the behavior without
the feature is wrong.
And some might not want to run to parity but behave closer to the
white paper with a pick after each quantum with quantum being
something in the range [0.7ms:2*tick)

>
> The problem with usage of normalized_sysctl_ values is that you then get
> behavioural differences between 1 and 8 CPUs or so. Also, perhaps its

normalized_sysctl_ values don't scale with the number of CPUs. In this
case, it's always 0.7ms which is short enough compare to 1ms tick
period to prevent default irq accounting to keep current for another
tick

> time to just nuke that whole scaling thing (I'm sure someone mentioned
> that a short while ago).
>
> > Regarding the "fix delayed requeue", I already get an update of
> > current before requeueing a delayed task. Do you have a use case in
> > mind ?
>
> Ah, it was just from reading code, clearly I missed something. Happy to
> forget about that patch :-)

Re: [PATCH 0/4] sched/fair: Manage lag and run to parity with different slices

Posted by Peter Zijlstra 3 months, 2 weeks ago

On Fri, Jun 20, 2025 at 12:29:27PM +0200, Vincent Guittot wrote:

> yes but at this point any waking up task is either the next running
> task or enqueued in the rb tree

The scenario I was thinking of was something like:

A (long slice)
B (short slice)
C (short slice)

  A wakes up and goes running

Since A is the only task around, it gets normal protection
 
  B wakes up and doesn't win

So now we have A running with long protection and short task on-rq

  C wakes up ...

Whereas what we would've wanted to end up with for C is A running with
short protection.

> > Which is why I approached it by moving the protection to after pick;
> > because then we can directly compare the task we're running to the
> > best pick -- which includes the tasks that got woken. This gives
> > check_preempt_wakeup_fair() better chances.
> 
> we don't always want to break the run to parity but only when a task
> wakes up and should preempt current or decrease the run to parity
> period. Otherwise, the protection applies for a duration that is short
> enough to stay fair for others
> 
> I will see if check_preempt_wakeup_fair can be smarter when deciding
> to cancel the protection

Thanks. In the above scenario B getting selected when C wakes up would
be a clue I suppose :-)

> > To be fair, I did not get around to testing the patches much beyond
> > booting them, so quite possibly they're buggered :-/
> >
> > > Also, my patchset take into account the NO_RUN_TO_PARITY case by
> > > adding a notion of quantum execution time which was missing until now
> >
> > Right; not ideal, but I suppose for the people that disable
> > RUN_TO_PARITY it might make sense. But perhaps there should be a little
> > more justification for why we bother tweaking a non-default option.
> 
> Otherwise disabling RUN_TO_PARITY to check if it's the root cause of a
> regression or a problem becomes pointless because the behavior without
> the feature is wrong.

Fair enough.

> And some might not want to run to parity but behave closer to the
> white paper with a pick after each quantum with quantum being
> something in the range [0.7ms:2*tick)
> 
> >
> > The problem with usage of normalized_sysctl_ values is that you then get
> > behavioural differences between 1 and 8 CPUs or so. Also, perhaps its
> 
> normalized_sysctl_ values don't scale with the number of CPUs. In this
> case, it's always 0.7ms which is short enough compare to 1ms tick
> period to prevent default irq accounting to keep current for another
> tick

Right; but it not scaling means it is the full slice on UP, half the
slice on SMP-4 and a third for SMP-8 and up or somesuch.

It probably doesn't matter much, but its weird.

Re: [PATCH 0/4] sched/fair: Manage lag and run to parity with different slices

Posted by Vincent Guittot 3 months, 2 weeks ago

On Mon, 23 Jun 2025 at 13:16, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, Jun 20, 2025 at 12:29:27PM +0200, Vincent Guittot wrote:
>
> > yes but at this point any waking up task is either the next running
> > task or enqueued in the rb tree
>
> The scenario I was thinking of was something like:
>
> A (long slice)
> B (short slice)
> C (short slice)
>
>   A wakes up and goes running
>
> Since A is the only task around, it gets normal protection
>
>   B wakes up and doesn't win
>
> So now we have A running with long protection and short task on-rq
>
>   C wakes up ...
>
> Whereas what we would've wanted to end up with for C is A running with
> short protection.

I will look at this case more deeply. We might want to update the
slice protection with the new min slice even if B doesn't preempt A.
That's part of a smarter check_preempt_wakeup_fair that I mentioned
below.
In case of B deadline not being before A, we don't need to update the
protection as the remaining protect duration is already shorter than
the new slice
In case of B not eligible and already on the cpu, B is already
enqueued (delayed dequeue) so its short slice is already accounted in
set protection. I still have to look at B being migrated from another
CPU with negative lag

>
> > > Which is why I approached it by moving the protection to after pick;
> > > because then we can directly compare the task we're running to the
> > > best pick -- which includes the tasks that got woken. This gives
> > > check_preempt_wakeup_fair() better chances.
> >
> > we don't always want to break the run to parity but only when a task
> > wakes up and should preempt current or decrease the run to parity
> > period. Otherwise, the protection applies for a duration that is short
> > enough to stay fair for others
> >
> > I will see if check_preempt_wakeup_fair can be smarter when deciding
> > to cancel the protection
>
> Thanks. In the above scenario B getting selected when C wakes up would
> be a clue I suppose :-)

yes, fixing the comment :
* Note that even if @p does not turn out to be the most eligible
* task at this moment, current's slice protection will be lost.

>
> > > To be fair, I did not get around to testing the patches much beyond
> > > booting them, so quite possibly they're buggered :-/
> > >
> > > > Also, my patchset take into account the NO_RUN_TO_PARITY case by
> > > > adding a notion of quantum execution time which was missing until now
> > >
> > > Right; not ideal, but I suppose for the people that disable
> > > RUN_TO_PARITY it might make sense. But perhaps there should be a little
> > > more justification for why we bother tweaking a non-default option.
> >
> > Otherwise disabling RUN_TO_PARITY to check if it's the root cause of a
> > regression or a problem becomes pointless because the behavior without
> > the feature is wrong.
>
> Fair enough.
>
> > And some might not want to run to parity but behave closer to the
> > white paper with a pick after each quantum with quantum being
> > something in the range [0.7ms:2*tick)
> >
> > >
> > > The problem with usage of normalized_sysctl_ values is that you then get
> > > behavioural differences between 1 and 8 CPUs or so. Also, perhaps its
> >
> > normalized_sysctl_ values don't scale with the number of CPUs. In this
> > case, it's always 0.7ms which is short enough compare to 1ms tick
> > period to prevent default irq accounting to keep current for another
> > tick
>
> Right; but it not scaling means it is the full slice on UP, half the
> slice on SMP-4 and a third for SMP-8 and up or somesuch.

Yes, the goal is to implement a kind of quantum of time which doesn't
scale with number CPUs unlike the default slice duration

>
> It probably doesn't matter much, but its weird.