sched/eevdf: sched feature to dismiss lag on wakeup

[RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Posted by Tobias Huschle 1 year, 11 months ago

The previously used CFS scheduler gave tasks that were woken up an
enhanced chance to see runtime immediately by deducting a certain value
from its vruntime on runqueue placement during wakeup.

This property was used by some, at least vhost, to ensure, that certain
kworkers are scheduled immediately after being woken up. The EEVDF
scheduler, does not support this so far. Instead, if such a woken up
entitiy carries a negative lag from its previous execution, it will have
to wait for the current time slice to finish, which affects the
performance of the process expecting the immediate execution negatively.

To address this issue, implement EEVDF strategy #2 for rejoining
entities, which dismisses the lag from previous execution and allows
the woken up task to run immediately (if no other entities are deemed
to be preferred for scheduling by EEVDF).

The vruntime is decremented by an additional value of 1 to make sure,
that the woken up tasks gets to actually run. This is of course not
following strategy #2 in an exact manner but guarantees the expected
behavior for the scenario described above. Without the additional
decrement, the performance goes south even more. So there are some
side effects I could not get my head around yet.

Questions:
1. The kworker getting its negative lag occurs in the following scenario
   - kworker and a cgroup are supposed to execute on the same CPU
   - one task within the cgroup is executing and wakes up the kworker
   - kworker with 0 lag, gets picked immediately and finishes its
     execution within ~5000ns
   - on dequeue, kworker gets assigned a negative lag
   Is this expected behavior? With this short execution time, I would
   expect the kworker to be fine.
   For a more detailed discussion on this symptom, please see:
   https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/
2. The proposed code change of course only addresses the symptom. Am I
   assuming correctly that this is in general the exepected behavior and
   that the task waking up the kworker should rather do an explicit
   reschedule of itself to grant the kworker time to execute?
   In the vhost case, this is currently attempted through a cond_resched
   which is not doing anything because the need_resched flag is not set.

Feedback and opinions would be highly appreciated.

Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
---
 kernel/sched/fair.c     | 5 +++++
 kernel/sched/features.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 533547e3c90a..c20ae6d62961 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5239,6 +5239,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 		lag = div_s64(lag, load);
 	}
 
+	if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) {
+		se->vlag = 0;
+		lag = 1;
+	}
+
 	se->vruntime = vruntime - lag;
 
 	/*
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 143f55df890b..d3118e7568b4 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -7,6 +7,7 @@
 SCHED_FEAT(PLACE_LAG, true)
 SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
 SCHED_FEAT(RUN_TO_PARITY, true)
+SCHED_FEAT(NOLAG_WAKEUP, true)
 
 /*
  * Prefer to schedule the task we woke last (assuming it failed
-- 
2.34.1

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Posted by Luis Machado 1 year, 11 months ago

Hi Tobias,

On 2/28/24 16:10, Tobias Huschle wrote:
> The previously used CFS scheduler gave tasks that were woken up an
> enhanced chance to see runtime immediately by deducting a certain value
> from its vruntime on runqueue placement during wakeup.
> 
> This property was used by some, at least vhost, to ensure, that certain
> kworkers are scheduled immediately after being woken up. The EEVDF
> scheduler, does not support this so far. Instead, if such a woken up
> entitiy carries a negative lag from its previous execution, it will have
> to wait for the current time slice to finish, which affects the
> performance of the process expecting the immediate execution negatively.
> 
> To address this issue, implement EEVDF strategy #2 for rejoining
> entities, which dismisses the lag from previous execution and allows
> the woken up task to run immediately (if no other entities are deemed
> to be preferred for scheduling by EEVDF).
> 
> The vruntime is decremented by an additional value of 1 to make sure,
> that the woken up tasks gets to actually run. This is of course not
> following strategy #2 in an exact manner but guarantees the expected
> behavior for the scenario described above. Without the additional
> decrement, the performance goes south even more. So there are some
> side effects I could not get my head around yet.
> 
> Questions:
> 1. The kworker getting its negative lag occurs in the following scenario
>    - kworker and a cgroup are supposed to execute on the same CPU
>    - one task within the cgroup is executing and wakes up the kworker
>    - kworker with 0 lag, gets picked immediately and finishes its
>      execution within ~5000ns
>    - on dequeue, kworker gets assigned a negative lag
>    Is this expected behavior? With this short execution time, I would
>    expect the kworker to be fine.

That strikes me as a bit odd as well. Have you been able to determine how a negative lag
is assigned to the kworker after such a short runtime?

I was looking at a different thread (https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/) that
uncovers a potential overflow in the eligibility calculation. Though I don't think that is the case for this particular
vhost problem.

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Posted by Tobias Huschle 1 year, 11 months ago

On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
> On 2/28/24 16:10, Tobias Huschle wrote:
> > 
> > Questions:
> > 1. The kworker getting its negative lag occurs in the following scenario
> >    - kworker and a cgroup are supposed to execute on the same CPU
> >    - one task within the cgroup is executing and wakes up the kworker
> >    - kworker with 0 lag, gets picked immediately and finishes its
> >      execution within ~5000ns
> >    - on dequeue, kworker gets assigned a negative lag
> >    Is this expected behavior? With this short execution time, I would
> >    expect the kworker to be fine.
> 
> That strikes me as a bit odd as well. Have you been able to determine how a negative lag
> is assigned to the kworker after such a short runtime?
> 

I did some more trace reading though and found something.

What I observed if everything runs regularly:
- vhost and kworker run alternating on the same CPU
- if the kworker is done, it leaves the runqueue
- vhost wakes up the kworker if it needs it
--> this means:
  - vhost starts alone on an otherwise empty runqueue
  - it seems like it never gets dequeued
    (unless another unrelated task joins or migration hits)
  - if vhost wakes up the kworker, the kworker gets selected
  - vhost runtime > kworker runtime 
    --> kworker gets positive lag and gets selected immediately next time

What happens if it does go wrong:
From what I gather, there seem to be occasions where the vhost either
executes suprisingly quick, or the kworker surprinsingly slow. If these
outliers reach critical values, it can happen, that
   vhost runtime < kworker runtime
which now causes the kworker to get the negative lag.

In this case it seems like, that the vhost is very fast in waking up
the kworker. And coincidentally, the kworker takes, more time than usual
to finish. We speak of 4-digit to low 5-digit nanoseconds.

So, for these outliers, the scheduler extrapolates that the kworker 
out-consumes the vhost and should be slowed down, although in the majority
of other cases this does not happen.

Therefore this particular usecase would profit from being able to ignore
such outliers, or being able to ignore a certain amount of difference in the
lag values, i.e. introduce some grace value around the average runtime for
which lag is not accounted. But not sure if I like that idea.

So the negative lag can be somewhat justified, but for this particular case
it leads to a problem where one outlier can cause havoc. As mentioned in the
vhost discussion, it could also be argued that the vhost should not rely on 
the fact that the kworker gets always scheduled on wake up, since these
timing issues can always happen.

Hence, the two options:
- offer the alternative strategy which dismisses lag on wake up for workloads
  where we know that a task usually finishes faster than others but should
  not be punished by rare outliers (if that is predicatble, I don't know)
- require vhost to adress this issue on their side (if possible without 
  creating an armada of side effects)

(plus the third one mentioned above, but that requires a magic cutoff value, meh)

> I was looking at a different thread (https://lore.kernel.org/lkml/20240226082349.302363-1-yu.c.chen@intel.com/) that
> uncovers a potential overflow in the eligibility calculation. Though I don't think that is the case for this particular
> vhost problem.

Yea, the numbers I see do not look very overflowy.

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Posted by Luis Machado 1 year, 10 months ago

On 3/14/24 13:45, Tobias Huschle wrote:
> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
>> On 2/28/24 16:10, Tobias Huschle wrote:
>>>
>>> Questions:
>>> 1. The kworker getting its negative lag occurs in the following scenario
>>>    - kworker and a cgroup are supposed to execute on the same CPU
>>>    - one task within the cgroup is executing and wakes up the kworker
>>>    - kworker with 0 lag, gets picked immediately and finishes its
>>>      execution within ~5000ns
>>>    - on dequeue, kworker gets assigned a negative lag
>>>    Is this expected behavior? With this short execution time, I would
>>>    expect the kworker to be fine.
>>
>> That strikes me as a bit odd as well. Have you been able to determine how a negative lag
>> is assigned to the kworker after such a short runtime?
>>
> 
> I did some more trace reading though and found something.
> 
> What I observed if everything runs regularly:
> - vhost and kworker run alternating on the same CPU
> - if the kworker is done, it leaves the runqueue
> - vhost wakes up the kworker if it needs it
> --> this means:
>   - vhost starts alone on an otherwise empty runqueue
>   - it seems like it never gets dequeued
>     (unless another unrelated task joins or migration hits)
>   - if vhost wakes up the kworker, the kworker gets selected
>   - vhost runtime > kworker runtime 
>     --> kworker gets positive lag and gets selected immediately next time
> 
> What happens if it does go wrong:
> From what I gather, there seem to be occasions where the vhost either
> executes suprisingly quick, or the kworker surprinsingly slow. If these
> outliers reach critical values, it can happen, that
>    vhost runtime < kworker runtime
> which now causes the kworker to get the negative lag.
> 
> In this case it seems like, that the vhost is very fast in waking up
> the kworker. And coincidentally, the kworker takes, more time than usual
> to finish. We speak of 4-digit to low 5-digit nanoseconds.
> 
> So, for these outliers, the scheduler extrapolates that the kworker 
> out-consumes the vhost and should be slowed down, although in the majority
> of other cases this does not happen.

Thanks for providing the above details Tobias. It does seem like EEVDF is strict
about the eligibility checks and making tasks wait when their lags are negative, even
if just a little bit as in the case of the kworker.

There was a patch to disable the eligibility checks (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromium.org/),
which would make EEVDF more like EVDF, though the deadline comparison would
probably still favor the vhost task instead of the kworker with the negative lag.

I'm not sure if you tried it, but I thought I'd mention it.

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Posted by Tobias Huschle 1 year, 10 months ago

On 2024-03-18 15:45, Luis Machado wrote:
> On 3/14/24 13:45, Tobias Huschle wrote:
>> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
>>> On 2/28/24 16:10, Tobias Huschle wrote:
>>>> 
>>>> Questions:
>>>> 1. The kworker getting its negative lag occurs in the following 
>>>> scenario
>>>>    - kworker and a cgroup are supposed to execute on the same CPU
>>>>    - one task within the cgroup is executing and wakes up the 
>>>> kworker
>>>>    - kworker with 0 lag, gets picked immediately and finishes its
>>>>      execution within ~5000ns
>>>>    - on dequeue, kworker gets assigned a negative lag
>>>>    Is this expected behavior? With this short execution time, I 
>>>> would
>>>>    expect the kworker to be fine.
>>> 
>>> That strikes me as a bit odd as well. Have you been able to determine 
>>> how a negative lag
>>> is assigned to the kworker after such a short runtime?
>>> 
>> 
>> I did some more trace reading though and found something.
>> 
>> What I observed if everything runs regularly:
>> - vhost and kworker run alternating on the same CPU
>> - if the kworker is done, it leaves the runqueue
>> - vhost wakes up the kworker if it needs it
>> --> this means:
>>   - vhost starts alone on an otherwise empty runqueue
>>   - it seems like it never gets dequeued
>>     (unless another unrelated task joins or migration hits)
>>   - if vhost wakes up the kworker, the kworker gets selected
>>   - vhost runtime > kworker runtime
>>     --> kworker gets positive lag and gets selected immediately next 
>> time
>> 
>> What happens if it does go wrong:
>> From what I gather, there seem to be occasions where the vhost either
>> executes suprisingly quick, or the kworker surprinsingly slow. If 
>> these
>> outliers reach critical values, it can happen, that
>>    vhost runtime < kworker runtime
>> which now causes the kworker to get the negative lag.
>> 
>> In this case it seems like, that the vhost is very fast in waking up
>> the kworker. And coincidentally, the kworker takes, more time than 
>> usual
>> to finish. We speak of 4-digit to low 5-digit nanoseconds.
>> 
>> So, for these outliers, the scheduler extrapolates that the kworker
>> out-consumes the vhost and should be slowed down, although in the 
>> majority
>> of other cases this does not happen.
> 
> Thanks for providing the above details Tobias. It does seem like EEVDF 
> is strict
> about the eligibility checks and making tasks wait when their lags are
> negative, even
> if just a little bit as in the case of the kworker.
> 
> There was a patch to disable the eligibility checks
> (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromium.org/),
> which would make EEVDF more like EVDF, though the deadline comparison 
> would
> probably still favor the vhost task instead of the kworker with the
> negative lag.
> 
> I'm not sure if you tried it, but I thought I'd mention it.

Haven't seen that one yet. Unfortunately, it does not help to ignore the 
eligibility.

I'm inclined to rather propose propose a documentation change, which 
describes that tasks should not rely on woken up tasks being scheduled 
immediately.

Changing things in the code to address for the specific scenario I'm 
seeing seems to mostly create unwanted side effects and/or would require 
the definition of some magic cut-off values.

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Posted by Vincent Guittot 1 year, 10 months ago

On Tue, 19 Mar 2024 at 10:08, Tobias Huschle <huschle@linux.ibm.com> wrote:
>
> On 2024-03-18 15:45, Luis Machado wrote:
> > On 3/14/24 13:45, Tobias Huschle wrote:
> >> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
> >>> On 2/28/24 16:10, Tobias Huschle wrote:
> >>>>
> >>>> Questions:
> >>>> 1. The kworker getting its negative lag occurs in the following
> >>>> scenario
> >>>>    - kworker and a cgroup are supposed to execute on the same CPU
> >>>>    - one task within the cgroup is executing and wakes up the
> >>>> kworker
> >>>>    - kworker with 0 lag, gets picked immediately and finishes its
> >>>>      execution within ~5000ns
> >>>>    - on dequeue, kworker gets assigned a negative lag
> >>>>    Is this expected behavior? With this short execution time, I
> >>>> would
> >>>>    expect the kworker to be fine.
> >>>
> >>> That strikes me as a bit odd as well. Have you been able to determine
> >>> how a negative lag
> >>> is assigned to the kworker after such a short runtime?
> >>>
> >>
> >> I did some more trace reading though and found something.
> >>
> >> What I observed if everything runs regularly:
> >> - vhost and kworker run alternating on the same CPU
> >> - if the kworker is done, it leaves the runqueue
> >> - vhost wakes up the kworker if it needs it
> >> --> this means:
> >>   - vhost starts alone on an otherwise empty runqueue
> >>   - it seems like it never gets dequeued
> >>     (unless another unrelated task joins or migration hits)
> >>   - if vhost wakes up the kworker, the kworker gets selected
> >>   - vhost runtime > kworker runtime
> >>     --> kworker gets positive lag and gets selected immediately next
> >> time
> >>
> >> What happens if it does go wrong:
> >> From what I gather, there seem to be occasions where the vhost either
> >> executes suprisingly quick, or the kworker surprinsingly slow. If
> >> these
> >> outliers reach critical values, it can happen, that
> >>    vhost runtime < kworker runtime
> >> which now causes the kworker to get the negative lag.
> >>
> >> In this case it seems like, that the vhost is very fast in waking up
> >> the kworker. And coincidentally, the kworker takes, more time than
> >> usual
> >> to finish. We speak of 4-digit to low 5-digit nanoseconds.
> >>
> >> So, for these outliers, the scheduler extrapolates that the kworker
> >> out-consumes the vhost and should be slowed down, although in the
> >> majority
> >> of other cases this does not happen.
> >
> > Thanks for providing the above details Tobias. It does seem like EEVDF
> > is strict
> > about the eligibility checks and making tasks wait when their lags are
> > negative, even
> > if just a little bit as in the case of the kworker.
> >
> > There was a patch to disable the eligibility checks
> > (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromium.org/),
> > which would make EEVDF more like EVDF, though the deadline comparison
> > would
> > probably still favor the vhost task instead of the kworker with the
> > negative lag.
> >
> > I'm not sure if you tried it, but I thought I'd mention it.
>
> Haven't seen that one yet. Unfortunately, it does not help to ignore the
> eligibility.
>
> I'm inclined to rather propose propose a documentation change, which
> describes that tasks should not rely on woken up tasks being scheduled
> immediately.

Where do you see such an assumption ? Even before eevdf, there were
nothing that ensures such behavior. When using CFS (legacy or eevdf)
tasks, you can't know if the newly wakeup task will run 1st or not


>
> Changing things in the code to address for the specific scenario I'm
> seeing seems to mostly create unwanted side effects and/or would require
> the definition of some magic cut-off values.
>
>

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Posted by Tobias Huschle 1 year, 10 months ago

On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle <huschle@linux.ibm.com> wrote:
> >
> > On 2024-03-18 15:45, Luis Machado wrote:
> > > On 3/14/24 13:45, Tobias Huschle wrote:
> > >> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
> > >>> On 2/28/24 16:10, Tobias Huschle wrote:
> > >>>>
> > >>>> Questions:
> > >>>> 1. The kworker getting its negative lag occurs in the following
> > >>>> scenario
> > >>>>    - kworker and a cgroup are supposed to execute on the same CPU
> > >>>>    - one task within the cgroup is executing and wakes up the
> > >>>> kworker
> > >>>>    - kworker with 0 lag, gets picked immediately and finishes its
> > >>>>      execution within ~5000ns
> > >>>>    - on dequeue, kworker gets assigned a negative lag
> > >>>>    Is this expected behavior? With this short execution time, I
> > >>>> would
> > >>>>    expect the kworker to be fine.
> > >>>
> > >>> That strikes me as a bit odd as well. Have you been able to determine
> > >>> how a negative lag
> > >>> is assigned to the kworker after such a short runtime?
> > >>>
> > >>
> > >> I did some more trace reading though and found something.
> > >>
> > >> What I observed if everything runs regularly:
> > >> - vhost and kworker run alternating on the same CPU
> > >> - if the kworker is done, it leaves the runqueue
> > >> - vhost wakes up the kworker if it needs it
> > >> --> this means:
> > >>   - vhost starts alone on an otherwise empty runqueue
> > >>   - it seems like it never gets dequeued
> > >>     (unless another unrelated task joins or migration hits)
> > >>   - if vhost wakes up the kworker, the kworker gets selected
> > >>   - vhost runtime > kworker runtime
> > >>     --> kworker gets positive lag and gets selected immediately next
> > >> time
> > >>
> > >> What happens if it does go wrong:
> > >> From what I gather, there seem to be occasions where the vhost either
> > >> executes suprisingly quick, or the kworker surprinsingly slow. If
> > >> these
> > >> outliers reach critical values, it can happen, that
> > >>    vhost runtime < kworker runtime
> > >> which now causes the kworker to get the negative lag.
> > >>
> > >> In this case it seems like, that the vhost is very fast in waking up
> > >> the kworker. And coincidentally, the kworker takes, more time than
> > >> usual
> > >> to finish. We speak of 4-digit to low 5-digit nanoseconds.
> > >>
> > >> So, for these outliers, the scheduler extrapolates that the kworker
> > >> out-consumes the vhost and should be slowed down, although in the
> > >> majority
> > >> of other cases this does not happen.
> > >
> > > Thanks for providing the above details Tobias. It does seem like EEVDF
> > > is strict
> > > about the eligibility checks and making tasks wait when their lags are
> > > negative, even
> > > if just a little bit as in the case of the kworker.
> > >
> > > There was a patch to disable the eligibility checks
> > > (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromium.org/),
> > > which would make EEVDF more like EVDF, though the deadline comparison
> > > would
> > > probably still favor the vhost task instead of the kworker with the
> > > negative lag.
> > >
> > > I'm not sure if you tried it, but I thought I'd mention it.
> >
> > Haven't seen that one yet. Unfortunately, it does not help to ignore the
> > eligibility.
> >
> > I'm inclined to rather propose propose a documentation change, which
> > describes that tasks should not rely on woken up tasks being scheduled
> > immediately.
> 
> Where do you see such an assumption ? Even before eevdf, there were
> nothing that ensures such behavior. When using CFS (legacy or eevdf)
> tasks, you can't know if the newly wakeup task will run 1st or not
> 

There was no guarantee of course. place_entity was reducing the vruntime of 
woken up tasks though, giving it a slight boost, right?. For the scenario 
that I observed, that boost was enough to make sure, that the woken up tasks 
gets scheduled consistently. This might still not be true for all scenarios, 
but in general EEVDF seems to be stricter with woken up tasks.

Dismissing the lag on wakeup also does obviously not guarantee getting 
scheduled, as other tasks might still be involved.

The question would be if it should be explicitly mentioned somewhere that,
at this point, woken up tasks are not getting any special treatment and
noone should rely on that boost for woken up tasks.

> >
> > Changing things in the code to address for the specific scenario I'm
> > seeing seems to mostly create unwanted side effects and/or would require
> > the definition of some magic cut-off values.
> >
> >

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Posted by Luis Machado 1 year, 10 months ago

On 3/20/24 07:04, Tobias Huschle wrote:
> On Tue, Mar 19, 2024 at 02:41:14PM +0100, Vincent Guittot wrote:
>> On Tue, 19 Mar 2024 at 10:08, Tobias Huschle <huschle@linux.ibm.com> wrote:
>>>
>>> On 2024-03-18 15:45, Luis Machado wrote:
>>>> On 3/14/24 13:45, Tobias Huschle wrote:
>>>>> On Fri, Mar 08, 2024 at 03:11:38PM +0000, Luis Machado wrote:
>>>>>> On 2/28/24 16:10, Tobias Huschle wrote:
>>>>>>>
>>>>>>> Questions:
>>>>>>> 1. The kworker getting its negative lag occurs in the following
>>>>>>> scenario
>>>>>>>    - kworker and a cgroup are supposed to execute on the same CPU
>>>>>>>    - one task within the cgroup is executing and wakes up the
>>>>>>> kworker
>>>>>>>    - kworker with 0 lag, gets picked immediately and finishes its
>>>>>>>      execution within ~5000ns
>>>>>>>    - on dequeue, kworker gets assigned a negative lag
>>>>>>>    Is this expected behavior? With this short execution time, I
>>>>>>> would
>>>>>>>    expect the kworker to be fine.
>>>>>>
>>>>>> That strikes me as a bit odd as well. Have you been able to determine
>>>>>> how a negative lag
>>>>>> is assigned to the kworker after such a short runtime?
>>>>>>
>>>>>
>>>>> I did some more trace reading though and found something.
>>>>>
>>>>> What I observed if everything runs regularly:
>>>>> - vhost and kworker run alternating on the same CPU
>>>>> - if the kworker is done, it leaves the runqueue
>>>>> - vhost wakes up the kworker if it needs it
>>>>> --> this means:
>>>>>   - vhost starts alone on an otherwise empty runqueue
>>>>>   - it seems like it never gets dequeued
>>>>>     (unless another unrelated task joins or migration hits)
>>>>>   - if vhost wakes up the kworker, the kworker gets selected
>>>>>   - vhost runtime > kworker runtime
>>>>>     --> kworker gets positive lag and gets selected immediately next
>>>>> time
>>>>>
>>>>> What happens if it does go wrong:
>>>>> From what I gather, there seem to be occasions where the vhost either
>>>>> executes suprisingly quick, or the kworker surprinsingly slow. If
>>>>> these
>>>>> outliers reach critical values, it can happen, that
>>>>>    vhost runtime < kworker runtime
>>>>> which now causes the kworker to get the negative lag.
>>>>>
>>>>> In this case it seems like, that the vhost is very fast in waking up
>>>>> the kworker. And coincidentally, the kworker takes, more time than
>>>>> usual
>>>>> to finish. We speak of 4-digit to low 5-digit nanoseconds.
>>>>>
>>>>> So, for these outliers, the scheduler extrapolates that the kworker
>>>>> out-consumes the vhost and should be slowed down, although in the
>>>>> majority
>>>>> of other cases this does not happen.
>>>>
>>>> Thanks for providing the above details Tobias. It does seem like EEVDF
>>>> is strict
>>>> about the eligibility checks and making tasks wait when their lags are
>>>> negative, even
>>>> if just a little bit as in the case of the kworker.
>>>>
>>>> There was a patch to disable the eligibility checks
>>>> (https://lore.kernel.org/lkml/20231013030213.2472697-1-youssefesmat@chromium.org/),
>>>> which would make EEVDF more like EVDF, though the deadline comparison
>>>> would
>>>> probably still favor the vhost task instead of the kworker with the
>>>> negative lag.
>>>>
>>>> I'm not sure if you tried it, but I thought I'd mention it.
>>>
>>> Haven't seen that one yet. Unfortunately, it does not help to ignore the
>>> eligibility.
>>>
>>> I'm inclined to rather propose propose a documentation change, which
>>> describes that tasks should not rely on woken up tasks being scheduled
>>> immediately.
>>
>> Where do you see such an assumption ? Even before eevdf, there were
>> nothing that ensures such behavior. When using CFS (legacy or eevdf)
>> tasks, you can't know if the newly wakeup task will run 1st or not
>>
> 
> There was no guarantee of course. place_entity was reducing the vruntime of 
> woken up tasks though, giving it a slight boost, right?. For the scenario 
> that I observed, that boost was enough to make sure, that the woken up tasks 
> gets scheduled consistently. This might still not be true for all scenarios, 
> but in general EEVDF seems to be stricter with woken up tasks.

It seems that way, as EEVDF will do eligibility and deadline checks before scheduling a task, so
a task would have to satisfy both of those checks.

I think we have some special treatment for when a task initially joins the competition, in which
case we halve its slice. But I don't think there is any special treatment for woken tasks
anymore.

There was also a fix (63304558ba5dcaaff9e052ee43cfdcc7f9c29e85) to try to reduce the number of
wake up preemptions under some conditions, under the RUN_TO_PARITY feature.

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Posted by K Prateek Nayak 1 year, 11 months ago

(+ Xuewen Yan, Ke Wang)

Hello Tobias,

On 2/28/2024 9:40 PM, Tobias Huschle wrote:
> The previously used CFS scheduler gave tasks that were woken up an
> enhanced chance to see runtime immediately by deducting a certain value
> from its vruntime on runqueue placement during wakeup.
> 
> This property was used by some, at least vhost, to ensure, that certain
> kworkers are scheduled immediately after being woken up. The EEVDF
> scheduler, does not support this so far. Instead, if such a woken up
> entitiy carries a negative lag from its previous execution, it will have
> to wait for the current time slice to finish, which affects the
> performance of the process expecting the immediate execution negatively.
> 
> To address this issue, implement EEVDF strategy #2 for rejoining
> entities, which dismisses the lag from previous execution and allows
> the woken up task to run immediately (if no other entities are deemed
> to be preferred for scheduling by EEVDF).
> 
> The vruntime is decremented by an additional value of 1 to make sure,
> that the woken up tasks gets to actually run. This is of course not
> following strategy #2 in an exact manner but guarantees the expected
> behavior for the scenario described above. Without the additional
> decrement, the performance goes south even more. So there are some
> side effects I could not get my head around yet.
> 
> Questions:
> 1. The kworker getting its negative lag occurs in the following scenario
>    - kworker and a cgroup are supposed to execute on the same CPU
>    - one task within the cgroup is executing and wakes up the kworker
>    - kworker with 0 lag, gets picked immediately and finishes its
>      execution within ~5000ns
>    - on dequeue, kworker gets assigned a negative lag
>    Is this expected behavior? With this short execution time, I would
>    expect the kworker to be fine.
>    For a more detailed discussion on this symptom, please see:
>    https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/

Does the lag clamping path from Xuewen Yan [1] work for the vhost case
mentioned in the thread? Instead of placing the task just behind the
0-lag point, clamping the lag seems to be more principled approach since
EEVDF already does it in update_entity_lag().

If the lag is still too large, maybe the above coupled with Peter's
delayed dequeue patch can help [2] (Note: tree is prone to force
updates)

[1] https://lore.kernel.org/lkml/20240130080643.1828-1-xuewen.yan@unisoc.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e62ef63a888c97188a977daddb72b61548da8417

> 2. The proposed code change of course only addresses the symptom. Am I
>    assuming correctly that this is in general the exepected behavior and
>    that the task waking up the kworker should rather do an explicit
>    reschedule of itself to grant the kworker time to execute?
>    In the vhost case, this is currently attempted through a cond_resched
>    which is not doing anything because the need_resched flag is not set.
> 
> Feedback and opinions would be highly appreciated.
> 
> Signed-off-by: Tobias Huschle <huschle@linux.ibm.com>
> ---
>  kernel/sched/fair.c     | 5 +++++
>  kernel/sched/features.h | 1 +
>  2 files changed, 6 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 533547e3c90a..c20ae6d62961 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5239,6 +5239,11 @@ place_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
>  		lag = div_s64(lag, load);
>  	}
>  
> +	if (sched_feat(NOLAG_WAKEUP) && (flags & ENQUEUE_WAKEUP)) {
> +		se->vlag = 0;
> +		lag = 1;
> +	}
> +
>  	se->vruntime = vruntime - lag;
>  
>  	/*
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 143f55df890b..d3118e7568b4 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -7,6 +7,7 @@
>  SCHED_FEAT(PLACE_LAG, true)
>  SCHED_FEAT(PLACE_DEADLINE_INITIAL, true)
>  SCHED_FEAT(RUN_TO_PARITY, true)
> +SCHED_FEAT(NOLAG_WAKEUP, true)
>  
>  /*
>   * Prefer to schedule the task we woke last (assuming it failed

--
Thanks and Regards,
Prateek

Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

Posted by Tobias Huschle 1 year, 11 months ago

On Thu, Feb 29, 2024 at 09:06:16AM +0530, K Prateek Nayak wrote:
> (+ Xuewen Yan, Ke Wang)
> 
> Hello Tobias,
> 
<...>
> > 
> > Questions:
> > 1. The kworker getting its negative lag occurs in the following scenario
> >    - kworker and a cgroup are supposed to execute on the same CPU
> >    - one task within the cgroup is executing and wakes up the kworker
> >    - kworker with 0 lag, gets picked immediately and finishes its
> >      execution within ~5000ns
> >    - on dequeue, kworker gets assigned a negative lag
> >    Is this expected behavior? With this short execution time, I would
> >    expect the kworker to be fine.
> >    For a more detailed discussion on this symptom, please see:
> >    https://lore.kernel.org/netdev/ZWbapeL34Z8AMR5f@DESKTOP-2CCOB1S./T/
> 
> Does the lag clamping path from Xuewen Yan [1] work for the vhost case
> mentioned in the thread? Instead of placing the task just behind the
> 0-lag point, clamping the lag seems to be more principled approach since
> EEVDF already does it in update_entity_lag().
> 
> If the lag is still too large, maybe the above coupled with Peter's
> delayed dequeue patch can help [2] (Note: tree is prone to force
> updates)
> 
> [1] https://lore.kernel.org/lkml/20240130080643.1828-1-xuewen.yan@unisoc.com/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=sched/eevdf&id=e62ef63a888c97188a977daddb72b61548da8417

I tried Peter's patches a while ago. Unfortunately, reducing the lag
is not sufficient in this particular case. The calling entity expects
the woken up kworker to run instantly.

In order to have a chance that the woken up kworker is scheduled right
away, the kworker must not have any negative lag. To guarantee it being 
scheduled it should even have a positive lag which allows it to pass
all other entities on the queue.

Therefore I proposed to just wipe the negative lag in these cases, 
which seems to map to strategy #2 of the underlying paper.

The other way to think about this would be:
The assumption, that woken up tasks get a high probability to run
is no longer valid. In that case, the entity triggering the wake
up has to explicitly give up the CPU. If there are no other tasks,
apart from the 2 involved so far, has good chances of being 
scheduled. If the runqueue is busy, other tasks might intervene.

I keep playing around with these options, but potential side effects
are worrying me.

> 
<...>