[v1] sched/fair: Only increment deadline once on yield

[PATCH] sched/fair: Only increment deadline once on yield

Posted by Fernand Sieber 10 months, 1 week ago

If a task yields, the scheduler may decide to pick it again. The task in
turn may decide to yield immediately or shortly after, leading to a tight
loop of yields.

If there's another runnable task as this point, the deadline will be
increased by the slice at each loop. This can cause the deadline to runaway
pretty quickly, and subsequent elevated run delays later on as the task
doesn't get picked again. The reason the scheduler can pick the same task
again and again despite its deadline increasing is because it may be the
only eligible task at that point.

Fix this by updating the deadline only to one slice ahead.

Note, we might want to consider iterating on the implementation of yield as
follow up:
* the yielding task could be forfeiting its remaining slice by
  incrementing its vruntime correspondingly
* in case of yield_to the yielding task could be donating its remaining
  slice to the target task

Signed-off-by: Fernand Sieber <sieberf@amazon.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e43993a4e580..c1eff68d8ffc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9024,7 +9024,7 @@ static void yield_task_fair(struct rq *rq)
 	 */
 	rq_clock_skip_update(rq);

-	se->deadline += calc_delta_fair(se->slice, se);
+	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
 }

 static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
--
2.47.1




Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07

Re: [PATCH] sched/fair: Only increment deadline once on yield

Posted by Alexander Graf 10 months ago

On 01.04.25 14:36, Fernand Sieber wrote:
> If a task yields, the scheduler may decide to pick it again. The task in
> turn may decide to yield immediately or shortly after, leading to a tight
> loop of yields.
>
> If there's another runnable task as this point, the deadline will be
> increased by the slice at each loop. This can cause the deadline to runaway
> pretty quickly, and subsequent elevated run delays later on as the task
> doesn't get picked again. The reason the scheduler can pick the same task
> again and again despite its deadline increasing is because it may be the
> only eligible task at that point.
>
> Fix this by updating the deadline only to one slice ahead.
>
> Note, we might want to consider iterating on the implementation of yield as
> follow up:
> * the yielding task could be forfeiting its remaining slice by
>    incrementing its vruntime correspondingly
> * in case of yield_to the yielding task could be donating its remaining
>    slice to the target task
>
> Signed-off-by: Fernand Sieber <sieberf@amazon.com>


IMHO it's worth noting that this is not a theoretical issue. We have 
seen this in real life: A KVM virtual machine's vCPU which runs into a 
busy guest spin lock calls kvm_vcpu_yield_to() which eventually ends up 
in the yield_task_fair() function. We have seen such spin locks due to 
guest contention rather than host overcommit, which means we go into a 
loop of vCPU execution and spin loop exit, which results in an 
undesirable increase in the vCPU thread's deadline.

Given this impacts real workloads and is a bug present since the 
introduction of EEVDF, I would say it warrants a

Fixes: 147f3efaa24182 ("sched/fair: Implement an EEVDF-like scheduling 
policy")

tag.


Alex

Re: [PATCH] sched/fair: Only increment deadline once on yield

Posted by Wang Tao 5 months ago

Picking up this dead thread again.

Has this patch been applied to the mainline or other branch? 

Thanks.

Re: [PATCH] sched/fair: Only increment deadline once on yield

Posted by Fernand Sieber 4 months, 4 weeks ago

Hi Tao,

The patch hasn't been merged yet.
I have resent this patch with fixes tag and additional maintainers in Cc.
Please review on the following thread:

Link: https://lore.kernel.org/lkml/20250911095113.203439-1-sieberf@amazon.com

Thanks,
Fernand



Amazon Development Centre (South Africa) (Proprietary) Limited
29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
Registration Number: 2004 / 034463 / 07

[PATCH] Re: [PATCH] sched/fair: Only increment deadline once on yield

Posted by Wang Tao 6 months, 2 weeks ago

>> On 01/04/25 18:06, Fernand Sieber wrote:
>> If a task yields, the scheduler may decide to pick it again. The task in
>> turn may decide to yield immediately or shortly after, leading to a tight
>> loop of yields.
>>
>> If there's another runnable task as this point, the deadline will be
>> increased by the slice at each loop. This can cause the deadline to runaway
>> pretty quickly, and subsequent elevated run delays later on as the task
>> doesn't get picked again. The reason the scheduler can pick the same task
>> again and again despite its deadline increasing is because it may be the
>> only eligible task at that point.
>>
>> Fix this by updating the deadline only to one slice ahead.
>>
>> Note, we might want to consider iterating on the implementation of yield as
>> follow up:
>> * the yielding task could be forfeiting its remaining slice by
>>    incrementing its vruntime correspondingly
>> * in case of yield_to the yielding task could be donating its remaining
>>    slice to the target task
>>
>> Signed-off-by: Fernand Sieber <sieberf@amazon.com>


>IMHO it's worth noting that this is not a theoretical issue. We have 
>seen this in real life: A KVM virtual machine's vCPU which runs into a 
>busy guest spin lock calls kvm_vcpu_yield_to() which eventually ends up 
>in the yield_task_fair() function. We have seen such spin locks due to 
>guest contention rather than host overcommit, which means we go into a 
>loop of vCPU execution and spin loop exit, which results in an 
>undesirable increase in the vCPU thread's deadline.

>Given this impacts real workloads and is a bug present since the 
>introduction of EEVDF, I would say it warrants a

>Fixes: 147f3efaa24182 ("sched/fair: Implement an EEVDF-like scheduling 
>policy")

>tag.


>Alex

Actually, as Alex described, we encountered the same issue in this 
testing scenario: starting qemu, binding cores to the cpuset group, 
setting cpuset.cpus=1-3 for stress testing in qemu, 
running taskset -c 1-3 ./stress-ng -c 20, and then encountering an error where qemu freezes, 
reporting a soft lockup issue in qemu. After applying this patch, the problem was resolved.
Do we have plans to merge this patch into the mainline?

Re: [PATCH] sched/fair: Only increment deadline once on yield

Posted by Madadi Vineeth Reddy 10 months ago

Hi Fernand,

On 01/04/25 18:06, Fernand Sieber wrote:
> If a task yields, the scheduler may decide to pick it again. The task in
> turn may decide to yield immediately or shortly after, leading to a tight
> loop of yields.
> 
> If there's another runnable task as this point, the deadline will be
> increased by the slice at each loop. This can cause the deadline to runaway
> pretty quickly, and subsequent elevated run delays later on as the task
> doesn't get picked again. The reason the scheduler can pick the same task
> again and again despite its deadline increasing is because it may be the
> only eligible task at that point.
> 
> Fix this by updating the deadline only to one slice ahead.
> 
> Note, we might want to consider iterating on the implementation of yield as
> follow up:
> * the yielding task could be forfeiting its remaining slice by
>   incrementing its vruntime correspondingly
> * in case of yield_to the yielding task could be donating its remaining
>   slice to the target task
> 
> Signed-off-by: Fernand Sieber <sieberf@amazon.com>
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e43993a4e580..c1eff68d8ffc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -9024,7 +9024,7 @@ static void yield_task_fair(struct rq *rq)
>  	 */
>  	rq_clock_skip_update(rq);
> 
> -	se->deadline += calc_delta_fair(se->slice, se);
> +	se->deadline = se->vruntime + calc_delta_fair(se->slice, se);
>  }

Makes sense. Deadline would be rapidly incremented in the scenario of
having other tasks not eligible and the current task calls yield
repeatedly for a very short run time.

Reviewed-by: Madadi Vineeth Reddy <vineethr@linux.ibm.com>

Thanks,
Madadi Vineeth Reddy

> 
>  static bool yield_to_task_fair(struct rq *rq, struct task_struct *p)
> --
> 2.47.1
> 
> 
> 
> 
> Amazon Development Centre (South Africa) (Proprietary) Limited
> 29 Gogosoa Street, Observatory, Cape Town, Western Cape, 7925, South Africa
> Registration Number: 2004 / 034463 / 07
>