[Patch] select_idle_sibling v.s. DELAYED_DEQUEUE

Tingjia Cao posted 1 patch 1 week, 1 day ago
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[Patch] select_idle_sibling v.s. DELAYED_DEQUEUE
Posted by Tingjia Cao 1 week, 1 day ago
Recently, we encountered an issue that sync wakeup kthread didn't choose
the current CPU though the waker is the only runnable task. It is caused by
a conflict between delayed dequeue feature and select_idle_sibling function.

With the DELAYED_DEQUEUE mechanism enabled, a task that goes to sleep may
not be removed from the runqueue immediately. As a result, nr_running may
overcount the number of runnable tasks. Inside select_idle_sibling, there
is a special case for sync wakeup:

if (is_per_cpu_kthread(current) &&
    in_task() &&
    prev == smp_processor_id() &&
    this_rq()->nr_running <= 1 &&
    asym_fits_cpu(...)) {
    return prev;
}

For "this_rq()->nr_running <= 1": we should use the real running-tasks rq
to check whether to place the wake-up task to the current cpu.

To fix this (patch attached), we can use the true number of runnable tasks
by subtracting the delayed-dequeue count:

        this_rq()->nr_running - cfs_h_nr_delayed(this_rq()) <= 1


Best,
Tingjia
From 2540ac815e9cfa47e984a828139526e290f4f459 Mon Sep 17 00:00:00 2001
From: Tingjia-0v0 <tjcao980311@gmail.com>
Date: Sat, 22 Nov 2025 21:42:00 -0600
Subject: [PATCH] fix select_idle_sibling vs DELAYED_DEQUEUE

---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..d60a3f5ebeca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7869,7 +7869,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (is_per_cpu_kthread(current) &&
 	    in_task() &&
 	    prev == smp_processor_id() &&
-	    this_rq()->nr_running <= 1 &&
+	    this_rq()->nr_running - cfs_h_nr_delayed(this_rq()) <= 1 &&
 	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
 		return prev;
 	}
-- 
2.43.0

Re: [Patch] select_idle_sibling v.s. DELAYED_DEQUEUE
Posted by K Prateek Nayak 1 week ago
Hello Tingjia,

On 11/23/2025 9:34 AM, Tingjia Cao wrote:
> Recently, we encountered an issue that sync wakeup kthread didn't choose the current CPU though the waker is the only runnable task. It is caused by a conflict between delayed dequeue feature and select_idle_sibling function.
> 
> With the DELAYED_DEQUEUE mechanism enabled, a task that goes to sleep may not be removed from the runqueue immediately. As a result, nr_running may overcount the number of runnable tasks. Inside select_idle_sibling, there is a special case for sync wakeup:
> 
> if (is_per_cpu_kthread(current) &&
>     in_task() &&
>     prev == smp_processor_id() &&
>     this_rq()->nr_running <= 1 &&
>     asym_fits_cpu(...)) {
>     return prev;
> }
> 
> For "this_rq()->nr_running <= 1": we should use the real running-tasks rq to check whether to place the wake-up task to the current cpu.
> 
> To fix this (patch attached), we can use the true number of runnable tasks by subtracting the delayed-dequeue count:
> 
>         this_rq()->nr_running - cfs_h_nr_delayed(this_rq()) <= 1

This is a very transient state - tasks cannot be delayed without other
runnable tasks at the time of dequeue and soon after the dequeue of
last runnable task, all the pending delayed tasks would get dequeued.
The window is actually very small. Does this make a difference in
your workload performance?

Once all tasks are dequeued, the newidle balance should run on the CPU
going idle to help reduce any imbalance.

-- 
Thanks and Regards,
Prateek