sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

[RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

Posted by Chengming Zhou 1 month, 2 weeks ago

These sched_idle_cpu() considerations in select_task_rq_fair() is based
on an assumption that the wakee task can pick a cpu running sched_idle
task and preempt it to run, faster than picking an idle cpu to preempt
the idle task.

This assumption is correct, but it also brings some problems:

1. work conservation: Often sched_idle tasks are also picking the cpu
which is already running sched_idle task, instead of utilizing a real
idle cpu, so work conservation is somewhat broken.

2. sched_idle group: This sched_idle_cpu() is just not correct with
sched_idle group running. Look a simple example below.

		root
	/		\
	kubepods	system
	/	\
burstable	besteffort
		(cpu.idle == 1)

When a sched_idle cpu is just running tasks from besteffort group,
sched_idle_cpu() will return true in this case, but this cpu pick
is bad for wakee task from system group. Because the system group
has lower weight than kubepods, work conservation is somewhat
broken too.

In a nutshell, sched_idle_cpu() should consider the wakee task group's
relationship with sched_idle tasks running on the cpu.

Obviously, it's hard to do so. This patch chooses the simple approach
to remove all sched_idle_cpu() considerations in select_task_rq_fair()
to bring back work conservation in these cases.

Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
---
 kernel/sched/fair.c | 23 ++++++++---------------
 1 file changed, 8 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..d9de63cdc314 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7355,9 +7355,6 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
 		if (!sched_core_cookie_match(rq, p))
 			continue;
 
-		if (sched_idle_cpu(i))
-			return i;
-
 		if (available_idle_cpu(i)) {
 			struct cpuidle_state *idle = idle_get_state(rq);
 			if (idle && idle->exit_latency < min_exit_latency) {
@@ -7446,7 +7443,7 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas
 
 static inline int __select_idle_cpu(int cpu, struct task_struct *p)
 {
-	if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+	if (available_idle_cpu(cpu) &&
 	    sched_cpu_cookie_match(cpu_rq(cpu), p))
 		return cpu;
 
@@ -7519,13 +7516,9 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
 	for_each_cpu(cpu, cpu_smt_mask(core)) {
 		if (!available_idle_cpu(cpu)) {
 			idle = false;
-			if (*idle_cpu == -1) {
-				if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, cpus)) {
-					*idle_cpu = cpu;
-					break;
-				}
+			if (*idle_cpu == -1)
 				continue;
-			}
+
 			break;
 		}
 		if (*idle_cpu == -1 && cpumask_test_cpu(cpu, cpus))
@@ -7555,7 +7548,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
 		 */
 		if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
 			continue;
-		if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+		if (available_idle_cpu(cpu))
 			return cpu;
 	}
 
@@ -7677,7 +7670,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
 	for_each_cpu_wrap(cpu, cpus, target) {
 		unsigned long cpu_cap = capacity_of(cpu);
 
-		if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
+		if (!available_idle_cpu(cpu))
 			continue;
 
 		fits = util_fits_cpu(task_util, util_min, util_max, cpu);
@@ -7748,7 +7741,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 */
 	lockdep_assert_irqs_disabled();
 
-	if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
+	if (available_idle_cpu(target) &&
 	    asym_fits_cpu(task_util, util_min, util_max, target))
 		return target;
 
@@ -7756,7 +7749,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	 * If the previous CPU is cache affine and idle, don't be stupid:
 	 */
 	if (prev != target && cpus_share_cache(prev, target) &&
-	    (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+	    available_idle_cpu(prev) &&
 	    asym_fits_cpu(task_util, util_min, util_max, prev)) {
 
 		if (!static_branch_unlikely(&sched_cluster_active) ||
@@ -7788,7 +7781,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
 	if (recent_used_cpu != prev &&
 	    recent_used_cpu != target &&
 	    cpus_share_cache(recent_used_cpu, target) &&
-	    (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
+	    available_idle_cpu(recent_used_cpu) &&
 	    cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
 	    asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
 
-- 
2.20.1

Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

Posted by Vincent Guittot 1 month, 1 week ago

On Mon, 18 Aug 2025 at 14:47, Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> These sched_idle_cpu() considerations in select_task_rq_fair() is based
> on an assumption that the wakee task can pick a cpu running sched_idle
> task and preempt it to run, faster than picking an idle cpu to preempt
> the idle task.
>
> This assumption is correct, but it also brings some problems:
>
> 1. work conservation: Often sched_idle tasks are also picking the cpu
> which is already running sched_idle task, instead of utilizing a real
> idle cpu, so work conservation is somewhat broken.

But this will be fixed by next idle load balancing, won't it ?

Keep in mind that sched_idle task has a particular meaning to not
preempt other task and use a very low share

>
> 2. sched_idle group: This sched_idle_cpu() is just not correct with
> sched_idle group running. Look a simple example below.
>
>                 root
>         /               \
>         kubepods        system
>         /       \
> burstable       besteffort
>                 (cpu.idle == 1)
>
> When a sched_idle cpu is just running tasks from besteffort group,
> sched_idle_cpu() will return true in this case, but this cpu pick
> is bad for wakee task from system group. Because the system group
> has lower weight than kubepods, work conservation is somewhat
> broken too.

sched_idle task in non sched_idle group has always been a problem
because people wants those task to be preempted immediately but they
also give it more weight than just what sched_idle provide with the
weight of their parent group

>
> In a nutshell, sched_idle_cpu() should consider the wakee task group's
> relationship with sched_idle tasks running on the cpu.
>
> Obviously, it's hard to do so. This patch chooses the simple approach
> to remove all sched_idle_cpu() considerations in select_task_rq_fair()
> to bring back work conservation in these cases.

No, CPU with only sched_idle tasks should be considered as idle CPUs
from a scheduling PoV. You need to find another way to fix it.  Why
don't you use batch task but sched_idle if those sched_idle tasks are
finally more important than just using remaining cycles on CPUs ?

>
> Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
> ---
>  kernel/sched/fair.c | 23 ++++++++---------------
>  1 file changed, 8 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..d9de63cdc314 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7355,9 +7355,6 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
>                 if (!sched_core_cookie_match(rq, p))
>                         continue;
>
> -               if (sched_idle_cpu(i))
> -                       return i;
> -
>                 if (available_idle_cpu(i)) {
>                         struct cpuidle_state *idle = idle_get_state(rq);
>                         if (idle && idle->exit_latency < min_exit_latency) {
> @@ -7446,7 +7443,7 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas
>
>  static inline int __select_idle_cpu(int cpu, struct task_struct *p)
>  {
> -       if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
> +       if (available_idle_cpu(cpu) &&
>             sched_cpu_cookie_match(cpu_rq(cpu), p))
>                 return cpu;
>
> @@ -7519,13 +7516,9 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
>         for_each_cpu(cpu, cpu_smt_mask(core)) {
>                 if (!available_idle_cpu(cpu)) {
>                         idle = false;
> -                       if (*idle_cpu == -1) {
> -                               if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, cpus)) {
> -                                       *idle_cpu = cpu;
> -                                       break;
> -                               }
> +                       if (*idle_cpu == -1)
>                                 continue;
> -                       }
> +
>                         break;
>                 }
>                 if (*idle_cpu == -1 && cpumask_test_cpu(cpu, cpus))
> @@ -7555,7 +7548,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
>                  */
>                 if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
>                         continue;
> -               if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
> +               if (available_idle_cpu(cpu))
>                         return cpu;
>         }
>
> @@ -7677,7 +7670,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>         for_each_cpu_wrap(cpu, cpus, target) {
>                 unsigned long cpu_cap = capacity_of(cpu);
>
> -               if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
> +               if (!available_idle_cpu(cpu))
>                         continue;
>
>                 fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> @@ -7748,7 +7741,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>          */
>         lockdep_assert_irqs_disabled();
>
> -       if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
> +       if (available_idle_cpu(target) &&
>             asym_fits_cpu(task_util, util_min, util_max, target))
>                 return target;
>
> @@ -7756,7 +7749,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>          * If the previous CPU is cache affine and idle, don't be stupid:
>          */
>         if (prev != target && cpus_share_cache(prev, target) &&
> -           (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
> +           available_idle_cpu(prev) &&
>             asym_fits_cpu(task_util, util_min, util_max, prev)) {
>
>                 if (!static_branch_unlikely(&sched_cluster_active) ||
> @@ -7788,7 +7781,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
>         if (recent_used_cpu != prev &&
>             recent_used_cpu != target &&
>             cpus_share_cache(recent_used_cpu, target) &&
> -           (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
> +           available_idle_cpu(recent_used_cpu) &&
>             cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
>             asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
>
> --
> 2.20.1
>

Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

Posted by Christian Loehle 1 month, 2 weeks ago

On 8/18/25 13:47, Chengming Zhou wrote:
> These sched_idle_cpu() considerations in select_task_rq_fair() is based
> on an assumption that the wakee task can pick a cpu running sched_idle
> task and preempt it to run, faster than picking an idle cpu to preempt
> the idle task.
> 
> This assumption is correct, but it also brings some problems:
> 
> 1. work conservation: Often sched_idle tasks are also picking the cpu
> which is already running sched_idle task, instead of utilizing a real
> idle cpu, so work conservation is somewhat broken.
> 
> 2. sched_idle group: This sched_idle_cpu() is just not correct with
> sched_idle group running. Look a simple example below.
> 
> 		root
> 	/		\
> 	kubepods	system
> 	/	\
> burstable	besteffort
> 		(cpu.idle == 1)
> 
> When a sched_idle cpu is just running tasks from besteffort group,
> sched_idle_cpu() will return true in this case, but this cpu pick
> is bad for wakee task from system group. Because the system group
> has lower weight than kubepods, work conservation is somewhat
> broken too.
> 
> In a nutshell, sched_idle_cpu() should consider the wakee task group's
> relationship with sched_idle tasks running on the cpu.
> 
> Obviously, it's hard to do so. This patch chooses the simple approach
> to remove all sched_idle_cpu() considerations in select_task_rq_fair()
> to bring back work conservation in these cases.

OTOH sched_idle_cpu() CPUs are guaranteed to not be in an idle state and
potentially already have DVFS on some higher level...

Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

Posted by Chen, Yu C 1 month, 2 weeks ago

On 8/18/2025 9:24 PM, Christian Loehle wrote:
> On 8/18/25 13:47, Chengming Zhou wrote:
>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
>> on an assumption that the wakee task can pick a cpu running sched_idle
>> task and preempt it to run, faster than picking an idle cpu to preempt
>> the idle task.
>>
>> This assumption is correct, but it also brings some problems:
>>
>> 1. work conservation: Often sched_idle tasks are also picking the cpu
>> which is already running sched_idle task, instead of utilizing a real
>> idle cpu, so work conservation is somewhat broken.
>>
>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
>> sched_idle group running. Look a simple example below.
>>
>> 		root
>> 	/		\
>> 	kubepods	system
>> 	/	\
>> burstable	besteffort
>> 		(cpu.idle == 1)
>>
>> When a sched_idle cpu is just running tasks from besteffort group,
>> sched_idle_cpu() will return true in this case, but this cpu pick
>> is bad for wakee task from system group. Because the system group
>> has lower weight than kubepods, work conservation is somewhat
>> broken too.
>>
>> In a nutshell, sched_idle_cpu() should consider the wakee task group's
>> relationship with sched_idle tasks running on the cpu.
>>
>> Obviously, it's hard to do so. This patch chooses the simple approach
>> to remove all sched_idle_cpu() considerations in select_task_rq_fair()
>> to bring back work conservation in these cases.
> 
> OTOH sched_idle_cpu() CPUs are guaranteed to not be in an idle state and
> potentially already have DVFS on some higher level...
> 
Is it because the schedutil governor considers the utilization
of SCHED_IDLE, thus causing schedutil to request a higher
frequency?

The commit 3c29e651e16d ("sched/fair: Fall back to sched-idle
CPU if an idle CPU isn't found") mentions that choosing a CPU
running a SCHED_IDLE task can avoid waking a CPU from a deep
sleep state.

If this is the case, can we say that if an administrator sets
the cpufreq governor to "performance" and disables deep idle
states, an idle CPU would be more preferable than a CPU running
a SCHED_IDLE task? On the other hand, if
per_cpu(cpufreq_update_util_data, cpu) is NULL and only shallow
idle states are enabled in idle_get_state(), should we skip
SCHED_IDLE to achieve work conservation?

thanks,
Chenyu

Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

Posted by Christian Loehle 1 month, 2 weeks ago

On 8/19/25 16:32, Chen, Yu C wrote:
> On 8/18/2025 9:24 PM, Christian Loehle wrote:
>> On 8/18/25 13:47, Chengming Zhou wrote:
>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
>>> on an assumption that the wakee task can pick a cpu running sched_idle
>>> task and preempt it to run, faster than picking an idle cpu to preempt
>>> the idle task.
>>>
>>> This assumption is correct, but it also brings some problems:
>>>
>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
>>> which is already running sched_idle task, instead of utilizing a real
>>> idle cpu, so work conservation is somewhat broken.
>>>
>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
>>> sched_idle group running. Look a simple example below.
>>>
>>>         root
>>>     /        \
>>>     kubepods    system
>>>     /    \
>>> burstable    besteffort
>>>         (cpu.idle == 1)
>>>
>>> When a sched_idle cpu is just running tasks from besteffort group,
>>> sched_idle_cpu() will return true in this case, but this cpu pick
>>> is bad for wakee task from system group. Because the system group
>>> has lower weight than kubepods, work conservation is somewhat
>>> broken too.
>>>
>>> In a nutshell, sched_idle_cpu() should consider the wakee task group's
>>> relationship with sched_idle tasks running on the cpu.
>>>
>>> Obviously, it's hard to do so. This patch chooses the simple approach
>>> to remove all sched_idle_cpu() considerations in select_task_rq_fair()
>>> to bring back work conservation in these cases.
>>
>> OTOH sched_idle_cpu() CPUs are guaranteed to not be in an idle state and
>> potentially already have DVFS on some higher level...
>>
> Is it because the schedutil governor considers the utilization
> of SCHED_IDLE, thus causing schedutil to request a higher
> frequency?

For intel_pstate active (HWP and !HWP) the same issue should persist, no?

> 
> The commit 3c29e651e16d ("sched/fair: Fall back to sched-idle
> CPU if an idle CPU isn't found") mentions that choosing a CPU
> running a SCHED_IDLE task can avoid waking a CPU from a deep
> sleep state.
> 
> If this is the case, can we say that if an administrator sets
> the cpufreq governor to "performance" and disables deep idle
> states, an idle CPU would be more preferable than a CPU running
> a SCHED_IDLE task? On the other hand, if
> per_cpu(cpufreq_update_util_data, cpu) is NULL and only shallow
> idle states are enabled in idle_get_state(), should we skip
> SCHED_IDLE to achieve work conservation?

That's probably getting the most out of it.
That being said, strictly speaking the SCHED_IDLE CPU and the
SHALLOW_IDLE CPU may still share a power and thermal budget, which
may make preempting the sched-idle task on SCHED_IDLE CPU the
better choice.

Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

Posted by Chengming Zhou 1 month, 2 weeks ago

+cc Josh and Viresh, I forgot to cc you, sorry!

On 2025/8/20 21:53, Christian Loehle wrote:
> On 8/19/25 16:32, Chen, Yu C wrote:
>> On 8/18/2025 9:24 PM, Christian Loehle wrote:
>>> On 8/18/25 13:47, Chengming Zhou wrote:
>>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
>>>> on an assumption that the wakee task can pick a cpu running sched_idle
>>>> task and preempt it to run, faster than picking an idle cpu to preempt
>>>> the idle task.
>>>>
>>>> This assumption is correct, but it also brings some problems:
>>>>
>>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
>>>> which is already running sched_idle task, instead of utilizing a real
>>>> idle cpu, so work conservation is somewhat broken.
>>>>
>>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
>>>> sched_idle group running. Look a simple example below.
>>>>
>>>>          root
>>>>      /        \
>>>>      kubepods    system
>>>>      /    \
>>>> burstable    besteffort
>>>>          (cpu.idle == 1)
>>>>
>>>> When a sched_idle cpu is just running tasks from besteffort group,
>>>> sched_idle_cpu() will return true in this case, but this cpu pick
>>>> is bad for wakee task from system group. Because the system group
>>>> has lower weight than kubepods, work conservation is somewhat
>>>> broken too.
>>>>
>>>> In a nutshell, sched_idle_cpu() should consider the wakee task group's
>>>> relationship with sched_idle tasks running on the cpu.
>>>>
>>>> Obviously, it's hard to do so. This patch chooses the simple approach
>>>> to remove all sched_idle_cpu() considerations in select_task_rq_fair()
>>>> to bring back work conservation in these cases.
>>>
>>> OTOH sched_idle_cpu() CPUs are guaranteed to not be in an idle state and
>>> potentially already have DVFS on some higher level...
>>>
>> Is it because the schedutil governor considers the utilization
>> of SCHED_IDLE, thus causing schedutil to request a higher
>> frequency?
> 
> For intel_pstate active (HWP and !HWP) the same issue should persist, no?
> 
>>
>> The commit 3c29e651e16d ("sched/fair: Fall back to sched-idle
>> CPU if an idle CPU isn't found") mentions that choosing a CPU
>> running a SCHED_IDLE task can avoid waking a CPU from a deep
>> sleep state.
>>
>> If this is the case, can we say that if an administrator sets
>> the cpufreq governor to "performance" and disables deep idle
>> states, an idle CPU would be more preferable than a CPU running
>> a SCHED_IDLE task? On the other hand, if
>> per_cpu(cpufreq_update_util_data, cpu) is NULL and only shallow
>> idle states are enabled in idle_get_state(), should we skip
>> SCHED_IDLE to achieve work conservation?
> 
> That's probably getting the most out of it.
> That being said, strictly speaking the SCHED_IDLE CPU and the
> SHALLOW_IDLE CPU may still share a power and thermal budget, which
> may make preempting the sched-idle task on SCHED_IDLE CPU the
> better choice.

I admit these sched_idle_cpu() considerations are good motivation,
at least for !sched_idle tasks, to preempt sched_idle task instead of
waking a real idle cpu.

But it oversimplified the complicated situations we have, such as
the cases above, which make the scheduler's work conservation performance
really bad.

Thanks.

Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

Posted by Josh Don 1 month, 1 week ago

On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>
> +cc Josh and Viresh, I forgot to cc you, sorry!

Thanks, missed this previously :)

>
> On 2025/8/20 21:53, Christian Loehle wrote:
> > On 8/19/25 16:32, Chen, Yu C wrote:
> >> On 8/18/2025 9:24 PM, Christian Loehle wrote:
> >>> On 8/18/25 13:47, Chengming Zhou wrote:
> >>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
> >>>> on an assumption that the wakee task can pick a cpu running sched_idle
> >>>> task and preempt it to run, faster than picking an idle cpu to preempt
> >>>> the idle task.
> >>>>
> >>>> This assumption is correct, but it also brings some problems:
> >>>>
> >>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
> >>>> which is already running sched_idle task, instead of utilizing a real
> >>>> idle cpu, so work conservation is somewhat broken.
> >>>>
> >>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
> >>>> sched_idle group running. Look a simple example below.
> >>>>
> >>>>          root
> >>>>      /        \
> >>>>      kubepods    system
> >>>>      /    \
> >>>> burstable    besteffort
> >>>>          (cpu.idle == 1)

Thanks for bringing attention to this scenario, it's been a case I've
worried about but haven't had a good idea about fixing. Ideally we
could find_matching_se(), but we want to do these checks locklessly
and quickly, so that's out of the question. Agree on it being a hard
problem.

One idea is that we at least handle the (what I think is fairly
typical) scenario of a root-level sched_idle group well (a root level
sched_idle group is trivially idle with respect to anything else in
the system that is not also nested under a root-level sched_idle
group). It would be fairly easy to track a nr_idle_queued cfs_rq
field, as well as cache on task enqueue whether it nests under a
sched_idle group.

Best,
Josh

Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

Posted by Chengming Zhou 1 month, 1 week ago

On 2025/8/22 02:13, Josh Don wrote:
> On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
>>
>> +cc Josh and Viresh, I forgot to cc you, sorry!
> 
> Thanks, missed this previously :)
> 
>>
>> On 2025/8/20 21:53, Christian Loehle wrote:
>>> On 8/19/25 16:32, Chen, Yu C wrote:
>>>> On 8/18/2025 9:24 PM, Christian Loehle wrote:
>>>>> On 8/18/25 13:47, Chengming Zhou wrote:
>>>>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
>>>>>> on an assumption that the wakee task can pick a cpu running sched_idle
>>>>>> task and preempt it to run, faster than picking an idle cpu to preempt
>>>>>> the idle task.
>>>>>>
>>>>>> This assumption is correct, but it also brings some problems:
>>>>>>
>>>>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
>>>>>> which is already running sched_idle task, instead of utilizing a real
>>>>>> idle cpu, so work conservation is somewhat broken.
>>>>>>
>>>>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
>>>>>> sched_idle group running. Look a simple example below.
>>>>>>
>>>>>>           root
>>>>>>       /        \
>>>>>>       kubepods    system
>>>>>>       /    \
>>>>>> burstable    besteffort
>>>>>>           (cpu.idle == 1)
> 
> Thanks for bringing attention to this scenario, it's been a case I've
> worried about but haven't had a good idea about fixing. Ideally we
> could find_matching_se(), but we want to do these checks locklessly
> and quickly, so that's out of the question. Agree on it being a hard
> problem.

Yeah, right, we don't want to use find_matching_se() here.

> 
> One idea is that we at least handle the (what I think is fairly
> typical) scenario of a root-level sched_idle group well (a root level

You mean /kubepods and /system group in this case, right? Both of
them are not sched_idle here.

> sched_idle group is trivially idle with respect to anything else in
> the system that is not also nested under a root-level sched_idle
> group). It would be fairly easy to track a nr_idle_queued cfs_rq
> field, as well as cache on task enqueue whether it nests under a
> sched_idle group.

Ok, we can track if a task nests under a sched_idle group, like tasks
from /system and /kubepods/burstable are not under any sched_idle group,
there seems no way to distinguish them except using find_matching_se().

Thanks!

> 
> Best,
> Josh

Re: [RFC PATCH] sched/fair: Remove sched_idle_cpu() usages in select_task_rq_fair()

Posted by Josh Don 1 month, 1 week ago

On Sun, Aug 24, 2025 at 11:58 PM Chengming Zhou
<chengming.zhou@linux.dev> wrote:
>
> On 2025/8/22 02:13, Josh Don wrote:
> > On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@linux.dev> wrote:
> >>
> >> +cc Josh and Viresh, I forgot to cc you, sorry!
> >
> > Thanks, missed this previously :)
> >
> >>
> >> On 2025/8/20 21:53, Christian Loehle wrote:
> >>> On 8/19/25 16:32, Chen, Yu C wrote:
> >>>> On 8/18/2025 9:24 PM, Christian Loehle wrote:
> >>>>> On 8/18/25 13:47, Chengming Zhou wrote:
> >>>>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based
> >>>>>> on an assumption that the wakee task can pick a cpu running sched_idle
> >>>>>> task and preempt it to run, faster than picking an idle cpu to preempt
> >>>>>> the idle task.
> >>>>>>
> >>>>>> This assumption is correct, but it also brings some problems:
> >>>>>>
> >>>>>> 1. work conservation: Often sched_idle tasks are also picking the cpu
> >>>>>> which is already running sched_idle task, instead of utilizing a real
> >>>>>> idle cpu, so work conservation is somewhat broken.
> >>>>>>
> >>>>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with
> >>>>>> sched_idle group running. Look a simple example below.
> >>>>>>
> >>>>>>           root
> >>>>>>       /        \
> >>>>>>       kubepods    system
> >>>>>>       /    \
> >>>>>> burstable    besteffort
> >>>>>>           (cpu.idle == 1)
> >
> > Thanks for bringing attention to this scenario, it's been a case I've
> > worried about but haven't had a good idea about fixing. Ideally we
> > could find_matching_se(), but we want to do these checks locklessly
> > and quickly, so that's out of the question. Agree on it being a hard
> > problem.
>
> Yeah, right, we don't want to use find_matching_se() here.
>
> >
> > One idea is that we at least handle the (what I think is fairly
> > typical) scenario of a root-level sched_idle group well (a root level
>
> You mean /kubepods and /system group in this case, right? Both of
> them are not sched_idle here.

Correct

> > sched_idle group is trivially idle with respect to anything else in
> > the system that is not also nested under a root-level sched_idle
> > group). It would be fairly easy to track a nr_idle_queued cfs_rq
> > field, as well as cache on task enqueue whether it nests under a
> > sched_idle group.
>
> Ok, we can track if a task nests under a sched_idle group, like tasks
> from /system and /kubepods/burstable are not under any sched_idle group,
> there seems no way to distinguish them except using find_matching_se().

nr_idle_queued on the cfs_rq seems like the way to do it, but I agree
it is a tricky problem.

>
> Thanks!
>
> >
> > Best,
> > Josh