kernel/sched/fair.c | 23 ++++++++--------------- 1 file changed, 8 insertions(+), 15 deletions(-)
These sched_idle_cpu() considerations in select_task_rq_fair() is based
on an assumption that the wakee task can pick a cpu running sched_idle
task and preempt it to run, faster than picking an idle cpu to preempt
the idle task.
This assumption is correct, but it also brings some problems:
1. work conservation: Often sched_idle tasks are also picking the cpu
which is already running sched_idle task, instead of utilizing a real
idle cpu, so work conservation is somewhat broken.
2. sched_idle group: This sched_idle_cpu() is just not correct with
sched_idle group running. Look a simple example below.
root
/ \
kubepods system
/ \
burstable besteffort
(cpu.idle == 1)
When a sched_idle cpu is just running tasks from besteffort group,
sched_idle_cpu() will return true in this case, but this cpu pick
is bad for wakee task from system group. Because the system group
has lower weight than kubepods, work conservation is somewhat
broken too.
In a nutshell, sched_idle_cpu() should consider the wakee task group's
relationship with sched_idle tasks running on the cpu.
Obviously, it's hard to do so. This patch chooses the simple approach
to remove all sched_idle_cpu() considerations in select_task_rq_fair()
to bring back work conservation in these cases.
Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev>
---
kernel/sched/fair.c | 23 ++++++++---------------
1 file changed, 8 insertions(+), 15 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..d9de63cdc314 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7355,9 +7355,6 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
if (!sched_core_cookie_match(rq, p))
continue;
- if (sched_idle_cpu(i))
- return i;
-
if (available_idle_cpu(i)) {
struct cpuidle_state *idle = idle_get_state(rq);
if (idle && idle->exit_latency < min_exit_latency) {
@@ -7446,7 +7443,7 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas
static inline int __select_idle_cpu(int cpu, struct task_struct *p)
{
- if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) &&
+ if (available_idle_cpu(cpu) &&
sched_cpu_cookie_match(cpu_rq(cpu), p))
return cpu;
@@ -7519,13 +7516,9 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu
for_each_cpu(cpu, cpu_smt_mask(core)) {
if (!available_idle_cpu(cpu)) {
idle = false;
- if (*idle_cpu == -1) {
- if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, cpus)) {
- *idle_cpu = cpu;
- break;
- }
+ if (*idle_cpu == -1)
continue;
- }
+
break;
}
if (*idle_cpu == -1 && cpumask_test_cpu(cpu, cpus))
@@ -7555,7 +7548,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t
*/
if (!cpumask_test_cpu(cpu, sched_domain_span(sd)))
continue;
- if (available_idle_cpu(cpu) || sched_idle_cpu(cpu))
+ if (available_idle_cpu(cpu))
return cpu;
}
@@ -7677,7 +7670,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
for_each_cpu_wrap(cpu, cpus, target) {
unsigned long cpu_cap = capacity_of(cpu);
- if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu))
+ if (!available_idle_cpu(cpu))
continue;
fits = util_fits_cpu(task_util, util_min, util_max, cpu);
@@ -7748,7 +7741,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
*/
lockdep_assert_irqs_disabled();
- if ((available_idle_cpu(target) || sched_idle_cpu(target)) &&
+ if (available_idle_cpu(target) &&
asym_fits_cpu(task_util, util_min, util_max, target))
return target;
@@ -7756,7 +7749,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
* If the previous CPU is cache affine and idle, don't be stupid:
*/
if (prev != target && cpus_share_cache(prev, target) &&
- (available_idle_cpu(prev) || sched_idle_cpu(prev)) &&
+ available_idle_cpu(prev) &&
asym_fits_cpu(task_util, util_min, util_max, prev)) {
if (!static_branch_unlikely(&sched_cluster_active) ||
@@ -7788,7 +7781,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target)
if (recent_used_cpu != prev &&
recent_used_cpu != target &&
cpus_share_cache(recent_used_cpu, target) &&
- (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) &&
+ available_idle_cpu(recent_used_cpu) &&
cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) &&
asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) {
--
2.20.1
On Mon, 18 Aug 2025 at 14:47, Chengming Zhou <chengming.zhou@linux.dev> wrote: > > These sched_idle_cpu() considerations in select_task_rq_fair() is based > on an assumption that the wakee task can pick a cpu running sched_idle > task and preempt it to run, faster than picking an idle cpu to preempt > the idle task. > > This assumption is correct, but it also brings some problems: > > 1. work conservation: Often sched_idle tasks are also picking the cpu > which is already running sched_idle task, instead of utilizing a real > idle cpu, so work conservation is somewhat broken. But this will be fixed by next idle load balancing, won't it ? Keep in mind that sched_idle task has a particular meaning to not preempt other task and use a very low share > > 2. sched_idle group: This sched_idle_cpu() is just not correct with > sched_idle group running. Look a simple example below. > > root > / \ > kubepods system > / \ > burstable besteffort > (cpu.idle == 1) > > When a sched_idle cpu is just running tasks from besteffort group, > sched_idle_cpu() will return true in this case, but this cpu pick > is bad for wakee task from system group. Because the system group > has lower weight than kubepods, work conservation is somewhat > broken too. sched_idle task in non sched_idle group has always been a problem because people wants those task to be preempted immediately but they also give it more weight than just what sched_idle provide with the weight of their parent group > > In a nutshell, sched_idle_cpu() should consider the wakee task group's > relationship with sched_idle tasks running on the cpu. > > Obviously, it's hard to do so. This patch chooses the simple approach > to remove all sched_idle_cpu() considerations in select_task_rq_fair() > to bring back work conservation in these cases. No, CPU with only sched_idle tasks should be considered as idle CPUs from a scheduling PoV. You need to find another way to fix it. Why don't you use batch task but sched_idle if those sched_idle tasks are finally more important than just using remaining cycles on CPUs ? > > Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> > --- > kernel/sched/fair.c | 23 ++++++++--------------- > 1 file changed, 8 insertions(+), 15 deletions(-) > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index b173a059315c..d9de63cdc314 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -7355,9 +7355,6 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct * > if (!sched_core_cookie_match(rq, p)) > continue; > > - if (sched_idle_cpu(i)) > - return i; > - > if (available_idle_cpu(i)) { > struct cpuidle_state *idle = idle_get_state(rq); > if (idle && idle->exit_latency < min_exit_latency) { > @@ -7446,7 +7443,7 @@ static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct tas > > static inline int __select_idle_cpu(int cpu, struct task_struct *p) > { > - if ((available_idle_cpu(cpu) || sched_idle_cpu(cpu)) && > + if (available_idle_cpu(cpu) && > sched_cpu_cookie_match(cpu_rq(cpu), p)) > return cpu; > > @@ -7519,13 +7516,9 @@ static int select_idle_core(struct task_struct *p, int core, struct cpumask *cpu > for_each_cpu(cpu, cpu_smt_mask(core)) { > if (!available_idle_cpu(cpu)) { > idle = false; > - if (*idle_cpu == -1) { > - if (sched_idle_cpu(cpu) && cpumask_test_cpu(cpu, cpus)) { > - *idle_cpu = cpu; > - break; > - } > + if (*idle_cpu == -1) > continue; > - } > + > break; > } > if (*idle_cpu == -1 && cpumask_test_cpu(cpu, cpus)) > @@ -7555,7 +7548,7 @@ static int select_idle_smt(struct task_struct *p, struct sched_domain *sd, int t > */ > if (!cpumask_test_cpu(cpu, sched_domain_span(sd))) > continue; > - if (available_idle_cpu(cpu) || sched_idle_cpu(cpu)) > + if (available_idle_cpu(cpu)) > return cpu; > } > > @@ -7677,7 +7670,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target) > for_each_cpu_wrap(cpu, cpus, target) { > unsigned long cpu_cap = capacity_of(cpu); > > - if (!available_idle_cpu(cpu) && !sched_idle_cpu(cpu)) > + if (!available_idle_cpu(cpu)) > continue; > > fits = util_fits_cpu(task_util, util_min, util_max, cpu); > @@ -7748,7 +7741,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > */ > lockdep_assert_irqs_disabled(); > > - if ((available_idle_cpu(target) || sched_idle_cpu(target)) && > + if (available_idle_cpu(target) && > asym_fits_cpu(task_util, util_min, util_max, target)) > return target; > > @@ -7756,7 +7749,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > * If the previous CPU is cache affine and idle, don't be stupid: > */ > if (prev != target && cpus_share_cache(prev, target) && > - (available_idle_cpu(prev) || sched_idle_cpu(prev)) && > + available_idle_cpu(prev) && > asym_fits_cpu(task_util, util_min, util_max, prev)) { > > if (!static_branch_unlikely(&sched_cluster_active) || > @@ -7788,7 +7781,7 @@ static int select_idle_sibling(struct task_struct *p, int prev, int target) > if (recent_used_cpu != prev && > recent_used_cpu != target && > cpus_share_cache(recent_used_cpu, target) && > - (available_idle_cpu(recent_used_cpu) || sched_idle_cpu(recent_used_cpu)) && > + available_idle_cpu(recent_used_cpu) && > cpumask_test_cpu(recent_used_cpu, p->cpus_ptr) && > asym_fits_cpu(task_util, util_min, util_max, recent_used_cpu)) { > > -- > 2.20.1 >
On 8/18/25 13:47, Chengming Zhou wrote: > These sched_idle_cpu() considerations in select_task_rq_fair() is based > on an assumption that the wakee task can pick a cpu running sched_idle > task and preempt it to run, faster than picking an idle cpu to preempt > the idle task. > > This assumption is correct, but it also brings some problems: > > 1. work conservation: Often sched_idle tasks are also picking the cpu > which is already running sched_idle task, instead of utilizing a real > idle cpu, so work conservation is somewhat broken. > > 2. sched_idle group: This sched_idle_cpu() is just not correct with > sched_idle group running. Look a simple example below. > > root > / \ > kubepods system > / \ > burstable besteffort > (cpu.idle == 1) > > When a sched_idle cpu is just running tasks from besteffort group, > sched_idle_cpu() will return true in this case, but this cpu pick > is bad for wakee task from system group. Because the system group > has lower weight than kubepods, work conservation is somewhat > broken too. > > In a nutshell, sched_idle_cpu() should consider the wakee task group's > relationship with sched_idle tasks running on the cpu. > > Obviously, it's hard to do so. This patch chooses the simple approach > to remove all sched_idle_cpu() considerations in select_task_rq_fair() > to bring back work conservation in these cases. OTOH sched_idle_cpu() CPUs are guaranteed to not be in an idle state and potentially already have DVFS on some higher level...
On 8/18/2025 9:24 PM, Christian Loehle wrote: > On 8/18/25 13:47, Chengming Zhou wrote: >> These sched_idle_cpu() considerations in select_task_rq_fair() is based >> on an assumption that the wakee task can pick a cpu running sched_idle >> task and preempt it to run, faster than picking an idle cpu to preempt >> the idle task. >> >> This assumption is correct, but it also brings some problems: >> >> 1. work conservation: Often sched_idle tasks are also picking the cpu >> which is already running sched_idle task, instead of utilizing a real >> idle cpu, so work conservation is somewhat broken. >> >> 2. sched_idle group: This sched_idle_cpu() is just not correct with >> sched_idle group running. Look a simple example below. >> >> root >> / \ >> kubepods system >> / \ >> burstable besteffort >> (cpu.idle == 1) >> >> When a sched_idle cpu is just running tasks from besteffort group, >> sched_idle_cpu() will return true in this case, but this cpu pick >> is bad for wakee task from system group. Because the system group >> has lower weight than kubepods, work conservation is somewhat >> broken too. >> >> In a nutshell, sched_idle_cpu() should consider the wakee task group's >> relationship with sched_idle tasks running on the cpu. >> >> Obviously, it's hard to do so. This patch chooses the simple approach >> to remove all sched_idle_cpu() considerations in select_task_rq_fair() >> to bring back work conservation in these cases. > > OTOH sched_idle_cpu() CPUs are guaranteed to not be in an idle state and > potentially already have DVFS on some higher level... > Is it because the schedutil governor considers the utilization of SCHED_IDLE, thus causing schedutil to request a higher frequency? The commit 3c29e651e16d ("sched/fair: Fall back to sched-idle CPU if an idle CPU isn't found") mentions that choosing a CPU running a SCHED_IDLE task can avoid waking a CPU from a deep sleep state. If this is the case, can we say that if an administrator sets the cpufreq governor to "performance" and disables deep idle states, an idle CPU would be more preferable than a CPU running a SCHED_IDLE task? On the other hand, if per_cpu(cpufreq_update_util_data, cpu) is NULL and only shallow idle states are enabled in idle_get_state(), should we skip SCHED_IDLE to achieve work conservation? thanks, Chenyu
On 8/19/25 16:32, Chen, Yu C wrote: > On 8/18/2025 9:24 PM, Christian Loehle wrote: >> On 8/18/25 13:47, Chengming Zhou wrote: >>> These sched_idle_cpu() considerations in select_task_rq_fair() is based >>> on an assumption that the wakee task can pick a cpu running sched_idle >>> task and preempt it to run, faster than picking an idle cpu to preempt >>> the idle task. >>> >>> This assumption is correct, but it also brings some problems: >>> >>> 1. work conservation: Often sched_idle tasks are also picking the cpu >>> which is already running sched_idle task, instead of utilizing a real >>> idle cpu, so work conservation is somewhat broken. >>> >>> 2. sched_idle group: This sched_idle_cpu() is just not correct with >>> sched_idle group running. Look a simple example below. >>> >>> root >>> / \ >>> kubepods system >>> / \ >>> burstable besteffort >>> (cpu.idle == 1) >>> >>> When a sched_idle cpu is just running tasks from besteffort group, >>> sched_idle_cpu() will return true in this case, but this cpu pick >>> is bad for wakee task from system group. Because the system group >>> has lower weight than kubepods, work conservation is somewhat >>> broken too. >>> >>> In a nutshell, sched_idle_cpu() should consider the wakee task group's >>> relationship with sched_idle tasks running on the cpu. >>> >>> Obviously, it's hard to do so. This patch chooses the simple approach >>> to remove all sched_idle_cpu() considerations in select_task_rq_fair() >>> to bring back work conservation in these cases. >> >> OTOH sched_idle_cpu() CPUs are guaranteed to not be in an idle state and >> potentially already have DVFS on some higher level... >> > Is it because the schedutil governor considers the utilization > of SCHED_IDLE, thus causing schedutil to request a higher > frequency? For intel_pstate active (HWP and !HWP) the same issue should persist, no? > > The commit 3c29e651e16d ("sched/fair: Fall back to sched-idle > CPU if an idle CPU isn't found") mentions that choosing a CPU > running a SCHED_IDLE task can avoid waking a CPU from a deep > sleep state. > > If this is the case, can we say that if an administrator sets > the cpufreq governor to "performance" and disables deep idle > states, an idle CPU would be more preferable than a CPU running > a SCHED_IDLE task? On the other hand, if > per_cpu(cpufreq_update_util_data, cpu) is NULL and only shallow > idle states are enabled in idle_get_state(), should we skip > SCHED_IDLE to achieve work conservation? That's probably getting the most out of it. That being said, strictly speaking the SCHED_IDLE CPU and the SHALLOW_IDLE CPU may still share a power and thermal budget, which may make preempting the sched-idle task on SCHED_IDLE CPU the better choice.
+cc Josh and Viresh, I forgot to cc you, sorry! On 2025/8/20 21:53, Christian Loehle wrote: > On 8/19/25 16:32, Chen, Yu C wrote: >> On 8/18/2025 9:24 PM, Christian Loehle wrote: >>> On 8/18/25 13:47, Chengming Zhou wrote: >>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based >>>> on an assumption that the wakee task can pick a cpu running sched_idle >>>> task and preempt it to run, faster than picking an idle cpu to preempt >>>> the idle task. >>>> >>>> This assumption is correct, but it also brings some problems: >>>> >>>> 1. work conservation: Often sched_idle tasks are also picking the cpu >>>> which is already running sched_idle task, instead of utilizing a real >>>> idle cpu, so work conservation is somewhat broken. >>>> >>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with >>>> sched_idle group running. Look a simple example below. >>>> >>>> root >>>> / \ >>>> kubepods system >>>> / \ >>>> burstable besteffort >>>> (cpu.idle == 1) >>>> >>>> When a sched_idle cpu is just running tasks from besteffort group, >>>> sched_idle_cpu() will return true in this case, but this cpu pick >>>> is bad for wakee task from system group. Because the system group >>>> has lower weight than kubepods, work conservation is somewhat >>>> broken too. >>>> >>>> In a nutshell, sched_idle_cpu() should consider the wakee task group's >>>> relationship with sched_idle tasks running on the cpu. >>>> >>>> Obviously, it's hard to do so. This patch chooses the simple approach >>>> to remove all sched_idle_cpu() considerations in select_task_rq_fair() >>>> to bring back work conservation in these cases. >>> >>> OTOH sched_idle_cpu() CPUs are guaranteed to not be in an idle state and >>> potentially already have DVFS on some higher level... >>> >> Is it because the schedutil governor considers the utilization >> of SCHED_IDLE, thus causing schedutil to request a higher >> frequency? > > For intel_pstate active (HWP and !HWP) the same issue should persist, no? > >> >> The commit 3c29e651e16d ("sched/fair: Fall back to sched-idle >> CPU if an idle CPU isn't found") mentions that choosing a CPU >> running a SCHED_IDLE task can avoid waking a CPU from a deep >> sleep state. >> >> If this is the case, can we say that if an administrator sets >> the cpufreq governor to "performance" and disables deep idle >> states, an idle CPU would be more preferable than a CPU running >> a SCHED_IDLE task? On the other hand, if >> per_cpu(cpufreq_update_util_data, cpu) is NULL and only shallow >> idle states are enabled in idle_get_state(), should we skip >> SCHED_IDLE to achieve work conservation? > > That's probably getting the most out of it. > That being said, strictly speaking the SCHED_IDLE CPU and the > SHALLOW_IDLE CPU may still share a power and thermal budget, which > may make preempting the sched-idle task on SCHED_IDLE CPU the > better choice. I admit these sched_idle_cpu() considerations are good motivation, at least for !sched_idle tasks, to preempt sched_idle task instead of waking a real idle cpu. But it oversimplified the complicated situations we have, such as the cases above, which make the scheduler's work conservation performance really bad. Thanks.
On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@linux.dev> wrote: > > +cc Josh and Viresh, I forgot to cc you, sorry! Thanks, missed this previously :) > > On 2025/8/20 21:53, Christian Loehle wrote: > > On 8/19/25 16:32, Chen, Yu C wrote: > >> On 8/18/2025 9:24 PM, Christian Loehle wrote: > >>> On 8/18/25 13:47, Chengming Zhou wrote: > >>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based > >>>> on an assumption that the wakee task can pick a cpu running sched_idle > >>>> task and preempt it to run, faster than picking an idle cpu to preempt > >>>> the idle task. > >>>> > >>>> This assumption is correct, but it also brings some problems: > >>>> > >>>> 1. work conservation: Often sched_idle tasks are also picking the cpu > >>>> which is already running sched_idle task, instead of utilizing a real > >>>> idle cpu, so work conservation is somewhat broken. > >>>> > >>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with > >>>> sched_idle group running. Look a simple example below. > >>>> > >>>> root > >>>> / \ > >>>> kubepods system > >>>> / \ > >>>> burstable besteffort > >>>> (cpu.idle == 1) Thanks for bringing attention to this scenario, it's been a case I've worried about but haven't had a good idea about fixing. Ideally we could find_matching_se(), but we want to do these checks locklessly and quickly, so that's out of the question. Agree on it being a hard problem. One idea is that we at least handle the (what I think is fairly typical) scenario of a root-level sched_idle group well (a root level sched_idle group is trivially idle with respect to anything else in the system that is not also nested under a root-level sched_idle group). It would be fairly easy to track a nr_idle_queued cfs_rq field, as well as cache on task enqueue whether it nests under a sched_idle group. Best, Josh
On 2025/8/22 02:13, Josh Don wrote: > On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@linux.dev> wrote: >> >> +cc Josh and Viresh, I forgot to cc you, sorry! > > Thanks, missed this previously :) > >> >> On 2025/8/20 21:53, Christian Loehle wrote: >>> On 8/19/25 16:32, Chen, Yu C wrote: >>>> On 8/18/2025 9:24 PM, Christian Loehle wrote: >>>>> On 8/18/25 13:47, Chengming Zhou wrote: >>>>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based >>>>>> on an assumption that the wakee task can pick a cpu running sched_idle >>>>>> task and preempt it to run, faster than picking an idle cpu to preempt >>>>>> the idle task. >>>>>> >>>>>> This assumption is correct, but it also brings some problems: >>>>>> >>>>>> 1. work conservation: Often sched_idle tasks are also picking the cpu >>>>>> which is already running sched_idle task, instead of utilizing a real >>>>>> idle cpu, so work conservation is somewhat broken. >>>>>> >>>>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with >>>>>> sched_idle group running. Look a simple example below. >>>>>> >>>>>> root >>>>>> / \ >>>>>> kubepods system >>>>>> / \ >>>>>> burstable besteffort >>>>>> (cpu.idle == 1) > > Thanks for bringing attention to this scenario, it's been a case I've > worried about but haven't had a good idea about fixing. Ideally we > could find_matching_se(), but we want to do these checks locklessly > and quickly, so that's out of the question. Agree on it being a hard > problem. Yeah, right, we don't want to use find_matching_se() here. > > One idea is that we at least handle the (what I think is fairly > typical) scenario of a root-level sched_idle group well (a root level You mean /kubepods and /system group in this case, right? Both of them are not sched_idle here. > sched_idle group is trivially idle with respect to anything else in > the system that is not also nested under a root-level sched_idle > group). It would be fairly easy to track a nr_idle_queued cfs_rq > field, as well as cache on task enqueue whether it nests under a > sched_idle group. Ok, we can track if a task nests under a sched_idle group, like tasks from /system and /kubepods/burstable are not under any sched_idle group, there seems no way to distinguish them except using find_matching_se(). Thanks! > > Best, > Josh
On Sun, Aug 24, 2025 at 11:58 PM Chengming Zhou <chengming.zhou@linux.dev> wrote: > > On 2025/8/22 02:13, Josh Don wrote: > > On Wed, Aug 20, 2025 at 6:53 PM Chengming Zhou <chengming.zhou@linux.dev> wrote: > >> > >> +cc Josh and Viresh, I forgot to cc you, sorry! > > > > Thanks, missed this previously :) > > > >> > >> On 2025/8/20 21:53, Christian Loehle wrote: > >>> On 8/19/25 16:32, Chen, Yu C wrote: > >>>> On 8/18/2025 9:24 PM, Christian Loehle wrote: > >>>>> On 8/18/25 13:47, Chengming Zhou wrote: > >>>>>> These sched_idle_cpu() considerations in select_task_rq_fair() is based > >>>>>> on an assumption that the wakee task can pick a cpu running sched_idle > >>>>>> task and preempt it to run, faster than picking an idle cpu to preempt > >>>>>> the idle task. > >>>>>> > >>>>>> This assumption is correct, but it also brings some problems: > >>>>>> > >>>>>> 1. work conservation: Often sched_idle tasks are also picking the cpu > >>>>>> which is already running sched_idle task, instead of utilizing a real > >>>>>> idle cpu, so work conservation is somewhat broken. > >>>>>> > >>>>>> 2. sched_idle group: This sched_idle_cpu() is just not correct with > >>>>>> sched_idle group running. Look a simple example below. > >>>>>> > >>>>>> root > >>>>>> / \ > >>>>>> kubepods system > >>>>>> / \ > >>>>>> burstable besteffort > >>>>>> (cpu.idle == 1) > > > > Thanks for bringing attention to this scenario, it's been a case I've > > worried about but haven't had a good idea about fixing. Ideally we > > could find_matching_se(), but we want to do these checks locklessly > > and quickly, so that's out of the question. Agree on it being a hard > > problem. > > Yeah, right, we don't want to use find_matching_se() here. > > > > > One idea is that we at least handle the (what I think is fairly > > typical) scenario of a root-level sched_idle group well (a root level > > You mean /kubepods and /system group in this case, right? Both of > them are not sched_idle here. Correct > > sched_idle group is trivially idle with respect to anything else in > > the system that is not also nested under a root-level sched_idle > > group). It would be fairly easy to track a nr_idle_queued cfs_rq > > field, as well as cache on task enqueue whether it nests under a > > sched_idle group. > > Ok, we can track if a task nests under a sched_idle group, like tasks > from /system and /kubepods/burstable are not under any sched_idle group, > there seems no way to distinguish them except using find_matching_se(). nr_idle_queued on the cfs_rq seems like the way to do it, but I agree it is a tricky problem. > > Thanks! > > > > > Best, > > Josh
© 2016 - 2025 Red Hat, Inc.