The SCHED_IDLE cgroups whose cpu.idle equals to 1, only mean something
to their siblings due to cgroup hierarchical behavior. So a SCHED_IDLE
cpu does NOT necessarily implies any of the following:
- It is a less loaded cpu (since the parent of its topmost idle
ancestor could be a 'giant' entity with large cpu.weight).
- It can be expected to be preempted by a newly woken task soon
enough (which actually depends on their ancestors who have
common parent).
As a less loaded cpu probably has better ability to serve the newly
woken task, which also applies to the SCHED_IDLE cpus that less loaded
SCHED_IDLE cpu might be easier and faster preempted, let's not special
case SCHED_IDLE cpus at least in slowpath when selecting.
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
---
kernel/sched/fair.c | 21 +++++++++------------
1 file changed, 9 insertions(+), 12 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 379764bd2795..769505cf519b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7446,7 +7446,7 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
unsigned int min_exit_latency = UINT_MAX;
u64 latest_idle_timestamp = 0;
int least_loaded_cpu = this_cpu;
- int shallowest_idle_cpu = -1, si_cpu = -1;
+ int shallowest_idle_cpu = -1;
int i;
/* Check if we have any choice: */
@@ -7481,12 +7481,13 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
latest_idle_timestamp = rq->idle_stamp;
shallowest_idle_cpu = i;
}
- } else if (shallowest_idle_cpu == -1 && si_cpu == -1) {
- if (sched_idle_cpu(i)) {
- si_cpu = i;
- continue;
- }
-
+ } else if (shallowest_idle_cpu == -1) {
+ /*
+ * The SCHED_IDLE cpus do not necessarily means anything
+ * to @p due to the cgroup hierarchical behavior. But it
+ * is almost certain that the wakee will get better served
+ * if the cpu is less loaded.
+ */
load = cpu_load(cpu_rq(i));
if (load < min_load) {
min_load = load;
@@ -7495,11 +7496,7 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
}
}
- if (shallowest_idle_cpu != -1)
- return shallowest_idle_cpu;
- if (si_cpu != -1)
- return si_cpu;
- return least_loaded_cpu;
+ return shallowest_idle_cpu != -1 ? shallowest_idle_cpu : least_loaded_cpu;
}
static inline int sched_balance_find_dst_cpu(struct sched_domain *sd, struct task_struct *p,
--
2.37.3
Thanks Abel,
> @@ -7481,12 +7481,13 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
> latest_idle_timestamp = rq->idle_stamp;
> shallowest_idle_cpu = i;
> }
> - } else if (shallowest_idle_cpu == -1 && si_cpu == -1) {
> - if (sched_idle_cpu(i)) {
> - si_cpu = i;
> - continue;
> - }
> -
> + } else if (shallowest_idle_cpu == -1) {
> + /*
> + * The SCHED_IDLE cpus do not necessarily means anything
> + * to @p due to the cgroup hierarchical behavior. But it
> + * is almost certain that the wakee will get better served
> + * if the cpu is less loaded.
> + */
> load = cpu_load(cpu_rq(i));
> if (load < min_load) {
> min_load = load;
This seems reasonable due to the case you describe. However, I'm
wondering if you considered any heuristics here to help identify when
a target cpu should really be considered sched_idle from the
perspective of the incoming task. For example, in your cgroup
hierarchy, if you have a cpu currently only running tasks in your
besteffort container (and all cpus in the system are busy running
something), then that cpu should be considered as a good target for a
waking task in the "guaranteed" container, and not a good target for a
waking task in the "containerd" container. A simple way to do this
would be to do a find_matching_se on the incoming task and the current
on_cpu task. That does have a drawback of cgroup pointer chasing in a
hot wakeup path, so I don't love it.
In any case, I'm fine with the change as-is, mostly curious if you
gave any additional thought to the case mentioned above.
Best,
Josh
Hi Josh,
On 3/11/25 6:38 AM, Josh Don wrote:
> Thanks Abel,
>
>> @@ -7481,12 +7481,13 @@ sched_balance_find_dst_group_cpu(struct sched_group *group, struct task_struct *
>> latest_idle_timestamp = rq->idle_stamp;
>> shallowest_idle_cpu = i;
>> }
>> - } else if (shallowest_idle_cpu == -1 && si_cpu == -1) {
>> - if (sched_idle_cpu(i)) {
>> - si_cpu = i;
>> - continue;
>> - }
>> -
>> + } else if (shallowest_idle_cpu == -1) {
>> + /*
>> + * The SCHED_IDLE cpus do not necessarily means anything
>> + * to @p due to the cgroup hierarchical behavior. But it
>> + * is almost certain that the wakee will get better served
>> + * if the cpu is less loaded.
>> + */
>> load = cpu_load(cpu_rq(i));
>> if (load < min_load) {
>> min_load = load;
>
> This seems reasonable due to the case you describe. However, I'm
> wondering if you considered any heuristics here to help identify when
> a target cpu should really be considered sched_idle from the
> perspective of the incoming task. For example, in your cgroup
> hierarchy, if you have a cpu currently only running tasks in your
> besteffort container (and all cpus in the system are busy running
> something), then that cpu should be considered as a good target for a
> waking task in the "guaranteed" container, and not a good target for a
> waking task in the "containerd" container. A simple way to do this
Yes, it actually did cost me several days trying hard to figure out
whether there is a proper way to achieve what you're suggesting.
Solution A
----------
Mark all the hierarchically idle task_groups by assigning a unique
prime, and define a tree walk starts from @n to root that contains
all the preemptable nodes.
struct task_group {
/*
* Set to a unique prime if at least 1 ancestor is idle,
* otherwise set to 1.
*/
u64 prime;
u64 mul;
};
/* Called by sched_group_set_idle() */
static void calculate_mul(struct task_group *tg)
{
struct task_group *curr, *iter;
lockdep_assert_held(&shares_mutex);
for (curr = tg; curr->parent; curr = curr->parent) {
list_for_each_entry(iter, &curr->parent->children, siblings) {
/* Can't preempt non-idle siblings */
if (!iter->idle)
continue;
/*
* For each node in the subtree rooted at @iter do:
* tg->mul *= node->prime;
*/
traverse_subtree(tg, iter);
}
}
}
int sched_idle_cpu(int cpu, struct task_struct *p)
{
/* rq::cached_mul caches rq->donor's tg::mul */
return p->sched_task_group.mul % cpu_rq(cpu)->cached_mul == 0;
}
With this we even can drop find_matching_se(), since it's quite easy
to know whether or not a sched_entity can preempt another. But sadly
task_group::mul will quickly get overflowed.
Solution B
----------
Since we do not require 100% accurate, solution A can be shifted to
using bloom filters.
struct task_group {
u64 key; /* A random number used for hashing */
u64 filter;
};
/* Called by sched_group_set_idle() */
static void calculate_filter(struct task_group *tg)
{
struct task_group *curr, *iter;
lockdep_assert_held(&shares_mutex);
for (curr = tg; curr->parent; curr = curr->parent) {
list_for_each_entry(iter, &curr->parent->children, siblings) {
/* Can't preempt non-idle siblings */
if (!iter->idle)
continue;
/*
* For each node in the subtree rooted at @iter do:
* set_bit(bloom_hash(node->key), &tg->filter);
*/
traverse_subtree(tg, iter);
}
}
}
int sched_idle_cpu(int cpu, struct task_struct *p)
{
u64 filter = p->sched_task_group.filter;
u64 cached = cpu_rq(cpu)->cached_filter;
return filter & cached == cached;
}
False positives are possible, but the possibility can be reduced by
optimizing blooming setup.
I chose the simplest way for now to workaround the issue we encountered,
while I am still trying to do something to get rid of sched_idle_cpu().
Thoughts?
> would be to do a find_matching_se on the incoming task and the current
> on_cpu task. That does have a drawback of cgroup pointer chasing in a
> hot wakeup path, so I don't love it.
Neither do I. But if we go through find_matching_se(), I think we need
this rq locked.
>
> In any case, I'm fine with the change as-is, mostly curious if you
> gave any additional thought to the case mentioned above.
Thanks!
Abel
On Tue, Mar 11, 2025 at 9:43 AM Abel Wu <wuyun.abel@bytedance.com> wrote: [snip] > False positives are possible, but the possibility can be reduced by > optimizing blooming setup. An interesting approach, thanks for sharing. Not that it matters (given that we're not pursuing this now), but just to call out that this has poor scaling with large cgroup hierarchies and updates to cgroup idle state, so in an actual implementation it would be ideal to do the updates asynchronously from sched_group_set_idle (ie. via a kworker). We could also greatly simplify this down if we assume certain contrived setups, for example if we assume we primarily care about sched_idle cpu preemption against only root-level sched_idle cgroups (as everything inside a root-level sched_idle cgroup is trivially preemptible by a task from another hierarchy). But obviously your cgroup setup doesn't fall under this category, so it is not very useful. > I chose the simplest way for now to workaround the issue we encountered, > while I am still trying to do something to get rid of sched_idle_cpu(). > Thoughts? That sounds reasonable to me. Reviewed-by: Josh Don <joshdon@google.com>
On 3/13/25 3:24 AM, Josh Don wrote: > On Tue, Mar 11, 2025 at 9:43 AM Abel Wu <wuyun.abel@bytedance.com> wrote: > [snip] >> False positives are possible, but the possibility can be reduced by >> optimizing blooming setup. > > An interesting approach, thanks for sharing. Not that it matters > (given that we're not pursuing this now), but just to call out that > this has poor scaling with large cgroup hierarchies and updates to > cgroup idle state, so in an actual implementation it would be ideal to > do the updates asynchronously from sched_group_set_idle (ie. via a > kworker). Good idea. Async update makes sense since sched_idle_cpu() is not required 100% correct, and with this we also can join several updates into one. > > We could also greatly simplify this down if we assume certain > contrived setups, for example if we assume we primarily care about > sched_idle cpu preemption against only root-level sched_idle cgroups > (as everything inside a root-level sched_idle cgroup is trivially > preemptible by a task from another hierarchy). But obviously your > cgroup setup doesn't fall under this category, so it is not very > useful. > >> I chose the simplest way for now to workaround the issue we encountered, >> while I am still trying to do something to get rid of sched_idle_cpu(). >> Thoughts? > > That sounds reasonable to me. > > Reviewed-by: Josh Don <joshdon@google.com> Thanks!
© 2016 - 2026 Red Hat, Inc.