[PATCH RESEND] sched/fair: Only update stats for allowed CPUs when looking for dst group

Adam Li posted 1 patch 2 months, 1 week ago
There is a newer version of this series
kernel/sched/fair.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
[PATCH RESEND] sched/fair: Only update stats for allowed CPUs when looking for dst group
Posted by Adam Li 2 months, 1 week ago
Load imbalance is observed when the workload frequently forks new threads.
Due to CPU affinity, the workload can run on CPU 0-7 in the first
group, and only on CPU 8-11 in the second group. CPU 12-15 are always idle.

{ 0 1 2 3 4 5 6 7 } {8 9 10 11 12 13 14 15}
  * * * * * * * *    * * *  *

When looking for dst group for newly forked threads, in many times
update_sg_wakeup_stats() reports the second group has more idle CPUs
than the first group. The scheduler thinks the second group is less
busy. Then it selects least busy CPUs among CPU 8-11. Therefore CPU 8-11
can be crowded with newly forked threads, at the same time CPU 0-7
can be idle.

A task may not use all the CPUs in a schedule group due to CPU affinity.
Only update schedule group statistics for allowed CPUs.

Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
---
Resending this patch from the patchset:
https://lore.kernel.org/lkml/20250717062036.432243-2-adamli@os.amperecomputing.com/

Only changed commit message. The single patch may be easier for reviewing.
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d..d5ec15050ebc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10671,7 +10671,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 	if (sd->flags & SD_ASYM_CPUCAPACITY)
 		sgs->group_misfit_task_load = 1;
 
-	for_each_cpu(i, sched_group_span(group)) {
+	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
 		struct rq *rq = cpu_rq(i);
 		unsigned int local;
 
-- 
2.34.1
Re: [PATCH RESEND] sched/fair: Only update stats for allowed CPUs when looking for dst group
Posted by Peter Zijlstra 2 months ago
On Sat, Oct 11, 2025 at 06:43:22AM +0000, Adam Li wrote:
> Load imbalance is observed when the workload frequently forks new threads.
> Due to CPU affinity, the workload can run on CPU 0-7 in the first
> group, and only on CPU 8-11 in the second group. CPU 12-15 are always idle.
> 
> { 0 1 2 3 4 5 6 7 } {8 9 10 11 12 13 14 15}
>   * * * * * * * *    * * *  *
> 
> When looking for dst group for newly forked threads, in many times
> update_sg_wakeup_stats() reports the second group has more idle CPUs
> than the first group. The scheduler thinks the second group is less
> busy. Then it selects least busy CPUs among CPU 8-11. Therefore CPU 8-11
> can be crowded with newly forked threads, at the same time CPU 0-7
> can be idle.
> 
> A task may not use all the CPUs in a schedule group due to CPU affinity.
> Only update schedule group statistics for allowed CPUs.
> 
> Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
> ---
> Resending this patch from the patchset:
> https://lore.kernel.org/lkml/20250717062036.432243-2-adamli@os.amperecomputing.com/
> 

Right, lets start with this then ;-)

No need to do the cpumask_and() thing, that's just more changes vs
update_sg_lb_stats().
[tip: sched/core] sched/fair: Only update stats for allowed CPUs when looking for dst group
Posted by tip-bot2 for Adam Li 2 months ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     82d6e01a0699800efd8b048eb584c907ccb47b7a
Gitweb:        https://git.kernel.org/tip/82d6e01a0699800efd8b048eb584c907ccb47b7a
Author:        Adam Li <adamli@os.amperecomputing.com>
AuthorDate:    Sat, 11 Oct 2025 06:43:22 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 16 Oct 2025 11:13:50 +02:00

sched/fair: Only update stats for allowed CPUs when looking for dst group

Load imbalance is observed when the workload frequently forks new threads.
Due to CPU affinity, the workload can run on CPU 0-7 in the first
group, and only on CPU 8-11 in the second group. CPU 12-15 are always idle.

{ 0 1 2 3 4 5 6 7 } {8 9 10 11 12 13 14 15}
  * * * * * * * *    * * *  *

When looking for dst group for newly forked threads, in many times
update_sg_wakeup_stats() reports the second group has more idle CPUs
than the first group. The scheduler thinks the second group is less
busy. Then it selects least busy CPUs among CPU 8-11. Therefore CPU 8-11
can be crowded with newly forked threads, at the same time CPU 0-7
can be idle.

A task may not use all the CPUs in a schedule group due to CPU affinity.
Only update schedule group statistics for allowed CPUs.

Signed-off-by: Adam Li <adamli@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 00f9d6c..ac881df 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10683,7 +10683,7 @@ static inline void update_sg_wakeup_stats(struct sched_domain *sd,
 	if (sd->flags & SD_ASYM_CPUCAPACITY)
 		sgs->group_misfit_task_load = 1;
 
-	for_each_cpu(i, sched_group_span(group)) {
+	for_each_cpu_and(i, sched_group_span(group), p->cpus_ptr) {
 		struct rq *rq = cpu_rq(i);
 		unsigned int local;