sched/fair: allow imbalance between LLCs under NUMA

[PATCH] sched/fair: allow imbalance between LLCs under NUMA

Posted by Jianyong Wu 8 months, 2 weeks ago

The efficiency gains from co-locating communicating tasks within the same
LLC are well-established. However, in multi-LLC NUMA systems, the load
balancer unintentionally sabotages this optimization.

Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
wakes the server within its initial LLC (e.g., LLC_0). The load balancer
subsequently migrates the client to a different LLC (e.g., LLC_1). When
the client next wakes the server, it now targets the server’s placement
to LLC_1 (the client’s new location). The server then migrates to LLC_1,
but the load balancer may reallocate the client to another
LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
perpetually chase each other across all four LLCs — a sustained
cross-LLC ping-pong within the NUMA node.

Our solution: Permit controlled load imbalance between LLCs on the same
NUMA node, prioritizing communication affinity over strict balance.

Impact: In a virtual machine with one socket, multiple NUMA nodes (each
with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
seconds as tasks cycled through all four LLCs. With the patch, migrations
stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
thrashing.

Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
---
 kernel/sched/fair.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..749210e6316b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		}
 #endif
 
+		/* Allow imbalance between LLCs within a single NUMA node */
+		if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
+				&& env->sd->parent->flags & SD_NUMA) {
+			int child_weight = env->sd->child->span_weight;
+			int llc_nr = env->sd->span_weight / child_weight;
+			int imb_nr, min;
+
+			if (llc_nr > 1) {
+				/* Let the imbalance not be greater than half of child_weight */
+				min = child_weight >= 4 ? 2 : 1;
+				imb_nr = max_t(int, min, child_weight >> 2);
+				if (imb_nr >= env->imbalance)
+					env->imbalance = 0;
+			}
+		}
+
 		/* Number of tasks to move to restore balance */
 		env->imbalance >>= 1;
 
-- 
2.43.0

Re: [PATCH] sched/fair: allow imbalance between LLCs under NUMA

Posted by K Prateek Nayak 8 months, 2 weeks ago

On 5/28/2025 12:39 PM, Jianyong Wu wrote:
> The efficiency gains from co-locating communicating tasks within the same
> LLC are well-established. However, in multi-LLC NUMA systems, the load
> balancer unintentionally sabotages this optimization.
> 
> Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
> wakes the server within its initial LLC (e.g., LLC_0). The load balancer
> subsequently migrates the client to a different LLC (e.g., LLC_1). When
> the client next wakes the server, it now targets the server’s placement
> to LLC_1 (the client’s new location). The server then migrates to LLC_1,
> but the load balancer may reallocate the client to another
> LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
> perpetually chase each other across all four LLCs — a sustained
> cross-LLC ping-pong within the NUMA node.

Migration only happens if the CPU is overloaded right? I've only seen
this happen when a noise like kworker comes in. What exactly is
causing these migrations in your case and is it actually that bad
for iperf?

> 
> Our solution: Permit controlled load imbalance between LLCs on the same
> NUMA node, prioritizing communication affinity over strict balance.
> 
> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
> seconds as tasks cycled through all four LLCs. With the patch, migrations
> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
> thrashing.

Is there any improvement in iperf numbers with these changes?

> 
> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
> ---
>   kernel/sched/fair.c | 16 ++++++++++++++++
>   1 file changed, 16 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fb9bf995a47..749210e6316b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>   		}
>   #endif
>   
> +		/* Allow imbalance between LLCs within a single NUMA node */
> +		if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
> +				&& env->sd->parent->flags & SD_NUMA) {

This does not imply multiple LLC in package. SD_SHARE_LLC is
SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
will be true on Intel with SNC enabled despite not having multiple LLC
and llc_nr will be number of cores there.

Perhaps multiple LLCs can be detected using:

     !((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)

> +			int child_weight = env->sd->child->span_weight;
> +			int llc_nr = env->sd->span_weight / child_weight;
> +			int imb_nr, min;
> +
> +			if (llc_nr > 1) {
> +				/* Let the imbalance not be greater than half of child_weight */
> +				min = child_weight >= 4 ? 2 : 1;
> +				imb_nr = max_t(int, min, child_weight >> 2);

Isn't this just max_t(int, child_weight >> 2, 1)?

> +				if (imb_nr >= env->imbalance)
> +					env->imbalance = 0;

At this point, we are trying to even out the number of idle CPUs on the
destination and the busiest LLC. sched_balance_find_src_rq() will return
NULL if it doesn't find an overloaded rq. Is waiting behind a task
more beneficial than migrating to an idler LLC?

> +			}
> +		}
> +
>   		/* Number of tasks to move to restore balance */
>   		env->imbalance >>= 1;
>   

-- 
Thanks and Regards,
Prateek