kernel/sched/fair.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+)
The efficiency gains from co-locating communicating tasks within the same
LLC are well-established. However, in multi-LLC NUMA systems, the load
balancer unintentionally sabotages this optimization.
Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
wakes the server within its initial LLC (e.g., LLC_0). The load balancer
subsequently migrates the client to a different LLC (e.g., LLC_1). When
the client next wakes the server, it now targets the server’s placement
to LLC_1 (the client’s new location). The server then migrates to LLC_1,
but the load balancer may reallocate the client to another
LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
perpetually chase each other across all four LLCs — a sustained
cross-LLC ping-pong within the NUMA node.
Our solution: Permit controlled load imbalance between LLCs on the same
NUMA node, prioritizing communication affinity over strict balance.
Impact: In a virtual machine with one socket, multiple NUMA nodes (each
with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
seconds as tasks cycled through all four LLCs. With the patch, migrations
stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
thrashing.
Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
---
kernel/sched/fair.c | 16 ++++++++++++++++
1 file changed, 16 insertions(+)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fb9bf995a47..749210e6316b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
}
#endif
+ /* Allow imbalance between LLCs within a single NUMA node */
+ if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
+ && env->sd->parent->flags & SD_NUMA) {
+ int child_weight = env->sd->child->span_weight;
+ int llc_nr = env->sd->span_weight / child_weight;
+ int imb_nr, min;
+
+ if (llc_nr > 1) {
+ /* Let the imbalance not be greater than half of child_weight */
+ min = child_weight >= 4 ? 2 : 1;
+ imb_nr = max_t(int, min, child_weight >> 2);
+ if (imb_nr >= env->imbalance)
+ env->imbalance = 0;
+ }
+ }
+
/* Number of tasks to move to restore balance */
env->imbalance >>= 1;
--
2.43.0
On 5/28/2025 12:39 PM, Jianyong Wu wrote:
> The efficiency gains from co-locating communicating tasks within the same
> LLC are well-established. However, in multi-LLC NUMA systems, the load
> balancer unintentionally sabotages this optimization.
>
> Observe this pattern: On a NUMA node with 4 LLCs, the iperf3 client first
> wakes the server within its initial LLC (e.g., LLC_0). The load balancer
> subsequently migrates the client to a different LLC (e.g., LLC_1). When
> the client next wakes the server, it now targets the server’s placement
> to LLC_1 (the client’s new location). The server then migrates to LLC_1,
> but the load balancer may reallocate the client to another
> LLC (e.g., LLC_2) later. This cycle repeats, causing both tasks to
> perpetually chase each other across all four LLCs — a sustained
> cross-LLC ping-pong within the NUMA node.
Migration only happens if the CPU is overloaded right? I've only seen
this happen when a noise like kworker comes in. What exactly is
causing these migrations in your case and is it actually that bad
for iperf?
>
> Our solution: Permit controlled load imbalance between LLCs on the same
> NUMA node, prioritizing communication affinity over strict balance.
>
> Impact: In a virtual machine with one socket, multiple NUMA nodes (each
> with 4 LLCs), unpatched systems suffered 3,000+ LLC migrations in 200
> seconds as tasks cycled through all four LLCs. With the patch, migrations
> stabilize at ≤10 instances, largely suppressing the NUMA-local LLC
> thrashing.
Is there any improvement in iperf numbers with these changes?
>
> Signed-off-by: Jianyong Wu <wujianyong@hygon.cn>
> ---
> kernel/sched/fair.c | 16 ++++++++++++++++
> 1 file changed, 16 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0fb9bf995a47..749210e6316b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11203,6 +11203,22 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> }
> #endif
>
> + /* Allow imbalance between LLCs within a single NUMA node */
> + if (env->sd->child && env->sd->child->flags & SD_SHARE_LLC && env->sd->parent
> + && env->sd->parent->flags & SD_NUMA) {
This does not imply multiple LLC in package. SD_SHARE_LLC is
SDF_SHARED_CHILD and will be set from SMT domain onwards. This condition
will be true on Intel with SNC enabled despite not having multiple LLC
and llc_nr will be number of cores there.
Perhaps multiple LLCs can be detected using:
!((sd->child->flags ^ sd->flags) & SD_SHARE_LLC)
> + int child_weight = env->sd->child->span_weight;
> + int llc_nr = env->sd->span_weight / child_weight;
> + int imb_nr, min;
> +
> + if (llc_nr > 1) {
> + /* Let the imbalance not be greater than half of child_weight */
> + min = child_weight >= 4 ? 2 : 1;
> + imb_nr = max_t(int, min, child_weight >> 2);
Isn't this just max_t(int, child_weight >> 2, 1)?
> + if (imb_nr >= env->imbalance)
> + env->imbalance = 0;
At this point, we are trying to even out the number of idle CPUs on the
destination and the busiest LLC. sched_balance_find_src_rq() will return
NULL if it doesn't find an overloaded rq. Is waiting behind a task
more beneficial than migrating to an idler LLC?
> + }
> + }
> +
> /* Number of tasks to move to restore balance */
> env->imbalance >>= 1;
>
--
Thanks and Regards,
Prateek
© 2016 - 2025 Red Hat, Inc.