kernel/sched/fair.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-)
Repost comments:
There have been past discussions about avoiding serialization in load
balancing, but no objections were raised to this patch itself during
its last posting:
https://lore.kernel.org/lkml/20250416035823.1846307-1-tim.c.chen@linux.intel.com/
Vincent and Chen Yu have already provided their Reviewed-by tags.
We recently encountered this issue again on a 2-socket, 240-core
Clearwater Forest server running SPECjbb. In this case, 14% of CPU
cycles were wasted on unnecessary acquisitions of
sched_balance_running. This reinforces the need for the change, and we
hope it can be merged.
Tim
---
During load balancing, balancing at the LLC level and above must be
serialized. The scheduler currently checks the atomic
`sched_balance_running` flag before verifying whether a balance is
actually due. This causes high contention, as multiple CPUs may attempt
to acquire the flag concurrently.
On a 2-socket Granite Rapids system with sub-NUMA clustering enabled
and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg
operations for `sched_balance_running`. In most cases, the attempt
aborts immediately after acquisition because the load balance time is
not yet due.
Fix this by checking whether a balance is due *before* trying to
acquire `sched_balance_running`. This avoids many wasted acquisitions
and reduces the cmpxchg overhead in `sched_balance_domain()` from 7.6%
to 0.05%. As a result, OLTP throughput improves by 11%.
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
---
kernel/sched/fair.c | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8ce56a8d507f..bedd785c4a39 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12126,13 +12126,13 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
interval = get_sd_balance_interval(sd, busy);
- need_serialize = sd->flags & SD_SERIALIZE;
- if (need_serialize) {
- if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
- goto out;
- }
-
if (time_after_eq(jiffies, sd->last_balance + interval)) {
+ need_serialize = sd->flags & SD_SERIALIZE;
+ if (need_serialize) {
+ if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
+ goto out;
+ }
+
if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
/*
* The LBF_DST_PINNED logic could have changed
@@ -12144,9 +12144,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
}
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
+ if (need_serialize)
+ atomic_set_release(&sched_balance_running, 0);
}
- if (need_serialize)
- atomic_set_release(&sched_balance_running, 0);
out:
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
--
2.32.0
On Thu, Oct 02, 2025 at 04:00:12PM -0700, Tim Chen wrote:
> During load balancing, balancing at the LLC level and above must be
> serialized.
I would argue the wording here, there is no *must*, they *are*. Per the
current rules SD_NUMA and up get SD_SERIALIZE.
This is a *very* old thing, done by Christoph Lameter back when he was
at SGI. I'm not sure this default is still valid or not. Someone would
have to investigate. An alternative would be moving it into
node_reclaim_distance or somesuch.
> The scheduler currently checks the atomic
> `sched_balance_running` flag before verifying whether a balance is
> actually due. This causes high contention, as multiple CPUs may attempt
> to acquire the flag concurrently.
Right.
> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled
> and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg
> operations for `sched_balance_running`. In most cases, the attempt
> aborts immediately after acquisition because the load balance time is
> not yet due.
So I'm not sure I understand the situation, @continue_balancing should
limit this concurrency to however many groups are on this domain -- your
granite thing with SNC on would have something like 6 groups?
> Fix this by checking whether a balance is due *before* trying to
> acquire `sched_balance_running`. This avoids many wasted acquisitions
> and reduces the cmpxchg overhead in `sched_balance_domain()` from 7.6%
> to 0.05%. As a result, OLTP throughput improves by 11%.
Yeah, I see no harm flipping this, but the Changelog needs help.
> Reviewed-by: Chen Yu <yu.c.chen@intel.com>
> Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> ---
> kernel/sched/fair.c | 16 ++++++++--------
> 1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8ce56a8d507f..bedd785c4a39 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12126,13 +12126,13 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>
> interval = get_sd_balance_interval(sd, busy);
>
> - need_serialize = sd->flags & SD_SERIALIZE;
> - if (need_serialize) {
> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> - goto out;
> - }
> -
> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> + need_serialize = sd->flags & SD_SERIALIZE;
> + if (need_serialize) {
> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> + goto out;
> + }
> +
> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> /*
> * The LBF_DST_PINNED logic could have changed
> @@ -12144,9 +12144,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> }
> sd->last_balance = jiffies;
> interval = get_sd_balance_interval(sd, busy);
> + if (need_serialize)
> + atomic_set_release(&sched_balance_running, 0);
> }
> - if (need_serialize)
> - atomic_set_release(&sched_balance_running, 0);
> out:
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;
Instead of making the indenting worse, could we make it better?
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e743d9d0576c..6318834ff42a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12215,6 +12215,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
}
interval = get_sd_balance_interval(sd, busy);
+ if (time_before(jiffies, sd->last_balance + interval))
+ goto out;
need_serialize = sd->flags & SD_SERIALIZE;
if (need_serialize) {
@@ -12222,19 +12224,18 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
goto out;
}
- if (time_after_eq(jiffies, sd->last_balance + interval)) {
- if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
- /*
- * The LBF_DST_PINNED logic could have changed
- * env->dst_cpu, so we can't know our idle
- * state even if we migrated tasks. Update it.
- */
- idle = idle_cpu(cpu);
- busy = !idle && !sched_idle_cpu(cpu);
- }
- sd->last_balance = jiffies;
- interval = get_sd_balance_interval(sd, busy);
+ if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
+ /*
+ * The LBF_DST_PINNED logic could have changed
+ * env->dst_cpu, so we can't know our idle
+ * state even if we migrated tasks. Update it.
+ */
+ idle = idle_cpu(cpu);
+ busy = !idle && !sched_idle_cpu(cpu);
}
+ sd->last_balance = jiffies;
+ interval = get_sd_balance_interval(sd, busy);
+
if (need_serialize)
atomic_set_release(&sched_balance_running, 0);
out:
On Mon, 2025-10-13 at 16:26 +0200, Peter Zijlstra wrote:
> On Thu, Oct 02, 2025 at 04:00:12PM -0700, Tim Chen wrote:
>
> > During load balancing, balancing at the LLC level and above must be
> > serialized.
>
> I would argue the wording here, there is no *must*, they *are*. Per the
> current rules SD_NUMA and up get SD_SERIALIZE.
>
> This is a *very* old thing, done by Christoph Lameter back when he was
> at SGI. I'm not sure this default is still valid or not. Someone would
> have to investigate. An alternative would be moving it into
> node_reclaim_distance or somesuch.
>
> > The scheduler currently checks the atomic
> > `sched_balance_running` flag before verifying whether a balance is
> > actually due. This causes high contention, as multiple CPUs may attempt
> > to acquire the flag concurrently.
>
> Right.
>
> > On a 2-socket Granite Rapids system with sub-NUMA clustering enabled
> > and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg
> > operations for `sched_balance_running`. In most cases, the attempt
> > aborts immediately after acquisition because the load balance time is
> > not yet due.
>
> So I'm not sure I understand the situation, @continue_balancing should
> limit this concurrency to however many groups are on this domain -- your
> granite thing with SNC on would have something like 6 groups?
That's a good point. But I think the contention is worse than
6 CPUs.
The hierarchy would be
SMT
NUMA-level1
NUMA-level2
NUMA-level3
NUMA-level4
There would be multiple CPUs in that are first in the SMT group
with continue_balancing=1 going up in the hierachy and
attempting the cmpxchg in the first NUMA domain level,
before calling should_we_balance() and finding that they are
not the first in the NUMA domain and set continue_balancing=0
and abort. Those CPUS are in same L3.
But at the same time, there could be CPUs in other sockets
cmpxchg on sched_balance_running.
In our experiment, we have not applied
NUMA hierarchy fix [1]. The system can have up to 4 NUMA domains,
with 3 NUMA levels spanning different sockets.
That means there can be up to three concurrent cmpxchg attempts
per level from CPUs in separate sockets. So that would make
things worse.
[1] https://lore.kernel.org/lkml/cover.1759515405.git.tim.c.chen@linux.intel.com/
>
> > Fix this by checking whether a balance is due *before* trying to
> > acquire `sched_balance_running`. This avoids many wasted acquisitions
> > and reduces the cmpxchg overhead in `sched_balance_domain()` from 7.6%
> > to 0.05%. As a result, OLTP throughput improves by 11%.
>
> Yeah, I see no harm flipping this, but the Changelog needs help.
>
> > Reviewed-by: Chen Yu <yu.c.chen@intel.com>
> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
> > ---
> > kernel/sched/fair.c | 16 ++++++++--------
> > 1 file changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 8ce56a8d507f..bedd785c4a39 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -12126,13 +12126,13 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> >
> > interval = get_sd_balance_interval(sd, busy);
> >
> > - need_serialize = sd->flags & SD_SERIALIZE;
> > - if (need_serialize) {
> > - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > - goto out;
> > - }
> > -
> > if (time_after_eq(jiffies, sd->last_balance + interval)) {
> > + need_serialize = sd->flags & SD_SERIALIZE;
> > + if (need_serialize) {
> > + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > + goto out;
> > + }
> > +
> > if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> > /*
> > * The LBF_DST_PINNED logic could have changed
> > @@ -12144,9 +12144,9 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> > }
> > sd->last_balance = jiffies;
> > interval = get_sd_balance_interval(sd, busy);
> > + if (need_serialize)
> > + atomic_set_release(&sched_balance_running, 0);
> > }
> > - if (need_serialize)
> > - atomic_set_release(&sched_balance_running, 0);
> > out:
> > if (time_after(next_balance, sd->last_balance + interval)) {
> > next_balance = sd->last_balance + interval;
>
> Instead of making the indenting worse, could we make it better?
Yes, this is much better. Thanks.
Tim
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e743d9d0576c..6318834ff42a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12215,6 +12215,8 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> }
>
> interval = get_sd_balance_interval(sd, busy);
> + if (time_before(jiffies, sd->last_balance + interval))
> + goto out;
>
> need_serialize = sd->flags & SD_SERIALIZE;
> if (need_serialize) {
> @@ -12222,19 +12224,18 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> goto out;
> }
>
> - if (time_after_eq(jiffies, sd->last_balance + interval)) {
> - if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> - /*
> - * The LBF_DST_PINNED logic could have changed
> - * env->dst_cpu, so we can't know our idle
> - * state even if we migrated tasks. Update it.
> - */
> - idle = idle_cpu(cpu);
> - busy = !idle && !sched_idle_cpu(cpu);
> - }
> - sd->last_balance = jiffies;
> - interval = get_sd_balance_interval(sd, busy);
> + if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> + /*
> + * The LBF_DST_PINNED logic could have changed
> + * env->dst_cpu, so we can't know our idle
> + * state even if we migrated tasks. Update it.
> + */
> + idle = idle_cpu(cpu);
> + busy = !idle && !sched_idle_cpu(cpu);
> }
> + sd->last_balance = jiffies;
> + interval = get_sd_balance_interval(sd, busy);
> +
> if (need_serialize)
> atomic_set_release(&sched_balance_running, 0);
> out:
On Mon, Oct 13, 2025 at 02:54:19PM -0700, Tim Chen wrote:
> > So I'm not sure I understand the situation, @continue_balancing should
> > limit this concurrency to however many groups are on this domain -- your
> > granite thing with SNC on would have something like 6 groups?
>
> That's a good point. But I think the contention is worse than
> 6 CPUs.
>
> The hierarchy would be
>
> SMT
> NUMA-level1
> NUMA-level2
> NUMA-level3
> NUMA-level4
Aren't you missing the LLC/NODE domain here? We should have at least one
!SD_NUMA domain above SMT.
> There would be multiple CPUs in that are first in the SMT group
> with continue_balancing=1 going up in the hierachy and
> attempting the cmpxchg in the first NUMA domain level,
> before calling should_we_balance() and finding that they are
> not the first in the NUMA domain and set continue_balancing=0
> and abort. Those CPUS are in same L3.
> But at the same time, there could be CPUs in other sockets
> cmpxchg on sched_balance_running.
Right, Yu Chen said something like that as well, should_we_balance() is
too late.
Should we instead move the whole serialize thing inside
sched_balance_rq() like so:
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bc0b7ce8a65d..e9f719ba17e1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11722,6 +11722,22 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
}
}
+
+/*
+ * This flag serializes load-balancing passes over large domains
+ * (above the NODE topology level) - only one load-balancing instance
+ * may run at a time, to reduce overhead on very large systems with
+ * lots of CPUs and large NUMA distances.
+ *
+ * - Note that load-balancing passes triggered while another one
+ * is executing are skipped and not re-tried.
+ *
+ * - Also note that this does not serialize rebalance_domains()
+ * execution, as non-SD_SERIALIZE domains will still be
+ * load-balanced in parallel.
+ */
+static atomic_t sched_balance_running = ATOMIC_INIT(0);
+
/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
* tasks if there is an imbalance.
@@ -11747,6 +11763,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
.fbq_type = all,
.tasks = LIST_HEAD_INIT(env.tasks),
};
+ int need_unlock = false;
cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
@@ -11758,6 +11775,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
goto out_balanced;
}
+ if (idle != CPU_NEWLY_IDLE && (sd->flags & SD_SERIALIZE)) {
+ if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
+ goto out_balanced;
+ need_unlock = true;
+ }
+
group = sched_balance_find_src_group(&env);
if (!group) {
schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11998,6 +12021,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
out:
+ if (need_unlock)
+ atomic_set_release(&sched_balance_running, 0);
+
return ld_moved;
}
@@ -12122,21 +12148,6 @@ static int active_load_balance_cpu_stop(void *data)
return 0;
}
-/*
- * This flag serializes load-balancing passes over large domains
- * (above the NODE topology level) - only one load-balancing instance
- * may run at a time, to reduce overhead on very large systems with
- * lots of CPUs and large NUMA distances.
- *
- * - Note that load-balancing passes triggered while another one
- * is executing are skipped and not re-tried.
- *
- * - Also note that this does not serialize rebalance_domains()
- * execution, as non-SD_SERIALIZE domains will still be
- * load-balanced in parallel.
- */
-static atomic_t sched_balance_running = ATOMIC_INIT(0);
-
/*
* Scale the max sched_balance_rq interval with the number of CPUs in the system.
* This trades load-balance latency on larger machines for less cross talk.
@@ -12192,7 +12203,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
- int need_serialize, need_decay = 0;
+ int need_decay = 0;
u64 max_cost = 0;
rcu_read_lock();
@@ -12216,13 +12227,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
}
interval = get_sd_balance_interval(sd, busy);
-
- need_serialize = sd->flags & SD_SERIALIZE;
- if (need_serialize) {
- if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
- goto out;
- }
-
if (time_after_eq(jiffies, sd->last_balance + interval)) {
if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
/*
@@ -12236,9 +12240,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
}
- if (need_serialize)
- atomic_set_release(&sched_balance_running, 0);
-out:
+
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
update_next_balance = 1;
On 10/14/25 2:54 PM, Peter Zijlstra wrote:
>
> Should we instead move the whole serialize thing inside
> sched_balance_rq() like so:
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc0b7ce8a65d..e9f719ba17e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11722,6 +11722,22 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
> }
> }
>
> +
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (above the NODE topology level) - only one load-balancing instance
> + * may run at a time, to reduce overhead on very large systems with
> + * lots of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + * is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize rebalance_domains()
> + * execution, as non-SD_SERIALIZE domains will still be
> + * load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
> +
> /*
> * Check this_cpu to ensure it is balanced within domain. Attempt to move
> * tasks if there is an imbalance.
> @@ -11747,6 +11763,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> .fbq_type = all,
> .tasks = LIST_HEAD_INIT(env.tasks),
> };
> + int need_unlock = false;
>
> cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>
> @@ -11758,6 +11775,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> goto out_balanced;
> }
>
> + if (idle != CPU_NEWLY_IDLE && (sd->flags & SD_SERIALIZE)) {
> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> + goto out_balanced;
> + need_unlock = true;
> + }
> +
> group = sched_balance_find_src_group(&env);
> if (!group) {
> schedstat_inc(sd->lb_nobusyg[idle]);
> @@ -11998,6 +12021,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> sd->balance_interval < sd->max_interval)
> sd->balance_interval *= 2;
> out:
> + if (need_unlock)
> + atomic_set_release(&sched_balance_running, 0);
> +
> return ld_moved;
> }
>
> @@ -12122,21 +12148,6 @@ static int active_load_balance_cpu_stop(void *data)
> return 0;
> }
>
> -/*
> - * This flag serializes load-balancing passes over large domains
> - * (above the NODE topology level) - only one load-balancing instance
> - * may run at a time, to reduce overhead on very large systems with
> - * lots of CPUs and large NUMA distances.
> - *
> - * - Note that load-balancing passes triggered while another one
> - * is executing are skipped and not re-tried.
> - *
> - * - Also note that this does not serialize rebalance_domains()
> - * execution, as non-SD_SERIALIZE domains will still be
> - * load-balanced in parallel.
> - */
> -static atomic_t sched_balance_running = ATOMIC_INIT(0);
> -
> /*
> * Scale the max sched_balance_rq interval with the number of CPUs in the system.
> * This trades load-balance latency on larger machines for less cross talk.
> @@ -12192,7 +12203,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> /* Earliest time when we have to do rebalance again */
> unsigned long next_balance = jiffies + 60*HZ;
> int update_next_balance = 0;
> - int need_serialize, need_decay = 0;
> + int need_decay = 0;
> u64 max_cost = 0;
>
> rcu_read_lock();
> @@ -12216,13 +12227,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> }
>
> interval = get_sd_balance_interval(sd, busy);
> -
> - need_serialize = sd->flags & SD_SERIALIZE;
> - if (need_serialize) {
> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> - goto out;
> - }
> -
> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> /*
> @@ -12236,9 +12240,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> sd->last_balance = jiffies;
> interval = get_sd_balance_interval(sd, busy);
> }
> - if (need_serialize)
> - atomic_set_release(&sched_balance_running, 0);
> -out:
> +
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;
> update_next_balance = 1;
One thing is missing, need to reset to 0 for redo logic to work.
@@ -11882,6 +11905,8 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
env.loop_break = SCHED_NR_MIGRATE_BREAK;
+ if (need_unlock)
+ atomic_set_release(&sched_balance_running, 0);
goto redo;
* Peter Zijlstra <peterz@infradead.org> [2025-10-14 11:24:36]:
> On Mon, Oct 13, 2025 at 02:54:19PM -0700, Tim Chen wrote:
>
>
> Right, Yu Chen said something like that as well, should_we_balance() is
> too late.
>
> Should we instead move the whole serialize thing inside
> sched_balance_rq() like so:
>
> @@ -12122,21 +12148,6 @@ static int active_load_balance_cpu_stop(void *data)
> return 0;
> }
>
> -/*
> - * This flag serializes load-balancing passes over large domains
> - * (above the NODE topology level) - only one load-balancing instance
> - * may run at a time, to reduce overhead on very large systems with
> - * lots of CPUs and large NUMA distances.
> - *
> - * - Note that load-balancing passes triggered while another one
> - * is executing are skipped and not re-tried.
> - *
> - * - Also note that this does not serialize rebalance_domains()
> - * execution, as non-SD_SERIALIZE domains will still be
> - * load-balanced in parallel.
> - */
> -static atomic_t sched_balance_running = ATOMIC_INIT(0);
> -
> /*
> * Scale the max sched_balance_rq interval with the number of CPUs in the system.
> * This trades load-balance latency on larger machines for less cross talk.
> @@ -12192,7 +12203,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> /* Earliest time when we have to do rebalance again */
> unsigned long next_balance = jiffies + 60*HZ;
> int update_next_balance = 0;
> - int need_serialize, need_decay = 0;
> + int need_decay = 0;
> u64 max_cost = 0;
>
> rcu_read_lock();
> @@ -12216,13 +12227,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> }
>
> interval = get_sd_balance_interval(sd, busy);
> -
> - need_serialize = sd->flags & SD_SERIALIZE;
> - if (need_serialize) {
> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> - goto out;
> - }
> -
> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> /*
> @@ -12236,9 +12240,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> sd->last_balance = jiffies;
> interval = get_sd_balance_interval(sd, busy);
> }
> - if (need_serialize)
> - atomic_set_release(&sched_balance_running, 0);
> -out:
> +
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;
> update_next_balance = 1;
I think this is better since previously the one CPU which was not suppose to
do the balancing may increment the atomic variable. If the CPU, that was
suppose to do the balance now tries it may fail since the variable was not
yet decremented.
--
Thanks and Regards
Srikar Dronamraju
On Tue, Oct 14, 2025 at 07:20:35PM +0530, Srikar Dronamraju wrote:
> * Peter Zijlstra <peterz@infradead.org> [2025-10-14 11:24:36]:
>
> > On Mon, Oct 13, 2025 at 02:54:19PM -0700, Tim Chen wrote:
> >
> >
> > Right, Yu Chen said something like that as well, should_we_balance() is
> > too late.
> >
> > Should we instead move the whole serialize thing inside
> > sched_balance_rq() like so:
> >
> > @@ -12122,21 +12148,6 @@ static int active_load_balance_cpu_stop(void *data)
> > return 0;
> > }
> >
> > -/*
> > - * This flag serializes load-balancing passes over large domains
> > - * (above the NODE topology level) - only one load-balancing instance
> > - * may run at a time, to reduce overhead on very large systems with
> > - * lots of CPUs and large NUMA distances.
> > - *
> > - * - Note that load-balancing passes triggered while another one
> > - * is executing are skipped and not re-tried.
> > - *
> > - * - Also note that this does not serialize rebalance_domains()
> > - * execution, as non-SD_SERIALIZE domains will still be
> > - * load-balanced in parallel.
> > - */
> > -static atomic_t sched_balance_running = ATOMIC_INIT(0);
> > -
> > /*
> > * Scale the max sched_balance_rq interval with the number of CPUs in the system.
> > * This trades load-balance latency on larger machines for less cross talk.
> > @@ -12192,7 +12203,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> > /* Earliest time when we have to do rebalance again */
> > unsigned long next_balance = jiffies + 60*HZ;
> > int update_next_balance = 0;
> > - int need_serialize, need_decay = 0;
> > + int need_decay = 0;
> > u64 max_cost = 0;
> >
> > rcu_read_lock();
> > @@ -12216,13 +12227,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> > }
> >
> > interval = get_sd_balance_interval(sd, busy);
> > -
> > - need_serialize = sd->flags & SD_SERIALIZE;
> > - if (need_serialize) {
> > - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > - goto out;
> > - }
> > -
> > if (time_after_eq(jiffies, sd->last_balance + interval)) {
> > if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> > /*
> > @@ -12236,9 +12240,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> > sd->last_balance = jiffies;
> > interval = get_sd_balance_interval(sd, busy);
> > }
> > - if (need_serialize)
> > - atomic_set_release(&sched_balance_running, 0);
> > -out:
> > +
> > if (time_after(next_balance, sd->last_balance + interval)) {
> > next_balance = sd->last_balance + interval;
> > update_next_balance = 1;
>
> I think this is better since previously the one CPU which was not suppose to
> do the balancing may increment the atomic variable. If the CPU, that was
> suppose to do the balance now tries it may fail since the variable was not
> yet decremented.
Right, it would do that acquire and then still have at least 2 ways to
not actually balance, which is a waste.
On 10/14/25 2:54 PM, Peter Zijlstra wrote:
> On Mon, Oct 13, 2025 at 02:54:19PM -0700, Tim Chen wrote:
>
>>> So I'm not sure I understand the situation, @continue_balancing should
>>> limit this concurrency to however many groups are on this domain -- your
>>> granite thing with SNC on would have something like 6 groups?
>>
>> That's a good point. But I think the contention is worse than
>> 6 CPUs.
>>
>> The hierarchy would be
>>
>> SMT
>> NUMA-level1
>> NUMA-level2
>> NUMA-level3
>> NUMA-level4
>
> Aren't you missing the LLC/NODE domain here? We should have at least one
> !SD_NUMA domain above SMT.
>
>> There would be multiple CPUs in that are first in the SMT group
>> with continue_balancing=1 going up in the hierachy and
>> attempting the cmpxchg in the first NUMA domain level,
>> before calling should_we_balance() and finding that they are
>> not the first in the NUMA domain and set continue_balancing=0
>> and abort. Those CPUS are in same L3.
>> But at the same time, there could be CPUs in other sockets
>> cmpxchg on sched_balance_running.
>
> Right, Yu Chen said something like that as well, should_we_balance() is
> too late.
>
> Should we instead move the whole serialize thing inside
> sched_balance_rq() like so:
>
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc0b7ce8a65d..e9f719ba17e1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11722,6 +11722,22 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
> }
> }
>
> +
> +/*
> + * This flag serializes load-balancing passes over large domains
> + * (above the NODE topology level) - only one load-balancing instance
> + * may run at a time, to reduce overhead on very large systems with
> + * lots of CPUs and large NUMA distances.
> + *
> + * - Note that load-balancing passes triggered while another one
> + * is executing are skipped and not re-tried.
> + *
> + * - Also note that this does not serialize rebalance_domains()
> + * execution, as non-SD_SERIALIZE domains will still be
> + * load-balanced in parallel.
> + */
> +static atomic_t sched_balance_running = ATOMIC_INIT(0);
> +
> /*
> * Check this_cpu to ensure it is balanced within domain. Attempt to move
> * tasks if there is an imbalance.
> @@ -11747,6 +11763,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> .fbq_type = all,
> .tasks = LIST_HEAD_INIT(env.tasks),
> };
> + int need_unlock = false;
>
> cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>
> @@ -11758,6 +11775,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> goto out_balanced;
> }
>
> + if (idle != CPU_NEWLY_IDLE && (sd->flags & SD_SERIALIZE)) {
> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> + goto out_balanced;
Maybe goto out instead of out_balanced ?
> + need_unlock = true;
> + }
> +
> group = sched_balance_find_src_group(&env);
> if (!group) {
> schedstat_inc(sd->lb_nobusyg[idle]);
> @@ -11998,6 +12021,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> sd->balance_interval < sd->max_interval)
> sd->balance_interval *= 2;
> out:
> + if (need_unlock)
> + atomic_set_release(&sched_balance_running, 0);
> +
> return ld_moved;
> }
>
> @@ -12122,21 +12148,6 @@ static int active_load_balance_cpu_stop(void *data)
> return 0;
> }
>
> -/*
> - * This flag serializes load-balancing passes over large domains
> - * (above the NODE topology level) - only one load-balancing instance
> - * may run at a time, to reduce overhead on very large systems with
> - * lots of CPUs and large NUMA distances.
> - *
> - * - Note that load-balancing passes triggered while another one
> - * is executing are skipped and not re-tried.
> - *
> - * - Also note that this does not serialize rebalance_domains()
> - * execution, as non-SD_SERIALIZE domains will still be
> - * load-balanced in parallel.
> - */
> -static atomic_t sched_balance_running = ATOMIC_INIT(0);
> -
> /*
> * Scale the max sched_balance_rq interval with the number of CPUs in the system.
> * This trades load-balance latency on larger machines for less cross talk.
> @@ -12192,7 +12203,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> /* Earliest time when we have to do rebalance again */
> unsigned long next_balance = jiffies + 60*HZ;
> int update_next_balance = 0;
> - int need_serialize, need_decay = 0;
> + int need_decay = 0;
> u64 max_cost = 0;
>
> rcu_read_lock();
> @@ -12216,13 +12227,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> }
>
> interval = get_sd_balance_interval(sd, busy);
> -
> - need_serialize = sd->flags & SD_SERIALIZE;
> - if (need_serialize) {
> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> - goto out;
> - }
> -
> if (time_after_eq(jiffies, sd->last_balance + interval)) {
> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> /*
> @@ -12236,9 +12240,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> sd->last_balance = jiffies;
> interval = get_sd_balance_interval(sd, busy);
> }
> - if (need_serialize)
> - atomic_set_release(&sched_balance_running, 0);
> -out:
> +
> if (time_after(next_balance, sd->last_balance + interval)) {
> next_balance = sd->last_balance + interval;
> update_next_balance = 1;
On Tue, Oct 14, 2025 at 03:03:41PM +0530, Shrikanth Hegde wrote:
> > @@ -11758,6 +11775,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> > goto out_balanced;
> > }
> > + if (idle != CPU_NEWLY_IDLE && (sd->flags & SD_SERIALIZE)) {
> > + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > + goto out_balanced;
>
> Maybe goto out instead of out_balanced ?
That would be inconsistent with the !should_we_balance() goto
out_balanced right above this, no?
> > + need_unlock = true;
> > + }
> > +
> > group = sched_balance_find_src_group(&env);
> > if (!group) {
> > schedstat_inc(sd->lb_nobusyg[idle]);
> > @@ -11998,6 +12021,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> > sd->balance_interval < sd->max_interval)
> > sd->balance_interval *= 2;
> > out:
> > + if (need_unlock)
> > + atomic_set_release(&sched_balance_running, 0);
> > +
> > return ld_moved;
> > }
> > @@ -12122,21 +12148,6 @@ static int active_load_balance_cpu_stop(void *data)
> > return 0;
> > }
> > -/*
> > - * This flag serializes load-balancing passes over large domains
> > - * (above the NODE topology level) - only one load-balancing instance
> > - * may run at a time, to reduce overhead on very large systems with
> > - * lots of CPUs and large NUMA distances.
> > - *
> > - * - Note that load-balancing passes triggered while another one
> > - * is executing are skipped and not re-tried.
> > - *
> > - * - Also note that this does not serialize rebalance_domains()
> > - * execution, as non-SD_SERIALIZE domains will still be
> > - * load-balanced in parallel.
> > - */
> > -static atomic_t sched_balance_running = ATOMIC_INIT(0);
> > -
> > /*
> > * Scale the max sched_balance_rq interval with the number of CPUs in the system.
> > * This trades load-balance latency on larger machines for less cross talk.
> > @@ -12192,7 +12203,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> > /* Earliest time when we have to do rebalance again */
> > unsigned long next_balance = jiffies + 60*HZ;
> > int update_next_balance = 0;
> > - int need_serialize, need_decay = 0;
> > + int need_decay = 0;
> > u64 max_cost = 0;
> > rcu_read_lock();
> > @@ -12216,13 +12227,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> > }
> > interval = get_sd_balance_interval(sd, busy);
> > -
> > - need_serialize = sd->flags & SD_SERIALIZE;
> > - if (need_serialize) {
> > - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > - goto out;
> > - }
> > -
> > if (time_after_eq(jiffies, sd->last_balance + interval)) {
> > if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
> > /*
> > @@ -12236,9 +12240,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> > sd->last_balance = jiffies;
> > interval = get_sd_balance_interval(sd, busy);
> > }
> > - if (need_serialize)
> > - atomic_set_release(&sched_balance_running, 0);
> > -out:
> > +
> > if (time_after(next_balance, sd->last_balance + interval)) {
> > next_balance = sd->last_balance + interval;
> > update_next_balance = 1;
>
On 10/14/25 3:12 PM, Peter Zijlstra wrote:
> On Tue, Oct 14, 2025 at 03:03:41PM +0530, Shrikanth Hegde wrote:
>
>>> @@ -11758,6 +11775,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>> goto out_balanced;
>>> }
>>> + if (idle != CPU_NEWLY_IDLE && (sd->flags & SD_SERIALIZE)) {
>>> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>>> + goto out_balanced;
>>
>> Maybe goto out instead of out_balanced ?
>
> That would be inconsistent with the !should_we_balance() goto
> out_balanced right above this, no?
>
Hi Peter.
Did similar probe points numbers compared to this. Even the patch is quite similar to what
was suggested there a while ago.
https://lore.kernel.org/all/41e11090-a100-48a7-a0dd-c989772822d7@linux.ibm.com/
480 CPUs system with 6 NUMA nodes. (different system than last time)
tl;dr
- Number of time sched_balance_running is taken is way less after the swb check. (which is great)
- Number of time it fails to set is very low after swb. (So out_balanced vs out may not make a
significant difference.)
- Patch is at the end. It is this patch + redo stuff + (ref_variable_stuff(ignore))
--- detailed log----
++++++++++++ probe points +++++++++++++++
(added a ref("crap") so i could put a probe where i want )
0 static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
...
20 max_cost += sd->max_newidle_lb_cost;
/*
* Stop the load balance at this level. There is another
* CPU in our sched group which is doing load balancing more
* actively.
*/
if (!continue_balancing) {
if (need_decay)
continue;
break;
}
33 if (sd->flags & SD_SERIALIZE)
34 ref = ref + 5;
<sched_balance_rq@/home/shrikanth/sched_tip/kernel/sched/fair.c:0>
0 static int sched_balance_rq(int this_cpu, struct rq *this_rq,
struct sched_domain *sd, enum cpu_idle_type idle,
int *continue_balancing)
...
int need_unlock = false;
cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
25 schedstat_inc(sd->lb_count[idle]);
...
34 if (idle != CPU_NEWLY_IDLE && (sd->flags & SD_SERIALIZE)) {
35 if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
36 ref = ref+1;
37 goto out_balanced;
}
39 ref = ref + 2;
40 need_unlock = true;
...
env.loop_break = SCHED_NR_MIGRATE_BREAK;
167 if (need_unlock) {
168 ref = ref+3;
169 atomic_set_release(&sched_balance_running, 0);
}
goto redo;
...
out:
287 if (need_unlock) {
288 ref = ref +4;
289 atomic_set_release(&sched_balance_running, 0);
}
return ld_moved;
probe:sched_balance_domains_L34 (on sched_balance_domains:34@kernel/sched/fair.c)
probe:sched_balance_rq_L168 (on sched_balance_rq:168@kernel/sched/fair.c)
probe:sched_balance_rq_L21 (on sched_balance_rq+312@kernel/sched/fair.c)
probe:sched_balance_rq_L288 (on sched_balance_rq+312@kernel/sched/fair.c)
probe:sched_balance_rq_L35 (on sched_balance_rq+312@kernel/sched/fair.c)
probe:sched_balance_rq_L36 (on sched_balance_rq+312@kernel/sched/fair.c)
probe:sched_balance_rq_L39 (on sched_balance_rq+312@kernel/sched/fair.c)
+++++++++++ Data on various load points ++++++++++++++++++++++++
--- idle ---
perf stat -a -e probe:* sleep 10
6,123 probe:sched_balance_domains_L34
10,378 probe:sched_balance_rq_L21
79 probe:sched_balance_rq_L35
17 probe:sched_balance_rq_L36
62 probe:sched_balance_rq_L39
0 probe:sched_balance_rq_L168
62 probe:sched_balance_rq_L288
--- 25% load ---
perf stat -a -e probe:* stress-ng --cpu=480 -l 25 -t 10
510,551 probe:sched_balance_domains_L34
303,892 probe:sched_balance_rq_L21
442 probe:sched_balance_rq_L35
3 probe:sched_balance_rq_L36
439 probe:sched_balance_rq_L39
0 probe:sched_balance_rq_L168
439 probe:sched_balance_rq_L288
--- 50% load ---
248,969 probe:sched_balance_domains_L34
187,864 probe:sched_balance_rq_L21
926 probe:sched_balance_rq_L35
6 probe:sched_balance_rq_L36
920 probe:sched_balance_rq_L39
0 probe:sched_balance_rq_L168
920 probe:sched_balance_rq_L288
--- 75% load ---
110,294 probe:sched_balance_domains_L34
71,568 probe:sched_balance_rq_L21
861 probe:sched_balance_rq_L35
6 probe:sched_balance_rq_L36
855 probe:sched_balance_rq_L39
0 probe:sched_balance_rq_L168
855 probe:sched_balance_rq_L288
--- 100% load ---
85,960 probe:sched_balance_domains_L34
48,169 probe:sched_balance_rq_L21
71 probe:sched_balance_rq_L35
4 probe:sched_balance_rq_L36
67 probe:sched_balance_rq_L39
0 probe:sched_balance_rq_L168
67 probe:sched_balance_rq_L288
++++++++++++++++++ patch ++++++++++++++++++++++++++++++++++++
(ignore ref crap)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cee1793e8277..832104705500 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11722,10 +11722,29 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
}
}
+
+/*
+ * This flag serializes load-balancing passes over large domains
+ * (above the NODE topology level) - only one load-balancing instance
+ * may run at a time, to reduce overhead on very large systems with
+ * lots of CPUs and large NUMA distances.
+ *
+ * - Note that load-balancing passes triggered while another one
+ * is executing are skipped and not re-tried.
+ *
+ * - Also note that this does not serialize rebalance_domains()
+ * execution, as non-SD_SERIALIZE domains will still be
+ * load-balanced in parallel.
+ */
+static atomic_t sched_balance_running = ATOMIC_INIT(0);
+
/*
* Check this_cpu to ensure it is balanced within domain. Attempt to move
* tasks if there is an imbalance.
*/
+
+int ref = 0;
+
static int sched_balance_rq(int this_cpu, struct rq *this_rq,
struct sched_domain *sd, enum cpu_idle_type idle,
int *continue_balancing)
@@ -11747,10 +11766,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
.fbq_type = all,
.tasks = LIST_HEAD_INIT(env.tasks),
};
+ int need_unlock = false;
cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
schedstat_inc(sd->lb_count[idle]);
+ ref = 1;
redo:
if (!should_we_balance(&env)) {
@@ -11758,6 +11779,15 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
goto out_balanced;
}
+ if (idle != CPU_NEWLY_IDLE && (sd->flags & SD_SERIALIZE)) {
+ if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
+ ref = ref+1;
+ goto out_balanced;
+ }
+ ref = ref + 2;
+ need_unlock = true;
+ }
+
group = sched_balance_find_src_group(&env);
if (!group) {
schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11882,6 +11912,10 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
env.loop_break = SCHED_NR_MIGRATE_BREAK;
+ if (need_unlock) {
+ ref = ref+3;
+ atomic_set_release(&sched_balance_running, 0);
+ }
goto redo;
}
goto out_all_pinned;
@@ -11998,6 +12032,11 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
sd->balance_interval < sd->max_interval)
sd->balance_interval *= 2;
out:
+ if (need_unlock) {
+ ref = ref +4;
+ atomic_set_release(&sched_balance_running, 0);
+ }
+
return ld_moved;
}
@@ -12122,21 +12161,6 @@ static int active_load_balance_cpu_stop(void *data)
return 0;
}
-/*
- * This flag serializes load-balancing passes over large domains
- * (above the NODE topology level) - only one load-balancing instance
- * may run at a time, to reduce overhead on very large systems with
- * lots of CPUs and large NUMA distances.
- *
- * - Note that load-balancing passes triggered while another one
- * is executing are skipped and not re-tried.
- *
- * - Also note that this does not serialize rebalance_domains()
- * execution, as non-SD_SERIALIZE domains will still be
- * load-balanced in parallel.
- */
-static atomic_t sched_balance_running = ATOMIC_INIT(0);
-
/*
* Scale the max sched_balance_rq interval with the number of CPUs in the system.
* This trades load-balance latency on larger machines for less cross talk.
@@ -12192,7 +12216,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
/* Earliest time when we have to do rebalance again */
unsigned long next_balance = jiffies + 60*HZ;
int update_next_balance = 0;
- int need_serialize, need_decay = 0;
+ int need_decay = 0;
u64 max_cost = 0;
rcu_read_lock();
@@ -12215,14 +12239,10 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
break;
}
- interval = get_sd_balance_interval(sd, busy);
-
- need_serialize = sd->flags & SD_SERIALIZE;
- if (need_serialize) {
- if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
- goto out;
- }
+ if (sd->flags & SD_SERIALIZE)
+ ref = ref + 5;
+ interval = get_sd_balance_interval(sd, busy);
if (time_after_eq(jiffies, sd->last_balance + interval)) {
if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
/*
@@ -12236,9 +12256,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
sd->last_balance = jiffies;
interval = get_sd_balance_interval(sd, busy);
}
- if (need_serialize)
- atomic_set_release(&sched_balance_running, 0);
-out:
+
if (time_after(next_balance, sd->last_balance + interval)) {
next_balance = sd->last_balance + interval;
update_next_balance = 1;
On 10/16/25 7:33 PM, Shrikanth Hegde wrote:
>
>
> On 10/14/25 3:12 PM, Peter Zijlstra wrote:
>> On Tue, Oct 14, 2025 at 03:03:41PM +0530, Shrikanth Hegde wrote:
>>
>>>> @@ -11758,6 +11775,12 @@ static int sched_balance_rq(int this_cpu,
>>>> struct rq *this_rq,
>>>> goto out_balanced;
>>>> }
>>>> + if (idle != CPU_NEWLY_IDLE && (sd->flags & SD_SERIALIZE)) {
>>>> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>>>> + goto out_balanced;
>>>
>>> Maybe goto out instead of out_balanced ?
>>
>> That would be inconsistent with the !should_we_balance() goto
>> out_balanced right above this, no?
>>
> Hi Peter.
>
> Did similar probe points numbers compared to this. Even the patch is
> quite similar to what
> was suggested there a while ago.
> https://lore.kernel.org/all/41e11090-a100-48a7-a0dd-
> c989772822d7@linux.ibm.com/
>
> 480 CPUs system with 6 NUMA nodes. (different system than last time)
>
Hi Peter, Tim.
How are we proceeding here? Should i send the below patch with changelog or
is it being worked out by either of you.
It is really beneficial for large systems to avoid un-neccessary cache bouncing.
> tl;dr
>
> - Number of time sched_balance_running is taken is way less after the
> swb check. (which is great)
> - Number of time it fails to set is very low after swb. (So out_balanced
> vs out may not make a
> significant difference.)
> - Patch is at the end. It is this patch + redo stuff +
> (ref_variable_stuff(ignore))
>
>
On 10/14/25 3:12 PM, Peter Zijlstra wrote:
> On Tue, Oct 14, 2025 at 03:03:41PM +0530, Shrikanth Hegde wrote:
>
>>> @@ -11758,6 +11775,12 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>> goto out_balanced;
>>> }
>>> + if (idle != CPU_NEWLY_IDLE && (sd->flags & SD_SERIALIZE)) {
>>> + if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>>> + goto out_balanced;
>>
>> Maybe goto out instead of out_balanced ?
>
> That would be inconsistent with the !should_we_balance() goto
> out_balanced right above this, no?
Yes. But whats the reason for saying out_balanced for !should_we_balance?
Load balance wasn't even attempted there right? Could this be updating it wrongly?
At-least comments around out_all_pinned doesn't make sense if we came here via !swb
schedstat_inc(sd->lb_balanced[idle]);
sd->nr_balance_failed = 0;
>
>>> + need_unlock = true;
>>> + }
>>> +
>>> group = sched_balance_find_src_group(&env);
>>> if (!group) {
>>> schedstat_inc(sd->lb_nobusyg[idle]);
>>> @@ -11998,6 +12021,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>> sd->balance_interval < sd->max_interval)
>>> sd->balance_interval *= 2;
>>> out:
>>> + if (need_unlock)
>>> + atomic_set_release(&sched_balance_running, 0);
>>> +
>>> return ld_moved;
>>> }
>>> @@ -12122,21 +12148,6 @@ static int active_load_balance_cpu_stop(void *data)
>>> return 0;
>>> }
>>> -/*
>>> - * This flag serializes load-balancing passes over large domains
>>> - * (above the NODE topology level) - only one load-balancing instance
>>> - * may run at a time, to reduce overhead on very large systems with
>>> - * lots of CPUs and large NUMA distances.
>>> - *
>>> - * - Note that load-balancing passes triggered while another one
>>> - * is executing are skipped and not re-tried.
>>> - *
>>> - * - Also note that this does not serialize rebalance_domains()
>>> - * execution, as non-SD_SERIALIZE domains will still be
>>> - * load-balanced in parallel.
>>> - */
>>> -static atomic_t sched_balance_running = ATOMIC_INIT(0);
>>> -
>>> /*
>>> * Scale the max sched_balance_rq interval with the number of CPUs in the system.
>>> * This trades load-balance latency on larger machines for less cross talk.
>>> @@ -12192,7 +12203,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>> /* Earliest time when we have to do rebalance again */
>>> unsigned long next_balance = jiffies + 60*HZ;
>>> int update_next_balance = 0;
>>> - int need_serialize, need_decay = 0;
>>> + int need_decay = 0;
>>> u64 max_cost = 0;
>>> rcu_read_lock();
>>> @@ -12216,13 +12227,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>> }
>>> interval = get_sd_balance_interval(sd, busy);
>>> -
>>> - need_serialize = sd->flags & SD_SERIALIZE;
>>> - if (need_serialize) {
>>> - if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
>>> - goto out;
>>> - }
>>> -
>>> if (time_after_eq(jiffies, sd->last_balance + interval)) {
>>> if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
>>> /*
>>> @@ -12236,9 +12240,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>> sd->last_balance = jiffies;
>>> interval = get_sd_balance_interval(sd, busy);
>>> }
>>> - if (need_serialize)
>>> - atomic_set_release(&sched_balance_running, 0);
>>> -out:
>>> +
>>> if (time_after(next_balance, sd->last_balance + interval)) {
>>> next_balance = sd->last_balance + interval;
>>> update_next_balance = 1;
>>
On 10/13/2025 10:26 PM, Peter Zijlstra wrote:
> On Thu, Oct 02, 2025 at 04:00:12PM -0700, Tim Chen wrote:
>
>> During load balancing, balancing at the LLC level and above must be
>> serialized.
>
> I would argue the wording here, there is no *must*, they *are*. Per the
> current rules SD_NUMA and up get SD_SERIALIZE.
>
> This is a *very* old thing, done by Christoph Lameter back when he was
> at SGI. I'm not sure this default is still valid or not. Someone would
> have to investigate. An alternative would be moving it into
> node_reclaim_distance or somesuch.
>
Do you mean the following:
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 444bdfdab731..436c899d8da2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1697,11 +1697,16 @@ sd_init(struct sched_domain_topology_level *tl,
sd->cache_nice_tries = 2;
sd->flags &= ~SD_PREFER_SIBLING;
- sd->flags |= SD_SERIALIZE;
if (sched_domains_numa_distance[tl->numa_level] >
node_reclaim_distance) {
sd->flags &= ~(SD_BALANCE_EXEC |
SD_BALANCE_FORK |
SD_WAKE_AFFINE);
+ /*
+ * Nodes that are far away need to be serialized to
+ * reduce the overhead of long-distance task
migration
+ * caused by load balancing.
+ */
+ sd->flags |= SD_SERIALIZE;
}
We can launch some tests to see if removing SD_SERIALIZE would
bring any impact.
>> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled
>> and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg
>> operations for `sched_balance_running`. In most cases, the attempt
>> aborts immediately after acquisition because the load balance time is
>> not yet due.
>
> So I'm not sure I understand the situation, @continue_balancing should
> limit this concurrency to however many groups are on this domain -- your
> granite thing with SNC on would have something like 6 groups?
>
My understanding is that, continue_balancing is set to false after
atomic_cmpxhg(sched_balance_running), so if sched_balance_domains()
is invoked by many CPUs in parallel, the atomic operation still compete?
thanks,
Chenyu
On Tue, Oct 14, 2025 at 12:32:57AM +0800, Chen, Yu C wrote:
> On 10/13/2025 10:26 PM, Peter Zijlstra wrote:
> > On Thu, Oct 02, 2025 at 04:00:12PM -0700, Tim Chen wrote:
> >
> > > During load balancing, balancing at the LLC level and above must be
> > > serialized.
> >
> > I would argue the wording here, there is no *must*, they *are*. Per the
> > current rules SD_NUMA and up get SD_SERIALIZE.
> >
> > This is a *very* old thing, done by Christoph Lameter back when he was
> > at SGI. I'm not sure this default is still valid or not. Someone would
> > have to investigate. An alternative would be moving it into
> > node_reclaim_distance or somesuch.
> >
>
> Do you mean the following:
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 444bdfdab731..436c899d8da2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1697,11 +1697,16 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->cache_nice_tries = 2;
>
> sd->flags &= ~SD_PREFER_SIBLING;
> - sd->flags |= SD_SERIALIZE;
> if (sched_domains_numa_distance[tl->numa_level] >
> node_reclaim_distance) {
> sd->flags &= ~(SD_BALANCE_EXEC |
> SD_BALANCE_FORK |
> SD_WAKE_AFFINE);
> + /*
> + * Nodes that are far away need to be serialized to
> + * reduce the overhead of long-distance task
> migration
> + * caused by load balancing.
> + */
> + sd->flags |= SD_SERIALIZE;
> }
>
> We can launch some tests to see if removing SD_SERIALIZE would
> bring any impact.
Yeah, something like that. But lets first get this other thing sorted. I
agree that the SD_SERIALIZE thing is in the wrong place.
On 10/13/25 10:02 PM, Chen, Yu C wrote:
> On 10/13/2025 10:26 PM, Peter Zijlstra wrote:
>> On Thu, Oct 02, 2025 at 04:00:12PM -0700, Tim Chen wrote:
>>
>>> During load balancing, balancing at the LLC level and above must be
>>> serialized.
>>
>> I would argue the wording here, there is no *must*, they *are*. Per the
>> current rules SD_NUMA and up get SD_SERIALIZE.
>>
>> This is a *very* old thing, done by Christoph Lameter back when he was
>> at SGI. I'm not sure this default is still valid or not. Someone would
>> have to investigate. An alternative would be moving it into
>> node_reclaim_distance or somesuch.
>>
>
> Do you mean the following:
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 444bdfdab731..436c899d8da2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1697,11 +1697,16 @@ sd_init(struct sched_domain_topology_level *tl,
> sd->cache_nice_tries = 2;
>
> sd->flags &= ~SD_PREFER_SIBLING;
> - sd->flags |= SD_SERIALIZE;
> if (sched_domains_numa_distance[tl->numa_level] >
> node_reclaim_distance) {
> sd->flags &= ~(SD_BALANCE_EXEC |
> SD_BALANCE_FORK |
> SD_WAKE_AFFINE);
> + /*
> + * Nodes that are far away need to be serialized to
> + * reduce the overhead of long-distance task
> migration
> + * caused by load balancing.
> + */
> + sd->flags |= SD_SERIALIZE;
> }
>
> We can launch some tests to see if removing SD_SERIALIZE would
> bring any impact.
>
>>> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled
>>> and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg
>>> operations for `sched_balance_running`. In most cases, the attempt
>>> aborts immediately after acquisition because the load balance time is
>>> not yet due.
>>
>> So I'm not sure I understand the situation, @continue_balancing should
>> limit this concurrency to however many groups are on this domain -- your
>> granite thing with SNC on would have something like 6 groups?
>>
>
> My understanding is that, continue_balancing is set to false after
> atomic_cmpxhg(sched_balance_running), so if sched_balance_domains()
> is invoked by many CPUs in parallel, the atomic operation still compete?
>
From what i could remember,
This mostly always happens at SMT after which continue_balancing = 0.
Since multiple CPUs end up calling it (specially on busy system)
it causes a lot cacheline bouncing. and ends up showing in perf profile.
On 10/14/2025 12:41 AM, Shrikanth Hegde wrote:
>
>
> On 10/13/25 10:02 PM, Chen, Yu C wrote:
>> On 10/13/2025 10:26 PM, Peter Zijlstra wrote:
>>> On Thu, Oct 02, 2025 at 04:00:12PM -0700, Tim Chen wrote:
>>>
>>>> During load balancing, balancing at the LLC level and above must be
>>>> serialized.
>>>
>>> I would argue the wording here, there is no *must*, they *are*. Per the
>>> current rules SD_NUMA and up get SD_SERIALIZE.
>>>
>>> This is a *very* old thing, done by Christoph Lameter back when he was
>>> at SGI. I'm not sure this default is still valid or not. Someone would
>>> have to investigate. An alternative would be moving it into
>>> node_reclaim_distance or somesuch.
>>>
>>
>> Do you mean the following:
>>
>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 444bdfdab731..436c899d8da2 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1697,11 +1697,16 @@ sd_init(struct sched_domain_topology_level *tl,
>> sd->cache_nice_tries = 2;
>>
>> sd->flags &= ~SD_PREFER_SIBLING;
>> - sd->flags |= SD_SERIALIZE;
>> if (sched_domains_numa_distance[tl->numa_level] >
>> node_reclaim_distance) {
>> sd->flags &= ~(SD_BALANCE_EXEC |
>> SD_BALANCE_FORK |
>> SD_WAKE_AFFINE);
>> + /*
>> + * Nodes that are far away need to be
>> serialized to
>> + * reduce the overhead of long-distance task
>> migration
>> + * caused by load balancing.
>> + */
>> + sd->flags |= SD_SERIALIZE;
>> }
>>
>> We can launch some tests to see if removing SD_SERIALIZE would
>> bring any impact.
>>
>>>> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled
>>>> and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg
>>>> operations for `sched_balance_running`. In most cases, the attempt
>>>> aborts immediately after acquisition because the load balance time is
>>>> not yet due.
>>>
>>> So I'm not sure I understand the situation, @continue_balancing should
>>> limit this concurrency to however many groups are on this domain -- your
>>> granite thing with SNC on would have something like 6 groups?
>>>
>>
>> My understanding is that, continue_balancing is set to false after
>> atomic_cmpxhg(sched_balance_running), so if sched_balance_domains()
>> is invoked by many CPUs in parallel, the atomic operation still compete?
>>
>
> From what i could remember,
>
> This mostly always happens at SMT after which continue_balancing = 0.
> Since multiple CPUs end up calling it (specially on busy system)
> it causes a lot cacheline bouncing. and ends up showing in perf profile.
>
I see, when reaching NUMA domain, the continue_balancing should already
be set to 0. Thanks for pointing it out.
thanks,
Chenyu
On 10/3/25 4:30 AM, Tim Chen wrote: > Repost comments: > > There have been past discussions about avoiding serialization in load > balancing, but no objections were raised to this patch itself during > its last posting: > https://lore.kernel.org/lkml/20250416035823.1846307-1-tim.c.chen@linux.intel.com/ > > Vincent and Chen Yu have already provided their Reviewed-by tags. > > We recently encountered this issue again on a 2-socket, 240-core > Clearwater Forest server running SPECjbb. In this case, 14% of CPU > cycles were wasted on unnecessary acquisitions of > sched_balance_running. This reinforces the need for the change, and we > hope it can be merged. > > Tim > > --- > > During load balancing, balancing at the LLC level and above must be > serialized. The scheduler currently checks the atomic > `sched_balance_running` flag before verifying whether a balance is > actually due. This causes high contention, as multiple CPUs may attempt > to acquire the flag concurrently. > > On a 2-socket Granite Rapids system with sub-NUMA clustering enabled > and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg > operations for `sched_balance_running`. In most cases, the attempt > aborts immediately after acquisition because the load balance time is > not yet due. > > Fix this by checking whether a balance is due *before* trying to > acquire `sched_balance_running`. This avoids many wasted acquisitions > and reduces the cmpxchg overhead in `sched_balance_domain()` from 7.6% > to 0.05%. As a result, OLTP throughput improves by 11%. > > Reviewed-by: Chen Yu <yu.c.chen@intel.com> > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > --- Hi Tim. Fine by me. unnecessary atomic operations do hurt on large systems. The further optimization that i pointed out can come in later i guess. That would help only further. this should be good to begin with. With that. Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
On Fri, 2025-10-03 at 10:53 +0530, Shrikanth Hegde wrote: > > On 10/3/25 4:30 AM, Tim Chen wrote: > > Repost comments: > > > > There have been past discussions about avoiding serialization in load > > balancing, but no objections were raised to this patch itself during > > its last posting: > > https://lore.kernel.org/lkml/20250416035823.1846307-1-tim.c.chen@linux.intel.com/ > > > > Vincent and Chen Yu have already provided their Reviewed-by tags. > > > > We recently encountered this issue again on a 2-socket, 240-core > > Clearwater Forest server running SPECjbb. In this case, 14% of CPU > > cycles were wasted on unnecessary acquisitions of > > sched_balance_running. This reinforces the need for the change, and we > > hope it can be merged. > > > > Tim > > > > --- > > > > During load balancing, balancing at the LLC level and above must be > > serialized. The scheduler currently checks the atomic > > `sched_balance_running` flag before verifying whether a balance is > > actually due. This causes high contention, as multiple CPUs may attempt > > to acquire the flag concurrently. > > > > On a 2-socket Granite Rapids system with sub-NUMA clustering enabled > > and running OLTP workloads, 7.6% of CPU cycles were spent on cmpxchg > > operations for `sched_balance_running`. In most cases, the attempt > > aborts immediately after acquisition because the load balance time is > > not yet due. > > > > Fix this by checking whether a balance is due *before* trying to > > acquire `sched_balance_running`. This avoids many wasted acquisitions > > and reduces the cmpxchg overhead in `sched_balance_domain()` from 7.6% > > to 0.05%. As a result, OLTP throughput improves by 11%. > > > > Reviewed-by: Chen Yu <yu.c.chen@intel.com> > > Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org> > > Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com> > > --- > > Hi Tim. > > Fine by me. unnecessary atomic operations do hurt on large systems. > The further optimization that i pointed out can come in later i guess. > That would help only further. this should be good to begin with. Thanks for your review and your past comments. We'll look into further optimization if we find that this became a hot path again. For now this change seemed to be good enough. Tim > > With that. > Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com> >
© 2016 - 2026 Red Hat, Inc.