[PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due

Tim Chen posted 1 patch 3 weeks ago
kernel/sched/fair.c | 57 ++++++++++++++++++++++++---------------------
1 file changed, 31 insertions(+), 26 deletions(-)
[PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Tim Chen 3 weeks ago
The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
only one NUMA load balancing operation to run system-wide at a time.

Currently, each sched group leader directly under NUMA domain attempts
to acquire the global sched_balance_running flag via cmpxchg() before
checking whether load balancing is due or whether it is the designated
load balancer for that NUMA domain. On systems with a large number
of cores, this causes significant cache contention on the shared
sched_balance_running flag.

This patch reduces unnecessary cmpxchg() operations by first checking
that the balancer is the designated leader for a NUMA domain from
should_we_balance(), and the balance interval has expired before
trying to acquire sched_balance_running to load balance a NUMA
domain.

On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
running an OLTP workload, 7.8% of total CPU cycles were previously spent
in sched_balance_domain() contending on sched_balance_running before
this change.

         : 104              static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
         : 105              {
         : 106              return arch_cmpxchg(&v->counter, old, new);
    0.00 :   ffffffff81326e6c:       xor    %eax,%eax
    0.00 :   ffffffff81326e6e:       mov    $0x1,%ecx
    0.00 :   ffffffff81326e73:       lock cmpxchg %ecx,0x2394195(%rip)        # ffffffff836bb010 <sched_balance_running>
         : 110              sched_balance_domains():
         : 12234            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
   99.39 :   ffffffff81326e7b:       test   %eax,%eax
    0.00 :   ffffffff81326e7d:       jne    ffffffff81326e99 <sched_balance_domains+0x209>
         : 12238            if (time_after_eq(jiffies, sd->last_balance + interval)) {
    0.00 :   ffffffff81326e7f:       mov    0x14e2b3a(%rip),%rax        # ffffffff828099c0 <jiffies_64>
    0.00 :   ffffffff81326e86:       sub    0x48(%r14),%rax
    0.00 :   ffffffff81326e8a:       cmp    %rdx,%rax

After applying this fix, sched_balance_domain() is gone from the profile
and there is a 5% throughput improvement.

Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>

---
v4:
1. Allow NEWLY_IDLE balance with SD_SERIALIZE domain to be serialized.
2. Reorganize need_unlock to be bool and initialize it again for the
   redo case.
link to v3: https://lore.kernel.org/lkml/52fcd1e8582a6b014a70f0ce7795ce0d88cd63a8.1762470554.git.tim.c.chen@linux.intel.com/

v3:
1. Move check balance time to after should_we_balance()
link to v2: https://lore.kernel.org/lkml/248b775fc9030989c829d4061f6f85ae33dabe45.1761682932.git.tim.c.chen@linux.intel.com/

v2:
1. Rearrange the patch to get rid of an indent level per Peter's
   suggestion.
2. Updated the data from new run by OLTP team.

link to v1: https://lore.kernel.org/lkml/e27d5dcb724fe46acc24ff44670bc4bb5be21d98.1759445926.git.tim.c.chen@linux.intel.com/

---
 kernel/sched/fair.c | 57 ++++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 25970dbbb279..43c5ec039633 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11732,6 +11732,21 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 	}
 }
 
+/*
+ * This flag serializes load-balancing passes over large domains
+ * (above the NODE topology level) - only one load-balancing instance
+ * may run at a time, to reduce overhead on very large systems with
+ * lots of CPUs and large NUMA distances.
+ *
+ * - Note that load-balancing passes triggered while another one
+ *   is executing are skipped and not re-tried.
+ *
+ * - Also note that this does not serialize rebalance_domains()
+ *   execution, as non-SD_SERIALIZE domains will still be
+ *   load-balanced in parallel.
+ */
+static atomic_t sched_balance_running = ATOMIC_INIT(0);
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -11757,17 +11772,26 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
+	bool need_unlock;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
+	need_unlock = false;
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
 		goto out_balanced;
 	}
 
+	if (sd->flags & SD_SERIALIZE) {
+		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
+			goto out_balanced;
+		}
+		need_unlock = true;
+	}
+
 	group = sched_balance_find_src_group(&env);
 	if (!group) {
 		schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
+				if (need_unlock)
+					atomic_set_release(&sched_balance_running, 0);
+
 				goto redo;
 			}
 			goto out_all_pinned;
@@ -12008,6 +12035,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 	    sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
 out:
+	if (need_unlock)
+		atomic_set_release(&sched_balance_running, 0);
+
 	return ld_moved;
 }
 
@@ -12132,21 +12162,6 @@ static int active_load_balance_cpu_stop(void *data)
 	return 0;
 }
 
-/*
- * This flag serializes load-balancing passes over large domains
- * (above the NODE topology level) - only one load-balancing instance
- * may run at a time, to reduce overhead on very large systems with
- * lots of CPUs and large NUMA distances.
- *
- * - Note that load-balancing passes triggered while another one
- *   is executing are skipped and not re-tried.
- *
- * - Also note that this does not serialize rebalance_domains()
- *   execution, as non-SD_SERIALIZE domains will still be
- *   load-balanced in parallel.
- */
-static atomic_t sched_balance_running = ATOMIC_INIT(0);
-
 /*
  * Scale the max sched_balance_rq interval with the number of CPUs in the system.
  * This trades load-balance latency on larger machines for less cross talk.
@@ -12202,7 +12217,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
 	int update_next_balance = 0;
-	int need_serialize, need_decay = 0;
+	int need_decay = 0;
 	u64 max_cost = 0;
 
 	rcu_read_lock();
@@ -12226,13 +12241,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 		}
 
 		interval = get_sd_balance_interval(sd, busy);
-
-		need_serialize = sd->flags & SD_SERIALIZE;
-		if (need_serialize) {
-			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
-				goto out;
-		}
-
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
 			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
 				/*
@@ -12246,9 +12254,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 			sd->last_balance = jiffies;
 			interval = get_sd_balance_interval(sd, busy);
 		}
-		if (need_serialize)
-			atomic_set_release(&sched_balance_running, 0);
-out:
 		if (time_after(next_balance, sd->last_balance + interval)) {
 			next_balance = sd->last_balance + interval;
 			update_next_balance = 1;
-- 
2.32.0
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Srikar Dronamraju 2 weeks, 5 days ago
* Tim Chen <tim.c.chen@linux.intel.com> [2025-11-10 10:47:35]:

> The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
> only one NUMA load balancing operation to run system-wide at a time.
> 
> @@ -11757,17 +11772,26 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  		.fbq_type	= all,
>  		.tasks		= LIST_HEAD_INIT(env.tasks),
>  	};
> +	bool need_unlock;
>  
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
>  	schedstat_inc(sd->lb_count[idle]);
>  
>  redo:
> +	need_unlock = false;
>  	if (!should_we_balance(&env)) {
>  		*continue_balancing = 0;
>  		goto out_balanced;
>  	}
>  
> +	if (sd->flags & SD_SERIALIZE) {
> +		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
> +			goto out_balanced;
> +		}
> +		need_unlock = true;
> +	}
> +

Moving the serialize check to sched_balance_rq is better since we only take
when its really needed. Previously we could have skipped the balancing for
the appropriate CPU.

>  	group = sched_balance_find_src_group(&env);
>  	if (!group) {
>  		schedstat_inc(sd->lb_nobusyg[idle]);
> @@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>  				env.loop = 0;
>  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> +				if (need_unlock)
> +					atomic_set_release(&sched_balance_running, 0);
> +

One nit:
While the current code is good, would conditionally resetting the
need_unlock just after resetting the atomic variable better than
unconditional reset that we do now?

>  				goto redo;

Otherwise looks good to me.

Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com>

-- 
Thanks and Regards
Srikar Dronamraju
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Wed, Nov 12, 2025 at 01:32:23PM +0530, Srikar Dronamraju wrote:
> >  	group = sched_balance_find_src_group(&env);
> >  	if (!group) {
> >  		schedstat_inc(sd->lb_nobusyg[idle]);
> > @@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
> >  				env.loop = 0;
> >  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> > +				if (need_unlock)
> > +					atomic_set_release(&sched_balance_running, 0);
> > +
> 
> One nit:
> While the current code is good, would conditionally resetting the
> need_unlock just after resetting the atomic variable better than
> unconditional reset that we do now?

Right, I had the same thought when grabbed the patch yesterday, but
ignored it.

But perhaps something like so?

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11717,14 +11717,20 @@ static int sched_balance_rq(int this_cpu
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
-	bool need_unlock;
+	bool need_unlock = false;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
+	if (0) {
 redo:
-	need_unlock = false;
+		if (need_unlock) {
+			atomic_set_release(&sched_balance_running, 0);
+			need_unlock = false;
+		}
+	}
+
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
 		goto out_balanced;
@@ -11861,9 +11867,6 @@ static int sched_balance_rq(int this_cpu
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
-				if (need_unlock)
-					atomic_set_release(&sched_balance_running, 0);
-
 				goto redo;
 			}
 			goto out_all_pinned;


---

The other option is something like this, but Linus hated on this pattern
when we started with things.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11692,6 +11692,8 @@ static void update_lb_imbalance_stat(str
  */
 static atomic_t sched_balance_running = ATOMIC_INIT(0);
 
+DEFINE_FREE(balance_unlock, atomic_t *, if (_T) atomic_set_release(_T, 0));
+
 /*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
@@ -11717,24 +11719,23 @@ static int sched_balance_rq(int this_cpu
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
-	bool need_unlock;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-	need_unlock = false;
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
-		goto out_balanced;
+		return 0;
 	}
 
+	atomic_t *lock __free(balance_unlock) = NULL;
 	if (sd->flags & SD_SERIALIZE) {
-		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
-			goto out_balanced;
-		}
-		need_unlock = true;
+		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
+			return 0;
+
+		lock = &sched_balance_running;
 	}
 
 	group = sched_balance_find_src_group(&env);
@@ -11861,9 +11862,6 @@ static int sched_balance_rq(int this_cpu
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
-				if (need_unlock)
-					atomic_set_release(&sched_balance_running, 0);
-
 				goto redo;
 			}
 			goto out_all_pinned;
@@ -11980,9 +11978,6 @@ static int sched_balance_rq(int this_cpu
 	    sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
 out:
-	if (need_unlock)
-		atomic_set_release(&sched_balance_running, 0);
-
 	return ld_moved;
 }
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Shrikanth Hegde 2 weeks, 5 days ago

On 11/12/25 4:07 PM, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 01:32:23PM +0530, Srikar Dronamraju wrote:
>>>   	group = sched_balance_find_src_group(&env);
>>>   	if (!group) {
>>>   		schedstat_inc(sd->lb_nobusyg[idle]);
>>> @@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>>   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>>>   				env.loop = 0;
>>>   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
>>> +				if (need_unlock)
>>> +					atomic_set_release(&sched_balance_running, 0);
>>> +
>>
>> One nit:
>> While the current code is good, would conditionally resetting the
>> need_unlock just after resetting the atomic variable better than
>> unconditional reset that we do now?
> 
> Right, I had the same thought when grabbed the patch yesterday, but
> ignored it.
> 
> But perhaps something like so?
> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11717,14 +11717,20 @@ static int sched_balance_rq(int this_cpu
>   		.fbq_type	= all,
>   		.tasks		= LIST_HEAD_INIT(env.tasks),
>   	};
> -	bool need_unlock;
> +	bool need_unlock = false;
>   
>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>   
>   	schedstat_inc(sd->lb_count[idle]);
>   
> +	if (0) {
>   redo:
> -	need_unlock = false;
> +		if (need_unlock) {
> +			atomic_set_release(&sched_balance_running, 0);
> +			need_unlock = false;
> +		}
> +	}
> +
>   	if (!should_we_balance(&env)) {
>   		*continue_balancing = 0;
>   		goto out_balanced;
> @@ -11861,9 +11867,6 @@ static int sched_balance_rq(int this_cpu
>   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>   				env.loop = 0;
>   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
>   				goto redo;
>   			}
>   			goto out_all_pinned;
> 
> 
> ---
> 
> The other option is something like this, but Linus hated on this pattern
> when we started with things.
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11692,6 +11692,8 @@ static void update_lb_imbalance_stat(str
>    */
>   static atomic_t sched_balance_running = ATOMIC_INIT(0);
>   
> +DEFINE_FREE(balance_unlock, atomic_t *, if (_T) atomic_set_release(_T, 0));
> +
>   /*
>    * Check this_cpu to ensure it is balanced within domain. Attempt to move
>    * tasks if there is an imbalance.
> @@ -11717,24 +11719,23 @@ static int sched_balance_rq(int this_cpu
>   		.fbq_type	= all,
>   		.tasks		= LIST_HEAD_INIT(env.tasks),
>   	};
> -	bool need_unlock;
>   
>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>   
>   	schedstat_inc(sd->lb_count[idle]);
>   
>   redo:
> -	need_unlock = false;
>   	if (!should_we_balance(&env)) {
>   		*continue_balancing = 0;
> -		goto out_balanced;
> +		return 0;
>   	}
>   
> +	atomic_t *lock __free(balance_unlock) = NULL;
>   	if (sd->flags & SD_SERIALIZE) {
> -		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
> -			goto out_balanced;
> -		}
> -		need_unlock = true;
> +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> +			return 0;
> +
> +		lock = &sched_balance_running;
>   	}
>   
>   	group = sched_balance_find_src_group(&env);
> @@ -11861,9 +11862,6 @@ static int sched_balance_rq(int this_cpu
>   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>   				env.loop = 0;
>   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
>   				goto redo;
>   			}
>   			goto out_all_pinned;
> @@ -11980,9 +11978,6 @@ static int sched_balance_rq(int this_cpu
>   	    sd->balance_interval < sd->max_interval)
>   		sd->balance_interval *= 2;
>   out:
> -	if (need_unlock)
> -		atomic_set_release(&sched_balance_running, 0);
> -
>   	return ld_moved;
>   }
>   

Second one is difficult to understand.


Is this an option too?
---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f8e1df9c5199..64e2aeacd65e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11720,14 +11720,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
                 .fbq_type       = all,
                 .tasks          = LIST_HEAD_INIT(env.tasks),
         };
-       bool need_unlock;
+       bool need_unlock = false;
  
         cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
  
         schedstat_inc(sd->lb_count[idle]);
  
  redo:
-       need_unlock = false;
         if (!should_we_balance(&env)) {
                 *continue_balancing = 0;
                 goto out_balanced;
@@ -11864,8 +11863,10 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
                         if (!cpumask_subset(cpus, env.dst_grpmask)) {
                                 env.loop = 0;
                                 env.loop_break = SCHED_NR_MIGRATE_BREAK;
-                               if (need_unlock)
+                               if (need_unlock) {
                                         atomic_set_release(&sched_balance_running, 0);
+                                       need_unlock = false;
+                               }
  
                                 goto redo;
                         }
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Wed, Nov 12, 2025 at 11:37:40AM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 01:32:23PM +0530, Srikar Dronamraju wrote:
> > >  	group = sched_balance_find_src_group(&env);
> > >  	if (!group) {
> > >  		schedstat_inc(sd->lb_nobusyg[idle]);
> > > @@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> > >  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
> > >  				env.loop = 0;
> > >  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> > > +				if (need_unlock)
> > > +					atomic_set_release(&sched_balance_running, 0);
> > > +
> > 
> > One nit:
> > While the current code is good, would conditionally resetting the
> > need_unlock just after resetting the atomic variable better than
> > unconditional reset that we do now?
> 
> Right, I had the same thought when grabbed the patch yesterday, but
> ignored it.
> 

Hmm, should we not redo while keeping the lock? Doesn't make much sense
to drop and try to reacquire things here.

So perhaps this is the better option -- or did I overlook something with
should_we_balance? It doesn't look like that will make a different
decision on the retry.

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11717,26 +11717,25 @@ static int sched_balance_rq(int this_cpu
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
-	bool need_unlock;
+	bool need_unlock = false;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
-redo:
-	need_unlock = false;
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
 		goto out_balanced;
 	}
 
 	if (sd->flags & SD_SERIALIZE) {
-		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
+		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
 			goto out_balanced;
-		}
+
 		need_unlock = true;
 	}
 
+redo:
 	group = sched_balance_find_src_group(&env);
 	if (!group) {
 		schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11861,9 +11860,6 @@ static int sched_balance_rq(int this_cpu
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
-				if (need_unlock)
-					atomic_set_release(&sched_balance_running, 0);
-
 				goto redo;
 			}
 			goto out_all_pinned;
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Srikar Dronamraju 2 weeks, 5 days ago
* Peter Zijlstra <peterz@infradead.org> [2025-11-12 11:45:55]:

> > 
> > Right, I had the same thought when grabbed the patch yesterday, but
> > ignored it.
> > 
> 
> Hmm, should we not redo while keeping the lock? Doesn't make much sense
> to drop and try to reacquire things here.
> 
> So perhaps this is the better option -- or did I overlook something with
> should_we_balance? It doesn't look like that will make a different
> decision on the retry.
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11717,26 +11717,25 @@ static int sched_balance_rq(int this_cpu
>  		.fbq_type	= all,
>  		.tasks		= LIST_HEAD_INIT(env.tasks),
>  	};
> -	bool need_unlock;
> +	bool need_unlock = false;
>  
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
>  	schedstat_inc(sd->lb_count[idle]);
>  
> -redo:
> -	need_unlock = false;
>  	if (!should_we_balance(&env)) {
>  		*continue_balancing = 0;
>  		goto out_balanced;
>  	}
>  
>  	if (sd->flags & SD_SERIALIZE) {
> -		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
> +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
>  			goto out_balanced;
> -		}
> +
>  		need_unlock = true;
>  	}
>  
> +redo:
>  	group = sched_balance_find_src_group(&env);
>  	if (!group) {
>  		schedstat_inc(sd->lb_nobusyg[idle]);
> @@ -11861,9 +11860,6 @@ static int sched_balance_rq(int this_cpu
>  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>  				env.loop = 0;
>  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
>  				goto redo;
>  			}
>  			goto out_all_pinned;

If the CPU that was doing the balance was not the first CPU of the domain
span, but it was doing the balance since the first CPU was busy, and the
first CPU now happens to be idle at redo, the scheduler would have chosen the
first CPU to do the balance. However it will now choose the CPU that had the atomic..

I think this is better because 
- The first CPU may have tried just before this CPU dropped the atomic and
  hence we may miss the balance opportunity.
- The first CPU and the other CPU may not be sharing cache and hence there
  may be a cache-miss, which we are avoiding by doing this.

-- 
Thanks and Regards
Srikar Dronamraju
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Wed, Nov 12, 2025 at 04:55:48PM +0530, Srikar Dronamraju wrote:

> If the CPU that was doing the balance was not the first CPU of the domain
> span, but it was doing the balance since the first CPU was busy, and the
> first CPU now happens to be idle at redo, the scheduler would have chosen the
> first CPU to do the balance. However it will now choose the CPU that had the atomic..
> 
> I think this is better because 
> - The first CPU may have tried just before this CPU dropped the atomic and
>   hence we may miss the balance opportunity.
> - The first CPU and the other CPU may not be sharing cache and hence there
>   may be a cache-miss, which we are avoiding by doing this.

I'm not sure I understand what you're arguing for. Are you saying it
would be better to retain the lock where possible?


--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11717,23 +11717,22 @@ static int sched_balance_rq(int this_cpu
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
-	bool need_unlock;
+	bool need_unlock = false;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-	need_unlock = false;
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
 		goto out_balanced;
 	}
 
-	if (sd->flags & SD_SERIALIZE) {
-		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
+	if (!need_unlock && sd->flags & SD_SERIALIZE) {
+		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
 			goto out_balanced;
-		}
+
 		need_unlock = true;
 	}
 
@@ -11861,9 +11860,6 @@ static int sched_balance_rq(int this_cpu
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
-				if (need_unlock)
-					atomic_set_release(&sched_balance_running, 0);
-
 				goto redo;
 			}
 			goto out_all_pinned;
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Srikar Dronamraju 2 weeks, 5 days ago
* Peter Zijlstra <peterz@infradead.org> [2025-11-12 14:39:37]:

> On Wed, Nov 12, 2025 at 04:55:48PM +0530, Srikar Dronamraju wrote:
> 
> > If the CPU that was doing the balance was not the first CPU of the domain
> > span, but it was doing the balance since the first CPU was busy, and the
> > first CPU now happens to be idle at redo, the scheduler would have chosen the
> > first CPU to do the balance. However it will now choose the CPU that had the atomic..
> > 
> > I think this is better because 
> > - The first CPU may have tried just before this CPU dropped the atomic and
> >   hence we may miss the balance opportunity.
> > - The first CPU and the other CPU may not be sharing cache and hence there
> >   may be a cache-miss, which we are avoiding by doing this.
> 
> I'm not sure I understand what you're arguing for. Are you saying it
> would be better to retain the lock where possible?
> 

Yes, I was supporting keeping the lock and not check should_we_balance() with
lock held.

Lets say CPU2 enters sched_balance_rq(), should_we_balance succeeds, CPU 2 take
the lock. It calls redo, and this time should_we_balance() may not succeed for
CPU 2 (since CPU 0/1 is idle). However CPU0 may have already raced with CPU2
and tried to take the lock before CPU2 released it and bailed out. So we miss a
balancing opportunity.

> 

-- 
Thanks and Regards
Srikar Dronamraju
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Wed, Nov 12, 2025 at 02:39:37PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 04:55:48PM +0530, Srikar Dronamraju wrote:
> 
> > If the CPU that was doing the balance was not the first CPU of the domain
> > span, but it was doing the balance since the first CPU was busy, and the
> > first CPU now happens to be idle at redo, the scheduler would have chosen the
> > first CPU to do the balance. However it will now choose the CPU that had the atomic..
> > 
> > I think this is better because 
> > - The first CPU may have tried just before this CPU dropped the atomic and
> >   hence we may miss the balance opportunity.
> > - The first CPU and the other CPU may not be sharing cache and hence there
> >   may be a cache-miss, which we are avoiding by doing this.
> 
> I'm not sure I understand what you're arguing for. Are you saying it
> would be better to retain the lock where possible?

FWIW this is also the behaviour we had before this patch, where the lock
is taken in the caller, it would have covered the whole redo case as
well.

> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11717,23 +11717,22 @@ static int sched_balance_rq(int this_cpu
>  		.fbq_type	= all,
>  		.tasks		= LIST_HEAD_INIT(env.tasks),
>  	};
> -	bool need_unlock;
> +	bool need_unlock = false;
>  
>  	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>  
>  	schedstat_inc(sd->lb_count[idle]);
>  
>  redo:
> -	need_unlock = false;
>  	if (!should_we_balance(&env)) {
>  		*continue_balancing = 0;
>  		goto out_balanced;
>  	}
>  
> -	if (sd->flags & SD_SERIALIZE) {
> -		if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1)) {
> +	if (!need_unlock && sd->flags & SD_SERIALIZE) {
> +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
>  			goto out_balanced;
> -		}
> +
>  		need_unlock = true;
>  	}
>  
> @@ -11861,9 +11860,6 @@ static int sched_balance_rq(int this_cpu
>  			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>  				env.loop = 0;
>  				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
>  				goto redo;
>  			}
>  			goto out_all_pinned;
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Shrikanth Hegde 2 weeks, 5 days ago

On 11/12/25 4:15 PM, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 11:37:40AM +0100, Peter Zijlstra wrote:
>> On Wed, Nov 12, 2025 at 01:32:23PM +0530, Srikar Dronamraju wrote:
>>>>   	group = sched_balance_find_src_group(&env);
>>>>   	if (!group) {
>>>>   		schedstat_inc(sd->lb_nobusyg[idle]);
>>>> @@ -11892,6 +11916,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>>>   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>>>>   				env.loop = 0;
>>>>   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
>>>> +				if (need_unlock)
>>>> +					atomic_set_release(&sched_balance_running, 0);
>>>> +
>>>
>>> One nit:
>>> While the current code is good, would conditionally resetting the
>>> need_unlock just after resetting the atomic variable better than
>>> unconditional reset that we do now?
>>
>> Right, I had the same thought when grabbed the patch yesterday, but
>> ignored it.
>>
> 
> Hmm, should we not redo while keeping the lock? Doesn't make much sense
> to drop and try to reacquire things here.
> 
> So perhaps this is the better option -- or did I overlook something with
> should_we_balance? It doesn't look like that will make a different
> decision on the retry.
> 

I think in newidle balance, these checks are there in swb to bail of load balance.
redo logic catches it right?

env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?


         if (env->idle == CPU_NEWLY_IDLE) {
                 if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
                         return 0;
                 return 1;
         }
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Peter Zijlstra 2 weeks, 5 days ago
On Wed, Nov 12, 2025 at 04:39:43PM +0530, Shrikanth Hegde wrote:
> 
> 

> > So perhaps this is the better option -- or did I overlook something with
> > should_we_balance? It doesn't look like that will make a different
> > decision on the retry.
> > 
> 
> I think in newidle balance, these checks are there in swb to bail of load balance.
> redo logic catches it right?

Urgh, my brain still thinks we're not serializing on newidle. Perhaps I
should make this 2 patches, one moving the serializing and one adding it
to newidle.

> env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
> pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?
> 
>         if (env->idle == CPU_NEWLY_IDLE) {
>                 if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
>                         return 0;
>                 return 1;
>         }

Right, that could get tickled.
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Tim Chen 2 weeks, 5 days ago
On Wed, 2025-11-12 at 12:21 +0100, Peter Zijlstra wrote:
> On Wed, Nov 12, 2025 at 04:39:43PM +0530, Shrikanth Hegde wrote:
> > 
> > 
> 
> > > So perhaps this is the better option -- or did I overlook something with
> > > should_we_balance? It doesn't look like that will make a different
> > > decision on the retry.
> > > 
> > 
> > I think in newidle balance, these checks are there in swb to bail of load balance.
> > redo logic catches it right?
> 
> Urgh, my brain still thinks we're not serializing on newidle. Perhaps I
> should make this 2 patches, one moving the serializing and one adding it
> to newidle.
> 
> > env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
> > pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?
> > 
> >         if (env->idle == CPU_NEWLY_IDLE) {
> >                 if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
> >                         return 0;
> >                 return 1;
> >         }
> 
> Right, that could get tickled.

How about something like the following on top of v4 patch?
This will avoid releasing the lock and take care of the NEWLY_IDLE case.

Tim

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43c5ec039633..26179f4b77f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11772,14 +11772,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
-	bool need_unlock;
+	bool need_unlock = false;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
 	schedstat_inc(sd->lb_count[idle]);
 
 redo:
-	need_unlock = false;
 	if (!should_we_balance(&env)) {
 		*continue_balancing = 0;
 		goto out_balanced;
@@ -11916,9 +11915,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 			if (!cpumask_subset(cpus, env.dst_grpmask)) {
 				env.loop = 0;
 				env.loop_break = SCHED_NR_MIGRATE_BREAK;
-				if (need_unlock)
-					atomic_set_release(&sched_balance_running, 0);
-
+				if (env->idle == CPU_NEWLY_IDLE &&
+				    (env->dst_running > 0 || env->dst_rq->ttwu_pending))
+					goto out;
 				goto redo;
 			}
 			goto out_all_pinned;
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Shrikanth Hegde 2 weeks, 4 days ago

On 11/13/25 2:40 AM, Tim Chen wrote:
> On Wed, 2025-11-12 at 12:21 +0100, Peter Zijlstra wrote:
>> On Wed, Nov 12, 2025 at 04:39:43PM +0530, Shrikanth Hegde wrote:
>>>
>>>
>>
>>>> So perhaps this is the better option -- or did I overlook something with
>>>> should_we_balance? It doesn't look like that will make a different
>>>> decision on the retry.
>>>>
>>>
>>> I think in newidle balance, these checks are there in swb to bail of load balance.
>>> redo logic catches it right?
>>
>> Urgh, my brain still thinks we're not serializing on newidle. Perhaps I
>> should make this 2 patches, one moving the serializing and one adding it
>> to newidle.
>>
>>> env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
>>> pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?
>>>
>>>          if (env->idle == CPU_NEWLY_IDLE) {
>>>                  if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
>>>                          return 0;
>>>                  return 1;
>>>          }
>>
>> Right, that could get tickled.
> 
> How about something like the following on top of v4 patch?
> This will avoid releasing the lock and take care of the NEWLY_IDLE case.
> 
> Tim
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 43c5ec039633..26179f4b77f6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11772,14 +11772,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   		.fbq_type	= all,
>   		.tasks		= LIST_HEAD_INIT(env.tasks),
>   	};
> -	bool need_unlock;
> +	bool need_unlock = false;
>   
>   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
>   
>   	schedstat_inc(sd->lb_count[idle]);
>   
>   redo:
> -	need_unlock = false;
>   	if (!should_we_balance(&env)) {
>   		*continue_balancing = 0;
>   		goto out_balanced;
> @@ -11916,9 +11915,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
>   				env.loop = 0;
>   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> -				if (need_unlock)
> -					atomic_set_release(&sched_balance_running, 0);
> -
> +				if (env->idle == CPU_NEWLY_IDLE &&
> +				    (env->dst_running > 0 || env->dst_rq->ttwu_pending))
> +					goto out;

IIUC, we come here, it means busiest cpu was found, but due to
affinity restrictions none of the tasks can come to this cpu.

So a redo is done excluding that busiest cpu if there are cpus other
than the group_mask of this cpu. So doing a redo does make sense specially
for newidle.

So doing bailing out might be wrong.

>   				goto redo;
>   			}
>   			goto out_all_pinned;
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Tim Chen 2 weeks, 4 days ago
On Thu, 2025-11-13 at 09:55 +0530, Shrikanth Hegde wrote:
> 
> On 11/13/25 2:40 AM, Tim Chen wrote:
> > On Wed, 2025-11-12 at 12:21 +0100, Peter Zijlstra wrote:
> > > On Wed, Nov 12, 2025 at 04:39:43PM +0530, Shrikanth Hegde wrote:
> > > > 
> > > > 
> > > 
> > > > > So perhaps this is the better option -- or did I overlook something with
> > > > > should_we_balance? It doesn't look like that will make a different
> > > > > decision on the retry.
> > > > > 
> > > > 
> > > > I think in newidle balance, these checks are there in swb to bail of load balance.
> > > > redo logic catches it right?
> > > 
> > > Urgh, my brain still thinks we're not serializing on newidle. Perhaps I
> > > should make this 2 patches, one moving the serializing and one adding it
> > > to newidle.
> > > 
> > > > env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
> > > > pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?
> > > > 
> > > >          if (env->idle == CPU_NEWLY_IDLE) {
> > > >                  if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
> > > >                          return 0;
> > > >                  return 1;
> > > >          }
> > > 
> > > Right, that could get tickled.
> > 
> > How about something like the following on top of v4 patch?
> > This will avoid releasing the lock and take care of the NEWLY_IDLE case.
> > 
> > Tim
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 43c5ec039633..26179f4b77f6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -11772,14 +11772,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >   		.fbq_type	= all,
> >   		.tasks		= LIST_HEAD_INIT(env.tasks),
> >   	};
> > -	bool need_unlock;
> > +	bool need_unlock = false;
> >   
> >   	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
> >   
> >   	schedstat_inc(sd->lb_count[idle]);
> >   
> >   redo:
> > -	need_unlock = false;
> >   	if (!should_we_balance(&env)) {
> >   		*continue_balancing = 0;
> >   		goto out_balanced;
> > @@ -11916,9 +11915,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >   			if (!cpumask_subset(cpus, env.dst_grpmask)) {
> >   				env.loop = 0;
> >   				env.loop_break = SCHED_NR_MIGRATE_BREAK;
> > -				if (need_unlock)
> > -					atomic_set_release(&sched_balance_running, 0);
> > -
> > +				if (env->idle == CPU_NEWLY_IDLE &&
> > +				    (env->dst_running > 0 || env->dst_rq->ttwu_pending))
> > +					goto out;
> 
> IIUC, we come here, it means busiest cpu was found, but due to
> affinity restrictions none of the tasks can come to this cpu.
> 
> So a redo is done excluding that busiest cpu if there are cpus other
> than the group_mask of this cpu. So doing a redo does make sense specially
> for newidle.
> 
> So doing bailing out might be wrong.

My understanding is the reason for the idle balancing is because the
dst_rq becomes idle.  If we see that the dst_rq already has something to run,
why add latency to search for more tasks to pull as it is likely the dst_rq
is the current cpu.

Tim

> 
> >   				goto redo;
> >   			}
> >   			goto out_all_pinned;
Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Shrikanth Hegde 2 weeks, 6 days ago
Hi Tim,

On 11/11/25 12:17 AM, Tim Chen wrote:
> The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
> only one NUMA load balancing operation to run system-wide at a time.
> 
> Currently, each sched group leader directly under NUMA domain attempts
> to acquire the global sched_balance_running flag via cmpxchg() before
> checking whether load balancing is due or whether it is the designated
> load balancer for that NUMA domain. On systems with a large number
> of cores, this causes significant cache contention on the shared
> sched_balance_running flag.
> 
> This patch reduces unnecessary cmpxchg() operations by first checking
> that the balancer is the designated leader for a NUMA domain from
> should_we_balance(), and the balance interval has expired before
> trying to acquire sched_balance_running to load balance a NUMA
> domain.
> 
> On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
> running an OLTP workload, 7.8% of total CPU cycles were previously spent
> in sched_balance_domain() contending on sched_balance_running before
> this change.
Looks good to me. Thanks for getting this into current shape.

I see hackbench improving slightly across its variations. So,
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>