[tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due

tip-bot2 for Tim Chen posted 1 patch 2 months, 3 weeks ago
There is a newer version of this series
kernel/sched/fair.c | 53 ++++++++++++++++++++++----------------------
1 file changed, 27 insertions(+), 26 deletions(-)
[tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by tip-bot2 for Tim Chen 2 months, 3 weeks ago
The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2265c5d4deeff3bfe4580d9ffe718fd80a414cac
Gitweb:        https://git.kernel.org/tip/2265c5d4deeff3bfe4580d9ffe718fd80a414cac
Author:        Tim Chen <tim.c.chen@linux.intel.com>
AuthorDate:    Mon, 10 Nov 2025 10:47:35 -08:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Fri, 14 Nov 2025 13:03:05 +01:00

sched/fair: Skip sched_balance_running cmpxchg when balance is not due

The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
only one NUMA load balancing operation to run system-wide at a time.

Currently, each sched group leader directly under NUMA domain attempts
to acquire the global sched_balance_running flag via cmpxchg() before
checking whether load balancing is due or whether it is the designated
load balancer for that NUMA domain. On systems with a large number
of cores, this causes significant cache contention on the shared
sched_balance_running flag.

This patch reduces unnecessary cmpxchg() operations by first checking
that the balancer is the designated leader for a NUMA domain from
should_we_balance(), and the balance interval has expired before
trying to acquire sched_balance_running to load balance a NUMA
domain.

On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
running an OLTP workload, 7.8% of total CPU cycles were previously spent
in sched_balance_domain() contending on sched_balance_running before
this change.

         : 104              static __always_inline int arch_atomic_cmpxchg(atomic_t *v, int old, int new)
         : 105              {
         : 106              return arch_cmpxchg(&v->counter, old, new);
    0.00 :   ffffffff81326e6c:       xor    %eax,%eax
    0.00 :   ffffffff81326e6e:       mov    $0x1,%ecx
    0.00 :   ffffffff81326e73:       lock cmpxchg %ecx,0x2394195(%rip)        # ffffffff836bb010 <sched_balance_running>
         : 110              sched_balance_domains():
         : 12234            if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
   99.39 :   ffffffff81326e7b:       test   %eax,%eax
    0.00 :   ffffffff81326e7d:       jne    ffffffff81326e99 <sched_balance_domains+0x209>
         : 12238            if (time_after_eq(jiffies, sd->last_balance + interval)) {
    0.00 :   ffffffff81326e7f:       mov    0x14e2b3a(%rip),%rax        # ffffffff828099c0 <jiffies_64>
    0.00 :   ffffffff81326e86:       sub    0x48(%r14),%rax
    0.00 :   ffffffff81326e8a:       cmp    %rdx,%rax

After applying this fix, sched_balance_domain() is gone from the profile
and there is a 5% throughput improvement.

[peterz: made it so that redo retains the 'lock' and split out the
         CPU_NEWLY_IDLE change to a separate patch]
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Chen Yu <yu.c.chen@intel.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com>
Reviewed-by: Srikar Dronamraju <srikar@linux.ibm.com>
Tested-by: Mohini Narkhede <mohini.narkhede@intel.com>
Tested-by: Shrikanth Hegde <sshegde@linux.ibm.com>
Link: https://patch.msgid.link/6fed119b723c71552943bfe5798c93851b30a361.1762800251.git.tim.c.chen@linux.intel.com
---
 kernel/sched/fair.c | 53 ++++++++++++++++++++++----------------------
 1 file changed, 27 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b4617d6..5feb5a6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11681,6 +11681,21 @@ static void update_lb_imbalance_stat(struct lb_env *env, struct sched_domain *sd
 }
 
 /*
+ * This flag serializes load-balancing passes over large domains
+ * (above the NODE topology level) - only one load-balancing instance
+ * may run at a time, to reduce overhead on very large systems with
+ * lots of CPUs and large NUMA distances.
+ *
+ * - Note that load-balancing passes triggered while another one
+ *   is executing are skipped and not re-tried.
+ *
+ * - Also note that this does not serialize rebalance_domains()
+ *   execution, as non-SD_SERIALIZE domains will still be
+ *   load-balanced in parallel.
+ */
+static atomic_t sched_balance_running = ATOMIC_INIT(0);
+
+/*
  * Check this_cpu to ensure it is balanced within domain. Attempt to move
  * tasks if there is an imbalance.
  */
@@ -11705,6 +11720,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
 	};
+	bool need_unlock = false;
 
 	cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
 
@@ -11716,6 +11732,13 @@ redo:
 		goto out_balanced;
 	}
 
+	if (!need_unlock && (sd->flags & SD_SERIALIZE) && idle != CPU_NEWLY_IDLE) {
+		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
+			goto out_balanced;
+
+		need_unlock = true;
+	}
+
 	group = sched_balance_find_src_group(&env);
 	if (!group) {
 		schedstat_inc(sd->lb_nobusyg[idle]);
@@ -11956,6 +11979,9 @@ out_one_pinned:
 	    sd->balance_interval < sd->max_interval)
 		sd->balance_interval *= 2;
 out:
+	if (need_unlock)
+		atomic_set_release(&sched_balance_running, 0);
+
 	return ld_moved;
 }
 
@@ -12081,21 +12107,6 @@ out_unlock:
 }
 
 /*
- * This flag serializes load-balancing passes over large domains
- * (above the NODE topology level) - only one load-balancing instance
- * may run at a time, to reduce overhead on very large systems with
- * lots of CPUs and large NUMA distances.
- *
- * - Note that load-balancing passes triggered while another one
- *   is executing are skipped and not re-tried.
- *
- * - Also note that this does not serialize rebalance_domains()
- *   execution, as non-SD_SERIALIZE domains will still be
- *   load-balanced in parallel.
- */
-static atomic_t sched_balance_running = ATOMIC_INIT(0);
-
-/*
  * Scale the max sched_balance_rq interval with the number of CPUs in the system.
  * This trades load-balance latency on larger machines for less cross talk.
  */
@@ -12150,7 +12161,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 	/* Earliest time when we have to do rebalance again */
 	unsigned long next_balance = jiffies + 60*HZ;
 	int update_next_balance = 0;
-	int need_serialize, need_decay = 0;
+	int need_decay = 0;
 	u64 max_cost = 0;
 
 	rcu_read_lock();
@@ -12174,13 +12185,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 		}
 
 		interval = get_sd_balance_interval(sd, busy);
-
-		need_serialize = sd->flags & SD_SERIALIZE;
-		if (need_serialize) {
-			if (atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
-				goto out;
-		}
-
 		if (time_after_eq(jiffies, sd->last_balance + interval)) {
 			if (sched_balance_rq(cpu, rq, sd, idle, &continue_balancing)) {
 				/*
@@ -12194,9 +12198,6 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
 			sd->last_balance = jiffies;
 			interval = get_sd_balance_interval(sd, busy);
 		}
-		if (need_serialize)
-			atomic_set_release(&sched_balance_running, 0);
-out:
 		if (time_after(next_balance, sd->last_balance + interval)) {
 			next_balance = sd->last_balance + interval;
 			update_next_balance = 1;
Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Shrikanth Hegde 2 months, 3 weeks ago
Hi Peter.

On 11/14/25 5:49 PM, tip-bot2 for Tim Chen wrote:
> The following commit has been merged into the sched/core branch of tip:
> 
> Commit-ID:     2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> Gitweb:        https://git.kernel.org/tip/2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> Author:        Tim Chen <tim.c.chen@linux.intel.com>
> AuthorDate:    Mon, 10 Nov 2025 10:47:35 -08:00
> Committer:     Peter Zijlstra <peterz@infradead.org>
> CommitterDate: Fri, 14 Nov 2025 13:03:05 +01:00
> 
> sched/fair: Skip sched_balance_running cmpxchg when balance is not due
> 
> 

   
> +	if (!need_unlock && (sd->flags & SD_SERIALIZE) && idle != CPU_NEWLY_IDLE) {
> +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))

This should be atomic_cmpxchg_acquire?

I booted the system with latest sched/core and it crashes at the boot.

BUG: Kernel NULL pointer dereference on read at 0x00000000
Faulting instruction address: 0xc0000000001db57c
Oops: Kernel access of bad area, sig: 7 [#1]
LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
Modules linked in:
CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.18.0-rc3+ #242 PREEMPT(lazy)
NIP [c0000000001db57c] sched_balance_rq+0x560/0x92c
LR [c0000000001db198] sched_balance_rq+0x17c/0x92c
Call Trace:
[c00000111ffdfd10] [c0000000001db198] sched_balance_rq+0x17c/0x92c (unreliable)
[c00000111ffdfe50] [c0000000001dc598] sched_balance_domains+0x2c4/0x3d0
[c00000111ffdff00] [c000000000168958] handle_softirqs+0x138/0x414
[c00000111ffdffe0] [c000000000017d80] do_softirq_own_stack+0x3c/0x50
[c000000008a57a60] [c000000000168048] __irq_exit_rcu+0x18c/0x1b4
[c000000008a57a90] [c0000000001691a8] irq_exit+0x20/0x38
[c000000008a57ab0] [c000000000028c18] timer_interrupt+0x174/0x394
[c000000008a57b10] [c000000000009f8c] decrementer_common_virt+0x28c/0x290


Bisect pointed to:
git bisect bad 2265c5d4deeff3bfe4580d9ffe718fd80a414cac
# first bad commit: [2265c5d4deeff3bfe4580d9ffe718fd80a414cac] sched/fair: Skip sched_balance_running cmpxchg when balance is not due


I wondered what is really different since the tim's v4 boots fine.
There is try instead in the tip, i think that is messing it since likely
we are dereferencing 0?


With this diff it boots fine.

---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aaa47ece6a8e..01814b10b833 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11841,7 +11841,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
         }
  
         if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
-               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
+               if (!atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
                         goto out_balanced;
  
                 need_unlock = true;
Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Borislav Petkov 2 months, 3 weeks ago
On Sun, Nov 16, 2025 at 02:26:13AM +0530, Shrikanth Hegde wrote:
> 
> Hi Peter.
> 
> On 11/14/25 5:49 PM, tip-bot2 for Tim Chen wrote:
> > The following commit has been merged into the sched/core branch of tip:
> > 
> > Commit-ID:     2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> > Gitweb:        https://git.kernel.org/tip/2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> > Author:        Tim Chen <tim.c.chen@linux.intel.com>
> > AuthorDate:    Mon, 10 Nov 2025 10:47:35 -08:00
> > Committer:     Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Fri, 14 Nov 2025 13:03:05 +01:00
> > 
> > sched/fair: Skip sched_balance_running cmpxchg when balance is not due
> > 
> > 
> 
> > +	if (!need_unlock && (sd->flags & SD_SERIALIZE) && idle != CPU_NEWLY_IDLE) {
> > +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 
> This should be atomic_cmpxchg_acquire?
> 
> I booted the system with latest sched/core and it crashes at the boot.
> 
> BUG: Kernel NULL pointer dereference on read at 0x00000000
> Faulting instruction address: 0xc0000000001db57c
> Oops: Kernel access of bad area, sig: 7 [#1]
> LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
> Modules linked in:
> CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.18.0-rc3+ #242 PREEMPT(lazy)
> NIP [c0000000001db57c] sched_balance_rq+0x560/0x92c
> LR [c0000000001db198] sched_balance_rq+0x17c/0x92c
> Call Trace:
> [c00000111ffdfd10] [c0000000001db198] sched_balance_rq+0x17c/0x92c (unreliable)
> [c00000111ffdfe50] [c0000000001dc598] sched_balance_domains+0x2c4/0x3d0
> [c00000111ffdff00] [c000000000168958] handle_softirqs+0x138/0x414
> [c00000111ffdffe0] [c000000000017d80] do_softirq_own_stack+0x3c/0x50
> [c000000008a57a60] [c000000000168048] __irq_exit_rcu+0x18c/0x1b4
> [c000000008a57a90] [c0000000001691a8] irq_exit+0x20/0x38
> [c000000008a57ab0] [c000000000028c18] timer_interrupt+0x174/0x394
> [c000000008a57b10] [c000000000009f8c] decrementer_common_virt+0x28c/0x290
> 
> 
> Bisect pointed to:
> git bisect bad 2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> # first bad commit: [2265c5d4deeff3bfe4580d9ffe718fd80a414cac] sched/fair: Skip sched_balance_running cmpxchg when balance is not due

Dammit, I spent a whole day bisecting exactly the same issue and I missed your
mail.

Oh well, it is fixed now... should pay more attention next time.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette
Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Tim Chen 2 months, 3 weeks ago
On Sun, 2025-11-16 at 02:26 +0530, Shrikanth Hegde wrote:
> Hi Peter.
> 
> On 11/14/25 5:49 PM, tip-bot2 for Tim Chen wrote:
> > The following commit has been merged into the sched/core branch of tip:
> > 
> > Commit-ID:     2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> > Gitweb:        https://git.kernel.org/tip/2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> > Author:        Tim Chen <tim.c.chen@linux.intel.com>
> > AuthorDate:    Mon, 10 Nov 2025 10:47:35 -08:00
> > Committer:     Peter Zijlstra <peterz@infradead.org>
> > CommitterDate: Fri, 14 Nov 2025 13:03:05 +01:00
> > 
> > sched/fair: Skip sched_balance_running cmpxchg when balance is not due
> > 
> > 
> 
>    
> > +	if (!need_unlock && (sd->flags & SD_SERIALIZE) && idle != CPU_NEWLY_IDLE) {
> > +		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 
> This should be atomic_cmpxchg_acquire?
> 
> I booted the system with latest sched/core and it crashes at the boot.
> 
> BUG: Kernel NULL pointer dereference on read at 0x00000000
> Faulting instruction address: 0xc0000000001db57c
> Oops: Kernel access of bad area, sig: 7 [#1]
> LE PAGE_SIZE=64K MMU=Radix  SMP NR_CPUS=8192 NUMA pSeries
> Modules linked in:
> CPU: 1 UID: 0 PID: 0 Comm: swapper/1 Not tainted 6.18.0-rc3+ #242 PREEMPT(lazy)
> NIP [c0000000001db57c] sched_balance_rq+0x560/0x92c
> LR [c0000000001db198] sched_balance_rq+0x17c/0x92c
> Call Trace:
> [c00000111ffdfd10] [c0000000001db198] sched_balance_rq+0x17c/0x92c (unreliable)
> [c00000111ffdfe50] [c0000000001dc598] sched_balance_domains+0x2c4/0x3d0
> [c00000111ffdff00] [c000000000168958] handle_softirqs+0x138/0x414
> [c00000111ffdffe0] [c000000000017d80] do_softirq_own_stack+0x3c/0x50
> [c000000008a57a60] [c000000000168048] __irq_exit_rcu+0x18c/0x1b4
> [c000000008a57a90] [c0000000001691a8] irq_exit+0x20/0x38
> [c000000008a57ab0] [c000000000028c18] timer_interrupt+0x174/0x394
> [c000000008a57b10] [c000000000009f8c] decrementer_common_virt+0x28c/0x290
> 
> 
> Bisect pointed to:
> git bisect bad 2265c5d4deeff3bfe4580d9ffe718fd80a414cac
> # first bad commit: [2265c5d4deeff3bfe4580d9ffe718fd80a414cac] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
> 
> 
> I wondered what is really different since the tim's v4 boots fine.
> There is try instead in the tip, i think that is messing it since likely
> we are dereferencing 0?
> 
> 
> With this diff it boots fine.
> 
> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index aaa47ece6a8e..01814b10b833 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -11841,7 +11841,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>          }
>   
>          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))

The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
is "int old". So the above check would result in NULL pointer access.  Probably have
to do something like the following to use atomic_try_cmpxchg_acquire()

		int zero = 0;
		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
		
Otherwise we should do atomic_cmpxchg_acquire() as below

> +               if (!atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))

Tim

>                          goto out_balanced;
>   
>                  need_unlock = true;
> 
> 
Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Peter Zijlstra 2 months, 3 weeks ago
On Mon, Nov 17, 2025 at 10:55:07AM -0800, Tim Chen wrote:

> >          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> > -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 
> The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> is "int old". So the above check would result in NULL pointer access.  Probably have
> to do something like the following to use atomic_try_cmpxchg_acquire()
> 
> 		int zero = 0;
> 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
> 		
> Otherwise we should do atomic_cmpxchg_acquire() as below

Yes, and I'm all mightily miffed all the compilers accept 0 (which is
int) for an 'int *' argument without so much as a warning :/

Nathan, you looked into this a bit yesterday, afaict there is:

  -Wzero-as-null-pointer-constant

which is supposed to issue a warn here, but I can't get clang-22 to
object :/ (GCC doesn't take that warning for C mode, only C++, perhaps
that's the problem?).

And then there is modernize-use-nullptr, but that objects to using NULL,
although I suppose we could do:

  #define NULL nullptr

to get around that. Except I also cannot get clangd to report on the
issue.

Help?
Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Nathan Chancellor 2 months, 2 weeks ago
On Tue, Nov 18, 2025 at 10:54:32AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 17, 2025 at 10:55:07AM -0800, Tim Chen wrote:
> 
> > >          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> > > -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > 
> > The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> > is "int old". So the above check would result in NULL pointer access.  Probably have
> > to do something like the following to use atomic_try_cmpxchg_acquire()
> > 
> > 		int zero = 0;
> > 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
> > 		
> > Otherwise we should do atomic_cmpxchg_acquire() as below
> 
> Yes, and I'm all mightily miffed all the compilers accept 0 (which is
> int) for an 'int *' argument without so much as a warning :/

The C11 standard says in 6.3.2.3p3

  An integer constant expression with the value 0, or such an expression
  cast to type void *, is called a null pointer constant.

which seems to indicate to me that

  int *foo = 0;

and

  #define NULL (void *)0
  int *foo = NULL;

have to be treated the same way :/ I think that is a big part of the
motivation to bring nullptr into C in C23:

  https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3042.htm

> Nathan, you looked into this a bit yesterday, afaict there is:
> 
>   -Wzero-as-null-pointer-constant
> 
> which is supposed to issue a warn here, but I can't get clang-22 to
> object :/ (GCC doesn't take that warning for C mode, only C++, perhaps
> that's the problem?).

Right, it appears to be the same case for clang, notice the comment in
diagnoseZeroToNullptrConversion():

  https://github.com/llvm/llvm-project/commit/d7ba86b6bf54740dd4007e65a927151cb9f510b4

That warning should probably be updated to work for C23 but that does
not really help us now because nullptr is not available in older
standards (and I think the support for C23 is only solid in really
recent compilers IIUC).

> Help?

Maybe we could have something like -Wnon-literal-null-conversion-strict
in clang that would behave like -Wnon-literal-null-conversion but warn
even in the literal zero conversion case (i.e., require a 'void *'
cast)... That does not really help GCC though since it does not warn on
any case of implicit conversion to NULL:

https://godbolt.org/z/M5WE5covz

Cheers,
Nathan
Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Peter Zijlstra 2 months, 2 weeks ago
On Thu, Nov 20, 2025 at 11:26:00PM -0700, Nathan Chancellor wrote:
> On Tue, Nov 18, 2025 at 10:54:32AM +0100, Peter Zijlstra wrote:
> > On Mon, Nov 17, 2025 at 10:55:07AM -0800, Tim Chen wrote:
> > 
> > > >          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> > > > -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > > 
> > > The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> > > is "int old". So the above check would result in NULL pointer access.  Probably have
> > > to do something like the following to use atomic_try_cmpxchg_acquire()
> > > 
> > > 		int zero = 0;
> > > 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
> > > 		
> > > Otherwise we should do atomic_cmpxchg_acquire() as below
> > 
> > Yes, and I'm all mightily miffed all the compilers accept 0 (which is
> > int) for an 'int *' argument without so much as a warning :/
> 
> The C11 standard says in 6.3.2.3p3
> 
>   An integer constant expression with the value 0, or such an expression
>   cast to type void *, is called a null pointer constant.

That's just bloody ludicrous :-(, I mean, the explicit cast to void*
sure, but the implicit conversion is just idiotic. I realize there's
legacy here, but urgh.

> which seems to indicate to me that
> 
>   int *foo = 0;
> 
> and
> 
>   #define NULL (void *)0
>   int *foo = NULL;
> 
> have to be treated the same way :/ I think that is a big part of the
> motivation to bring nullptr into C in C23:
> 
>   https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3042.htm

Even without that, just dropping the implicit conversion is a giant step
forward.

> > Nathan, you looked into this a bit yesterday, afaict there is:
> > 
> >   -Wzero-as-null-pointer-constant
> > 
> > which is supposed to issue a warn here, but I can't get clang-22 to
> > object :/ (GCC doesn't take that warning for C mode, only C++, perhaps
> > that's the problem?).
> 
> Right, it appears to be the same case for clang, notice the comment in
> diagnoseZeroToNullptrConversion():
> 
>   https://github.com/llvm/llvm-project/commit/d7ba86b6bf54740dd4007e65a927151cb9f510b4
> 
> That warning should probably be updated to work for C23 but that does
> not really help us now because nullptr is not available in older
> standards (and I think the support for C23 is only solid in really
> recent compilers IIUC).

So personally I really don't see a problem with '(void *)0', what if
anything does nullptr actually bring over that?

> > Help?
> 
> Maybe we could have something like -Wnon-literal-null-conversion-strict
> in clang that would behave like -Wnon-literal-null-conversion but warn
> even in the literal zero conversion case (i.e., require a 'void *'
> cast)... That does not really help GCC though since it does not warn on
> any case of implicit conversion to NULL:

Yes, that makes sense. Perhaps we can even convince GCC folks to also
implement this ;-)

Just having it in clang would mean clangd will have the warning and thus
all the LSP enabled editors will provide the warn, which is a win.
Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Peter Zijlstra 2 months, 3 weeks ago
And now with Nathan as a recipient..

On Tue, Nov 18, 2025 at 10:54:33AM +0100, Peter Zijlstra wrote:
> On Mon, Nov 17, 2025 at 10:55:07AM -0800, Tim Chen wrote:
> 
> > >          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> > > -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > 
> > The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> > is "int old". So the above check would result in NULL pointer access.  Probably have
> > to do something like the following to use atomic_try_cmpxchg_acquire()
> > 
> > 		int zero = 0;
> > 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
> > 		
> > Otherwise we should do atomic_cmpxchg_acquire() as below
> 
> Yes, and I'm all mightily miffed all the compilers accept 0 (which is
> int) for an 'int *' argument without so much as a warning :/
> 
> Nathan, you looked into this a bit yesterday, afaict there is:
> 
>   -Wzero-as-null-pointer-constant
> 
> which is supposed to issue a warn here, but I can't get clang-22 to
> object :/ (GCC doesn't take that warning for C mode, only C++, perhaps
> that's the problem?).
> 
> And then there is modernize-use-nullptr, but that objects to using NULL,
> although I suppose we could do:
> 
>   #define NULL nullptr
> 
> to get around that. Except I also cannot get clangd to report on the
> issue.
> 
> Help?
Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by K Prateek Nayak 2 months, 3 weeks ago
On 11/18/2025 12:25 AM, Tim Chen wrote:
>> I wondered what is really different since the tim's v4 boots fine.
>> There is try instead in the tip, i think that is messing it since likely
>> we are dereferencing 0?
>>
>>
>> With this diff it boots fine.
>>
>> ---
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index aaa47ece6a8e..01814b10b833 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -11841,7 +11841,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
>>          }
>>   
>>          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
>> -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 
> The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> is "int old". So the above check would result in NULL pointer access.  Probably have
> to do something like the following to use atomic_try_cmpxchg_acquire()
> 
> 		int zero = 0;
> 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))

Peter seems to have refreshed tip:sched/core with above but is
there any advantage of using atomic_try_cmpxchg_acquire() as
opposed to plain old atomic_cmpxchg_acquire() and then checking
the old value it returns?

That zero variable serves no other purpose and is a bit of an
eyesore IMO.

> 		
> Otherwise we should do atomic_cmpxchg_acquire() as below
> 
>> +               if (!atomic_cmpxchg_acquire(&sched_balance_running, 0, 1))
> 
-- 
Thanks and Regards,
Prateek
Re: [tip: sched/core] sched/fair: Skip sched_balance_running cmpxchg when balance is not due
Posted by Peter Zijlstra 2 months, 1 week ago
On Tue, Nov 18, 2025 at 12:30:36AM +0530, K Prateek Nayak wrote:
> On 11/18/2025 12:25 AM, Tim Chen wrote:
> >> I wondered what is really different since the tim's v4 boots fine.
> >> There is try instead in the tip, i think that is messing it since likely
> >> we are dereferencing 0?
> >>
> >>
> >> With this diff it boots fine.
> >>
> >> ---
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index aaa47ece6a8e..01814b10b833 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -11841,7 +11841,7 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> >>          }
> >>   
> >>          if (!need_unlock && (sd->flags & SD_SERIALIZE)) {
> >> -               if (!atomic_try_cmpxchg_acquire(&sched_balance_running, 0, 1))
> > 
> > The second argument of atomic_try_cmpxchg_acquire is "int *old" while that of atomic_cmpxchg_acquire
> > is "int old". So the above check would result in NULL pointer access.  Probably have
> > to do something like the following to use atomic_try_cmpxchg_acquire()
> > 
> > 		int zero = 0;
> > 		if (!atomic_try_cmpxchg_acquire(&sched_balance_running, &zero, 1))
> 
> Peter seems to have refreshed tip:sched/core with above but is
> there any advantage of using atomic_try_cmpxchg_acquire() as
> opposed to plain old atomic_cmpxchg_acquire() and then checking
> the old value it returns?
> 
> That zero variable serves no other purpose and is a bit of an
> eyesore IMO.

Yeah, its not ideal. It should generate slightly saner code, since the
compiler can now use the condition codes set by cmpxchg instead of
having to do an extra compare.