x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other_cpus()

[PATCH] x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other_cpus()

Posted by Uros Bizjak 2 years, 2 months ago

Use atomic_try_cmpxchg() instead of atomic_cmpxchg(*ptr, old, new) == old
in native_stop_other_cpus(). On x86 the CMPXCHG instruction returns success
in the ZF flag, so this change saves a compare after CMPXCHG. Together
with a small code reorder, the generated asm code improves from:

  74:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax
  7a:	41 54                	push   %r12
  7c:	55                   	push   %rbp
  7d:	65 8b 2d 00 00 00 00 	mov    %gs:0x0(%rip),%ebp
  84:	53                   	push   %rbx
  85:	85 c0                	test   %eax,%eax
  87:	75 71                	jne    fa <native_stop_other_cpus+0x8a>
  89:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
  8e:	f0 0f b1 2d 00 00 00 	lock cmpxchg %ebp,0x0(%rip)
  95:	00
  96:	83 f8 ff             	cmp    $0xffffffff,%eax
  99:	75 5f                	jne    fa <native_stop_other_cpus+0x8a>

to:

  74:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax
  7a:	85 c0                	test   %eax,%eax
  7c:	0f 85 84 00 00 00    	jne    106 <native_stop_other_cpus+0x96>
  82:	41 54                	push   %r12
  84:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
  89:	55                   	push   %rbp
  8a:	53                   	push   %rbx
  8b:	65 8b 1d 00 00 00 00 	mov    %gs:0x0(%rip),%ebx
  92:	f0 0f b1 1d 00 00 00 	lock cmpxchg %ebx,0x0(%rip)
  99:	00
  9a:	75 5e                	jne    fa <native_stop_other_cpus+0x8a>

Please note early exit and lack of CMP after CMPXCHG.

No functional change intended.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
 arch/x86/kernel/smp.c | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 96a771f9f930..2908e063d7d8 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -148,14 +148,16 @@ static int register_stop_handler(void)
 
 static void native_stop_other_cpus(int wait)
 {
-	unsigned int cpu = smp_processor_id();
+	unsigned int old_cpu, this_cpu;
 	unsigned long flags, timeout;
 
 	if (reboot_force)
 		return;
 
 	/* Only proceed if this is the first CPU to reach this code */
-	if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
+	old_cpu = -1;
+	this_cpu = smp_processor_id();
+	if (!atomic_try_cmpxchg(&stopping_cpu, &old_cpu, this_cpu))
 		return;
 
 	/* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
@@ -186,7 +188,7 @@ static void native_stop_other_cpus(int wait)
 	 * NMIs.
 	 */
 	cpumask_copy(&cpus_stop_mask, cpu_online_mask);
-	cpumask_clear_cpu(cpu, &cpus_stop_mask);
+	cpumask_clear_cpu(this_cpu, &cpus_stop_mask);
 
 	if (!cpumask_empty(&cpus_stop_mask)) {
 		apic_send_IPI_allbutself(REBOOT_VECTOR);
@@ -210,6 +212,8 @@ static void native_stop_other_cpus(int wait)
 		 * CPUs to stop.
 		 */
 		if (!smp_no_nmi_ipi && !register_stop_handler()) {
+			unsigned int cpu;
+
 			pr_emerg("Shutting down cpus with NMI\n");
 
 			for_each_cpu(cpu, &cpus_stop_mask)
-- 
2.41.0

Re: [PATCH] x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other_cpus()

Posted by Dave Hansen 2 years, 2 months ago

On 11/14/23 08:43, Uros Bizjak wrote:
> Use atomic_try_cmpxchg() instead of atomic_cmpxchg(*ptr, old, new) == old
> in native_stop_other_cpus(). On x86 the CMPXCHG instruction returns success
> in the ZF flag, so this change saves a compare after CMPXCHG. Together
> with a small code reorder, the generated asm code improves from:
> 
>   74:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax
>   7a:	41 54                	push   %r12
>   7c:	55                   	push   %rbp
>   7d:	65 8b 2d 00 00 00 00 	mov    %gs:0x0(%rip),%ebp
>   84:	53                   	push   %rbx
>   85:	85 c0                	test   %eax,%eax
>   87:	75 71                	jne    fa <native_stop_other_cpus+0x8a>
>   89:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
>   8e:	f0 0f b1 2d 00 00 00 	lock cmpxchg %ebp,0x0(%rip)
>   95:	00
>   96:	83 f8 ff             	cmp    $0xffffffff,%eax
>   99:	75 5f                	jne    fa <native_stop_other_cpus+0x8a>
> 
> to:
> 
>   74:	8b 05 00 00 00 00    	mov    0x0(%rip),%eax
>   7a:	85 c0                	test   %eax,%eax
>   7c:	0f 85 84 00 00 00    	jne    106 <native_stop_other_cpus+0x96>
>   82:	41 54                	push   %r12
>   84:	b8 ff ff ff ff       	mov    $0xffffffff,%eax
>   89:	55                   	push   %rbp
>   8a:	53                   	push   %rbx
>   8b:	65 8b 1d 00 00 00 00 	mov    %gs:0x0(%rip),%ebx
>   92:	f0 0f b1 1d 00 00 00 	lock cmpxchg %ebx,0x0(%rip)
>   99:	00
>   9a:	75 5e                	jne    fa <native_stop_other_cpus+0x8a>
> 
> Please note early exit and lack of CMP after CMPXCHG.

Uros, I really do appreciate that you are trying to optimize these
paths.  But the thing we have to balance is the _need_ for optimization
with the chance that this will break something.

This is about as much of a slow path as we have in the kernel.  It's
only used at shutdown, right?  That means this is one of the places in
the kernel that least needs optimization.  It can only possibly get run
once per boot.

So, the benefit is that it might make this code a few cycles faster.  In
practice, it might not even be measurably faster.

On the other hand, this is relatively untested and also makes the C code
more complicated.

Is there some other side benefit that I'm missing here?  Applying this
patch doesn't seem to have a great risk/reward ratio.

Re: [PATCH] x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other_cpus()

Posted by Uros Bizjak 2 years, 2 months ago

On Fri, Nov 17, 2023 at 5:18 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/14/23 08:43, Uros Bizjak wrote:
> > Use atomic_try_cmpxchg() instead of atomic_cmpxchg(*ptr, old, new) == old
> > in native_stop_other_cpus(). On x86 the CMPXCHG instruction returns success
> > in the ZF flag, so this change saves a compare after CMPXCHG. Together
> > with a small code reorder, the generated asm code improves from:
> >
> >   74: 8b 05 00 00 00 00       mov    0x0(%rip),%eax
> >   7a: 41 54                   push   %r12
> >   7c: 55                      push   %rbp
> >   7d: 65 8b 2d 00 00 00 00    mov    %gs:0x0(%rip),%ebp
> >   84: 53                      push   %rbx
> >   85: 85 c0                   test   %eax,%eax
> >   87: 75 71                   jne    fa <native_stop_other_cpus+0x8a>
> >   89: b8 ff ff ff ff          mov    $0xffffffff,%eax
> >   8e: f0 0f b1 2d 00 00 00    lock cmpxchg %ebp,0x0(%rip)
> >   95: 00
> >   96: 83 f8 ff                cmp    $0xffffffff,%eax
> >   99: 75 5f                   jne    fa <native_stop_other_cpus+0x8a>
> >
> > to:
> >
> >   74: 8b 05 00 00 00 00       mov    0x0(%rip),%eax
> >   7a: 85 c0                   test   %eax,%eax
> >   7c: 0f 85 84 00 00 00       jne    106 <native_stop_other_cpus+0x96>
> >   82: 41 54                   push   %r12
> >   84: b8 ff ff ff ff          mov    $0xffffffff,%eax
> >   89: 55                      push   %rbp
> >   8a: 53                      push   %rbx
> >   8b: 65 8b 1d 00 00 00 00    mov    %gs:0x0(%rip),%ebx
> >   92: f0 0f b1 1d 00 00 00    lock cmpxchg %ebx,0x0(%rip)
> >   99: 00
> >   9a: 75 5e                   jne    fa <native_stop_other_cpus+0x8a>
> >
> > Please note early exit and lack of CMP after CMPXCHG.
>
> Uros, I really do appreciate that you are trying to optimize these
> paths.  But the thing we have to balance is the _need_ for optimization
> with the chance that this will break something.
>
> This is about as much of a slow path as we have in the kernel.  It's
> only used at shutdown, right?  That means this is one of the places in
> the kernel that least needs optimization.  It can only possibly get run
> once per boot.
>
> So, the benefit is that it might make this code a few cycles faster.  In
> practice, it might not even be measurably faster.
>
> On the other hand, this is relatively untested and also makes the C code
> more complicated.
>
> Is there some other side benefit that I'm missing here?  Applying this
> patch doesn't seem to have a great risk/reward ratio.

Yes, in addition to better asm code, I think that the use of magic
constant (-1) is not descriptive at all. I tried to make this code
look like nmi_panic() from kernel/panic.c, which has similar
functionality, and describe that this constant belongs to old_cpu
(same as in nmi_panic() ). Also, from converting many cmpxchg to
try_cmpxchg, it becomes evident that in cases like this (usage in "if"
clauses) the correct locking primitive is try_cmpxchg. Additionally,
in this particular case, it is not the speed, but a little code save
that can be achieved with the same functionality.

Thanks,
Uros.

Re: [PATCH] x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other_cpus()

Posted by Dave Hansen 2 years, 2 months ago

On 11/17/23 08:37, Uros Bizjak wrote:
> On Fri, Nov 17, 2023 at 5:18 PM Dave Hansen <dave.hansen@intel.com> wrote:
>> Is there some other side benefit that I'm missing here?  Applying this
>> patch doesn't seem to have a great risk/reward ratio.
> 
> Yes, in addition to better asm code, I think that the use of magic
> constant (-1) is not descriptive at all. I tried to make this code
> look like nmi_panic() from kernel/panic.c, which has similar
> functionality, and describe that this constant belongs to old_cpu
> (same as in nmi_panic() ). 

I guess it's a step in the direction of nmi_panic().  But honestly, it
doesn't go far enough for me at least.  The nice part about nmi_panic()
is that it gives -1 nice symbolic name and uses that name in the
definition of the atomic_t.

> Also, from converting many cmpxchg to try_cmpxchg, it becomes evident
> that in cases like this (usage in "if" clauses) the correct locking
> primitive is try_cmpxchg. Additionally, in this particular case, it
> is not the speed, but a little code save that can be achieved with
> the same functionality.

I think I understand what you're trying to say: using try_cmpxchg can
result in better code generation in general than plain cmpxchg.  Thus,
it's more "correct" to use try_cmpxchg in any case where plain cmpxchg
is in use.  Right?

I honestly don't think cmpxchg is more or less "correct" than
try_cmpxchg.  If you're going to be sending patches like these, I'd
remove this kind of language from your changelogs and discussions
because it holds zero weight.

Here's what I'm afraid of: this patch is not tested enough.  We apply it
and then start getting reports of reboot breaking on some server.
Someone spends two hours bisecting to this patch.  We'll wonder after
the fact: Was this patch worth it?

I don't think what you have here is more descriptive than what was there
before.  It still has a -1 and still doesn't logically connect it to the
'stopping_cpu' definition.  I have no idea how much this was tested.  It
doesn't _clearly_ move the needle enough on making the code better to
apply it.

We shouldn't be blindly converting cmpxchg=>try_cmpxchg, and then trying
to justify it after the fact.  I _do_ agree that try_cmpxchg() leads to
better looking C code *AND* generated code.  So, for new x86 code, it
seems like the best thing to do.  But for old (working) code, I think it
should mostly be left alone.

Maybe you could keep an eye on:

> https://lore.kernel.org/lkml/?q=dfb%3Aatomic_cmpxchg+-dfb%3Aatomic_try_cmpxchg

and remind folks what they should be using, rather than expending
efforts on trying to move existing code over.

Re: [PATCH] x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other_cpus()

Posted by Uros Bizjak 2 years, 2 months ago

On Fri, Nov 17, 2023 at 6:23 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 11/17/23 08:37, Uros Bizjak wrote:
> > On Fri, Nov 17, 2023 at 5:18 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >> Is there some other side benefit that I'm missing here?  Applying this
> >> patch doesn't seem to have a great risk/reward ratio.
> >
> > Yes, in addition to better asm code, I think that the use of magic
> > constant (-1) is not descriptive at all. I tried to make this code
> > look like nmi_panic() from kernel/panic.c, which has similar
> > functionality, and describe that this constant belongs to old_cpu
> > (same as in nmi_panic() ).
>
> I guess it's a step in the direction of nmi_panic().  But honestly, it
> doesn't go far enough for me at least.  The nice part about nmi_panic()
> is that it gives -1 nice symbolic name and uses that name in the
> definition of the atomic_t.
>
> > Also, from converting many cmpxchg to try_cmpxchg, it becomes evident
> > that in cases like this (usage in "if" clauses) the correct locking
> > primitive is try_cmpxchg. Additionally, in this particular case, it
> > is not the speed, but a little code save that can be achieved with
> > the same functionality.
>
> I think I understand what you're trying to say: using try_cmpxchg can
> result in better code generation in general than plain cmpxchg.  Thus,
> it's more "correct" to use try_cmpxchg in any case where plain cmpxchg
> is in use.  Right?

Yes, when we have the condition "cmpxchg(*ptr, old, new) == old", then
using try_cmpxchg not only generates better code (note also how the
compiler creates fast path through the code due to likely/unlikely
annotations), but is also *less* error prone. E.g., there were a
couple of instances where the result of cmpxchg was compared to the
wrong variable.

> I honestly don't think cmpxchg is more or less "correct" than
> try_cmpxchg.  If you're going to be sending patches like these, I'd
> remove this kind of language from your changelogs and discussions
> because it holds zero weight.
>
> Here's what I'm afraid of: this patch is not tested enough.  We apply it
> and then start getting reports of reboot breaking on some server.
> Someone spends two hours bisecting to this patch.  We'll wonder after
> the fact: Was this patch worth it?

Let me just say that after some 50 conversions of code of various
complexity to try_cmpxchg, fixing inconsistencies and plain logic bugs
on the way, removing superfluous memory reads form the loops,
eyeballing generated asm code and persuading relevant maintainers
about the benefit of the conversion, I can confidently say that this
particular conversion won't make troubles.

Even without the proposed conversion, the call to smp_processor_id()
should be put after early exit where reboot_force is checked.

> I don't think what you have here is more descriptive than what was there
> before.  It still has a -1 and still doesn't logically connect it to the
> 'stopping_cpu' definition.  I have no idea how much this was tested.  It
> doesn't _clearly_ move the needle enough on making the code better to
> apply it.

I should state that I test my patches by bootstrapping the image in
qemu, and from time to time put together a bunch of patches and build
a real Fedora kernel rpm and run it for some time as my main kernel.
This last part gives me much confidence when the patch is discussed
with the maintainer. In this particular case, I didn't put much
attention on reboot functionality, but the patched kernel definitely
reboots without problems.

Regarding "-1", I was thinking of introducing a CPU_INVALID #define
instead of -1, but at the end, this #define should be eventually
introduced as a target-independent patch that puts the new define into
the generic part of the kernel source, since it looks like other
targets could use this define as well.

> We shouldn't be blindly converting cmpxchg=>try_cmpxchg, and then trying
> to justify it after the fact.  I _do_ agree that try_cmpxchg() leads to
> better looking C code *AND* generated code.  So, for new x86 code, it
> seems like the best thing to do.  But for old (working) code, I think it
> should mostly be left alone.

I'm not here to discuss policies, but the "don't fix it if it ain't
broke" policy is a slippery slope, as it can lead to stagnation in the
long term. Please read what happened here [1].

[1] https://lwn.net/Articles/488847/

> Maybe you could keep an eye on:
>
> > https://lore.kernel.org/lkml/?q=dfb%3Aatomic_cmpxchg+-dfb%3Aatomic_try_cmpxchg
>
> and remind folks what they should be using, rather than expending
> efforts on trying to move existing code over.

Yes, I'm doing my best to point out optimization opportunities
involving try_cmpxchg when I come across one. But this one slipped
below my radar...

Thanks and BR,
Uros.

Re: [PATCH] x86/smp: Use atomic_try_cmpxchg() to micro-optimize native_stop_other_cpus()

Posted by Dave Hansen 2 years, 2 months ago

On 11/22/23 12:18, Uros Bizjak wrote:
> I'm not here to discuss policies, but the "don't fix it if it ain't
> broke" policy is a slippery slope, as it can lead to stagnation in the
> long term. Please read what happened here [1].
> 
> [1] https://lwn.net/Articles/488847/

Even though I'm not dying to apply this patch, I will take a look at it
again if it improves in how its changelog is phrased, actually makes the
code more readable and comes with some kind of statement about how
thoroughly it was tested.