arch/x86/kernel/smp.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
Use atomic_try_cmpxchg() instead of atomic_cmpxchg(*ptr, old, new) == old
in native_stop_other_cpus(). On x86 the CMPXCHG instruction returns success
in the ZF flag, so this change saves a compare after CMPXCHG. Together
with a small code reorder, the generated asm code improves from:
74: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
7a: 41 54 push %r12
7c: 55 push %rbp
7d: 65 8b 2d 00 00 00 00 mov %gs:0x0(%rip),%ebp
84: 53 push %rbx
85: 85 c0 test %eax,%eax
87: 75 71 jne fa <native_stop_other_cpus+0x8a>
89: b8 ff ff ff ff mov $0xffffffff,%eax
8e: f0 0f b1 2d 00 00 00 lock cmpxchg %ebp,0x0(%rip)
95: 00
96: 83 f8 ff cmp $0xffffffff,%eax
99: 75 5f jne fa <native_stop_other_cpus+0x8a>
to:
74: 8b 05 00 00 00 00 mov 0x0(%rip),%eax
7a: 85 c0 test %eax,%eax
7c: 0f 85 84 00 00 00 jne 106 <native_stop_other_cpus+0x96>
82: 41 54 push %r12
84: b8 ff ff ff ff mov $0xffffffff,%eax
89: 55 push %rbp
8a: 53 push %rbx
8b: 65 8b 1d 00 00 00 00 mov %gs:0x0(%rip),%ebx
92: f0 0f b1 1d 00 00 00 lock cmpxchg %ebx,0x0(%rip)
99: 00
9a: 75 5e jne fa <native_stop_other_cpus+0x8a>
Please note early exit and lack of CMP after CMPXCHG.
No functional change intended.
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
---
arch/x86/kernel/smp.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 96a771f9f930..2908e063d7d8 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -148,14 +148,16 @@ static int register_stop_handler(void)
static void native_stop_other_cpus(int wait)
{
- unsigned int cpu = smp_processor_id();
+ unsigned int old_cpu, this_cpu;
unsigned long flags, timeout;
if (reboot_force)
return;
/* Only proceed if this is the first CPU to reach this code */
- if (atomic_cmpxchg(&stopping_cpu, -1, cpu) != -1)
+ old_cpu = -1;
+ this_cpu = smp_processor_id();
+ if (!atomic_try_cmpxchg(&stopping_cpu, &old_cpu, this_cpu))
return;
/* For kexec, ensure that offline CPUs are out of MWAIT and in HLT */
@@ -186,7 +188,7 @@ static void native_stop_other_cpus(int wait)
* NMIs.
*/
cpumask_copy(&cpus_stop_mask, cpu_online_mask);
- cpumask_clear_cpu(cpu, &cpus_stop_mask);
+ cpumask_clear_cpu(this_cpu, &cpus_stop_mask);
if (!cpumask_empty(&cpus_stop_mask)) {
apic_send_IPI_allbutself(REBOOT_VECTOR);
@@ -210,6 +212,8 @@ static void native_stop_other_cpus(int wait)
* CPUs to stop.
*/
if (!smp_no_nmi_ipi && !register_stop_handler()) {
+ unsigned int cpu;
+
pr_emerg("Shutting down cpus with NMI\n");
for_each_cpu(cpu, &cpus_stop_mask)
--
2.41.0
On 11/14/23 08:43, Uros Bizjak wrote: > Use atomic_try_cmpxchg() instead of atomic_cmpxchg(*ptr, old, new) == old > in native_stop_other_cpus(). On x86 the CMPXCHG instruction returns success > in the ZF flag, so this change saves a compare after CMPXCHG. Together > with a small code reorder, the generated asm code improves from: > > 74: 8b 05 00 00 00 00 mov 0x0(%rip),%eax > 7a: 41 54 push %r12 > 7c: 55 push %rbp > 7d: 65 8b 2d 00 00 00 00 mov %gs:0x0(%rip),%ebp > 84: 53 push %rbx > 85: 85 c0 test %eax,%eax > 87: 75 71 jne fa <native_stop_other_cpus+0x8a> > 89: b8 ff ff ff ff mov $0xffffffff,%eax > 8e: f0 0f b1 2d 00 00 00 lock cmpxchg %ebp,0x0(%rip) > 95: 00 > 96: 83 f8 ff cmp $0xffffffff,%eax > 99: 75 5f jne fa <native_stop_other_cpus+0x8a> > > to: > > 74: 8b 05 00 00 00 00 mov 0x0(%rip),%eax > 7a: 85 c0 test %eax,%eax > 7c: 0f 85 84 00 00 00 jne 106 <native_stop_other_cpus+0x96> > 82: 41 54 push %r12 > 84: b8 ff ff ff ff mov $0xffffffff,%eax > 89: 55 push %rbp > 8a: 53 push %rbx > 8b: 65 8b 1d 00 00 00 00 mov %gs:0x0(%rip),%ebx > 92: f0 0f b1 1d 00 00 00 lock cmpxchg %ebx,0x0(%rip) > 99: 00 > 9a: 75 5e jne fa <native_stop_other_cpus+0x8a> > > Please note early exit and lack of CMP after CMPXCHG. Uros, I really do appreciate that you are trying to optimize these paths. But the thing we have to balance is the _need_ for optimization with the chance that this will break something. This is about as much of a slow path as we have in the kernel. It's only used at shutdown, right? That means this is one of the places in the kernel that least needs optimization. It can only possibly get run once per boot. So, the benefit is that it might make this code a few cycles faster. In practice, it might not even be measurably faster. On the other hand, this is relatively untested and also makes the C code more complicated. Is there some other side benefit that I'm missing here? Applying this patch doesn't seem to have a great risk/reward ratio.
On Fri, Nov 17, 2023 at 5:18 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 11/14/23 08:43, Uros Bizjak wrote: > > Use atomic_try_cmpxchg() instead of atomic_cmpxchg(*ptr, old, new) == old > > in native_stop_other_cpus(). On x86 the CMPXCHG instruction returns success > > in the ZF flag, so this change saves a compare after CMPXCHG. Together > > with a small code reorder, the generated asm code improves from: > > > > 74: 8b 05 00 00 00 00 mov 0x0(%rip),%eax > > 7a: 41 54 push %r12 > > 7c: 55 push %rbp > > 7d: 65 8b 2d 00 00 00 00 mov %gs:0x0(%rip),%ebp > > 84: 53 push %rbx > > 85: 85 c0 test %eax,%eax > > 87: 75 71 jne fa <native_stop_other_cpus+0x8a> > > 89: b8 ff ff ff ff mov $0xffffffff,%eax > > 8e: f0 0f b1 2d 00 00 00 lock cmpxchg %ebp,0x0(%rip) > > 95: 00 > > 96: 83 f8 ff cmp $0xffffffff,%eax > > 99: 75 5f jne fa <native_stop_other_cpus+0x8a> > > > > to: > > > > 74: 8b 05 00 00 00 00 mov 0x0(%rip),%eax > > 7a: 85 c0 test %eax,%eax > > 7c: 0f 85 84 00 00 00 jne 106 <native_stop_other_cpus+0x96> > > 82: 41 54 push %r12 > > 84: b8 ff ff ff ff mov $0xffffffff,%eax > > 89: 55 push %rbp > > 8a: 53 push %rbx > > 8b: 65 8b 1d 00 00 00 00 mov %gs:0x0(%rip),%ebx > > 92: f0 0f b1 1d 00 00 00 lock cmpxchg %ebx,0x0(%rip) > > 99: 00 > > 9a: 75 5e jne fa <native_stop_other_cpus+0x8a> > > > > Please note early exit and lack of CMP after CMPXCHG. > > Uros, I really do appreciate that you are trying to optimize these > paths. But the thing we have to balance is the _need_ for optimization > with the chance that this will break something. > > This is about as much of a slow path as we have in the kernel. It's > only used at shutdown, right? That means this is one of the places in > the kernel that least needs optimization. It can only possibly get run > once per boot. > > So, the benefit is that it might make this code a few cycles faster. In > practice, it might not even be measurably faster. > > On the other hand, this is relatively untested and also makes the C code > more complicated. > > Is there some other side benefit that I'm missing here? Applying this > patch doesn't seem to have a great risk/reward ratio. Yes, in addition to better asm code, I think that the use of magic constant (-1) is not descriptive at all. I tried to make this code look like nmi_panic() from kernel/panic.c, which has similar functionality, and describe that this constant belongs to old_cpu (same as in nmi_panic() ). Also, from converting many cmpxchg to try_cmpxchg, it becomes evident that in cases like this (usage in "if" clauses) the correct locking primitive is try_cmpxchg. Additionally, in this particular case, it is not the speed, but a little code save that can be achieved with the same functionality. Thanks, Uros.
On 11/17/23 08:37, Uros Bizjak wrote: > On Fri, Nov 17, 2023 at 5:18 PM Dave Hansen <dave.hansen@intel.com> wrote: >> Is there some other side benefit that I'm missing here? Applying this >> patch doesn't seem to have a great risk/reward ratio. > > Yes, in addition to better asm code, I think that the use of magic > constant (-1) is not descriptive at all. I tried to make this code > look like nmi_panic() from kernel/panic.c, which has similar > functionality, and describe that this constant belongs to old_cpu > (same as in nmi_panic() ). I guess it's a step in the direction of nmi_panic(). But honestly, it doesn't go far enough for me at least. The nice part about nmi_panic() is that it gives -1 nice symbolic name and uses that name in the definition of the atomic_t. > Also, from converting many cmpxchg to try_cmpxchg, it becomes evident > that in cases like this (usage in "if" clauses) the correct locking > primitive is try_cmpxchg. Additionally, in this particular case, it > is not the speed, but a little code save that can be achieved with > the same functionality. I think I understand what you're trying to say: using try_cmpxchg can result in better code generation in general than plain cmpxchg. Thus, it's more "correct" to use try_cmpxchg in any case where plain cmpxchg is in use. Right? I honestly don't think cmpxchg is more or less "correct" than try_cmpxchg. If you're going to be sending patches like these, I'd remove this kind of language from your changelogs and discussions because it holds zero weight. Here's what I'm afraid of: this patch is not tested enough. We apply it and then start getting reports of reboot breaking on some server. Someone spends two hours bisecting to this patch. We'll wonder after the fact: Was this patch worth it? I don't think what you have here is more descriptive than what was there before. It still has a -1 and still doesn't logically connect it to the 'stopping_cpu' definition. I have no idea how much this was tested. It doesn't _clearly_ move the needle enough on making the code better to apply it. We shouldn't be blindly converting cmpxchg=>try_cmpxchg, and then trying to justify it after the fact. I _do_ agree that try_cmpxchg() leads to better looking C code *AND* generated code. So, for new x86 code, it seems like the best thing to do. But for old (working) code, I think it should mostly be left alone. Maybe you could keep an eye on: > https://lore.kernel.org/lkml/?q=dfb%3Aatomic_cmpxchg+-dfb%3Aatomic_try_cmpxchg and remind folks what they should be using, rather than expending efforts on trying to move existing code over.
On Fri, Nov 17, 2023 at 6:23 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 11/17/23 08:37, Uros Bizjak wrote: > > On Fri, Nov 17, 2023 at 5:18 PM Dave Hansen <dave.hansen@intel.com> wrote: > >> Is there some other side benefit that I'm missing here? Applying this > >> patch doesn't seem to have a great risk/reward ratio. > > > > Yes, in addition to better asm code, I think that the use of magic > > constant (-1) is not descriptive at all. I tried to make this code > > look like nmi_panic() from kernel/panic.c, which has similar > > functionality, and describe that this constant belongs to old_cpu > > (same as in nmi_panic() ). > > I guess it's a step in the direction of nmi_panic(). But honestly, it > doesn't go far enough for me at least. The nice part about nmi_panic() > is that it gives -1 nice symbolic name and uses that name in the > definition of the atomic_t. > > > Also, from converting many cmpxchg to try_cmpxchg, it becomes evident > > that in cases like this (usage in "if" clauses) the correct locking > > primitive is try_cmpxchg. Additionally, in this particular case, it > > is not the speed, but a little code save that can be achieved with > > the same functionality. > > I think I understand what you're trying to say: using try_cmpxchg can > result in better code generation in general than plain cmpxchg. Thus, > it's more "correct" to use try_cmpxchg in any case where plain cmpxchg > is in use. Right? Yes, when we have the condition "cmpxchg(*ptr, old, new) == old", then using try_cmpxchg not only generates better code (note also how the compiler creates fast path through the code due to likely/unlikely annotations), but is also *less* error prone. E.g., there were a couple of instances where the result of cmpxchg was compared to the wrong variable. > I honestly don't think cmpxchg is more or less "correct" than > try_cmpxchg. If you're going to be sending patches like these, I'd > remove this kind of language from your changelogs and discussions > because it holds zero weight. > > Here's what I'm afraid of: this patch is not tested enough. We apply it > and then start getting reports of reboot breaking on some server. > Someone spends two hours bisecting to this patch. We'll wonder after > the fact: Was this patch worth it? Let me just say that after some 50 conversions of code of various complexity to try_cmpxchg, fixing inconsistencies and plain logic bugs on the way, removing superfluous memory reads form the loops, eyeballing generated asm code and persuading relevant maintainers about the benefit of the conversion, I can confidently say that this particular conversion won't make troubles. Even without the proposed conversion, the call to smp_processor_id() should be put after early exit where reboot_force is checked. > I don't think what you have here is more descriptive than what was there > before. It still has a -1 and still doesn't logically connect it to the > 'stopping_cpu' definition. I have no idea how much this was tested. It > doesn't _clearly_ move the needle enough on making the code better to > apply it. I should state that I test my patches by bootstrapping the image in qemu, and from time to time put together a bunch of patches and build a real Fedora kernel rpm and run it for some time as my main kernel. This last part gives me much confidence when the patch is discussed with the maintainer. In this particular case, I didn't put much attention on reboot functionality, but the patched kernel definitely reboots without problems. Regarding "-1", I was thinking of introducing a CPU_INVALID #define instead of -1, but at the end, this #define should be eventually introduced as a target-independent patch that puts the new define into the generic part of the kernel source, since it looks like other targets could use this define as well. > We shouldn't be blindly converting cmpxchg=>try_cmpxchg, and then trying > to justify it after the fact. I _do_ agree that try_cmpxchg() leads to > better looking C code *AND* generated code. So, for new x86 code, it > seems like the best thing to do. But for old (working) code, I think it > should mostly be left alone. I'm not here to discuss policies, but the "don't fix it if it ain't broke" policy is a slippery slope, as it can lead to stagnation in the long term. Please read what happened here [1]. [1] https://lwn.net/Articles/488847/ > Maybe you could keep an eye on: > > > https://lore.kernel.org/lkml/?q=dfb%3Aatomic_cmpxchg+-dfb%3Aatomic_try_cmpxchg > > and remind folks what they should be using, rather than expending > efforts on trying to move existing code over. Yes, I'm doing my best to point out optimization opportunities involving try_cmpxchg when I come across one. But this one slipped below my radar... Thanks and BR, Uros.
On 11/22/23 12:18, Uros Bizjak wrote: > I'm not here to discuss policies, but the "don't fix it if it ain't > broke" policy is a slippery slope, as it can lead to stagnation in the > long term. Please read what happened here [1]. > > [1] https://lwn.net/Articles/488847/ Even though I'm not dying to apply this patch, I will take a look at it again if it improves in how its changelog is phrased, actually makes the code more readable and comes with some kind of statement about how thoroughly it was tested.
© 2016 - 2025 Red Hat, Inc.