[v1] x86-32: fix cmpxchg8b_emu build error with clang

[PATCH] x86-32: fix cmpxchg8b_emu build error with clang

Posted by Linus Torvalds 1 year, 7 months ago

The kernel test robot reported that clang no longer compiles the 32-bit
x86 kernel in some configurations due to commit 95ece48165c1
("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}()
functions").

The build fails with

  arch/x86/include/asm/cmpxchg_32.h:149:9: error: inline assembly requires more registers than available

and the reason seems to be that not only does the cmpxchg8b instruction
need four fixed registers (EDX:EAX and ECX:EBX), with the emulation
fallback the inline asm also wants a fifth fixed register for the
address (it uses %esi for that, but that's just a software convention
with cmpxchg8b_emu).

Avoiding using another pointer input to the asm (and just forcing it to
use the "0(%esi)" addressing that we end up requiring for the sw
fallback) seems to fix the issue.

Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202406230912.F6XFIyA6-lkp@intel.com/
Fixes: 95ece48165c1 ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions")
Link: https://lore.kernel.org/all/202406230912.F6XFIyA6-lkp@intel.com/
Suggested-by: Uros Bizjak <ubizjak@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Added commit message, and updated the asm to use '%a[ptr]' instead of
writing out the addressing by hand. 

Still doing the 'oldp' writeback unconmditionally.  The code generation
for the case I checked were the same for both clang and gcc, but until
Uros hits me with the big clue-hammer, I think it's the simpler code
that leaves room for potentially better optimizations too. 

This falls solidly in the "looks ok to me, but still untested" category
for me.  It fixes the clang build issue in my build testing, but I no
longer have a 32-bit test environment, so no actual runtime testing.

 arch/x86/include/asm/cmpxchg_32.h | 28 ++++++++++++----------------
 1 file changed, 12 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h
index ed2797f132ce..4444a8292c7a 100644
--- a/arch/x86/include/asm/cmpxchg_32.h
+++ b/arch/x86/include/asm/cmpxchg_32.h
@@ -88,18 +88,17 @@ static __always_inline bool __try_cmpxchg64_local(volatile u64 *ptr, u64 *oldp,
 
 #define __arch_cmpxchg64_emu(_ptr, _old, _new, _lock_loc, _lock)	\
 ({									\
-	union __u64_halves o = { .full = (_old), },			\
-			   n = { .full = (_new), };			\
+	__u64 o = (_old);						\
+	union __u64_halves n = { .full = (_new), };			\
 									\
 	asm volatile(ALTERNATIVE(_lock_loc				\
 				 "call cmpxchg8b_emu",			\
-				 _lock "cmpxchg8b %[ptr]", X86_FEATURE_CX8) \
-		     : [ptr] "+m" (*(_ptr)),				\
-		       "+a" (o.low), "+d" (o.high)			\
-		     : "b" (n.low), "c" (n.high), "S" (_ptr)		\
+				 _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \
+		     : "+A" (o)						\
+		     : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr)	\
 		     : "memory");					\
 									\
-	o.full;								\
+	o;								\
 })
 
 static __always_inline u64 arch_cmpxchg64(volatile u64 *ptr, u64 old, u64 new)
@@ -116,22 +115,19 @@ static __always_inline u64 arch_cmpxchg64_local(volatile u64 *ptr, u64 old, u64
 
 #define __arch_try_cmpxchg64_emu(_ptr, _oldp, _new, _lock_loc, _lock)	\
 ({									\
-	union __u64_halves o = { .full = *(_oldp), },			\
-			   n = { .full = (_new), };			\
+	__u64 o = *(_oldp);						\
+	union __u64_halves n = { .full = (_new), };			\
 	bool ret;							\
 									\
 	asm volatile(ALTERNATIVE(_lock_loc				\
 				 "call cmpxchg8b_emu",			\
-				 _lock "cmpxchg8b %[ptr]", X86_FEATURE_CX8) \
+				 _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \
 		     CC_SET(e)						\
-		     : CC_OUT(e) (ret),					\
-		       [ptr] "+m" (*(_ptr)),				\
-		       "+a" (o.low), "+d" (o.high)			\
-		     : "b" (n.low), "c" (n.high), "S" (_ptr)		\
+		     : CC_OUT(e) (ret), "+A" (o)			\
+		     : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr)	\
 		     : "memory");					\
 									\
-	if (unlikely(!ret))						\
-		*(_oldp) = o.full;					\
+	*(_oldp) = o;							\
 									\
 	likely(ret);							\
 })
-- 
2.45.1.209.gc6f12300df

Re: [PATCH] x86-32: fix cmpxchg8b_emu build error with clang

Posted by Uros Bizjak 1 year, 7 months ago

On Wed, Jun 26, 2024 at 3:13 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> The kernel test robot reported that clang no longer compiles the 32-bit
> x86 kernel in some configurations due to commit 95ece48165c1
> ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}()
> functions").
>
> The build fails with
>
>   arch/x86/include/asm/cmpxchg_32.h:149:9: error: inline assembly requires more registers than available
>
> and the reason seems to be that not only does the cmpxchg8b instruction
> need four fixed registers (EDX:EAX and ECX:EBX), with the emulation
> fallback the inline asm also wants a fifth fixed register for the
> address (it uses %esi for that, but that's just a software convention
> with cmpxchg8b_emu).
>
> Avoiding using another pointer input to the asm (and just forcing it to
> use the "0(%esi)" addressing that we end up requiring for the sw
> fallback) seems to fix the issue.
>
> Reported-by: kernel test robot <lkp@intel.com>
> Closes: https://lore.kernel.org/oe-kbuild-all/202406230912.F6XFIyA6-lkp@intel.com/
> Fixes: 95ece48165c1 ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions")
> Link: https://lore.kernel.org/all/202406230912.F6XFIyA6-lkp@intel.com/
> Suggested-by: Uros Bizjak <ubizjak@gmail.com>
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
>
> Added commit message, and updated the asm to use '%a[ptr]' instead of
> writing out the addressing by hand.
>
> Still doing the 'oldp' writeback unconmditionally.  The code generation
> for the case I checked were the same for both clang and gcc, but until
> Uros hits me with the big clue-hammer, I think it's the simpler code
> that leaves room for potentially better optimizations too.

You probably want to look at 44fe84459faf1 ("locking/atomic: Fix
atomic_try_cmpxchg() semantics") [1] and the long LKML discussion at
[2].

--quote--
This code is broken with the current implementation, the problem is
with unconditional update of *__po.

In case of success it writes the same value back into *__po, but in
case of cmpxchg success we might have lose ownership of some memory
locations and potentially over what __po has pointed to. The same
holds for the re-read of *__po. "
--/quote--

[1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=44fe84459faf1a7781595b7c64cd36daf2f2827d
[2] https://lore.kernel.org/lkml/CACT4Y+bG+a0w6j6v1AmBE7fqqMSPyPEm4QimCzCouicmHT8FqA@mail.gmail.com/

Uros.

>
> This falls solidly in the "looks ok to me, but still untested" category
> for me.  It fixes the clang build issue in my build testing, but I no
> longer have a 32-bit test environment, so no actual runtime testing.
>
>  arch/x86/include/asm/cmpxchg_32.h | 28 ++++++++++++----------------
>  1 file changed, 12 insertions(+), 16 deletions(-)
>
> diff --git a/arch/x86/include/asm/cmpxchg_32.h b/arch/x86/include/asm/cmpxchg_32.h
> index ed2797f132ce..4444a8292c7a 100644
> --- a/arch/x86/include/asm/cmpxchg_32.h
> +++ b/arch/x86/include/asm/cmpxchg_32.h
> @@ -88,18 +88,17 @@ static __always_inline bool __try_cmpxchg64_local(volatile u64 *ptr, u64 *oldp,
>
>  #define __arch_cmpxchg64_emu(_ptr, _old, _new, _lock_loc, _lock)       \
>  ({                                                                     \
> -       union __u64_halves o = { .full = (_old), },                     \
> -                          n = { .full = (_new), };                     \
> +       __u64 o = (_old);                                               \
> +       union __u64_halves n = { .full = (_new), };                     \
>                                                                         \
>         asm volatile(ALTERNATIVE(_lock_loc                              \
>                                  "call cmpxchg8b_emu",                  \
> -                                _lock "cmpxchg8b %[ptr]", X86_FEATURE_CX8) \
> -                    : [ptr] "+m" (*(_ptr)),                            \
> -                      "+a" (o.low), "+d" (o.high)                      \
> -                    : "b" (n.low), "c" (n.high), "S" (_ptr)            \
> +                                _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \
> +                    : "+A" (o)                                         \
> +                    : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr)      \
>                      : "memory");                                       \
>                                                                         \
> -       o.full;                                                         \
> +       o;                                                              \
>  })
>
>  static __always_inline u64 arch_cmpxchg64(volatile u64 *ptr, u64 old, u64 new)
> @@ -116,22 +115,19 @@ static __always_inline u64 arch_cmpxchg64_local(volatile u64 *ptr, u64 old, u64
>
>  #define __arch_try_cmpxchg64_emu(_ptr, _oldp, _new, _lock_loc, _lock)  \
>  ({                                                                     \
> -       union __u64_halves o = { .full = *(_oldp), },                   \
> -                          n = { .full = (_new), };                     \
> +       __u64 o = *(_oldp);                                             \
> +       union __u64_halves n = { .full = (_new), };                     \
>         bool ret;                                                       \
>                                                                         \
>         asm volatile(ALTERNATIVE(_lock_loc                              \
>                                  "call cmpxchg8b_emu",                  \
> -                                _lock "cmpxchg8b %[ptr]", X86_FEATURE_CX8) \
> +                                _lock "cmpxchg8b %a[ptr]", X86_FEATURE_CX8) \
>                      CC_SET(e)                                          \
> -                    : CC_OUT(e) (ret),                                 \
> -                      [ptr] "+m" (*(_ptr)),                            \
> -                      "+a" (o.low), "+d" (o.high)                      \
> -                    : "b" (n.low), "c" (n.high), "S" (_ptr)            \
> +                    : CC_OUT(e) (ret), "+A" (o)                        \
> +                    : "b" (n.low), "c" (n.high), [ptr] "S" (_ptr)      \
>                      : "memory");                                       \
>                                                                         \
> -       if (unlikely(!ret))                                             \
> -               *(_oldp) = o.full;                                      \
> +       *(_oldp) = o;                                                   \
>                                                                         \
>         likely(ret);                                                    \
>  })
> --
> 2.45.1.209.gc6f12300df
>

Re: [PATCH] x86-32: fix cmpxchg8b_emu build error with clang

Posted by Linus Torvalds 1 year, 7 months ago

On Wed, 26 Jun 2024 at 00:39, Uros Bizjak <ubizjak@gmail.com> wrote:
>
> >
> > Still doing the 'oldp' writeback unconmditionally.  The code generation
> > for the case I checked were the same for both clang and gcc, but until
> > Uros hits me with the big clue-hammer, I think it's the simpler code
> > that leaves room for potentially better optimizations too.
>
> You probably want to look at 44fe84459faf1 ("locking/atomic: Fix
> atomic_try_cmpxchg() semantics") [1] and the long LKML discussion at
> [2].

Christ. That use should be invalid.

The only _atomic_ pointer is "_ptr", not "old". Anybody who gives
something that can change during the operation in "old" is basically
already doing random things.

> --quote--
> This code is broken with the current implementation, the problem is
> with unconditional update of *__po.

I think the only thing broken is that quote, and the crazy expectation
that "old" can change.

But obviously, I had completely forgotten that whole discussion from
seven years ago.

I don't actually find a single use of that invalid code sequence where
somebody would pass a non-private pointer as "oldp". So I really think
that part of the whole discussion was bogus to begin with, and
presumably from some other code base.

IOW, I think that example of a "classical lock-free stack push" is just broken.

That said, I can't find a case where it would matter for code
generation (every use will always do a conditional branch based on the
result, so the conditional assignment is practically speaking always
"static" anyway by the time you do branch following.

So I'll just send out a minimal patch with *only* the %esi changes.

           Linus

Re: [PATCH] x86-32: fix cmpxchg8b_emu build error with clang

Posted by Uros Bizjak 1 year, 7 months ago

On Wed, Jun 26, 2024 at 9:39 AM Uros Bizjak <ubizjak@gmail.com> wrote:
>
> On Wed, Jun 26, 2024 at 3:13 AM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > The kernel test robot reported that clang no longer compiles the 32-bit
> > x86 kernel in some configurations due to commit 95ece48165c1
> > ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}()
> > functions").
> >
> > The build fails with
> >
> >   arch/x86/include/asm/cmpxchg_32.h:149:9: error: inline assembly requires more registers than available
> >
> > and the reason seems to be that not only does the cmpxchg8b instruction
> > need four fixed registers (EDX:EAX and ECX:EBX), with the emulation
> > fallback the inline asm also wants a fifth fixed register for the
> > address (it uses %esi for that, but that's just a software convention
> > with cmpxchg8b_emu).
> >
> > Avoiding using another pointer input to the asm (and just forcing it to
> > use the "0(%esi)" addressing that we end up requiring for the sw

A nit: offset 0 is required only for %ebp, so the above should read "(%esi)".

> > fallback) seems to fix the issue.
> >
> > Reported-by: kernel test robot <lkp@intel.com>
> > Closes: https://lore.kernel.org/oe-kbuild-all/202406230912.F6XFIyA6-lkp@intel.com/
> > Fixes: 95ece48165c1 ("locking/atomic/x86: Rewrite x86_32 arch_atomic64_{,fetch}_{and,or,xor}() functions")
> > Link: https://lore.kernel.org/all/202406230912.F6XFIyA6-lkp@intel.com/
> > Suggested-by: Uros Bizjak <ubizjak@gmail.com>
> > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> > ---
> >
> > Added commit message, and updated the asm to use '%a[ptr]' instead of
> > writing out the addressing by hand.
> >
> > Still doing the 'oldp' writeback unconmditionally.  The code generation
> > for the case I checked were the same for both clang and gcc, but until
> > Uros hits me with the big clue-hammer, I think it's the simpler code
> > that leaves room for potentially better optimizations too.
>
> You probably want to look at 44fe84459faf1 ("locking/atomic: Fix
> atomic_try_cmpxchg() semantics") [1] and the long LKML discussion at
> [2].
>
> --quote--
> This code is broken with the current implementation, the problem is
> with unconditional update of *__po.
>
> In case of success it writes the same value back into *__po, but in
> case of cmpxchg success we might have lose ownership of some memory
> locations and potentially over what __po has pointed to. The same
> holds for the re-read of *__po. "
> --/quote--
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?id=44fe84459faf1a7781595b7c64cd36daf2f2827d
> [2] https://lore.kernel.org/lkml/CACT4Y+bG+a0w6j6v1AmBE7fqqMSPyPEm4QimCzCouicmHT8FqA@mail.gmail.com/
>
> Uros.
>
> >
> > This falls solidly in the "looks ok to me, but still untested" category
> > for me.  It fixes the clang build issue in my build testing, but I no
> > longer have a 32-bit test environment, so no actual runtime testing.                                     \

...

> >                                                                         \
> > -       if (unlikely(!ret))                                             \
> > -               *(_oldp) = o.full;                                      \
> > +       *(_oldp) = o;                                                   \

With the above part changed to:

       if (unlikely(!ret))                                             \
-               *(_oldp) = o.full;                                      \
+               *(_oldp) = o;                                           \

Reviewed-and-Tested-by: Uros Bizjak <ubizjak@gmail.com>

Runtime tested with .config provided by test robot on qemu-i386 with
both clang and GCC compiler:

LKP: ttyS0: 229: Kernel tests: Boot OK!

Thanks,
Uros.