tools/nolibc: LTO compatibility

[PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Thomas Weißschuh 1 year, 6 months ago

The bare asm usage does not work well together with LTO.
It leads to errors:

  /tmp/ccIHTYT6.s: Assembler messages:
  /tmp/ccIHTYT6.s:36: Error: symbol `memmove' is already defined
  /tmp/ccIHTYT6.s:37: Error: symbol `memcpy' is already defined
  /tmp/ccIHTYT6.s:46: Error: symbol `.Lbackward_copy' is already defined
  /tmp/ccIHTYT6.s:54: Error: symbol `memset' is already defined

Wrap the asm in naked functions, which leads to the same object code but
avoids the errors.

Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
---
 tools/include/nolibc/arch-x86_64.h | 80 +++++++++++++++++++++-----------------
 1 file changed, 44 insertions(+), 36 deletions(-)

diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
index 3c3b703d9b0c..efbea173fb74 100644
--- a/tools/include/nolibc/arch-x86_64.h
+++ b/tools/include/nolibc/arch-x86_64.h
@@ -174,45 +174,53 @@ void __attribute__((weak, noreturn)) __nolibc_naked __no_stack_protector _start(
 }
 
 #define NOLIBC_ARCH_HAS_MEMMOVE
-void *memmove(void *dst, const void *src, size_t len);
+__attribute__((weak,unused,section(".text.nolibc_memmove")))
+__nolibc_naked __no_stack_protector
+void *memmove(void *dst __attribute__((unused)),
+	      const void *src __attribute__((unused)),
+	      size_t len __attribute__((unused)))
+{
+	__asm__ volatile (
+		"movq %rdx, %rcx\n\t"
+		"movq %rdi, %rax\n\t"
+		"movq %rdi, %rdx\n\t"
+		"subq %rsi, %rdx\n\t"
+		"cmpq %rcx, %rdx\n\t"
+		"jb   .Lbackward_copy\n\t"
+		"rep movsb\n\t"
+		"retq\n"
+		".Lbackward_copy:"
+		"leaq -1(%rdi, %rcx, 1), %rdi\n\t"
+		"leaq -1(%rsi, %rcx, 1), %rsi\n\t"
+		"std\n\t"
+		"rep movsb\n\t"
+		"cld\n\t"
+		"retq\n"
+	);
+	__nolibc_naked_epilogue();
+}
 
 #define NOLIBC_ARCH_HAS_MEMCPY
-void *memcpy(void *dst, const void *src, size_t len);
+static __inline__ void *memcpy(void *dst, const void *src, size_t len)
+{
+	return memmove(dst, src, len);
+}
 
 #define NOLIBC_ARCH_HAS_MEMSET
-void *memset(void *dst, int c, size_t len);
-
-__asm__ (
-".section .text.nolibc_memmove_memcpy\n"
-".weak memmove\n"
-".weak memcpy\n"
-"memmove:\n"
-"memcpy:\n"
-	"movq %rdx, %rcx\n\t"
-	"movq %rdi, %rax\n\t"
-	"movq %rdi, %rdx\n\t"
-	"subq %rsi, %rdx\n\t"
-	"cmpq %rcx, %rdx\n\t"
-	"jb   .Lbackward_copy\n\t"
-	"rep movsb\n\t"
-	"retq\n"
-".Lbackward_copy:"
-	"leaq -1(%rdi, %rcx, 1), %rdi\n\t"
-	"leaq -1(%rsi, %rcx, 1), %rsi\n\t"
-	"std\n\t"
-	"rep movsb\n\t"
-	"cld\n\t"
-	"retq\n"
-
-".section .text.nolibc_memset\n"
-".weak memset\n"
-"memset:\n"
-	"xchgl %eax, %esi\n\t"
-	"movq  %rdx, %rcx\n\t"
-	"pushq %rdi\n\t"
-	"rep stosb\n\t"
-	"popq  %rax\n\t"
-	"retq\n"
-);
+__attribute__((weak,unused,section(".text.nolibc_memset")))
+__nolibc_naked __no_stack_protector
+void *memset(void *dst __attribute__((unused)), int c __attribute__((unused)),
+	     size_t len __attribute__((unused)))
+{
+	__asm__ volatile (
+		"xchgl %eax, %esi\n\t"
+		"movq  %rdx, %rcx\n\t"
+		"pushq %rdi\n\t"
+		"rep stosb\n\t"
+		"popq  %rax\n\t"
+		"retq\n"
+	);
+	__nolibc_naked_epilogue();
+}
 
 #endif /* _NOLIBC_ARCH_X86_64_H */

-- 
2.46.0

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Ammar Faizi 1 year, 6 months ago

On Sat, Aug 10, 2024 at 12:54:46PM +0200, Thomas Weißschuh wrote:
> +__attribute__((weak,unused,section(".text.nolibc_memmove")))
> +__nolibc_naked __no_stack_protector
> +void *memmove(void *dst __attribute__((unused)),
> +	      const void *src __attribute__((unused)),
> +	      size_t len __attribute__((unused)))
> +{
> +	__asm__ volatile (
> +		"movq %rdx, %rcx\n\t"
> +		"movq %rdi, %rax\n\t"
> +		"movq %rdi, %rdx\n\t"
> +		"subq %rsi, %rdx\n\t"
> +		"cmpq %rcx, %rdx\n\t"
> +		"jb   .Lbackward_copy\n\t"
> +		"rep movsb\n\t"
> +		"retq\n"
> +		".Lbackward_copy:"
> +		"leaq -1(%rdi, %rcx, 1), %rdi\n\t"
> +		"leaq -1(%rsi, %rcx, 1), %rsi\n\t"
> +		"std\n\t"
> +		"rep movsb\n\t"
> +		"cld\n\t"
> +		"retq\n"
> +	);
> +	__nolibc_naked_epilogue();
> +}

NAK for this patch.

This approach appears highly dangerous, particularly when the compiler
inlines the call. When using inline assembly within a function, it's
crucial to define proper constraints and a clobber list to ensure the
arguments are correctly bound to the inline assembly.

Moreover, as it stands, there is no indication to the compiler that the
inline assembly modifies registers such as %rax, %rdi, %rsi, %rdx, %rcx,
or "memory", which could lead to unpredictable behavior.

Unfortunately, I can't spend more time on this right now as I'm
currently traveling. I'll get back to it later when I'm in transit.

-- 
Ammar Faizi

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Thomas Weißschuh 1 year, 6 months ago

On 2024-08-10 19:12:25+0000, Ammar Faizi wrote:
> On Sat, Aug 10, 2024 at 12:54:46PM +0200, Thomas Weißschuh wrote:
> > +__attribute__((weak,unused,section(".text.nolibc_memmove")))
> > +__nolibc_naked __no_stack_protector
> > +void *memmove(void *dst __attribute__((unused)),
> > +	      const void *src __attribute__((unused)),
> > +	      size_t len __attribute__((unused)))
> > +{
> > +	__asm__ volatile (
> > +		"movq %rdx, %rcx\n\t"
> > +		"movq %rdi, %rax\n\t"
> > +		"movq %rdi, %rdx\n\t"
> > +		"subq %rsi, %rdx\n\t"
> > +		"cmpq %rcx, %rdx\n\t"
> > +		"jb   .Lbackward_copy\n\t"
> > +		"rep movsb\n\t"
> > +		"retq\n"
> > +		".Lbackward_copy:"
> > +		"leaq -1(%rdi, %rcx, 1), %rdi\n\t"
> > +		"leaq -1(%rsi, %rcx, 1), %rsi\n\t"
> > +		"std\n\t"
> > +		"rep movsb\n\t"
> > +		"cld\n\t"
> > +		"retq\n"
> > +	);
> > +	__nolibc_naked_epilogue();
> > +}
> 
> NAK for this patch.

Thanks for the feedback!

(I'm not an assembler programmer, so regard my notes with a grain of salt)

> This approach appears highly dangerous, particularly when the compiler
> inlines the call. When using inline assembly within a function, it's
> crucial to define proper constraints and a clobber list to ensure the
> arguments are correctly bound to the inline assembly.

Aren't the constraints a feature of Extended Asm?
This is a Basic Asm block.
Indeed naked functions only support Basic Asm, so there is no way to
explicitly bind arguments to their registers.

Looking at the object code for various on both gcc and clang show always
the same object code.
(Although GCC adds a "ud2" after the "ret")

> Moreover, as it stands, there is no indication to the compiler that the
> inline assembly modifies registers such as %rax, %rdi, %rsi, %rdx, %rcx,
> or "memory", which could lead to unpredictable behavior.

> Unfortunately, I can't spend more time on this right now as I'm
> currently traveling. I'll get back to it later when I'm in transit.

There is no urgency on this, I'll wait on your further feedback.

Thanks again,
Thomas

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Willy Tarreau 1 year, 6 months ago

Hi Thomas,

On Sat, Aug 10, 2024 at 02:37:19PM +0200, Thomas Weißschuh wrote:
> On 2024-08-10 19:12:25+0000, Ammar Faizi wrote:
> > On Sat, Aug 10, 2024 at 12:54:46PM +0200, Thomas Weißschuh wrote:
> > > +__attribute__((weak,unused,section(".text.nolibc_memmove")))
> > > +__nolibc_naked __no_stack_protector
> > > +void *memmove(void *dst __attribute__((unused)),
> > > +	      const void *src __attribute__((unused)),
> > > +	      size_t len __attribute__((unused)))
> > > +{
> > > +	__asm__ volatile (
> > > +		"movq %rdx, %rcx\n\t"
> > > +		"movq %rdi, %rax\n\t"
> > > +		"movq %rdi, %rdx\n\t"
> > > +		"subq %rsi, %rdx\n\t"
> > > +		"cmpq %rcx, %rdx\n\t"
> > > +		"jb   .Lbackward_copy\n\t"
> > > +		"rep movsb\n\t"
> > > +		"retq\n"
> > > +		".Lbackward_copy:"
> > > +		"leaq -1(%rdi, %rcx, 1), %rdi\n\t"
> > > +		"leaq -1(%rsi, %rcx, 1), %rsi\n\t"
> > > +		"std\n\t"
> > > +		"rep movsb\n\t"
> > > +		"cld\n\t"
> > > +		"retq\n"
> > > +	);
> > > +	__nolibc_naked_epilogue();
> > > +}
> > 
> > NAK for this patch.
> 
> Thanks for the feedback!
> 
> (I'm not an assembler programmer, so regard my notes with a grain of salt)
> 
> > This approach appears highly dangerous, particularly when the compiler
> > inlines the call. When using inline assembly within a function, it's
> > crucial to define proper constraints and a clobber list to ensure the
> > arguments are correctly bound to the inline assembly.
> 
> Aren't the constraints a feature of Extended Asm?
> This is a Basic Asm block.
> Indeed naked functions only support Basic Asm, so there is no way to
> explicitly bind arguments to their registers.

That's indeed what is said in GNU docs, but a quick test with gcc 9 and 11
shows me that both accept both naked and extended asm. However clang doesn't
seem happy about it.

The problem here is dual:
  - first, the naked attribute can be emulated if not supported by the
    compiler and we'll only have optimize(-Os,-fomit-frame-pointer), and
    in this case the compiler does not respect well the registers (I'm
    seeing bad code being emitted in callers if I mark the function
    static for example).

  - second, nothing prevents the compiler from inlining that function
    and in this case, as Ammar says, we have no control over what
    registers arguments will be placed into since the main purpose of
    inlining precisely is to build optimal code that limits register
    moves. However, in my tests, it appears that when marked naked,
    the compiler never inlines it and even emits a warning if I try
    to mark it inline, saying it was already noinline. So maybe naked
    implies noinline.

In any case, depending on the availability of naked, we have two clearly
different layouts to deal with :-/

We could imagine not marking it naked and keeping the optimize(Os...)
stuff only, then using extended asm like in other functions maybe.
Otherwise this could require two versions, which is less fun.

> Looking at the object code for various on both gcc and clang show always
> the same object code.
> (Although GCC adds a "ud2" after the "ret")

It *seems* so as well for me but in the not-really-naked case, you
really have zero control over it. Also one thing it does to me in the
not-naked-case is that registers are not saved by the caller and are
clobbered by the asm statement:

  void *memmove2(void *dst __attribute__((unused)),
                 void *src __attribute__((unused)),
                 size_t len __attribute__((unused)))
  {
        memmove(dst, src, len);
        memmove(src, dst, len);
        return 0;
  }

Gives me this with the naked attribute:
0000000000000025 <memmove2>:
  25:   55                      push   %rbp
  26:   48 89 f5                mov    %rsi,%rbp
  29:   50                      push   %rax
  2a:   48 89 14 24             mov    %rdx,(%rsp)
  2e:   e8 00 00 00 00          callq  33 <memmove2+0xe>
  33:   48 8b 14 24             mov    (%rsp),%rdx
  37:   48 89 ef                mov    %rbp,%rdi
  3a:   48 89 c6                mov    %rax,%rsi
  3d:   e8 00 00 00 00          callq  42 <memmove2+0x1d>
  42:   31 c0                   xor    %eax,%eax
  44:   5a                      pop    %rdx
  45:   5d                      pop    %rbp
  46:   c3                      retq   

But the alternate form (optimize(-Os...)):
0000000000000024 <memmove2>:
  24:   49 89 f0                mov    %rsi,%r8
  27:   e8 00 00 00 00          callq  2c <memmove2+0x8>
  2c:   48 89 fe                mov    %rdi,%rsi
  2f:   4c 89 c7                mov    %r8,%rdi
  32:   e8 00 00 00 00          callq  37 <memmove2+0x13>
  37:   31 c0                   xor    %eax,%eax
  39:   c3                      retq   

See how rdi and rdx were expected to be preserved after the first call
but were not? This was with gcc-9.5 (which supports naked but it's for
illustration purposes of the risk of leaving unconstrained asm like this).

I also managed to get clang to complain about the .Lbackward_copy label
being already defined, but I don't know how I managed to do it. I think
we should not leave it as a global label like this and instead just use
the regular numbered labels if the asm is inlined.

Also I'm wondering why there are errors about memcpy and memmove being
already defined with -flto, because these ones are marked "weak". Maybe
we need to add something else, that gcc emits with the functions when
using lto ?

Just my two cents,
Willy

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Thomas Weißschuh 1 year, 6 months ago

On 2024-08-10 16:35:56+0000, Willy Tarreau wrote:
> On Sat, Aug 10, 2024 at 02:37:19PM +0200, Thomas Weißschuh wrote:
> > On 2024-08-10 19:12:25+0000, Ammar Faizi wrote:
> > > On Sat, Aug 10, 2024 at 12:54:46PM +0200, Thomas Weißschuh wrote:
> > > > +__attribute__((weak,unused,section(".text.nolibc_memmove")))
> > > > +__nolibc_naked __no_stack_protector
> > > > +void *memmove(void *dst __attribute__((unused)),
> > > > +	      const void *src __attribute__((unused)),
> > > > +	      size_t len __attribute__((unused)))
> > > > +{
> > > > +	__asm__ volatile (
> > > > +		"movq %rdx, %rcx\n\t"
> > > > +		"movq %rdi, %rax\n\t"
> > > > +		"movq %rdi, %rdx\n\t"
> > > > +		"subq %rsi, %rdx\n\t"
> > > > +		"cmpq %rcx, %rdx\n\t"
> > > > +		"jb   .Lbackward_copy\n\t"
> > > > +		"rep movsb\n\t"
> > > > +		"retq\n"
> > > > +		".Lbackward_copy:"
> > > > +		"leaq -1(%rdi, %rcx, 1), %rdi\n\t"
> > > > +		"leaq -1(%rsi, %rcx, 1), %rsi\n\t"
> > > > +		"std\n\t"
> > > > +		"rep movsb\n\t"
> > > > +		"cld\n\t"
> > > > +		"retq\n"
> > > > +	);
> > > > +	__nolibc_naked_epilogue();
> > > > +}
> > > 
> > > NAK for this patch.
> > 
> > Thanks for the feedback!
> > 
> > (I'm not an assembler programmer, so regard my notes with a grain of salt)
> > 
> > > This approach appears highly dangerous, particularly when the compiler
> > > inlines the call. When using inline assembly within a function, it's
> > > crucial to define proper constraints and a clobber list to ensure the
> > > arguments are correctly bound to the inline assembly.
> > 
> > Aren't the constraints a feature of Extended Asm?
> > This is a Basic Asm block.
> > Indeed naked functions only support Basic Asm, so there is no way to
> > explicitly bind arguments to their registers.
> 
> That's indeed what is said in GNU docs, but a quick test with gcc 9 and 11
> shows me that both accept both naked and extended asm. However clang doesn't
> seem happy about it.
> 
> The problem here is dual:
>   - first, the naked attribute can be emulated if not supported by the
>     compiler and we'll only have optimize(-Os,-fomit-frame-pointer), and
>     in this case the compiler does not respect well the registers (I'm
>     seeing bad code being emitted in callers if I mark the function
>     static for example).

Ack.

Would it help to mark the function as non-inlineable?

>   - second, nothing prevents the compiler from inlining that function
>     and in this case, as Ammar says, we have no control over what
>     registers arguments will be placed into since the main purpose of
>     inlining precisely is to build optimal code that limits register
>     moves. However, in my tests, it appears that when marked naked,
>     the compiler never inlines it and even emits a warning if I try
>     to mark it inline, saying it was already noinline. So maybe naked
>     implies noinline.

I did not manage yet to get it inlined.
With __attribute__((always_inline)) gcc even gives:

sysroot/x86_64/include/arch.h:214:1: error: ignoring attribute 'always_inline' because it conflicts with attribute 'noinline' [-Werror=attributes]

> In any case, depending on the availability of naked, we have two clearly
> different layouts to deal with :-/

Indeed...

> We could imagine not marking it naked and keeping the optimize(Os...)
> stuff only, then using extended asm like in other functions maybe.
> Otherwise this could require two versions, which is less fun.

Note: __attribute__((naked)) on gcc x86_64 is supported since gcc 8.1.

> > Looking at the object code for various on both gcc and clang show always
> > the same object code.
> > (Although GCC adds a "ud2" after the "ret")
> 
> It *seems* so as well for me but in the not-really-naked case, you
> really have zero control over it. Also one thing it does to me in the
> not-naked-case is that registers are not saved by the caller and are
> clobbered by the asm statement:
> 
>   void *memmove2(void *dst __attribute__((unused)),
>                  void *src __attribute__((unused)),
>                  size_t len __attribute__((unused)))
>   {
>         memmove(dst, src, len);
>         memmove(src, dst, len);
>         return 0;
>   }
> 
> Gives me this with the naked attribute:
> 0000000000000025 <memmove2>:
>   25:   55                      push   %rbp
>   26:   48 89 f5                mov    %rsi,%rbp
>   29:   50                      push   %rax
>   2a:   48 89 14 24             mov    %rdx,(%rsp)
>   2e:   e8 00 00 00 00          callq  33 <memmove2+0xe>
>   33:   48 8b 14 24             mov    (%rsp),%rdx
>   37:   48 89 ef                mov    %rbp,%rdi
>   3a:   48 89 c6                mov    %rax,%rsi
>   3d:   e8 00 00 00 00          callq  42 <memmove2+0x1d>
>   42:   31 c0                   xor    %eax,%eax
>   44:   5a                      pop    %rdx
>   45:   5d                      pop    %rbp
>   46:   c3                      retq   
> 
> But the alternate form (optimize(-Os...)):
> 0000000000000024 <memmove2>:
>   24:   49 89 f0                mov    %rsi,%r8
>   27:   e8 00 00 00 00          callq  2c <memmove2+0x8>
>   2c:   48 89 fe                mov    %rdi,%rsi
>   2f:   4c 89 c7                mov    %r8,%rdi
>   32:   e8 00 00 00 00          callq  37 <memmove2+0x13>
>   37:   31 c0                   xor    %eax,%eax
>   39:   c3                      retq   
> 
> See how rdi and rdx were expected to be preserved after the first call
> but were not? This was with gcc-9.5 (which supports naked but it's for
> illustration purposes of the risk of leaving unconstrained asm like this).

To be honest, I don't see it. Not enough asm experience I guess, but
I'll look at it some more.

> I also managed to get clang to complain about the .Lbackward_copy label
> being already defined, but I don't know how I managed to do it. I think
> we should not leave it as a global label like this and instead just use
> the regular numbered labels if the asm is inlined.

Ack, this was easy to fix.

> Also I'm wondering why there are errors about memcpy and memmove being
> already defined with -flto, because these ones are marked "weak". Maybe
> we need to add something else, that gcc emits with the functions when
> using lto ?

I think normally .weak would only be resolved by the linker.
What happens here is that the assembler is already presented with
duplicate definitions which it is not prepared to handle.
(See below for an example)

It works on gcc without -flto and on clang with and without -flto.
It seems like a compiler bug to me. If you agree I'll open a ticket
against GCC.
Then we can fix only the label to make it work on clang and wait for a
fixed GCC.


Example:

$ cat a.c
#include "func.h"

int main(void)
{
	return 0;
}

$ cat b.c 
#include "func.h"

$ cat func.h 
__asm__(
	".weak foo\n"
	"foo:\n"
	"retq\n"
);

$ gcc -flto -save-temps a.c b.c
./a.ltrans0.ltrans.s: Assembler messages:
./a.ltrans0.ltrans.s:28: Error: symbol `foo' is already defined
lto-wrapper: fatal error: gcc returned 1 exit status
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

$ cat ./a.ltrans0.ltrans.s
	.file	"<artificial>"
	.text
#APP
	.weak foo
foo:
retq

#NO_APP
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	movl	$0, %eax
	popq	%rbp
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
#APP
	.weak foo
foo:
retq

	.ident	"GCC: (GNU) 14.2.1 20240805"
	.section	.note.GNU-stack,"",@progbits

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Ammar Faizi 1 year, 6 months ago

On Sat, Aug 10, 2024 at 06:04:36PM +0200, Thomas Weißschuh wrote:
> On 2024-08-10 16:35:56+0000, Willy Tarreau wrote:
> > See how rdi and rdx were expected to be preserved after the first call
> > but were not? This was with gcc-9.5 (which supports naked but it's for
> > illustration purposes of the risk of leaving unconstrained asm like this).
> 
> To be honest, I don't see it. Not enough asm experience I guess, but
> I'll look at it some more.

Let me shed some light on what is going wrong in that form.

First, let's revisit the System V ABI x86-64 calling convention:

  - %rdi holds the 1st argument
  - %rsi holds the 2nd argument
  - %rdx holds the 3rd argument
  - %rax is used for the return value

Take a look at the assembly code that Willy sent. The second memmove()
call is receiving incorrect arguments from the previous memmove() call.

The issue arises because your memmove() function doesn't inform the
compiler that it modifies %rdi and %rdx. As far as the compiler is
concerned, your memmove() function:

  Does not alter any registers.

Consequently, the compiler assumes that the values in %rdi and %rdx
remain unchanged after the memmove() function returns. With this
assumption, and since memmove() is within the same translation unit as
the caller, the compiler optimizes the register moves without preserving
the values in %rdi and %rdx.

To properly inform the compiler that certain registers are being
modified, you need to use constraints and a clobber list. Here's an
example (note: this is untested, only compiled on Godbolt):

Link: https://godbolt.org/z/WTz1nvn1h

```
void *memmove(void *dst, const void *src, size_t len)
{
	void *rax;

	__asm__ goto volatile (
		"movq	%%rdx, %%rcx\n\t"
		"movq	%%rdi, %%rax\n\t"
		"movq	%%rdi, %%rdx\n\t"
		"subq	%%rsi, %%rdx\n\t"
		"cmpq	%%rcx, %%rdx\n\t"
		"jb	.Lbackward_copy\n\t"
		"rep movsb\n\t"
		"jmp	%[out]\n\t"
	".Lbackward_copy:\n\t"
		"leaq	-1(%%rdi, %%rcx, 1), %%rdi\n\t"
		"leaq	-1(%%rsi, %%rcx, 1), %%rsi\n\t"
		"std\n\t"
		"rep movsb\n\t"
		"cld"
		: "+D"(dst), "+S"(src), "+c"(len), "=a"(rax)
		:
		: "rdx", "memory", "cc"
		: out
	);

out:
	return rax;
}
```

This approach helps the compiler correctly handle the register
modifications, ensuring that the generated code behaves as expected.

Notes:

  constraints (binding C variables):
    - "+D": Read and Write %rdi
    - "+S": Read and Write %rsi
    - "+c": Read and Write %rcx
    - "=a": Write %rax

  clobber list:
    - "rdx": Write %rdx
    - "memory": Read and Write memory (pointed to by dst and src)
    - "cc": Write %rflags (This might be optional for x86-64)

I believe we can do better code than that. I can't pull Linux git tree
as well for now. I'll be fully available for work on Tuesday morning
(13 August 2024).

-- 
Ammar Faizi

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Willy Tarreau 1 year, 6 months ago

Hi Ammar,

On Sun, Aug 11, 2024 at 02:16:00AM +0700, Ammar Faizi wrote:
> On Sat, Aug 10, 2024 at 06:04:36PM +0200, Thomas Weißschuh wrote:
> > On 2024-08-10 16:35:56+0000, Willy Tarreau wrote:
> > > See how rdi and rdx were expected to be preserved after the first call
> > > but were not? This was with gcc-9.5 (which supports naked but it's for
> > > illustration purposes of the risk of leaving unconstrained asm like this).
> > 
> > To be honest, I don't see it. Not enough asm experience I guess, but
> > I'll look at it some more.
> 
> Let me shed some light on what is going wrong in that form.
> 
> First, let's revisit the System V ABI x86-64 calling convention:
> 
>   - %rdi holds the 1st argument
>   - %rsi holds the 2nd argument
>   - %rdx holds the 3rd argument
>   - %rax is used for the return value
> 
> Take a look at the assembly code that Willy sent. The second memmove()
> call is receiving incorrect arguments from the previous memmove() call.
> 
> The issue arises because your memmove() function doesn't inform the
> compiler that it modifies %rdi and %rdx. As far as the compiler is
> concerned, your memmove() function:
> 
>   Does not alter any registers.
> 
> Consequently, the compiler assumes that the values in %rdi and %rdx
> remain unchanged after the memmove() function returns. With this
> assumption, and since memmove() is within the same translation unit as
> the caller, the compiler optimizes the register moves without preserving
> the values in %rdi and %rdx.
> 
> To properly inform the compiler that certain registers are being
> modified, you need to use constraints and a clobber list. Here's an
> example (note: this is untested, only compiled on Godbolt):
> 
> Link: https://godbolt.org/z/WTz1nvn1h
> 
> ```
> void *memmove(void *dst, const void *src, size_t len)
> {
> 	void *rax;
> 
> 	__asm__ goto volatile (
> 		"movq	%%rdx, %%rcx\n\t"
> 		"movq	%%rdi, %%rax\n\t"
> 		"movq	%%rdi, %%rdx\n\t"
> 		"subq	%%rsi, %%rdx\n\t"
> 		"cmpq	%%rcx, %%rdx\n\t"
> 		"jb	.Lbackward_copy\n\t"
> 		"rep movsb\n\t"
> 		"jmp	%[out]\n\t"
> 	".Lbackward_copy:\n\t"
> 		"leaq	-1(%%rdi, %%rcx, 1), %%rdi\n\t"
> 		"leaq	-1(%%rsi, %%rcx, 1), %%rsi\n\t"
> 		"std\n\t"
> 		"rep movsb\n\t"
> 		"cld"
> 		: "+D"(dst), "+S"(src), "+c"(len), "=a"(rax)
> 		:
> 		: "rdx", "memory", "cc"
> 		: out
> 	);
> 
> out:
> 	return rax;
> }
> ```
> 
> This approach helps the compiler correctly handle the register
> modifications, ensuring that the generated code behaves as expected.
> 
> Notes:
> 
>   constraints (binding C variables):
>     - "+D": Read and Write %rdi
>     - "+S": Read and Write %rsi
>     - "+c": Read and Write %rcx
>     - "=a": Write %rax
> 
>   clobber list:
>     - "rdx": Write %rdx
>     - "memory": Read and Write memory (pointed to by dst and src)
>     - "cc": Write %rflags (This might be optional for x86-64)
> 
> I believe we can do better code than that. I can't pull Linux git tree
> as well for now. I'll be fully available for work on Tuesday morning
> (13 August 2024).

The constraints are trivial, the problem is that they're not supposed to
be used in a naked function. I tried, as you can guess. gcc accepted
them without complaining but clang not at all. However what's interesting
is that the compiler being aware of our unability to inform it about the
clobber list, it did consider everything clobbered and saved the registers
in the caller in this case. That does make sense, otherwise it would be
impossible to use asm in naked functions. However we need to restrict
this use case to true naked functions, not the ones we were doing ourselves
before the existence of the naked attribute.

Willy

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Ammar Faizi 1 year, 6 months ago

On Sat, Aug 10, 2024 at 09:45:45PM +0200, Willy Tarreau wrote:
> The constraints are trivial, the problem is that they're not supposed to
> be used in a naked function. I tried, as you can guess. gcc accepted
> them without complaining but clang not at all. However what's interesting
> is that the compiler being aware of our unability to inform it about the
> clobber list, it did consider everything clobbered and saved the registers
> in the caller in this case. That does make sense, otherwise it would be
> impossible to use asm in naked functions. However we need to restrict
> this use case to true naked functions, not the ones we were doing ourselves
> before the existence of the naked attribute.

Ah, I get it now. Apparently, my ignorance of naked functions led to,
well, not knowing about naked functions. I didn't know about the naked
attribute.

Yeah, clang indeed throws errors while GCC does not:

```
  <source>:24:10: error: parameter references not allowed in naked functions
     24 |                 : "+D"(dst), "+S"(src), "+c"(len), "=a"(rax)
        |                        ^
  <source>:4:16: note: attribute is here
      4 | __attribute__((naked))
        |                ^
  <source>:7:2: error: non-ASM statement in naked function is not supported
      7 |         void *rax;
        |         ^
  <source>:4:16: note: attribute is here
      4 | __attribute__((naked))
        |                ^
  2 errors generated.
  Compiler returned: 1
```

For now, I'll wait for your discussion with Thomas to resolve this
issue. And yes, I agree that we should find a solution that doesn't
require maintaining two different versions.

-- 
Ammar Faizi

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Thomas Weißschuh 1 year, 6 months ago

On 2024-08-10 18:04:36+0000, Thomas Weißschuh wrote:
> On 2024-08-10 16:35:56+0000, Willy Tarreau wrote:

<snip>

> > Also I'm wondering why there are errors about memcpy and memmove being
> > already defined with -flto, because these ones are marked "weak". Maybe
> > we need to add something else, that gcc emits with the functions when
> > using lto ?
> 
> I think normally .weak would only be resolved by the linker.
> What happens here is that the assembler is already presented with
> duplicate definitions which it is not prepared to handle.
> (See below for an example)
> 
> It works on gcc without -flto and on clang with and without -flto.
> It seems like a compiler bug to me. If you agree I'll open a ticket
> against GCC.
> Then we can fix only the label to make it work on clang and wait for a
> fixed GCC.

Iff we really want to support it, we could do use naked where available
and fall back to toplevel asm otherwise.
This should work on newer compilers and older ones without -flto.
It looks horrible though.

  #define NOLIBC_ARCH_HAS_MEMSET
  void *memset(void *dst, int c, size_t len);
  
  #if __nolibc_has_attribute(naked)

  __attribute__((weak,naked))
  void *memset(void *dst __attribute__((unused)), int c __attribute__((unused)), size_t len __attribute__((unused))) {

  #else

  __asm__ (
  ".section .text.nolibc_memset\n"
  ".weak memset\n"
  "memset:\n"
  );

  #endif

  __asm__ (
  	"xchgl %eax, %esi\n\t"
  	"movq  %rdx, %rcx\n\t"
  	"pushq %rdi\n\t"
  	"rep stosb\n\t"
  	"popq  %rax\n\t"
  	"retq\n"
  );

  #if __nolibc_has_attribute(naked)
  }
  #endif

(Or some impenetrable macro wrapper abstraction thereof)

The memcpy / memmove combination could be split up into one real
function and one C inline wrapper and then the same pattern would apply.

But to be honest I'd be fine with not supporting -flto on GCC.


Thomas

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Willy Tarreau 1 year, 6 months ago

On Sat, Aug 10, 2024 at 06:45:19PM +0200, Thomas Weißschuh wrote:
> Iff we really want to support it, we could do use naked where available
> and fall back to toplevel asm otherwise.
> This should work on newer compilers and older ones without -flto.
> It looks horrible though.
> 
>   #define NOLIBC_ARCH_HAS_MEMSET
>   void *memset(void *dst, int c, size_t len);
>   
>   #if __nolibc_has_attribute(naked)
> 
>   __attribute__((weak,naked))
>   void *memset(void *dst __attribute__((unused)), int c __attribute__((unused)), size_t len __attribute__((unused))) {
> 
>   #else
> 
>   __asm__ (
>   ".section .text.nolibc_memset\n"
>   ".weak memset\n"
>   "memset:\n"
>   );
> 
>   #endif
> 
>   __asm__ (
>   	"xchgl %eax, %esi\n\t"
>   	"movq  %rdx, %rcx\n\t"
>   	"pushq %rdi\n\t"
>   	"rep stosb\n\t"
>   	"popq  %rax\n\t"
>   	"retq\n"
>   );
> 
>   #if __nolibc_has_attribute(naked)
>   }
>   #endif
> 
> (Or some impenetrable macro wrapper abstraction thereof)

One dangerous part above is that the compiler can reorder toplevel asm
statements, so having a label in one and the code in another may result
in random bugs.

> The memcpy / memmove combination could be split up into one real
> function and one C inline wrapper and then the same pattern would apply.
> 
> But to be honest I'd be fine with not supporting -flto on GCC.

That could also be a reasonable solution. The primary goal of nolibc
is to make it easy for developers to develop tests, and for those who
want to create pre-boot code to do so. By nature this code is prone to
bugs. If it becomes totally unreadable for very unlikely cases, it will
cause issues that are hard to debug by the users themselves. It's sure
that supporting a variety of compilers and setups is great, but we should
keep in mind the maintainability goal when thinking about this. I think
that LTO will mostly be used for testing, and in this case I think it's
totally reasonable to restrict the choice of compatible compilers.

Willy

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Thomas Weißschuh 1 year, 6 months ago

On 2024-08-10 19:00:30+0000, Willy Tarreau wrote:
> On Sat, Aug 10, 2024 at 06:45:19PM +0200, Thomas Weißschuh wrote:
> > Iff we really want to support it, we could do use naked where available
> > and fall back to toplevel asm otherwise.
> > This should work on newer compilers and older ones without -flto.
> > It looks horrible though.
> > 
> >   #define NOLIBC_ARCH_HAS_MEMSET
> >   void *memset(void *dst, int c, size_t len);
> >   
> >   #if __nolibc_has_attribute(naked)
> > 
> >   __attribute__((weak,naked))
> >   void *memset(void *dst __attribute__((unused)), int c __attribute__((unused)), size_t len __attribute__((unused))) {
> > 
> >   #else
> > 
> >   __asm__ (
> >   ".section .text.nolibc_memset\n"
> >   ".weak memset\n"
> >   "memset:\n"
> >   );
> > 
> >   #endif
> > 
> >   __asm__ (
> >   	"xchgl %eax, %esi\n\t"
> >   	"movq  %rdx, %rcx\n\t"
> >   	"pushq %rdi\n\t"
> >   	"rep stosb\n\t"
> >   	"popq  %rax\n\t"
> >   	"retq\n"
> >   );
> > 
> >   #if __nolibc_has_attribute(naked)
> >   }
> >   #endif
> > 
> > (Or some impenetrable macro wrapper abstraction thereof)
> 
> One dangerous part above is that the compiler can reorder toplevel asm
> statements, so having a label in one and the code in another may result
> in random bugs.

Indeed, let's scratch that then.

> > The memcpy / memmove combination could be split up into one real
> > function and one C inline wrapper and then the same pattern would apply.
> > 
> > But to be honest I'd be fine with not supporting -flto on GCC.
> 
> That could also be a reasonable solution. The primary goal of nolibc
> is to make it easy for developers to develop tests, and for those who
> want to create pre-boot code to do so. By nature this code is prone to
> bugs. If it becomes totally unreadable for very unlikely cases, it will
> cause issues that are hard to debug by the users themselves. It's sure
> that supporting a variety of compilers and setups is great, but we should
> keep in mind the maintainability goal when thinking about this. I think
> that LTO will mostly be used for testing, and in this case I think it's
> totally reasonable to restrict the choice of compatible compilers.

IMO LTO is useful for nolibc to reduce the size of the generated binaries.
For example the custom function sections that then can be combined with
--gc-sections to get rid of unused functions is all implicit in LTO.
And on top of that it can perform other size optimizations.

But the complications for GCC are not worth it.
And there seem general with LTO on GCC anyways as evidenced by the
constructor and duplicate-asm bugs.

After this discussion I think we should also change the __nolibc_naked
attribute back to __nolibc_entrypoint as it was before.
It is indeed not a replacement for a proper "naked".
What do you think about me rewriting the existing commits on nolibc-next
to fix this up?


Thomas

Re: [PATCH 3/3] tools/nolibc: x86_64: wrap asm functions in functions

Posted by Willy Tarreau 1 year, 6 months ago

Hi Thomas,

On Mon, Aug 12, 2024 at 07:01:26PM +0200, Thomas Weißschuh wrote:
> > > The memcpy / memmove combination could be split up into one real
> > > function and one C inline wrapper and then the same pattern would apply.
> > > 
> > > But to be honest I'd be fine with not supporting -flto on GCC.
> > 
> > That could also be a reasonable solution. The primary goal of nolibc
> > is to make it easy for developers to develop tests, and for those who
> > want to create pre-boot code to do so. By nature this code is prone to
> > bugs. If it becomes totally unreadable for very unlikely cases, it will
> > cause issues that are hard to debug by the users themselves. It's sure
> > that supporting a variety of compilers and setups is great, but we should
> > keep in mind the maintainability goal when thinking about this. I think
> > that LTO will mostly be used for testing, and in this case I think it's
> > totally reasonable to restrict the choice of compatible compilers.
> 
> IMO LTO is useful for nolibc to reduce the size of the generated binaries.
> For example the custom function sections that then can be combined with
> --gc-sections to get rid of unused functions is all implicit in LTO.
> And on top of that it can perform other size optimizations.

Well, probably, but my experience of it till now has been quite negative,
between stuff that doesn't build and stuff that takes ages to link on a
single core, for most of the time tiny to no gain at all :-/  But I admit
that in our case here at least we shouldn't witness long link times.

> But the complications for GCC are not worth it.
> And there seem general with LTO on GCC anyways as evidenced by the
> constructor and duplicate-asm bugs.

Yes, and the trouble it will cause will result in a lot of head-scratching
for those affected.

> After this discussion I think we should also change the __nolibc_naked
> attribute back to __nolibc_entrypoint as it was before.
> It is indeed not a replacement for a proper "naked".

Indeed, it sort of emulates it but does not have the exact same properties
as we've seen.

> What do you think about me rewriting the existing commits on nolibc-next
> to fix this up?

I'm obviously fine with this :-)

Thank you!
Willy