[RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`

Ammar Faizi posted 5 patches 2 years, 3 months ago
There is a newer version of this series
[RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
Posted by Ammar Faizi 2 years, 3 months ago
Simplify memcmp() on the x86-64 arch.

The x86-64 arch has a 'rep cmpsb' instruction, which can be used to
implement the memcmp() function.

    %rdi = source 1
    %rsi = source 2
    %rcx = length

Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
---
 tools/include/nolibc/arch-x86_64.h | 19 +++++++++++++++++++
 tools/include/nolibc/string.h      |  2 ++
 2 files changed, 21 insertions(+)

diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
index 42f2674ad1ecdd64..6c1b54ba9f774e7b 100644
--- a/tools/include/nolibc/arch-x86_64.h
+++ b/tools/include/nolibc/arch-x86_64.h
@@ -214,4 +214,23 @@ __asm__ (
 	"retq\n"
 );
 
+#define NOLIBC_ARCH_HAS_MEMCMP
+static int memcmp(const void *s1, const void *s2, size_t n)
+{
+	const unsigned char *p1 = s1;
+	const unsigned char *p2 = s2;
+
+	if (!n)
+		return 0;
+
+	__asm__ volatile (
+		"rep cmpsb"
+		: "+D"(p2), "+S"(p1), "+c"(n)
+		: "m"(*(const unsigned char (*)[n])s1),
+		  "m"(*(const unsigned char (*)[n])s2)
+	);
+
+	return p1[-1] - p2[-1];
+}
+
 #endif /* _NOLIBC_ARCH_X86_64_H */
diff --git a/tools/include/nolibc/string.h b/tools/include/nolibc/string.h
index 1bad6121ef8c4ab5..3c941289d5dd0985 100644
--- a/tools/include/nolibc/string.h
+++ b/tools/include/nolibc/string.h
@@ -15,6 +15,7 @@ static void *malloc(size_t len);
  * As much as possible, please keep functions alphabetically sorted.
  */
 
+#ifndef NOLIBC_ARCH_HAS_MEMCMP
 static __attribute__((unused))
 int memcmp(const void *s1, const void *s2, size_t n)
 {
@@ -26,6 +27,7 @@ int memcmp(const void *s1, const void *s2, size_t n)
 	}
 	return c1;
 }
+#endif /* #ifndef NOLIBC_ARCH_HAS_MEMCMP */
 
 static __attribute__((unused))
 void *_nolibc_memcpy_up(void *dst, const void *src, size_t len)
-- 
Ammar Faizi
Re: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
Posted by Willy Tarreau 2 years, 3 months ago
On Wed, Aug 30, 2023 at 08:57:24PM +0700, Ammar Faizi wrote:
> Simplify memcmp() on the x86-64 arch.
> 
> The x86-64 arch has a 'rep cmpsb' instruction, which can be used to
> implement the memcmp() function.
> 
>     %rdi = source 1
>     %rsi = source 2
>     %rcx = length
> 
> Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
> ---
>  tools/include/nolibc/arch-x86_64.h | 19 +++++++++++++++++++
>  tools/include/nolibc/string.h      |  2 ++
>  2 files changed, 21 insertions(+)
> 
> diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
> index 42f2674ad1ecdd64..6c1b54ba9f774e7b 100644
> --- a/tools/include/nolibc/arch-x86_64.h
> +++ b/tools/include/nolibc/arch-x86_64.h
> @@ -214,4 +214,23 @@ __asm__ (
>  	"retq\n"
>  );
>  
> +#define NOLIBC_ARCH_HAS_MEMCMP
> +static int memcmp(const void *s1, const void *s2, size_t n)
> +{
> +	const unsigned char *p1 = s1;
> +	const unsigned char *p2 = s2;
> +
> +	if (!n)
> +		return 0;
> +
> +	__asm__ volatile (
> +		"rep cmpsb"
> +		: "+D"(p2), "+S"(p1), "+c"(n)
> +		: "m"(*(const unsigned char (*)[n])s1),
> +		  "m"(*(const unsigned char (*)[n])s2)
> +	);
> +
> +	return p1[-1] - p2[-1];
> +}

Out of curiosity, given that you implemented the 3 other ones directly
in an asm statement, is there a particular reason this one mixes a bit
of C and asm ? It would probably be something around this, in the same
vein:

  memcmp:
    xchg  %esi,%eax   // source1
    mov   %rdx,%rcx   // count
    rep   cmpsb       // source2 in rdi; sets ZF on equal, CF if src1<src2
    seta  %al         // 0 if src2 <= src1, 1 if src2 > src1
    sbb   $0, %al     // 0 if src2 == src1, -1 if src2 < src1, 1 if src2 > src1
    movsx %al, %eax   // sign extend to %eax
    ret

Note that the output logic could have to be revisited, I'm not certain but
at first glance it looks valid.

Regards,
Willy
RE: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
Posted by David Laight 2 years, 3 months ago
From: Willy Tarreau
> Sent: 30 August 2023 22:27
> 
> On Wed, Aug 30, 2023 at 08:57:24PM +0700, Ammar Faizi wrote:
> > Simplify memcmp() on the x86-64 arch.
> >
> > The x86-64 arch has a 'rep cmpsb' instruction, which can be used to
> > implement the memcmp() function.
> >
> >     %rdi = source 1
> >     %rsi = source 2
> >     %rcx = length
> >
> > Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
> > ---
> >  tools/include/nolibc/arch-x86_64.h | 19 +++++++++++++++++++
> >  tools/include/nolibc/string.h      |  2 ++
> >  2 files changed, 21 insertions(+)
> >
> > diff --git a/tools/include/nolibc/arch-x86_64.h b/tools/include/nolibc/arch-x86_64.h
> > index 42f2674ad1ecdd64..6c1b54ba9f774e7b 100644
> > --- a/tools/include/nolibc/arch-x86_64.h
> > +++ b/tools/include/nolibc/arch-x86_64.h
> > @@ -214,4 +214,23 @@ __asm__ (
> >  	"retq\n"
> >  );
> >
> > +#define NOLIBC_ARCH_HAS_MEMCMP
> > +static int memcmp(const void *s1, const void *s2, size_t n)
> > +{
> > +	const unsigned char *p1 = s1;
> > +	const unsigned char *p2 = s2;
> > +
> > +	if (!n)
> > +		return 0;
> > +
> > +	__asm__ volatile (
> > +		"rep cmpsb"
> > +		: "+D"(p2), "+S"(p1), "+c"(n)
> > +		: "m"(*(const unsigned char (*)[n])s1),
> > +		  "m"(*(const unsigned char (*)[n])s2)
> > +	);
> > +
> > +	return p1[-1] - p2[-1];
> > +}
> 
> Out of curiosity, given that you implemented the 3 other ones directly
> in an asm statement, is there a particular reason this one mixes a bit
> of C and asm ? It would probably be something around this, in the same
> vein:
> 
>   memcmp:
>     xchg  %esi,%eax   // source1

Aren't the arguments in %rdi, %rsi and %rdx so you only
need to move the count (below) ?
(Looks like you copied memchr())

	David

>     mov   %rdx,%rcx   // count
>     rep   cmpsb       // source2 in rdi; sets ZF on equal, CF if src1<src2
>     seta  %al         // 0 if src2 <= src1, 1 if src2 > src1
>     sbb   $0, %al     // 0 if src2 == src1, -1 if src2 < src1, 1 if src2 > src1
>     movsx %al, %eax   // sign extend to %eax
>     ret
> 
> Note that the output logic could have to be revisited, I'm not certain but
> at first glance it looks valid.
> 
> Regards,
> Willy

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
Posted by Ammar Faizi 2 years, 3 months ago
On Wed, Aug 30, 2023 at 11:26:57PM +0200, Willy Tarreau wrote:
> Out of curiosity, given that you implemented the 3 other ones directly
> in an asm statement, is there a particular reason this one mixes a bit
> of C and asm ?

Because this one maybe unused. The other are explicitly exported.

> It would probably be something around this, in the same vein:
> 
>   memcmp:
>     xchg  %esi,%eax   // source1
>     mov   %rdx,%rcx   // count
>     rep   cmpsb       // source2 in rdi; sets ZF on equal, CF if src1<src2
>     seta  %al         // 0 if src2 <= src1, 1 if src2 > src1
>     sbb   $0, %al     // 0 if src2 == src1, -1 if src2 < src1, 1 if src2 > src1
>     movsx %al, %eax   // sign extend to %eax
>     ret
> 
> Note that the output logic could have to be revisited, I'm not certain but
> at first glance it looks valid.

After thinking about this more, I think I'll drop the memcmp() patch
because it will prevent optimization when comparing a small value.

For example, without __asm__:

    memcmp(var, "abcd", 4);

may compile to:

    cmpl $0x64636261, %reg
    ...something...

But with __asm__, the compiler can't do that. Thus, it's not worth
optimizing the memcmp() in this case.

-- 
Ammar Faizi
Re: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
Posted by Willy Tarreau 2 years, 3 months ago
On Fri, Sep 01, 2023 at 10:24:42AM +0700, Ammar Faizi wrote:
> On Wed, Aug 30, 2023 at 11:26:57PM +0200, Willy Tarreau wrote:
> > Out of curiosity, given that you implemented the 3 other ones directly
> > in an asm statement, is there a particular reason this one mixes a bit
> > of C and asm ?
> 
> Because this one maybe unused. The other are explicitly exported.

Makes sense, indeed.

> > It would probably be something around this, in the same vein:
> > 
> >   memcmp:
> >     xchg  %esi,%eax   // source1
> >     mov   %rdx,%rcx   // count
> >     rep   cmpsb       // source2 in rdi; sets ZF on equal, CF if src1<src2
> >     seta  %al         // 0 if src2 <= src1, 1 if src2 > src1
> >     sbb   $0, %al     // 0 if src2 == src1, -1 if src2 < src1, 1 if src2 > src1
> >     movsx %al, %eax   // sign extend to %eax
> >     ret
> > 
> > Note that the output logic could have to be revisited, I'm not certain but
> > at first glance it looks valid.
> 
> After thinking about this more, I think I'll drop the memcmp() patch
> because it will prevent optimization when comparing a small value.
> 
> For example, without __asm__:
> 
>     memcmp(var, "abcd", 4);
> 
> may compile to:
> 
>     cmpl $0x64636261, %reg
>     ...something...
> 
> But with __asm__, the compiler can't do that. Thus, it's not worth
> optimizing the memcmp() in this case.

Ah you're totally right!

Willy
Re: [RFC PATCH v1 3/5] tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
Posted by Ammar Faizi 2 years, 3 months ago
On Fri, Sep 01, 2023 at 05:35:08AM +0200, Willy Tarreau wrote:
> On Fri, Sep 01, 2023 at 10:24:42AM +0700, Ammar Faizi wrote:
> > After thinking about this more, I think I'll drop the memcmp() patch
> > because it will prevent optimization when comparing a small value.
> > 
> > For example, without __asm__:
> > 
> >     memcmp(var, "abcd", 4);
> > 
> > may compile to:
> > 
> >     cmpl $0x64636261, %reg
> >     ...something...
> > 
> > But with __asm__, the compiler can't do that. Thus, it's not worth
> > optimizing the memcmp() in this case.
> 
> Ah you're totally right!

So, it turns out that such assumption is wrong. The compiler cannot
optimize the current memcmp() into that. I just posted a question on SO:

   https://stackoverflow.com/questions/77020562/what-prevents-the-compiler-from-optimizing-a-hand-written-memcmp

Given:
```
  bool test_data(void *data)
  {
          return memcmp(data, "abcd", 4) == 0;
  }
```

The result when using default the <string.h> memcmp (good):
```
  test_data:
      cmpl    $1684234849, (%rdi)
      sete    %al
      ret
```

The result when using nolibc memcmp() (bad):
```
  test_data:
      cmpb    $97, (%rdi)
      jne     .L5
      cmpb    $98, 1(%rdi)
      jne     .L5
      cmpb    $99, 2(%rdi)
      jne     .L5
      cmpb    $100, 3(%rdi)
      sete    %al
      ret
  .L5:
      xorl    %eax, %eax
      ret
```

Link: https://godbolt.org/z/TT94r3bvf

This is because apart from the input length, the current nolibc
`memcmp()` must stop comparing the next byte if it finds a non-match
byte. Imagine what happens if we call:

```
  char xstr[] = {'a', 'b', 'x'};
  test_data(x);
```

In that case, the compiler may read past xstr if it uses a dword cmp, it
can also lead to segfault in particular circumstances using a dword cmp.

What the current nolibc memcmp() does from the C language view:

  1) Compare one byte at a time.
  2) Must stop comparing the next byte if it finds a non-match byte.

Because point (2) comes in, the compiler is not allowed to optimize
nolibc memcmp() into a wider load; otherwise, it may hit a segfault.
That also means it cannot vectorize the memcmp() loop.

On the other hand, memcpy() and memset() don't have such a restriction
so they can vectorize.

The real memcmp() assumes that both sources are at least `n` length in
size, allowing for a wider load. The current nolibc memcmp()
implementation doesn't reflect that assumption in the C code.

IOW, the real built-in memcmp() is undefined behavior for this code:
```
    char x = 'q';
    return memcmp(&x, "abcd", 4);
```
but the current nolibc memcmp() is well-defined behavior (well, must be,
as what the C code reflects).

We can improve nolibc memcmp() by casting the sources to a wider type
like (ulong, uint, ushort). But that's another story for another RFC
patchset.

-- 
Ammar Faizi