[RFC PATCH v3 0/4] nolibc x86-64 string functions

Ammar Faizi posted 4 patches 2 years, 3 months ago
tools/include/nolibc/arch-x86_64.h | 42 ++++++++++++++++++++++++++++++
tools/include/nolibc/string.h      | 36 +++++++++----------------
2 files changed, 55 insertions(+), 23 deletions(-)
[RFC PATCH v3 0/4] nolibc x86-64 string functions
Posted by Ammar Faizi 2 years, 3 months ago
Hi Willy,

This is an RFC patchset v3 for nolibc x86-64 string functions.

There are 4 patches in this series:

## Patch 1-2: Use `rep movsb`, `rep stosb` for:
    - memcpy() and memmove()
    - memset()
respectively. They can simplify the generated ASM code.

Patch 3 and 4 are not related, just a small cleanup.

## Patch 3: Remove the `_nolibc_memcpy_down()` function
This nolibc internal function is not used. Delete it. It was probably
supposed to handle memmove(), but today the memmove() has its own
implementation.

## Patch 4: Remove the `_nolibc_memcpy_up()` function
This function is only called by memcpy(), there is no real reason to
have this wrapper. Delete this function and move the code to memcpy()
directly.

Before this series:
```
  00000000004013aa <memmove>:
    4013aa: f3 0f 1e fa           endbr64
    4013ae: 48 39 f7              cmpq   %rsi,%rdi
    4013b1: 48 c7 c1 ff ff ff ff  movq   $0xffffffffffffffff,%rcx
    4013b8: 48 89 f8              movq   %rdi,%rax
    4013bb: 48 0f 43 ca           cmovaeq %rdx,%rcx
    4013bf: 48 19 ff              sbbq   %rdi,%rdi
    4013c2: 83 e7 02              andl   $0x2,%edi
    4013c5: 48 ff cf              decq   %rdi
    4013c8: 48 85 d2              testq  %rdx,%rdx
    4013cb: 74 10                 je     4013dd <memmove+0x33>
    4013cd: 48 01 f9              addq   %rdi,%rcx
    4013d0: 48 ff ca              decq   %rdx
    4013d3: 44 8a 04 0e           movb   (%rsi,%rcx,1),%r8b
    4013d7: 44 88 04 08           movb   %r8b,(%rax,%rcx,1)
    4013db: eb eb                 jmp    4013c8 <memmove+0x1e>
    4013dd: c3                    retq

  00000000004013de <memcpy>:
    4013de: f3 0f 1e fa           endbr64
    4013e2: 48 89 f8              movq   %rdi,%rax
    4013e5: 31 c9                 xorl   %ecx,%ecx
    4013e7: 48 39 ca              cmpq   %rcx,%rdx
    4013ea: 74 0d                 je     4013f9 <memcpy+0x1b>
    4013ec: 40 8a 3c 0e           movb   (%rsi,%rcx,1),%dil
    4013f0: 40 88 3c 08           movb   %dil,(%rax,%rcx,1)
    4013f4: 48 ff c1              incq   %rcx
    4013f7: eb ee                 jmp    4013e7 <memcpy+0x9>
    4013f9: c3                    retq

  00000000004013fa <memset>:
    4013fa: f3 0f 1e fa           endbr64
    4013fe: 48 89 f8              movq   %rdi,%rax
    401401: 31 c9                 xorl   %ecx,%ecx
    401403: 48 39 ca              cmpq   %rcx,%rdx
    401406: 74 09                 je     401411 <memset+0x17>
    401408: 40 88 34 08           movb   %sil,(%rax,%rcx,1)
    40140c: 48 ff c1              incq   %rcx
    40140f: eb f2                 jmp    401403 <memset+0x9>
    401411: c3                    retq
```

After this series:
```
  // `memmove` is an alias for `memcpy`
  000000000040149c <memcpy>:
    40149c: 48 89 d1              movq   %rdx,%rcx
    40149f: 48 89 f8              movq   %rdi,%rax
    4014a2: 48 89 fa              movq   %rdi,%rdx
    4014a5: 48 29 f2              subq   %rsi,%rdx
    4014a8: 48 39 ca              cmpq   %rcx,%rdx
    4014ab: 72 03                 jb     4014b0 <memcpy+0x14>
    4014ad: f3 a4                 rep movsb %ds:(%rsi),%es:(%rdi)
    4014af: c3                    retq
    4014b0: 48 8d 7c 0f ff        leaq   -0x1(%rdi,%rcx,1),%rdi
    4014b5: 48 8d 74 0e ff        leaq   -0x1(%rsi,%rcx,1),%rsi
    4014ba: fd                    std
    4014bb: f3 a4                 rep movsb %ds:(%rsi),%es:(%rdi)
    4014bd: fc                    cld
    4014be: c3                    retq

  00000000004014bf <memset>:
    4014bf: 96                    xchgl  %eax,%esi
    4014c0: 48 89 d1              movq   %rdx,%rcx
    4014c3: 57                    pushq  %rdi
    4014c4: f3 aa                 rep stosb %al,%es:(%rdi)
    4014c6: 58                    popq   %rax
    4014c7: c3                    retq
```

## Changelog

Changes in v3:
  - Make memmove as an alias for memcpy (Willy).
  - Make the forward copy the likely case (Alviro).

Changes in v2:
  - Shrink the memset code size:
      - Use pushq %rdi / popq %rax (Alviro).
      - Use xchg %eax, %esi (Willy).
  - Drop the memcmp patch (need more pondering).
  - Fix the broken memmove implementation (David).

Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
---

Ammar Faizi (4):
  tools/nolibc: x86-64: Use `rep movsb` for `memcpy()` and `memmove()`
  tools/nolibc: x86-64: Use `rep stosb` for `memset()`
  tools/nolibc: string: Remove the `_nolibc_memcpy_down()` function
  tools/nolibc: string: Remove the `_nolibc_memcpy_up()` function

 tools/include/nolibc/arch-x86_64.h | 42 ++++++++++++++++++++++++++++++
 tools/include/nolibc/string.h      | 36 +++++++++----------------
 2 files changed, 55 insertions(+), 23 deletions(-)


base-commit: 3c9b7c4a228bf8cca2f92abb65575cdd54065302
-- 
Ammar Faizi
RE: [RFC PATCH v3 0/4] nolibc x86-64 string functions
Posted by David Laight 2 years, 3 months ago
From: Ammar Faizi
> Sent: 02 September 2023 14:35
> 
> This is an RFC patchset v3 for nolibc x86-64 string functions.
> 
> There are 4 patches in this series:
> 
> ## Patch 1-2: Use `rep movsb`, `rep stosb` for:
>     - memcpy() and memmove()
>     - memset()
> respectively. They can simplify the generated ASM code.

It is worth pointing out that while the code size for 'rep xxxb'
is smaller, the performance is terrible.
The only time it is ever good is for the optimised forwards
copies on cpu that support it.

reverse, stos and scas are always horrid.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Re: [RFC PATCH v3 0/4] nolibc x86-64 string functions
Posted by Willy Tarreau 2 years, 3 months ago
On Sun, Sep 03, 2023 at 08:38:22PM +0000, David Laight wrote:
> From: Ammar Faizi
> > Sent: 02 September 2023 14:35
> > 
> > This is an RFC patchset v3 for nolibc x86-64 string functions.
> > 
> > There are 4 patches in this series:
> > 
> > ## Patch 1-2: Use `rep movsb`, `rep stosb` for:
> >     - memcpy() and memmove()
> >     - memset()
> > respectively. They can simplify the generated ASM code.
> 
> It is worth pointing out that while the code size for 'rep xxxb'
> is smaller, the performance is terrible.
> The only time it is ever good is for the optimised forwards
> copies on cpu that support it.
> 
> reverse, stos and scas are always horrid.

It's terrible compared to other approaches but not *that* bad. Also we
absolutely don't care about performance here, rather about correctness
and compact size.

Regards,
Willy