tools/include/nolibc/arch-x86_64.h | 60 ++++++++++++++++++++++++++++++ tools/include/nolibc/string.h | 38 ++++++++----------- 2 files changed, 75 insertions(+), 23 deletions(-)
Hi Willy,
This is an RFC patchset for nolibc x86-64 string functions. There are 5
patches in this series.
## Patch 1-3: Use `rep movsb`, `rep stosb`, and `rep cmpsb` for:
- memcpy() and memmove()
- memset()
- memcmp()
respectively. They can simplify the generated ASM code.
Patch 4 and 5 are not related, just a small cleanup.
## Patch 4: Remove the `_nolibc_memcpy_down()` function
This nolibc internal function is not used. Delete it. It was probably
supposed to handle memmove(), but today the memmove() has its own
implementation.
## Patch 5: Remove the `_nolibc_memcpy_up()` function
This function is only called by memcpy(), there is no real reason to
have this wrapper. Delete this function and move the code to memcpy()
directly.
Before this series:
```
0000000000001479 <memmove>:
1479: f3 0f 1e fa endbr64
147d: 48 39 f7 cmp %rsi,%rdi
1480: 48 c7 c1 ff ff ff ff mov $0xffffffffffffffff,%rcx
1487: 48 89 f8 mov %rdi,%rax
148a: 48 0f 43 ca cmovae %rdx,%rcx
148e: 48 19 ff sbb %rdi,%rdi
1491: 83 e7 02 and $0x2,%edi
1494: 48 ff cf dec %rdi
1497: 48 85 d2 test %rdx,%rdx
149a: 74 10 je 14ac <memmove+0x33>
149c: 48 01 f9 add %rdi,%rcx
149f: 48 ff ca dec %rdx
14a2: 44 8a 04 0e mov (%rsi,%rcx,1),%r8b
14a6: 44 88 04 08 mov %r8b,(%rax,%rcx,1)
14aa: eb eb jmp 1497 <memmove+0x1e>
14ac: c3 ret
00000000000014ad <memcpy>:
14ad: f3 0f 1e fa endbr64
14b1: 48 89 f8 mov %rdi,%rax
14b4: 31 c9 xor %ecx,%ecx
14b6: 48 39 ca cmp %rcx,%rdx
14b9: 74 0d je 14c8 <memcpy+0x1b>
14bb: 40 8a 3c 0e mov (%rsi,%rcx,1),%dil
14bf: 40 88 3c 08 mov %dil,(%rax,%rcx,1)
14c3: 48 ff c1 inc %rcx
14c6: eb ee jmp 14b6 <memcpy+0x9>
14c8: c3 ret
00000000000014c9 <memset>:
14c9: f3 0f 1e fa endbr64
14cd: 48 89 f8 mov %rdi,%rax
14d0: 31 c9 xor %ecx,%ecx
14d2: 48 39 ca cmp %rcx,%rdx
14d5: 74 09 je 14e0 <memset+0x17>
14d7: 40 88 34 08 mov %sil,(%rax,%rcx,1)
14db: 48 ff c1 inc %rcx
14de: eb f2 jmp 14d2 <memset+0x9>
14e0: c3 ret
```
After this series:
```
000000000000140a <memmove>:
140a: 48 89 f8 mov %rdi,%rax
140d: 48 89 d1 mov %rdx,%rcx
1410: 48 8d 7c 0f ff lea -0x1(%rdi,%rcx,1),%rdi
1415: 48 8d 74 0e ff lea -0x1(%rsi,%rcx,1),%rsi
141a: fd std
141b: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
141d: fc cld
141e: c3 ret
000000000000141f <memcpy>:
141f: 48 89 f8 mov %rdi,%rax
1422: 48 89 d1 mov %rdx,%rcx
1425: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi)
1427: c3 ret
0000000000001428 <memset>:
1428: 48 89 f0 mov %rsi,%rax
142b: 48 89 d1 mov %rdx,%rcx
142e: 48 89 fa mov %rdi,%rdx
1431: f3 aa rep stos %al,%es:(%rdi)
1433: 48 89 d0 mov %rdx,%rax
1436: c3 ret
```
Signed-off-by: Ammar Faizi <ammarfaizi2@gnuweeb.org>
---
Ammar Faizi (5):
tools/nolibc: x86-64: Use `rep movsb` for `memcpy()` and `memmove()`
tools/nolibc: x86-64: Use `rep stosb` for `memset()`
tools/nolibc: x86-64: Use `rep cmpsb` for `memcmp()`
tools/nolibc: string: Remove the `_nolibc_memcpy_down()` function
tools/nolibc: string: Remove the `_nolibc_memcpy_up()` function
tools/include/nolibc/arch-x86_64.h | 60 ++++++++++++++++++++++++++++++
tools/include/nolibc/string.h | 38 ++++++++-----------
2 files changed, 75 insertions(+), 23 deletions(-)
base-commit: 3c9b7c4a228bf8cca2f92abb65575cdd54065302
--
Ammar Faizi
Hi Ammar, On Wed, Aug 30, 2023 at 08:57:21PM +0700, Ammar Faizi wrote: > Hi Willy, > > This is an RFC patchset for nolibc x86-64 string functions. There are 5 > patches in this series. > > ## Patch 1-3: Use `rep movsb`, `rep stosb`, and `rep cmpsb` for: > - memcpy() and memmove() > - memset() > - memcmp() > respectively. They can simplify the generated ASM code. > > Patch 4 and 5 are not related, just a small cleanup. So overall I'm fine with this, I think it's reasonable. As you said we're not trying to chase the very last byte, but for such functions it's also nice if they can remain small. Some of them might even benefit from being inlined by the way (in this case they'd rather move to C functions with an asm() statement), because the call instruction and the register moves or spilling code will generally be larger than the functions themselves. That might be worth checking. Ah no, we cannot because some of them are called from libgcc and friends. Or we may need to mark them inline and weak without static, I'm not sure how well that works. Please just let me know if you intend to change a few things based on previous comments, and also this memcmp() stuff that's both C and asm. Thanks! Willy
From: Ammar Faizi <ammarfaizi2@gnuweeb.org> > Sent: 30 August 2023 14:57 > > This is an RFC patchset for nolibc x86-64 string functions. There are 5 > patches in this series. > > ## Patch 1-3: Use `rep movsb`, `rep stosb`, and `rep cmpsb` for: > - memcpy() and memmove() > - memset() > - memcmp() > respectively. They can simplify the generated ASM code. > ... > After this series: > ``` > 000000000000140a <memmove>: > 140a: 48 89 f8 mov %rdi,%rax > 140d: 48 89 d1 mov %rdx,%rcx > 1410: 48 8d 7c 0f ff lea -0x1(%rdi,%rcx,1),%rdi > 1415: 48 8d 74 0e ff lea -0x1(%rsi,%rcx,1),%rsi > 141a: fd std > 141b: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) > 141d: fc cld > 141e: c3 ret Isn't that completely broken? You need to select between forwards and backwards moves. Since forwards moves are preferred it is best to do if (dst - src < len) backards_copy() else formwards_copy() David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Sep 01, 2023 at 11:34:18AM +0000, David Laight wrote: > From: Ammar Faizi <ammarfaizi2@gnuweeb.org> > > Sent: 30 August 2023 14:57 > > > > This is an RFC patchset for nolibc x86-64 string functions. There are 5 > > patches in this series. > > > > ## Patch 1-3: Use `rep movsb`, `rep stosb`, and `rep cmpsb` for: > > - memcpy() and memmove() > > - memset() > > - memcmp() > > respectively. They can simplify the generated ASM code. > > > ... > > After this series: > > ``` > > 000000000000140a <memmove>: > > 140a: 48 89 f8 mov %rdi,%rax > > 140d: 48 89 d1 mov %rdx,%rcx > > 1410: 48 8d 7c 0f ff lea -0x1(%rdi,%rcx,1),%rdi > > 1415: 48 8d 74 0e ff lea -0x1(%rsi,%rcx,1),%rsi > > 141a: fd std > > 141b: f3 a4 rep movsb %ds:(%rsi),%es:(%rdi) > > 141d: fc cld > > 141e: c3 ret > > Isn't that completely broken? > > You need to select between forwards and backwards moves. > Since forwards moves are preferred it is best to do > if (dst - src < len) > backards_copy() > else > formwards_copy() > > David You're completely right indeed, reminds me about the copy_up/copy_down that were not used anymore :-) Willy
On Fri, Sep 01, 2023 at 01:46:44PM +0200, Willy Tarreau wrote: > On Fri, Sep 01, 2023 at 11:34:18AM +0000, David Laight wrote: > > Isn't that completely broken? > > > > You need to select between forwards and backwards moves. > > Since forwards moves are preferred it is best to do > > if (dst - src < len) > > backards_copy() > > else > > formwards_copy() > > > > David > > You're completely right indeed, reminds me about the copy_up/copy_down > that were not used anymore :-) I'm an idiot, will fix that. Another attempt as suggested below: __asm__ ( ".section .text.nolibc_memmove\n" ".weak memmove\n" "memmove:\n" " movq %rdx, %rcx\n" " movq %rdi, %rdx\n" " movq %rdi, %rax\n" " subq %rsi, %rdx\n" " cmpq %rcx, %rdx\n" " jnb .Lforward_copy\n" " leaq -1(%rdi, %rcx, 1), %rdi\n" " leaq -1(%rsi, %rcx, 1), %rsi\n" " std\n" ".Lforward_copy:\n" " rep movsb\n" " cld\n" " ret\n" ); -- Ammar Faizi
From: Ammar Faizi > Sent: 01 September 2023 14:06 ... > > You're completely right indeed, reminds me about the copy_up/copy_down > > that were not used anymore :-) > > I'm an idiot, will fix that. Another attempt as suggested below: > > __asm__ ( > ".section .text.nolibc_memmove\n" > ".weak memmove\n" > "memmove:\n" > " movq %rdx, %rcx\n" > " movq %rdi, %rdx\n" > " movq %rdi, %rax\n" You seem to have confused yourself about whether you are using %eax or %edx. > " subq %rsi, %rdx\n" > " cmpq %rcx, %rdx\n" > " jnb .Lforward_copy\n" I think I'd fall through to the forwards copy and not worry about replicating the 'reps movsb' and 'ret'. IIRC 'cld' can be slow as well. > " leaq -1(%rdi, %rcx, 1), %rdi\n" > " leaq -1(%rsi, %rcx, 1), %rsi\n" > " std\n" > ".Lforward_copy:\n" > " rep movsb\n" > " cld\n" > " ret\n" > ); > > -- > Ammar Faizi David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Sep 01, 2023 at 02:23:28PM +0000, David Laight wrote: > From: Ammar Faizi > > Sent: 01 September 2023 14:06 ... > > __asm__ ( > > ".section .text.nolibc_memmove\n" > > ".weak memmove\n" > > "memmove:\n" > > " movq %rdx, %rcx\n" > > " movq %rdi, %rdx\n" > > " movq %rdi, %rax\n" > > You seem to have confused yourself about whether you are using %eax or %edx. What do you mean? They're all 64-bit pointers. What I know is that the %rdx will be clobbered by "subq %rsi, %rdx" below and the %rax should be return value. That's why I copy the %rdi twice. memmove() returns the dst pointer. Did I miss something? > > " subq %rsi, %rdx\n" > > " cmpq %rcx, %rdx\n" > > " jnb .Lforward_copy\n" > > I think I'd fall through to the forwards copy > and not worry about replicating the 'reps movsb' and 'ret'. > IIRC 'cld' can be slow as well. Alright, I will avoid cld for the forward copy. > > " leaq -1(%rdi, %rcx, 1), %rdi\n" > > " leaq -1(%rsi, %rcx, 1), %rsi\n" > > " std\n" > > ".Lforward_copy:\n" > > " rep movsb\n" > > " cld\n" > > " ret\n" > > ); -- Ammar Faizi
From: Ammar Faizi > Sent: 01 September 2023 15:42 ... > > > " movq %rdx, %rcx\n" > > > " movq %rdi, %rdx\n" > > > " movq %rdi, %rax\n" > > > > You seem to have confused yourself about whether you are using %eax or %edx. > > What do you mean? They're all 64-bit pointers. %ax, %eax, %rax - what is the difference :-) > What I know is that the %rdx will be clobbered by "subq %rsi, %rdx" > below and the %rax should be return value. That's why I copy the %rdi > twice. memmove() returns the dst pointer. Did I miss something? I'd forgotten about the (stupid) return value. I'm pretty sure it is an accident from the original pdp-11 implementation from the days before C had an explicit 'return' statement. (The pdp-11 I used ran RSX/11M - so had a Fortran compiler not a C one.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Fri, Sep 01, 2023 at 02:54:56PM +0000, David Laight wrote: > I'd forgotten about the (stupid) return value. > > I'm pretty sure it is an accident from the original pdp-11 > implementation from the days before C had an explicit 'return' > statement. > (The pdp-11 I used ran RSX/11M - so had a Fortran compiler > not a C one.) You're old. I did not exist in that era. And my parents were still young :-) -- Ammar Faizi
© 2016 - 2025 Red Hat, Inc.