Move the function to its own assembly file. Having it in C just for the
entire body to be an asm() isn't really helpful. Then have two flavors:
A "basic" version using qword steps for the bulk of the operation, and an
ERMS version for modern hardware, to be substituted in via alternatives
patching.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
We may want to consider branching over the REP STOSQ as well, if the
number of qwords turns out to be zero.
We may also want to consider using non-REP STOS{L,W,B} for the tail.
---
v3: Re-base.
--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_INDIRECT_THUNK) += indirect
obj-$(CONFIG_PV) += ioport_emulate.o
obj-y += irq.o
obj-$(CONFIG_KEXEC) += machine_kexec.o
+obj-y += memset.o
obj-y += mm.o x86_64/mm.o
obj-$(CONFIG_HVM) += monitor.o
obj-y += mpparse.o
--- /dev/null
+++ b/xen/arch/x86/memset.S
@@ -0,0 +1,30 @@
+#include <asm/asm_defns.h>
+
+.macro memset
+ and $7, %edx
+ shr $3, %rcx
+ movzbl %sil, %esi
+ mov $0x0101010101010101, %rax
+ imul %rsi, %rax
+ mov %rdi, %rsi
+ rep stosq
+ or %edx, %ecx
+ jz 0f
+ rep stosb
+0:
+ mov %rsi, %rax
+ ret
+.endm
+
+.macro memset_erms
+ mov %esi, %eax
+ mov %rdi, %rsi
+ rep stosb
+ mov %rsi, %rax
+ ret
+.endm
+
+FUNC(memset)
+ mov %rdx, %rcx
+ ALTERNATIVE memset, memset_erms, X86_FEATURE_ERMS
+END(memset)
--- a/xen/arch/x86/string.c
+++ b/xen/arch/x86/string.c
@@ -22,19 +22,6 @@ void *(memcpy)(void *dest, const void *s
return dest;
}
-void *(memset)(void *s, int c, size_t n)
-{
- long d0, d1;
-
- asm volatile (
- "rep stosb"
- : "=&c" (d0), "=&D" (d1)
- : "a" (c), "1" (s), "0" (n)
- : "memory");
-
- return s;
-}
-
void *(memmove)(void *dest, const void *src, size_t n)
{
long d0, d1, d2;
On 25/11/2024 2:28 pm, Jan Beulich wrote: > Move the function to its own assembly file. Having it in C just for the > entire body to be an asm() isn't really helpful. Then have two flavors: > A "basic" version using qword steps for the bulk of the operation, and an > ERMS version for modern hardware, to be substituted in via alternatives > patching. > > Signed-off-by: Jan Beulich <jbeulich@suse.com> This is far nicer than previous versions with nested alternatives. > --- > We may want to consider branching over the REP STOSQ as well, if the > number of qwords turns out to be zero. Until FSR{S,M} (Fast Short Rep {STO,MOV}SB), which is far newer than ERMS, passing 0 into any REP instruction is expensive. I wonder how often we memset with a size less than 8. > We may also want to consider using non-REP STOS{L,W,B} for the tail. Probably, yes. We use this form in non-ERMS cases, where we're advised to stay away from STOSB entirely. Interestingly, Linux doesn't have a STOSQ case at all. Or rather, it was deleted by Linus in 20f3337d350c last year. It was also identified as causing a performance regression. https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/T/#u although the memset() path was not reverted as part of the fix (47ee3f1dd93bcb eventually). Yet ca96b162bfd2 shows that REP MOVSQ is still definitely a win on Rome CPUs. I expect we probably do want some non-rep forms in here. Do you have any benchmarks with this series? > --- > v3: Re-base. > > --- a/xen/arch/x86/Makefile > +++ b/xen/arch/x86/Makefile > @@ -48,6 +48,7 @@ obj-$(CONFIG_INDIRECT_THUNK) += indirect > obj-$(CONFIG_PV) += ioport_emulate.o > obj-y += irq.o > obj-$(CONFIG_KEXEC) += machine_kexec.o > +obj-y += memset.o > obj-y += mm.o x86_64/mm.o > obj-$(CONFIG_HVM) += monitor.o > obj-y += mpparse.o > --- /dev/null > +++ b/xen/arch/x86/memset.S > @@ -0,0 +1,30 @@ > +#include <asm/asm_defns.h> > + > +.macro memset > + and $7, %edx > + shr $3, %rcx > + movzbl %sil, %esi > + mov $0x0101010101010101, %rax > + imul %rsi, %rax > + mov %rdi, %rsi > + rep stosq > + or %edx, %ecx > + jz 0f > + rep stosb > +0: > + mov %rsi, %rax Could you use %r8/9/etc instead of %rsi please? This is deceptively close to looking like a bug, and it took me a while to figure out it's only correct because STOSB only edits %rdi. Otherwise, I suspect this can go in. It should be an improvement on plain REP STOSB on non-ERMS systems, even if there are other improvements to come. I specifically wouldn't suggest blocking it until patch 1 is resolved. ~Andrew
On 26.11.2024 18:13, Andrew Cooper wrote: > On 25/11/2024 2:28 pm, Jan Beulich wrote: >> --- >> We may want to consider branching over the REP STOSQ as well, if the >> number of qwords turns out to be zero. > > Until FSR{S,M} (Fast Short Rep {STO,MOV}SB), which is far newer than > ERMS, passing 0 into any REP instruction is expensive. Is this a request to add such a conditional branch then, perhaps patched out when FSRS is available? And then perhaps also for the JZ that's already there? > I wonder how often we memset with a size less than 8. Hence why I raised the point, rather than putting the jump there directly. >> We may also want to consider using non-REP STOS{L,W,B} for the tail. > > Probably, yes. We use this form in non-ERMS cases, where we're advised > to stay away from STOSB entirely. Yet then we'll end up with three conditional branches - do we really want that? > Interestingly, Linux doesn't have a STOSQ case at all. Or rather, it > was deleted by Linus in 20f3337d350c last year. It was also identified > as causing a performance regression. > https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/T/#u > although the memset() path was not reverted as part of the fix > (47ee3f1dd93bcb eventually). > > Yet ca96b162bfd2 shows that REP MOVSQ is still definitely a win on Rome > CPUs. > > I expect we probably do want some non-rep forms in here. > > Do you have any benchmarks with this series? What I specifically measured were the clear_page() variants. I didn't do any measurements for memset() (or memcpy()), first ad foremost because any selection of inputs is going to be arbitrary rather than representative. >> --- /dev/null >> +++ b/xen/arch/x86/memset.S >> @@ -0,0 +1,30 @@ >> +#include <asm/asm_defns.h> >> + >> +.macro memset >> + and $7, %edx >> + shr $3, %rcx >> + movzbl %sil, %esi >> + mov $0x0101010101010101, %rax >> + imul %rsi, %rax >> + mov %rdi, %rsi >> + rep stosq >> + or %edx, %ecx >> + jz 0f >> + rep stosb >> +0: >> + mov %rsi, %rax > > Could you use %r8/9/etc instead of %rsi please? This is deceptively > close to looking like a bug, and it took me a while to figure out it's > only correct because STOSB only edits %rdi. Well, I can certainly switch (the number of REX prefixes will remain the same as it looks), but the fact that STOS, unlike MOVS, doesn't touch %rsi is a pretty basic one. Does your request extend to all uses of %rsi (and %esi), or merely the latter two (across the REP STOS)? Jan
© 2016 - 2024 Red Hat, Inc.