[PATCH v3 2/7] x86: re-work memset()

Jan Beulich posted 7 patches 4 days, 9 hours ago
[PATCH v3 2/7] x86: re-work memset()
Posted by Jan Beulich 4 days, 9 hours ago
Move the function to its own assembly file. Having it in C just for the
entire body to be an asm() isn't really helpful. Then have two flavors:
A "basic" version using qword steps for the bulk of the operation, and an
ERMS version for modern hardware, to be substituted in via alternatives
patching.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
We may want to consider branching over the REP STOSQ as well, if the
number of qwords turns out to be zero.
We may also want to consider using non-REP STOS{L,W,B} for the tail.
---
v3: Re-base.

--- a/xen/arch/x86/Makefile
+++ b/xen/arch/x86/Makefile
@@ -48,6 +48,7 @@ obj-$(CONFIG_INDIRECT_THUNK) += indirect
 obj-$(CONFIG_PV) += ioport_emulate.o
 obj-y += irq.o
 obj-$(CONFIG_KEXEC) += machine_kexec.o
+obj-y += memset.o
 obj-y += mm.o x86_64/mm.o
 obj-$(CONFIG_HVM) += monitor.o
 obj-y += mpparse.o
--- /dev/null
+++ b/xen/arch/x86/memset.S
@@ -0,0 +1,30 @@
+#include <asm/asm_defns.h>
+
+.macro memset
+        and     $7, %edx
+        shr     $3, %rcx
+        movzbl  %sil, %esi
+        mov     $0x0101010101010101, %rax
+        imul    %rsi, %rax
+        mov     %rdi, %rsi
+        rep stosq
+        or      %edx, %ecx
+        jz      0f
+        rep stosb
+0:
+        mov     %rsi, %rax
+        ret
+.endm
+
+.macro memset_erms
+        mov     %esi, %eax
+        mov     %rdi, %rsi
+        rep stosb
+        mov     %rsi, %rax
+        ret
+.endm
+
+FUNC(memset)
+        mov     %rdx, %rcx
+        ALTERNATIVE memset, memset_erms, X86_FEATURE_ERMS
+END(memset)
--- a/xen/arch/x86/string.c
+++ b/xen/arch/x86/string.c
@@ -22,19 +22,6 @@ void *(memcpy)(void *dest, const void *s
     return dest;
 }
 
-void *(memset)(void *s, int c, size_t n)
-{
-    long d0, d1;
-
-    asm volatile (
-        "rep stosb"
-        : "=&c" (d0), "=&D" (d1)
-        : "a" (c), "1" (s), "0" (n)
-        : "memory");
-
-    return s;
-}
-
 void *(memmove)(void *dest, const void *src, size_t n)
 {
     long d0, d1, d2;
Re: [PATCH v3 2/7] x86: re-work memset()
Posted by Andrew Cooper 3 days, 6 hours ago
On 25/11/2024 2:28 pm, Jan Beulich wrote:
> Move the function to its own assembly file. Having it in C just for the
> entire body to be an asm() isn't really helpful. Then have two flavors:
> A "basic" version using qword steps for the bulk of the operation, and an
> ERMS version for modern hardware, to be substituted in via alternatives
> patching.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

This is far nicer than previous versions with nested alternatives.

> ---
> We may want to consider branching over the REP STOSQ as well, if the
> number of qwords turns out to be zero.

Until FSR{S,M} (Fast Short Rep {STO,MOV}SB), which is far newer than
ERMS, passing 0 into any REP instruction is expensive.

I wonder how often we memset with a size less than 8.

> We may also want to consider using non-REP STOS{L,W,B} for the tail.

Probably, yes.  We use this form in non-ERMS cases, where we're advised
to stay away from STOSB entirely.

Interestingly, Linux doesn't have a STOSQ case at all.  Or rather, it
was deleted by Linus in 20f3337d350c last year.  It was also identified
as causing a performance regression. 
https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/T/#u
although the memset() path was not reverted as part of the fix
(47ee3f1dd93bcb eventually).

Yet ca96b162bfd2 shows that REP MOVSQ is still definitely a win on Rome
CPUs.

I expect we probably do want some non-rep forms in here.

Do you have any benchmarks with this series?

> ---
> v3: Re-base.
>
> --- a/xen/arch/x86/Makefile
> +++ b/xen/arch/x86/Makefile
> @@ -48,6 +48,7 @@ obj-$(CONFIG_INDIRECT_THUNK) += indirect
>  obj-$(CONFIG_PV) += ioport_emulate.o
>  obj-y += irq.o
>  obj-$(CONFIG_KEXEC) += machine_kexec.o
> +obj-y += memset.o
>  obj-y += mm.o x86_64/mm.o
>  obj-$(CONFIG_HVM) += monitor.o
>  obj-y += mpparse.o
> --- /dev/null
> +++ b/xen/arch/x86/memset.S
> @@ -0,0 +1,30 @@
> +#include <asm/asm_defns.h>
> +
> +.macro memset
> +        and     $7, %edx
> +        shr     $3, %rcx
> +        movzbl  %sil, %esi
> +        mov     $0x0101010101010101, %rax
> +        imul    %rsi, %rax
> +        mov     %rdi, %rsi
> +        rep stosq
> +        or      %edx, %ecx
> +        jz      0f
> +        rep stosb
> +0:
> +        mov     %rsi, %rax

Could you use %r8/9/etc instead of %rsi please?  This is deceptively
close to looking like a bug, and it took me a while to figure out it's
only correct because STOSB only edits %rdi.

Otherwise, I suspect this can go in.  It should be an improvement on
plain REP STOSB on non-ERMS systems, even if there are other
improvements to come.  I specifically wouldn't suggest blocking it until
patch 1 is resolved.

~Andrew

Re: [PATCH v3 2/7] x86: re-work memset()
Posted by Jan Beulich 2 days, 14 hours ago
On 26.11.2024 18:13, Andrew Cooper wrote:
> On 25/11/2024 2:28 pm, Jan Beulich wrote:
>> ---
>> We may want to consider branching over the REP STOSQ as well, if the
>> number of qwords turns out to be zero.
> 
> Until FSR{S,M} (Fast Short Rep {STO,MOV}SB), which is far newer than
> ERMS, passing 0 into any REP instruction is expensive.

Is this a request to add such a conditional branch then, perhaps patched
out when FSRS is available? And then perhaps also for the JZ that's
already there?

> I wonder how often we memset with a size less than 8.

Hence why I raised the point, rather than putting the jump there directly.

>> We may also want to consider using non-REP STOS{L,W,B} for the tail.
> 
> Probably, yes.  We use this form in non-ERMS cases, where we're advised
> to stay away from STOSB entirely.

Yet then we'll end up with three conditional branches - do we really want
that?

> Interestingly, Linux doesn't have a STOSQ case at all.  Or rather, it
> was deleted by Linus in 20f3337d350c last year.  It was also identified
> as causing a performance regression. 
> https://lore.kernel.org/lkml/CANn89iKUbyrJ=r2+_kK+sb2ZSSHifFZ7QkPLDpAtkJ8v4WUumA@mail.gmail.com/T/#u
> although the memset() path was not reverted as part of the fix
> (47ee3f1dd93bcb eventually).
> 
> Yet ca96b162bfd2 shows that REP MOVSQ is still definitely a win on Rome
> CPUs.
> 
> I expect we probably do want some non-rep forms in here.
> 
> Do you have any benchmarks with this series?

What I specifically measured were the clear_page() variants. I didn't do
any measurements for memset() (or memcpy()), first ad foremost because
any selection of inputs is going to be arbitrary rather than representative.

>> --- /dev/null
>> +++ b/xen/arch/x86/memset.S
>> @@ -0,0 +1,30 @@
>> +#include <asm/asm_defns.h>
>> +
>> +.macro memset
>> +        and     $7, %edx
>> +        shr     $3, %rcx
>> +        movzbl  %sil, %esi
>> +        mov     $0x0101010101010101, %rax
>> +        imul    %rsi, %rax
>> +        mov     %rdi, %rsi
>> +        rep stosq
>> +        or      %edx, %ecx
>> +        jz      0f
>> +        rep stosb
>> +0:
>> +        mov     %rsi, %rax
> 
> Could you use %r8/9/etc instead of %rsi please?  This is deceptively
> close to looking like a bug, and it took me a while to figure out it's
> only correct because STOSB only edits %rdi.

Well, I can certainly switch (the number of REX prefixes will remain the
same as it looks), but the fact that STOS, unlike MOVS, doesn't touch
%rsi is a pretty basic one.

Does your request extend to all uses of %rsi (and %esi), or merely the
latter two (across the REP STOS)?

Jan