x86: prevent gcc from emitting rep movsq/stosq for inlined ops

[RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by Mateusz Guzik 1 week, 1 day ago

Not a real submission yet as I would like results from other people.

tl;dr when benchmarking compilation of a hello-world program I'm getting
a 1.7% increase in throughput on Sapphire Rapids when convincing the
compiler to only use regular stores for inlined memset and memcpy

Note this uarch does have FSRM and still benefits from not using it for
some cases.

I am not in position to bench this on other CPUs, would be nice if
someone did it on AMD.

Onto the business:
The kernel is chock full of inlined rep movsq and rep stosq, including
in hot paths and these are known to be detrimental to performance below
certain sizes.

Most notably in sync_regs:
<+0>:     endbr64
<+4>:     mov    %gs:0x22ca5d4(%rip),%rax        # 0xffffffff8450f010 <cpu_current_top_of_stack>
<+12>:    mov    %rdi,%rsi
<+15>:    sub    $0xa8,%rax
<+21>:    cmp    %rdi,%rax
<+24>:    je     0xffffffff82244a55 <sync_regs+37>
<+26>:    mov    $0x15,%ecx
<+31>:    mov    %rax,%rdi
<+34>:    rep movsq %ds:(%rsi),%es:(%rdi)
<+37>:    jmp    0xffffffff82256ba0 <__x86_return_thunk>

When issuing hello-world compiles in a loop this is over 1% of total CPU
time as reported by perf. With the kernel recompiled to instead do a
copy with regular stores this drops to 0.13%.

Recompiled it looks like this:
<+0>:     endbr64
<+4>:     mov    %gs:0x22b9f44(%rip),%rax        # 0xffffffff8450f010 <cpu_current_top_of_stack>
<+12>:    sub    $0xa8,%rax
<+18>:    cmp    %rdi,%rax
<+21>:    je     0xffffffff82255114 <sync_regs+84>
<+23>:    xor    %ecx,%ecx
<+25>:    mov    %ecx,%edx
<+27>:    add    $0x20,%ecx
<+30>:    mov    (%rdi,%rdx,1),%r10
<+34>:    mov    0x8(%rdi,%rdx,1),%r9
<+39>:    mov    0x10(%rdi,%rdx,1),%r8
<+44>:    mov    0x18(%rdi,%rdx,1),%rsi
<+49>:    mov    %r10,(%rax,%rdx,1)
<+53>:    mov    %r9,0x8(%rax,%rdx,1)
<+58>:    mov    %r8,0x10(%rax,%rdx,1)
<+63>:    mov    %rsi,0x18(%rax,%rdx,1)
<+68>:    cmp    $0xa0,%ecx
<+74>:    jb     0xffffffff822550d9 <sync_regs+25>
<+76>:    mov    (%rdi,%rcx,1),%rdx
<+80>:    mov    %rdx,(%rax,%rcx,1)
<+84>:    jmp    0xffffffff822673e0 <__x86_return_thunk>

bloat-o-meter says:
Total: Before=30021301, After=30089151, chg +0.23%

There are of course other spots which are modified and they also see a
reduction in time spent.

Bench results in compilations completed in a 10 second period with /tmp
backed by tmpfs:

before:
978 ops (97 ops/s)
979 ops (97 ops/s)
978 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)

after:
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
996 ops (99 ops/s)

I'm running this with debian 12 userspace (gcc 12.2.0).

I asked the LKP folk to bench but did not get a response yet:
https://lore.kernel.org/oe-lkp/CAGudoHHd8TkyA1kOQ2KtZdZJ2VxUW=2mP-JR0t_oR07TfrwN8w@mail.gmail.com/

Repro instructions:
for i in $(seq 1 10); do taskset --cpu-list 1 ./ccbench 10; done

taskset is important as otherwise processes roam around the box big
time.

Attached files are:
- cc.c for will-it-scale if someone wants to profile the thing while it
  loops indefinitely
- src0.c -- hello world for reference, plop into /src/src0.c
- ccbench.c is the bench; compile with cc -O2 -o ccbench ccbench.c

It spawns gcc through system() forcing it to go through the shell, which
mimicks what happens when compiling with make.

 arch/x86/Makefile | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 9b76e77ff7f7..1a1afcc3041f 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -198,6 +198,29 @@ ifeq ($(CONFIG_STACKPROTECTOR),y)
     endif
 endif
 
+ifdef CONFIG_CC_IS_GCC
+#
+# Inline memcpy and memset handling policy for gcc.
+#
+# For ops of sizes known at compilation time it quickly resorts to issuing rep
+# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
+# latency and it is faster to issue regular stores (even if in loops) to handle
+# small buffers.
+#
+# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
+# reported 0.23% increase for enabling these.
+#
+# We inline up to 256 bytes, which in the best case issues few movs, in the
+# worst case creates a 4 * 8 store loop.
+#
+# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
+# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
+# common denominator. Someone(tm) should revisit this from time to time.
+#
+KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+endif
+
 #
 # If the function graph tracer is used with mcount instead of fentry,
 # '-maccumulate-outgoing-args' is needed to prevent a GCC bug
-- 
2.43.0

Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by David Laight 1 week ago

On Wed, 2 Apr 2025 15:42:40 +0200
Mateusz Guzik <mjguzik@gmail.com> wrote:

> Not a real submission yet as I would like results from other people.
> 
> tl;dr when benchmarking compilation of a hello-world program I'm getting
> a 1.7% increase in throughput on Sapphire Rapids when convincing the
> compiler to only use regular stores for inlined memset and memcpy
> 
> Note this uarch does have FSRM and still benefits from not using it for
> some cases.
> 
> I am not in position to bench this on other CPUs, would be nice if
> someone did it on AMD.

I did some benchmarking of 'rep movsb' on a zen 5.
Test is: mfence; rdpmc; mfence; test_code; mfence; rdpmc; mfence.
For large copies you get 64 bytes/clock.
Short copies (less than 128 bytes) are usually very cheap - maybe 5 clocks
But it then jumps to 38 clocks.
And the 'elephant in the room' is when (dest - src) % 4096 is between 1 and 63.
In that case short copies jump to 55 clocks.
Otherwise alignment doesn't make much difference.

If those values are right you want to use 'rep movsb' for short copies,
but probably not for ones between 128 and 256 bytes!

I might need to run with an inner loop.
The overhead for an empty test (an asm block with "nop" instead of "rep movsb")
is 180 clocks (and subtracted from the above clock counts).
But I've used the same scheme for 'normal' instructions (testing ipcsum)
and got sane results.

	David

Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by Mateusz Guzik 1 week ago

On Thu, Apr 3, 2025 at 12:29 AM David Laight
<david.laight.linux@gmail.com> wrote:
>
> On Wed, 2 Apr 2025 15:42:40 +0200
> Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> > Not a real submission yet as I would like results from other people.
> >
> > tl;dr when benchmarking compilation of a hello-world program I'm getting
> > a 1.7% increase in throughput on Sapphire Rapids when convincing the
> > compiler to only use regular stores for inlined memset and memcpy
> >
> > Note this uarch does have FSRM and still benefits from not using it for
> > some cases.
> >
> > I am not in position to bench this on other CPUs, would be nice if
> > someone did it on AMD.
>
> I did some benchmarking of 'rep movsb' on a zen 5.
> Test is: mfence; rdpmc; mfence; test_code; mfence; rdpmc; mfence.
> For large copies you get 64 bytes/clock.
> Short copies (less than 128 bytes) are usually very cheap - maybe 5 clocks
> But it then jumps to 38 clocks.
> And the 'elephant in the room' is when (dest - src) % 4096 is between 1 and 63.
> In that case short copies jump to 55 clocks.
> Otherwise alignment doesn't make much difference.
>

I think this roughly follows the advice on how to do benchmarks, but
at the same time I think it has too much potential to distort
differences when it comes to these routines.

The fence forces the CPU to get rid of the state accumulated prior and
probably prevents it from speculatively messing with instructions
after. But what if uarchs tolerate a mov loop better than rep mov? (up
to a point of course)

Based on my tests with running the compiler, Sapphire Rapids does
prefer the loop approach at least till 256 bytes, despite Fast Short
REP MOV. This can be seen in sync_regs() for example.

If you are up to it, I would appreciate if you ran the actual bench as
described in my opening mail. It is not hard to set up, but it does
require rebuilding the kernel. Perhaps you can do it in a vm, it's not
a scalability bench.

-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by Linus Torvalds 1 week, 1 day ago

On Wed, 2 Apr 2025 at 06:42, Mateusz Guzik <mjguzik@gmail.com> wrote:
>
>
> +ifdef CONFIG_CC_IS_GCC
> +#
> +# Inline memcpy and memset handling policy for gcc.
> +#
> +# For ops of sizes known at compilation time it quickly resorts to issuing rep
> +# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
> +# latency and it is faster to issue regular stores (even if in loops) to handle
> +# small buffers.
> +#
> +# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
> +# reported 0.23% increase for enabling these.
> +#
> +# We inline up to 256 bytes, which in the best case issues few movs, in the
> +# worst case creates a 4 * 8 store loop.
> +#
> +# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
> +# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
> +# common denominator. Someone(tm) should revisit this from time to time.
> +#
> +KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> +KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> +endif

Please make this a gcc bug-report instead - I really don't want to
have random compiler-specific tuning options in the kernel.

Because that whole memcpy-strategy thing is something that gets tuned
by a lot of other compiler options (ie -march and different versions).

             Linus

Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by Mateusz Guzik 1 week, 1 day ago

On Wed, Apr 2, 2025 at 6:22 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Wed, 2 Apr 2025 at 06:42, Mateusz Guzik <mjguzik@gmail.com> wrote:
> >
> >
> > +ifdef CONFIG_CC_IS_GCC
> > +#
> > +# Inline memcpy and memset handling policy for gcc.
> > +#
> > +# For ops of sizes known at compilation time it quickly resorts to issuing rep
> > +# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
> > +# latency and it is faster to issue regular stores (even if in loops) to handle
> > +# small buffers.
> > +#
> > +# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
> > +# reported 0.23% increase for enabling these.
> > +#
> > +# We inline up to 256 bytes, which in the best case issues few movs, in the
> > +# worst case creates a 4 * 8 store loop.
> > +#
> > +# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
> > +# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
> > +# common denominator. Someone(tm) should revisit this from time to time.
> > +#
> > +KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > +KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > +endif
>
> Please make this a gcc bug-report instead - I really don't want to
> have random compiler-specific tuning options in the kernel.
>
> Because that whole memcpy-strategy thing is something that gets tuned
> by a lot of other compiler options (ie -march and different versions).
>

Ok.
-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by Andrew Cooper 1 week, 1 day ago

> Please make this a gcc bug-report instead - I really don't want to
> have random compiler-specific tuning options in the kernel. Because
> that whole memcpy-strategy thing is something that gets tuned by a lot
> of other compiler options (ie -march and different versions).

I've discussed this with PeterZ in the past, although I can't for the
life of me find the bugzilla ticket I thought I opened on the matter. 
(Maybe I never got that far).

The behaviour wanted is:

1) Convert to plain plain accesses (so they can be merged/combined/etc), or
2) Emit a library call

because we do provide forms that are better than the GCC-chosen "REP
MOVSQ with manual alignment" in the general case.

Taking a leaf out of the repoline book, the ideal library call(s) would be:

    CALL __x86_thunk_rep_{mov,stos}sb

using the REP ABI (parameters in %rcx/%rdi/etc), rather than the SYSV ABI.

For current/future processors, which have fast reps of all short/zero
flavours, we can even inline the REP {MOV,STO}S instruction to avoid the
call.

For older microarchitectures, they can reuse the existing memcpy/memset
implementations, just with marginally less parameter shuffling.

How does this sound?

~Andrew

Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by Linus Torvalds 1 week, 1 day ago

On Wed, 2 Apr 2025 at 11:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> Taking a leaf out of the repoline book, the ideal library call(s) would be:
>
>     CALL __x86_thunk_rep_{mov,stos}sb
>
> using the REP ABI (parameters in %rcx/%rdi/etc), rather than the SYSV ABI.

Yes. That's basically what 'rep_movs_alternative' does so that we can
basically do a

                ALTERNATIVE("rep movsb",
                            "call rep_movs_alternative",
ALT_NOT(X86_FEATURE_FSRM))

but we only do this for user space copies exactly because we don't
have a good way to do it for compiler-generated ones.

If gcc just did an out-of-line call, but used the 'rep movs' "calling
convention", we would be able to basically do the rewriting
dynamically, replacing the call with an inlined "rep movsb" where
appropriate.

                 Linus

Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by Andrew Cooper 1 week, 1 day ago

On 02/04/2025 7:29 pm, Linus Torvalds wrote:
> On Wed, 2 Apr 2025 at 11:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>> Taking a leaf out of the repoline book, the ideal library call(s) would be:
>>
>>     CALL __x86_thunk_rep_{mov,stos}sb
>>
>> using the REP ABI (parameters in %rcx/%rdi/etc), rather than the SYSV ABI.
> Yes. That's basically what 'rep_movs_alternative' does so that we can
> basically do a
>
>                 ALTERNATIVE("rep movsb",
>                             "call rep_movs_alternative",
> ALT_NOT(X86_FEATURE_FSRM))
>
> but we only do this for user space copies exactly because we don't
> have a good way to do it for compiler-generated ones.
>
> If gcc just did an out-of-line call, but used the 'rep movs' "calling
> convention", we would be able to basically do the rewriting
> dynamically, replacing the call with an inlined "rep movsb" where
> appropriate.

You still want the compiler to be able to do a first-pass optimisation
over __builtin_mem*(), for elimination/merging/etc, but if it could stop
half way through what it currently does and just emit the library call,
that would be excellent.

~Andrew

Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by Linus Torvalds 1 week, 1 day ago

On Wed, 2 Apr 2025 at 11:40, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>
> You still want the compiler to be able to do a first-pass optimisation
> over __builtin_mem*(), for elimination/merging/etc

Absolutely.

That's literally why the kernel ends up using __builtin_memcpy() in
the first place - so that the compiler can turn the (very common)
small fixed-size cases into just regular moves.

The function call would only be for actual big moves and/or unknown sizes.

That said, it can be advantageous to have more than one function call.

For the user copy case (where we control the horizontal and the
vertical in the kernel) at one point I did have a patch that did
something fancier than just a single call-out.

Because even when the compiler doesn't know the exact length, it's
often known that it's a multiple of four, for example. It's not all
that uncommon to see code like this:

        memcpy(image->segment, segments, nr_segments * sizeof(*segments));

and an even more common case is "I'm doing an out-of-line call not
because the size is unknown, but because the size is big".

So I noted the size alignment of the size in the function name, so
that I could have different versions for the most obvious
straight-forward cases (notably the "just do word-at-a-time copies").

Admittedly my main use case for that was the big fixed-size case in
the stat code, which just does

        return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;

and shows up like a sore thumb on some benchmarks.

But I've always hated the patch because it turns out that the real fix
is to get rid of the temporary buffer on the stack, and just copy the
fields one-by-one with 'unsafe_put_user()' and friends, but every time
I do that - and I've done it several times - I end up throwing the
patch away because it ends up being problematic on non-x86
architectures

(Reason: INIT_STRUCT_STAT64_PADDING(). Ugh).

              Linus

Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by Mateusz Guzik 1 week ago

On Wed, Apr 2, 2025 at 8:56 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Admittedly my main use case for that was the big fixed-size case in
> the stat code, which just does
>
>         return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0;
>
> and shows up like a sore thumb on some benchmarks.
>
> But I've always hated the patch because it turns out that the real fix
> is to get rid of the temporary buffer on the stack, and just copy the
> fields one-by-one with 'unsafe_put_user()' and friends, but every time
> I do that - and I've done it several times - I end up throwing the
> patch away because it ends up being problematic on non-x86
> architectures
>
> (Reason: INIT_STRUCT_STAT64_PADDING(). Ugh).

Well, one could do ugliness as follows, opt-in for interested uarchs,
pseudocode-wise:

__stat_fill_padding_start(user);
unsafe_put_user(user->somefield0, stat->somefield0);
__stat_fill_padding0(user);
unsafe_put_user(user->somefield1, stat->somefield1);
__stat_fill_padding1(user);
.....
__stat_fill_padding_end(user);

Where archs would provide padding macros as needed, expanding to
nothing or zeroing something. I don't know if everyone has fields in
the same order, it may be the stores would not be clean here. I guess
the two archs which really matter are amd64 and arm64? Archs not
interested still get copy_to_user.

Anyhow, for stat specifically, I think there is a different way out
and I suspect it will be faster.

Things start with memsetting struct kstat and populating some of it,
and only some of the populated stuff is used to begin with. There is
some getattr calls to further fill out kstat. Finally the kstack thing
gets converted into userspace stat and copied out.

While you can coalesce kstat -> stat conversion and copyout into one
thing, there is still the cost paid for doing work the syscall is not
looking at.

Instead the vfs layer could fill out an on-stack stat buffer
immediately and copy that out. This should be less work (no kstat
memset for example) and not require anyone to mess with
unsafe_put_user.

I don't know if this is worth pursuing, maybe I'll hack it up out of curiosity.
-- 
Mateusz Guzik <mjguzik gmail.com>

Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

Posted by Peter Zijlstra 1 week, 1 day ago

On Wed, Apr 02, 2025 at 07:17:03PM +0100, Andrew Cooper wrote:
> > Please make this a gcc bug-report instead - I really don't want to
> > have random compiler-specific tuning options in the kernel. Because
> > that whole memcpy-strategy thing is something that gets tuned by a lot
> > of other compiler options (ie -march and different versions).
> 
> I've discussed this with PeterZ in the past, although I can't for the
> life of me find the bugzilla ticket I thought I opened on the matter. 
> (Maybe I never got that far).
> 
> The behaviour wanted is:
> 
> 1) Convert to plain plain accesses (so they can be merged/combined/etc), or
> 2) Emit a library call
> 
> because we do provide forms that are better than the GCC-chosen "REP
> MOVSQ with manual alignment" in the general case.
> 
> Taking a leaf out of the repoline book, the ideal library call(s) would be:
> 
>     CALL __x86_thunk_rep_{mov,stos}sb
> 
> using the REP ABI (parameters in %rcx/%rdi/etc), rather than the SYSV ABI.
> 
> For current/future processors, which have fast reps of all short/zero
> flavours, we can even inline the REP {MOV,STO}S instruction to avoid the
> call.
> 
> For older microarchitectures, they can reuse the existing memcpy/memset
> implementations, just with marginally less parameter shuffling.
> 
> How does this sound?

Right, vague memories indeed. We do something like this manually for
copy_user_generic().

But it would indeed be very nice if the compiler were to emit such thunk
calls instead of doing rep whatever and then we can objtool collect the
locations and patch at runtime to be 'rep movs' or not, depending on
CPU flags etc.