arch/x86/Makefile | 23 +++++++++++++++++++++++ 1 file changed, 23 insertions(+)
Not a real submission yet as I would like results from other people.
tl;dr when benchmarking compilation of a hello-world program I'm getting
a 1.7% increase in throughput on Sapphire Rapids when convincing the
compiler to only use regular stores for inlined memset and memcpy
Note this uarch does have FSRM and still benefits from not using it for
some cases.
I am not in position to bench this on other CPUs, would be nice if
someone did it on AMD.
Onto the business:
The kernel is chock full of inlined rep movsq and rep stosq, including
in hot paths and these are known to be detrimental to performance below
certain sizes.
Most notably in sync_regs:
<+0>: endbr64
<+4>: mov %gs:0x22ca5d4(%rip),%rax # 0xffffffff8450f010 <cpu_current_top_of_stack>
<+12>: mov %rdi,%rsi
<+15>: sub $0xa8,%rax
<+21>: cmp %rdi,%rax
<+24>: je 0xffffffff82244a55 <sync_regs+37>
<+26>: mov $0x15,%ecx
<+31>: mov %rax,%rdi
<+34>: rep movsq %ds:(%rsi),%es:(%rdi)
<+37>: jmp 0xffffffff82256ba0 <__x86_return_thunk>
When issuing hello-world compiles in a loop this is over 1% of total CPU
time as reported by perf. With the kernel recompiled to instead do a
copy with regular stores this drops to 0.13%.
Recompiled it looks like this:
<+0>: endbr64
<+4>: mov %gs:0x22b9f44(%rip),%rax # 0xffffffff8450f010 <cpu_current_top_of_stack>
<+12>: sub $0xa8,%rax
<+18>: cmp %rdi,%rax
<+21>: je 0xffffffff82255114 <sync_regs+84>
<+23>: xor %ecx,%ecx
<+25>: mov %ecx,%edx
<+27>: add $0x20,%ecx
<+30>: mov (%rdi,%rdx,1),%r10
<+34>: mov 0x8(%rdi,%rdx,1),%r9
<+39>: mov 0x10(%rdi,%rdx,1),%r8
<+44>: mov 0x18(%rdi,%rdx,1),%rsi
<+49>: mov %r10,(%rax,%rdx,1)
<+53>: mov %r9,0x8(%rax,%rdx,1)
<+58>: mov %r8,0x10(%rax,%rdx,1)
<+63>: mov %rsi,0x18(%rax,%rdx,1)
<+68>: cmp $0xa0,%ecx
<+74>: jb 0xffffffff822550d9 <sync_regs+25>
<+76>: mov (%rdi,%rcx,1),%rdx
<+80>: mov %rdx,(%rax,%rcx,1)
<+84>: jmp 0xffffffff822673e0 <__x86_return_thunk>
bloat-o-meter says:
Total: Before=30021301, After=30089151, chg +0.23%
There are of course other spots which are modified and they also see a
reduction in time spent.
Bench results in compilations completed in a 10 second period with /tmp
backed by tmpfs:
before:
978 ops (97 ops/s)
979 ops (97 ops/s)
978 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
after:
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
996 ops (99 ops/s)
I'm running this with debian 12 userspace (gcc 12.2.0).
I asked the LKP folk to bench but did not get a response yet:
https://lore.kernel.org/oe-lkp/CAGudoHHd8TkyA1kOQ2KtZdZJ2VxUW=2mP-JR0t_oR07TfrwN8w@mail.gmail.com/
Repro instructions:
for i in $(seq 1 10); do taskset --cpu-list 1 ./ccbench 10; done
taskset is important as otherwise processes roam around the box big
time.
Attached files are:
- cc.c for will-it-scale if someone wants to profile the thing while it
loops indefinitely
- src0.c -- hello world for reference, plop into /src/src0.c
- ccbench.c is the bench; compile with cc -O2 -o ccbench ccbench.c
It spawns gcc through system() forcing it to go through the shell, which
mimicks what happens when compiling with make.
arch/x86/Makefile | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 9b76e77ff7f7..1a1afcc3041f 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -198,6 +198,29 @@ ifeq ($(CONFIG_STACKPROTECTOR),y)
endif
endif
+ifdef CONFIG_CC_IS_GCC
+#
+# Inline memcpy and memset handling policy for gcc.
+#
+# For ops of sizes known at compilation time it quickly resorts to issuing rep
+# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
+# latency and it is faster to issue regular stores (even if in loops) to handle
+# small buffers.
+#
+# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
+# reported 0.23% increase for enabling these.
+#
+# We inline up to 256 bytes, which in the best case issues few movs, in the
+# worst case creates a 4 * 8 store loop.
+#
+# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
+# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
+# common denominator. Someone(tm) should revisit this from time to time.
+#
+KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+endif
+
#
# If the function graph tracer is used with mcount instead of fentry,
# '-maccumulate-outgoing-args' is needed to prevent a GCC bug
--
2.43.0
On Wed, 2 Apr 2025 15:42:40 +0200 Mateusz Guzik <mjguzik@gmail.com> wrote: > Not a real submission yet as I would like results from other people. > > tl;dr when benchmarking compilation of a hello-world program I'm getting > a 1.7% increase in throughput on Sapphire Rapids when convincing the > compiler to only use regular stores for inlined memset and memcpy > > Note this uarch does have FSRM and still benefits from not using it for > some cases. > > I am not in position to bench this on other CPUs, would be nice if > someone did it on AMD. I did some benchmarking of 'rep movsb' on a zen 5. Test is: mfence; rdpmc; mfence; test_code; mfence; rdpmc; mfence. For large copies you get 64 bytes/clock. Short copies (less than 128 bytes) are usually very cheap - maybe 5 clocks But it then jumps to 38 clocks. And the 'elephant in the room' is when (dest - src) % 4096 is between 1 and 63. In that case short copies jump to 55 clocks. Otherwise alignment doesn't make much difference. If those values are right you want to use 'rep movsb' for short copies, but probably not for ones between 128 and 256 bytes! I might need to run with an inner loop. The overhead for an empty test (an asm block with "nop" instead of "rep movsb") is 180 clocks (and subtracted from the above clock counts). But I've used the same scheme for 'normal' instructions (testing ipcsum) and got sane results. David
On Thu, Apr 3, 2025 at 12:29 AM David Laight <david.laight.linux@gmail.com> wrote: > > On Wed, 2 Apr 2025 15:42:40 +0200 > Mateusz Guzik <mjguzik@gmail.com> wrote: > > > Not a real submission yet as I would like results from other people. > > > > tl;dr when benchmarking compilation of a hello-world program I'm getting > > a 1.7% increase in throughput on Sapphire Rapids when convincing the > > compiler to only use regular stores for inlined memset and memcpy > > > > Note this uarch does have FSRM and still benefits from not using it for > > some cases. > > > > I am not in position to bench this on other CPUs, would be nice if > > someone did it on AMD. > > I did some benchmarking of 'rep movsb' on a zen 5. > Test is: mfence; rdpmc; mfence; test_code; mfence; rdpmc; mfence. > For large copies you get 64 bytes/clock. > Short copies (less than 128 bytes) are usually very cheap - maybe 5 clocks > But it then jumps to 38 clocks. > And the 'elephant in the room' is when (dest - src) % 4096 is between 1 and 63. > In that case short copies jump to 55 clocks. > Otherwise alignment doesn't make much difference. > I think this roughly follows the advice on how to do benchmarks, but at the same time I think it has too much potential to distort differences when it comes to these routines. The fence forces the CPU to get rid of the state accumulated prior and probably prevents it from speculatively messing with instructions after. But what if uarchs tolerate a mov loop better than rep mov? (up to a point of course) Based on my tests with running the compiler, Sapphire Rapids does prefer the loop approach at least till 256 bytes, despite Fast Short REP MOV. This can be seen in sync_regs() for example. If you are up to it, I would appreciate if you ran the actual bench as described in my opening mail. It is not hard to set up, but it does require rebuilding the kernel. Perhaps you can do it in a vm, it's not a scalability bench. -- Mateusz Guzik <mjguzik gmail.com>
On Wed, 2 Apr 2025 at 06:42, Mateusz Guzik <mjguzik@gmail.com> wrote: > > > +ifdef CONFIG_CC_IS_GCC > +# > +# Inline memcpy and memset handling policy for gcc. > +# > +# For ops of sizes known at compilation time it quickly resorts to issuing rep > +# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup > +# latency and it is faster to issue regular stores (even if in loops) to handle > +# small buffers. > +# > +# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter > +# reported 0.23% increase for enabling these. > +# > +# We inline up to 256 bytes, which in the best case issues few movs, in the > +# worst case creates a 4 * 8 store loop. > +# > +# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a > +# threshold past which a rep-prefixed op becomes faster, 256 being the lowest > +# common denominator. Someone(tm) should revisit this from time to time. > +# > +KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign > +KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign > +endif Please make this a gcc bug-report instead - I really don't want to have random compiler-specific tuning options in the kernel. Because that whole memcpy-strategy thing is something that gets tuned by a lot of other compiler options (ie -march and different versions). Linus
On Wed, Apr 2, 2025 at 6:22 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Wed, 2 Apr 2025 at 06:42, Mateusz Guzik <mjguzik@gmail.com> wrote: > > > > > > +ifdef CONFIG_CC_IS_GCC > > +# > > +# Inline memcpy and memset handling policy for gcc. > > +# > > +# For ops of sizes known at compilation time it quickly resorts to issuing rep > > +# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup > > +# latency and it is faster to issue regular stores (even if in loops) to handle > > +# small buffers. > > +# > > +# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter > > +# reported 0.23% increase for enabling these. > > +# > > +# We inline up to 256 bytes, which in the best case issues few movs, in the > > +# worst case creates a 4 * 8 store loop. > > +# > > +# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a > > +# threshold past which a rep-prefixed op becomes faster, 256 being the lowest > > +# common denominator. Someone(tm) should revisit this from time to time. > > +# > > +KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign > > +KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign > > +endif > > Please make this a gcc bug-report instead - I really don't want to > have random compiler-specific tuning options in the kernel. > > Because that whole memcpy-strategy thing is something that gets tuned > by a lot of other compiler options (ie -march and different versions). > Ok. -- Mateusz Guzik <mjguzik gmail.com>
> Please make this a gcc bug-report instead - I really don't want to > have random compiler-specific tuning options in the kernel. Because > that whole memcpy-strategy thing is something that gets tuned by a lot > of other compiler options (ie -march and different versions). I've discussed this with PeterZ in the past, although I can't for the life of me find the bugzilla ticket I thought I opened on the matter. (Maybe I never got that far). The behaviour wanted is: 1) Convert to plain plain accesses (so they can be merged/combined/etc), or 2) Emit a library call because we do provide forms that are better than the GCC-chosen "REP MOVSQ with manual alignment" in the general case. Taking a leaf out of the repoline book, the ideal library call(s) would be: CALL __x86_thunk_rep_{mov,stos}sb using the REP ABI (parameters in %rcx/%rdi/etc), rather than the SYSV ABI. For current/future processors, which have fast reps of all short/zero flavours, we can even inline the REP {MOV,STO}S instruction to avoid the call. For older microarchitectures, they can reuse the existing memcpy/memset implementations, just with marginally less parameter shuffling. How does this sound? ~Andrew
On Wed, 2 Apr 2025 at 11:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > Taking a leaf out of the repoline book, the ideal library call(s) would be: > > CALL __x86_thunk_rep_{mov,stos}sb > > using the REP ABI (parameters in %rcx/%rdi/etc), rather than the SYSV ABI. Yes. That's basically what 'rep_movs_alternative' does so that we can basically do a ALTERNATIVE("rep movsb", "call rep_movs_alternative", ALT_NOT(X86_FEATURE_FSRM)) but we only do this for user space copies exactly because we don't have a good way to do it for compiler-generated ones. If gcc just did an out-of-line call, but used the 'rep movs' "calling convention", we would be able to basically do the rewriting dynamically, replacing the call with an inlined "rep movsb" where appropriate. Linus
On 02/04/2025 7:29 pm, Linus Torvalds wrote: > On Wed, 2 Apr 2025 at 11:17, Andrew Cooper <andrew.cooper3@citrix.com> wrote: >> Taking a leaf out of the repoline book, the ideal library call(s) would be: >> >> CALL __x86_thunk_rep_{mov,stos}sb >> >> using the REP ABI (parameters in %rcx/%rdi/etc), rather than the SYSV ABI. > Yes. That's basically what 'rep_movs_alternative' does so that we can > basically do a > > ALTERNATIVE("rep movsb", > "call rep_movs_alternative", > ALT_NOT(X86_FEATURE_FSRM)) > > but we only do this for user space copies exactly because we don't > have a good way to do it for compiler-generated ones. > > If gcc just did an out-of-line call, but used the 'rep movs' "calling > convention", we would be able to basically do the rewriting > dynamically, replacing the call with an inlined "rep movsb" where > appropriate. You still want the compiler to be able to do a first-pass optimisation over __builtin_mem*(), for elimination/merging/etc, but if it could stop half way through what it currently does and just emit the library call, that would be excellent. ~Andrew
On Wed, 2 Apr 2025 at 11:40, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > You still want the compiler to be able to do a first-pass optimisation > over __builtin_mem*(), for elimination/merging/etc Absolutely. That's literally why the kernel ends up using __builtin_memcpy() in the first place - so that the compiler can turn the (very common) small fixed-size cases into just regular moves. The function call would only be for actual big moves and/or unknown sizes. That said, it can be advantageous to have more than one function call. For the user copy case (where we control the horizontal and the vertical in the kernel) at one point I did have a patch that did something fancier than just a single call-out. Because even when the compiler doesn't know the exact length, it's often known that it's a multiple of four, for example. It's not all that uncommon to see code like this: memcpy(image->segment, segments, nr_segments * sizeof(*segments)); and an even more common case is "I'm doing an out-of-line call not because the size is unknown, but because the size is big". So I noted the size alignment of the size in the function name, so that I could have different versions for the most obvious straight-forward cases (notably the "just do word-at-a-time copies"). Admittedly my main use case for that was the big fixed-size case in the stat code, which just does return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0; and shows up like a sore thumb on some benchmarks. But I've always hated the patch because it turns out that the real fix is to get rid of the temporary buffer on the stack, and just copy the fields one-by-one with 'unsafe_put_user()' and friends, but every time I do that - and I've done it several times - I end up throwing the patch away because it ends up being problematic on non-x86 architectures (Reason: INIT_STRUCT_STAT64_PADDING(). Ugh). Linus
On Wed, Apr 2, 2025 at 8:56 PM Linus Torvalds <torvalds@linux-foundation.org> wrote: > Admittedly my main use case for that was the big fixed-size case in > the stat code, which just does > > return copy_to_user(statbuf,&tmp,sizeof(tmp)) ? -EFAULT : 0; > > and shows up like a sore thumb on some benchmarks. > > But I've always hated the patch because it turns out that the real fix > is to get rid of the temporary buffer on the stack, and just copy the > fields one-by-one with 'unsafe_put_user()' and friends, but every time > I do that - and I've done it several times - I end up throwing the > patch away because it ends up being problematic on non-x86 > architectures > > (Reason: INIT_STRUCT_STAT64_PADDING(). Ugh). Well, one could do ugliness as follows, opt-in for interested uarchs, pseudocode-wise: __stat_fill_padding_start(user); unsafe_put_user(user->somefield0, stat->somefield0); __stat_fill_padding0(user); unsafe_put_user(user->somefield1, stat->somefield1); __stat_fill_padding1(user); ..... __stat_fill_padding_end(user); Where archs would provide padding macros as needed, expanding to nothing or zeroing something. I don't know if everyone has fields in the same order, it may be the stores would not be clean here. I guess the two archs which really matter are amd64 and arm64? Archs not interested still get copy_to_user. Anyhow, for stat specifically, I think there is a different way out and I suspect it will be faster. Things start with memsetting struct kstat and populating some of it, and only some of the populated stuff is used to begin with. There is some getattr calls to further fill out kstat. Finally the kstack thing gets converted into userspace stat and copied out. While you can coalesce kstat -> stat conversion and copyout into one thing, there is still the cost paid for doing work the syscall is not looking at. Instead the vfs layer could fill out an on-stack stat buffer immediately and copy that out. This should be less work (no kstat memset for example) and not require anyone to mess with unsafe_put_user. I don't know if this is worth pursuing, maybe I'll hack it up out of curiosity. -- Mateusz Guzik <mjguzik gmail.com>
On Wed, Apr 02, 2025 at 07:17:03PM +0100, Andrew Cooper wrote: > > Please make this a gcc bug-report instead - I really don't want to > > have random compiler-specific tuning options in the kernel. Because > > that whole memcpy-strategy thing is something that gets tuned by a lot > > of other compiler options (ie -march and different versions). > > I've discussed this with PeterZ in the past, although I can't for the > life of me find the bugzilla ticket I thought I opened on the matter. > (Maybe I never got that far). > > The behaviour wanted is: > > 1) Convert to plain plain accesses (so they can be merged/combined/etc), or > 2) Emit a library call > > because we do provide forms that are better than the GCC-chosen "REP > MOVSQ with manual alignment" in the general case. > > Taking a leaf out of the repoline book, the ideal library call(s) would be: > > CALL __x86_thunk_rep_{mov,stos}sb > > using the REP ABI (parameters in %rcx/%rdi/etc), rather than the SYSV ABI. > > For current/future processors, which have fast reps of all short/zero > flavours, we can even inline the REP {MOV,STO}S instruction to avoid the > call. > > For older microarchitectures, they can reuse the existing memcpy/memset > implementations, just with marginally less parameter shuffling. > > How does this sound? Right, vague memories indeed. We do something like this manually for copy_user_generic(). But it would indeed be very nice if the compiler were to emit such thunk calls instead of doing rep whatever and then we can objtool collect the locations and patch at runtime to be 'rep movs' or not, depending on CPU flags etc.
© 2016 - 2025 Red Hat, Inc.