Sapphire Rapids has both ERMS (of course) and FSRM.
sync_regs() runs into a corner case where both rep movsq and rep movsb
suffer massive penalty for being used to copy 168 bytes, which clear
itself when data is copied by a bunch of movq instead.
I verified the issue is not present on AMD EPYC 9454, I don't know about
other Intel CPUs.
Details:
When benchmarking page faults (page_fault1 from will-it-scale),
sync_regs() is very high on the profile performing a 168 byte copy with
rep movsq.
I figured movsq still sucks on the uarch, so I patched the kernel to use
movsb instead, but performance barely budged.
However, forcing the thing to do the copy with regular stores in
memcpy_orig (32 bytes per loop iteration + 8 bytes tail) unclogs it.
Check this out (ops/s):
rep movsb in ___pi_memcpy:
min:1293689 max:1293689 total:1293689
min:1293969 max:1293969 total:1293969
min:1293845 max:1293845 total:1293845
min:1293436 max:1293436 total:1293436
hand-rolled mov loop in memcpy_orig:
min:1498050 max:1498050 total:1498050
min:1499041 max:1499041 total:1499041
min:1498283 max:1498283 total:1498283
min:1499701 max:1499701 total:1499701
... or just shy of 16% faster.
I patched the kernel with a tunable to select memcpy version for
sync_regs() to use, togglable at runtime. Results reliably flip around
as I change it at runtime.
perf top says:
rep movsb in ___pi_memcpy:
25.20% [kernel] [k] asm_exc_page_fault
14.60% [kernel] [k] __pi_memcpy
11.78% page_fault1_processes [.] testcase
4.71% [kernel] [k] _raw_spin_lock
2.36% [kernel] [k] __handle_mm_fault
2.00% [kernel] [k] clear_page_erms
hand-rolled mov loop in memcpy_orig:
27.99% [kernel] [k] asm_exc_page_fault
13.42% page_fault1_processes [.] testcase
5.46% [kernel] [k] _raw_spin_lock
2.72% [kernel] [k] __handle_mm_fault
2.48% [kernel] [k] clear_page_erms
[..]
0.59% [kernel] [k] memcpy_orig
0.04% [kernel] [k] __pi_memcpy
As you can see the difference is staggering and this has to be a
deficiency at least in this uarch.
When it comes to sync_regs() specifically, I think it makes some sense
to instead recode in asm and perhaps issue the movs "by hand", which
would work around the immediate problem and shave off a function call
per page fault.
However, per the profile results above, there is at least one case where
rep movsb-based memcpy can grossly underperform and someone(tm) should
investigate what's going on there. Also note the kernel inlines plain
rep movsb usage for copy to/from user if FSRM is present, again possibly
being susceptible to whatever the problem is.
Maybe this is a matter of misalignment of the target or some other
bullshit, I have not tested and I don't have the time to dig into it.
I would expect someone better clued in the area will figure it out in
less time than I would need, hence I'm throwing this out there.
tunable usable as follows:
sysctl fs.magic_tunable=0 # rep movsb
sysctl fs.magic_tunable=1 # regular movs
hack:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6b22611e69cc..f5fd69b2dc5b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -915,6 +915,9 @@ DEFINE_IDTENTRY_RAW(exc_int3)
}
#ifdef CONFIG_X86_64
+extern unsigned long magic_tunable;
+void *memcpy_orig(void *dest, const void *src, size_t n);
+
/*
* Help handler running on a per-cpu (IST or entry trampoline) stack
* to switch to the normal thread stack if the interrupted code was in
@@ -923,8 +926,10 @@ DEFINE_IDTENTRY_RAW(exc_int3)
asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
{
struct pt_regs *regs = (struct pt_regs *)current_top_of_stack() - 1;
- if (regs != eregs)
- *regs = *eregs;
+ if (!magic_tunable)
+ __memcpy(regs, eregs, sizeof(struct pt_regs));
+ else
+ memcpy_orig(regs, eregs, sizeof(struct pt_regs));
return regs;
}
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 12a23fa7c44c..0f67378625b4 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -31,8 +31,6 @@
* which the compiler could/should do much better anyway.
*/
SYM_TYPED_FUNC_START(__memcpy)
- ALTERNATIVE "jmp memcpy_orig", "", X86_FEATURE_FSRM
-
movq %rdi, %rax
movq %rdx, %rcx
rep movsb
@@ -44,7 +42,7 @@ SYM_FUNC_ALIAS_MEMFUNC(memcpy, __memcpy)
SYM_PIC_ALIAS(memcpy)
EXPORT_SYMBOL(memcpy)
-SYM_FUNC_START_LOCAL(memcpy_orig)
+SYM_TYPED_FUNC_START(memcpy_orig)
movq %rdi, %rax
cmpq $0x20, %rdx
diff --git a/fs/file_table.c b/fs/file_table.c
index cd4a3db4659a..de1ef700d144 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
}
+unsigned long magic_tunable;
+
static const struct ctl_table fs_stat_sysctls[] = {
{
.procname = "file-nr",
@@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
.extra1 = SYSCTL_LONG_ZERO,
.extra2 = SYSCTL_LONG_MAX,
},
+ {
+ .procname = "magic_tunable",
+ .data = &magic_tunable,
+ .maxlen = sizeof(magic_tunable),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra1 = SYSCTL_LONG_ZERO,
+ .extra2 = SYSCTL_LONG_MAX,
+ },
+
{
.procname = "nr_open",
.data = &sysctl_nr_open,
On Thu, 27 Nov 2025 07:55:27 +0100 Mateusz Guzik <mjguzik@gmail.com> wrote: > Sapphire Rapids has both ERMS (of course) and FSRM. > > sync_regs() runs into a corner case where both rep movsq and rep movsb > suffer massive penalty for being used to copy 168 bytes, which clear > itself when data is copied by a bunch of movq instead. > > I verified the issue is not present on AMD EPYC 9454, I don't know about > other Intel CPUs. On pretty much all intel cpu 'rep movsb' and 'rep movsq' seem to be implemented in the same hardware - so the length in the 'q' case is just multiplied by 8. (That goes all the way back to Sandy bridge.) I'm guessing all the copies are at the same page alignment? I found some strange alignment related issues on a zen-5 cpu. Mostly neither the source nor destination alignment made much difference. (Apart from (IIRC) 64 byte aligning the destination doubling throughput.) But some copies were horribly slow. It was something like copies where the page offset of the destination was less than 64 bytes from the page offset of the src and the src wasn't on a page boundary (the byte alignment wasn't relevant). I wonder if Sapphire Rapids has some similar perversion? Or, is that one of the big/little cpu where most of the cpu are actually atom ones - which may not have either ERMS or FSRM ? I need to rerun those tests using data dependencies instead of lfence and get a much better estimation of the instruction setup time. But I am lacking old amd and new intel hardware. David
© 2016 - 2025 Red Hat, Inc.