performance anomaly in rep movsq/movsb as seen on Sapphire Rapids executing sync_regs()

Posted by Mateusz Guzik 2 months, 2 weeks ago

Sapphire Rapids has both ERMS (of course) and FSRM.

sync_regs() runs into a corner case where both rep movsq and rep movsb
suffer massive penalty for being used to copy 168 bytes, which clear
itself when data is copied by a bunch of movq instead.

I verified the issue is not present on AMD EPYC 9454, I don't know about
other Intel CPUs.

Details:
When benchmarking page faults (page_fault1 from will-it-scale),
sync_regs() is very high on the profile performing a 168 byte copy with
rep movsq.

I figured movsq still sucks on the uarch, so I patched the kernel to use
movsb instead, but performance barely budged.

However, forcing the thing to do the copy with regular stores in
memcpy_orig (32 bytes per loop iteration + 8 bytes tail) unclogs it.

Check this out (ops/s):

rep movsb in ___pi_memcpy:
min:1293689 max:1293689 total:1293689
min:1293969 max:1293969 total:1293969
min:1293845 max:1293845 total:1293845
min:1293436 max:1293436 total:1293436

hand-rolled mov loop in memcpy_orig:
min:1498050 max:1498050 total:1498050
min:1499041 max:1499041 total:1499041
min:1498283 max:1498283 total:1498283
min:1499701 max:1499701 total:1499701

... or just shy of 16% faster.

I patched the kernel with a tunable to select memcpy version for
sync_regs() to use, togglable at runtime. Results reliably flip around
as I change it at runtime.

perf top says:

rep movsb in ___pi_memcpy:
  25.20%  [kernel]                  [k] asm_exc_page_fault
  14.60%  [kernel]                  [k] __pi_memcpy
  11.78%  page_fault1_processes     [.] testcase
   4.71%  [kernel]                  [k] _raw_spin_lock
   2.36%  [kernel]                  [k] __handle_mm_fault
   2.00%  [kernel]                  [k] clear_page_erms
   
hand-rolled mov loop in memcpy_orig:
  27.99%  [kernel]               [k] asm_exc_page_fault
  13.42%  page_fault1_processes  [.] testcase
   5.46%  [kernel]               [k] _raw_spin_lock
   2.72%  [kernel]               [k] __handle_mm_fault
   2.48%  [kernel]               [k] clear_page_erms
   [..]
   0.59%  [kernel]		 [k] memcpy_orig
   0.04%  [kernel]		 [k] __pi_memcpy

As you can see the difference is staggering and this has to be a
deficiency at least in this uarch.

When it comes to sync_regs() specifically, I think it makes some sense
to instead recode in asm and perhaps issue the movs "by hand", which
would work around the immediate problem and shave off a function call
per page fault.

However, per the profile results above, there is at least one case where
rep movsb-based memcpy can grossly underperform and someone(tm) should
investigate what's going on there. Also note the kernel inlines plain
rep movsb usage for copy to/from user if FSRM is present, again possibly
being susceptible to whatever the problem is.

Maybe this is a matter of misalignment of the target or some other
bullshit, I have not tested and I don't have the time to dig into it.

I would expect someone better clued in the area will figure it out in
less time than I would need, hence I'm throwing this out there.

tunable usable as follows:
sysctl fs.magic_tunable=0 # rep movsb
sysctl fs.magic_tunable=1 # regular movs

hack:
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6b22611e69cc..f5fd69b2dc5b 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -915,6 +915,9 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 }
 
 #ifdef CONFIG_X86_64
+extern unsigned long magic_tunable;
+void *memcpy_orig(void *dest, const void *src, size_t n);
+
 /*
  * Help handler running on a per-cpu (IST or entry trampoline) stack
  * to switch to the normal thread stack if the interrupted code was in
@@ -923,8 +926,10 @@ DEFINE_IDTENTRY_RAW(exc_int3)
 asmlinkage __visible noinstr struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
 	struct pt_regs *regs = (struct pt_regs *)current_top_of_stack() - 1;
-	if (regs != eregs)
-		*regs = *eregs;
+	if (!magic_tunable)
+		__memcpy(regs, eregs, sizeof(struct pt_regs));
+	else
+		memcpy_orig(regs, eregs, sizeof(struct pt_regs));
 	return regs;
 }
 
diff --git a/arch/x86/lib/memcpy_64.S b/arch/x86/lib/memcpy_64.S
index 12a23fa7c44c..0f67378625b4 100644
--- a/arch/x86/lib/memcpy_64.S
+++ b/arch/x86/lib/memcpy_64.S
@@ -31,8 +31,6 @@
  * which the compiler could/should do much better anyway.
  */
 SYM_TYPED_FUNC_START(__memcpy)
-	ALTERNATIVE "jmp memcpy_orig", "", X86_FEATURE_FSRM
-
 	movq %rdi, %rax
 	movq %rdx, %rcx
 	rep movsb
@@ -44,7 +42,7 @@ SYM_FUNC_ALIAS_MEMFUNC(memcpy, __memcpy)
 SYM_PIC_ALIAS(memcpy)
 EXPORT_SYMBOL(memcpy)
 
-SYM_FUNC_START_LOCAL(memcpy_orig)
+SYM_TYPED_FUNC_START(memcpy_orig)
 	movq %rdi, %rax
 
 	cmpq $0x20, %rdx
diff --git a/fs/file_table.c b/fs/file_table.c
index cd4a3db4659a..de1ef700d144 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -109,6 +109,8 @@ static int proc_nr_files(const struct ctl_table *table, int write, void *buffer,
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 
+unsigned long magic_tunable;
+
 static const struct ctl_table fs_stat_sysctls[] = {
 	{
 		.procname	= "file-nr",
@@ -126,6 +128,16 @@ static const struct ctl_table fs_stat_sysctls[] = {
 		.extra1		= SYSCTL_LONG_ZERO,
 		.extra2		= SYSCTL_LONG_MAX,
 	},
+	{
+		.procname	= "magic_tunable",
+		.data		= &magic_tunable,
+		.maxlen		= sizeof(magic_tunable),
+		.mode		= 0644,
+		.proc_handler	= proc_doulongvec_minmax,
+		.extra1		= SYSCTL_LONG_ZERO,
+		.extra2		= SYSCTL_LONG_MAX,
+	},
+
 	{
 		.procname	= "nr_open",
 		.data		= &sysctl_nr_open,

Re: performance anomaly in rep movsq/movsb as seen on Sapphire Rapids executing sync_regs()

Posted by Dave Hansen 2 months, 1 week ago

On 11/26/25 22:55, Mateusz Guzik wrote:
> I figured movsq still sucks on the uarch, so I patched the kernel to use
> movsb instead, but performance barely budged.
> 
> However, forcing the thing to do the copy with regular stores in
> memcpy_orig (32 bytes per loop iteration + 8 bytes tail) unclogs it.

Any chance this can be reproduced in userspace somehow? Does any old
copy of 168 bytes do better with regular stores than rep movsq?

Re: performance anomaly in rep movsq/movsb as seen on Sapphire Rapids executing sync_regs()

Posted by david laight 2 months, 2 weeks ago

On Thu, 27 Nov 2025 07:55:27 +0100
Mateusz Guzik <mjguzik@gmail.com> wrote:

> Sapphire Rapids has both ERMS (of course) and FSRM.
> 
> sync_regs() runs into a corner case where both rep movsq and rep movsb
> suffer massive penalty for being used to copy 168 bytes, which clear
> itself when data is copied by a bunch of movq instead.
> 
> I verified the issue is not present on AMD EPYC 9454, I don't know about
> other Intel CPUs.

On pretty much all intel cpu 'rep movsb' and 'rep movsq' seem to be
implemented in the same hardware - so the length in the 'q' case is
just multiplied by 8.
(That goes all the way back to Sandy bridge.)

I'm guessing all the copies are at the same page alignment?

I found some strange alignment related issues on a zen-5 cpu.
Mostly neither the source nor destination alignment made much difference.
(Apart from (IIRC) 64 byte aligning the destination doubling throughput.)
But some copies were horribly slow.
It was something like copies where the page offset of the destination
was less than 64 bytes from the page offset of the src and the src wasn't
on a page boundary (the byte alignment wasn't relevant).

I wonder if Sapphire Rapids has some similar perversion?
Or, is that one of the big/little cpu where most of the cpu are
actually atom ones - which may not have either ERMS or FSRM ?

I need to rerun those tests using data dependencies instead of lfence
and get a much better estimation of the instruction setup time.
But I am lacking old amd and new intel hardware.

	David

Re: performance anomaly in rep movsq/movsb as seen on Sapphire Rapids executing sync_regs()

Posted by Dave Hansen 2 months, 1 week ago

On 11/27/25 01:58, david laight wrote:
> I wonder if Sapphire Rapids has some similar perversion?
> Or, is that one of the big/little cpu where most of the cpu are
> actually atom ones - which may not have either ERMS or FSRM ?

No, if it's really Sapphire Rapids, then the CPUs are homogeneous.