clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
variations. Inlining gets rid of the costly call/ret (for cases with
speculative execution related mitigations.)
Also, fixup and rename clear_page_orig() to adapt to the changed calling
convention.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/include/asm/page_64.h | 19 ++++++++---------
arch/x86/lib/clear_page_64.S | 39 +++++++---------------------------
2 files changed, 17 insertions(+), 41 deletions(-)
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 015d23f3e01f..596333bd0c73 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -40,23 +40,22 @@ extern unsigned long __phys_addr_symbol(unsigned long);
#define __phys_reloc_hide(x) (x)
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+void memzero_page_aligned_unrolled(void *addr, u64 len);
static inline void clear_page(void *page)
{
+ u64 len = PAGE_SIZE;
/*
* Clean up KMSAN metadata for the page being cleared. The assembly call
* below clobbers @page, so we perform unpoisoning before it.
*/
- kmsan_unpoison_memory(page, PAGE_SIZE);
- alternative_call_2(clear_page_orig,
- clear_page_rep, X86_FEATURE_REP_GOOD,
- clear_page_erms, X86_FEATURE_ERMS,
- "=D" (page),
- "D" (page),
- "cc", "memory", "rax", "rcx");
+ kmsan_unpoison_memory(page, len);
+ asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
+ "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
+ "rep stosb", X86_FEATURE_ERMS)
+ : "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
+ : "a" (0)
+ : "cc", "memory");
}
void copy_page(void *to, void *from);
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index a508e4a8c66a..27debe0c018c 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -6,30 +6,15 @@
#include <asm/asm.h>
/*
- * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
- * recommended to use this when possible and we do use them by default.
- * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
- * Otherwise, use original.
+ * Zero page aligned region.
+ * %rdi - dest
+ * %rcx - length
*/
-
-/*
- * Zero a page.
- * %rdi - page
- */
-SYM_TYPED_FUNC_START(clear_page_rep)
- movl $4096/8,%ecx
- xorl %eax,%eax
- rep stosq
- RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
-
-SYM_TYPED_FUNC_START(clear_page_orig)
- xorl %eax,%eax
- movl $4096/64,%ecx
+SYM_TYPED_FUNC_START(memzero_page_aligned_unrolled)
+ shrq $6, %rcx
.p2align 4
.Lloop:
- decl %ecx
+ decq %rcx
#define PUT(x) movq %rax,x*8(%rdi)
movq %rax,(%rdi)
PUT(1)
@@ -43,16 +28,8 @@ SYM_TYPED_FUNC_START(clear_page_orig)
jnz .Lloop
nop
RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
-
-SYM_TYPED_FUNC_START(clear_page_erms)
- movl $4096,%ecx
- xorl %eax,%eax
- rep stosb
- RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(memzero_page_aligned_unrolled)
+EXPORT_SYMBOL_GPL(memzero_page_aligned_unrolled)
/*
* Default clear user-space.
--
2.31.1
Hi Ankur, kernel test robot noticed the following build errors: [auto build test ERROR on perf-tools-next/perf-tools-next] [also build test ERROR on tip/perf/core perf-tools/perf-tools linus/master v6.16-rc2 next-20250616] [cannot apply to acme/perf/core] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/perf-bench-mem-Remove-repetition-around-time-measurement/20250616-132651 base: https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git perf-tools-next patch link: https://lore.kernel.org/r/20250616052223.723982-11-ankur.a.arora%40oracle.com patch subject: [PATCH v4 10/13] x86/mm: Simplify clear_page_* config: x86_64-buildonly-randconfig-004-20250616 (https://download.01.org/0day-ci/archive/20250617/202506170010.r7INoexI-lkp@intel.com/config) compiler: clang version 20.1.2 (https://github.com/llvm/llvm-project 58df0ef89dd64126512e4ee27b4ac3fd8ddf6247) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250617/202506170010.r7INoexI-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202506170010.r7INoexI-lkp@intel.com/ All errors (new ones prefixed by >>): >> ld.lld: error: undefined symbol: __kcfi_typeid_memzero_page_aligned_unrolled >>> referenced by usercopy_64.c >>> vmlinux.o:(__cfi_memzero_page_aligned_unrolled) -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
On 6/15/25 22:22, Ankur Arora wrote: > clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS" > variations. Inlining gets rid of the costly call/ret (for cases with > speculative execution related mitigations.) Could you elaborate a bit on which "speculative execution related mitigations" are so costly with these direct calls? > - kmsan_unpoison_memory(page, PAGE_SIZE); > - alternative_call_2(clear_page_orig, > - clear_page_rep, X86_FEATURE_REP_GOOD, > - clear_page_erms, X86_FEATURE_ERMS, > - "=D" (page), > - "D" (page), > - "cc", "memory", "rax", "rcx"); I've got to say, I don't dislike the old code. It's utterly clear from that code what's going on. It's arguable that it's not clear that the rep/erms variants are just using stosb vs. stosq, but the high level concept of "use a feature flag to switch between three implementations of clear page" is crystal clear. > + kmsan_unpoison_memory(page, len); > + asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled", > + "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD, > + "rep stosb", X86_FEATURE_ERMS) > + : "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT > + : "a" (0) > + : "cc", "memory"); > } This is substantially less clear. It also doesn't even add comments to make up for the decreased clarity. > void copy_page(void *to, void *from); > diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S > index a508e4a8c66a..27debe0c018c 100644 > --- a/arch/x86/lib/clear_page_64.S > +++ b/arch/x86/lib/clear_page_64.S > @@ -6,30 +6,15 @@ > #include <asm/asm.h> > > /* > - * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is > - * recommended to use this when possible and we do use them by default. > - * If enhanced REP MOVSB/STOSB is not available, try to use fast string. > - * Otherwise, use original. > + * Zero page aligned region. > + * %rdi - dest > + * %rcx - length > */ That comment was pretty useful, IMNHO. How about we add something like this above it? I think it explains the whole landscape, including the fact that X86_FEATURE_REP_GOOD is synthetic and X86_FEATURE_ERMS is not: Switch between three implementation of page clearing based on CPU capabilities: 1. memzero_page_aligned_unrolled(): the oldest, slowest and universally supported method. Uses a for loop (in assembly) to write a 64-byte cacheline on each loop. Each loop iteration writes to memory using 8x 8-byte MOV instructions. 2. "rep stosq": Really old CPUs had crummy REP implementations. Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be trusted. The instruction writes 8 bytes per REP iteration but CPUs internally batch these together and do larger writes. 3. "rep stosb": CPUs that enumerate 'ERMS' have an improved STOS implementation that is less picky about alignment and where STOSB (1 byte at a time) is actually faster than STOSQ (8 bytes at a time).
Dave Hansen <dave.hansen@intel.com> writes: > On 6/15/25 22:22, Ankur Arora wrote: >> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS" >> variations. Inlining gets rid of the costly call/ret (for cases with >> speculative execution related mitigations.) > > Could you elaborate a bit on which "speculative execution related > mitigations" are so costly with these direct calls? I can specify that we would mispredict on the RET if you use RETHUNK. >> - kmsan_unpoison_memory(page, PAGE_SIZE); >> - alternative_call_2(clear_page_orig, >> - clear_page_rep, X86_FEATURE_REP_GOOD, >> - clear_page_erms, X86_FEATURE_ERMS, >> - "=D" (page), >> - "D" (page), >> - "cc", "memory", "rax", "rcx"); > > I've got to say, I don't dislike the old code. It's utterly clear from > that code what's going on. It's arguable that it's not clear that the > rep/erms variants are just using stosb vs. stosq, but the high level > concept of "use a feature flag to switch between three implementations > of clear page" is crystal clear. > >> + kmsan_unpoison_memory(page, len); >> + asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled", >> + "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD, >> + "rep stosb", X86_FEATURE_ERMS) >> + : "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT >> + : "a" (0) >> + : "cc", "memory"); >> } > > This is substantially less clear. It also doesn't even add comments to > make up for the decreased clarity. > >> void copy_page(void *to, void *from); >> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S >> index a508e4a8c66a..27debe0c018c 100644 >> --- a/arch/x86/lib/clear_page_64.S >> +++ b/arch/x86/lib/clear_page_64.S >> @@ -6,30 +6,15 @@ >> #include <asm/asm.h> >> >> /* >> - * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is >> - * recommended to use this when possible and we do use them by default. >> - * If enhanced REP MOVSB/STOSB is not available, try to use fast string. >> - * Otherwise, use original. >> + * Zero page aligned region. >> + * %rdi - dest >> + * %rcx - length >> */ > > That comment was pretty useful, IMNHO. > > How about we add something like this above it? I think it explains the > whole landscape, including the fact that X86_FEATURE_REP_GOOD is > synthetic and X86_FEATURE_ERMS is not: > > Switch between three implementation of page clearing based on CPU > capabilities: > > 1. memzero_page_aligned_unrolled(): the oldest, slowest and universally > supported method. Uses a for loop (in assembly) to write a 64-byte > cacheline on each loop. Each loop iteration writes to memory using > 8x 8-byte MOV instructions. > 2. "rep stosq": Really old CPUs had crummy REP implementations. > Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be > trusted. The instruction writes 8 bytes per REP iteration but CPUs > internally batch these together and do larger writes. > 3. "rep stosb": CPUs that enumerate 'ERMS' have an improved STOS > implementation that is less picky about alignment and where STOSB > (1 byte at a time) is actually faster than STOSQ (8 bytes at a > time). Yeah this seems good to add. And, sorry should have threshed that comment out in the new location instead of just getting rid of it. -- ankur
On Mon, Jun 16, 2025 at 07:35:48AM -0700, Dave Hansen wrote: > On 6/15/25 22:22, Ankur Arora wrote: > > clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS" > > variations. Inlining gets rid of the costly call/ret (for cases with > > speculative execution related mitigations.) > > Could you elaborate a bit on which "speculative execution related > mitigations" are so costly with these direct calls? Pretty much everything with RETHUNK set I would imagine.
© 2016 - 2025 Red Hat, Inc.