clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
variations. Inlining gets rid of the costly call/ret (for cases with
speculative execution related mitigations.)
Also, fixup and rename clear_page_orig() to adapt to the changed calling
convention.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
arch/x86/include/asm/page_64.h | 19 ++++++++---------
arch/x86/lib/clear_page_64.S | 39 +++++++---------------------------
2 files changed, 17 insertions(+), 41 deletions(-)
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 015d23f3e01f..596333bd0c73 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -40,23 +40,22 @@ extern unsigned long __phys_addr_symbol(unsigned long);
#define __phys_reloc_hide(x) (x)
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+void memzero_page_aligned_unrolled(void *addr, u64 len);
static inline void clear_page(void *page)
{
+ u64 len = PAGE_SIZE;
/*
* Clean up KMSAN metadata for the page being cleared. The assembly call
* below clobbers @page, so we perform unpoisoning before it.
*/
- kmsan_unpoison_memory(page, PAGE_SIZE);
- alternative_call_2(clear_page_orig,
- clear_page_rep, X86_FEATURE_REP_GOOD,
- clear_page_erms, X86_FEATURE_ERMS,
- "=D" (page),
- "D" (page),
- "cc", "memory", "rax", "rcx");
+ kmsan_unpoison_memory(page, len);
+ asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
+ "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
+ "rep stosb", X86_FEATURE_ERMS)
+ : "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
+ : "a" (0)
+ : "cc", "memory");
}
void copy_page(void *to, void *from);
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index a508e4a8c66a..27debe0c018c 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -6,30 +6,15 @@
#include <asm/asm.h>
/*
- * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
- * recommended to use this when possible and we do use them by default.
- * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
- * Otherwise, use original.
+ * Zero page aligned region.
+ * %rdi - dest
+ * %rcx - length
*/
-
-/*
- * Zero a page.
- * %rdi - page
- */
-SYM_TYPED_FUNC_START(clear_page_rep)
- movl $4096/8,%ecx
- xorl %eax,%eax
- rep stosq
- RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
-
-SYM_TYPED_FUNC_START(clear_page_orig)
- xorl %eax,%eax
- movl $4096/64,%ecx
+SYM_TYPED_FUNC_START(memzero_page_aligned_unrolled)
+ shrq $6, %rcx
.p2align 4
.Lloop:
- decl %ecx
+ decq %rcx
#define PUT(x) movq %rax,x*8(%rdi)
movq %rax,(%rdi)
PUT(1)
@@ -43,16 +28,8 @@ SYM_TYPED_FUNC_START(clear_page_orig)
jnz .Lloop
nop
RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
-
-SYM_TYPED_FUNC_START(clear_page_erms)
- movl $4096,%ecx
- xorl %eax,%eax
- rep stosb
- RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(memzero_page_aligned_unrolled)
+EXPORT_SYMBOL_GPL(memzero_page_aligned_unrolled)
/*
* Default clear user-space.
--
2.31.1
Hi Ankur, kernel test robot noticed the following build errors: [auto build test ERROR on perf-tools-next/perf-tools-next] [also build test ERROR on tip/perf/core perf-tools/perf-tools linus/master v6.16-rc2 next-20250616] [cannot apply to acme/perf/core] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/perf-bench-mem-Remove-repetition-around-time-measurement/20250616-132651 base: https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git perf-tools-next patch link: https://lore.kernel.org/r/20250616052223.723982-11-ankur.a.arora%40oracle.com patch subject: [PATCH v4 10/13] x86/mm: Simplify clear_page_* config: x86_64-buildonly-randconfig-004-20250616 (https://download.01.org/0day-ci/archive/20250617/202506170010.r7INoexI-lkp@intel.com/config) compiler: clang version 20.1.2 (https://github.com/llvm/llvm-project 58df0ef89dd64126512e4ee27b4ac3fd8ddf6247) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250617/202506170010.r7INoexI-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202506170010.r7INoexI-lkp@intel.com/ All errors (new ones prefixed by >>): >> ld.lld: error: undefined symbol: __kcfi_typeid_memzero_page_aligned_unrolled >>> referenced by usercopy_64.c >>> vmlinux.o:(__cfi_memzero_page_aligned_unrolled) -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
On 6/15/25 22:22, Ankur Arora wrote:
> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
> variations. Inlining gets rid of the costly call/ret (for cases with
> speculative execution related mitigations.)
Could you elaborate a bit on which "speculative execution related
mitigations" are so costly with these direct calls?
> - kmsan_unpoison_memory(page, PAGE_SIZE);
> - alternative_call_2(clear_page_orig,
> - clear_page_rep, X86_FEATURE_REP_GOOD,
> - clear_page_erms, X86_FEATURE_ERMS,
> - "=D" (page),
> - "D" (page),
> - "cc", "memory", "rax", "rcx");
I've got to say, I don't dislike the old code. It's utterly clear from
that code what's going on. It's arguable that it's not clear that the
rep/erms variants are just using stosb vs. stosq, but the high level
concept of "use a feature flag to switch between three implementations
of clear page" is crystal clear.
> + kmsan_unpoison_memory(page, len);
> + asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
> + "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
> + "rep stosb", X86_FEATURE_ERMS)
> + : "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
> + : "a" (0)
> + : "cc", "memory");
> }
This is substantially less clear. It also doesn't even add comments to
make up for the decreased clarity.
> void copy_page(void *to, void *from);
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index a508e4a8c66a..27debe0c018c 100644
> --- a/arch/x86/lib/clear_page_64.S
> +++ b/arch/x86/lib/clear_page_64.S
> @@ -6,30 +6,15 @@
> #include <asm/asm.h>
>
> /*
> - * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
> - * recommended to use this when possible and we do use them by default.
> - * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
> - * Otherwise, use original.
> + * Zero page aligned region.
> + * %rdi - dest
> + * %rcx - length
> */
That comment was pretty useful, IMNHO.
How about we add something like this above it? I think it explains the
whole landscape, including the fact that X86_FEATURE_REP_GOOD is
synthetic and X86_FEATURE_ERMS is not:
Switch between three implementation of page clearing based on CPU
capabilities:
1. memzero_page_aligned_unrolled(): the oldest, slowest and universally
supported method. Uses a for loop (in assembly) to write a 64-byte
cacheline on each loop. Each loop iteration writes to memory using
8x 8-byte MOV instructions.
2. "rep stosq": Really old CPUs had crummy REP implementations.
Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
trusted. The instruction writes 8 bytes per REP iteration but CPUs
internally batch these together and do larger writes.
3. "rep stosb": CPUs that enumerate 'ERMS' have an improved STOS
implementation that is less picky about alignment and where STOSB
(1 byte at a time) is actually faster than STOSQ (8 bytes at a
time).
Dave Hansen <dave.hansen@intel.com> writes:
> On 6/15/25 22:22, Ankur Arora wrote:
>> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
>> variations. Inlining gets rid of the costly call/ret (for cases with
>> speculative execution related mitigations.)
>
> Could you elaborate a bit on which "speculative execution related
> mitigations" are so costly with these direct calls?
I can specify that we would mispredict on the RET if you use RETHUNK.
>> - kmsan_unpoison_memory(page, PAGE_SIZE);
>> - alternative_call_2(clear_page_orig,
>> - clear_page_rep, X86_FEATURE_REP_GOOD,
>> - clear_page_erms, X86_FEATURE_ERMS,
>> - "=D" (page),
>> - "D" (page),
>> - "cc", "memory", "rax", "rcx");
>
> I've got to say, I don't dislike the old code. It's utterly clear from
> that code what's going on. It's arguable that it's not clear that the
> rep/erms variants are just using stosb vs. stosq, but the high level
> concept of "use a feature flag to switch between three implementations
> of clear page" is crystal clear.
>
>> + kmsan_unpoison_memory(page, len);
>> + asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
>> + "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
>> + "rep stosb", X86_FEATURE_ERMS)
>> + : "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
>> + : "a" (0)
>> + : "cc", "memory");
>> }
>
> This is substantially less clear. It also doesn't even add comments to
> make up for the decreased clarity.
>
>> void copy_page(void *to, void *from);
>> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
>> index a508e4a8c66a..27debe0c018c 100644
>> --- a/arch/x86/lib/clear_page_64.S
>> +++ b/arch/x86/lib/clear_page_64.S
>> @@ -6,30 +6,15 @@
>> #include <asm/asm.h>
>>
>> /*
>> - * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
>> - * recommended to use this when possible and we do use them by default.
>> - * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
>> - * Otherwise, use original.
>> + * Zero page aligned region.
>> + * %rdi - dest
>> + * %rcx - length
>> */
>
> That comment was pretty useful, IMNHO.
>
> How about we add something like this above it? I think it explains the
> whole landscape, including the fact that X86_FEATURE_REP_GOOD is
> synthetic and X86_FEATURE_ERMS is not:
>
> Switch between three implementation of page clearing based on CPU
> capabilities:
>
> 1. memzero_page_aligned_unrolled(): the oldest, slowest and universally
> supported method. Uses a for loop (in assembly) to write a 64-byte
> cacheline on each loop. Each loop iteration writes to memory using
> 8x 8-byte MOV instructions.
> 2. "rep stosq": Really old CPUs had crummy REP implementations.
> Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
> trusted. The instruction writes 8 bytes per REP iteration but CPUs
> internally batch these together and do larger writes.
> 3. "rep stosb": CPUs that enumerate 'ERMS' have an improved STOS
> implementation that is less picky about alignment and where STOSB
> (1 byte at a time) is actually faster than STOSQ (8 bytes at a
> time).
Yeah this seems good to add. And, sorry should have threshed that comment
out in the new location instead of just getting rid of it.
--
ankur
On Mon, Jun 16, 2025 at 07:35:48AM -0700, Dave Hansen wrote: > On 6/15/25 22:22, Ankur Arora wrote: > > clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS" > > variations. Inlining gets rid of the costly call/ret (for cases with > > speculative execution related mitigations.) > > Could you elaborate a bit on which "speculative execution related > mitigations" are so costly with these direct calls? Pretty much everything with RETHUNK set I would imagine.
© 2016 - 2026 Red Hat, Inc.