[PATCH v4 10/13] x86/mm: Simplify clear_page_*

Ankur Arora posted 13 patches 3 months, 3 weeks ago
[PATCH v4 10/13] x86/mm: Simplify clear_page_*
Posted by Ankur Arora 3 months, 3 weeks ago
clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
variations. Inlining gets rid of the costly call/ret (for cases with
speculative execution related mitigations.)

Also, fixup and rename clear_page_orig() to adapt to the changed calling
convention.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 arch/x86/include/asm/page_64.h | 19 ++++++++---------
 arch/x86/lib/clear_page_64.S   | 39 +++++++---------------------------
 2 files changed, 17 insertions(+), 41 deletions(-)

diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 015d23f3e01f..596333bd0c73 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -40,23 +40,22 @@ extern unsigned long __phys_addr_symbol(unsigned long);
 
 #define __phys_reloc_hide(x)	(x)
 
-void clear_page_orig(void *page);
-void clear_page_rep(void *page);
-void clear_page_erms(void *page);
+void memzero_page_aligned_unrolled(void *addr, u64 len);
 
 static inline void clear_page(void *page)
 {
+	u64 len = PAGE_SIZE;
 	/*
 	 * Clean up KMSAN metadata for the page being cleared. The assembly call
 	 * below clobbers @page, so we perform unpoisoning before it.
 	 */
-	kmsan_unpoison_memory(page, PAGE_SIZE);
-	alternative_call_2(clear_page_orig,
-			   clear_page_rep, X86_FEATURE_REP_GOOD,
-			   clear_page_erms, X86_FEATURE_ERMS,
-			   "=D" (page),
-			   "D" (page),
-			   "cc", "memory", "rax", "rcx");
+	kmsan_unpoison_memory(page, len);
+	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
+				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
+				   "rep stosb", X86_FEATURE_ERMS)
+			: "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
+			: "a" (0)
+			: "cc", "memory");
 }
 
 void copy_page(void *to, void *from);
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index a508e4a8c66a..27debe0c018c 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -6,30 +6,15 @@
 #include <asm/asm.h>
 
 /*
- * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
- * recommended to use this when possible and we do use them by default.
- * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
- * Otherwise, use original.
+ * Zero page aligned region.
+ * %rdi	- dest
+ * %rcx	- length
  */
-
-/*
- * Zero a page.
- * %rdi	- page
- */
-SYM_TYPED_FUNC_START(clear_page_rep)
-	movl $4096/8,%ecx
-	xorl %eax,%eax
-	rep stosq
-	RET
-SYM_FUNC_END(clear_page_rep)
-EXPORT_SYMBOL_GPL(clear_page_rep)
-
-SYM_TYPED_FUNC_START(clear_page_orig)
-	xorl   %eax,%eax
-	movl   $4096/64,%ecx
+SYM_TYPED_FUNC_START(memzero_page_aligned_unrolled)
+	shrq   $6, %rcx
 	.p2align 4
 .Lloop:
-	decl	%ecx
+	decq	%rcx
 #define PUT(x) movq %rax,x*8(%rdi)
 	movq %rax,(%rdi)
 	PUT(1)
@@ -43,16 +28,8 @@ SYM_TYPED_FUNC_START(clear_page_orig)
 	jnz	.Lloop
 	nop
 	RET
-SYM_FUNC_END(clear_page_orig)
-EXPORT_SYMBOL_GPL(clear_page_orig)
-
-SYM_TYPED_FUNC_START(clear_page_erms)
-	movl $4096,%ecx
-	xorl %eax,%eax
-	rep stosb
-	RET
-SYM_FUNC_END(clear_page_erms)
-EXPORT_SYMBOL_GPL(clear_page_erms)
+SYM_FUNC_END(memzero_page_aligned_unrolled)
+EXPORT_SYMBOL_GPL(memzero_page_aligned_unrolled)
 
 /*
  * Default clear user-space.
-- 
2.31.1
Re: [PATCH v4 10/13] x86/mm: Simplify clear_page_*
Posted by kernel test robot 3 months, 3 weeks ago
Hi Ankur,

kernel test robot noticed the following build errors:

[auto build test ERROR on perf-tools-next/perf-tools-next]
[also build test ERROR on tip/perf/core perf-tools/perf-tools linus/master v6.16-rc2 next-20250616]
[cannot apply to acme/perf/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/perf-bench-mem-Remove-repetition-around-time-measurement/20250616-132651
base:   https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git perf-tools-next
patch link:    https://lore.kernel.org/r/20250616052223.723982-11-ankur.a.arora%40oracle.com
patch subject: [PATCH v4 10/13] x86/mm: Simplify clear_page_*
config: x86_64-buildonly-randconfig-004-20250616 (https://download.01.org/0day-ci/archive/20250617/202506170010.r7INoexI-lkp@intel.com/config)
compiler: clang version 20.1.2 (https://github.com/llvm/llvm-project 58df0ef89dd64126512e4ee27b4ac3fd8ddf6247)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250617/202506170010.r7INoexI-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202506170010.r7INoexI-lkp@intel.com/

All errors (new ones prefixed by >>):

>> ld.lld: error: undefined symbol: __kcfi_typeid_memzero_page_aligned_unrolled
   >>> referenced by usercopy_64.c
   >>>               vmlinux.o:(__cfi_memzero_page_aligned_unrolled)

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
Re: [PATCH v4 10/13] x86/mm: Simplify clear_page_*
Posted by Dave Hansen 3 months, 3 weeks ago
On 6/15/25 22:22, Ankur Arora wrote:
> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
> variations. Inlining gets rid of the costly call/ret (for cases with
> speculative execution related mitigations.)

Could you elaborate a bit on which "speculative execution related
mitigations" are so costly with these direct calls?


> -	kmsan_unpoison_memory(page, PAGE_SIZE);
> -	alternative_call_2(clear_page_orig,
> -			   clear_page_rep, X86_FEATURE_REP_GOOD,
> -			   clear_page_erms, X86_FEATURE_ERMS,
> -			   "=D" (page),
> -			   "D" (page),
> -			   "cc", "memory", "rax", "rcx");

I've got to say, I don't dislike the old code. It's utterly clear from
that code what's going on. It's arguable that it's not clear that the
rep/erms variants are just using stosb vs. stosq, but the high level
concept of "use a feature flag to switch between three implementations
of clear page" is crystal clear.

> +	kmsan_unpoison_memory(page, len);
> +	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
> +				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
> +				   "rep stosb", X86_FEATURE_ERMS)
> +			: "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
> +			: "a" (0)
> +			: "cc", "memory");
>  }

This is substantially less clear. It also doesn't even add comments to
make up for the decreased clarity.

>  void copy_page(void *to, void *from);
> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
> index a508e4a8c66a..27debe0c018c 100644
> --- a/arch/x86/lib/clear_page_64.S
> +++ b/arch/x86/lib/clear_page_64.S
> @@ -6,30 +6,15 @@
>  #include <asm/asm.h>
>  
>  /*
> - * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
> - * recommended to use this when possible and we do use them by default.
> - * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
> - * Otherwise, use original.
> + * Zero page aligned region.
> + * %rdi	- dest
> + * %rcx	- length
>   */

That comment was pretty useful, IMNHO.

How about we add something like this above it? I think it explains the
whole landscape, including the fact that X86_FEATURE_REP_GOOD is
synthetic and X86_FEATURE_ERMS is not:

Switch between three implementation of page clearing based on CPU
capabilities:

 1. memzero_page_aligned_unrolled(): the oldest, slowest and universally
    supported method. Uses a for loop (in assembly) to write a 64-byte
    cacheline on each loop. Each loop iteration writes to memory using
    8x 8-byte MOV instructions.
 2. "rep stosq": Really old CPUs had crummy REP implementations.
    Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
    trusted. The instruction writes 8 bytes per REP iteration but CPUs
    internally batch these together and do larger writes.
 3. "rep stosb": CPUs that enumerate 'ERMS' have an improved STOS
    implementation that is less picky about alignment and where STOSB
    (1 byte at a time) is actually faster than STOSQ (8 bytes at a
    time).
Re: [PATCH v4 10/13] x86/mm: Simplify clear_page_*
Posted by Ankur Arora 3 months, 3 weeks ago
Dave Hansen <dave.hansen@intel.com> writes:

> On 6/15/25 22:22, Ankur Arora wrote:
>> clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
>> variations. Inlining gets rid of the costly call/ret (for cases with
>> speculative execution related mitigations.)
>
> Could you elaborate a bit on which "speculative execution related
> mitigations" are so costly with these direct calls?

I can specify that we would mispredict on the RET if you use RETHUNK.

>> -	kmsan_unpoison_memory(page, PAGE_SIZE);
>> -	alternative_call_2(clear_page_orig,
>> -			   clear_page_rep, X86_FEATURE_REP_GOOD,
>> -			   clear_page_erms, X86_FEATURE_ERMS,
>> -			   "=D" (page),
>> -			   "D" (page),
>> -			   "cc", "memory", "rax", "rcx");
>
> I've got to say, I don't dislike the old code. It's utterly clear from
> that code what's going on. It's arguable that it's not clear that the
> rep/erms variants are just using stosb vs. stosq, but the high level
> concept of "use a feature flag to switch between three implementations
> of clear page" is crystal clear.
>
>> +	kmsan_unpoison_memory(page, len);
>> +	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
>> +				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
>> +				   "rep stosb", X86_FEATURE_ERMS)
>> +			: "+c" (len), "+D" (page), ASM_CALL_CONSTRAINT
>> +			: "a" (0)
>> +			: "cc", "memory");
>>  }
>
> This is substantially less clear. It also doesn't even add comments to
> make up for the decreased clarity.
>
>>  void copy_page(void *to, void *from);
>> diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
>> index a508e4a8c66a..27debe0c018c 100644
>> --- a/arch/x86/lib/clear_page_64.S
>> +++ b/arch/x86/lib/clear_page_64.S
>> @@ -6,30 +6,15 @@
>>  #include <asm/asm.h>
>>
>>  /*
>> - * Most CPUs support enhanced REP MOVSB/STOSB instructions. It is
>> - * recommended to use this when possible and we do use them by default.
>> - * If enhanced REP MOVSB/STOSB is not available, try to use fast string.
>> - * Otherwise, use original.
>> + * Zero page aligned region.
>> + * %rdi	- dest
>> + * %rcx	- length
>>   */
>
> That comment was pretty useful, IMNHO.
>
> How about we add something like this above it? I think it explains the
> whole landscape, including the fact that X86_FEATURE_REP_GOOD is
> synthetic and X86_FEATURE_ERMS is not:
>
> Switch between three implementation of page clearing based on CPU
> capabilities:
>
>  1. memzero_page_aligned_unrolled(): the oldest, slowest and universally
>     supported method. Uses a for loop (in assembly) to write a 64-byte
>     cacheline on each loop. Each loop iteration writes to memory using
>     8x 8-byte MOV instructions.
>  2. "rep stosq": Really old CPUs had crummy REP implementations.
>     Vendor CPU setup code sets 'REP_GOOD' on CPUs where REP can be
>     trusted. The instruction writes 8 bytes per REP iteration but CPUs
>     internally batch these together and do larger writes.
>  3. "rep stosb": CPUs that enumerate 'ERMS' have an improved STOS
>     implementation that is less picky about alignment and where STOSB
>     (1 byte at a time) is actually faster than STOSQ (8 bytes at a
>     time).

Yeah this seems good to add. And, sorry should have threshed that comment
out in the new location instead of just getting rid of it.

--
ankur
Re: [PATCH v4 10/13] x86/mm: Simplify clear_page_*
Posted by Peter Zijlstra 3 months, 3 weeks ago
On Mon, Jun 16, 2025 at 07:35:48AM -0700, Dave Hansen wrote:
> On 6/15/25 22:22, Ankur Arora wrote:
> > clear_page_rep() and clear_page_erms() are wrappers around "REP; STOS"
> > variations. Inlining gets rid of the costly call/ret (for cases with
> > speculative execution related mitigations.)
> 
> Could you elaborate a bit on which "speculative execution related
> mitigations" are so costly with these direct calls?

Pretty much everything with RETHUNK set I would imagine.