[v7] mm: folio_zero_user: clear contiguous pages

[PATCH v7 00/16] mm: folio_zero_user: clear contiguous pages

Posted by Ankur Arora 2 weeks ago

This series adds clearing of contiguous page ranges for hugepages,
improving on the current page-at-a-time approach in two ways:

 - amortizes the per-page setup cost over a larger extent

 - when using string instructions, exposes the real region size
   to the processor.

A processor could use a knowledge of the extent to optimize the
clearing. AMD Zen uarchs, as an example, elide allocation of
cachelines for regions larger than L3-size.

Demand faulting a 64GB region shows performance improvements:

 $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5

                 mm/folio_zero_user    x86/folio_zero_user       change
                  (GB/s  +- %stdev)     (GB/s  +- %stdev)

   pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%	preempt=*

   pg-sz=1GB       17.14  +- 1.39%        17.42  +-  0.98% [#]   +  1.6%	preempt=none|voluntary
   pg-sz=1GB       17.51  +- 1.19%        43.23  +-  5.22%       +146.8%	preempt=full|lazy

[#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, which is higher than the maximum extent used on x86
(ARCH_CONTIG_PAGE_NR=8MB), so preempt=none|voluntary sees no improvement
with pg-sz=1GB.

Raghavendra also tested v3/v4 on AMD Genoa and sees similar improvements [1].

Changelog:

v7:
 - interface cleanups, comments for clear_user_highpages(), clear_user_pages(),
   clear_pages().
 - fixed build errors flagged by kernel test robot
 - move all x86 patches to the tail end

v6:
 - perf bench mem: update man pages and other cleanups (Namhyung Kim)
 - unify folio_zero_user() for HIGHMEM, !HIGHMEM options instead of
   working through a new config option (David Hildenbrand).
   - cleanups and simlification around that.
 (https://lore.kernel.org/lkml/20250902080816.3715913-1-ankur.a.arora@oracle.com/)

v5:
 - move the non HIGHMEM implementation of folio_zero_user() from x86
   to common code (Dave Hansen)
 - Minor naming cleanups, commit messages etc
 (https://lore.kernel.org/lkml/20250710005926.1159009-1-ankur.a.arora@oracle.com/)

v4:
 - adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
 - inline stosb etc (PeterZ)
 - handle cooperative preemption models (Ingo)
 - interface and other cleanups all over (Ingo)
 (https://lore.kernel.org/lkml/20250616052223.723982-1-ankur.a.arora@oracle.com/)

v3:
 - get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
   was limited to preempt=full|lazy.
 - override folio_zero_user() (Linus)
 (https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)

v2:
 - addressed review comments from peterz, tglx.
 - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
 - General code cleanup
 (https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)

Comments appreciated!

Also at:
  github.com/terminus/linux clear-pages.v7

[1] https://lore.kernel.org/lkml/fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com/

Ankur Arora (16):
  perf bench mem: Remove repetition around time measurement
  perf bench mem: Defer type munging of size to float
  perf bench mem: Move mem op parameters into a structure
  perf bench mem: Pull out init/fini logic
  perf bench mem: Switch from zalloc() to mmap()
  perf bench mem: Allow mapping of hugepages
  perf bench mem: Allow chunking on a memory region
  perf bench mem: Refactor mem_options
  perf bench mem: Add mmap() workloads
  mm: define clear_pages(), clear_user_pages()
  mm/highmem: introduce clear_user_highpages()
  arm: mm: define clear_user_highpages()
  mm: memory: support clearing page ranges
  x86/mm: Simplify clear_page_*
  x86/clear_page: Introduce clear_pages()
  x86/clear_pages: Support clearing of page-extents

 arch/arm/include/asm/page.h                  |   7 +
 arch/x86/include/asm/page_32.h               |   6 +
 arch/x86/include/asm/page_64.h               |  72 +++-
 arch/x86/lib/clear_page_64.S                 |  39 +-
 include/linux/highmem.h                      |  18 +
 include/linux/mm.h                           |  44 +++
 mm/memory.c                                  |  82 +++-
 tools/perf/Documentation/perf-bench.txt      |  58 ++-
 tools/perf/bench/bench.h                     |   1 +
 tools/perf/bench/mem-functions.c             | 390 ++++++++++++++-----
 tools/perf/bench/mem-memcpy-arch.h           |   2 +-
 tools/perf/bench/mem-memcpy-x86-64-asm-def.h |   4 +
 tools/perf/bench/mem-memset-arch.h           |   2 +-
 tools/perf/bench/mem-memset-x86-64-asm-def.h |   4 +
 tools/perf/builtin-bench.c                   |   1 +
 15 files changed, 560 insertions(+), 170 deletions(-)

-- 
2.43.5

Re: [PATCH v7 00/16] mm: folio_zero_user: clear contiguous pages

Posted by Raghavendra K T 1 week, 2 days ago

On 9/17/2025 8:54 PM, Ankur Arora wrote:
> This series adds clearing of contiguous page ranges for hugepages,
> improving on the current page-at-a-time approach in two ways:
> 
>   - amortizes the per-page setup cost over a larger extent
> 
>   - when using string instructions, exposes the real region size
>     to the processor.
> 
> A processor could use a knowledge of the extent to optimize the
> clearing. AMD Zen uarchs, as an example, elide allocation of
> cachelines for regions larger than L3-size.
[...]

Hello,

Feel free to add

Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
 
for whole series.

[ I do understand that there may be minor tweeks to clear page patches
to convert nth_page once David's changes are in]

SUT: AMD Zen5

I also did a quick hack to unconditionally use CLZERO/MOVNT on top of
Ankur's series to test how much additional benefits can architectural
enhancements bring in. [ Inline with second part of Ankur's old series before
preempt lazy changes ]. Please note that it is only for testing ideally
for lower sizes we would want rep stosb only. and threshold at which
we need to do non-temporal copy should be a function of L3 and / OR L2 size
perhaps.

Results:
base      : 6.17-rc6 + perf bench patches
clearpage : 6.17-rc6 + whole series from Ankur 
clzero    : 6.17-rc6 + Ankur's series +  clzero (below patch)
movnt     : 6.17-rc6 + Ankur's series +  movnt (below patch)

Command run: ./perf bench mem mmap -p 2MB -f demand -s 64GB -l 10

Higher = better

                   preempt = lazy (GB/sec)  preempt = voluntary (GB/sec)

base               20.655559                19.712500

clearpage          35.060572                34.533414      

clzero             66.948422                66.067265

movnt              51.593506                51.403765


CLZERO/MOVNT experimental patch. Hope I have not missed anything here :)

-- >8 --
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 52c8910ba2ef..26cef2b187b9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -3170,6 +3170,8 @@ config HAVE_ATOMIC_IOMAP
 	def_bool y
 	depends on X86_32
 
+source "arch/x86/Kconfig.cpy"
+
 source "arch/x86/kvm/Kconfig"
 
 source "arch/x86/Kconfig.cpufeatures"
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index 2361066d175e..aa2e62bbfa62 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -84,11 +84,23 @@ static inline void clear_pages(void *addr, unsigned int npages)
 	 */
 	kmsan_unpoison_memory(addr, len);
 	asm volatile(ALTERNATIVE_2("call memzero_page_aligned_unrolled",
-				   "shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
-				   "rep stosb", X86_FEATURE_ERMS)
-			: "+c" (len), "+D" (addr), ASM_CALL_CONSTRAINT
-			: "a" (0)
-			: "cc", "memory");
+				"shrq $3, %%rcx; rep stosq", X86_FEATURE_REP_GOOD,
+#if defined(CONFIG_CLEARPAGE_CLZERO)
+		"call clear_pages_clzero", X86_FEATURE_CLZERO)
+		: "+c" (len), "+D" (addr), ASM_CALL_CONSTRAINT
+		: "a" (0)
+		: "cc", "memory");
+#elif defined(CONFIG_CLEARPAGE_MOVNT)
+		"call clear_pages_movnt", X86_FEATURE_XMM2)
+		: "+c" (len), "+D" (addr), ASM_CALL_CONSTRAINT
+		: "a" (0)
+		: "cc", "memory");
+#else
+		"rep stosb", X86_FEATURE_ERMS)
+		: "+c" (len), "+D" (addr), ASM_CALL_CONSTRAINT
+		: "a" (0)
+		: "cc", "memory");
+#endif
 }
 #define clear_pages clear_pages
 
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 27debe0c018c..0848287446dd 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -4,6 +4,7 @@
 #include <linux/cfi_types.h>
 #include <linux/objtool.h>
 #include <asm/asm.h>
+#include <asm/page_types.h>
 
 /*
  * Zero page aligned region.
@@ -119,3 +120,40 @@ SYM_FUNC_START(rep_stos_alternative)
 	_ASM_EXTABLE_UA(17b, .Lclear_user_tail)
 SYM_FUNC_END(rep_stos_alternative)
 EXPORT_SYMBOL(rep_stos_alternative)
+
+SYM_FUNC_START(clear_pages_movnt)
+	.p2align 4
+.Lstart:
+	movnti  %rax, 0x00(%rdi)
+	movnti  %rax, 0x08(%rdi)
+	movnti  %rax, 0x10(%rdi)
+	movnti  %rax, 0x18(%rdi)
+	movnti  %rax, 0x20(%rdi)
+	movnti  %rax, 0x28(%rdi)
+	movnti  %rax, 0x30(%rdi)
+	movnti  %rax, 0x38(%rdi)
+	addq    $0x40, %rdi
+	subl    $0x40, %ecx
+	ja      .Lstart
+	RET
+SYM_FUNC_END(clear_pages_movnt)
+EXPORT_SYMBOL_GPL(clear_pages_movnt)
+
+/*
+ * Zero a page using clzero (On AMD, with CPU_FEATURE_CLZERO.)
+ *
+ * Caller needs to issue a sfence at the end.
+ */
+
+SYM_FUNC_START(clear_pages_clzero)
+	movq	%rdi,%rax
+	.p2align 4
+.Liter:
+	clzero
+	addq    $0x40, %rax
+	subl    $0x40, %ecx
+	ja      .Liter
+	sfence
+	RET
+SYM_FUNC_END(clear_pages_clzero)
+EXPORT_SYMBOL_GPL(clear_pages_clzero)
-- 
2.43.0

Re: [PATCH v7 00/16] mm: folio_zero_user: clear contiguous pages

Posted by Arnaldo Carvalho de Melo 2 weeks ago

On Wed, Sep 17, 2025 at 08:24:02AM -0700, Ankur Arora wrote:
> This series adds clearing of contiguous page ranges for hugepages,
> improving on the current page-at-a-time approach in two ways:
> 
>  - amortizes the per-page setup cost over a larger extent
> 
>  - when using string instructions, exposes the real region size
>    to the processor.
> 
> A processor could use a knowledge of the extent to optimize the
> clearing. AMD Zen uarchs, as an example, elide allocation of
> cachelines for regions larger than L3-size.
> 
> Demand faulting a 64GB region shows performance improvements:
> 
>  $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
> 
>                  mm/folio_zero_user    x86/folio_zero_user       change
>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
> 
>    pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%	preempt=*
> 
>    pg-sz=1GB       17.14  +- 1.39%        17.42  +-  0.98% [#]   +  1.6%	preempt=none|voluntary
>    pg-sz=1GB       17.51  +- 1.19%        43.23  +-  5.22%       +146.8%	preempt=full|lazy
> 
> [#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
> allocation, which is higher than the maximum extent used on x86
> (ARCH_CONTIG_PAGE_NR=8MB), so preempt=none|voluntary sees no improvement
> with pg-sz=1GB.

I'm picking up the tools/perf part for perf-tools-next (v6.18), already
almost 100% reviewed by Namhyung.

Thanks,

- Arnaldo
 
> Raghavendra also tested v3/v4 on AMD Genoa and sees similar improvements [1].
> 
> Changelog:
> 
> v7:
>  - interface cleanups, comments for clear_user_highpages(), clear_user_pages(),
>    clear_pages().
>  - fixed build errors flagged by kernel test robot
>  - move all x86 patches to the tail end
> 
> v6:
>  - perf bench mem: update man pages and other cleanups (Namhyung Kim)
>  - unify folio_zero_user() for HIGHMEM, !HIGHMEM options instead of
>    working through a new config option (David Hildenbrand).
>    - cleanups and simlification around that.
>  (https://lore.kernel.org/lkml/20250902080816.3715913-1-ankur.a.arora@oracle.com/)
> 
> v5:
>  - move the non HIGHMEM implementation of folio_zero_user() from x86
>    to common code (Dave Hansen)
>  - Minor naming cleanups, commit messages etc
>  (https://lore.kernel.org/lkml/20250710005926.1159009-1-ankur.a.arora@oracle.com/)
> 
> v4:
>  - adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
>  - inline stosb etc (PeterZ)
>  - handle cooperative preemption models (Ingo)
>  - interface and other cleanups all over (Ingo)
>  (https://lore.kernel.org/lkml/20250616052223.723982-1-ankur.a.arora@oracle.com/)
> 
> v3:
>  - get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
>    was limited to preempt=full|lazy.
>  - override folio_zero_user() (Linus)
>  (https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)
> 
> v2:
>  - addressed review comments from peterz, tglx.
>  - Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
>  - General code cleanup
>  (https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)
> 
> Comments appreciated!
> 
> Also at:
>   github.com/terminus/linux clear-pages.v7
> 
> [1] https://lore.kernel.org/lkml/fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com/
> 
> Ankur Arora (16):
>   perf bench mem: Remove repetition around time measurement
>   perf bench mem: Defer type munging of size to float
>   perf bench mem: Move mem op parameters into a structure
>   perf bench mem: Pull out init/fini logic
>   perf bench mem: Switch from zalloc() to mmap()
>   perf bench mem: Allow mapping of hugepages
>   perf bench mem: Allow chunking on a memory region
>   perf bench mem: Refactor mem_options
>   perf bench mem: Add mmap() workloads
>   mm: define clear_pages(), clear_user_pages()
>   mm/highmem: introduce clear_user_highpages()
>   arm: mm: define clear_user_highpages()
>   mm: memory: support clearing page ranges
>   x86/mm: Simplify clear_page_*
>   x86/clear_page: Introduce clear_pages()
>   x86/clear_pages: Support clearing of page-extents
> 
>  arch/arm/include/asm/page.h                  |   7 +
>  arch/x86/include/asm/page_32.h               |   6 +
>  arch/x86/include/asm/page_64.h               |  72 +++-
>  arch/x86/lib/clear_page_64.S                 |  39 +-
>  include/linux/highmem.h                      |  18 +
>  include/linux/mm.h                           |  44 +++
>  mm/memory.c                                  |  82 +++-
>  tools/perf/Documentation/perf-bench.txt      |  58 ++-
>  tools/perf/bench/bench.h                     |   1 +
>  tools/perf/bench/mem-functions.c             | 390 ++++++++++++++-----
>  tools/perf/bench/mem-memcpy-arch.h           |   2 +-
>  tools/perf/bench/mem-memcpy-x86-64-asm-def.h |   4 +
>  tools/perf/bench/mem-memset-arch.h           |   2 +-
>  tools/perf/bench/mem-memset-x86-64-asm-def.h |   4 +
>  tools/perf/builtin-bench.c                   |   1 +
>  15 files changed, 560 insertions(+), 170 deletions(-)
> 
> -- 
> 2.43.5

Re: [PATCH v7 00/16] mm: folio_zero_user: clear contiguous pages

Posted by Ankur Arora 2 weeks ago

Arnaldo Carvalho de Melo <acme@kernel.org> writes:

> On Wed, Sep 17, 2025 at 08:24:02AM -0700, Ankur Arora wrote:
>> This series adds clearing of contiguous page ranges for hugepages,
>> improving on the current page-at-a-time approach in two ways:
>>
>>  - amortizes the per-page setup cost over a larger extent
>>
>>  - when using string instructions, exposes the real region size
>>    to the processor.
>>
>> A processor could use a knowledge of the extent to optimize the
>> clearing. AMD Zen uarchs, as an example, elide allocation of
>> cachelines for regions larger than L3-size.
>>
>> Demand faulting a 64GB region shows performance improvements:
>>
>>  $ perf bench mem map -p $pg-sz -f demand -s 64GB -l 5
>>
>>                  mm/folio_zero_user    x86/folio_zero_user       change
>>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>
>>    pg-sz=2MB       11.82  +- 0.67%        16.48  +-  0.30%       + 39.4%	preempt=*
>>
>>    pg-sz=1GB       17.14  +- 1.39%        17.42  +-  0.98% [#]   +  1.6%	preempt=none|voluntary
>>    pg-sz=1GB       17.51  +- 1.19%        43.23  +-  5.22%       +146.8%	preempt=full|lazy
>>
>> [#] Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
>> allocation, which is higher than the maximum extent used on x86
>> (ARCH_CONTIG_PAGE_NR=8MB), so preempt=none|voluntary sees no improvement
>> with pg-sz=1GB.
>
> I'm picking up the tools/perf part for perf-tools-next (v6.18), already
> almost 100% reviewed by Namhyung.

Thanks!

--
ankur