This series adds clearing of contiguous page ranges for hugepages,
improving on the current page-at-a-time approach in two ways:
- amortizes the per-page setup cost over a larger extent
- when using string instructions, exposes the real region size
to the processor.
A processor could use knowledge of the full extent to optimize the
clearing better than if it sees only a single page sized extent at
a time. AMD Zen uarchs, as an example, elide cacheline allocation
for regions larger than LLC-size.
Demand faulting a 64GB region shows performance improvement:
$ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
baseline +series change
(GB/s +- %stdev) (GB/s +- %stdev)
pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8% preempt=*
pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% + 5.2% preempt=none|voluntary
pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% [#] +144.3% preempt=full|lazy
[#] Notice that we perform much better with preempt=full|lazy. That's
because preemptible models don't need explicit invocations of
cond_resched() to ensure reasonable preemption latency, which,
allows us to clear the full extent (1GB) in a single unit.
In comparison the maximum extent used for preempt=none|voluntary is
PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
The larger extent allows the processor to elide cacheline
allocation (on Milan the threshold is LLC-size=32MB.)
(The hope is that eventually, in the fullness of time, the lazy
preemption model will be able to do the same job that none or
voluntary models are used for, allowing cond_resched() to go away.)
The anon-w-seq test in the vm-scalability benchmark, however, does show
worse performance with utime increasing by ~9%:
stime utime
baseline 1654.63 ( +- 3.84% ) 811.00 ( +- 3.84% )
+series 1630.32 ( +- 2.73% ) 886.37 ( +- 5.19% )
In part this is because anon-w-seq runs with 384 processes zeroing
anonymously mapped memory which they then access sequentially. As
such this is a likely uncommon pattern where the memory bandwidth
is saturated while we are also cache limited because the workload
accesses the entire region.
Raghavendra also tested previous version of the series on AMD Genoa
also sees improvement [1] with preempt=lazy.
(The pg-sz=2MB improvement is much higher on Genoa than I see on
Milan):
$ perf bench mem map -p $page-size -f populate -s 64GB -l 10
base patched change
pg-sz=2MB 12.731939 GB/sec 26.304263 GB/sec 106.6%
pg-sz=1GB 26.232423 GB/sec 61.174836 GB/sec 133.2%
Changelog:
v9:
- Define PROCESS_PAGES_NON_PREEMPT_BATCH in common code (instead of
inheriting ARCH_PAGE_CONTIG_NR.)
- Also document this in much greater detail as clearing pages
needing a a constant dependent on the preemption model is
facially quite odd.
(Suggested by David Hildenbrand, Andrew Morton, Borislav Petkov.)
- Switch architectural markers from __HAVE_ARCH_CLEAR_USER_PAGE (and
similar) to clear_user_page etc. (Suggested by David Hildenbrand)
- s/memzero_page_aligned_unrolled/__clear_pages_unrolled/
(Suggested by Borislav Petkov.)
- style, comment fixes
(https://lore.kernel.org/lkml/20251027202109.678022-1-ankur.a.arora@oracle.com/)
v8:
- make clear_user_highpages(), clear_user_pages() and clear_pages()
more robust across architectures. (Thanks David!)
- split up folio_zero_user() changes into ones for clearing contiguous
regions and those for maintaining temporal locality since they have
different performance profiles (Suggested by Andrew Morton.)
- added Raghavendra's Reviewed-by, Tested-by.
- get rid of nth_page()
- perf related patches have been pulled already. Remove them.
v7:
- interface cleanups, comments for clear_user_highpages(), clear_user_pages(),
clear_pages().
- fixed build errors flagged by kernel test robot
(https://lore.kernel.org/lkml/20250917152418.4077386-1-ankur.a.arora@oracle.com/)
v6:
- perf bench mem: update man pages and other cleanups (Namhyung Kim)
- unify folio_zero_user() for HIGHMEM, !HIGHMEM options instead of
working through a new config option (David Hildenbrand).
- cleanups and simlification around that.
(https://lore.kernel.org/lkml/20250902080816.3715913-1-ankur.a.arora@oracle.com/)
v5:
- move the non HIGHMEM implementation of folio_zero_user() from x86
to common code (Dave Hansen)
- Minor naming cleanups, commit messages etc
(https://lore.kernel.org/lkml/20250710005926.1159009-1-ankur.a.arora@oracle.com/)
v4:
- adds perf bench workloads to exercise mmap() populate/demand-fault (Ingo)
- inline stosb etc (PeterZ)
- handle cooperative preemption models (Ingo)
- interface and other cleanups all over (Ingo)
(https://lore.kernel.org/lkml/20250616052223.723982-1-ankur.a.arora@oracle.com/)
v3:
- get rid of preemption dependency (TIF_ALLOW_RESCHED); this version
was limited to preempt=full|lazy.
- override folio_zero_user() (Linus)
(https://lore.kernel.org/lkml/20250414034607.762653-1-ankur.a.arora@oracle.com/)
v2:
- addressed review comments from peterz, tglx.
- Removed clear_user_pages(), and CONFIG_X86_32:clear_pages()
- General code cleanup
(https://lore.kernel.org/lkml/20230830184958.2333078-1-ankur.a.arora@oracle.com/)
Comments appreciated!
Also at:
github.com/terminus/linux clear-pages.v7
[1] https://lore.kernel.org/lkml/fffd4dad-2cb9-4bc9-8a80-a70be687fd54@amd.com/
Ankur Arora (6):
mm: introduce clear_pages() and clear_user_pages()
mm/highmem: introduce clear_user_highpages()
x86/mm: Simplify clear_page_*
x86/clear_page: Introduce clear_pages()
mm, folio_zero_user: support clearing page ranges
mm: folio_zero_user: cache neighbouring pages
David Hildenbrand (1):
treewide: provide a generic clear_user_page() variant
arch/alpha/include/asm/page.h | 1 -
arch/arc/include/asm/page.h | 2 +
arch/arm/include/asm/page-nommu.h | 1 -
arch/arm64/include/asm/page.h | 1 -
arch/csky/abiv1/inc/abi/page.h | 1 +
arch/csky/abiv2/inc/abi/page.h | 7 ---
arch/hexagon/include/asm/page.h | 1 -
arch/loongarch/include/asm/page.h | 1 -
arch/m68k/include/asm/page_no.h | 1 -
arch/microblaze/include/asm/page.h | 1 -
arch/mips/include/asm/page.h | 1 +
arch/nios2/include/asm/page.h | 1 +
arch/openrisc/include/asm/page.h | 1 -
arch/parisc/include/asm/page.h | 1 -
arch/powerpc/include/asm/page.h | 1 +
arch/riscv/include/asm/page.h | 1 -
arch/s390/include/asm/page.h | 1 -
arch/sparc/include/asm/page_64.h | 1 +
arch/um/include/asm/page.h | 1 -
arch/x86/include/asm/page.h | 6 --
arch/x86/include/asm/page_32.h | 6 ++
arch/x86/include/asm/page_64.h | 59 +++++++++++++-----
arch/x86/lib/clear_page_64.S | 39 +++---------
arch/xtensa/include/asm/page.h | 1 -
include/linux/highmem.h | 29 +++++++++
include/linux/mm.h | 98 ++++++++++++++++++++++++++++++
mm/memory.c | 86 +++++++++++++++++++-------
mm/util.c | 13 ++++
28 files changed, 269 insertions(+), 94 deletions(-)
--
2.31.1