Change folio_zero_user() to clear contiguous page ranges instead of
clearing using the current page-at-a-time approach. Exposing the largest
feasible length can be useful in enabling processors to optimize based
on extent.
However, clearing in large chunks can have two problems:
- cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
(larger folios don't have any expectation of cache locality).
- preemption latency when clearing large folios.
Handle the first by splitting the clearing in three parts: the
faulting page and its immediate locality, its left and right
regions; with the local neighbourhood cleared last.
The second problem becomes relevant when running under cooperative
preemption models. Limit the worst case preemption latency by clearing
in architecture specified PAGE_CONTIG_NR units, using a default value
of 1 where not specified.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
include/linux/mm.h | 6 ++++
mm/memory.c | 82 ++++++++++++++++++++++++++++++++++------------
2 files changed, 67 insertions(+), 21 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0cde9b01da5e..29b2a8bf7b4f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3768,6 +3768,12 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
unsigned int order) {}
#endif /* CONFIG_DEBUG_PAGEALLOC */
+#ifndef ARCH_PAGE_CONTIG_NR
+#define PAGE_CONTIG_NR 1
+#else
+#define PAGE_CONTIG_NR ARCH_PAGE_CONTIG_NR
+#endif
+
#ifndef clear_pages
/**
* clear_pages() - clear a page range using a kernel virtual address.
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..0f5b1900b480 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7021,40 +7021,80 @@ static inline int process_huge_page(
return 0;
}
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
- unsigned int nr_pages)
+/*
+ * Clear contiguous pages chunking them up when running under
+ * non-preemptible models.
+ */
+static void clear_contig_highpages(struct page *page, unsigned long addr,
+ unsigned int npages)
{
- unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
- int i;
+ unsigned int i, count, unit;
- might_sleep();
- for (i = 0; i < nr_pages; i++) {
+ unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
+
+ for (i = 0; i < npages; ) {
+ count = min(unit, npages - i);
+ clear_user_highpages(nth_page(page, i),
+ addr + i * PAGE_SIZE, count);
+ i += count;
cond_resched();
- clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
}
}
-static int clear_subpage(unsigned long addr, int idx, void *arg)
-{
- struct folio *folio = arg;
-
- clear_user_highpage(folio_page(folio, idx), addr);
- return 0;
-}
-
/**
* folio_zero_user - Zero a folio which will be mapped to userspace.
* @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support for clear_pages() to zero page extents
+ * instead of clearing page-at-a-time.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood cleared last in order to keep cache
+ * lines of the target region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
*/
void folio_zero_user(struct folio *folio, unsigned long addr_hint)
{
- unsigned int nr_pages = folio_nr_pages(folio);
+ unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+ const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+ const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+ const int width = 2; /* number of pages cleared last on either side */
+ struct range r[3];
+ int i;
- if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
- clear_gigantic_page(folio, addr_hint, nr_pages);
- else
- process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+ if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+ clear_contig_highpages(folio_page(folio, 0),
+ base_addr, folio_nr_pages(folio));
+ return;
+ }
+
+ /*
+ * Faulting page and its immediate neighbourhood. Cleared at the end to
+ * ensure it sticks around in the cache.
+ */
+ r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+ clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+ /* Region to the left of the fault */
+ r[1] = DEFINE_RANGE(pg.start,
+ clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+ /* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+ r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+ pg.end);
+
+ for (i = 0; i <= 2; i++) {
+ unsigned int npages = range_len(&r[i]);
+ struct page *page = folio_page(folio, r[i].start);
+ unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;
+
+ if (npages > 0)
+ clear_contig_highpages(page, addr, npages);
+ }
}
static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
--
2.43.5
Hi Ankur, kernel test robot noticed the following build errors: [auto build test ERROR on akpm-mm/mm-everything] url: https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/perf-bench-mem-Remove-repetition-around-time-measurement/20250917-233045 base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything patch link: https://lore.kernel.org/r/20250917152418.4077386-14-ankur.a.arora%40oracle.com patch subject: [PATCH v7 13/16] mm: memory: support clearing page ranges config: arm-randconfig-001-20250919 (https://download.01.org/0day-ci/archive/20250919/202509191916.a0oRRfua-lkp@intel.com/config) compiler: arm-linux-gnueabi-gcc (GCC) 12.5.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250919/202509191916.a0oRRfua-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202509191916.a0oRRfua-lkp@intel.com/ All error/warnings (new ones prefixed by >>): In file included from arch/arm/include/asm/thread_info.h:14, from include/linux/thread_info.h:60, from include/asm-generic/preempt.h:5, from ./arch/arm/include/generated/asm/preempt.h:1, from include/linux/preempt.h:79, from include/linux/smp.h:116, from include/linux/kernel_stat.h:5, from mm/memory.c:42: mm/memory.c: In function 'clear_contig_highpages': mm/memory.c:7199:38: error: implicit declaration of function 'nth_page'; did you mean 'pte_page'? [-Werror=implicit-function-declaration] 7199 | clear_user_highpages(nth_page(page, i), | ^~~~~~~~ arch/arm/include/asm/page.h:152:36: note: in definition of macro 'clear_user_highpage' 152 | __cpu_clear_user_highpage(page, vaddr) | ^~~~ mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages' 7199 | clear_user_highpages(nth_page(page, i), | ^~~~~~~~~~~~~~~~~~~~ >> mm/memory.c:7199:38: warning: passing argument 1 of 'cpu_user.cpu_clear_user_highpage' makes pointer from integer without a cast [-Wint-conversion] 7199 | clear_user_highpages(nth_page(page, i), | ^~~~~~~~~~~~~~~~~ | | | int arch/arm/include/asm/page.h:152:36: note: in definition of macro 'clear_user_highpage' 152 | __cpu_clear_user_highpage(page, vaddr) | ^~~~ mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages' 7199 | clear_user_highpages(nth_page(page, i), | ^~~~~~~~~~~~~~~~~~~~ mm/memory.c:7199:38: note: expected 'struct page *' but argument is of type 'int' 7199 | clear_user_highpages(nth_page(page, i), | ^~~~~~~~~~~~~~~~~ arch/arm/include/asm/page.h:152:36: note: in definition of macro 'clear_user_highpage' 152 | __cpu_clear_user_highpage(page, vaddr) | ^~~~ mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages' 7199 | clear_user_highpages(nth_page(page, i), | ^~~~~~~~~~~~~~~~~~~~ >> arch/arm/include/asm/page.h:157:15: error: lvalue required as left operand of assignment 157 | vaddr += PAGE_SIZE; \ | ^~ mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages' 7199 | clear_user_highpages(nth_page(page, i), | ^~~~~~~~~~~~~~~~~~~~ >> arch/arm/include/asm/page.h:158:13: error: lvalue required as increment operand 158 | page++; \ | ^~ mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages' 7199 | clear_user_highpages(nth_page(page, i), | ^~~~~~~~~~~~~~~~~~~~ cc1: some warnings being treated as errors vim +7199 mm/memory.c 7185 7186 /* 7187 * Clear contiguous pages chunking them up when running under 7188 * non-preemptible models. 7189 */ 7190 static void clear_contig_highpages(struct page *page, unsigned long addr, 7191 unsigned int npages) 7192 { 7193 unsigned int i, count, unit; 7194 7195 unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR; 7196 7197 for (i = 0; i < npages; ) { 7198 count = min(unit, npages - i); > 7199 clear_user_highpages(nth_page(page, i), 7200 addr + i * PAGE_SIZE, count); 7201 i += count; 7202 cond_resched(); 7203 } 7204 } 7205 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki
On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: > Change folio_zero_user() to clear contiguous page ranges instead of > clearing using the current page-at-a-time approach. Exposing the largest > feasible length can be useful in enabling processors to optimize based > on extent. This patch is something which MM developers might care to take a closer look at. > However, clearing in large chunks can have two problems: > > - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES) > (larger folios don't have any expectation of cache locality). > > - preemption latency when clearing large folios. > > Handle the first by splitting the clearing in three parts: the > faulting page and its immediate locality, its left and right > regions; with the local neighbourhood cleared last. Has this optimization been shown to be beneficial? If so, are you able to share some measurements? If not, maybe it should be removed? > ... > > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -7021,40 +7021,80 @@ static inline int process_huge_page( > return 0; > } > > -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, > - unsigned int nr_pages) > +/* > + * Clear contiguous pages chunking them up when running under > + * non-preemptible models. > + */ > +static void clear_contig_highpages(struct page *page, unsigned long addr, > + unsigned int npages) Called "_highpages" because it wraps clear_user_highpages(). It really should be called clear_contig_user_highpages() ;) (Not serious) > { > - unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio)); > - int i; > + unsigned int i, count, unit; > > - might_sleep(); > - for (i = 0; i < nr_pages; i++) { > + unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR; Almost nothing uses preempt_model_preemptible() and I'm not usefully familiar with it. Will this check avoid all softlockup/rcu/etc detections in all situations (ie, configs)? > + for (i = 0; i < npages; ) { > + count = min(unit, npages - i); > + clear_user_highpages(nth_page(page, i), > + addr + i * PAGE_SIZE, count); > + i += count; > cond_resched(); > - clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE); > } > }
On 9/18/2025 3:14 AM, Andrew Morton wrote: > On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> Change folio_zero_user() to clear contiguous page ranges instead of >> clearing using the current page-at-a-time approach. Exposing the largest >> feasible length can be useful in enabling processors to optimize based >> on extent. > > This patch is something which MM developers might care to take a closer > look at. > >> However, clearing in large chunks can have two problems: >> >> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES) >> (larger folios don't have any expectation of cache locality). >> >> - preemption latency when clearing large folios. >> >> Handle the first by splitting the clearing in three parts: the >> faulting page and its immediate locality, its left and right >> regions; with the local neighbourhood cleared last. > > Has this optimization been shown to be beneficial? > > If so, are you able to share some measurements? > > If not, maybe it should be removed? > I reverted the effect of this patch by hard coding #define PAGE_CONTIG_NR 1 I see that benefit for voluntary kernel is lost without this change (for rep stosb) with PAGE_CONTIG_NR equivalent to 8MB Preempt mode: voluntary # Running 'mem/mmap' benchmark: # function 'demand' (Demand loaded mmap()) # Copying 64GB bytes ... 34.533414 GB/sec with PAGE_CONTIG_NR equivalent to 4KB # Running 'mem/mmap' benchmark: # function 'demand' (Demand loaded mmap()) # Copying 64GB bytes ... 20.766059 GB/sec For now (barring David's recommendations), feel free to add Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
On 9/23/2025 2:06 PM, Raghavendra K T wrote: > > > On 9/18/2025 3:14 AM, Andrew Morton wrote: >> On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora >> <ankur.a.arora@oracle.com> wrote: >> >>> Change folio_zero_user() to clear contiguous page ranges instead of >>> clearing using the current page-at-a-time approach. Exposing the largest >>> feasible length can be useful in enabling processors to optimize based >>> on extent. >> >> This patch is something which MM developers might care to take a closer >> look at. >> >>> However, clearing in large chunks can have two problems: >>> >>> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES) >>> (larger folios don't have any expectation of cache locality). >>> >>> - preemption latency when clearing large folios. >>> >>> Handle the first by splitting the clearing in three parts: the >>> faulting page and its immediate locality, its left and right >>> regions; with the local neighbourhood cleared last. >> >> Has this optimization been shown to be beneficial? >> >> If so, are you able to share some measurements? >> >> If not, maybe it should be removed? >> > > I reverted the effect of this patch by hard coding > > #define PAGE_CONTIG_NR 1 > > I see that benefit for voluntary kernel is lost without this change > > (for rep stosb) > > with PAGE_CONTIG_NR equivalent to 8MB > > Preempt mode: voluntary > > # Running 'mem/mmap' benchmark: > # function 'demand' (Demand loaded mmap()) > # Copying 64GB bytes ... > > 34.533414 GB/sec > > > with PAGE_CONTIG_NR equivalent to 4KB > > # Running 'mem/mmap' benchmark: > # function 'demand' (Demand loaded mmap()) > # Copying 64GB bytes ... > > 20.766059 GB/sec > > For now (barring David's recommendations), > feel free to add > > Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> > > My reply was more towards the benefits of clearing multiple pages for non-preempt and voluntary kernel, than the effect of clearing neighborhood range at the end. Sorry if that was confusing :)
On 17.09.25 23:44, Andrew Morton wrote: > On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> Change folio_zero_user() to clear contiguous page ranges instead of >> clearing using the current page-at-a-time approach. Exposing the largest >> feasible length can be useful in enabling processors to optimize based >> on extent. > > This patch is something which MM developers might care to take a closer > look at. I took a look at various revisions of this series, I'm only lagging behind on reviewing the latest series :) > >> However, clearing in large chunks can have two problems: >> >> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES) >> (larger folios don't have any expectation of cache locality). >> >> - preemption latency when clearing large folios. >> >> Handle the first by splitting the clearing in three parts: the >> faulting page and its immediate locality, its left and right >> regions; with the local neighbourhood cleared last. > > Has this optimization been shown to be beneficial? > > If so, are you able to share some measurements? > > If not, maybe it should be removed? > >> ... >> >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -7021,40 +7021,80 @@ static inline int process_huge_page( >> return 0; >> } >> >> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, >> - unsigned int nr_pages) >> +/* >> + * Clear contiguous pages chunking them up when running under >> + * non-preemptible models. >> + */ >> +static void clear_contig_highpages(struct page *page, unsigned long addr, >> + unsigned int npages) > > Called "_highpages" because it wraps clear_user_highpages(). It really > should be called clear_contig_user_highpages() ;) (Not serious) You have a point there, though :) Fortunately this is only an internal helper. -- Cheers David / dhildenb
[ Added Paul McKenney. ] Andrew Morton <akpm@linux-foundation.org> writes: > On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> Change folio_zero_user() to clear contiguous page ranges instead of >> clearing using the current page-at-a-time approach. Exposing the largest >> feasible length can be useful in enabling processors to optimize based >> on extent. > > This patch is something which MM developers might care to take a closer > look at. > >> However, clearing in large chunks can have two problems: >> >> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES) >> (larger folios don't have any expectation of cache locality). >> >> - preemption latency when clearing large folios. >> >> Handle the first by splitting the clearing in three parts: the >> faulting page and its immediate locality, its left and right >> regions; with the local neighbourhood cleared last. > > Has this optimization been shown to be beneficial? So, this was mostly meant to be defensive. The current code does a rather extensive left-right dance around the faulting page via c6ddfb6c58 ("mm, clear_huge_page: move order algorithm into a separate function") and I wanted to keep the cache hot property for the region closest to the address touched by the user. But, no I haven't run any tests showing that it helps. > If so, are you able to share some measurements? From some quick kernel builds (with THP) I do see a consistent difference of a few seconds (1% worse) if I remove this optimization. (I'm not sure right now why it is worse -- my expectation was that we would have higher cache misses, but I see pretty similar cache numbers.) But let me do a more careful test and report back. > If not, maybe it should be removed? > >> ... >> >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -7021,40 +7021,80 @@ static inline int process_huge_page( >> return 0; >> } >> >> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, >> - unsigned int nr_pages) >> +/* >> + * Clear contiguous pages chunking them up when running under >> + * non-preemptible models. >> + */ >> +static void clear_contig_highpages(struct page *page, unsigned long addr, >> + unsigned int npages) > > Called "_highpages" because it wraps clear_user_highpages(). It really > should be called clear_contig_user_highpages() ;) (Not serious) Or maybe clear_user_contig_highpages(), so when we get rid of HUGEMEM, the _highpages could just be chopped off :D. >> { >> - unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio)); >> - int i; >> + unsigned int i, count, unit; >> >> - might_sleep(); >> - for (i = 0; i < nr_pages; i++) { >> + unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR; > > Almost nothing uses preempt_model_preemptible() and I'm not usefully > familiar with it. Will this check avoid all softlockup/rcu/etc > detections in all situations (ie, configs)? IMO, yes. The code invoked under preempt_model_preemptible() will boil down to a single interruptible REP STOSB which might execute over an extent of 1GB (with the last patch). From prior experiments, I know that irqs are able to interrupt this. And, I /think/ that is a sufficient condition for avoiding RCU stalls/softlockups etc. Also, when we were discussing lazy preemption (which Thomas had suggested as a way to handle scenarios like this or long running Xen hypercalls etc) this seemed like a scenario that didn't need any extra handling for CONFIG_PREEMPT. We did need 83b28cfe79 ("rcu: handle quiescent states for PREEMPT_RCU=n, PREEMPT_COUNT=y") for CONFIG_PREEMPT_LAZY but AFAICS this should be safe. Anyway let me think about your all configs point (though only ones which can have some flavour for hugetlb.) Also, I would like x86 folks opinion on this. And, maybe Paul McKenney just to make sure I'm not missing something on RCU side. Thanks for the comments. -- ankur
© 2016 - 2025 Red Hat, Inc.