Change folio_zero_user() to clear contiguous page ranges instead of
clearing using the current page-at-a-time approach. Exposing the largest
feasible length can be useful in enabling processors to optimize based
on extent.
However, clearing in large chunks can have two problems:
- cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
(larger folios don't have any expectation of cache locality).
- preemption latency when clearing large folios.
Handle the first by splitting the clearing in three parts: the
faulting page and its immediate locality, its left and right
regions; with the local neighbourhood cleared last.
The second problem becomes relevant when running under cooperative
preemption models. Limit the worst case preemption latency by clearing
in architecture specified PAGE_CONTIG_NR units, using a default value
of 1 where not specified.
Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
include/linux/mm.h | 6 ++++
mm/memory.c | 82 ++++++++++++++++++++++++++++++++++------------
2 files changed, 67 insertions(+), 21 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0cde9b01da5e..29b2a8bf7b4f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3768,6 +3768,12 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
unsigned int order) {}
#endif /* CONFIG_DEBUG_PAGEALLOC */
+#ifndef ARCH_PAGE_CONTIG_NR
+#define PAGE_CONTIG_NR 1
+#else
+#define PAGE_CONTIG_NR ARCH_PAGE_CONTIG_NR
+#endif
+
#ifndef clear_pages
/**
* clear_pages() - clear a page range using a kernel virtual address.
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..0f5b1900b480 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7021,40 +7021,80 @@ static inline int process_huge_page(
return 0;
}
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
- unsigned int nr_pages)
+/*
+ * Clear contiguous pages chunking them up when running under
+ * non-preemptible models.
+ */
+static void clear_contig_highpages(struct page *page, unsigned long addr,
+ unsigned int npages)
{
- unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
- int i;
+ unsigned int i, count, unit;
- might_sleep();
- for (i = 0; i < nr_pages; i++) {
+ unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
+
+ for (i = 0; i < npages; ) {
+ count = min(unit, npages - i);
+ clear_user_highpages(nth_page(page, i),
+ addr + i * PAGE_SIZE, count);
+ i += count;
cond_resched();
- clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
}
}
-static int clear_subpage(unsigned long addr, int idx, void *arg)
-{
- struct folio *folio = arg;
-
- clear_user_highpage(folio_page(folio, idx), addr);
- return 0;
-}
-
/**
* folio_zero_user - Zero a folio which will be mapped to userspace.
* @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support for clear_pages() to zero page extents
+ * instead of clearing page-at-a-time.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood cleared last in order to keep cache
+ * lines of the target region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
*/
void folio_zero_user(struct folio *folio, unsigned long addr_hint)
{
- unsigned int nr_pages = folio_nr_pages(folio);
+ unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+ const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+ const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+ const int width = 2; /* number of pages cleared last on either side */
+ struct range r[3];
+ int i;
- if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
- clear_gigantic_page(folio, addr_hint, nr_pages);
- else
- process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+ if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+ clear_contig_highpages(folio_page(folio, 0),
+ base_addr, folio_nr_pages(folio));
+ return;
+ }
+
+ /*
+ * Faulting page and its immediate neighbourhood. Cleared at the end to
+ * ensure it sticks around in the cache.
+ */
+ r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+ clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+ /* Region to the left of the fault */
+ r[1] = DEFINE_RANGE(pg.start,
+ clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+ /* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+ r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+ pg.end);
+
+ for (i = 0; i <= 2; i++) {
+ unsigned int npages = range_len(&r[i]);
+ struct page *page = folio_page(folio, r[i].start);
+ unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;
+
+ if (npages > 0)
+ clear_contig_highpages(page, addr, npages);
+ }
}
static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
--
2.43.5
Hi Ankur,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/perf-bench-mem-Remove-repetition-around-time-measurement/20250917-233045
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/20250917152418.4077386-14-ankur.a.arora%40oracle.com
patch subject: [PATCH v7 13/16] mm: memory: support clearing page ranges
config: arm-randconfig-001-20250919 (https://download.01.org/0day-ci/archive/20250919/202509191916.a0oRRfua-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 12.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250919/202509191916.a0oRRfua-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509191916.a0oRRfua-lkp@intel.com/
All error/warnings (new ones prefixed by >>):
In file included from arch/arm/include/asm/thread_info.h:14,
from include/linux/thread_info.h:60,
from include/asm-generic/preempt.h:5,
from ./arch/arm/include/generated/asm/preempt.h:1,
from include/linux/preempt.h:79,
from include/linux/smp.h:116,
from include/linux/kernel_stat.h:5,
from mm/memory.c:42:
mm/memory.c: In function 'clear_contig_highpages':
mm/memory.c:7199:38: error: implicit declaration of function 'nth_page'; did you mean 'pte_page'? [-Werror=implicit-function-declaration]
7199 | clear_user_highpages(nth_page(page, i),
| ^~~~~~~~
arch/arm/include/asm/page.h:152:36: note: in definition of macro 'clear_user_highpage'
152 | __cpu_clear_user_highpage(page, vaddr)
| ^~~~
mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages'
7199 | clear_user_highpages(nth_page(page, i),
| ^~~~~~~~~~~~~~~~~~~~
>> mm/memory.c:7199:38: warning: passing argument 1 of 'cpu_user.cpu_clear_user_highpage' makes pointer from integer without a cast [-Wint-conversion]
7199 | clear_user_highpages(nth_page(page, i),
| ^~~~~~~~~~~~~~~~~
| |
| int
arch/arm/include/asm/page.h:152:36: note: in definition of macro 'clear_user_highpage'
152 | __cpu_clear_user_highpage(page, vaddr)
| ^~~~
mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages'
7199 | clear_user_highpages(nth_page(page, i),
| ^~~~~~~~~~~~~~~~~~~~
mm/memory.c:7199:38: note: expected 'struct page *' but argument is of type 'int'
7199 | clear_user_highpages(nth_page(page, i),
| ^~~~~~~~~~~~~~~~~
arch/arm/include/asm/page.h:152:36: note: in definition of macro 'clear_user_highpage'
152 | __cpu_clear_user_highpage(page, vaddr)
| ^~~~
mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages'
7199 | clear_user_highpages(nth_page(page, i),
| ^~~~~~~~~~~~~~~~~~~~
>> arch/arm/include/asm/page.h:157:15: error: lvalue required as left operand of assignment
157 | vaddr += PAGE_SIZE; \
| ^~
mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages'
7199 | clear_user_highpages(nth_page(page, i),
| ^~~~~~~~~~~~~~~~~~~~
>> arch/arm/include/asm/page.h:158:13: error: lvalue required as increment operand
158 | page++; \
| ^~
mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages'
7199 | clear_user_highpages(nth_page(page, i),
| ^~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +7199 mm/memory.c
7185
7186 /*
7187 * Clear contiguous pages chunking them up when running under
7188 * non-preemptible models.
7189 */
7190 static void clear_contig_highpages(struct page *page, unsigned long addr,
7191 unsigned int npages)
7192 {
7193 unsigned int i, count, unit;
7194
7195 unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
7196
7197 for (i = 0; i < npages; ) {
7198 count = min(unit, npages - i);
> 7199 clear_user_highpages(nth_page(page, i),
7200 addr + i * PAGE_SIZE, count);
7201 i += count;
7202 cond_resched();
7203 }
7204 }
7205
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
> Change folio_zero_user() to clear contiguous page ranges instead of
> clearing using the current page-at-a-time approach. Exposing the largest
> feasible length can be useful in enabling processors to optimize based
> on extent.
This patch is something which MM developers might care to take a closer
look at.
> However, clearing in large chunks can have two problems:
>
> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
> (larger folios don't have any expectation of cache locality).
>
> - preemption latency when clearing large folios.
>
> Handle the first by splitting the clearing in three parts: the
> faulting page and its immediate locality, its left and right
> regions; with the local neighbourhood cleared last.
Has this optimization been shown to be beneficial?
If so, are you able to share some measurements?
If not, maybe it should be removed?
> ...
>
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7021,40 +7021,80 @@ static inline int process_huge_page(
> return 0;
> }
>
> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
> - unsigned int nr_pages)
> +/*
> + * Clear contiguous pages chunking them up when running under
> + * non-preemptible models.
> + */
> +static void clear_contig_highpages(struct page *page, unsigned long addr,
> + unsigned int npages)
Called "_highpages" because it wraps clear_user_highpages(). It really
should be called clear_contig_user_highpages() ;) (Not serious)
> {
> - unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> - int i;
> + unsigned int i, count, unit;
>
> - might_sleep();
> - for (i = 0; i < nr_pages; i++) {
> + unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
Almost nothing uses preempt_model_preemptible() and I'm not usefully
familiar with it. Will this check avoid all softlockup/rcu/etc
detections in all situations (ie, configs)?
> + for (i = 0; i < npages; ) {
> + count = min(unit, npages - i);
> + clear_user_highpages(nth_page(page, i),
> + addr + i * PAGE_SIZE, count);
> + i += count;
> cond_resched();
> - clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
> }
> }
On 9/18/2025 3:14 AM, Andrew Morton wrote:
> On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Change folio_zero_user() to clear contiguous page ranges instead of
>> clearing using the current page-at-a-time approach. Exposing the largest
>> feasible length can be useful in enabling processors to optimize based
>> on extent.
>
> This patch is something which MM developers might care to take a closer
> look at.
>
>> However, clearing in large chunks can have two problems:
>>
>> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>> (larger folios don't have any expectation of cache locality).
>>
>> - preemption latency when clearing large folios.
>>
>> Handle the first by splitting the clearing in three parts: the
>> faulting page and its immediate locality, its left and right
>> regions; with the local neighbourhood cleared last.
>
> Has this optimization been shown to be beneficial?
>
> If so, are you able to share some measurements?
>
> If not, maybe it should be removed?
>
I reverted the effect of this patch by hard coding
#define PAGE_CONTIG_NR 1
I see that benefit for voluntary kernel is lost without this change
(for rep stosb)
with PAGE_CONTIG_NR equivalent to 8MB
Preempt mode: voluntary
# Running 'mem/mmap' benchmark:
# function 'demand' (Demand loaded mmap())
# Copying 64GB bytes ...
34.533414 GB/sec
with PAGE_CONTIG_NR equivalent to 4KB
# Running 'mem/mmap' benchmark:
# function 'demand' (Demand loaded mmap())
# Copying 64GB bytes ...
20.766059 GB/sec
For now (barring David's recommendations),
feel free to add
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Raghavendra K T <rkodsara@amd.com> writes: > On 9/18/2025 3:14 AM, Andrew Morton wrote: >> On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: >> >>> Change folio_zero_user() to clear contiguous page ranges instead of >>> clearing using the current page-at-a-time approach. Exposing the largest >>> feasible length can be useful in enabling processors to optimize based >>> on extent. >> This patch is something which MM developers might care to take a closer >> look at. >> >>> However, clearing in large chunks can have two problems: >>> >>> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES) >>> (larger folios don't have any expectation of cache locality). >>> >>> - preemption latency when clearing large folios. >>> >>> Handle the first by splitting the clearing in three parts: the >>> faulting page and its immediate locality, its left and right >>> regions; with the local neighbourhood cleared last. >> Has this optimization been shown to be beneficial? >> If so, are you able to share some measurements? >> If not, maybe it should be removed? >> > > I reverted the effect of this patch by hard coding > > #define PAGE_CONTIG_NR 1 > > I see that benefit for voluntary kernel is lost without this change > > (for rep stosb) > > with PAGE_CONTIG_NR equivalent to 8MB > > Preempt mode: voluntary > > # Running 'mem/mmap' benchmark: > # function 'demand' (Demand loaded mmap()) > # Copying 64GB bytes ... > > 34.533414 GB/sec > > > with PAGE_CONTIG_NR equivalent to 4KB > > # Running 'mem/mmap' benchmark: > # function 'demand' (Demand loaded mmap()) > # Copying 64GB bytes ... > > 20.766059 GB/sec > > For now (barring David's recommendations), > feel free to add > > Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> Thanks Raghu. -- ankur
On 9/23/2025 2:06 PM, Raghavendra K T wrote: > > > On 9/18/2025 3:14 AM, Andrew Morton wrote: >> On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora >> <ankur.a.arora@oracle.com> wrote: >> >>> Change folio_zero_user() to clear contiguous page ranges instead of >>> clearing using the current page-at-a-time approach. Exposing the largest >>> feasible length can be useful in enabling processors to optimize based >>> on extent. >> >> This patch is something which MM developers might care to take a closer >> look at. >> >>> However, clearing in large chunks can have two problems: >>> >>> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES) >>> (larger folios don't have any expectation of cache locality). >>> >>> - preemption latency when clearing large folios. >>> >>> Handle the first by splitting the clearing in three parts: the >>> faulting page and its immediate locality, its left and right >>> regions; with the local neighbourhood cleared last. >> >> Has this optimization been shown to be beneficial? >> >> If so, are you able to share some measurements? >> >> If not, maybe it should be removed? >> > > I reverted the effect of this patch by hard coding > > #define PAGE_CONTIG_NR 1 > > I see that benefit for voluntary kernel is lost without this change > > (for rep stosb) > > with PAGE_CONTIG_NR equivalent to 8MB > > Preempt mode: voluntary > > # Running 'mem/mmap' benchmark: > # function 'demand' (Demand loaded mmap()) > # Copying 64GB bytes ... > > 34.533414 GB/sec > > > with PAGE_CONTIG_NR equivalent to 4KB > > # Running 'mem/mmap' benchmark: > # function 'demand' (Demand loaded mmap()) > # Copying 64GB bytes ... > > 20.766059 GB/sec > > For now (barring David's recommendations), > feel free to add > > Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com> > > My reply was more towards the benefits of clearing multiple pages for non-preempt and voluntary kernel, than the effect of clearing neighborhood range at the end. Sorry if that was confusing :)
On 17.09.25 23:44, Andrew Morton wrote: > On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote: > >> Change folio_zero_user() to clear contiguous page ranges instead of >> clearing using the current page-at-a-time approach. Exposing the largest >> feasible length can be useful in enabling processors to optimize based >> on extent. > > This patch is something which MM developers might care to take a closer > look at. I took a look at various revisions of this series, I'm only lagging behind on reviewing the latest series :) > >> However, clearing in large chunks can have two problems: >> >> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES) >> (larger folios don't have any expectation of cache locality). >> >> - preemption latency when clearing large folios. >> >> Handle the first by splitting the clearing in three parts: the >> faulting page and its immediate locality, its left and right >> regions; with the local neighbourhood cleared last. > > Has this optimization been shown to be beneficial? > > If so, are you able to share some measurements? > > If not, maybe it should be removed? > >> ... >> >> --- a/mm/memory.c >> +++ b/mm/memory.c >> @@ -7021,40 +7021,80 @@ static inline int process_huge_page( >> return 0; >> } >> >> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint, >> - unsigned int nr_pages) >> +/* >> + * Clear contiguous pages chunking them up when running under >> + * non-preemptible models. >> + */ >> +static void clear_contig_highpages(struct page *page, unsigned long addr, >> + unsigned int npages) > > Called "_highpages" because it wraps clear_user_highpages(). It really > should be called clear_contig_user_highpages() ;) (Not serious) You have a point there, though :) Fortunately this is only an internal helper. -- Cheers David / dhildenb
[ Added Paul McKenney. ]
Andrew Morton <akpm@linux-foundation.org> writes:
> On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Change folio_zero_user() to clear contiguous page ranges instead of
>> clearing using the current page-at-a-time approach. Exposing the largest
>> feasible length can be useful in enabling processors to optimize based
>> on extent.
>
> This patch is something which MM developers might care to take a closer
> look at.
>
>> However, clearing in large chunks can have two problems:
>>
>> - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>> (larger folios don't have any expectation of cache locality).
>>
>> - preemption latency when clearing large folios.
>>
>> Handle the first by splitting the clearing in three parts: the
>> faulting page and its immediate locality, its left and right
>> regions; with the local neighbourhood cleared last.
>
> Has this optimization been shown to be beneficial?
So, this was mostly meant to be defensive. The current code does a
rather extensive left-right dance around the faulting page via
c6ddfb6c58 ("mm, clear_huge_page: move order algorithm into a separate
function") and I wanted to keep the cache hot property for the region
closest to the address touched by the user.
But, no I haven't run any tests showing that it helps.
> If so, are you able to share some measurements?
From some quick kernel builds (with THP) I do see a consistent
difference of a few seconds (1% worse) if I remove this optimization.
(I'm not sure right now why it is worse -- my expectation was that we
would have higher cache misses, but I see pretty similar cache numbers.)
But let me do a more careful test and report back.
> If not, maybe it should be removed?
>
>> ...
>>
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7021,40 +7021,80 @@ static inline int process_huge_page(
>> return 0;
>> }
>>
>> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
>> - unsigned int nr_pages)
>> +/*
>> + * Clear contiguous pages chunking them up when running under
>> + * non-preemptible models.
>> + */
>> +static void clear_contig_highpages(struct page *page, unsigned long addr,
>> + unsigned int npages)
>
> Called "_highpages" because it wraps clear_user_highpages(). It really
> should be called clear_contig_user_highpages() ;) (Not serious)
Or maybe clear_user_contig_highpages(), so when we get rid of HUGEMEM,
the _highpages could just be chopped off :D.
>> {
>> - unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> - int i;
>> + unsigned int i, count, unit;
>>
>> - might_sleep();
>> - for (i = 0; i < nr_pages; i++) {
>> + unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
>
> Almost nothing uses preempt_model_preemptible() and I'm not usefully
> familiar with it. Will this check avoid all softlockup/rcu/etc
> detections in all situations (ie, configs)?
IMO, yes. The code invoked under preempt_model_preemptible() will boil
down to a single interruptible REP STOSB which might execute over
an extent of 1GB (with the last patch). From prior experiments, I know
that irqs are able to interrupt this. And, I /think/ that is a sufficient
condition for avoiding RCU stalls/softlockups etc.
Also, when we were discussing lazy preemption (which Thomas had
suggested as a way to handle scenarios like this or long running Xen
hypercalls etc) this seemed like a scenario that didn't need any extra
handling for CONFIG_PREEMPT.
We did need 83b28cfe79 ("rcu: handle quiescent states for PREEMPT_RCU=n,
PREEMPT_COUNT=y") for CONFIG_PREEMPT_LAZY but AFAICS this should be safe.
Anyway let me think about your all configs point (though only ones which
can have some flavour for hugetlb.)
Also, I would like x86 folks opinion on this. And, maybe Paul McKenney
just to make sure I'm not missing something on RCU side.
Thanks for the comments.
--
ankur
© 2016 - 2026 Red Hat, Inc.