[v7] mm: folio_zero_user: clear contiguous pages

[PATCH v7 13/16] mm: memory: support clearing page ranges

Posted by Ankur Arora 2 weeks ago

Change folio_zero_user() to clear contiguous page ranges instead of
clearing using the current page-at-a-time approach. Exposing the largest
feasible length can be useful in enabling processors to optimize based
on extent.

However, clearing in large chunks can have two problems:

 - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
   (larger folios don't have any expectation of cache locality).

 - preemption latency when clearing large folios.

Handle the first by splitting the clearing in three parts: the
faulting page and its immediate locality, its left and right
regions; with the local neighbourhood cleared last.

The second problem becomes relevant when running under cooperative
preemption models. Limit the worst case preemption latency by clearing
in architecture specified PAGE_CONTIG_NR units, using a default value
of 1 where not specified.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
---
 include/linux/mm.h |  6 ++++
 mm/memory.c        | 82 ++++++++++++++++++++++++++++++++++------------
 2 files changed, 67 insertions(+), 21 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0cde9b01da5e..29b2a8bf7b4f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3768,6 +3768,12 @@ static inline void clear_page_guard(struct zone *zone, struct page *page,
 				unsigned int order) {}
 #endif	/* CONFIG_DEBUG_PAGEALLOC */
 
+#ifndef ARCH_PAGE_CONTIG_NR
+#define PAGE_CONTIG_NR	1
+#else
+#define PAGE_CONTIG_NR	ARCH_PAGE_CONTIG_NR
+#endif
+
 #ifndef clear_pages
 /**
  * clear_pages() - clear a page range using a kernel virtual address.
diff --git a/mm/memory.c b/mm/memory.c
index 0ba4f6b71847..0f5b1900b480 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7021,40 +7021,80 @@ static inline int process_huge_page(
 	return 0;
 }
 
-static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
-				unsigned int nr_pages)
+/*
+ * Clear contiguous pages chunking them up when running under
+ * non-preemptible models.
+ */
+static void clear_contig_highpages(struct page *page, unsigned long addr,
+				   unsigned int npages)
 {
-	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
-	int i;
+	unsigned int i, count, unit;
 
-	might_sleep();
-	for (i = 0; i < nr_pages; i++) {
+	unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
+
+	for (i = 0; i < npages; ) {
+		count = min(unit, npages - i);
+		clear_user_highpages(nth_page(page, i),
+				     addr + i * PAGE_SIZE, count);
+		i += count;
 		cond_resched();
-		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
 	}
 }
 
-static int clear_subpage(unsigned long addr, int idx, void *arg)
-{
-	struct folio *folio = arg;
-
-	clear_user_highpage(folio_page(folio, idx), addr);
-	return 0;
-}
-
 /**
  * folio_zero_user - Zero a folio which will be mapped to userspace.
  * @folio: The folio to zero.
- * @addr_hint: The address will be accessed or the base address if uncelar.
+ * @addr_hint: The address accessed by the user or the base address.
+ *
+ * Uses architectural support for clear_pages() to zero page extents
+ * instead of clearing page-at-a-time.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood cleared last in order to keep cache
+ * lines of the target region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
  */
 void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
-	unsigned int nr_pages = folio_nr_pages(folio);
+	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
 
-	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES))
-		clear_gigantic_page(folio, addr_hint, nr_pages);
-	else
-		process_huge_page(addr_hint, nr_pages, clear_subpage, folio);
+	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+		clear_contig_highpages(folio_page(folio, 0),
+					base_addr, folio_nr_pages(folio));
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		unsigned int npages = range_len(&r[i]);
+		struct page *page = folio_page(folio, r[i].start);
+		unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;
+
+		if (npages > 0)
+			clear_contig_highpages(page, addr, npages);
+	}
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.43.5

Re: [PATCH v7 13/16] mm: memory: support clearing page ranges

Posted by kernel test robot 1 week, 5 days ago

Hi Ankur,

kernel test robot noticed the following build errors:

[auto build test ERROR on akpm-mm/mm-everything]

url:    https://github.com/intel-lab-lkp/linux/commits/Ankur-Arora/perf-bench-mem-Remove-repetition-around-time-measurement/20250917-233045
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250917152418.4077386-14-ankur.a.arora%40oracle.com
patch subject: [PATCH v7 13/16] mm: memory: support clearing page ranges
config: arm-randconfig-001-20250919 (https://download.01.org/0day-ci/archive/20250919/202509191916.a0oRRfua-lkp@intel.com/config)
compiler: arm-linux-gnueabi-gcc (GCC) 12.5.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250919/202509191916.a0oRRfua-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509191916.a0oRRfua-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from arch/arm/include/asm/thread_info.h:14,
                    from include/linux/thread_info.h:60,
                    from include/asm-generic/preempt.h:5,
                    from ./arch/arm/include/generated/asm/preempt.h:1,
                    from include/linux/preempt.h:79,
                    from include/linux/smp.h:116,
                    from include/linux/kernel_stat.h:5,
                    from mm/memory.c:42:
   mm/memory.c: In function 'clear_contig_highpages':
   mm/memory.c:7199:38: error: implicit declaration of function 'nth_page'; did you mean 'pte_page'? [-Werror=implicit-function-declaration]
    7199 |                 clear_user_highpages(nth_page(page, i),
         |                                      ^~~~~~~~
   arch/arm/include/asm/page.h:152:36: note: in definition of macro 'clear_user_highpage'
     152 |          __cpu_clear_user_highpage(page, vaddr)
         |                                    ^~~~
   mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages'
    7199 |                 clear_user_highpages(nth_page(page, i),
         |                 ^~~~~~~~~~~~~~~~~~~~
>> mm/memory.c:7199:38: warning: passing argument 1 of 'cpu_user.cpu_clear_user_highpage' makes pointer from integer without a cast [-Wint-conversion]
    7199 |                 clear_user_highpages(nth_page(page, i),
         |                                      ^~~~~~~~~~~~~~~~~
         |                                      |
         |                                      int
   arch/arm/include/asm/page.h:152:36: note: in definition of macro 'clear_user_highpage'
     152 |          __cpu_clear_user_highpage(page, vaddr)
         |                                    ^~~~
   mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages'
    7199 |                 clear_user_highpages(nth_page(page, i),
         |                 ^~~~~~~~~~~~~~~~~~~~
   mm/memory.c:7199:38: note: expected 'struct page *' but argument is of type 'int'
    7199 |                 clear_user_highpages(nth_page(page, i),
         |                                      ^~~~~~~~~~~~~~~~~
   arch/arm/include/asm/page.h:152:36: note: in definition of macro 'clear_user_highpage'
     152 |          __cpu_clear_user_highpage(page, vaddr)
         |                                    ^~~~
   mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages'
    7199 |                 clear_user_highpages(nth_page(page, i),
         |                 ^~~~~~~~~~~~~~~~~~~~
>> arch/arm/include/asm/page.h:157:15: error: lvalue required as left operand of assignment
     157 |         vaddr += PAGE_SIZE;                             \
         |               ^~
   mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages'
    7199 |                 clear_user_highpages(nth_page(page, i),
         |                 ^~~~~~~~~~~~~~~~~~~~
>> arch/arm/include/asm/page.h:158:13: error: lvalue required as increment operand
     158 |         page++;                                         \
         |             ^~
   mm/memory.c:7199:17: note: in expansion of macro 'clear_user_highpages'
    7199 |                 clear_user_highpages(nth_page(page, i),
         |                 ^~~~~~~~~~~~~~~~~~~~
   cc1: some warnings being treated as errors


vim +7199 mm/memory.c

  7185	
  7186	/*
  7187	 * Clear contiguous pages chunking them up when running under
  7188	 * non-preemptible models.
  7189	 */
  7190	static void clear_contig_highpages(struct page *page, unsigned long addr,
  7191					   unsigned int npages)
  7192	{
  7193		unsigned int i, count, unit;
  7194	
  7195		unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
  7196	
  7197		for (i = 0; i < npages; ) {
  7198			count = min(unit, npages - i);
> 7199			clear_user_highpages(nth_page(page, i),
  7200					     addr + i * PAGE_SIZE, count);
  7201			i += count;
  7202			cond_resched();
  7203		}
  7204	}
  7205	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [PATCH v7 13/16] mm: memory: support clearing page ranges

Posted by Andrew Morton 2 weeks ago

On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:

> Change folio_zero_user() to clear contiguous page ranges instead of
> clearing using the current page-at-a-time approach. Exposing the largest
> feasible length can be useful in enabling processors to optimize based
> on extent.

This patch is something which MM developers might care to take a closer
look at.

> However, clearing in large chunks can have two problems:
> 
>  - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>    (larger folios don't have any expectation of cache locality).
> 
>  - preemption latency when clearing large folios.
> 
> Handle the first by splitting the clearing in three parts: the
> faulting page and its immediate locality, its left and right
> regions; with the local neighbourhood cleared last.

Has this optimization been shown to be beneficial?

If so, are you able to share some measurements?

If not, maybe it should be removed?

> ...
>
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7021,40 +7021,80 @@ static inline int process_huge_page(
>  	return 0;
>  }
>  
> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
> -				unsigned int nr_pages)
> +/*
> + * Clear contiguous pages chunking them up when running under
> + * non-preemptible models.
> + */
> +static void clear_contig_highpages(struct page *page, unsigned long addr,
> +				   unsigned int npages)

Called "_highpages" because it wraps clear_user_highpages().  It really
should be called clear_contig_user_highpages() ;)  (Not serious)

>  {
> -	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
> -	int i;
> +	unsigned int i, count, unit;
>  
> -	might_sleep();
> -	for (i = 0; i < nr_pages; i++) {
> +	unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;

Almost nothing uses preempt_model_preemptible() and I'm not usefully
familiar with it.  Will this check avoid all softlockup/rcu/etc
detections in all situations (ie, configs)?

> +	for (i = 0; i < npages; ) {
> +		count = min(unit, npages - i);
> +		clear_user_highpages(nth_page(page, i),
> +				     addr + i * PAGE_SIZE, count);
> +		i += count;
>  		cond_resched();
> -		clear_user_highpage(folio_page(folio, i), addr + i * PAGE_SIZE);
>  	}
>  }

Re: [PATCH v7 13/16] mm: memory: support clearing page ranges

Posted by Raghavendra K T 1 week, 2 days ago


On 9/18/2025 3:14 AM, Andrew Morton wrote:
> On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
>> Change folio_zero_user() to clear contiguous page ranges instead of
>> clearing using the current page-at-a-time approach. Exposing the largest
>> feasible length can be useful in enabling processors to optimize based
>> on extent.
> 
> This patch is something which MM developers might care to take a closer
> look at.
> 
>> However, clearing in large chunks can have two problems:
>>
>>   - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>>     (larger folios don't have any expectation of cache locality).
>>
>>   - preemption latency when clearing large folios.
>>
>> Handle the first by splitting the clearing in three parts: the
>> faulting page and its immediate locality, its left and right
>> regions; with the local neighbourhood cleared last.
> 
> Has this optimization been shown to be beneficial?
> 
> If so, are you able to share some measurements?
> 
> If not, maybe it should be removed?
> 

I reverted the effect of this patch by hard coding

#define PAGE_CONTIG_NR 1

I see that benefit for voluntary kernel is lost without this change

(for rep stosb)

with PAGE_CONTIG_NR equivalent to 8MB

Preempt mode: voluntary

# Running 'mem/mmap' benchmark:
# function 'demand' (Demand loaded mmap())
# Copying 64GB bytes ...

       34.533414 GB/sec


with PAGE_CONTIG_NR equivalent to 4KB

# Running 'mem/mmap' benchmark:
# function 'demand' (Demand loaded mmap())
# Copying 64GB bytes ...

       20.766059 GB/sec

For now (barring David's recommendations),
feel free to add

Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>

Re: [PATCH v7 13/16] mm: memory: support clearing page ranges

Posted by Raghavendra K T 1 week, 2 days ago

On 9/23/2025 2:06 PM, Raghavendra K T wrote:
> 
> 
> On 9/18/2025 3:14 AM, Andrew Morton wrote:
>> On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora 
>> <ankur.a.arora@oracle.com> wrote:
>>
>>> Change folio_zero_user() to clear contiguous page ranges instead of
>>> clearing using the current page-at-a-time approach. Exposing the largest
>>> feasible length can be useful in enabling processors to optimize based
>>> on extent.
>>
>> This patch is something which MM developers might care to take a closer
>> look at.
>>
>>> However, clearing in large chunks can have two problems:
>>>
>>>   - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>>>     (larger folios don't have any expectation of cache locality).
>>>
>>>   - preemption latency when clearing large folios.
>>>
>>> Handle the first by splitting the clearing in three parts: the
>>> faulting page and its immediate locality, its left and right
>>> regions; with the local neighbourhood cleared last.
>>
>> Has this optimization been shown to be beneficial?
>>
>> If so, are you able to share some measurements?
>>
>> If not, maybe it should be removed?
>>
> 
> I reverted the effect of this patch by hard coding
> 
> #define PAGE_CONTIG_NR 1
> 
> I see that benefit for voluntary kernel is lost without this change
> 
> (for rep stosb)
> 
> with PAGE_CONTIG_NR equivalent to 8MB
> 
> Preempt mode: voluntary
> 
> # Running 'mem/mmap' benchmark:
> # function 'demand' (Demand loaded mmap())
> # Copying 64GB bytes ...
> 
>        34.533414 GB/sec
> 
> 
> with PAGE_CONTIG_NR equivalent to 4KB
> 
> # Running 'mem/mmap' benchmark:
> # function 'demand' (Demand loaded mmap())
> # Copying 64GB bytes ...
> 
>        20.766059 GB/sec
> 
> For now (barring David's recommendations),
> feel free to add
> 
> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
> 
> 

My reply was more towards the benefits of clearing multiple pages
for non-preempt and voluntary kernel, than the effect of clearing
neighborhood range at the end.

Sorry if that was confusing :)

Re: [PATCH v7 13/16] mm: memory: support clearing page ranges

Posted by David Hildenbrand 1 week, 2 days ago

On 17.09.25 23:44, Andrew Morton wrote:
> On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
> 
>> Change folio_zero_user() to clear contiguous page ranges instead of
>> clearing using the current page-at-a-time approach. Exposing the largest
>> feasible length can be useful in enabling processors to optimize based
>> on extent.
> 
> This patch is something which MM developers might care to take a closer
> look at.

I took a look at various revisions of this series, I'm only lagging 
behind on reviewing the latest series :)

> 
>> However, clearing in large chunks can have two problems:
>>
>>   - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>>     (larger folios don't have any expectation of cache locality).
>>
>>   - preemption latency when clearing large folios.
>>
>> Handle the first by splitting the clearing in three parts: the
>> faulting page and its immediate locality, its left and right
>> regions; with the local neighbourhood cleared last.
> 
> Has this optimization been shown to be beneficial?
> 
> If so, are you able to share some measurements?
> 
> If not, maybe it should be removed?
> 
>> ...
>>
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7021,40 +7021,80 @@ static inline int process_huge_page(
>>   	return 0;
>>   }
>>   
>> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
>> -				unsigned int nr_pages)
>> +/*
>> + * Clear contiguous pages chunking them up when running under
>> + * non-preemptible models.
>> + */
>> +static void clear_contig_highpages(struct page *page, unsigned long addr,
>> +				   unsigned int npages)
> 
> Called "_highpages" because it wraps clear_user_highpages().  It really
> should be called clear_contig_user_highpages() ;)  (Not serious)

You have a point there, though :)

Fortunately this is only an internal helper.

-- 
Cheers

David / dhildenb

Re: [PATCH v7 13/16] mm: memory: support clearing page ranges

Posted by Ankur Arora 2 weeks ago

[ Added Paul McKenney. ]

Andrew Morton <akpm@linux-foundation.org> writes:

> On Wed, 17 Sep 2025 08:24:15 -0700 Ankur Arora <ankur.a.arora@oracle.com> wrote:
>
>> Change folio_zero_user() to clear contiguous page ranges instead of
>> clearing using the current page-at-a-time approach. Exposing the largest
>> feasible length can be useful in enabling processors to optimize based
>> on extent.
>
> This patch is something which MM developers might care to take a closer
> look at.
>
>> However, clearing in large chunks can have two problems:
>>
>>  - cache locality when clearing small folios (< MAX_ORDER_NR_PAGES)
>>    (larger folios don't have any expectation of cache locality).
>>
>>  - preemption latency when clearing large folios.
>>
>> Handle the first by splitting the clearing in three parts: the
>> faulting page and its immediate locality, its left and right
>> regions; with the local neighbourhood cleared last.
>
> Has this optimization been shown to be beneficial?

So, this was mostly meant to be defensive. The current code does a
rather extensive left-right dance around the faulting page via
c6ddfb6c58 ("mm, clear_huge_page: move order algorithm into a separate
function") and I wanted to keep the cache hot property for the region
closest to the address touched by the user.

But, no I haven't run any tests showing that it helps.

> If so, are you able to share some measurements?

From some quick kernel builds (with THP) I do see a consistent
difference of a few seconds (1% worse) if I remove this optimization.
(I'm not sure right now why it is worse -- my expectation was that we
would have higher cache misses, but I see pretty similar cache numbers.)

But let me do a more careful test and report back.

> If not, maybe it should be removed?
>
>> ...
>>
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7021,40 +7021,80 @@ static inline int process_huge_page(
>>  	return 0;
>>  }
>>
>> -static void clear_gigantic_page(struct folio *folio, unsigned long addr_hint,
>> -				unsigned int nr_pages)
>> +/*
>> + * Clear contiguous pages chunking them up when running under
>> + * non-preemptible models.
>> + */
>> +static void clear_contig_highpages(struct page *page, unsigned long addr,
>> +				   unsigned int npages)
>
> Called "_highpages" because it wraps clear_user_highpages().  It really
> should be called clear_contig_user_highpages() ;)  (Not serious)

Or maybe clear_user_contig_highpages(), so when we get rid of HUGEMEM,
the _highpages could just be chopped off :D.

>>  {
>> -	unsigned long addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>> -	int i;
>> +	unsigned int i, count, unit;
>>
>> -	might_sleep();
>> -	for (i = 0; i < nr_pages; i++) {
>> +	unit = preempt_model_preemptible() ? npages : PAGE_CONTIG_NR;
>
> Almost nothing uses preempt_model_preemptible() and I'm not usefully
> familiar with it.  Will this check avoid all softlockup/rcu/etc
> detections in all situations (ie, configs)?

IMO, yes. The code invoked under preempt_model_preemptible() will boil
down to a single interruptible REP STOSB which might execute over
an extent of 1GB (with the last patch). From prior experiments, I know
that irqs are able to interrupt this. And, I /think/ that is a sufficient
condition for avoiding RCU stalls/softlockups etc.

Also, when we were discussing lazy preemption (which Thomas had
suggested as a way to handle scenarios like this or long running Xen
hypercalls etc) this seemed like a scenario that didn't need any extra
handling for CONFIG_PREEMPT.
We did need 83b28cfe79 ("rcu: handle quiescent states for PREEMPT_RCU=n,
PREEMPT_COUNT=y") for CONFIG_PREEMPT_LAZY but AFAICS this should be safe.

Anyway let me think about your all configs point (though only ones which
can have some flavour for hugetlb.)

Also, I would like x86 folks opinion on this. And, maybe Paul McKenney
just to make sure I'm not missing something on RCU side.

Thanks for the comments.

--
ankur