[v10] mm: folio_zero_user: clear contiguous pages

[PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages

Posted by Ankur Arora 1 month, 3 weeks ago

folio_zero_user() does straight zeroing without caring about
temporal locality for caches.

This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
algorithm into a separate function") where we cleared a page at a
time converging to the faulting page from the left and the right.

To retain limited temporal locality, split the clearing in three
parts: the faulting page and its immediate neighbourhood, and, the
remaining regions on the left and the right. The local neighbourhood
will be cleared last.
Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
there isn't much expectation of cache locality for large folios.

Performance
===

AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
  memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)

anon-w-seq (vm-scalability):
                            stime                  utime

  page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
  contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
  neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )

Both stime and utime respond in expected ways. stime drops for both
contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
approaches. However, utime increases for both contiguous clearing
(+19.7%) and neighbourhood-last (+9.28%).

In part this is because anon-w-seq runs with 384 processes zeroing
anonymously mapped memory which they then access sequentially. As
such this is likely an uncommon pattern where the memory bandwidth
is saturated while also being cache limited because we access the
entire region.

Kernel make workload (make -j 12 bzImage):

                            stime                  utime

  page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
  contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
  neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )

For make the utime stays relatively flat with an up to 4.9% improvement
in the stime.

Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 42 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 974c48db6089..d22348b95227 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
  * @addr_hint: The address accessed by the user or the base address.
  *
  * Uses architectural support to clear page ranges.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood is cleared last in order to keep cache
+ * lines of the faulting region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
  */
 void folio_zero_user(struct folio *folio, unsigned long addr_hint)
 {
 	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
+	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+	const int width = 2; /* number of pages cleared last on either side */
+	struct range r[3];
+	int i;
 
-	clear_contig_highpages(folio_page(folio, 0),
-				base_addr, folio_nr_pages(folio));
+	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
+		clear_contig_highpages(folio_page(folio, 0),
+				       base_addr, folio_nr_pages(folio));
+		return;
+	}
+
+	/*
+	 * Faulting page and its immediate neighbourhood. Cleared at the end to
+	 * ensure it sticks around in the cache.
+	 */
+	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
+			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
+
+	/* Region to the left of the fault */
+	r[1] = DEFINE_RANGE(pg.start,
+			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
+
+	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
+	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
+			    pg.end);
+
+	for (i = 0; i <= 2; i++) {
+		unsigned int npages = range_len(&r[i]);
+		struct page *page = folio_page(folio, r[i].start);
+		unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;
+
+		if (npages > 0)
+			clear_contig_highpages(page, addr, npages);
+	}
 }
 
 static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
-- 
2.31.1

Re: [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages

Posted by David Hildenbrand (Red Hat) 1 month, 3 weeks ago

On 12/15/25 21:49, Ankur Arora wrote:
> folio_zero_user() does straight zeroing without caring about
> temporal locality for caches.
> 
> This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
> algorithm into a separate function") where we cleared a page at a
> time converging to the faulting page from the left and the right.
> 
> To retain limited temporal locality, split the clearing in three
> parts: the faulting page and its immediate neighbourhood, and, the
> remaining regions on the left and the right. The local neighbourhood
> will be cleared last.
> Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
> there isn't much expectation of cache locality for large folios.
> 
> Performance
> ===
> 
> AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
>    memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)
> 
> anon-w-seq (vm-scalability):
>                              stime                  utime
> 
>    page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>    contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
>    neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
> 
> Both stime and utime respond in expected ways. stime drops for both
> contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
> approaches. However, utime increases for both contiguous clearing
> (+19.7%) and neighbourhood-last (+9.28%).
> 
> In part this is because anon-w-seq runs with 384 processes zeroing
> anonymously mapped memory which they then access sequentially. As
> such this is likely an uncommon pattern where the memory bandwidth
> is saturated while also being cache limited because we access the
> entire region.
> 
> Kernel make workload (make -j 12 bzImage):
> 
>                              stime                  utime
> 
>    page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
>    contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
>    neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )
> 
> For make the utime stays relatively flat with an up to 4.9% improvement
> in the stime.

Nice evaluation!

> 
> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
>   mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>   1 file changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 974c48db6089..d22348b95227 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
>    * @addr_hint: The address accessed by the user or the base address.
>    *
>    * Uses architectural support to clear page ranges.
> + *
> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
> + * pages in the immediate locality of the faulting page, and its left, right
> + * regions; the local neighbourhood is cleared last in order to keep cache
> + * lines of the faulting region hot.
> + *
> + * For larger folios we assume that there is no expectation of cache locality
> + * and just do a straight zero.

Just wondering: why not do the same thing here as well? Probably 
shouldn't hurt and would get rid of some code?

>    */
>   void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>   {
>   	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));

While at it you could turn that const as well.

> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
> +	const int width = 2; /* number of pages cleared last on either side */

Is "width" really the right terminology? (the way you describe it, it's 
more like diameter?)

Wondering whether we should turn that into a define to make it clearer 
that we are dealing with a magic value.

Speaking of magic values, why 2 and not 3? :)

> +	struct range r[3];
> +	int i;
>   
> -	clear_contig_highpages(folio_page(folio, 0),
> -				base_addr, folio_nr_pages(folio));
> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
> +		clear_contig_highpages(folio_page(folio, 0),
> +				       base_addr, folio_nr_pages(folio));
> +		return;
> +	}
> +
> +	/*
> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
> +	 * ensure it sticks around in the cache.
> +	 */
> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
> +
> +	/* Region to the left of the fault */
> +	r[1] = DEFINE_RANGE(pg.start,
> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));

"start-1" -> "start - 1" etc.

> +
> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
> +			    pg.end);

Same here.

> +
> +	for (i = 0; i <= 2; i++) {

Can we use ARRAY_SIZE instead of "2" ?

> +		unsigned int npages = range_len(&r[i]);
> +		struct page *page = folio_page(folio, r[i].start);
> +		unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;

Can't you compute that from r[i].start) instead? The folio_page_idx() 
seems avoidable unless I am missing something.

Could make npages and addr const.

const unsigned long addr = base_addr + r[i].start * PAGE_SIZE;
const unsigned int npages = range_len(&r[i]);
struct page *page = folio_page(folio, r[i].start);

> +
> +		if (npages > 0)
> +			clear_contig_highpages(page, addr, npages);
> +	}
>   }
>   
>   static int copy_user_gigantic_page(struct folio *dst, struct folio *src,


-- 
Cheers

David

Re: [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages

Posted by Ankur Arora 1 month, 3 weeks ago

David Hildenbrand (Red Hat) <david@kernel.org> writes:

> On 12/15/25 21:49, Ankur Arora wrote:
>> folio_zero_user() does straight zeroing without caring about
>> temporal locality for caches.
>> This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
>> algorithm into a separate function") where we cleared a page at a
>> time converging to the faulting page from the left and the right.
>> To retain limited temporal locality, split the clearing in three
>> parts: the faulting page and its immediate neighbourhood, and, the
>> remaining regions on the left and the right. The local neighbourhood
>> will be cleared last.
>> Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
>> there isn't much expectation of cache locality for large folios.
>> Performance
>> ===
>> AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
>>    memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)
>> anon-w-seq (vm-scalability):
>>                              stime                  utime
>>    page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>    contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
>>    neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>> Both stime and utime respond in expected ways. stime drops for both
>> contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
>> approaches. However, utime increases for both contiguous clearing
>> (+19.7%) and neighbourhood-last (+9.28%).
>> In part this is because anon-w-seq runs with 384 processes zeroing
>> anonymously mapped memory which they then access sequentially. As
>> such this is likely an uncommon pattern where the memory bandwidth
>> is saturated while also being cache limited because we access the
>> entire region.
>> Kernel make workload (make -j 12 bzImage):
>>                              stime                  utime
>>    page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
>>    contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
>>    neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )
>> For make the utime stays relatively flat with an up to 4.9% improvement
>> in the stime.
>
> Nice evaluation!
>
>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>> ---
>>   mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 42 insertions(+), 2 deletions(-)
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 974c48db6089..d22348b95227 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
>>    * @addr_hint: The address accessed by the user or the base address.
>>    *
>>    * Uses architectural support to clear page ranges.
>> + *
>> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
>> + * pages in the immediate locality of the faulting page, and its left, right
>> + * regions; the local neighbourhood is cleared last in order to keep cache
>> + * lines of the faulting region hot.
>> + *
>> + * For larger folios we assume that there is no expectation of cache locality
>> + * and just do a straight zero.
>
> Just wondering: why not do the same thing here as well? Probably shouldn't hurt
> and would get rid of some code?

That's a good point. With only a three way split, there's no reason to
treat large folios specially.

>>    */
>>   void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>   {
>>   	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>
> While at it you could turn that const as well.

Ack.

>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>> +	const int width = 2; /* number of pages cleared last on either side */
>
> Is "width" really the right terminology? (the way you describe it, it's more
> like diameter?)

I like diameter. Will make that a define.

> Wondering whether we should turn that into a define to make it clearer that we
> are dealing with a magic value.
>
> Speaking of magic values, why 2 and not 3? :)

I think I had tried both :). The performance was pretty much the same.

But also, this is probably a function of the benchmark used. And I'm
not sure I have a good one (unless kernel build with THP counts which
isn't very responsive).

>> +	struct range r[3];
>> +	int i;
>>   -	clear_contig_highpages(folio_page(folio, 0),
>> -				base_addr, folio_nr_pages(folio));
>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>> +		clear_contig_highpages(folio_page(folio, 0),
>> +				       base_addr, folio_nr_pages(folio));
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>> +	 * ensure it sticks around in the cache.
>> +	 */
>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>> +
>> +	/* Region to the left of the fault */
>> +	r[1] = DEFINE_RANGE(pg.start,
>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>
> "start-1" -> "start - 1" etc.
>
>> +
>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>> +			    pg.end);
>
> Same here.
>
>> +
>> +	for (i = 0; i <= 2; i++) {
>
> Can we use ARRAY_SIZE instead of "2" ?

Ack to all of the these.

>> +		unsigned int npages = range_len(&r[i]);
>> +		struct page *page = folio_page(folio, r[i].start);
>> +		unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;
>
> Can't you compute that from r[i].start) instead? The folio_page_idx() seems
> avoidable unless I am missing something.
>
> Could make npages and addr const.
>
> const unsigned long addr = base_addr + r[i].start * PAGE_SIZE;
> const unsigned int npages = range_len(&r[i]);
> struct page *page = folio_page(folio, r[i].start);

Thanks. Yeah, all of this makes sense.

Will fix.

>> +
>> +		if (npages > 0)
>> +			clear_contig_highpages(page, addr, npages);
>> +	}
>>   }
>>     static int copy_user_gigantic_page(struct folio *dst, struct folio *src,

Thanks for the review!

--
ankur

Re: [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages

Posted by Ankur Arora 1 month, 3 weeks ago

Ankur Arora <ankur.a.arora@oracle.com> writes:

> David Hildenbrand (Red Hat) <david@kernel.org> writes:
>
>> On 12/15/25 21:49, Ankur Arora wrote:
>>> folio_zero_user() does straight zeroing without caring about
>>> temporal locality for caches.
>>> This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
>>> algorithm into a separate function") where we cleared a page at a
>>> time converging to the faulting page from the left and the right.
>>> To retain limited temporal locality, split the clearing in three
>>> parts: the faulting page and its immediate neighbourhood, and, the
>>> remaining regions on the left and the right. The local neighbourhood
>>> will be cleared last.
>>> Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
>>> there isn't much expectation of cache locality for large folios.
>>> Performance
>>> ===
>>> AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
>>>    memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)
>>> anon-w-seq (vm-scalability):
>>>                              stime                  utime
>>>    page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>>    contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
>>>    neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>>> Both stime and utime respond in expected ways. stime drops for both
>>> contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
>>> approaches. However, utime increases for both contiguous clearing
>>> (+19.7%) and neighbourhood-last (+9.28%).
>>> In part this is because anon-w-seq runs with 384 processes zeroing
>>> anonymously mapped memory which they then access sequentially. As
>>> such this is likely an uncommon pattern where the memory bandwidth
>>> is saturated while also being cache limited because we access the
>>> entire region.
>>> Kernel make workload (make -j 12 bzImage):
>>>                              stime                  utime
>>>    page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
>>>    contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
>>>    neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )
>>> For make the utime stays relatively flat with an up to 4.9% improvement
>>> in the stime.
>>
>> Nice evaluation!
>>
>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>>> ---
>>>   mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>>   1 file changed, 42 insertions(+), 2 deletions(-)
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 974c48db6089..d22348b95227 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
>>>    * @addr_hint: The address accessed by the user or the base address.
>>>    *
>>>    * Uses architectural support to clear page ranges.
>>> + *
>>> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
>>> + * pages in the immediate locality of the faulting page, and its left, right
>>> + * regions; the local neighbourhood is cleared last in order to keep cache
>>> + * lines of the faulting region hot.
>>> + *
>>> + * For larger folios we assume that there is no expectation of cache locality
>>> + * and just do a straight zero.
>>
>> Just wondering: why not do the same thing here as well? Probably shouldn't hurt
>> and would get rid of some code?
>
> That's a good point. With only a three way split, there's no reason to
> treat large folios specially.

A bit more on this: this change makes sense but I'll retain the current
split between patches-7, 8.

Where patch-7, is used to justify using contiguous clearing (and the
choice of value for PROCESS_PAGES_NON_PREEMPT_BATCH), unit based on
preemption model etc and patch-8, for the neighbourhood optimization.

>>>    */
>>>   void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>   {
>>>   	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>>
>> While at it you could turn that const as well.
>
> Ack.
>
>>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>>> +	const int width = 2; /* number of pages cleared last on either side */
>>
>> Is "width" really the right terminology? (the way you describe it, it's more
>> like diameter?)
>
> I like diameter. Will make that a define.

I'll make that radius since that's how I'm using it.

Thanks
Ankur

>> Wondering whether we should turn that into a define to make it clearer that we
>> are dealing with a magic value.
>>
>> Speaking of magic values, why 2 and not 3? :)
>
> I think I had tried both :). The performance was pretty much the same.
>
> But also, this is probably a function of the benchmark used. And I'm
> not sure I have a good one (unless kernel build with THP counts which
> isn't very responsive).
>
>>> +	struct range r[3];
>>> +	int i;
>>>   -	clear_contig_highpages(folio_page(folio, 0),
>>> -				base_addr, folio_nr_pages(folio));
>>> +	if (folio_nr_pages(folio) > MAX_ORDER_NR_PAGES) {
>>> +		clear_contig_highpages(folio_page(folio, 0),
>>> +				       base_addr, folio_nr_pages(folio));
>>> +		return;
>>> +	}
>>> +
>>> +	/*
>>> +	 * Faulting page and its immediate neighbourhood. Cleared at the end to
>>> +	 * ensure it sticks around in the cache.
>>> +	 */
>>> +	r[2] = DEFINE_RANGE(clamp_t(s64, fault_idx - width, pg.start, pg.end),
>>> +			    clamp_t(s64, fault_idx + width, pg.start, pg.end));
>>> +
>>> +	/* Region to the left of the fault */
>>> +	r[1] = DEFINE_RANGE(pg.start,
>>> +			    clamp_t(s64, r[2].start-1, pg.start-1, r[2].start));
>>
>> "start-1" -> "start - 1" etc.
>>
>>> +
>>> +	/* Region to the right of the fault: always valid for the common fault_idx=0 case. */
>>> +	r[0] = DEFINE_RANGE(clamp_t(s64, r[2].end+1, r[2].end, pg.end+1),
>>> +			    pg.end);
>>
>> Same here.
>>
>>> +
>>> +	for (i = 0; i <= 2; i++) {
>>
>> Can we use ARRAY_SIZE instead of "2" ?
>
> Ack to all of the these.
>
>>> +		unsigned int npages = range_len(&r[i]);
>>> +		struct page *page = folio_page(folio, r[i].start);
>>> +		unsigned long addr = base_addr + folio_page_idx(folio, page) * PAGE_SIZE;
>>
>> Can't you compute that from r[i].start) instead? The folio_page_idx() seems
>> avoidable unless I am missing something.
>>
>> Could make npages and addr const.
>>
>> const unsigned long addr = base_addr + r[i].start * PAGE_SIZE;
>> const unsigned int npages = range_len(&r[i]);
>> struct page *page = folio_page(folio, r[i].start);
>
> Thanks. Yeah, all of this makes sense.
>
> Will fix.
>
>>> +
>>> +		if (npages > 0)
>>> +			clear_contig_highpages(page, addr, npages);
>>> +	}
>>>   }
>>>     static int copy_user_gigantic_page(struct folio *dst, struct folio *src,
>
> Thanks for the review!

--
ankur

Re: [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages

Posted by David Hildenbrand (Red Hat) 1 month, 2 weeks ago

On 12/18/25 22:23, Ankur Arora wrote:
> 
> Ankur Arora <ankur.a.arora@oracle.com> writes:
> 
>> David Hildenbrand (Red Hat) <david@kernel.org> writes:
>>
>>> On 12/15/25 21:49, Ankur Arora wrote:
>>>> folio_zero_user() does straight zeroing without caring about
>>>> temporal locality for caches.
>>>> This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
>>>> algorithm into a separate function") where we cleared a page at a
>>>> time converging to the faulting page from the left and the right.
>>>> To retain limited temporal locality, split the clearing in three
>>>> parts: the faulting page and its immediate neighbourhood, and, the
>>>> remaining regions on the left and the right. The local neighbourhood
>>>> will be cleared last.
>>>> Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
>>>> there isn't much expectation of cache locality for large folios.
>>>> Performance
>>>> ===
>>>> AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
>>>>     memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)
>>>> anon-w-seq (vm-scalability):
>>>>                               stime                  utime
>>>>     page-at-a-time      1654.63 ( +- 3.84% )     811.00 ( +- 3.84% )
>>>>     contiguous clearing 1602.86 ( +- 3.00% )     970.75 ( +- 4.68% )
>>>>     neighbourhood-last  1630.32 ( +- 2.73% )     886.37 ( +- 5.19% )
>>>> Both stime and utime respond in expected ways. stime drops for both
>>>> contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
>>>> approaches. However, utime increases for both contiguous clearing
>>>> (+19.7%) and neighbourhood-last (+9.28%).
>>>> In part this is because anon-w-seq runs with 384 processes zeroing
>>>> anonymously mapped memory which they then access sequentially. As
>>>> such this is likely an uncommon pattern where the memory bandwidth
>>>> is saturated while also being cache limited because we access the
>>>> entire region.
>>>> Kernel make workload (make -j 12 bzImage):
>>>>                               stime                  utime
>>>>     page-at-a-time       138.16 ( +- 0.31% )    1015.11 ( +- 0.05% )
>>>>     contiguous clearing  133.42 ( +- 0.90% )    1013.49 ( +- 0.05% )
>>>>     neighbourhood-last   131.20 ( +- 0.76% )    1011.36 ( +- 0.07% )
>>>> For make the utime stays relatively flat with an up to 4.9% improvement
>>>> in the stime.
>>>
>>> Nice evaluation!
>>>
>>>> Signed-off-by: Ankur Arora <ankur.a.arora@oracle.com>
>>>> Reviewed-by: Raghavendra K T <raghavendra.kt@amd.com>
>>>> Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
>>>> ---
>>>>    mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
>>>>    1 file changed, 42 insertions(+), 2 deletions(-)
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 974c48db6089..d22348b95227 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
>>>>     * @addr_hint: The address accessed by the user or the base address.
>>>>     *
>>>>     * Uses architectural support to clear page ranges.
>>>> + *
>>>> + * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
>>>> + * pages in the immediate locality of the faulting page, and its left, right
>>>> + * regions; the local neighbourhood is cleared last in order to keep cache
>>>> + * lines of the faulting region hot.
>>>> + *
>>>> + * For larger folios we assume that there is no expectation of cache locality
>>>> + * and just do a straight zero.
>>>
>>> Just wondering: why not do the same thing here as well? Probably shouldn't hurt
>>> and would get rid of some code?
>>
>> That's a good point. With only a three way split, there's no reason to
>> treat large folios specially.
> 
> A bit more on this: this change makes sense but I'll retain the current
> split between patches-7, 8.
> 
> Where patch-7, is used to justify using contiguous clearing (and the
> choice of value for PROCESS_PAGES_NON_PREEMPT_BATCH), unit based on
> preemption model etc and patch-8, for the neighbourhood optimization.
> 
>>>>     */
>>>>    void folio_zero_user(struct folio *folio, unsigned long addr_hint)
>>>>    {
>>>>    	unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
>>>
>>> While at it you could turn that const as well.
>>
>> Ack.
>>
>>>> +	const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
>>>> +	const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
>>>> +	const int width = 2; /* number of pages cleared last on either side */
>>>
>>> Is "width" really the right terminology? (the way you describe it, it's more
>>> like diameter?)
>>
>> I like diameter. Will make that a define.
> 
> I'll make that radius since that's how I'm using it.

All makes sense to me.

-- 
Cheers

David

[PATCH v10 1/8] treewide: provide a generic clear_user_page() variant
[PATCH v10 2/8] highmem: introduce clear_user_highpages()
[PATCH v10 3/8] mm: introduce clear_pages() and clear_user_pages()
[PATCH v10 4/8] highmem: do range clearing in clear_user_highpages()
[PATCH v10 5/8] x86/mm: Simplify clear_page_*
[PATCH v10 6/8] x86/clear_page: Introduce clear_pages()
[PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
[PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages