[v2] support batch checking of references and unmapping for large folios

[PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Baolin Wang 1 month, 4 weeks ago

Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
only clear the young flag and flush TLBs for PTEs within the contiguous range.
To support batch PTE operations for other sized large folios in the following
patches, adding a new parameter to specify the number of PTEs.

While we are at it, rename the functions to maintain consistency with other
contpte_*() functions.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 12 ++++-----
 arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
 2 files changed, 37 insertions(+), 19 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 0944e296dd4a..e03034683156 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
 extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep,
 				unsigned int nr, int full);
-extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
-extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-				unsigned long addr, pte_t *ptep);
+extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
+extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep, unsigned int nr);
 extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, unsigned int nr);
 extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
@@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_test_and_clear_young(vma, addr, ptep);
 
-	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
 }
 
 #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
@@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	if (likely(!pte_valid_cont(orig_pte)))
 		return __ptep_clear_flush_young(vma, addr, ptep);
 
-	return contpte_ptep_clear_flush_young(vma, addr, ptep);
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
 }
 
 #define wrprotect_ptes wrprotect_ptes
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index c0557945939c..19b122441be3 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
 }
 EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
 
-int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
+					unsigned long addr, pte_t *ptep,
+					unsigned int nr)
 {
 	/*
 	 * ptep_clear_flush_young() technically requires us to clear the access
@@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 	 * having to unfold.
 	 */
 
+	unsigned long start = addr;
+	unsigned long end = start + nr * PAGE_SIZE;
 	int young = 0;
 	int i;
 
-	ptep = contpte_align_down(ptep);
-	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
+	if (pte_cont(__ptep_get(ptep + nr - 1)))
+		end = ALIGN(end, CONT_PTE_SIZE);
 
-	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
-		young |= __ptep_test_and_clear_young(vma, addr, ptep);
+	if (pte_cont(__ptep_get(ptep))) {
+		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
+		ptep = contpte_align_down(ptep);
+	}
+
+	nr = (end - start) / PAGE_SIZE;
+	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
+		young |= __ptep_test_and_clear_young(vma, start, ptep);
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
+EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
 
-int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
-					unsigned long addr, pte_t *ptep)
+int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
+				unsigned long addr, pte_t *ptep,
+				unsigned int nr)
 {
 	int young;
 
-	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
+	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
 
 	if (young) {
+		unsigned long start = addr;
+		unsigned long end = start + nr * PAGE_SIZE;
+
+		if (pte_cont(__ptep_get(ptep + nr - 1)))
+			end = ALIGN(end, CONT_PTE_SIZE);
+
+		if (pte_cont(__ptep_get(ptep)))
+			start = ALIGN_DOWN(start, CONT_PTE_SIZE);
+
 		/*
 		 * See comment in __ptep_clear_flush_young(); same rationale for
 		 * eliding the trailing DSB applies here.
 		 */
-		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
-		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
+		__flush_tlb_range_nosync(vma->vm_mm, start, end,
 					 PAGE_SIZE, true, 3);
 	}
 
 	return young;
 }
-EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
+EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
 
 void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
 					pte_t *ptep, unsigned int nr)
-- 
2.47.3

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Ryan Roberts 1 month, 3 weeks ago

Sorry I'm a bit late to the party...


On 11/12/2025 08:16, Baolin Wang wrote:
> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
> only clear the young flag and flush TLBs for PTEs within the contiguous range.
> To support batch PTE operations for other sized large folios in the following
> patches, adding a new parameter to specify the number of PTEs.
> 
> While we are at it, rename the functions to maintain consistency with other
> contpte_*() functions.
> 
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 12 ++++-----
>  arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>  2 files changed, 37 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0944e296dd4a..e03034683156 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>  extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep,
>  				unsigned int nr, int full);
> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep, unsigned int nr);
> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep, unsigned int nr);

The "contpte_" functions are intended to be private to the arm64 arch and should
be exposed via the generic APIs. But I don't see any generic batched API for
this, so you're only actually able to pass CONT_PTES as nr. Perhaps you're
planning to define "test_and_clear_young_ptes()" and "clear_flush_young_ptes()"
in later patches?

>  extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	if (likely(!pte_valid_cont(orig_pte)))
>  		return __ptep_test_and_clear_young(vma, addr, ptep);
>  
> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>  
>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	if (likely(!pte_valid_cont(orig_pte)))
>  		return __ptep_clear_flush_young(vma, addr, ptep);
>  
> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>  
>  #define wrprotect_ptes wrprotect_ptes
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index c0557945939c..19b122441be3 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>  
> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					unsigned int nr)
>  {
>  	/*
>  	 * ptep_clear_flush_young() technically requires us to clear the access
> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	 * having to unfold.
>  	 */
>  
> +	unsigned long start = addr;

Personally I wouldn't bother defining start - just reuse addr. You're
incrementing start in the below loop, so it's more appropriate to call it addr
anyway.

> +	unsigned long end = start + nr * PAGE_SIZE;
>  	int young = 0;
>  	int i;
>  
> -	ptep = contpte_align_down(ptep);
> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
> +		end = ALIGN(end, CONT_PTE_SIZE);
>  
> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +	if (pte_cont(__ptep_get(ptep))) {
> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> +		ptep = contpte_align_down(ptep);
> +	}
> +
> +	nr = (end - start) / PAGE_SIZE;
> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)

Given you're now defining end, perhaps we don't need nr?

	for (; addr != end; ptep++, addr += PAGE_SIZE)
		young |= __ptep_test_and_clear_young(vma, addr, ptep);

> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
>  
>  	return young;
>  }
> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>  
> -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr)
>  {
>  	int young;
>  
> -	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>  
>  	if (young) {
> +		unsigned long start = addr;
> +		unsigned long end = start + nr * PAGE_SIZE;
> +
> +		if (pte_cont(__ptep_get(ptep + nr - 1)))
> +			end = ALIGN(end, CONT_PTE_SIZE);
> +
> +		if (pte_cont(__ptep_get(ptep)))
> +			start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> +

We now have this pattern of expanding contpte blocks up and down in 3 places.
Perhaps create a helper?

Thanks,
Ryan

>  		/*
>  		 * See comment in __ptep_clear_flush_young(); same rationale for
>  		 * eliding the trailing DSB applies here.
>  		 */
> -		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> -		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
> +		__flush_tlb_range_nosync(vma->vm_mm, start, end,
>  					 PAGE_SIZE, true, 3);
>  	}
>  
>  	return young;
>  }
> -EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
> +EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
>  
>  void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  					pte_t *ptep, unsigned int nr)

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Baolin Wang 1 month, 3 weeks ago


On 2025/12/17 23:43, Ryan Roberts wrote:
> Sorry I'm a bit late to the party...

Never mind. It's not late and comments are always welcome :)

> On 11/12/2025 08:16, Baolin Wang wrote:
>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>> To support batch PTE operations for other sized large folios in the following
>> patches, adding a new parameter to specify the number of PTEs.
>>
>> While we are at it, rename the functions to maintain consistency with other
>> contpte_*() functions.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>   arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>   2 files changed, 37 insertions(+), 19 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 0944e296dd4a..e03034683156 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>   extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>   				unsigned long addr, pte_t *ptep,
>>   				unsigned int nr, int full);
>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> -				unsigned long addr, pte_t *ptep);
>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> -				unsigned long addr, pte_t *ptep);
>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
> 
> The "contpte_" functions are intended to be private to the arm64 arch and should
> be exposed via the generic APIs. But I don't see any generic batched API for
> this, so you're only actually able to pass CONT_PTES as nr. Perhaps you're
> planning to define "test_and_clear_young_ptes()" and "clear_flush_young_ptes()"
> in later patches?

Right. This is a preparation patch, and will be used in patch 2.

>>   extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   				pte_t *ptep, unsigned int nr);
>>   extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   	if (likely(!pte_valid_cont(orig_pte)))
>>   		return __ptep_test_and_clear_young(vma, addr, ptep);
>>   
>> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>   
>>   #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   	if (likely(!pte_valid_cont(orig_pte)))
>>   		return __ptep_clear_flush_young(vma, addr, ptep);
>>   
>> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>   
>>   #define wrprotect_ptes wrprotect_ptes
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index c0557945939c..19b122441be3 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>   }
>>   EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>   
>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> -					unsigned long addr, pte_t *ptep)
>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					unsigned int nr)
>>   {
>>   	/*
>>   	 * ptep_clear_flush_young() technically requires us to clear the access
>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   	 * having to unfold.
>>   	 */
>>   
>> +	unsigned long start = addr;
> 
> Personally I wouldn't bother defining start - just reuse addr. You're
> incrementing start in the below loop, so it's more appropriate to call it addr
> anyway.

OK.

>> +	unsigned long end = start + nr * PAGE_SIZE;
>>   	int young = 0;
>>   	int i;
>>   
>> -	ptep = contpte_align_down(ptep);
>> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
>> +		end = ALIGN(end, CONT_PTE_SIZE);
>>   
>> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +	if (pte_cont(__ptep_get(ptep))) {
>> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>> +		ptep = contpte_align_down(ptep);
>> +	}
>> +
>> +	nr = (end - start) / PAGE_SIZE;
>> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
> 
> Given you're now defining end, perhaps we don't need nr?
> 
> 	for (; addr != end; ptep++, addr += PAGE_SIZE)
> 		young |= __ptep_test_and_clear_young(vma, addr, ptep);

Yes, good point.

>> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
>>   
>>   	return young;
>>   }
>> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
>> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>>   
>> -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> -					unsigned long addr, pte_t *ptep)
>> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				unsigned int nr)
>>   {
>>   	int young;
>>   
>> -	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>>   
>>   	if (young) {
>> +		unsigned long start = addr;
>> +		unsigned long end = start + nr * PAGE_SIZE;
>> +
>> +		if (pte_cont(__ptep_get(ptep + nr - 1)))
>> +			end = ALIGN(end, CONT_PTE_SIZE);
>> +
>> +		if (pte_cont(__ptep_get(ptep)))
>> +			start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>> +
> 
> We now have this pattern of expanding contpte blocks up and down in 3 places.
> Perhaps create a helper?

Sounds reasonable. How about the following helper?

static pte_t *contpte_align_addr_ptep(unsigned long *start, unsigned 
long *end,
                                         pte_t *ptep, unsigned int nr)
{
         unsigned long end_addr = *start + nr * PAGE_SIZE;

         if (pte_cont(__ptep_get(ptep + nr - 1)))
                 *end = ALIGN(end_addr, CONT_PTE_SIZE);

         if (pte_cont(__ptep_get(ptep))) {
                 *start = ALIGN_DOWN(*start, CONT_PTE_SIZE);
                 ptep = contpte_align_down(ptep);
         }

         return ptep;
}

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Ryan Roberts 1 month, 3 weeks ago

On 18/12/2025 07:15, Baolin Wang wrote:
> 
> 
> On 2025/12/17 23:43, Ryan Roberts wrote:
>> Sorry I'm a bit late to the party...
> 
> Never mind. It's not late and comments are always welcome :)
> 
>> On 11/12/2025 08:16, Baolin Wang wrote:
>>> Currently, contpte_ptep_test_and_clear_young() and
>>> contpte_ptep_clear_flush_young()
>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>> To support batch PTE operations for other sized large folios in the following
>>> patches, adding a new parameter to specify the number of PTEs.
>>>
>>> While we are at it, rename the functions to maintain consistency with other
>>> contpte_*() functions.
>>>
>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> ---
>>>   arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>   arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>>   2 files changed, 37 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 0944e296dd4a..e03034683156 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct
>>> *mm, unsigned long addr,
>>>   extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>                   unsigned long addr, pte_t *ptep,
>>>                   unsigned int nr, int full);
>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> -                unsigned long addr, pte_t *ptep);
>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> -                unsigned long addr, pte_t *ptep);
>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep, unsigned int nr);
>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep, unsigned int nr);
>>
>> The "contpte_" functions are intended to be private to the arm64 arch and should
>> be exposed via the generic APIs. But I don't see any generic batched API for
>> this, so you're only actually able to pass CONT_PTES as nr. Perhaps you're
>> planning to define "test_and_clear_young_ptes()" and "clear_flush_young_ptes()"
>> in later patches?
> 
> Right. This is a preparation patch, and will be used in patch 2.
> 
>>>   extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>                   pte_t *ptep, unsigned int nr);
>>>   extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct
>>> vm_area_struct *vma,
>>>       if (likely(!pte_valid_cont(orig_pte)))
>>>           return __ptep_test_and_clear_young(vma, addr, ptep);
>>>   -    return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +    return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>   }
>>>     #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct
>>> vm_area_struct *vma,
>>>       if (likely(!pte_valid_cont(orig_pte)))
>>>           return __ptep_clear_flush_young(vma, addr, ptep);
>>>   -    return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>> +    return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>   }
>>>     #define wrprotect_ptes wrprotect_ptes
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index c0557945939c..19b122441be3 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>   }
>>>   EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>   -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> -                    unsigned long addr, pte_t *ptep)
>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>> +                    unsigned long addr, pte_t *ptep,
>>> +                    unsigned int nr)
>>>   {
>>>       /*
>>>        * ptep_clear_flush_young() technically requires us to clear the access
>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct
>>> vm_area_struct *vma,
>>>        * having to unfold.
>>>        */
>>>   +    unsigned long start = addr;
>>
>> Personally I wouldn't bother defining start - just reuse addr. You're
>> incrementing start in the below loop, so it's more appropriate to call it addr
>> anyway.
> 
> OK.
> 
>>> +    unsigned long end = start + nr * PAGE_SIZE;
>>>       int young = 0;
>>>       int i;
>>>   -    ptep = contpte_align_down(ptep);
>>> -    addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +    if (pte_cont(__ptep_get(ptep + nr - 1)))
>>> +        end = ALIGN(end, CONT_PTE_SIZE);
>>>   -    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> -        young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>> +    if (pte_cont(__ptep_get(ptep))) {
>>> +        start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>> +        ptep = contpte_align_down(ptep);
>>> +    }
>>> +
>>> +    nr = (end - start) / PAGE_SIZE;
>>> +    for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>
>> Given you're now defining end, perhaps we don't need nr?
>>
>>     for (; addr != end; ptep++, addr += PAGE_SIZE)
>>         young |= __ptep_test_and_clear_young(vma, addr, ptep);
> 
> Yes, good point.
> 
>>> +        young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>         return young;
>>>   }
>>> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
>>> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>>>   -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> -                    unsigned long addr, pte_t *ptep)
>>> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>> +                unsigned long addr, pte_t *ptep,
>>> +                unsigned int nr)
>>>   {
>>>       int young;
>>>   -    young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +    young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>>>         if (young) {
>>> +        unsigned long start = addr;
>>> +        unsigned long end = start + nr * PAGE_SIZE;
>>> +
>>> +        if (pte_cont(__ptep_get(ptep + nr - 1)))
>>> +            end = ALIGN(end, CONT_PTE_SIZE);
>>> +
>>> +        if (pte_cont(__ptep_get(ptep)))
>>> +            start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>> +
>>
>> We now have this pattern of expanding contpte blocks up and down in 3 places.
>> Perhaps create a helper?
> 
> Sounds reasonable. How about the following helper?
> 
> static pte_t *contpte_align_addr_ptep(unsigned long *start, unsigned long *end,
>                                         pte_t *ptep, unsigned int nr)
> {
>         unsigned long end_addr = *start + nr * PAGE_SIZE;
> 
>         if (pte_cont(__ptep_get(ptep + nr - 1)))

I think this is safe but calling it out to check; you're not checking that the
pte is valid, so theoretically you could have a swap-entry here with whatever
overlays the contiguous bit set. So then you would incorrectly extend.

But I think it is safe because the expectation is that core-mm has already
checked that the whole range is present?

>                 *end = ALIGN(end_addr, CONT_PTE_SIZE);
> 
>         if (pte_cont(__ptep_get(ptep))) {
>                 *start = ALIGN_DOWN(*start, CONT_PTE_SIZE);
>                 ptep = contpte_align_down(ptep);
>         }
> 
>         return ptep;
> }

Looks good.

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Baolin Wang 1 month, 3 weeks ago


On 2025/12/18 20:20, Ryan Roberts wrote:
> On 18/12/2025 07:15, Baolin Wang wrote:
>>
>>
>> On 2025/12/17 23:43, Ryan Roberts wrote:
>>> Sorry I'm a bit late to the party...
>>
>> Never mind. It's not late and comments are always welcome :)
>>
>>> On 11/12/2025 08:16, Baolin Wang wrote:
>>>> Currently, contpte_ptep_test_and_clear_young() and
>>>> contpte_ptep_clear_flush_young()
>>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>>> To support batch PTE operations for other sized large folios in the following
>>>> patches, adding a new parameter to specify the number of PTEs.
>>>>
>>>> While we are at it, rename the functions to maintain consistency with other
>>>> contpte_*() functions.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>>    arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>>>    2 files changed, 37 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 0944e296dd4a..e03034683156 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct
>>>> *mm, unsigned long addr,
>>>>    extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>                    unsigned long addr, pte_t *ptep,
>>>>                    unsigned int nr, int full);
>>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> -                unsigned long addr, pte_t *ptep);
>>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> -                unsigned long addr, pte_t *ptep);
>>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep, unsigned int nr);
>>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep, unsigned int nr);
>>>
>>> The "contpte_" functions are intended to be private to the arm64 arch and should
>>> be exposed via the generic APIs. But I don't see any generic batched API for
>>> this, so you're only actually able to pass CONT_PTES as nr. Perhaps you're
>>> planning to define "test_and_clear_young_ptes()" and "clear_flush_young_ptes()"
>>> in later patches?
>>
>> Right. This is a preparation patch, and will be used in patch 2.
>>
>>>>    extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>                    pte_t *ptep, unsigned int nr);
>>>>    extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct
>>>> vm_area_struct *vma,
>>>>        if (likely(!pte_valid_cont(orig_pte)))
>>>>            return __ptep_test_and_clear_young(vma, addr, ptep);
>>>>    -    return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>> +    return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>    }
>>>>      #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct
>>>> vm_area_struct *vma,
>>>>        if (likely(!pte_valid_cont(orig_pte)))
>>>>            return __ptep_clear_flush_young(vma, addr, ptep);
>>>>    -    return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>> +    return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>    }
>>>>      #define wrprotect_ptes wrprotect_ptes
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> index c0557945939c..19b122441be3 100644
>>>> --- a/arch/arm64/mm/contpte.c
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>>    -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> -                    unsigned long addr, pte_t *ptep)
>>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>> +                    unsigned long addr, pte_t *ptep,
>>>> +                    unsigned int nr)
>>>>    {
>>>>        /*
>>>>         * ptep_clear_flush_young() technically requires us to clear the access
>>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct
>>>> vm_area_struct *vma,
>>>>         * having to unfold.
>>>>         */
>>>>    +    unsigned long start = addr;
>>>
>>> Personally I wouldn't bother defining start - just reuse addr. You're
>>> incrementing start in the below loop, so it's more appropriate to call it addr
>>> anyway.
>>
>> OK.
>>
>>>> +    unsigned long end = start + nr * PAGE_SIZE;
>>>>        int young = 0;
>>>>        int i;
>>>>    -    ptep = contpte_align_down(ptep);
>>>> -    addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +    if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>> +        end = ALIGN(end, CONT_PTE_SIZE);
>>>>    -    for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>> -        young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>> +    if (pte_cont(__ptep_get(ptep))) {
>>>> +        start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>> +        ptep = contpte_align_down(ptep);
>>>> +    }
>>>> +
>>>> +    nr = (end - start) / PAGE_SIZE;
>>>> +    for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>>
>>> Given you're now defining end, perhaps we don't need nr?
>>>
>>>      for (; addr != end; ptep++, addr += PAGE_SIZE)
>>>          young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>
>> Yes, good point.
>>
>>>> +        young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>>          return young;
>>>>    }
>>>> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
>>>> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>>>>    -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> -                    unsigned long addr, pte_t *ptep)
>>>> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> +                unsigned long addr, pte_t *ptep,
>>>> +                unsigned int nr)
>>>>    {
>>>>        int young;
>>>>    -    young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>> +    young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>>>>          if (young) {
>>>> +        unsigned long start = addr;
>>>> +        unsigned long end = start + nr * PAGE_SIZE;
>>>> +
>>>> +        if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>> +            end = ALIGN(end, CONT_PTE_SIZE);
>>>> +
>>>> +        if (pte_cont(__ptep_get(ptep)))
>>>> +            start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>> +
>>>
>>> We now have this pattern of expanding contpte blocks up and down in 3 places.
>>> Perhaps create a helper?
>>
>> Sounds reasonable. How about the following helper?
>>
>> static pte_t *contpte_align_addr_ptep(unsigned long *start, unsigned long *end,
>>                                          pte_t *ptep, unsigned int nr)
>> {
>>          unsigned long end_addr = *start + nr * PAGE_SIZE;
>>
>>          if (pte_cont(__ptep_get(ptep + nr - 1)))
> 
> I think this is safe but calling it out to check; you're not checking that the
> pte is valid, so theoretically you could have a swap-entry here with whatever
> overlays the contiguous bit set. So then you would incorrectly extend.
> 
> But I think it is safe because the expectation is that core-mm has already
> checked that the whole range is present?

Yes. They must be present PTEs that map consecutive pages of the same 
large folio within a single VMA and a single page table. I will add some 
comments to make this clear.

>>                  *end = ALIGN(end_addr, CONT_PTE_SIZE);
>>
>>          if (pte_cont(__ptep_get(ptep))) {
>>                  *start = ALIGN_DOWN(*start, CONT_PTE_SIZE);
>>                  ptep = contpte_align_down(ptep);
>>          }
>>
>>          return ptep;
>> }
> 
> Looks good.

Thanks for reviewing.

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Lorenzo Stoakes 1 month, 3 weeks ago

On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
> only clear the young flag and flush TLBs for PTEs within the contiguous range.
> To support batch PTE operations for other sized large folios in the following
> patches, adding a new parameter to specify the number of PTEs.
>
> While we are at it, rename the functions to maintain consistency with other
> contpte_*() functions.
>
> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 12 ++++-----
>  arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>  2 files changed, 37 insertions(+), 19 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 0944e296dd4a..e03034683156 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>  extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep,
>  				unsigned int nr, int full);
> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> -				unsigned long addr, pte_t *ptep);
> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep, unsigned int nr);
> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep, unsigned int nr);

In core mm at least, as a convention, we strip 'extern' from header declarations
as we go as they're not necessary.

I wonder about this naming convention also. The contpte_xxx_() prefix
obviously implies we are operating upon PTEs in the contiguous range, now
we are doing something... different and 'nr' is a bit vague.

Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
now we're not saying they're necessarily contiguous in the sense of contpte...

I wonder whether we'd be better off introducing a new function that
explicitly has 'batch' or 'batched' in the name, and have the usual
contpte_***() stuff and callers that need batching using a new underlying helper?

Obviously I defer to Ryan on all this, just my thoughts...

>  extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, unsigned int nr);
>  extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	if (likely(!pte_valid_cont(orig_pte)))
>  		return __ptep_test_and_clear_young(vma, addr, ptep);
>
> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>
>  #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	if (likely(!pte_valid_cont(orig_pte)))
>  		return __ptep_clear_flush_young(vma, addr, ptep);
>
> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>  }
>
>  #define wrprotect_ptes wrprotect_ptes
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index c0557945939c..19b122441be3 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>  }
>  EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>
> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					unsigned int nr)
>  {
>  	/*
>  	 * ptep_clear_flush_young() technically requires us to clear the access
> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  	 * having to unfold.
>  	 */

Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:

"And since we only create a contig range when the range is covered by a single
folio, we can get away with clearing young for the whole contig range here, so
we avoid having to unfold."

However now you're allowing for large folios that are not contpte?

Overall feeling pretty iffy about implementing this this way.

>
> +	unsigned long start = addr;
> +	unsigned long end = start + nr * PAGE_SIZE;

So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
table?

>  	int young = 0;
>  	int i;
>
> -	ptep = contpte_align_down(ptep);
> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
> +		end = ALIGN(end, CONT_PTE_SIZE);

OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
with other PTE entries present there not having PTE_CONT set (Ryan - do
correct me if I'm wrong!)

I guess then no worry about running off the end of the PTE table here.

I wonder about using pte_cont_addr_end() here, but I guess you are not
intending to limit to a range here but rather capture the entire contpte
range of the end.

I don't love that now a function prefixed with contpte_... can have a
condition where part of the input range is _not_ a contpte entry, or is
specified partially...


>
> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +	if (pte_cont(__ptep_get(ptep))) {
> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> +		ptep = contpte_align_down(ptep);
> +	}
> +
> +	nr = (end - start) / PAGE_SIZE;

Hm don't love this reuse of input param.

> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
> +		young |= __ptep_test_and_clear_young(vma, start, ptep);

Again, overall find it a bit iffy that we are essentially overloading the
code with non-contpte behaviour.

>
>  	return young;
>  }
> -EXPORT_SYMBOL_GPL(contpte_ptep_test_and_clear_young);
> +EXPORT_SYMBOL_GPL(contpte_test_and_clear_young_ptes);
>
> -int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> -					unsigned long addr, pte_t *ptep)
> +int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				unsigned int nr)
>  {
>  	int young;
>
> -	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +	young = contpte_test_and_clear_young_ptes(vma, addr, ptep, nr);
>
>  	if (young) {
> +		unsigned long start = addr;
> +		unsigned long end = start + nr * PAGE_SIZE;
> +
> +		if (pte_cont(__ptep_get(ptep + nr - 1)))
> +			end = ALIGN(end, CONT_PTE_SIZE);
> +
> +		if (pte_cont(__ptep_get(ptep)))
> +			start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> +
>  		/*
>  		 * See comment in __ptep_clear_flush_young(); same rationale for
>  		 * eliding the trailing DSB applies here.
>  		 */
> -		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> -		__flush_tlb_range_nosync(vma->vm_mm, addr, addr + CONT_PTE_SIZE,
> +		__flush_tlb_range_nosync(vma->vm_mm, start, end,
>  					 PAGE_SIZE, true, 3);

Similar comments to above.

Am curious as to Ryan's opinion.

>  	}
>
>  	return young;
>  }
> -EXPORT_SYMBOL_GPL(contpte_ptep_clear_flush_young);
> +EXPORT_SYMBOL_GPL(contpte_clear_flush_young_ptes);
>
>  void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  					pte_t *ptep, unsigned int nr)
> --
> 2.47.3
>

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Baolin Wang 1 month, 3 weeks ago


On 2025/12/15 19:36, Lorenzo Stoakes wrote:
> On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>> To support batch PTE operations for other sized large folios in the following
>> patches, adding a new parameter to specify the number of PTEs.
>>
>> While we are at it, rename the functions to maintain consistency with other
>> contpte_*() functions.
>>
>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>> ---
>>   arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>   arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>   2 files changed, 37 insertions(+), 19 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 0944e296dd4a..e03034683156 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>   extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>   				unsigned long addr, pte_t *ptep,
>>   				unsigned int nr, int full);
>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> -				unsigned long addr, pte_t *ptep);
>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> -				unsigned long addr, pte_t *ptep);
>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
> 
> In core mm at least, as a convention, we strip 'extern' from header declarations
> as we go as they're not necessary.

Sure. I can remove the 'extern'.

> I wonder about this naming convention also. The contpte_xxx_() prefix
> obviously implies we are operating upon PTEs in the contiguous range, now
> we are doing something... different and 'nr' is a bit vague.
> 
> Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But

Yes, 'nr' here means nr_pages, which follows the convention of the 
parameters in contpte_xxx_() functions.

> now we're not saying they're necessarily contiguous in the sense of contpte...
> 
> I wonder whether we'd be better off introducing a new function that
> explicitly has 'batch' or 'batched' in the name, and have the usual
> contpte_***() stuff and callers that need batching using a new underlying helper?

OK. I get your point. However, this is outside the scope of my patchset.

Perhaps we can clean up all the contpte_xxx_() functions if everyone 
agrees that this is confusing.

> Obviously I defer to Ryan on all this, just my thoughts...
> 
>>   extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>   				pte_t *ptep, unsigned int nr);
>>   extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   	if (likely(!pte_valid_cont(orig_pte)))
>>   		return __ptep_test_and_clear_young(vma, addr, ptep);
>>
>> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>
>>   #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   	if (likely(!pte_valid_cont(orig_pte)))
>>   		return __ptep_clear_flush_young(vma, addr, ptep);
>>
>> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>   }
>>
>>   #define wrprotect_ptes wrprotect_ptes
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index c0557945939c..19b122441be3 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>   }
>>   EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>
>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> -					unsigned long addr, pte_t *ptep)
>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					unsigned int nr)
>>   {
>>   	/*
>>   	 * ptep_clear_flush_young() technically requires us to clear the access
>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>   	 * having to unfold.
>>   	 */
> 
> Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
> 
> "And since we only create a contig range when the range is covered by a single
> folio, we can get away with clearing young for the whole contig range here, so
> we avoid having to unfold."

I think the original comments are still clear:
"
/*
  * ptep_clear_flush_young() technically requires us to clear the access
  * flag for a _single_ pte. However, the core-mm code actually tracks
  * access/dirty per folio, not per page. And since we only create a
  * contig range when the range is covered by a single folio, we can get
  * away with clearing young for the whole contig range here, so we avoid
  * having to unfold.
  */
"

> However now you're allowing for large folios that are not contpte?
> 
> Overall feeling pretty iffy about implementing this this way.

Please see the above comments, which explain why we should do that.

>> +	unsigned long start = addr;
>> +	unsigned long end = start + nr * PAGE_SIZE;
> 
> So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
> table?

Caller has made sure that (see folio_referenced_one()).

>>   	int young = 0;
>>   	int i;
>>
>> -	ptep = contpte_align_down(ptep);
>> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
>> +		end = ALIGN(end, CONT_PTE_SIZE);
> 
> OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
> with other PTE entries present there not having PTE_CONT set (Ryan - do
> correct me if I'm wrong!)
> 
> I guess then no worry about running off the end of the PTE table here.
> 
> I wonder about using pte_cont_addr_end() here, but I guess you are not
> intending to limit to a range here but rather capture the entire contpte
> range of the end.

Right.

> I don't love that now a function prefixed with contpte_... can have a
> condition where part of the input range is _not_ a contpte entry, or is
> specified partially...

Like I said above, this is outside the scope of my patchset. If Ryan 
also thinks we need to do some cleanups for all the contpte_xxx() 
functions in the contpte.c file, we can send a follow-up patchset to 
rename all the contpte_xxx() functions to make it clear.

>> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +	if (pte_cont(__ptep_get(ptep))) {
>> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>> +		ptep = contpte_align_down(ptep);
>> +	}
>> +
>> +	nr = (end - start) / PAGE_SIZE;
> 
> Hm don't love this reuse of input param.

OK. Will use another local variable instead.

> 
>> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
> 
> Again, overall find it a bit iffy that we are essentially overloading the
> code with non-contpte behaviour.

Let me clarify further: this is the handling logic of the contpte_xxx() 
functions in the contpte.c file. For example, the function 
contpte_set_ptes() has a parameter 'nr', which may be greater than 
CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of 
a large folio.

It might be a bit confusing, and I understand this is why you want to 
change the 'contpte_' prefix to the 'batch_' prefix. Of course, we can 
hope Ryan can provide some input on this.

Thanks for reviewing.

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Lorenzo Stoakes 1 month, 3 weeks ago

On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
>
>
> On 2025/12/15 19:36, Lorenzo Stoakes wrote:
> > On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
> > > Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
> > > only clear the young flag and flush TLBs for PTEs within the contiguous range.
> > > To support batch PTE operations for other sized large folios in the following
> > > patches, adding a new parameter to specify the number of PTEs.
> > >
> > > While we are at it, rename the functions to maintain consistency with other
> > > contpte_*() functions.
> > >
> > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > ---
> > >   arch/arm64/include/asm/pgtable.h | 12 ++++-----
> > >   arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
> > >   2 files changed, 37 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > index 0944e296dd4a..e03034683156 100644
> > > --- a/arch/arm64/include/asm/pgtable.h
> > > +++ b/arch/arm64/include/asm/pgtable.h
> > > @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> > >   extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> > >   				unsigned long addr, pte_t *ptep,
> > >   				unsigned int nr, int full);
> > > -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > -				unsigned long addr, pte_t *ptep);
> > > -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> > > -				unsigned long addr, pte_t *ptep);
> > > +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> > > +				unsigned long addr, pte_t *ptep, unsigned int nr);
> > > +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> > > +				unsigned long addr, pte_t *ptep, unsigned int nr);
> >
> > In core mm at least, as a convention, we strip 'extern' from header declarations
> > as we go as they're not necessary.
>
> Sure. I can remove the 'extern'.

Thanks.

>
> > I wonder about this naming convention also. The contpte_xxx_() prefix
> > obviously implies we are operating upon PTEs in the contiguous range, now
> > we are doing something... different and 'nr' is a bit vague.
> >
> > Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
>
> Yes, 'nr' here means nr_pages, which follows the convention of the
> parameters in contpte_xxx_() functions.

Except you're violating the convention already by doing non-contpte stuff in
contpte functions?

The confusion is between nr of contpte ranges and nr of pages, so I think we
should be explicit.

>
> > now we're not saying they're necessarily contiguous in the sense of contpte...
> >
> > I wonder whether we'd be better off introducing a new function that
> > explicitly has 'batch' or 'batched' in the name, and have the usual
> > contpte_***() stuff and callers that need batching using a new underlying helper?
>
> OK. I get your point. However, this is outside the scope of my patchset.

Well, no, this is review feedback that needs to be addressed, I really don't
like this patch as-is.

So for this to be upstreamable you really need to separate this out.

>
> Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
> that this is confusing.

I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
reject the patch unless we do this.

The contpte logic is already very confusing, and this is only adding to it.

>
> > Obviously I defer to Ryan on all this, just my thoughts...
> >
> > >   extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> > >   				pte_t *ptep, unsigned int nr);
> > >   extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> > > @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> > >   	if (likely(!pte_valid_cont(orig_pte)))
> > >   		return __ptep_test_and_clear_young(vma, addr, ptep);
> > >
> > > -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> > > +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
> > >   }
> > >
> > >   #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> > > @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> > >   	if (likely(!pte_valid_cont(orig_pte)))
> > >   		return __ptep_clear_flush_young(vma, addr, ptep);
> > >
> > > -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> > > +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
> > >   }
> > >
> > >   #define wrprotect_ptes wrprotect_ptes
> > > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > > index c0557945939c..19b122441be3 100644
> > > --- a/arch/arm64/mm/contpte.c
> > > +++ b/arch/arm64/mm/contpte.c
> > > @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> > >   }
> > >   EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
> > >
> > > -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > -					unsigned long addr, pte_t *ptep)
> > > +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> > > +					unsigned long addr, pte_t *ptep,
> > > +					unsigned int nr)
> > >   {
> > >   	/*
> > >   	 * ptep_clear_flush_young() technically requires us to clear the access
> > > @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > >   	 * having to unfold.
> > >   	 */
> >
> > Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
> >
> > "And since we only create a contig range when the range is covered by a single
> > folio, we can get away with clearing young for the whole contig range here, so
> > we avoid having to unfold."
>
> I think the original comments are still clear:
> "
> /*
>  * ptep_clear_flush_young() technically requires us to clear the access
>  * flag for a _single_ pte. However, the core-mm code actually tracks
>  * access/dirty per folio, not per page. And since we only create a
>  * contig range when the range is covered by a single folio, we can get
>  * away with clearing young for the whole contig range here, so we avoid
>  * having to unfold.
>  */
> "

Except nr means you can span more than a single contig pte range? So this
comment is no longer applicable as-is?

You've changed the function, the comment needs to change too.

>
> > However now you're allowing for large folios that are not contpte?
> >
> > Overall feeling pretty iffy about implementing this this way.
>
> Please see the above comments, which explain why we should do that.

You haven't explained anything? Maybe I missed it?

Can you explain clearly what nr might actually be in relation to contpte's? I'm
confused even as to the utility here.

Is it batched ranges of contptes? But then why are you trying to see if end is a
contpte + expand the range if so?

>
> > > +	unsigned long start = addr;
> > > +	unsigned long end = start + nr * PAGE_SIZE;
> >
> > So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
> > table?
>
> Caller has made sure that (see folio_referenced_one()).

You should comment this somehow, I hate this nested assumptions stuff 'function
X which calls function Y which calls function Z makes sure that parameter A
fulfils requirements B but we don't mention it anywhere'.

>
> > >   	int young = 0;
> > >   	int i;
> > >
> > > -	ptep = contpte_align_down(ptep);
> > > -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> > > +	if (pte_cont(__ptep_get(ptep + nr - 1)))
> > > +		end = ALIGN(end, CONT_PTE_SIZE);
> >
> > OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
> > with other PTE entries present there not having PTE_CONT set (Ryan - do
> > correct me if I'm wrong!)
> >
> > I guess then no worry about running off the end of the PTE table here.
> >
> > I wonder about using pte_cont_addr_end() here, but I guess you are not
> > intending to limit to a range here but rather capture the entire contpte
> > range of the end.
>
> Right.
>
> > I don't love that now a function prefixed with contpte_... can have a
> > condition where part of the input range is _not_ a contpte entry, or is
> > specified partially...
>
> Like I said above, this is outside the scope of my patchset. If Ryan also
> thinks we need to do some cleanups for all the contpte_xxx() functions in
> the contpte.c file, we can send a follow-up patchset to rename all the
> contpte_xxx() functions to make it clear.

It's really not outside of the scope of the patch set, this is review feedback
you have to address :) trying not to be a grumpy kernel dev here :>)

In general in the kernel we don't accept confusing patches then defer making
them not-confusing until later.

We review until the series is at the correct standard to be accepted before we
merge.

As always I'm happy to be convinced that I'm mistaken or something's not
necesary, however!

>
> > > -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> > > -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> > > +	if (pte_cont(__ptep_get(ptep))) {
> > > +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> > > +		ptep = contpte_align_down(ptep);
> > > +	}
> > > +
> > > +	nr = (end - start) / PAGE_SIZE;
> >
> > Hm don't love this reuse of input param.
>
> OK. Will use another local variable instead.

Thanks.

>
> >
> > > +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
> > > +		young |= __ptep_test_and_clear_young(vma, start, ptep);
> >
> > Again, overall find it a bit iffy that we are essentially overloading the
> > code with non-contpte behaviour.
>
> Let me clarify further: this is the handling logic of the contpte_xxx()
> functions in the contpte.c file. For example, the function
> contpte_set_ptes() has a parameter 'nr', which may be greater than
> CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
> large folio.
>
> It might be a bit confusing, and I understand this is why you want to change
> the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
> can provide some input on this.
>
> Thanks for reviewing.

OK so we have some reeeeal confusion here. And this speaks to the patch needing
work.

This seems to answer what I asked above - basically - are we doing _batches_ of
_contpte_ entries, or are we just accepting arbitrary large folios here and
batching up operations on them, including non-contpte entries.

Presumably from the above you're saying - this is dealing with batches of
contpte entries.

In which case this has to be made _super_ clear. Not just adding an arbitrary
'nr' parameter and letting people guess.

The contpte code is _already_ confusing and somewhat surprising if you're not
aware of it, we need to be going in the other direction.

Thanks, Lorenzo

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Baolin Wang 1 month, 3 weeks ago


On 2025/12/16 19:11, Lorenzo Stoakes wrote:
> On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/12/15 19:36, Lorenzo Stoakes wrote:
>>> On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
>>>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>>> To support batch PTE operations for other sized large folios in the following
>>>> patches, adding a new parameter to specify the number of PTEs.
>>>>
>>>> While we are at it, rename the functions to maintain consistency with other
>>>> contpte_*() functions.
>>>>
>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> ---
>>>>    arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>>    arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>>>    2 files changed, 37 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 0944e296dd4a..e03034683156 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>>>    extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>    				unsigned long addr, pte_t *ptep,
>>>>    				unsigned int nr, int full);
>>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> -				unsigned long addr, pte_t *ptep);
>>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> -				unsigned long addr, pte_t *ptep);
>>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>
>>> In core mm at least, as a convention, we strip 'extern' from header declarations
>>> as we go as they're not necessary.
>>
>> Sure. I can remove the 'extern'.
> 
> Thanks.
> 
>>
>>> I wonder about this naming convention also. The contpte_xxx_() prefix
>>> obviously implies we are operating upon PTEs in the contiguous range, now
>>> we are doing something... different and 'nr' is a bit vague.
>>>
>>> Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
>>
>> Yes, 'nr' here means nr_pages, which follows the convention of the
>> parameters in contpte_xxx_() functions.
> 
> Except you're violating the convention already by doing non-contpte stuff in
> contpte functions?

Nope. Sorry for my poor English, which might have caused some confusion. 
I followed the contpte's convention, and let me try to explain it again 
below.

> The confusion is between nr of contpte ranges and nr of pages, so I think we
> should be explicit.
> 
>>
>>> now we're not saying they're necessarily contiguous in the sense of contpte...
>>>
>>> I wonder whether we'd be better off introducing a new function that
>>> explicitly has 'batch' or 'batched' in the name, and have the usual
>>> contpte_***() stuff and callers that need batching using a new underlying helper?
>>
>> OK. I get your point. However, this is outside the scope of my patchset.
> 
> Well, no, this is review feedback that needs to be addressed, I really don't
> like this patch as-is.
> 
> So for this to be upstreamable you really need to separate this out.
> 
>>
>> Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
>> that this is confusing.
> 
> I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
> reject the patch unless we do this.
> 
> The contpte logic is already very confusing, and this is only adding to it.
> 
>>
>>> Obviously I defer to Ryan on all this, just my thoughts...
>>>
>>>>    extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>    				pte_t *ptep, unsigned int nr);
>>>>    extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>    	if (likely(!pte_valid_cont(orig_pte)))
>>>>    		return __ptep_test_and_clear_young(vma, addr, ptep);
>>>>
>>>> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>    }
>>>>
>>>>    #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>    	if (likely(!pte_valid_cont(orig_pte)))
>>>>    		return __ptep_clear_flush_young(vma, addr, ptep);
>>>>
>>>> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>    }
>>>>
>>>>    #define wrprotect_ptes wrprotect_ptes
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> index c0557945939c..19b122441be3 100644
>>>> --- a/arch/arm64/mm/contpte.c
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>>
>>>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>> -					unsigned long addr, pte_t *ptep)
>>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>> +					unsigned long addr, pte_t *ptep,
>>>> +					unsigned int nr)
>>>>    {
>>>>    	/*
>>>>    	 * ptep_clear_flush_young() technically requires us to clear the access
>>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>    	 * having to unfold.
>>>>    	 */
>>>
>>> Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
>>>
>>> "And since we only create a contig range when the range is covered by a single
>>> folio, we can get away with clearing young for the whole contig range here, so
>>> we avoid having to unfold."
>>
>> I think the original comments are still clear:
>> "
>> /*
>>   * ptep_clear_flush_young() technically requires us to clear the access
>>   * flag for a _single_ pte. However, the core-mm code actually tracks
>>   * access/dirty per folio, not per page. And since we only create a
>>   * contig range when the range is covered by a single folio, we can get
>>   * away with clearing young for the whole contig range here, so we avoid
>>   * having to unfold.
>>   */
>> "
> 
> Except nr means you can span more than a single contig pte range? So this
> comment is no longer applicable as-is?

The nr means consecutive (present) PTEs that map consecutive pages of 
the same large folio in a single VMA and a single page table.

I think this is a prerequisite. If you think it's necessary to explain 
it, I can add it to the comments.

> You've changed the function, the comment needs to change too.
> 
>>
>>> However now you're allowing for large folios that are not contpte?
>>>
>>> Overall feeling pretty iffy about implementing this this way.
>>
>> Please see the above comments, which explain why we should do that.
> 
> You haven't explained anything? Maybe I missed it?
> 
> Can you explain clearly what nr might actually be in relation to contpte's? I'm
> confused even as to the utility here.
> 
> Is it batched ranges of contptes? But then why are you trying to see if end is a
> contpte + expand the range if so?

See below.

>>>> +	unsigned long start = addr;
>>>> +	unsigned long end = start + nr * PAGE_SIZE;
>>>
>>> So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
>>> table?
>>
>> Caller has made sure that (see folio_referenced_one()).
> 
> You should comment this somehow, I hate this nested assumptions stuff 'function
> X which calls function Y which calls function Z makes sure that parameter A
> fulfils requirements B but we don't mention it anywhere'.

Sure.

>>>>    	int young = 0;
>>>>    	int i;
>>>>
>>>> -	ptep = contpte_align_down(ptep);
>>>> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>> +		end = ALIGN(end, CONT_PTE_SIZE);
>>>
>>> OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
>>> with other PTE entries present there not having PTE_CONT set (Ryan - do
>>> correct me if I'm wrong!)
>>>
>>> I guess then no worry about running off the end of the PTE table here.
>>>
>>> I wonder about using pte_cont_addr_end() here, but I guess you are not
>>> intending to limit to a range here but rather capture the entire contpte
>>> range of the end.
>>
>> Right.
>>
>>> I don't love that now a function prefixed with contpte_... can have a
>>> condition where part of the input range is _not_ a contpte entry, or is
>>> specified partially...
>>
>> Like I said above, this is outside the scope of my patchset. If Ryan also
>> thinks we need to do some cleanups for all the contpte_xxx() functions in
>> the contpte.c file, we can send a follow-up patchset to rename all the
>> contpte_xxx() functions to make it clear.
> 
> It's really not outside of the scope of the patch set, this is review feedback
> you have to address :) trying not to be a grumpy kernel dev here :>)
> 
> In general in the kernel we don't accept confusing patches then defer making
> them not-confusing until later.
> 
> We review until the series is at the correct standard to be accepted before we
> merge.
> 
> As always I'm happy to be convinced that I'm mistaken or something's not
> necesary, however!

Yes, let me try to explain it again. See below.

>>>> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>> +	if (pte_cont(__ptep_get(ptep))) {
>>>> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>> +		ptep = contpte_align_down(ptep);
>>>> +	}
>>>> +
>>>> +	nr = (end - start) / PAGE_SIZE;
>>>
>>> Hm don't love this reuse of input param.
>>
>> OK. Will use another local variable instead.
> 
> Thanks.
> 
>>
>>>
>>>> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>>> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>
>>> Again, overall find it a bit iffy that we are essentially overloading the
>>> code with non-contpte behaviour.
>>
>> Let me clarify further: this is the handling logic of the contpte_xxx()
>> functions in the contpte.c file. For example, the function
>> contpte_set_ptes() has a parameter 'nr', which may be greater than
>> CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
>> large folio.
>>
>> It might be a bit confusing, and I understand this is why you want to change
>> the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
>> can provide some input on this.
>>
>> Thanks for reviewing.
> 
> OK so we have some reeeeal confusion here. And this speaks to the patch needing
> work.
> 
> This seems to answer what I asked above - basically - are we doing _batches_ of
> _contpte_ entries, or are we just accepting arbitrary large folios here and
> batching up operations on them, including non-contpte entries.
> 
> Presumably from the above you're saying - this is dealing with batches of
> contpte entries.
> 
> In which case this has to be made _super_ clear. Not just adding an arbitrary
> 'nr' parameter and letting people guess.
> 
> The contpte code is _already_ confusing and somewhat surprising if you're not
> aware of it, we need to be going in the other direction.

Sure. Sorry for causing confusion. I try to explain it again to make it 
clear.

First, it is necessary to explain how the contpte implementation in 
Arm64 is exposed, using set_ptes() as an example.

Arm64 defines an arch-specific implementation of set_ptes():
"
static __always_inline void set_ptes(struct mm_struct *mm, unsigned long 
addr,
				pte_t *ptep, pte_t pte, unsigned int nr)
{
	pte = pte_mknoncont(pte);

	if (likely(nr == 1)) {
		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
		__set_ptes(mm, addr, ptep, pte, 1);
		contpte_try_fold(mm, addr, ptep, pte);
	} else {
		contpte_set_ptes(mm, addr, ptep, pte, nr);
	}
}
"

Here, 'nr' refers to consecutive (present) PTEs that map consecutive 
pages of the same large folio within a single VMA and a single page 
table. Based on 'nr', it determines whether the PTE_CONT attribute 
should be set or not.

Second, the underlying implementation of contpte_set_ptes() also 
calculates 'end' based on 'nr' and checks whether the PTE_CONT attribute 
should be set. For example:

For a large folio with order = 3 (nr = 8), the PTE_CONT attribute will 
not be set since the 'next' is not aligned with CONT_PTE_MASK.

For a large folio with order = 4 (nr = 16), it aligns perfectly, 
allowing the PTE_CONT attribute to be set for the entire PTE entries of 
the large folio.

For a large folio with order = 5 (nr = 32), this involves two ranges of 
CONT_PTE_SIZE (16 PTEs in each range), requiring the PTE_CONT attribute 
to be set separately for each range.

"
void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
					pte_t *ptep, pte_t pte, unsigned int nr)
{
	.......

	end = addr + (nr << PAGE_SHIFT);
	pfn = pte_pfn(pte);
	prot = pte_pgprot(pte);

	do {
		next = pte_cont_addr_end(addr, end);
		nr = (next - addr) >> PAGE_SHIFT;
		pte = pfn_pte(pfn, prot);

		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
			pte = pte_mkcont(pte);
		else
			pte = pte_mknoncont(pte);

		__set_ptes(mm, addr, ptep, pte, nr);

		addr = next;
		ptep += nr;
		pfn += nr;

	} while (addr != end);
}
"

Third, regarding the implementation of my patch, I have followed the 
contpte logic here by adding the 'nr' parameter to 
contpte_ptep_test_and_clear_young() for batch processing consecutive 
(present) PTEs. Moreover, the semantics of 'nr' remain consistent with 
the original contpte functions, and I have not broken the existing 
contpte implementation logic.

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Lorenzo Stoakes 1 month, 3 weeks ago

On Wed, Dec 17, 2025 at 11:53:31AM +0800, Baolin Wang wrote:
>
>
> On 2025/12/16 19:11, Lorenzo Stoakes wrote:
> > On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
> > >
> > >
> > > On 2025/12/15 19:36, Lorenzo Stoakes wrote:
> > > > On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
> > > > > Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
> > > > > only clear the young flag and flush TLBs for PTEs within the contiguous range.
> > > > > To support batch PTE operations for other sized large folios in the following
> > > > > patches, adding a new parameter to specify the number of PTEs.
> > > > >
> > > > > While we are at it, rename the functions to maintain consistency with other
> > > > > contpte_*() functions.
> > > > >
> > > > > Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > > > > ---
> > > > >    arch/arm64/include/asm/pgtable.h | 12 ++++-----
> > > > >    arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
> > > > >    2 files changed, 37 insertions(+), 19 deletions(-)
> > > > >
> > > > > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > > > > index 0944e296dd4a..e03034683156 100644
> > > > > --- a/arch/arm64/include/asm/pgtable.h
> > > > > +++ b/arch/arm64/include/asm/pgtable.h
> > > > > @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
> > > > >    extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> > > > >    				unsigned long addr, pte_t *ptep,
> > > > >    				unsigned int nr, int full);
> > > > > -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > > > -				unsigned long addr, pte_t *ptep);
> > > > > -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> > > > > -				unsigned long addr, pte_t *ptep);
> > > > > +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> > > > > +				unsigned long addr, pte_t *ptep, unsigned int nr);
> > > > > +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
> > > > > +				unsigned long addr, pte_t *ptep, unsigned int nr);
> > > >
> > > > In core mm at least, as a convention, we strip 'extern' from header declarations
> > > > as we go as they're not necessary.
> > >
> > > Sure. I can remove the 'extern'.
> >
> > Thanks.
> >
> > >
> > > > I wonder about this naming convention also. The contpte_xxx_() prefix
> > > > obviously implies we are operating upon PTEs in the contiguous range, now
> > > > we are doing something... different and 'nr' is a bit vague.
> > > >
> > > > Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
> > >
> > > Yes, 'nr' here means nr_pages, which follows the convention of the
> > > parameters in contpte_xxx_() functions.
> >
> > Except you're violating the convention already by doing non-contpte stuff in
> > contpte functions?
>
> Nope. Sorry for my poor English, which might have caused some confusion. I
> followed the contpte's convention, and let me try to explain it again below.
>
> > The confusion is between nr of contpte ranges and nr of pages, so I think we
> > should be explicit.
> >
> > >
> > > > now we're not saying they're necessarily contiguous in the sense of contpte...
> > > >
> > > > I wonder whether we'd be better off introducing a new function that
> > > > explicitly has 'batch' or 'batched' in the name, and have the usual
> > > > contpte_***() stuff and callers that need batching using a new underlying helper?
> > >
> > > OK. I get your point. However, this is outside the scope of my patchset.
> >
> > Well, no, this is review feedback that needs to be addressed, I really don't
> > like this patch as-is.
> >
> > So for this to be upstreamable you really need to separate this out.
> >
> > >
> > > Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
> > > that this is confusing.
> >
> > I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
> > reject the patch unless we do this.
> >
> > The contpte logic is already very confusing, and this is only adding to it.
> >
> > >
> > > > Obviously I defer to Ryan on all this, just my thoughts...
> > > >
> > > > >    extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> > > > >    				pte_t *ptep, unsigned int nr);
> > > > >    extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> > > > > @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > > >    	if (likely(!pte_valid_cont(orig_pte)))
> > > > >    		return __ptep_test_and_clear_young(vma, addr, ptep);
> > > > >
> > > > > -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> > > > > +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
> > > > >    }
> > > > >
> > > > >    #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> > > > > @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> > > > >    	if (likely(!pte_valid_cont(orig_pte)))
> > > > >    		return __ptep_clear_flush_young(vma, addr, ptep);
> > > > >
> > > > > -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> > > > > +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
> > > > >    }
> > > > >
> > > > >    #define wrprotect_ptes wrprotect_ptes
> > > > > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > > > > index c0557945939c..19b122441be3 100644
> > > > > --- a/arch/arm64/mm/contpte.c
> > > > > +++ b/arch/arm64/mm/contpte.c
> > > > > @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
> > > > >    }
> > > > >    EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
> > > > >
> > > > > -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > > > -					unsigned long addr, pte_t *ptep)
> > > > > +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
> > > > > +					unsigned long addr, pte_t *ptep,
> > > > > +					unsigned int nr)
> > > > >    {
> > > > >    	/*
> > > > >    	 * ptep_clear_flush_young() technically requires us to clear the access
> > > > > @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> > > > >    	 * having to unfold.
> > > > >    	 */
> > > >
> > > > Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
> > > >
> > > > "And since we only create a contig range when the range is covered by a single
> > > > folio, we can get away with clearing young for the whole contig range here, so
> > > > we avoid having to unfold."
> > >
> > > I think the original comments are still clear:
> > > "
> > > /*
> > >   * ptep_clear_flush_young() technically requires us to clear the access
> > >   * flag for a _single_ pte. However, the core-mm code actually tracks
> > >   * access/dirty per folio, not per page. And since we only create a
> > >   * contig range when the range is covered by a single folio, we can get
> > >   * away with clearing young for the whole contig range here, so we avoid
> > >   * having to unfold.
> > >   */
> > > "
> >
> > Except nr means you can span more than a single contig pte range? So this
> > comment is no longer applicable as-is?
>
> The nr means consecutive (present) PTEs that map consecutive pages of the
> same large folio in a single VMA and a single page table.
>
> I think this is a prerequisite. If you think it's necessary to explain it, I
> can add it to the comments.
>
> > You've changed the function, the comment needs to change too.
> >
> > >
> > > > However now you're allowing for large folios that are not contpte?
> > > >
> > > > Overall feeling pretty iffy about implementing this this way.
> > >
> > > Please see the above comments, which explain why we should do that.
> >
> > You haven't explained anything? Maybe I missed it?
> >
> > Can you explain clearly what nr might actually be in relation to contpte's? I'm
> > confused even as to the utility here.
> >
> > Is it batched ranges of contptes? But then why are you trying to see if end is a
> > contpte + expand the range if so?
>
> See below.
>
> > > > > +	unsigned long start = addr;
> > > > > +	unsigned long end = start + nr * PAGE_SIZE;
> > > >
> > > > So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
> > > > table?
> > >
> > > Caller has made sure that (see folio_referenced_one()).
> >
> > You should comment this somehow, I hate this nested assumptions stuff 'function
> > X which calls function Y which calls function Z makes sure that parameter A
> > fulfils requirements B but we don't mention it anywhere'.
>
> Sure.
>
> > > > >    	int young = 0;
> > > > >    	int i;
> > > > >
> > > > > -	ptep = contpte_align_down(ptep);
> > > > > -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> > > > > +	if (pte_cont(__ptep_get(ptep + nr - 1)))
> > > > > +		end = ALIGN(end, CONT_PTE_SIZE);
> > > >
> > > > OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
> > > > with other PTE entries present there not having PTE_CONT set (Ryan - do
> > > > correct me if I'm wrong!)
> > > >
> > > > I guess then no worry about running off the end of the PTE table here.
> > > >
> > > > I wonder about using pte_cont_addr_end() here, but I guess you are not
> > > > intending to limit to a range here but rather capture the entire contpte
> > > > range of the end.
> > >
> > > Right.
> > >
> > > > I don't love that now a function prefixed with contpte_... can have a
> > > > condition where part of the input range is _not_ a contpte entry, or is
> > > > specified partially...
> > >
> > > Like I said above, this is outside the scope of my patchset. If Ryan also
> > > thinks we need to do some cleanups for all the contpte_xxx() functions in
> > > the contpte.c file, we can send a follow-up patchset to rename all the
> > > contpte_xxx() functions to make it clear.
> >
> > It's really not outside of the scope of the patch set, this is review feedback
> > you have to address :) trying not to be a grumpy kernel dev here :>)
> >
> > In general in the kernel we don't accept confusing patches then defer making
> > them not-confusing until later.
> >
> > We review until the series is at the correct standard to be accepted before we
> > merge.
> >
> > As always I'm happy to be convinced that I'm mistaken or something's not
> > necesary, however!
>
> Yes, let me try to explain it again. See below.
>
> > > > > -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> > > > > -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> > > > > +	if (pte_cont(__ptep_get(ptep))) {
> > > > > +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
> > > > > +		ptep = contpte_align_down(ptep);
> > > > > +	}
> > > > > +
> > > > > +	nr = (end - start) / PAGE_SIZE;
> > > >
> > > > Hm don't love this reuse of input param.
> > >
> > > OK. Will use another local variable instead.
> >
> > Thanks.
> >
> > >
> > > >
> > > > > +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
> > > > > +		young |= __ptep_test_and_clear_young(vma, start, ptep);
> > > >
> > > > Again, overall find it a bit iffy that we are essentially overloading the
> > > > code with non-contpte behaviour.
> > >
> > > Let me clarify further: this is the handling logic of the contpte_xxx()
> > > functions in the contpte.c file. For example, the function
> > > contpte_set_ptes() has a parameter 'nr', which may be greater than
> > > CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
> > > large folio.
> > >
> > > It might be a bit confusing, and I understand this is why you want to change
> > > the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
> > > can provide some input on this.
> > >
> > > Thanks for reviewing.
> >
> > OK so we have some reeeeal confusion here. And this speaks to the patch needing
> > work.
> >
> > This seems to answer what I asked above - basically - are we doing _batches_ of
> > _contpte_ entries, or are we just accepting arbitrary large folios here and
> > batching up operations on them, including non-contpte entries.
> >
> > Presumably from the above you're saying - this is dealing with batches of
> > contpte entries.
> >
> > In which case this has to be made _super_ clear. Not just adding an arbitrary
> > 'nr' parameter and letting people guess.
> >
> > The contpte code is _already_ confusing and somewhat surprising if you're not
> > aware of it, we need to be going in the other direction.
>
> Sure. Sorry for causing confusion. I try to explain it again to make it
> clear.
>
> First, it is necessary to explain how the contpte implementation in Arm64 is
> exposed, using set_ptes() as an example.
>
> Arm64 defines an arch-specific implementation of set_ptes():
> "
> static __always_inline void set_ptes(struct mm_struct *mm, unsigned long
> addr,
> 				pte_t *ptep, pte_t pte, unsigned int nr)
> {
> 	pte = pte_mknoncont(pte);
>
> 	if (likely(nr == 1)) {
> 		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> 		__set_ptes(mm, addr, ptep, pte, 1);
> 		contpte_try_fold(mm, addr, ptep, pte);
> 	} else {
> 		contpte_set_ptes(mm, addr, ptep, pte, nr);
> 	}
> }
> "
>
> Here, 'nr' refers to consecutive (present) PTEs that map consecutive pages
> of the same large folio within a single VMA and a single page table. Based
> on 'nr', it determines whether the PTE_CONT attribute should be set or not.
>
> Second, the underlying implementation of contpte_set_ptes() also calculates
> 'end' based on 'nr' and checks whether the PTE_CONT attribute should be set.
> For example:
>
> For a large folio with order = 3 (nr = 8), the PTE_CONT attribute will not
> be set since the 'next' is not aligned with CONT_PTE_MASK.
>
> For a large folio with order = 4 (nr = 16), it aligns perfectly, allowing
> the PTE_CONT attribute to be set for the entire PTE entries of the large
> folio.
>
> For a large folio with order = 5 (nr = 32), this involves two ranges of
> CONT_PTE_SIZE (16 PTEs in each range), requiring the PTE_CONT attribute to
> be set separately for each range.
>
> "
> void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> 					pte_t *ptep, pte_t pte, unsigned int nr)
> {
> 	.......
>
> 	end = addr + (nr << PAGE_SHIFT);
> 	pfn = pte_pfn(pte);
> 	prot = pte_pgprot(pte);
>
> 	do {
> 		next = pte_cont_addr_end(addr, end);
> 		nr = (next - addr) >> PAGE_SHIFT;
> 		pte = pfn_pte(pfn, prot);
>
> 		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> 			pte = pte_mkcont(pte);
> 		else
> 			pte = pte_mknoncont(pte);
>
> 		__set_ptes(mm, addr, ptep, pte, nr);
>
> 		addr = next;
> 		ptep += nr;
> 		pfn += nr;
>
> 	} while (addr != end);
> }
> "
>
> Third, regarding the implementation of my patch, I have followed the contpte
> logic here by adding the 'nr' parameter to
> contpte_ptep_test_and_clear_young() for batch processing consecutive
> (present) PTEs. Moreover, the semantics of 'nr' remain consistent with the
> original contpte functions, and I have not broken the existing contpte
> implementation logic.

OK that helps a lot thanks.

So the nr will be number of PTEs that are consecutive, belong to the same folio,
and have the same PTE bit set excluding accessed, writable, dirty, soft-dirty.

But since we are retrieving accessed bit state (young), isn't it a problem we're
ignoring this bit when considering what goes into the batch?

I mean now it's going to return true even if some do not have the accessed flag
set?

I am more convinced things are correct in style at least then this way as, as
you say, this is the approach taken in set_ptes().

Thanks, Lorenzo

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Ryan Roberts 1 month, 3 weeks ago

On 17/12/2025 14:50, Lorenzo Stoakes wrote:
> On Wed, Dec 17, 2025 at 11:53:31AM +0800, Baolin Wang wrote:
>>
>>
>> On 2025/12/16 19:11, Lorenzo Stoakes wrote:
>>> On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
>>>>
>>>>
>>>> On 2025/12/15 19:36, Lorenzo Stoakes wrote:
>>>>> On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
>>>>>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>>>>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>>>>> To support batch PTE operations for other sized large folios in the following
>>>>>> patches, adding a new parameter to specify the number of PTEs.
>>>>>>
>>>>>> While we are at it, rename the functions to maintain consistency with other
>>>>>> contpte_*() functions.
>>>>>>
>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> ---
>>>>>>    arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>>>>    arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>>>>>    2 files changed, 37 insertions(+), 19 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>> index 0944e296dd4a..e03034683156 100644
>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>    extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>>>    				unsigned long addr, pte_t *ptep,
>>>>>>    				unsigned int nr, int full);
>>>>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>> -				unsigned long addr, pte_t *ptep);
>>>>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>> -				unsigned long addr, pte_t *ptep);
>>>>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>>>
>>>>> In core mm at least, as a convention, we strip 'extern' from header declarations
>>>>> as we go as they're not necessary.
>>>>
>>>> Sure. I can remove the 'extern'.
>>>
>>> Thanks.
>>>
>>>>
>>>>> I wonder about this naming convention also. The contpte_xxx_() prefix
>>>>> obviously implies we are operating upon PTEs in the contiguous range, now
>>>>> we are doing something... different and 'nr' is a bit vague.
>>>>>
>>>>> Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
>>>>
>>>> Yes, 'nr' here means nr_pages, which follows the convention of the
>>>> parameters in contpte_xxx_() functions.
>>>
>>> Except you're violating the convention already by doing non-contpte stuff in
>>> contpte functions?
>>
>> Nope. Sorry for my poor English, which might have caused some confusion. I
>> followed the contpte's convention, and let me try to explain it again below.
>>
>>> The confusion is between nr of contpte ranges and nr of pages, so I think we
>>> should be explicit.
>>>
>>>>
>>>>> now we're not saying they're necessarily contiguous in the sense of contpte...
>>>>>
>>>>> I wonder whether we'd be better off introducing a new function that
>>>>> explicitly has 'batch' or 'batched' in the name, and have the usual
>>>>> contpte_***() stuff and callers that need batching using a new underlying helper?
>>>>
>>>> OK. I get your point. However, this is outside the scope of my patchset.
>>>
>>> Well, no, this is review feedback that needs to be addressed, I really don't
>>> like this patch as-is.
>>>
>>> So for this to be upstreamable you really need to separate this out.
>>>
>>>>
>>>> Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
>>>> that this is confusing.
>>>
>>> I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
>>> reject the patch unless we do this.
>>>
>>> The contpte logic is already very confusing, and this is only adding to it.
>>>
>>>>
>>>>> Obviously I defer to Ryan on all this, just my thoughts...
>>>>>
>>>>>>    extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>    				pte_t *ptep, unsigned int nr);
>>>>>>    extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>    	if (likely(!pte_valid_cont(orig_pte)))
>>>>>>    		return __ptep_test_and_clear_young(vma, addr, ptep);
>>>>>>
>>>>>> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>>>> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>>>    }
>>>>>>
>>>>>>    #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>>    	if (likely(!pte_valid_cont(orig_pte)))
>>>>>>    		return __ptep_clear_flush_young(vma, addr, ptep);
>>>>>>
>>>>>> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>>>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>>>    }
>>>>>>
>>>>>>    #define wrprotect_ptes wrprotect_ptes
>>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>>> index c0557945939c..19b122441be3 100644
>>>>>> --- a/arch/arm64/mm/contpte.c
>>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>>>    }
>>>>>>    EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>>>>
>>>>>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>> -					unsigned long addr, pte_t *ptep)
>>>>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>>>> +					unsigned long addr, pte_t *ptep,
>>>>>> +					unsigned int nr)
>>>>>>    {
>>>>>>    	/*
>>>>>>    	 * ptep_clear_flush_young() technically requires us to clear the access
>>>>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>    	 * having to unfold.
>>>>>>    	 */
>>>>>
>>>>> Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
>>>>>
>>>>> "And since we only create a contig range when the range is covered by a single
>>>>> folio, we can get away with clearing young for the whole contig range here, so
>>>>> we avoid having to unfold."
>>>>
>>>> I think the original comments are still clear:
>>>> "
>>>> /*
>>>>   * ptep_clear_flush_young() technically requires us to clear the access
>>>>   * flag for a _single_ pte. However, the core-mm code actually tracks
>>>>   * access/dirty per folio, not per page. And since we only create a
>>>>   * contig range when the range is covered by a single folio, we can get
>>>>   * away with clearing young for the whole contig range here, so we avoid
>>>>   * having to unfold.
>>>>   */
>>>> "
>>>
>>> Except nr means you can span more than a single contig pte range? So this
>>> comment is no longer applicable as-is?
>>
>> The nr means consecutive (present) PTEs that map consecutive pages of the
>> same large folio in a single VMA and a single page table.
>>
>> I think this is a prerequisite. If you think it's necessary to explain it, I
>> can add it to the comments.
>>
>>> You've changed the function, the comment needs to change too.
>>>
>>>>
>>>>> However now you're allowing for large folios that are not contpte?
>>>>>
>>>>> Overall feeling pretty iffy about implementing this this way.
>>>>
>>>> Please see the above comments, which explain why we should do that.
>>>
>>> You haven't explained anything? Maybe I missed it?
>>>
>>> Can you explain clearly what nr might actually be in relation to contpte's? I'm
>>> confused even as to the utility here.
>>>
>>> Is it batched ranges of contptes? But then why are you trying to see if end is a
>>> contpte + expand the range if so?
>>
>> See below.
>>
>>>>>> +	unsigned long start = addr;
>>>>>> +	unsigned long end = start + nr * PAGE_SIZE;
>>>>>
>>>>> So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
>>>>> table?
>>>>
>>>> Caller has made sure that (see folio_referenced_one()).
>>>
>>> You should comment this somehow, I hate this nested assumptions stuff 'function
>>> X which calls function Y which calls function Z makes sure that parameter A
>>> fulfils requirements B but we don't mention it anywhere'.
>>
>> Sure.
>>
>>>>>>    	int young = 0;
>>>>>>    	int i;
>>>>>>
>>>>>> -	ptep = contpte_align_down(ptep);
>>>>>> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>>>> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>>>> +		end = ALIGN(end, CONT_PTE_SIZE);
>>>>>
>>>>> OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
>>>>> with other PTE entries present there not having PTE_CONT set (Ryan - do
>>>>> correct me if I'm wrong!)
>>>>>
>>>>> I guess then no worry about running off the end of the PTE table here.
>>>>>
>>>>> I wonder about using pte_cont_addr_end() here, but I guess you are not
>>>>> intending to limit to a range here but rather capture the entire contpte
>>>>> range of the end.
>>>>
>>>> Right.
>>>>
>>>>> I don't love that now a function prefixed with contpte_... can have a
>>>>> condition where part of the input range is _not_ a contpte entry, or is
>>>>> specified partially...
>>>>
>>>> Like I said above, this is outside the scope of my patchset. If Ryan also
>>>> thinks we need to do some cleanups for all the contpte_xxx() functions in
>>>> the contpte.c file, we can send a follow-up patchset to rename all the
>>>> contpte_xxx() functions to make it clear.
>>>
>>> It's really not outside of the scope of the patch set, this is review feedback
>>> you have to address :) trying not to be a grumpy kernel dev here :>)
>>>
>>> In general in the kernel we don't accept confusing patches then defer making
>>> them not-confusing until later.
>>>
>>> We review until the series is at the correct standard to be accepted before we
>>> merge.
>>>
>>> As always I'm happy to be convinced that I'm mistaken or something's not
>>> necesary, however!
>>
>> Yes, let me try to explain it again. See below.
>>
>>>>>> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>>>> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>>>> +	if (pte_cont(__ptep_get(ptep))) {
>>>>>> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>>>> +		ptep = contpte_align_down(ptep);
>>>>>> +	}
>>>>>> +
>>>>>> +	nr = (end - start) / PAGE_SIZE;
>>>>>
>>>>> Hm don't love this reuse of input param.
>>>>
>>>> OK. Will use another local variable instead.
>>>
>>> Thanks.
>>>
>>>>
>>>>>
>>>>>> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>>>>> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>>>
>>>>> Again, overall find it a bit iffy that we are essentially overloading the
>>>>> code with non-contpte behaviour.
>>>>
>>>> Let me clarify further: this is the handling logic of the contpte_xxx()
>>>> functions in the contpte.c file. For example, the function
>>>> contpte_set_ptes() has a parameter 'nr', which may be greater than
>>>> CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
>>>> large folio.
>>>>
>>>> It might be a bit confusing, and I understand this is why you want to change
>>>> the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
>>>> can provide some input on this.
>>>>
>>>> Thanks for reviewing.
>>>
>>> OK so we have some reeeeal confusion here. And this speaks to the patch needing
>>> work.
>>>
>>> This seems to answer what I asked above - basically - are we doing _batches_ of
>>> _contpte_ entries, or are we just accepting arbitrary large folios here and
>>> batching up operations on them, including non-contpte entries.
>>>
>>> Presumably from the above you're saying - this is dealing with batches of
>>> contpte entries.
>>>
>>> In which case this has to be made _super_ clear. Not just adding an arbitrary
>>> 'nr' parameter and letting people guess.
>>>
>>> The contpte code is _already_ confusing and somewhat surprising if you're not
>>> aware of it, we need to be going in the other direction.
>>
>> Sure. Sorry for causing confusion. I try to explain it again to make it
>> clear.
>>
>> First, it is necessary to explain how the contpte implementation in Arm64 is
>> exposed, using set_ptes() as an example.
>>
>> Arm64 defines an arch-specific implementation of set_ptes():
>> "
>> static __always_inline void set_ptes(struct mm_struct *mm, unsigned long
>> addr,
>> 				pte_t *ptep, pte_t pte, unsigned int nr)
>> {
>> 	pte = pte_mknoncont(pte);
>>
>> 	if (likely(nr == 1)) {
>> 		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> 		__set_ptes(mm, addr, ptep, pte, 1);
>> 		contpte_try_fold(mm, addr, ptep, pte);
>> 	} else {
>> 		contpte_set_ptes(mm, addr, ptep, pte, nr);
>> 	}
>> }
>> "
>>
>> Here, 'nr' refers to consecutive (present) PTEs that map consecutive pages
>> of the same large folio within a single VMA and a single page table. Based
>> on 'nr', it determines whether the PTE_CONT attribute should be set or not.
>>
>> Second, the underlying implementation of contpte_set_ptes() also calculates
>> 'end' based on 'nr' and checks whether the PTE_CONT attribute should be set.
>> For example:
>>
>> For a large folio with order = 3 (nr = 8), the PTE_CONT attribute will not
>> be set since the 'next' is not aligned with CONT_PTE_MASK.
>>
>> For a large folio with order = 4 (nr = 16), it aligns perfectly, allowing
>> the PTE_CONT attribute to be set for the entire PTE entries of the large
>> folio.
>>
>> For a large folio with order = 5 (nr = 32), this involves two ranges of
>> CONT_PTE_SIZE (16 PTEs in each range), requiring the PTE_CONT attribute to
>> be set separately for each range.
>>
>> "
>> void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> 					pte_t *ptep, pte_t pte, unsigned int nr)
>> {
>> 	.......
>>
>> 	end = addr + (nr << PAGE_SHIFT);
>> 	pfn = pte_pfn(pte);
>> 	prot = pte_pgprot(pte);
>>
>> 	do {
>> 		next = pte_cont_addr_end(addr, end);
>> 		nr = (next - addr) >> PAGE_SHIFT;
>> 		pte = pfn_pte(pfn, prot);
>>
>> 		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> 			pte = pte_mkcont(pte);
>> 		else
>> 			pte = pte_mknoncont(pte);
>>
>> 		__set_ptes(mm, addr, ptep, pte, nr);
>>
>> 		addr = next;
>> 		ptep += nr;
>> 		pfn += nr;
>>
>> 	} while (addr != end);
>> }
>> "
>>
>> Third, regarding the implementation of my patch, I have followed the contpte
>> logic here by adding the 'nr' parameter to
>> contpte_ptep_test_and_clear_young() for batch processing consecutive
>> (present) PTEs. Moreover, the semantics of 'nr' remain consistent with the
>> original contpte functions, and I have not broken the existing contpte
>> implementation logic.
> 
> OK that helps a lot thanks.
> 
> So the nr will be number of PTEs that are consecutive, belong to the same folio,
> and have the same PTE bit set excluding accessed, writable, dirty, soft-dirty.

I'm not sure test_and_clear_young_ptes() would technically require all of these
constraints. I think it would be sufficient to just ensure that all the PTEs in
the batch are pte_present(). And I assume the return value is intended to
indicate if *any* of the PTEs in the batch were previously young?

The usual order for adding new batched PTE helpers is to define the generic
function along with it's spec in a comment block and provide a default
implementation that is implemented as a loop around the existing non-batched
version of the interface. Then you can convert the core-mm to use that API. And
finally you can opt arm64 into overriding the API with a more efficient
implementation.

If you were to do it in that order, it tells a better story and is easier to
follow, I think.

> 
> But since we are retrieving accessed bit state (young), isn't it a problem we're
> ignoring this bit when considering what goes into the batch?
> 
> I mean now it's going to return true even if some do not have the accessed flag
> set?

test_and_clear_young_ptes() is exploiting the fact that the kernel tracks young
(and dirty) per *folio*. So having a young and dirty bit per PTE is providing
more information than the kernel needs if a large folio is mapped across
multiple PTEs. Which is why returning the logical OR of the access bits is ok here.

> 
> I am more convinced things are correct in style at least then this way as, as
> you say, this is the approach taken in set_ptes().

I'm personally ok with the style and I believe the implementation is correct.
But it would help to reformulate the series in the order I suggest above since
then we can review against the spec of what the core-mm helper is supposed to do.

Thanks,
Ryan

> 
> Thanks, Lorenzo

Re: [PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios

Posted by Baolin Wang 1 month, 3 weeks ago


On 2025/12/18 00:06, Ryan Roberts wrote:
> On 17/12/2025 14:50, Lorenzo Stoakes wrote:
>> On Wed, Dec 17, 2025 at 11:53:31AM +0800, Baolin Wang wrote:
>>>
>>>
>>> On 2025/12/16 19:11, Lorenzo Stoakes wrote:
>>>> On Tue, Dec 16, 2025 at 11:32:58AM +0800, Baolin Wang wrote:
>>>>>
>>>>>
>>>>> On 2025/12/15 19:36, Lorenzo Stoakes wrote:
>>>>>> On Thu, Dec 11, 2025 at 04:16:54PM +0800, Baolin Wang wrote:
>>>>>>> Currently, contpte_ptep_test_and_clear_young() and contpte_ptep_clear_flush_young()
>>>>>>> only clear the young flag and flush TLBs for PTEs within the contiguous range.
>>>>>>> To support batch PTE operations for other sized large folios in the following
>>>>>>> patches, adding a new parameter to specify the number of PTEs.
>>>>>>>
>>>>>>> While we are at it, rename the functions to maintain consistency with other
>>>>>>> contpte_*() functions.
>>>>>>>
>>>>>>> Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>> ---
>>>>>>>     arch/arm64/include/asm/pgtable.h | 12 ++++-----
>>>>>>>     arch/arm64/mm/contpte.c          | 44 ++++++++++++++++++++++----------
>>>>>>>     2 files changed, 37 insertions(+), 19 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>>>>> index 0944e296dd4a..e03034683156 100644
>>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>>> @@ -1679,10 +1679,10 @@ extern void contpte_clear_full_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>>     extern pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>>>>     				unsigned long addr, pte_t *ptep,
>>>>>>>     				unsigned int nr, int full);
>>>>>>> -extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>> -				unsigned long addr, pte_t *ptep);
>>>>>>> -extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>>> -				unsigned long addr, pte_t *ptep);
>>>>>>> +extern int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>>>>> +extern int contpte_clear_flush_young_ptes(struct vm_area_struct *vma,
>>>>>>> +				unsigned long addr, pte_t *ptep, unsigned int nr);
>>>>>>
>>>>>> In core mm at least, as a convention, we strip 'extern' from header declarations
>>>>>> as we go as they're not necessary.
>>>>>
>>>>> Sure. I can remove the 'extern'.
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>>> I wonder about this naming convention also. The contpte_xxx_() prefix
>>>>>> obviously implies we are operating upon PTEs in the contiguous range, now
>>>>>> we are doing something... different and 'nr' is a bit vague.
>>>>>>
>>>>>> Is it the nr of contiguous ranges? Well in reality it's nr_pages right? But
>>>>>
>>>>> Yes, 'nr' here means nr_pages, which follows the convention of the
>>>>> parameters in contpte_xxx_() functions.
>>>>
>>>> Except you're violating the convention already by doing non-contpte stuff in
>>>> contpte functions?
>>>
>>> Nope. Sorry for my poor English, which might have caused some confusion. I
>>> followed the contpte's convention, and let me try to explain it again below.
>>>
>>>> The confusion is between nr of contpte ranges and nr of pages, so I think we
>>>> should be explicit.
>>>>
>>>>>
>>>>>> now we're not saying they're necessarily contiguous in the sense of contpte...
>>>>>>
>>>>>> I wonder whether we'd be better off introducing a new function that
>>>>>> explicitly has 'batch' or 'batched' in the name, and have the usual
>>>>>> contpte_***() stuff and callers that need batching using a new underlying helper?
>>>>>
>>>>> OK. I get your point. However, this is outside the scope of my patchset.
>>>>
>>>> Well, no, this is review feedback that needs to be addressed, I really don't
>>>> like this patch as-is.
>>>>
>>>> So for this to be upstreamable you really need to separate this out.
>>>>
>>>>>
>>>>> Perhaps we can clean up all the contpte_xxx_() functions if everyone agrees
>>>>> that this is confusing.
>>>>
>>>> I mean, unless Ryan explicitly makes a case for this being fine, I'm inclined to
>>>> reject the patch unless we do this.
>>>>
>>>> The contpte logic is already very confusing, and this is only adding to it.
>>>>
>>>>>
>>>>>> Obviously I defer to Ryan on all this, just my thoughts...
>>>>>>
>>>>>>>     extern void contpte_wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>>     				pte_t *ptep, unsigned int nr);
>>>>>>>     extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>>>>>> @@ -1854,7 +1854,7 @@ static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>>     	if (likely(!pte_valid_cont(orig_pte)))
>>>>>>>     		return __ptep_test_and_clear_young(vma, addr, ptep);
>>>>>>>
>>>>>>> -	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>>>>>> +	return contpte_test_and_clear_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>>>>     }
>>>>>>>
>>>>>>>     #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>>>>>> @@ -1866,7 +1866,7 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>>>>>>     	if (likely(!pte_valid_cont(orig_pte)))
>>>>>>>     		return __ptep_clear_flush_young(vma, addr, ptep);
>>>>>>>
>>>>>>> -	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>>>>>> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, CONT_PTES);
>>>>>>>     }
>>>>>>>
>>>>>>>     #define wrprotect_ptes wrprotect_ptes
>>>>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>>>>> index c0557945939c..19b122441be3 100644
>>>>>>> --- a/arch/arm64/mm/contpte.c
>>>>>>> +++ b/arch/arm64/mm/contpte.c
>>>>>>> @@ -488,8 +488,9 @@ pte_t contpte_get_and_clear_full_ptes(struct mm_struct *mm,
>>>>>>>     }
>>>>>>>     EXPORT_SYMBOL_GPL(contpte_get_and_clear_full_ptes);
>>>>>>>
>>>>>>> -int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>> -					unsigned long addr, pte_t *ptep)
>>>>>>> +int contpte_test_and_clear_young_ptes(struct vm_area_struct *vma,
>>>>>>> +					unsigned long addr, pte_t *ptep,
>>>>>>> +					unsigned int nr)
>>>>>>>     {
>>>>>>>     	/*
>>>>>>>     	 * ptep_clear_flush_young() technically requires us to clear the access
>>>>>>> @@ -500,39 +501,56 @@ int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>>>>     	 * having to unfold.
>>>>>>>     	 */
>>>>>>
>>>>>> Hmm shouldn't you need to update this comment now 'nr' is a thing? E.g.:
>>>>>>
>>>>>> "And since we only create a contig range when the range is covered by a single
>>>>>> folio, we can get away with clearing young for the whole contig range here, so
>>>>>> we avoid having to unfold."
>>>>>
>>>>> I think the original comments are still clear:
>>>>> "
>>>>> /*
>>>>>    * ptep_clear_flush_young() technically requires us to clear the access
>>>>>    * flag for a _single_ pte. However, the core-mm code actually tracks
>>>>>    * access/dirty per folio, not per page. And since we only create a
>>>>>    * contig range when the range is covered by a single folio, we can get
>>>>>    * away with clearing young for the whole contig range here, so we avoid
>>>>>    * having to unfold.
>>>>>    */
>>>>> "
>>>>
>>>> Except nr means you can span more than a single contig pte range? So this
>>>> comment is no longer applicable as-is?
>>>
>>> The nr means consecutive (present) PTEs that map consecutive pages of the
>>> same large folio in a single VMA and a single page table.
>>>
>>> I think this is a prerequisite. If you think it's necessary to explain it, I
>>> can add it to the comments.
>>>
>>>> You've changed the function, the comment needs to change too.
>>>>
>>>>>
>>>>>> However now you're allowing for large folios that are not contpte?
>>>>>>
>>>>>> Overall feeling pretty iffy about implementing this this way.
>>>>>
>>>>> Please see the above comments, which explain why we should do that.
>>>>
>>>> You haven't explained anything? Maybe I missed it?
>>>>
>>>> Can you explain clearly what nr might actually be in relation to contpte's? I'm
>>>> confused even as to the utility here.
>>>>
>>>> Is it batched ranges of contptes? But then why are you trying to see if end is a
>>>> contpte + expand the range if so?
>>>
>>> See below.
>>>
>>>>>>> +	unsigned long start = addr;
>>>>>>> +	unsigned long end = start + nr * PAGE_SIZE;
>>>>>>
>>>>>> So now [addr, addr + nr * PAGE_SIZE) must for sure be within a single PTE
>>>>>> table?
>>>>>
>>>>> Caller has made sure that (see folio_referenced_one()).
>>>>
>>>> You should comment this somehow, I hate this nested assumptions stuff 'function
>>>> X which calls function Y which calls function Z makes sure that parameter A
>>>> fulfils requirements B but we don't mention it anywhere'.
>>>
>>> Sure.
>>>
>>>>>>>     	int young = 0;
>>>>>>>     	int i;
>>>>>>>
>>>>>>> -	ptep = contpte_align_down(ptep);
>>>>>>> -	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>>>>> +	if (pte_cont(__ptep_get(ptep + nr - 1)))
>>>>>>> +		end = ALIGN(end, CONT_PTE_SIZE);
>>>>>>
>>>>>> OK so I guess for PTE_CONT to be set, it must be aligned to CONT_PTE_SIZE,
>>>>>> with other PTE entries present there not having PTE_CONT set (Ryan - do
>>>>>> correct me if I'm wrong!)
>>>>>>
>>>>>> I guess then no worry about running off the end of the PTE table here.
>>>>>>
>>>>>> I wonder about using pte_cont_addr_end() here, but I guess you are not
>>>>>> intending to limit to a range here but rather capture the entire contpte
>>>>>> range of the end.
>>>>>
>>>>> Right.
>>>>>
>>>>>> I don't love that now a function prefixed with contpte_... can have a
>>>>>> condition where part of the input range is _not_ a contpte entry, or is
>>>>>> specified partially...
>>>>>
>>>>> Like I said above, this is outside the scope of my patchset. If Ryan also
>>>>> thinks we need to do some cleanups for all the contpte_xxx() functions in
>>>>> the contpte.c file, we can send a follow-up patchset to rename all the
>>>>> contpte_xxx() functions to make it clear.
>>>>
>>>> It's really not outside of the scope of the patch set, this is review feedback
>>>> you have to address :) trying not to be a grumpy kernel dev here :>)
>>>>
>>>> In general in the kernel we don't accept confusing patches then defer making
>>>> them not-confusing until later.
>>>>
>>>> We review until the series is at the correct standard to be accepted before we
>>>> merge.
>>>>
>>>> As always I'm happy to be convinced that I'm mistaken or something's not
>>>> necesary, however!
>>>
>>> Yes, let me try to explain it again. See below.
>>>
>>>>>>> -	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>>>>>> -		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>>>>>> +	if (pte_cont(__ptep_get(ptep))) {
>>>>>>> +		start = ALIGN_DOWN(start, CONT_PTE_SIZE);
>>>>>>> +		ptep = contpte_align_down(ptep);
>>>>>>> +	}
>>>>>>> +
>>>>>>> +	nr = (end - start) / PAGE_SIZE;
>>>>>>
>>>>>> Hm don't love this reuse of input param.
>>>>>
>>>>> OK. Will use another local variable instead.
>>>>
>>>> Thanks.
>>>>
>>>>>
>>>>>>
>>>>>>> +	for (i = 0; i < nr; i++, ptep++, start += PAGE_SIZE)
>>>>>>> +		young |= __ptep_test_and_clear_young(vma, start, ptep);
>>>>>>
>>>>>> Again, overall find it a bit iffy that we are essentially overloading the
>>>>>> code with non-contpte behaviour.
>>>>>
>>>>> Let me clarify further: this is the handling logic of the contpte_xxx()
>>>>> functions in the contpte.c file. For example, the function
>>>>> contpte_set_ptes() has a parameter 'nr', which may be greater than
>>>>> CONT_PTES, meaning that it can handle multiple CONT_PTES range sizes of a
>>>>> large folio.
>>>>>
>>>>> It might be a bit confusing, and I understand this is why you want to change
>>>>> the 'contpte_' prefix to the 'batch_' prefix. Of course, we can hope Ryan
>>>>> can provide some input on this.
>>>>>
>>>>> Thanks for reviewing.
>>>>
>>>> OK so we have some reeeeal confusion here. And this speaks to the patch needing
>>>> work.
>>>>
>>>> This seems to answer what I asked above - basically - are we doing _batches_ of
>>>> _contpte_ entries, or are we just accepting arbitrary large folios here and
>>>> batching up operations on them, including non-contpte entries.
>>>>
>>>> Presumably from the above you're saying - this is dealing with batches of
>>>> contpte entries.
>>>>
>>>> In which case this has to be made _super_ clear. Not just adding an arbitrary
>>>> 'nr' parameter and letting people guess.
>>>>
>>>> The contpte code is _already_ confusing and somewhat surprising if you're not
>>>> aware of it, we need to be going in the other direction.
>>>
>>> Sure. Sorry for causing confusion. I try to explain it again to make it
>>> clear.
>>>
>>> First, it is necessary to explain how the contpte implementation in Arm64 is
>>> exposed, using set_ptes() as an example.
>>>
>>> Arm64 defines an arch-specific implementation of set_ptes():
>>> "
>>> static __always_inline void set_ptes(struct mm_struct *mm, unsigned long
>>> addr,
>>> 				pte_t *ptep, pte_t pte, unsigned int nr)
>>> {
>>> 	pte = pte_mknoncont(pte);
>>>
>>> 	if (likely(nr == 1)) {
>>> 		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> 		__set_ptes(mm, addr, ptep, pte, 1);
>>> 		contpte_try_fold(mm, addr, ptep, pte);
>>> 	} else {
>>> 		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>> 	}
>>> }
>>> "
>>>
>>> Here, 'nr' refers to consecutive (present) PTEs that map consecutive pages
>>> of the same large folio within a single VMA and a single page table. Based
>>> on 'nr', it determines whether the PTE_CONT attribute should be set or not.
>>>
>>> Second, the underlying implementation of contpte_set_ptes() also calculates
>>> 'end' based on 'nr' and checks whether the PTE_CONT attribute should be set.
>>> For example:
>>>
>>> For a large folio with order = 3 (nr = 8), the PTE_CONT attribute will not
>>> be set since the 'next' is not aligned with CONT_PTE_MASK.
>>>
>>> For a large folio with order = 4 (nr = 16), it aligns perfectly, allowing
>>> the PTE_CONT attribute to be set for the entire PTE entries of the large
>>> folio.
>>>
>>> For a large folio with order = 5 (nr = 32), this involves two ranges of
>>> CONT_PTE_SIZE (16 PTEs in each range), requiring the PTE_CONT attribute to
>>> be set separately for each range.
>>>
>>> "
>>> void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> 					pte_t *ptep, pte_t pte, unsigned int nr)
>>> {
>>> 	.......
>>>
>>> 	end = addr + (nr << PAGE_SHIFT);
>>> 	pfn = pte_pfn(pte);
>>> 	prot = pte_pgprot(pte);
>>>
>>> 	do {
>>> 		next = pte_cont_addr_end(addr, end);
>>> 		nr = (next - addr) >> PAGE_SHIFT;
>>> 		pte = pfn_pte(pfn, prot);
>>>
>>> 		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>>> 			pte = pte_mkcont(pte);
>>> 		else
>>> 			pte = pte_mknoncont(pte);
>>>
>>> 		__set_ptes(mm, addr, ptep, pte, nr);
>>>
>>> 		addr = next;
>>> 		ptep += nr;
>>> 		pfn += nr;
>>>
>>> 	} while (addr != end);
>>> }
>>> "
>>>
>>> Third, regarding the implementation of my patch, I have followed the contpte
>>> logic here by adding the 'nr' parameter to
>>> contpte_ptep_test_and_clear_young() for batch processing consecutive
>>> (present) PTEs. Moreover, the semantics of 'nr' remain consistent with the
>>> original contpte functions, and I have not broken the existing contpte
>>> implementation logic.
>>
>> OK that helps a lot thanks.
>>
>> So the nr will be number of PTEs that are consecutive, belong to the same folio,
>> and have the same PTE bit set excluding accessed, writable, dirty, soft-dirty.
> 
> I'm not sure test_and_clear_young_ptes() would technically require all of these
> constraints. I think it would be sufficient to just ensure that all the PTEs in
> the batch are pte_present(). And I assume the return value is intended to
> indicate if *any* of the PTEs in the batch were previously young?

Agree.

> The usual order for adding new batched PTE helpers is to define the generic
> function along with it's spec in a comment block and provide a default
> implementation that is implemented as a loop around the existing non-batched
> version of the interface. Then you can convert the core-mm to use that API. And
> finally you can opt arm64 into overriding the API with a more efficient
> implementation.
> 
> If you were to do it in that order, it tells a better story and is easier to
> follow, I think.

Good point. Let me rearrange the patch set.

> 
>>
>> But since we are retrieving accessed bit state (young), isn't it a problem we're
>> ignoring this bit when considering what goes into the batch?
>>
>> I mean now it's going to return true even if some do not have the accessed flag
>> set?
> 
> test_and_clear_young_ptes() is exploiting the fact that the kernel tracks young
> (and dirty) per *folio*. So having a young and dirty bit per PTE is providing
> more information than the kernel needs if a large folio is mapped across
> multiple PTEs. Which is why returning the logical OR of the access bits is ok here.

Right. Agree.

>> I am more convinced things are correct in style at least then this way as, as
>> you say, this is the approach taken in set_ptes().
> 
> I'm personally ok with the style and I believe the implementation is correct.
> But it would help to reformulate the series in the order I suggest above since
> then we can review against the spec of what the core-mm helper is supposed to do.

Sure. Thanks for reviewing.

[PATCH v2 1/3] arm64: mm: support batch clearing of the young flag for large folios
[PATCH v2 2/3] mm: rmap: support batched checks of the references for large folios
[PATCH v2 3/3] mm: rmap: support batched unmapping for file large folios