[PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()

Baolin Wang posted 5 patches 1 month, 1 week ago
[PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
Posted by Baolin Wang 1 month, 1 week ago
Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
batched checking of young flags and TLB flushing, improving performance during
large folio reclamation.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
33% performance improvement on my Arm64 32-core server (and 10%+ improvement
on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
from approximately 35% to around 5%.

W/o patchset:
real	0m1.518s
user	0m0.000s
sys	0m1.518s

W/ patchset:
real	0m1.018s
user	0m0.000s
sys	0m1.018s

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
---
 arch/arm64/include/asm/pgtable.h | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 5e9ff16146c3..aa8f642f1260 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
 	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
 }
 
+#define clear_flush_young_ptes clear_flush_young_ptes
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+					 unsigned long addr, pte_t *ptep,
+					 unsigned int nr)
+{
+	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
+		return __ptep_clear_flush_young(vma, addr, ptep);
+
+	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
+}
+
 #define wrprotect_ptes wrprotect_ptes
 static __always_inline void wrprotect_ptes(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep, unsigned int nr)
-- 
2.47.3
Re: [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
Posted by Chris Mason 1 week, 2 days ago
Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
> batched checking of young flags and TLB flushing, improving performance during
> large folio reclamation.
> 
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.

Hi everyone, I ran mm-new through my AI review prompts and this one was
flagged.  AI review below:

> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>  	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>  }
>
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> +					 unsigned long addr, pte_t *ptep,
> +					 unsigned int nr)
> +{
> +	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
> +		return __ptep_clear_flush_young(vma, addr, ptep);

Should this be checking !pte_valid_cont() instead of !pte_cont()?

The existing ptep_clear_flush_young() above uses !pte_valid_cont() to
determine when to take the fast path. The new function only checks
!pte_cont(), which differs when handling non-present PTEs.

Non-present PTEs (device-private, device-exclusive) can reach
clear_flush_young_ptes() through folio_referenced_one()->
clear_flush_young_ptes_notify(). These entries may have bit 52 set as
part of their encoding, but they aren't valid contiguous mappings.

With the current check, wouldn't such entries incorrectly trigger the
contpte path and potentially cause contpte_clear_flush_young_ptes() to
process additional unrelated PTEs beyond the intended single entry?

> +
> +	return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> +}
> +
>  #define wrprotect_ptes wrprotect_ptes
>  static __always_inline void wrprotect_ptes(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep, unsigned int nr)
Re: [PATCH v5 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
Posted by Baolin Wang 1 week, 2 days ago

On 1/28/26 7:47 PM, Chris Mason wrote:
> Baolin Wang <baolin.wang@linux.alibaba.com> wrote:
>> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
>> batched checking of young flags and TLB flushing, improving performance during
>> large folio reclamation.
>>
>> Performance testing:
>> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
>> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
>> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
>> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
>> from approximately 35% to around 5%.
> 
> Hi everyone, I ran mm-new through my AI review prompts and this one was
> flagged.  AI review below:
> 
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>   	return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
>>   }
>>
>> +#define clear_flush_young_ptes clear_flush_young_ptes
>> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
>> +					 unsigned long addr, pte_t *ptep,
>> +					 unsigned int nr)
>> +{
>> +	if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
> 
> Should this be checking !pte_valid_cont() instead of !pte_cont()?
> 
> The existing ptep_clear_flush_young() above uses !pte_valid_cont() to
> determine when to take the fast path. The new function only checks
> !pte_cont(), which differs when handling non-present PTEs.
> 
> Non-present PTEs (device-private, device-exclusive) can reach
> clear_flush_young_ptes() through folio_referenced_one()->
> clear_flush_young_ptes_notify(). These entries may have bit 52 set as
> part of their encoding, but they aren't valid contiguous mappings.
> 
> With the current check, wouldn't such entries incorrectly trigger the
> contpte path and potentially cause contpte_clear_flush_young_ptes() to
> process additional unrelated PTEs beyond the intended single entry?

Indeed. I previously discussed with Ryan whether using pte_cont() was 
enough, and we believed that invalid PTEs wouldn’t have the PTE_CONT bit 
set. But we clearly missed the device-folio cases. Thanks for reporting.

Andrew, could you please squash the following fix into this patch? If 
you prefer a new version, please let me know. Thanks.

diff --git a/arch/arm64/include/asm/pgtable.h 
b/arch/arm64/include/asm/pgtable.h
index a17eb8a76788..dc16591c4241 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1843,7 +1843,7 @@ static inline int clear_flush_young_ptes(struct 
vm_area_struct *vma,
                                          unsigned long addr, pte_t *ptep,
                                          unsigned int nr)
  {
-       if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
+       if (likely(nr == 1 && !pte_valid_cont(__ptep_get(ptep))))
                 return __ptep_clear_flush_young(vma, addr, ptep);

         return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);