[v3] Optimize mremap() for large folios

[PATCH v3 2/2] mm: Optimize mremap() by PTE batching

Posted by Dev Jain 8 months, 2 weeks ago

Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
are painted with the contig bit, then ptep_get() will iterate through all 16
entries to collect a/d bits. Hence this optimization will result in a 16x
reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
will eventually call contpte_try_unfold() on every contig block, thus
flushing the TLB for the complete large folio range. Instead, use
get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
do them on the starting and ending contig block.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 mm/mremap.c | 40 +++++++++++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 0163e02e5aa8..580b41f8d169 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -170,6 +170,24 @@ static pte_t move_soft_dirty_pte(pte_t pte)
 	return pte;
 }
 
+/* mremap a batch of PTEs mapping the same large folio */
+static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
+		pte_t *ptep, pte_t pte, int max_nr)
+{
+	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
+	struct folio *folio;
+
+	if (max_nr == 1)
+		return 1;
+
+	folio = vm_normal_folio(vma, addr, pte);
+	if (!folio || !folio_test_large(folio))
+		return 1;
+
+	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
+			       NULL, NULL);
+}
+
 static int move_ptes(struct pagetable_move_control *pmc,
 		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
 {
@@ -177,7 +195,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
 	struct mm_struct *mm = vma->vm_mm;
 	pte_t *old_ptep, *new_ptep;
-	pte_t pte;
+	pte_t old_pte, pte;
 	pmd_t dummy_pmdval;
 	spinlock_t *old_ptl, *new_ptl;
 	bool force_flush = false;
@@ -185,6 +203,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	unsigned long new_addr = pmc->new_addr;
 	unsigned long old_end = old_addr + extent;
 	unsigned long len = old_end - old_addr;
+	int max_nr_ptes;
+	int nr_ptes;
 	int err = 0;
 
 	/*
@@ -236,12 +256,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
 	flush_tlb_batched_pending(vma->vm_mm);
 	arch_enter_lazy_mmu_mode();
 
-	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
-				   new_ptep++, new_addr += PAGE_SIZE) {
-		if (pte_none(ptep_get(old_ptep)))
+	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
+		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
+		nr_ptes = 1;
+		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
+		old_pte = ptep_get(old_ptep);
+		if (pte_none(old_pte))
 			continue;
 
-		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
 		/*
 		 * If we are remapping a valid PTE, make sure
 		 * to flush TLB before we drop the PTL for the
@@ -253,8 +275,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		 * the TLB entry for the old mapping has been
 		 * flushed.
 		 */
-		if (pte_present(pte))
+		if (pte_present(old_pte)) {
+			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
+							 old_pte, max_nr_ptes);
 			force_flush = true;
+		}
+		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
 		pte = move_pte(pte, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 
@@ -267,7 +293,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 				else if (is_swap_pte(pte))
 					pte = pte_swp_clear_uffd_wp(pte);
 			}
-			set_pte_at(mm, new_addr, new_ptep, pte);
+			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
 		}
 	}
 
-- 
2.30.2

Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching

Posted by Lorenzo Stoakes 8 months, 2 weeks ago

On Tue, May 27, 2025 at 01:20:49PM +0530, Dev Jain wrote:
> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> are painted with the contig bit, then ptep_get() will iterate through all 16
> entries to collect a/d bits. Hence this optimization will result in a 16x
> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> will eventually call contpte_try_unfold() on every contig block, thus
> flushing the TLB for the complete large folio range. Instead, use
> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> do them on the starting and ending contig block.

But you're also making this applicable to non-contpte cases?

See below, but the commit message shoud clearly point out this is general
for page table split large folios (unless I've missed something of course!
:)

>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  mm/mremap.c | 40 +++++++++++++++++++++++++++++++++-------
>  1 file changed, 33 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mremap.c b/mm/mremap.c
> index 0163e02e5aa8..580b41f8d169 100644
> --- a/mm/mremap.c
> +++ b/mm/mremap.c
> @@ -170,6 +170,24 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>  	return pte;
>  }
>
> +/* mremap a batch of PTEs mapping the same large folio */

I think this comment is fairly useless, it basically spells out the function
name.

I'd prefer something like 'determine if a PTE contains physically contiguous
entries which map the same large folio'.

> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> +		pte_t *ptep, pte_t pte, int max_nr)
> +{
> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> +	struct folio *folio;
> +
> +	if (max_nr == 1)
> +		return 1;
> +
> +	folio = vm_normal_folio(vma, addr, pte);
> +	if (!folio || !folio_test_large(folio))
> +		return 1;
> +
> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> +			       NULL, NULL);
> +}

The code is much better however! :)

> +
>  static int move_ptes(struct pagetable_move_control *pmc,
>  		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>  {
> @@ -177,7 +195,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>  	struct mm_struct *mm = vma->vm_mm;
>  	pte_t *old_ptep, *new_ptep;
> -	pte_t pte;
> +	pte_t old_pte, pte;
>  	pmd_t dummy_pmdval;
>  	spinlock_t *old_ptl, *new_ptl;
>  	bool force_flush = false;
> @@ -185,6 +203,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	unsigned long new_addr = pmc->new_addr;
>  	unsigned long old_end = old_addr + extent;
>  	unsigned long len = old_end - old_addr;
> +	int max_nr_ptes;
> +	int nr_ptes;
>  	int err = 0;
>
>  	/*
> @@ -236,12 +256,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  	flush_tlb_batched_pending(vma->vm_mm);
>  	arch_enter_lazy_mmu_mode();
>
> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> -				   new_ptep++, new_addr += PAGE_SIZE) {
> -		if (pte_none(ptep_get(old_ptep)))
> +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
> +		nr_ptes = 1;
> +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
> +		old_pte = ptep_get(old_ptep);
> +		if (pte_none(old_pte))
>  			continue;
>
> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>  		/*
>  		 * If we are remapping a valid PTE, make sure
>  		 * to flush TLB before we drop the PTL for the
> @@ -253,8 +275,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  		 * the TLB entry for the old mapping has been
>  		 * flushed.
>  		 */
> -		if (pte_present(pte))
> +		if (pte_present(old_pte)) {
> +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> +							 old_pte, max_nr_ptes);
>  			force_flush = true;
> +		}
> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);

Just to clarify, in the previous revision you said:

"Split THPs won't be batched; you can use pte_batch() (from David's refactoring)
and figure the split THP batch out, but then get_and_clear_full_ptes() will be
gathering a/d bits and smearing them across the batch, which will be incorrect."

But... this will be triggered for page table split large folio no?

So is there something wrong here or not?

>  		pte = move_pte(pte, old_addr, new_addr);
>  		pte = move_soft_dirty_pte(pte);
>
> @@ -267,7 +293,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>  				else if (is_swap_pte(pte))
>  					pte = pte_swp_clear_uffd_wp(pte);
>  			}
> -			set_pte_at(mm, new_addr, new_ptep, pte);
> +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);

The code looks much better here after refactoring, however!

>  		}
>  	}
>
> --
> 2.30.2
>

Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching

Posted by Dev Jain 8 months, 2 weeks ago

On 27/05/25 4:15 pm, Lorenzo Stoakes wrote:
> On Tue, May 27, 2025 at 01:20:49PM +0530, Dev Jain wrote:
>> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>> are painted with the contig bit, then ptep_get() will iterate through all 16
>> entries to collect a/d bits. Hence this optimization will result in a 16x
>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>> will eventually call contpte_try_unfold() on every contig block, thus
>> flushing the TLB for the complete large folio range. Instead, use
>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>> do them on the starting and ending contig block.
> But you're also making this applicable to non-contpte cases?
>
> See below, but the commit message shoud clearly point out this is general
> for page table split large folios (unless I've missed something of course!
> :)
>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>   mm/mremap.c | 40 +++++++++++++++++++++++++++++++++-------
>>   1 file changed, 33 insertions(+), 7 deletions(-)
>>
>> diff --git a/mm/mremap.c b/mm/mremap.c
>> index 0163e02e5aa8..580b41f8d169 100644
>> --- a/mm/mremap.c
>> +++ b/mm/mremap.c
>> @@ -170,6 +170,24 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>   	return pte;
>>   }
>>
>> +/* mremap a batch of PTEs mapping the same large folio */
> I think this comment is fairly useless, it basically spells out the function
> name.
>
> I'd prefer something like 'determine if a PTE contains physically contiguous
> entries which map the same large folio'.

I'd rather prefer dropping the comment altogether, the function is fairly obvious : )


>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>> +		pte_t *ptep, pte_t pte, int max_nr)
>> +{
>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>> +	struct folio *folio;
>> +
>> +	if (max_nr == 1)
>> +		return 1;
>> +
>> +	folio = vm_normal_folio(vma, addr, pte);
>> +	if (!folio || !folio_test_large(folio))
>> +		return 1;
>> +
>> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
>> +			       NULL, NULL);
>> +}
> The code is much better however! :)
>
>> +
>>   static int move_ptes(struct pagetable_move_control *pmc,
>>   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>   {
>> @@ -177,7 +195,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>   	struct mm_struct *mm = vma->vm_mm;
>>   	pte_t *old_ptep, *new_ptep;
>> -	pte_t pte;
>> +	pte_t old_pte, pte;
>>   	pmd_t dummy_pmdval;
>>   	spinlock_t *old_ptl, *new_ptl;
>>   	bool force_flush = false;
>> @@ -185,6 +203,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	unsigned long new_addr = pmc->new_addr;
>>   	unsigned long old_end = old_addr + extent;
>>   	unsigned long len = old_end - old_addr;
>> +	int max_nr_ptes;
>> +	int nr_ptes;
>>   	int err = 0;
>>
>>   	/*
>> @@ -236,12 +256,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   	flush_tlb_batched_pending(vma->vm_mm);
>>   	arch_enter_lazy_mmu_mode();
>>
>> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>> -				   new_ptep++, new_addr += PAGE_SIZE) {
>> -		if (pte_none(ptep_get(old_ptep)))
>> +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
>> +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>> +		nr_ptes = 1;
>> +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>> +		old_pte = ptep_get(old_ptep);
>> +		if (pte_none(old_pte))
>>   			continue;
>>
>> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>   		/*
>>   		 * If we are remapping a valid PTE, make sure
>>   		 * to flush TLB before we drop the PTL for the
>> @@ -253,8 +275,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   		 * the TLB entry for the old mapping has been
>>   		 * flushed.
>>   		 */
>> -		if (pte_present(pte))
>> +		if (pte_present(old_pte)) {
>> +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>> +							 old_pte, max_nr_ptes);
>>   			force_flush = true;
>> +		}
>> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> Just to clarify, in the previous revision you said:
>
> "Split THPs won't be batched; you can use pte_batch() (from David's refactoring)
> and figure the split THP batch out, but then get_and_clear_full_ptes() will be
> gathering a/d bits and smearing them across the batch, which will be incorrect."
>
> But... this will be triggered for page table split large folio no?
>
> So is there something wrong here or not?

Since I am using folio_pte_batch (and not the hypothetical pte_batch() I was
saying in the other email), the batch must belong to the same folio. Since split
THP means a small folio, nr_ptes will be 1.



>
>>   		pte = move_pte(pte, old_addr, new_addr);
>>   		pte = move_soft_dirty_pte(pte);
>>
>> @@ -267,7 +293,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>   				else if (is_swap_pte(pte))
>>   					pte = pte_swp_clear_uffd_wp(pte);
>>   			}
>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>> +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
> The code looks much better here after refactoring, however!
>
>>   		}
>>   	}
>>
>> --
>> 2.30.2
>>

Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching

Posted by Lorenzo Stoakes 8 months, 2 weeks ago

On Tue, May 27, 2025 at 09:52:47PM +0530, Dev Jain wrote:
>
> On 27/05/25 4:15 pm, Lorenzo Stoakes wrote:
> > On Tue, May 27, 2025 at 01:20:49PM +0530, Dev Jain wrote:
> > > Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> > > are painted with the contig bit, then ptep_get() will iterate through all 16
> > > entries to collect a/d bits. Hence this optimization will result in a 16x
> > > reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> > > will eventually call contpte_try_unfold() on every contig block, thus
> > > flushing the TLB for the complete large folio range. Instead, use
> > > get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> > > do them on the starting and ending contig block.
> > But you're also making this applicable to non-contpte cases?
> >
> > See below, but the commit message shoud clearly point out this is general
> > for page table split large folios (unless I've missed something of course!
> > :)
> >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >   mm/mremap.c | 40 +++++++++++++++++++++++++++++++++-------
> > >   1 file changed, 33 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 0163e02e5aa8..580b41f8d169 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -170,6 +170,24 @@ static pte_t move_soft_dirty_pte(pte_t pte)
> > >   	return pte;
> > >   }
> > >
> > > +/* mremap a batch of PTEs mapping the same large folio */
> > I think this comment is fairly useless, it basically spells out the function
> > name.
> >
> > I'd prefer something like 'determine if a PTE contains physically contiguous
> > entries which map the same large folio'.
>
> I'd rather prefer dropping the comment altogether, the function is fairly obvious : )

Sure fine.

>
>
> > > +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> > > +		pte_t *ptep, pte_t pte, int max_nr)
> > > +{
> > > +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > > +	struct folio *folio;
> > > +
> > > +	if (max_nr == 1)
> > > +		return 1;
> > > +
> > > +	folio = vm_normal_folio(vma, addr, pte);
> > > +	if (!folio || !folio_test_large(folio))
> > > +		return 1;
> > > +
> > > +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
> > > +			       NULL, NULL);
> > > +}
> > The code is much better however! :)
> >
> > > +
> > >   static int move_ptes(struct pagetable_move_control *pmc,
> > >   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
> > >   {
> > > @@ -177,7 +195,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
> > >   	struct mm_struct *mm = vma->vm_mm;
> > >   	pte_t *old_ptep, *new_ptep;
> > > -	pte_t pte;
> > > +	pte_t old_pte, pte;
> > >   	pmd_t dummy_pmdval;
> > >   	spinlock_t *old_ptl, *new_ptl;
> > >   	bool force_flush = false;
> > > @@ -185,6 +203,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	unsigned long new_addr = pmc->new_addr;
> > >   	unsigned long old_end = old_addr + extent;
> > >   	unsigned long len = old_end - old_addr;
> > > +	int max_nr_ptes;
> > > +	int nr_ptes;
> > >   	int err = 0;
> > >
> > >   	/*
> > > @@ -236,12 +256,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	flush_tlb_batched_pending(vma->vm_mm);
> > >   	arch_enter_lazy_mmu_mode();
> > >
> > > -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> > > -				   new_ptep++, new_addr += PAGE_SIZE) {
> > > -		if (pte_none(ptep_get(old_ptep)))
> > > +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
> > > +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
> > > +		nr_ptes = 1;
> > > +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
> > > +		old_pte = ptep_get(old_ptep);
> > > +		if (pte_none(old_pte))
> > >   			continue;
> > >
> > > -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
> > >   		/*
> > >   		 * If we are remapping a valid PTE, make sure
> > >   		 * to flush TLB before we drop the PTL for the
> > > @@ -253,8 +275,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   		 * the TLB entry for the old mapping has been
> > >   		 * flushed.
> > >   		 */
> > > -		if (pte_present(pte))
> > > +		if (pte_present(old_pte)) {
> > > +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> > > +							 old_pte, max_nr_ptes);
> > >   			force_flush = true;
> > > +		}
> > > +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
> > Just to clarify, in the previous revision you said:
> >
> > "Split THPs won't be batched; you can use pte_batch() (from David's refactoring)
> > and figure the split THP batch out, but then get_and_clear_full_ptes() will be
> > gathering a/d bits and smearing them across the batch, which will be incorrect."
> >
> > But... this will be triggered for page table split large folio no?
> >
> > So is there something wrong here or not?
>
> Since I am using folio_pte_batch (and not the hypothetical pte_batch() I was
> saying in the other email), the batch must belong to the same folio. Since split
> THP means a small folio, nr_ptes will be 1.

I'm not sure I follow - keep in mind there's two kinds of splitting - folio
splitting and page table splitting.

If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
large folio. The folio itself is not split, so nr_ptes surely will be equal to
something >1 here right?

I hit this in my MREMAP_RELOCATE_ANON work - where I had to take special care to
differentiate between these cases.

And the comment for folio_pte_batch() states 'Detect a PTE batch: consecutive
(present) PTEs that map consecutive pages of the same large folio.' - so I don't
see why this would not hit this case?

I may be missing something however!

>
>
>
> >
> > >   		pte = move_pte(pte, old_addr, new_addr);
> > >   		pte = move_soft_dirty_pte(pte);
> > >
> > > @@ -267,7 +293,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   				else if (is_swap_pte(pte))
> > >   					pte = pte_swp_clear_uffd_wp(pte);
> > >   			}
> > > -			set_pte_at(mm, new_addr, new_ptep, pte);
> > > +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
> > The code looks much better here after refactoring, however!
> >
> > >   		}
> > >   	}
> > >
> > > --
> > > 2.30.2
> > >

Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching

Posted by Dev Jain 8 months, 2 weeks ago

On 27/05/25 9:59 pm, Lorenzo Stoakes wrote:
> On Tue, May 27, 2025 at 09:52:47PM +0530, Dev Jain wrote:
>> On 27/05/25 4:15 pm, Lorenzo Stoakes wrote:
>>> On Tue, May 27, 2025 at 01:20:49PM +0530, Dev Jain wrote:
>>>> Use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
>>>> are painted with the contig bit, then ptep_get() will iterate through all 16
>>>> entries to collect a/d bits. Hence this optimization will result in a 16x
>>>> reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
>>>> will eventually call contpte_try_unfold() on every contig block, thus
>>>> flushing the TLB for the complete large folio range. Instead, use
>>>> get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
>>>> do them on the starting and ending contig block.
>>> But you're also making this applicable to non-contpte cases?
>>>
>>> See below, but the commit message shoud clearly point out this is general
>>> for page table split large folios (unless I've missed something of course!
>>> :)
>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>    mm/mremap.c | 40 +++++++++++++++++++++++++++++++++-------
>>>>    1 file changed, 33 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/mm/mremap.c b/mm/mremap.c
>>>> index 0163e02e5aa8..580b41f8d169 100644
>>>> --- a/mm/mremap.c
>>>> +++ b/mm/mremap.c
>>>> @@ -170,6 +170,24 @@ static pte_t move_soft_dirty_pte(pte_t pte)
>>>>    	return pte;
>>>>    }
>>>>
>>>> +/* mremap a batch of PTEs mapping the same large folio */
>>> I think this comment is fairly useless, it basically spells out the function
>>> name.
>>>
>>> I'd prefer something like 'determine if a PTE contains physically contiguous
>>> entries which map the same large folio'.
>> I'd rather prefer dropping the comment altogether, the function is fairly obvious : )
> Sure fine.
>
>>
>>>> +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
>>>> +		pte_t *ptep, pte_t pte, int max_nr)
>>>> +{
>>>> +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
>>>> +	struct folio *folio;
>>>> +
>>>> +	if (max_nr == 1)
>>>> +		return 1;
>>>> +
>>>> +	folio = vm_normal_folio(vma, addr, pte);
>>>> +	if (!folio || !folio_test_large(folio))
>>>> +		return 1;
>>>> +
>>>> +	return folio_pte_batch(folio, addr, ptep, pte, max_nr, flags, NULL,
>>>> +			       NULL, NULL);
>>>> +}
>>> The code is much better however! :)
>>>
>>>> +
>>>>    static int move_ptes(struct pagetable_move_control *pmc,
>>>>    		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
>>>>    {
>>>> @@ -177,7 +195,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
>>>>    	struct mm_struct *mm = vma->vm_mm;
>>>>    	pte_t *old_ptep, *new_ptep;
>>>> -	pte_t pte;
>>>> +	pte_t old_pte, pte;
>>>>    	pmd_t dummy_pmdval;
>>>>    	spinlock_t *old_ptl, *new_ptl;
>>>>    	bool force_flush = false;
>>>> @@ -185,6 +203,8 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	unsigned long new_addr = pmc->new_addr;
>>>>    	unsigned long old_end = old_addr + extent;
>>>>    	unsigned long len = old_end - old_addr;
>>>> +	int max_nr_ptes;
>>>> +	int nr_ptes;
>>>>    	int err = 0;
>>>>
>>>>    	/*
>>>> @@ -236,12 +256,14 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    	flush_tlb_batched_pending(vma->vm_mm);
>>>>    	arch_enter_lazy_mmu_mode();
>>>>
>>>> -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
>>>> -				   new_ptep++, new_addr += PAGE_SIZE) {
>>>> -		if (pte_none(ptep_get(old_ptep)))
>>>> +	for (; old_addr < old_end; old_ptep += nr_ptes, old_addr += nr_ptes * PAGE_SIZE,
>>>> +		new_ptep += nr_ptes, new_addr += nr_ptes * PAGE_SIZE) {
>>>> +		nr_ptes = 1;
>>>> +		max_nr_ptes = (old_end - old_addr) >> PAGE_SHIFT;
>>>> +		old_pte = ptep_get(old_ptep);
>>>> +		if (pte_none(old_pte))
>>>>    			continue;
>>>>
>>>> -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
>>>>    		/*
>>>>    		 * If we are remapping a valid PTE, make sure
>>>>    		 * to flush TLB before we drop the PTL for the
>>>> @@ -253,8 +275,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    		 * the TLB entry for the old mapping has been
>>>>    		 * flushed.
>>>>    		 */
>>>> -		if (pte_present(pte))
>>>> +		if (pte_present(old_pte)) {
>>>> +			nr_ptes = mremap_folio_pte_batch(vma, old_addr, old_ptep,
>>>> +							 old_pte, max_nr_ptes);
>>>>    			force_flush = true;
>>>> +		}
>>>> +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr_ptes, 0);
>>> Just to clarify, in the previous revision you said:
>>>
>>> "Split THPs won't be batched; you can use pte_batch() (from David's refactoring)
>>> and figure the split THP batch out, but then get_and_clear_full_ptes() will be
>>> gathering a/d bits and smearing them across the batch, which will be incorrect."
>>>
>>> But... this will be triggered for page table split large folio no?
>>>
>>> So is there something wrong here or not?
>> Since I am using folio_pte_batch (and not the hypothetical pte_batch() I was
>> saying in the other email), the batch must belong to the same folio. Since split
>> THP means a small folio, nr_ptes will be 1.
> I'm not sure I follow - keep in mind there's two kinds of splitting - folio
> splitting and page table splitting.
>
> If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
> large folio. The folio itself is not split, so nr_ptes surely will be equal to
> something >1 here right?


Thanks for elaborating.

So,

Case 1: folio splitting => nr_ptes = 1 => the question of a/d bit smearing
disappears.

Case 2: page table splitting => consec PTEs point to the same large folio
=> nr_ptes > 1 => get_and_clear_full_ptes() will smear a/d bits on the
new ptes, which is correct because we are still pointing to the same large
folio.


>
> I hit this in my MREMAP_RELOCATE_ANON work - where I had to take special care to
> differentiate between these cases.
>
> And the comment for folio_pte_batch() states 'Detect a PTE batch: consecutive
> (present) PTEs that map consecutive pages of the same large folio.' - so I don't
> see why this would not hit this case?
>
> I may be missing something however!
>
>>
>>
>>>>    		pte = move_pte(pte, old_addr, new_addr);
>>>>    		pte = move_soft_dirty_pte(pte);
>>>>
>>>> @@ -267,7 +293,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
>>>>    				else if (is_swap_pte(pte))
>>>>    					pte = pte_swp_clear_uffd_wp(pte);
>>>>    			}
>>>> -			set_pte_at(mm, new_addr, new_ptep, pte);
>>>> +			set_ptes(mm, new_addr, new_ptep, pte, nr_ptes);
>>> The code looks much better here after refactoring, however!
>>>
>>>>    		}
>>>>    	}
>>>>
>>>> --
>>>> 2.30.2
>>>>

Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching

Posted by Lorenzo Stoakes 8 months, 2 weeks ago

On Tue, May 27, 2025 at 10:08:59PM +0530, Dev Jain wrote:
>
> On 27/05/25 9:59 pm, Lorenzo Stoakes wrote:
[snip]
> > If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
> > large folio. The folio itself is not split, so nr_ptes surely will be equal to
> > something >1 here right?
>
>
> Thanks for elaborating.
>
> So,
>
> Case 1: folio splitting => nr_ptes = 1 => the question of a/d bit smearing
> disappears.
>
> Case 2: page table splitting => consec PTEs point to the same large folio
> => nr_ptes > 1 => get_and_clear_full_ptes() will smear a/d bits on the
> new ptes, which is correct because we are still pointing to the same large
> folio.
>

OK awesome, I thought as much, just wanted to make sure :) we are good then.

The accessed/dirty bits really matter at a folio granularity (and especially
with respect to reclaim/writeback which both operate at folio level) so the
smearing as you say is fine.

This patch therefore looks fine, only the trivial comment fixup.

I ran the series on my x86-64 setup (fwiw) with no build/mm selftest errors.

Sorry to be a pain but could you respin with the commit message for this patch
updated to explicitly mention that the logic applies for the non-contPTE split
PTE case (and therefore also helps performance there)? That and the trivial
thing of dropping that comment.

Then we should be good for a tag unless somebody else spots something
egregious :)

Thanks for this! Good improvement.

[snip]

Cheers, Lorenzo

Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching

Posted by Dev Jain 8 months, 2 weeks ago

On 27/05/25 10:16 pm, Lorenzo Stoakes wrote:
> On Tue, May 27, 2025 at 10:08:59PM +0530, Dev Jain wrote:
>> On 27/05/25 9:59 pm, Lorenzo Stoakes wrote:
> [snip]
>>> If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
>>> large folio. The folio itself is not split, so nr_ptes surely will be equal to
>>> something >1 here right?
>>
>> Thanks for elaborating.
>>
>> So,
>>
>> Case 1: folio splitting => nr_ptes = 1 => the question of a/d bit smearing
>> disappears.
>>
>> Case 2: page table splitting => consec PTEs point to the same large folio
>> => nr_ptes > 1 => get_and_clear_full_ptes() will smear a/d bits on the
>> new ptes, which is correct because we are still pointing to the same large
>> folio.
>>
> OK awesome, I thought as much, just wanted to make sure :) we are good then.
>
> The accessed/dirty bits really matter at a folio granularity (and especially
> with respect to reclaim/writeback which both operate at folio level) so the
> smearing as you say is fine.
>
> This patch therefore looks fine, only the trivial comment fixup.
>
> I ran the series on my x86-64 setup (fwiw) with no build/mm selftest errors.

Thanks!


>
> Sorry to be a pain but could you respin with the commit message for this patch
> updated to explicitly mention that the logic applies for the non-contPTE split
> PTE case (and therefore also helps performance there)? That and the trivial
> thing of dropping that comment.

What do you mean by the non-contpte case? In that case the PTEs do not point
to the same folio or are misaligned, and there will be no optimization. This
patch is optimizing two things: 1) ptep_get() READ_ONCE accesses 2) reduction
in number of TLBIs for contig blocks, both of which happen in the contpte case.

In general, the patch should have a minor improvement on other arches because
we are detecting a batch and processing it together, thus saving on a few
function calls, but the main benefit is for arm64.

>
> Then we should be good for a tag unless somebody else spots something
> egregious :)
>
> Thanks for this! Good improvement.
>
> [snip]
>
> Cheers, Lorenzo

Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching

Posted by Lorenzo Stoakes 8 months, 2 weeks ago

On Wed, May 28, 2025 at 09:02:26AM +0530, Dev Jain wrote:
>
> On 27/05/25 10:16 pm, Lorenzo Stoakes wrote:
> > On Tue, May 27, 2025 at 10:08:59PM +0530, Dev Jain wrote:
> > > On 27/05/25 9:59 pm, Lorenzo Stoakes wrote:
> > [snip]
> > > > If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
> > > > large folio. The folio itself is not split, so nr_ptes surely will be equal to
> > > > something >1 here right?
> > >
> > > Thanks for elaborating.
> > >
> > > So,
> > >
> > > Case 1: folio splitting => nr_ptes = 1 => the question of a/d bit smearing
> > > disappears.
> > >
> > > Case 2: page table splitting => consec PTEs point to the same large folio
> > > => nr_ptes > 1 => get_and_clear_full_ptes() will smear a/d bits on the
> > > new ptes, which is correct because we are still pointing to the same large
> > > folio.
> > >
> > OK awesome, I thought as much, just wanted to make sure :) we are good then.
> >
> > The accessed/dirty bits really matter at a folio granularity (and especially
> > with respect to reclaim/writeback which both operate at folio level) so the
> > smearing as you say is fine.
> >
> > This patch therefore looks fine, only the trivial comment fixup.
> >
> > I ran the series on my x86-64 setup (fwiw) with no build/mm selftest errors.
>
> Thanks!
>
>
> >
> > Sorry to be a pain but could you respin with the commit message for this patch
> > updated to explicitly mention that the logic applies for the non-contPTE split
> > PTE case (and therefore also helps performance there)? That and the trivial
> > thing of dropping that comment.
>
> What do you mean by the non-contpte case? In that case the PTEs do not point
> to the same folio or are misaligned, and there will be no optimization. This

Split page table large folio.

> patch is optimizing two things: 1) ptep_get() READ_ONCE accesses 2) reduction
> in number of TLBIs for contig blocks, both of which happen in the contpte case.
>

But it impacts split huge pages. Your code changes this behaviour. We need to
make this clear :)

> In general, the patch should have a minor improvement on other arches because
> we are detecting a batch and processing it together, thus saving on a few
> function calls, but the main benefit is for arm64.

Ack, but you are changing this behaviour. The commit message doesn't make this
clear and seems to imply this only impacts contPTE cases. Or at least isn't
clear enough

A simple additional paragraph like:

'Transparent huge pages which have been split into PTEs will also be impacted,
however the performance gain in this case is expected to be modest'

Will sort this out.

Thanks!

>
> >
> > Then we should be good for a tag unless somebody else spots something
> > egregious :)
> >
> > Thanks for this! Good improvement.
> >
> > [snip]
> >
> > Cheers, Lorenzo

Re: [PATCH v3 2/2] mm: Optimize mremap() by PTE batching

Posted by Dev Jain 8 months, 2 weeks ago

On 28/05/25 10:19 am, Lorenzo Stoakes wrote:
> On Wed, May 28, 2025 at 09:02:26AM +0530, Dev Jain wrote:
>> On 27/05/25 10:16 pm, Lorenzo Stoakes wrote:
>>> On Tue, May 27, 2025 at 10:08:59PM +0530, Dev Jain wrote:
>>>> On 27/05/25 9:59 pm, Lorenzo Stoakes wrote:
>>> [snip]
>>>>> If I invoke split_huge_pmd(), I end up with a bunch of PTEs mapping the same
>>>>> large folio. The folio itself is not split, so nr_ptes surely will be equal to
>>>>> something >1 here right?
>>>> Thanks for elaborating.
>>>>
>>>> So,
>>>>
>>>> Case 1: folio splitting => nr_ptes = 1 => the question of a/d bit smearing
>>>> disappears.
>>>>
>>>> Case 2: page table splitting => consec PTEs point to the same large folio
>>>> => nr_ptes > 1 => get_and_clear_full_ptes() will smear a/d bits on the
>>>> new ptes, which is correct because we are still pointing to the same large
>>>> folio.
>>>>
>>> OK awesome, I thought as much, just wanted to make sure :) we are good then.
>>>
>>> The accessed/dirty bits really matter at a folio granularity (and especially
>>> with respect to reclaim/writeback which both operate at folio level) so the
>>> smearing as you say is fine.
>>>
>>> This patch therefore looks fine, only the trivial comment fixup.
>>>
>>> I ran the series on my x86-64 setup (fwiw) with no build/mm selftest errors.
>> Thanks!
>>
>>
>>> Sorry to be a pain but could you respin with the commit message for this patch
>>> updated to explicitly mention that the logic applies for the non-contPTE split
>>> PTE case (and therefore also helps performance there)? That and the trivial
>>> thing of dropping that comment.
>> What do you mean by the non-contpte case? In that case the PTEs do not point
>> to the same folio or are misaligned, and there will be no optimization. This
> Split page table large folio.
>
>> patch is optimizing two things: 1) ptep_get() READ_ONCE accesses 2) reduction
>> in number of TLBIs for contig blocks, both of which happen in the contpte case.
>>
> But it impacts split huge pages. Your code changes this behaviour. We need to
> make this clear :)
>
>> In general, the patch should have a minor improvement on other arches because
>> we are detecting a batch and processing it together, thus saving on a few
>> function calls, but the main benefit is for arm64.
> Ack, but you are changing this behaviour. The commit message doesn't make this
> clear and seems to imply this only impacts contPTE cases. Or at least isn't
> clear enough
>
> A simple additional paragraph like:
>
> 'Transparent huge pages which have been split into PTEs will also be impacted,
> however the performance gain in this case is expected to be modest'

Ah okay, sounds good.

>
> Will sort this out.
>
> Thanks!
>
>>> Then we should be good for a tag unless somebody else spots something
>>> egregious :)
>>>
>>> Thanks for this! Good improvement.
>>>
>>> [snip]
>>>
>>> Cheers, Lorenzo

[PATCH v3 1/2] mm: Call pointers to ptes as ptep
[PATCH v3 2/2] mm: Optimize mremap() by PTE batching