mm/rmap: Optimize anonymous large folio unmapping

[PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs

Posted by Dev Jain 1 month ago

The ptes are all the same w.r.t belonging to the same type of VMA, and
being marked with uffd-wp or all being not marked. Therefore we can batch
set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
batched unmapping of folios belonging to uffd-wp VMAs by dropping that
condition from folio_unmap_pte_batch.

It may happen that we don't batch over the entire folio in one go, in which
case, we must skip over the current batch. Add a helper to do that -
page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
by nr pages.

I think that we can get away with just incrementing pvmw->pte
and pvmw->address, since looking at the code in page_vma_mapped.c,
pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
cancelling out the increment and decrement in the respective fields. But
let us not rely on the pvmw implementation and keep this simple.

Export this function to rmap.h to enable future reuse.

Signed-off-by: Dev Jain <dev.jain@arm.com>
---
 include/linux/rmap.h | 10 ++++++++++
 mm/rmap.c            |  8 +++-----
 2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8dc0871e5f001..1b7720c66ac87 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 		spin_unlock(pvmw->ptl);
 }
 
+static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
+		unsigned int nr)
+{
+	pvmw->pfn += nr;
+	pvmw->nr_pages -= nr;
+	pvmw->pgoff += nr;
+	pvmw->pte += nr;
+	pvmw->address += nr * PAGE_SIZE;
+}
+
 /**
  * page_vma_mapped_walk_restart - Restart the page table walk.
  * @pvmw: Pointer to struct page_vma_mapped_walk.
diff --git a/mm/rmap.c b/mm/rmap.c
index a7570cd037344..dd638429c963e 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1953,9 +1953,6 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
 	if (pte_unused(pte))
 		return 1;
 
-	if (userfaultfd_wp(vma))
-		return 1;
-
 	/*
 	 * If unmap fails, we need to restore the ptes. To avoid accidentally
 	 * upgrading write permissions for ptes that were not originally
@@ -2235,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 * we may want to replace a none pte with a marker pte if
 		 * it's file-backed, so we don't lose the tracking info.
 		 */
-		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, 1);
+		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, nr_pages);
 
 		/* Update high watermark before we lower rss */
 		update_hiwater_rss(mm);
@@ -2359,8 +2356,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		 * If we are sure that we batched the entire folio and cleared
 		 * all PTEs, we can just optimize and stop right here.
 		 */
-		if (nr_pages == folio_nr_pages(folio))
+		if (likely(nr_pages == folio_nr_pages(folio)))
 			goto walk_done;
+		page_vma_mapped_walk_jump(&pvmw, nr_pages - 1);
 		continue;
 walk_abort:
 		ret = false;
-- 
2.34.1

Re: [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs

Posted by Lorenzo Stoakes (Oracle) 4 weeks, 1 day ago

On Tue, Mar 10, 2026 at 01:00:09PM +0530, Dev Jain wrote:
> The ptes are all the same w.r.t belonging to the same type of VMA, and
> being marked with uffd-wp or all being not marked. Therefore we can batch
> set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
> batched unmapping of folios belonging to uffd-wp VMAs by dropping that
> condition from folio_unmap_pte_batch.
>
> It may happen that we don't batch over the entire folio in one go, in which
> case, we must skip over the current batch. Add a helper to do that -
> page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
> by nr pages.
>
> I think that we can get away with just incrementing pvmw->pte
> and pvmw->address, since looking at the code in page_vma_mapped.c,
> pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
> and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
> cancelling out the increment and decrement in the respective fields. But
> let us not rely on the pvmw implementation and keep this simple.

This isn't simple...

>
> Export this function to rmap.h to enable future reuse.
>
> Signed-off-by: Dev Jain <dev.jain@arm.com>
> ---
>  include/linux/rmap.h | 10 ++++++++++
>  mm/rmap.c            |  8 +++-----
>  2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 8dc0871e5f001..1b7720c66ac87 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>  		spin_unlock(pvmw->ptl);
>  }
>
> +static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
> +		unsigned int nr)

unsigned long nr_pages... 'nr' is meaningless and you're mixing + matching types
for no reason.

> +{
> +	pvmw->pfn += nr;
> +	pvmw->nr_pages -= nr;
> +	pvmw->pgoff += nr;
> +	pvmw->pte += nr;
> +	pvmw->address += nr * PAGE_SIZE;
> +}

I absolutely hate this. It's extremely confusing, especially since you're now
going from looking at 1 page to nr_pages - 1, jump doesn't really mean anything
here, you're losing sight of the batch size and exposing a silly detail to the
caller, and I really don't want to 'export' this at this time.

If we must have this, can you please make it static in rmap.c at least for the
time being.

Or perhaps instead, have a batched variant of page_vma_mapped_walk(), like
page_vma_mapped_walk_batch()?

I think that makes a lot more sense...

I mean I kind of hate the pvmw interface in general, this is a hack to handle
batching clamped on to the side of it, let's figure out how to do this sensibly
and do what's needed rather than adding yet more hacks-on-hacks please.

> +
>  /**
>   * page_vma_mapped_walk_restart - Restart the page table walk.
>   * @pvmw: Pointer to struct page_vma_mapped_walk.
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a7570cd037344..dd638429c963e 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1953,9 +1953,6 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>  	if (pte_unused(pte))
>  		return 1;
>
> -	if (userfaultfd_wp(vma))
> -		return 1;
> -
>  	/*
>  	 * If unmap fails, we need to restore the ptes. To avoid accidentally
>  	 * upgrading write permissions for ptes that were not originally
> @@ -2235,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		 * we may want to replace a none pte with a marker pte if
>  		 * it's file-backed, so we don't lose the tracking info.
>  		 */
> -		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, 1);
> +		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, nr_pages);
>
>  		/* Update high watermark before we lower rss */
>  		update_hiwater_rss(mm);
> @@ -2359,8 +2356,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		 * If we are sure that we batched the entire folio and cleared
>  		 * all PTEs, we can just optimize and stop right here.
>  		 */
> -		if (nr_pages == folio_nr_pages(folio))
> +		if (likely(nr_pages == folio_nr_pages(folio)))

Please don't add random likely()'s based on what you think is likely(). This
kind of thing should only be done based on profiling.

>  			goto walk_done;
> +		page_vma_mapped_walk_jump(&pvmw, nr_pages - 1);

(You're now passing a signed long to an unsigned int...!)


>  		continue;
>  walk_abort:
>  		ret = false;
> --
> 2.34.1
>

Thanks, Lorenzo

Re: [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs

Posted by Dev Jain 4 weeks, 1 day ago


On 10/03/26 2:04 pm, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 10, 2026 at 01:00:09PM +0530, Dev Jain wrote:
>> The ptes are all the same w.r.t belonging to the same type of VMA, and
>> being marked with uffd-wp or all being not marked. Therefore we can batch
>> set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
>> batched unmapping of folios belonging to uffd-wp VMAs by dropping that
>> condition from folio_unmap_pte_batch.
>>
>> It may happen that we don't batch over the entire folio in one go, in which
>> case, we must skip over the current batch. Add a helper to do that -
>> page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
>> by nr pages.
>>
>> I think that we can get away with just incrementing pvmw->pte
>> and pvmw->address, since looking at the code in page_vma_mapped.c,
>> pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
>> and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
>> cancelling out the increment and decrement in the respective fields. But
>> let us not rely on the pvmw implementation and keep this simple.
> 
> This isn't simple...
> 
>>
>> Export this function to rmap.h to enable future reuse.
>>
>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>> ---
>>  include/linux/rmap.h | 10 ++++++++++
>>  mm/rmap.c            |  8 +++-----
>>  2 files changed, 13 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index 8dc0871e5f001..1b7720c66ac87 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>>  		spin_unlock(pvmw->ptl);
>>  }
>>
>> +static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
>> +		unsigned int nr)
> 
> unsigned long nr_pages... 'nr' is meaningless and you're mixing + matching types
> for no reason.
> 
>> +{
>> +	pvmw->pfn += nr;
>> +	pvmw->nr_pages -= nr;
>> +	pvmw->pgoff += nr;
>> +	pvmw->pte += nr;
>> +	pvmw->address += nr * PAGE_SIZE;
>> +}
> 
> I absolutely hate this. It's extremely confusing, especially since you're now
> going from looking at 1 page to nr_pages - 1, jump doesn't really mean anything
> here, you're losing sight of the batch size and exposing a silly detail to the
> caller, and I really don't want to 'export' this at this time.
> 
> If we must have this, can you please make it static in rmap.c at least for the
> time being.
> 
> Or perhaps instead, have a batched variant of page_vma_mapped_walk(), like
> page_vma_mapped_walk_batch()?
> 
> I think that makes a lot more sense...
> 
> I mean I kind of hate the pvmw interface in general, this is a hack to handle
> batching clamped on to the side of it, let's figure out how to do this sensibly
> and do what's needed rather than adding yet more hacks-on-hacks please.
> 
>> +
>>  /**
>>   * page_vma_mapped_walk_restart - Restart the page table walk.
>>   * @pvmw: Pointer to struct page_vma_mapped_walk.
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index a7570cd037344..dd638429c963e 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1953,9 +1953,6 @@ static inline unsigned int folio_unmap_pte_batch(struct folio *folio,
>>  	if (pte_unused(pte))
>>  		return 1;
>>
>> -	if (userfaultfd_wp(vma))
>> -		return 1;
>> -
>>  	/*
>>  	 * If unmap fails, we need to restore the ptes. To avoid accidentally
>>  	 * upgrading write permissions for ptes that were not originally
>> @@ -2235,7 +2232,7 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  		 * we may want to replace a none pte with a marker pte if
>>  		 * it's file-backed, so we don't lose the tracking info.
>>  		 */
>> -		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, 1);
>> +		install_uffd_wp_ptes_if_needed(vma, address, pvmw.pte, pteval, nr_pages);
>>
>>  		/* Update high watermark before we lower rss */
>>  		update_hiwater_rss(mm);
>> @@ -2359,8 +2356,9 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  		 * If we are sure that we batched the entire folio and cleared
>>  		 * all PTEs, we can just optimize and stop right here.
>>  		 */
>> -		if (nr_pages == folio_nr_pages(folio))
>> +		if (likely(nr_pages == folio_nr_pages(folio)))
> 
> Please don't add random likely()'s based on what you think is likely(). This
> kind of thing should only be done based on profiling.

Okay.

> 
>>  			goto walk_done;
>> +		page_vma_mapped_walk_jump(&pvmw, nr_pages - 1);
> 
> (You're now passing a signed long to an unsigned int...!)

Will fix all instances of nr_pages to unsigned long.

> 
> 
>>  		continue;
>>  walk_abort:
>>  		ret = false;
>> --
>> 2.34.1
>>
> 
> Thanks, Lorenzo

Re: [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs

Posted by Barry Song 4 weeks, 1 day ago

On Tue, Mar 10, 2026 at 4:34 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Tue, Mar 10, 2026 at 01:00:09PM +0530, Dev Jain wrote:
> > The ptes are all the same w.r.t belonging to the same type of VMA, and
> > being marked with uffd-wp or all being not marked. Therefore we can batch
> > set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
> > batched unmapping of folios belonging to uffd-wp VMAs by dropping that
> > condition from folio_unmap_pte_batch.
> >
> > It may happen that we don't batch over the entire folio in one go, in which
> > case, we must skip over the current batch. Add a helper to do that -
> > page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
> > by nr pages.
> >
> > I think that we can get away with just incrementing pvmw->pte
> > and pvmw->address, since looking at the code in page_vma_mapped.c,
> > pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
> > and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
> > cancelling out the increment and decrement in the respective fields. But
> > let us not rely on the pvmw implementation and keep this simple.
>
> This isn't simple...
>
> >
> > Export this function to rmap.h to enable future reuse.
> >
> > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > ---
> >  include/linux/rmap.h | 10 ++++++++++
> >  mm/rmap.c            |  8 +++-----
> >  2 files changed, 13 insertions(+), 5 deletions(-)
> >
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 8dc0871e5f001..1b7720c66ac87 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> >               spin_unlock(pvmw->ptl);
> >  }
> >
> > +static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
> > +             unsigned int nr)
>
> unsigned long nr_pages... 'nr' is meaningless and you're mixing + matching types
> for no reason.
>
> > +{
> > +     pvmw->pfn += nr;
> > +     pvmw->nr_pages -= nr;
> > +     pvmw->pgoff += nr;
> > +     pvmw->pte += nr;
> > +     pvmw->address += nr * PAGE_SIZE;
> > +}
>
> I absolutely hate this. It's extremely confusing, especially since you're now
> going from looking at 1 page to nr_pages - 1, jump doesn't really mean anything
> here, you're losing sight of the batch size and exposing a silly detail to the
> caller, and I really don't want to 'export' this at this time.

I’m fairly sure I raised the same concern when Dev first suggested this,
but somehow it seems my comment was completely overlooked. :-)

>
> If we must have this, can you please make it static in rmap.c at least for the
> time being.
>
> Or perhaps instead, have a batched variant of page_vma_mapped_walk(), like
> page_vma_mapped_walk_batch()?

Right now, for non-anon pages we face the same issues, but
page_vma_mapped_walk() can skip those PTEs once it finds that
nr - 1 PTEs are none.

next_pte:
                do {
                        pvmw->address += PAGE_SIZE;
                        if (pvmw->address >= end)
                                return not_found(pvmw);
                        /* Did we cross page table boundary? */
                        if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
                                if (pvmw->ptl) {
                                        spin_unlock(pvmw->ptl);
                                        pvmw->ptl = NULL;
                                }
                                pte_unmap(pvmw->pte);
                                pvmw->pte = NULL;
                                pvmw->flags |= PVMW_PGTABLE_CROSSED;
                                goto restart;
                        }
                        pvmw->pte++;
                } while (pte_none(ptep_get(pvmw->pte)));

The difference now is that swap entries cannot be skipped.

If we're trying to find `page_vma_mapped_walk_batch()`, I suppose
it could be like this?

bool page_vma_mapped_walk_batch(struct page_vma_mapped_walk *pvmw,
unsigned long nr)
{
...
}

static inline bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
{
        return page_vma_mapped_walk_batch(pvmw, 1);
}

Thanks
barry

Re: [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs

Posted by Barry Song 4 weeks, 1 day ago

On Wed, Mar 11, 2026 at 7:32 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 10, 2026 at 4:34 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Tue, Mar 10, 2026 at 01:00:09PM +0530, Dev Jain wrote:
> > > The ptes are all the same w.r.t belonging to the same type of VMA, and
> > > being marked with uffd-wp or all being not marked. Therefore we can batch
> > > set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
> > > batched unmapping of folios belonging to uffd-wp VMAs by dropping that
> > > condition from folio_unmap_pte_batch.
> > >
> > > It may happen that we don't batch over the entire folio in one go, in which
> > > case, we must skip over the current batch. Add a helper to do that -
> > > page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
> > > by nr pages.
> > >
> > > I think that we can get away with just incrementing pvmw->pte
> > > and pvmw->address, since looking at the code in page_vma_mapped.c,
> > > pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
> > > and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
> > > cancelling out the increment and decrement in the respective fields. But
> > > let us not rely on the pvmw implementation and keep this simple.
> >
> > This isn't simple...
> >
> > >
> > > Export this function to rmap.h to enable future reuse.
> > >
> > > Signed-off-by: Dev Jain <dev.jain@arm.com>
> > > ---
> > >  include/linux/rmap.h | 10 ++++++++++
> > >  mm/rmap.c            |  8 +++-----
> > >  2 files changed, 13 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > > index 8dc0871e5f001..1b7720c66ac87 100644
> > > --- a/include/linux/rmap.h
> > > +++ b/include/linux/rmap.h
> > > @@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> > >               spin_unlock(pvmw->ptl);
> > >  }
> > >
> > > +static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
> > > +             unsigned int nr)
> >
> > unsigned long nr_pages... 'nr' is meaningless and you're mixing + matching types
> > for no reason.
> >
> > > +{
> > > +     pvmw->pfn += nr;
> > > +     pvmw->nr_pages -= nr;
> > > +     pvmw->pgoff += nr;
> > > +     pvmw->pte += nr;
> > > +     pvmw->address += nr * PAGE_SIZE;
> > > +}
> >
> > I absolutely hate this. It's extremely confusing, especially since you're now
> > going from looking at 1 page to nr_pages - 1, jump doesn't really mean anything
> > here, you're losing sight of the batch size and exposing a silly detail to the
> > caller, and I really don't want to 'export' this at this time.
>
> I’m fairly sure I raised the same concern when Dev first suggested this,
> but somehow it seems my comment was completely overlooked. :-)
>
> >
> > If we must have this, can you please make it static in rmap.c at least for the
> > time being.
> >
> > Or perhaps instead, have a batched variant of page_vma_mapped_walk(), like
> > page_vma_mapped_walk_batch()?
>
> Right now, for non-anon pages we face the same issues, but
> page_vma_mapped_walk() can skip those PTEs once it finds that
> nr - 1 PTEs are none.
>
> next_pte:
>                 do {
>                         pvmw->address += PAGE_SIZE;
>                         if (pvmw->address >= end)
>                                 return not_found(pvmw);
>                         /* Did we cross page table boundary? */
>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>                                 if (pvmw->ptl) {
>                                         spin_unlock(pvmw->ptl);
>                                         pvmw->ptl = NULL;
>                                 }
>                                 pte_unmap(pvmw->pte);
>                                 pvmw->pte = NULL;
>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>                                 goto restart;
>                         }
>                         pvmw->pte++;
>                 } while (pte_none(ptep_get(pvmw->pte)));
>
> The difference now is that swap entries cannot be skipped.
>
> If we're trying to find `page_vma_mapped_walk_batch()`, I suppose
> it could be like this?
>
> bool page_vma_mapped_walk_batch(struct page_vma_mapped_walk *pvmw,
> unsigned long nr)
> {
> ...
> }
>
> static inline bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> {
>         return page_vma_mapped_walk_batch(pvmw, 1);
> }

Another approach might be to introduce a flag so that
page_vma_mapped_walk() knows we are doing batched unmaps
and can skip nr - 1 swap entries.

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 8dc0871e5f00..bf03ae006366 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -856,6 +856,9 @@ struct page *make_device_exclusive(struct
mm_struct *mm, unsigned long addr,
 /* Look for migration entries rather than present PTEs */
 #define PVMW_MIGRATION         (1 << 1)

+/* Batched unmap: skip swap entries. */
+#define PVMW_BATCH_UNMAP       (1 << 2)
+
 /* Result flags */

 /* The page is mapped across page table boundary */


Thanks
Barry

Re: [PATCH 5/9] mm/rmap: batch unmap folios belonging to uffd-wp VMAs

Posted by Dev Jain 4 weeks, 1 day ago


On 11/03/26 9:44 am, Barry Song wrote:
> On Wed, Mar 11, 2026 at 7:32 AM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, Mar 10, 2026 at 4:34 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>>>
>>> On Tue, Mar 10, 2026 at 01:00:09PM +0530, Dev Jain wrote:
>>>> The ptes are all the same w.r.t belonging to the same type of VMA, and
>>>> being marked with uffd-wp or all being not marked. Therefore we can batch
>>>> set uffd-wp markers through install_uffd_wp_ptes_if_needed, and enable
>>>> batched unmapping of folios belonging to uffd-wp VMAs by dropping that
>>>> condition from folio_unmap_pte_batch.
>>>>
>>>> It may happen that we don't batch over the entire folio in one go, in which
>>>> case, we must skip over the current batch. Add a helper to do that -
>>>> page_vma_mapped_walk_jump() will increment the relevant fields of pvmw
>>>> by nr pages.
>>>>
>>>> I think that we can get away with just incrementing pvmw->pte
>>>> and pvmw->address, since looking at the code in page_vma_mapped.c,
>>>> pvmw->pfn and pvmw->nr_pages are used in conjunction, and pvmw->pgoff
>>>> and pvmw->nr_pages (in vma_address_end()) are used in conjunction,
>>>> cancelling out the increment and decrement in the respective fields. But
>>>> let us not rely on the pvmw implementation and keep this simple.
>>>
>>> This isn't simple...
>>>
>>>>
>>>> Export this function to rmap.h to enable future reuse.
>>>>
>>>> Signed-off-by: Dev Jain <dev.jain@arm.com>
>>>> ---
>>>>  include/linux/rmap.h | 10 ++++++++++
>>>>  mm/rmap.c            |  8 +++-----
>>>>  2 files changed, 13 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index 8dc0871e5f001..1b7720c66ac87 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -892,6 +892,16 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>>>>               spin_unlock(pvmw->ptl);
>>>>  }
>>>>
>>>> +static inline void page_vma_mapped_walk_jump(struct page_vma_mapped_walk *pvmw,
>>>> +             unsigned int nr)
>>>
>>> unsigned long nr_pages... 'nr' is meaningless and you're mixing + matching types
>>> for no reason.
>>>
>>>> +{
>>>> +     pvmw->pfn += nr;
>>>> +     pvmw->nr_pages -= nr;
>>>> +     pvmw->pgoff += nr;
>>>> +     pvmw->pte += nr;
>>>> +     pvmw->address += nr * PAGE_SIZE;
>>>> +}
>>>
>>> I absolutely hate this. It's extremely confusing, especially since you're now
>>> going from looking at 1 page to nr_pages - 1, jump doesn't really mean anything
>>> here, you're losing sight of the batch size and exposing a silly detail to the
>>> caller, and I really don't want to 'export' this at this time.
>>
>> I’m fairly sure I raised the same concern when Dev first suggested this,
>> but somehow it seems my comment was completely overlooked. :-)

I do recall, perhaps I was lazy to look at the pvmw code :) But I should
have just looked at this earlier, it's fairly simple. See below.

>>
>>>
>>> If we must have this, can you please make it static in rmap.c at least for the
>>> time being.
>>>
>>> Or perhaps instead, have a batched variant of page_vma_mapped_walk(), like
>>> page_vma_mapped_walk_batch()?
>>
>> Right now, for non-anon pages we face the same issues, but
>> page_vma_mapped_walk() can skip those PTEs once it finds that
>> nr - 1 PTEs are none.
>>
>> next_pte:
>>                 do {
>>                         pvmw->address += PAGE_SIZE;
>>                         if (pvmw->address >= end)
>>                                 return not_found(pvmw);
>>                         /* Did we cross page table boundary? */
>>                         if ((pvmw->address & (PMD_SIZE - PAGE_SIZE)) == 0) {
>>                                 if (pvmw->ptl) {
>>                                         spin_unlock(pvmw->ptl);
>>                                         pvmw->ptl = NULL;
>>                                 }
>>                                 pte_unmap(pvmw->pte);
>>                                 pvmw->pte = NULL;
>>                                 pvmw->flags |= PVMW_PGTABLE_CROSSED;
>>                                 goto restart;
>>                         }
>>                         pvmw->pte++;
>>                 } while (pte_none(ptep_get(pvmw->pte)));
>>
>> The difference now is that swap entries cannot be skipped.
>>
>> If we're trying to find `page_vma_mapped_walk_batch()`, I suppose
>> it could be like this?
>>
>> bool page_vma_mapped_walk_batch(struct page_vma_mapped_walk *pvmw,
>> unsigned long nr)
>> {
>> ...
>> }
>>
>> static inline bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>> {
>>         return page_vma_mapped_walk_batch(pvmw, 1);
>> }
> 
> Another approach might be to introduce a flag so that
> page_vma_mapped_walk() knows we are doing batched unmaps
> and can skip nr - 1 swap entries.
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 8dc0871e5f00..bf03ae006366 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -856,6 +856,9 @@ struct page *make_device_exclusive(struct
> mm_struct *mm, unsigned long addr,
>  /* Look for migration entries rather than present PTEs */
>  #define PVMW_MIGRATION         (1 << 1)
> 
> +/* Batched unmap: skip swap entries. */
> +#define PVMW_BATCH_UNMAP       (1 << 2)

Exactly, I just also came up with the same solution and saw your reply :)
We can just name this PVMW_BATCH_PRESENT, the comment saying
"Look for present entries", and fix the comment above PVMW_MIGRATION to
drop the "rather than present PTEs" because that is wrong, we are currently
also looking for softleaf entries by default.

> +
>  /* Result flags */
> 
>  /* The page is mapped across page table boundary */
> 
> 
> Thanks
> Barry