[v4] introduce VM_MAYBE_GUARD and make it sticky

[PATCH v4 6/9] mm: set the VM_MAYBE_GUARD flag on guard region install

Posted by Lorenzo Stoakes 2 months, 3 weeks ago

Now we have established the VM_MAYBE_GUARD flag and added the capacity to
set it atomically, do so upon MADV_GUARD_INSTALL.

The places where this flag is used currently and matter are:

* VMA merge - performed under mmap/VMA write lock, therefore excluding
  racing writes.

* /proc/$pid/smaps - can race the write, however this isn't meaningful
  as the flag write is performed at the point of the guard region being
  established, and thus an smaps reader can't reasonably expect to avoid
  races.  Due to atomicity, a reader will observe either the flag being
  set or not.  Therefore consistency will be maintained.

In all other cases the flag being set is irrelevant and atomicity
guarantees other flags will be read correctly.

Note that non-atomic updates of unrelated flags do not cause an issue with
this flag being set atomically, as writes of other flags are performed
under mmap/VMA write lock, and these atomic writes are performed under
mmap/VMA read lock, which excludes the write, avoiding RMW races.

Note that we do not encounter issues with KCSAN by adjusting this flag
atomically, as we are only updating a single bit in the flag bitmap and
therefore we do not need to annotate these changes.

We intentionally set this flag in advance of actually updating the page
tables, to ensure that any racing atomic read of this flag will only
return false prior to page tables being updated, to allow for
serialisation via page table locks.

Note that we set vma->anon_vma for anonymous mappings.  This is because
the expectation for anonymous mappings is that an anon_vma is established
should they possess any page table mappings.  This is also consistent with
what we were doing prior to this patch (unconditionally setting anon_vma
on guard region installation).

We also need to update retract_page_tables() to ensure that madvise(...,
MADV_COLLAPSE) doesn't incorrectly collapse file-backed ranges contain
guard regions.

This was previously guarded by anon_vma being set to catch MAP_PRIVATE
cases, but the introduction of VM_MAYBE_GUARD necessitates that we check
this flag instead.

We utilise vma_flag_test_atomic() to do so - we first perform an
optimistic check, then after the PTE page table lock is held, we can check
again safely, as upon guard marker install the flag is set atomically
prior to the page table lock being taken to actually apply it.

So if the initial check fails either:

* Page table retraction acquires page table lock prior to VM_MAYBE_GUARD
  being set - guard marker installation will be blocked until page table
  retraction is complete.

OR:

* Guard marker installation acquires page table lock after setting
  VM_MAYBE_GUARD, which raced and didn't pick this up in the initial
  optimistic check, blocking page table retraction until the guard regions
  are installed - the second VM_MAYBE_GUARD check will prevent page table
  retraction.

Either way we're safe.

We refactor the retraction checks into a single
file_backed_vma_is_retractable(), there doesn't seem to be any reason that
the checks were separated as before.

Note that VM_MAYBE_GUARD being set atomically remains correct as
vma_needs_copy() is invoked with the mmap and VMA write locks held,
excluding any race with madvise_guard_install().

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/khugepaged.c | 71 ++++++++++++++++++++++++++++++++-----------------
 mm/madvise.c    | 22 +++++++++------
 2 files changed, 61 insertions(+), 32 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index f6ed1072ed6e..af1c162c9a94 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1710,6 +1710,43 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	return result;
 }
 
+/* Can we retract page tables for this file-backed VMA? */
+static bool file_backed_vma_is_retractable(struct vm_area_struct *vma)
+{
+	/*
+	 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
+	 * got written to. These VMAs are likely not worth removing
+	 * page tables from, as PMD-mapping is likely to be split later.
+	 */
+	if (READ_ONCE(vma->anon_vma))
+		return false;
+
+	/*
+	 * When a vma is registered with uffd-wp, we cannot recycle
+	 * the page table because there may be pte markers installed.
+	 * Other vmas can still have the same file mapped hugely, but
+	 * skip this one: it will always be mapped in small page size
+	 * for uffd-wp registered ranges.
+	 */
+	if (userfaultfd_wp(vma))
+		return false;
+
+	/*
+	 * If the VMA contains guard regions then we can't collapse it.
+	 *
+	 * This is set atomically on guard marker installation under mmap/VMA
+	 * read lock, and here we may not hold any VMA or mmap lock at all.
+	 *
+	 * This is therefore serialised on the PTE page table lock, which is
+	 * obtained on guard region installation after the flag is set, so this
+	 * check being performed under this lock excludes races.
+	 */
+	if (vma_flag_test_atomic(vma, VM_MAYBE_GUARD_BIT))
+		return false;
+
+	return true;
+}
+
 static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 {
 	struct vm_area_struct *vma;
@@ -1724,14 +1761,6 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 		spinlock_t *ptl;
 		bool success = false;
 
-		/*
-		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
-		 * got written to. These VMAs are likely not worth removing
-		 * page tables from, as PMD-mapping is likely to be split later.
-		 */
-		if (READ_ONCE(vma->anon_vma))
-			continue;
-
 		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
 		if (addr & ~HPAGE_PMD_MASK ||
 		    vma->vm_end < addr + HPAGE_PMD_SIZE)
@@ -1743,14 +1772,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 
 		if (hpage_collapse_test_exit(mm))
 			continue;
-		/*
-		 * When a vma is registered with uffd-wp, we cannot recycle
-		 * the page table because there may be pte markers installed.
-		 * Other vmas can still have the same file mapped hugely, but
-		 * skip this one: it will always be mapped in small page size
-		 * for uffd-wp registered ranges.
-		 */
-		if (userfaultfd_wp(vma))
+
+		if (!file_backed_vma_is_retractable(vma))
 			continue;
 
 		/* PTEs were notified when unmapped; but now for the PMD? */
@@ -1777,15 +1800,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
 			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
 
 		/*
-		 * Huge page lock is still held, so normally the page table
-		 * must remain empty; and we have already skipped anon_vma
-		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
-		 * held, it is still possible for a racing userfaultfd_ioctl()
-		 * to have inserted ptes or markers.  Now that we hold ptlock,
-		 * repeating the anon_vma check protects from one category,
-		 * and repeating the userfaultfd_wp() check from another.
+		 * Huge page lock is still held, so normally the page table must
+		 * remain empty; and we have already skipped anon_vma and
+		 * userfaultfd_wp() vmas.  But since the mmap_lock is not held,
+		 * it is still possible for a racing userfaultfd_ioctl() or
+		 * madvise() to have inserted ptes or markers.  Now that we hold
+		 * ptlock, repeating the retractable checks protects us from
+		 * races against the prior checks.
 		 */
-		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
+		if (likely(file_backed_vma_is_retractable(vma))) {
 			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
 			pmdp_get_lockless_sync();
 			success = true;
diff --git a/mm/madvise.c b/mm/madvise.c
index 0b3280752bfb..5dbe40be7c65 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -1141,15 +1141,21 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
 		return -EINVAL;
 
 	/*
-	 * If we install guard markers, then the range is no longer
-	 * empty from a page table perspective and therefore it's
-	 * appropriate to have an anon_vma.
-	 *
-	 * This ensures that on fork, we copy page tables correctly.
+	 * Set atomically under read lock. All pertinent readers will need to
+	 * acquire an mmap/VMA write lock to read it. All remaining readers may
+	 * or may not see the flag set, but we don't care.
+	 */
+	vma_flag_set_atomic(vma, VM_MAYBE_GUARD_BIT);
+
+	/*
+	 * If anonymous and we are establishing page tables the VMA ought to
+	 * have an anon_vma associated with it.
 	 */
-	err = anon_vma_prepare(vma);
-	if (err)
-		return err;
+	if (vma_is_anonymous(vma)) {
+		err = anon_vma_prepare(vma);
+		if (err)
+			return err;
+	}
 
 	/*
 	 * Optimistically try to install the guard marker pages first. If any
-- 
2.51.2

Re: [PATCH v4 6/9] mm: set the VM_MAYBE_GUARD flag on guard region install

Posted by David Hildenbrand (Red Hat) 2 months, 3 weeks ago

>   
> +/* Can we retract page tables for this file-backed VMA? */
> +static bool file_backed_vma_is_retractable(struct vm_area_struct *vma)

It's not really the VMA that is retractable :)

Given that the function we are called this from is called 
"retract_page_tables" (and not file_backed_...) I guess I would just 
have called this

"page_tables_are_retractable"

"page_tables_support_retract"

Or sth. along those lines.

> +{
> +	/*
> +	 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> +	 * got written to. These VMAs are likely not worth removing
> +	 * page tables from, as PMD-mapping is likely to be split later.
> +	 */
> +	if (READ_ONCE(vma->anon_vma))
> +		return false;
> +
> +	/*
> +	 * When a vma is registered with uffd-wp, we cannot recycle
> +	 * the page table because there may be pte markers installed.
> +	 * Other vmas can still have the same file mapped hugely, but
> +	 * skip this one: it will always be mapped in small page size
> +	 * for uffd-wp registered ranges.
> +	 */
> +	if (userfaultfd_wp(vma))
> +		return false;
> +
> +	/*
> +	 * If the VMA contains guard regions then we can't collapse it.
> +	 *
> +	 * This is set atomically on guard marker installation under mmap/VMA
> +	 * read lock, and here we may not hold any VMA or mmap lock at all.
> +	 *
> +	 * This is therefore serialised on the PTE page table lock, which is
> +	 * obtained on guard region installation after the flag is set, so this
> +	 * check being performed under this lock excludes races.
> +	 */
> +	if (vma_flag_test_atomic(vma, VM_MAYBE_GUARD_BIT))
> +		return false;
> +
> +	return true;
> +}
> +
>   static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   {
>   	struct vm_area_struct *vma;
> @@ -1724,14 +1761,6 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   		spinlock_t *ptl;
>   		bool success = false;
>   
> -		/*
> -		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> -		 * got written to. These VMAs are likely not worth removing
> -		 * page tables from, as PMD-mapping is likely to be split later.
> -		 */
> -		if (READ_ONCE(vma->anon_vma))
> -			continue;
> -
>   		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
>   		if (addr & ~HPAGE_PMD_MASK ||
>   		    vma->vm_end < addr + HPAGE_PMD_SIZE)
> @@ -1743,14 +1772,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   
>   		if (hpage_collapse_test_exit(mm))
>   			continue;
> -		/*
> -		 * When a vma is registered with uffd-wp, we cannot recycle
> -		 * the page table because there may be pte markers installed.
> -		 * Other vmas can still have the same file mapped hugely, but
> -		 * skip this one: it will always be mapped in small page size
> -		 * for uffd-wp registered ranges.
> -		 */
> -		if (userfaultfd_wp(vma))
> +
> +		if (!file_backed_vma_is_retractable(vma))
>   			continue;
>   
>   		/* PTEs were notified when unmapped; but now for the PMD? */
> @@ -1777,15 +1800,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
>   			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
>   
>   		/*
> -		 * Huge page lock is still held, so normally the page table
> -		 * must remain empty; and we have already skipped anon_vma
> -		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
> -		 * held, it is still possible for a racing userfaultfd_ioctl()
> -		 * to have inserted ptes or markers.  Now that we hold ptlock,
> -		 * repeating the anon_vma check protects from one category,
> -		 * and repeating the userfaultfd_wp() check from another.
> +		 * Huge page lock is still held, so normally the page table must
> +		 * remain empty; and we have already skipped anon_vma and
> +		 * userfaultfd_wp() vmas.  But since the mmap_lock is not held,
> +		 * it is still possible for a racing userfaultfd_ioctl() or
> +		 * madvise() to have inserted ptes or markers.  Now that we hold
> +		 * ptlock, repeating the retractable checks protects us from
> +		 * races against the prior checks.
>   		 */
> -		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
> +		if (likely(file_backed_vma_is_retractable(vma))) {
>   			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
>   			pmdp_get_lockless_sync();
>   			success = true;
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 0b3280752bfb..5dbe40be7c65 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -1141,15 +1141,21 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
>   		return -EINVAL;
>   
>   	/*
> -	 * If we install guard markers, then the range is no longer
> -	 * empty from a page table perspective and therefore it's
> -	 * appropriate to have an anon_vma.
> -	 *
> -	 * This ensures that on fork, we copy page tables correctly.
> +	 * Set atomically under read lock. All pertinent readers will need to
> +	 * acquire an mmap/VMA write lock to read it. All remaining readers may
> +	 * or may not see the flag set, but we don't care.
> +	 */
> +	vma_flag_set_atomic(vma, VM_MAYBE_GUARD_BIT);
> +

In general LGTM.

> +	/*
> +	 * If anonymous and we are establishing page tables the VMA ought to
> +	 * have an anon_vma associated with it.

Do you know why? I know that as soon as we have anon folios in there we 
need it, but is it still required for guard regions? Patch #5 should 
handle the for case I guess.

Which other code depends on that?

-- 
Cheers

David

Re: [PATCH v4 6/9] mm: set the VM_MAYBE_GUARD flag on guard region install

Posted by Lorenzo Stoakes 2 months, 3 weeks ago

On Wed, Nov 19, 2025 at 10:16:14AM +0100, David Hildenbrand (Red Hat) wrote:
> > +/* Can we retract page tables for this file-backed VMA? */
> > +static bool file_backed_vma_is_retractable(struct vm_area_struct *vma)
>
> It's not really the VMA that is retractable :)
>
> Given that the function we are called this from is called
> "retract_page_tables" (and not file_backed_...) I guess I would just have
> called this
>
> "page_tables_are_retractable"
>
> "page_tables_support_retract"
>
> Or sth. along those lines.

Well it's specific to the VMA and it starts getting messy, this is the problem
with naming, you can_end_up_with_way_too_specific_names :)

Also you'd need to say file-baked for clarity really, and that's getting far too
long...

I think this is fine as-is given it's a static function, a user thinking 'what
does retractable mean?' can see right away in the comment immdiately above the
function name.

>
> > +{
> > +	/*
> > +	 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > +	 * got written to. These VMAs are likely not worth removing
> > +	 * page tables from, as PMD-mapping is likely to be split later.
> > +	 */
> > +	if (READ_ONCE(vma->anon_vma))
> > +		return false;
> > +
> > +	/*
> > +	 * When a vma is registered with uffd-wp, we cannot recycle
> > +	 * the page table because there may be pte markers installed.
> > +	 * Other vmas can still have the same file mapped hugely, but
> > +	 * skip this one: it will always be mapped in small page size
> > +	 * for uffd-wp registered ranges.
> > +	 */
> > +	if (userfaultfd_wp(vma))
> > +		return false;
> > +
> > +	/*
> > +	 * If the VMA contains guard regions then we can't collapse it.
> > +	 *
> > +	 * This is set atomically on guard marker installation under mmap/VMA
> > +	 * read lock, and here we may not hold any VMA or mmap lock at all.
> > +	 *
> > +	 * This is therefore serialised on the PTE page table lock, which is
> > +	 * obtained on guard region installation after the flag is set, so this
> > +	 * check being performed under this lock excludes races.
> > +	 */
> > +	if (vma_flag_test_atomic(vma, VM_MAYBE_GUARD_BIT))
> > +		return false;
> > +
> > +	return true;
> > +}
> > +
> >   static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >   {
> >   	struct vm_area_struct *vma;
> > @@ -1724,14 +1761,6 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >   		spinlock_t *ptl;
> >   		bool success = false;
> > -		/*
> > -		 * Check vma->anon_vma to exclude MAP_PRIVATE mappings that
> > -		 * got written to. These VMAs are likely not worth removing
> > -		 * page tables from, as PMD-mapping is likely to be split later.
> > -		 */
> > -		if (READ_ONCE(vma->anon_vma))
> > -			continue;
> > -
> >   		addr = vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE_SHIFT);
> >   		if (addr & ~HPAGE_PMD_MASK ||
> >   		    vma->vm_end < addr + HPAGE_PMD_SIZE)
> > @@ -1743,14 +1772,8 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >   		if (hpage_collapse_test_exit(mm))
> >   			continue;
> > -		/*
> > -		 * When a vma is registered with uffd-wp, we cannot recycle
> > -		 * the page table because there may be pte markers installed.
> > -		 * Other vmas can still have the same file mapped hugely, but
> > -		 * skip this one: it will always be mapped in small page size
> > -		 * for uffd-wp registered ranges.
> > -		 */
> > -		if (userfaultfd_wp(vma))
> > +
> > +		if (!file_backed_vma_is_retractable(vma))
> >   			continue;
> >   		/* PTEs were notified when unmapped; but now for the PMD? */
> > @@ -1777,15 +1800,15 @@ static void retract_page_tables(struct address_space *mapping, pgoff_t pgoff)
> >   			spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> >   		/*
> > -		 * Huge page lock is still held, so normally the page table
> > -		 * must remain empty; and we have already skipped anon_vma
> > -		 * and userfaultfd_wp() vmas.  But since the mmap_lock is not
> > -		 * held, it is still possible for a racing userfaultfd_ioctl()
> > -		 * to have inserted ptes or markers.  Now that we hold ptlock,
> > -		 * repeating the anon_vma check protects from one category,
> > -		 * and repeating the userfaultfd_wp() check from another.
> > +		 * Huge page lock is still held, so normally the page table must
> > +		 * remain empty; and we have already skipped anon_vma and
> > +		 * userfaultfd_wp() vmas.  But since the mmap_lock is not held,
> > +		 * it is still possible for a racing userfaultfd_ioctl() or
> > +		 * madvise() to have inserted ptes or markers.  Now that we hold
> > +		 * ptlock, repeating the retractable checks protects us from
> > +		 * races against the prior checks.
> >   		 */
> > -		if (likely(!vma->anon_vma && !userfaultfd_wp(vma))) {
> > +		if (likely(file_backed_vma_is_retractable(vma))) {
> >   			pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> >   			pmdp_get_lockless_sync();
> >   			success = true;
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 0b3280752bfb..5dbe40be7c65 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -1141,15 +1141,21 @@ static long madvise_guard_install(struct madvise_behavior *madv_behavior)
> >   		return -EINVAL;
> >   	/*
> > -	 * If we install guard markers, then the range is no longer
> > -	 * empty from a page table perspective and therefore it's
> > -	 * appropriate to have an anon_vma.
> > -	 *
> > -	 * This ensures that on fork, we copy page tables correctly.
> > +	 * Set atomically under read lock. All pertinent readers will need to
> > +	 * acquire an mmap/VMA write lock to read it. All remaining readers may
> > +	 * or may not see the flag set, but we don't care.
> > +	 */
> > +	vma_flag_set_atomic(vma, VM_MAYBE_GUARD_BIT);
> > +
>
> In general LGTM.

Thanks

>
> > +	/*
> > +	 * If anonymous and we are establishing page tables the VMA ought to
> > +	 * have an anon_vma associated with it.
>
> Do you know why? I know that as soon as we have anon folios in there we need
> it, but is it still required for guard regions? Patch #5 should handle the
> for case I guess.
>
> Which other code depends on that?

There seems to be a general convention of people seeing 'vma->anon_vma' as
meaning it has page tables, and vice-versa, for anon VMAs.

Obviously we change fork behaviour for this now with the flag, and _perhaps_
it's not necessary, but I'd rather keep this consistent for now (this is what we
were doing before) and come back to it, rather than audit the code base for
assumptions.

I'd probably like to do a patch adding vma_has_page_tables() or something that
EXPLICITLY spells this out for cases that need it.

And it's not really overhead, as there'd not be much point in guard regions if
you didn't fault in the memory (running off the end of empty range doesn't
really make snese).

The key change here is that file-backed guard regions no longer do the horrible
thing of having your file-backed VMA act as if it were MAP_PRIVATE with its own
pointless anon_vma just to get correct fork behaviour... :)

>
> --
> Cheers
>
> David

Thanks, Lorenzo