[PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()

David Carlier posted 1 patch 18 hours ago
mm/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 56 insertions(+), 7 deletions(-)
[PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by David Carlier 18 hours ago
In mfill_copy_folio_retry(), all locks are dropped to retry
copy_from_user() with page faults enabled. During this window, the VMA
can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
another thread), but the caller proceeds with a folio allocated from the
original VMA's backing store.

Checking ops alone is insufficient: the replacement VMA could be the
same type (e.g. shmem -> shmem) with identical flags but a different
backing inode. Take a snapshot of the VMA's file and flags before
dropping locks, and compare after re-acquiring them. If anything
changed, bail out with -EINVAL.

Use get_file()/fput() rather than ihold()/iput() to hold the file
reference across the lock-dropped window, avoiding potential deadlocks
from filesystem eviction under mmap_lock.

Fixes: 56a3706fd7f9 ("shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops")
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 mm/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 56 insertions(+), 7 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 481ec7eb4442..93d6a954e659 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -443,33 +443,82 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
 	return ret;
 }
 
+struct vma_snapshot {
+	struct file *file;
+	vma_flags_t flags;
+};
+
+static void vma_snapshot_take(struct vm_area_struct *vma,
+			      struct vma_snapshot *s)
+{
+	memcpy(&s->flags, &vma->flags, sizeof(s->flags));
+	if (vma->vm_file)
+		s->file = get_file(vma->vm_file);
+	else
+		s->file = NULL;
+}
+
+static bool vma_snapshot_changed(struct vm_area_struct *vma,
+				 struct vma_snapshot *s)
+{
+	if (memcmp(&s->flags, &vma->flags, sizeof(s->flags)))
+		return true;
+
+	if (s->file && (!vma->vm_file ||
+	    vma->vm_file->f_inode != s->file->f_inode))
+		return true;
+
+	if (!s->file && !vma_is_anonymous(vma))
+		return true;
+
+	return false;
+}
+
+static void vma_snapshot_release(struct vma_snapshot *s)
+{
+	if (s->file) {
+		fput(s->file);
+		s->file = NULL;
+	}
+}
+
 static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio)
 {
 	unsigned long src_addr = state->src_addr;
+	struct vma_snapshot s;
 	void *kaddr;
 	int err;
 
+	/* Take a quick snapshot of the current vma */
+	vma_snapshot_take(state->vma, &s);
+
 	/* retry copying with mm_lock dropped */
 	mfill_put_vma(state);
 
 	kaddr = kmap_local_folio(folio, 0);
 	err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE);
 	kunmap_local(kaddr);
-	if (unlikely(err))
-		return -EFAULT;
+	if (unlikely(err)) {
+		err = -EFAULT;
+		goto out;
+	}
 
 	flush_dcache_folio(folio);
 
 	/* reget VMA and PMD, they could change underneath us */
 	err = mfill_get_vma(state);
 	if (err)
-		return err;
+		goto out;
 
-	err = mfill_establish_pmd(state);
-	if (err)
-		return err;
+	if (vma_snapshot_changed(state->vma, &s)) {
+		err = -EINVAL;
+		goto out;
+	}
 
-	return 0;
+	err = mfill_establish_pmd(state);
+out:
+	vma_snapshot_release(&s);
+	return err;
 }
 
 static int __mfill_atomic_pte(struct mfill_state *state,
-- 
2.53.0
Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by Andrew Morton 5 hours ago
On Tue, 31 Mar 2026 14:41:58 +0100 David Carlier <devnexen@gmail.com> wrote:

> In mfill_copy_folio_retry(), all locks are dropped to retry
> copy_from_user() with page faults enabled. During this window, the VMA
> can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
> another thread), but the caller proceeds with a folio allocated from the
> original VMA's backing store.
> 
> Checking ops alone is insufficient: the replacement VMA could be the
> same type (e.g. shmem -> shmem) with identical flags but a different
> backing inode. Take a snapshot of the VMA's file and flags before
> dropping locks, and compare after re-acquiring them. If anything
> changed, bail out with -EINVAL.
> 
> Use get_file()/fput() rather than ihold()/iput() to hold the file
> reference across the lock-dropped window, avoiding potential deadlocks
> from filesystem eviction under mmap_lock.

Thanks, I've queued this as a squashable fix against mm-unstable's
"shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
ongoing".

I've fumbled the ball on your [2/2] unlikely() fix ;).  Please resend that
after -rc1.
Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by Mike Rapoport 33 minutes ago
Hi Andrew,

On Tue, Mar 31, 2026 at 08:01:48PM -0700, Andrew Morton wrote:
> On Tue, 31 Mar 2026 14:41:58 +0100 David Carlier <devnexen@gmail.com> wrote:
> 
> > In mfill_copy_folio_retry(), all locks are dropped to retry
> > copy_from_user() with page faults enabled. During this window, the VMA
> > can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
> > another thread), but the caller proceeds with a folio allocated from the
> > original VMA's backing store.

What does "folio allocated from the original VMA's backing store" exactly
mean? Why is this a problem?
 
> > Checking ops alone is insufficient: the replacement VMA could be the
> > same type (e.g. shmem -> shmem) with identical flags but a different
> > backing inode. Take a snapshot of the VMA's file and flags before
> > dropping locks, and compare after re-acquiring them. If anything
> > changed, bail out with -EINVAL.
> > 
> > Use get_file()/fput() rather than ihold()/iput() to hold the file
> > reference across the lock-dropped window, avoiding potential deadlocks
> > from filesystem eviction under mmap_lock.
> 
> Thanks, I've queued this as a squashable fix against mm-unstable's
> "shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops
> ongoing".

First, this a pre-existing and TBH quite theoretical bug and it was there
since the very beginning, so it should not be added as a fixup for the
uffd+guestmemfd series.

Second, I have reservations about vma_snapshot implementation. What
invariant does it exactly enforce? 
 
> I've fumbled the ball on your [2/2] unlikely() fix ;).  Please resend that
> after -rc1.

This one should go the same route IMO. 

-- 
Sincerely yours,
Mike.
Re: [PATCH v4] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by David CARLIER 15 minutes ago
Hi Mike,

  On Tue, Apr 01, 2026 at 08:49:00AM +0300, Mike Rapoport wrote:
  > What does "folio allocated from the original VMA's backing store" exactly
  > mean? Why is this a problem?

  Fair point, the commit message was vague here. What I meant is:

  mfill_atomic_pte_copy() captures ops = vma_uffd_ops(state->vma) and
  passes it to __mfill_atomic_pte(). There, ops->alloc_folio() allocates
  a folio for the original VMA's inode (e.g. a shmem folio for that
  specific shmem inode). Then mfill_copy_folio_retry() drops all locks for
  the copy_from_user retry. After mfill_get_vma() re-acquires them,
  state->vma may now point to a replacement VMA, but ops is still the
  stale pointer from before the drop.

  The code then calls ops->filemap_add(folio, state->vma, ...) which
  would insert a folio allocated for the old inode into the new VMA's
  backing store. If the VMA changed type entirely (e.g. shmem -> anon),
  ops->filemap_add could be operating on a VMA that has no business
  receiving this folio.

  > First, this a pre-existing and TBH quite theoretical bug and it was there
  > since the very beginning, so it should not be added as a fixup for the
  > uffd+guestmemfd series.

  You're right. The race window (VMA replacement during the lock-dropped
  copy retry) existed in the original mcopy_atomic_pte() code long before
  the vm_uffd_ops refactoring. The Fixes tag pointing at 56a3706fd7f9 was
  wrong. I'll drop it and resend as a standalone fix against the original
  retry logic.

  > Second, I have reservations about vma_snapshot implementation. What
  > invariant does it exactly enforce?

  The invariant I was going for: "the folio we allocated is still
  compatible with the VMA we're about to install it into." Since
  alloc_folio() allocates from the VMA's backing file (inode), checking
  that vm_file is still the same after re-acquiring locks ensures the
  folio matches the inode. The vm_flags comparison was a secondary guard
  against permission/type changes during the window.

  That said, I can see the vma_snapshot abstraction is doing too much for
  what's really needed. Would a simpler approach work better — just
  saving vm_file (with get_file/fput) before the drop and comparing it
  directly after re-acquiring? That makes the invariant explicit: "same
  backing file means the folio is valid for this VMA."

  Happy to rework along those lines, or if you have a different approach
  in mind I'm open to suggestions.

  > > I've fumbled the ball on your [2/2] unlikely() fix ;).  Please resend that
  > > after -rc1.
  >
  > This one should go the same route IMO.

  Agreed, I'll resend both after -rc1.