[PATCH v2] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()

David Carlier posted 1 patch 1 day, 11 hours ago
There is a newer version of this series
mm/userfaultfd.c | 64 ++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 57 insertions(+), 7 deletions(-)
[PATCH v2] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by David Carlier 1 day, 11 hours ago
In mfill_copy_folio_retry(), all locks are dropped to retry
copy_from_user() with page faults enabled. During this window, the VMA
can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
another thread), but the caller proceeds with a folio allocated from the
original VMA's backing store.

Checking ops alone is insufficient: the replacement VMA could be the
same type (e.g. shmem -> shmem) with identical flags but a different
backing inode. Take a snapshot of the VMA's inode and flags before
dropping locks, and compare after re-acquiring them. If anything
changed, bail out with -EAGAIN.

Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
---
 mm/userfaultfd.c | 64 ++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 57 insertions(+), 7 deletions(-)

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 481ec7eb4442..22e82f953eb5 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -443,33 +443,83 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
 	return ret;
 }
 
+struct vma_snapshot {
+	struct inode *inode;
+	vm_flags_t flags;
+};
+
+static void vma_snapshot_take(struct vm_area_struct *vma,
+			      struct vma_snapshot *s)
+{
+	s->flags = vma->vm_flags;
+	if (vma->vm_file) {
+		s->inode = vma->vm_file->f_inode;
+		ihold(s->inode);
+	} else {
+		s->inode = NULL;
+	}
+}
+
+static bool vma_snapshot_changed(struct vm_area_struct *vma,
+				 struct vma_snapshot *s)
+{
+	if (s->flags != vma->vm_flags)
+		return true;
+
+	if (s->inode && vma->vm_file->f_inode != s->inode)
+		return true;
+
+	if (!s->inode && !vma_is_anonymous(vma))
+		return true;
+
+	return false;
+}
+
+static void vma_snapshot_release(struct vma_snapshot *s)
+{
+	if (s->inode) {
+		iput(s->inode);
+		s->inode = NULL;
+	}
+}
+
 static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio)
 {
 	unsigned long src_addr = state->src_addr;
+	struct vma_snapshot s;
 	void *kaddr;
 	int err;
 
+	/* Take a quick snapshot of the current vma */
+	vma_snapshot_take(state->vma, &s);
+
 	/* retry copying with mm_lock dropped */
 	mfill_put_vma(state);
 
 	kaddr = kmap_local_folio(folio, 0);
 	err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE);
 	kunmap_local(kaddr);
-	if (unlikely(err))
-		return -EFAULT;
+	if (unlikely(err)) {
+		err = -EFAULT;
+		goto out;
+	}
 
 	flush_dcache_folio(folio);
 
 	/* reget VMA and PMD, they could change underneath us */
 	err = mfill_get_vma(state);
 	if (err)
-		return err;
+		goto out;
 
-	err = mfill_establish_pmd(state);
-	if (err)
-		return err;
+	if (vma_snapshot_changed(state->vma, &s)) {
+		err = -EAGAIN;
+		goto out;
+	}
 
-	return 0;
+	err = mfill_establish_pmd(state);
+out:
+	vma_snapshot_release(&s);
+	return err;
 }
 
 static int __mfill_atomic_pte(struct mfill_state *state,
-- 
2.53.0
Re: [PATCH v2] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by Mike Rapoport 20 hours ago
(added VMA folks)

Hi,

On Mon, Mar 30, 2026 at 09:29:09PM +0100, David Carlier wrote:
> In mfill_copy_folio_retry(), all locks are dropped to retry
> copy_from_user() with page faults enabled. During this window, the VMA
> can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
> another thread), but the caller proceeds with a folio allocated from the
> original VMA's backing store.

Is it possible at all that after all that dance vma pointer will remain the
same?
 
> Checking ops alone is insufficient: the replacement VMA could be the
> same type (e.g. shmem -> shmem) with identical flags but a different
> backing inode. Take a snapshot of the VMA's inode and flags before
> dropping locks, and compare after re-acquiring them. If anything
> changed, bail out with -EAGAIN.
> 
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Carlier <devnexen@gmail.com>

Sashiko has comments and they seem quite relevant to me:
https://sashiko.dev/#/patchset/20260330214948.148349-1-devnexen%40gmail.com

> ---
>  mm/userfaultfd.c | 64 ++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 57 insertions(+), 7 deletions(-)

...

> +	if (vma_snapshot_changed(state->vma, &s)) {
> +		err = -EAGAIN;

Whatever we do verify the VMA this should not be EAGAIN. EINVAL or ENOENT
like mfill_get_vma() returns seem more appropriate.

> +		goto out;
> +	}

-- 
Sincerely yours,
Mike.
Re: [PATCH v2] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by David CARLIER 20 hours ago
Hi Mike,

▎ Is it possible at all that after all that dance vma pointer will
remain
  ▎ the same?

  Yes, VMA structs are slab-allocated so after munmap frees the old
VMA
  and mmap allocates a new one, SLUB can hand back the same address.
The
  pointer matches but it's a different VMA — which is exactly why the
  snapshot is needed.

  ▎ This isn't a bug, but struct vm_area_struct uses vm_flags, not
flags.
  ▎ Will this cause a compilation error?

  This is a false positive from Sashiko. vm_area_struct has a union at
  include/linux/mm_types.h:956-960:

  union {
      const vm_flags_t vm_flags;
      vma_flags_t flags;
  };
  Peter explicitly asked to use vma_flags_t / vma->flags since
vm_flags_t
  is being deprecated (see vma_flags_to_legacy()).

  ▎ If the original VMA was file-backed (s->inode is non-NULL), but is
  ▎ concurrently replaced by an anonymous VMA during the lock-dropped
  ▎ window, vma->vm_file will be NULL. Does accessing
  ▎ vma->vm_file->f_inode here cause a NULL pointer dereference?

  Good catch, this is a real bug. Will fix with a vm_file NULL guard.

  ▎ Filesystem eviction paths often acquire locks (like i_rwsem) that
  ▎ invert with the mmap lock. Can this cause an AB-BA deadlock? Should
  ▎ this take a reference to the struct file via get_file() and
release it
  ▎ with fput() instead, which defers destruction safely?

  Valid concern. I'll switch to get_file()/fput() which defers the
  destruction safely.

  ▎ Whatever we do verify the VMA this should not be EAGAIN. EINVAL or
  ▎ ENOENT like mfill_get_vma() returns seem more appropriate.

  Agreed, will change to -EINVAL to match mfill_get_vma()'s validation
  failures.

  Will send a v2 with all three fixes later on.

Cheers !
Re: [PATCH v2] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by Peter Xu 1 day, 11 hours ago
On Mon, Mar 30, 2026 at 09:29:09PM +0100, David Carlier wrote:
> +struct vma_snapshot {
> +	struct inode *inode;
> +	vm_flags_t flags;

Note that I used vma_flags_t / memcpy() / memcmp(), explicitly because IIUC
we're moving towards deprecating vm_flags_t.

Please see vma_flags_to_legacy().

> +};

-- 
Peter Xu
Re: [PATCH v2] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by Andrew Morton 1 day, 11 hours ago
On Mon, 30 Mar 2026 21:29:09 +0100 David Carlier <devnexen@gmail.com> wrote:

> In mfill_copy_folio_retry(), all locks are dropped to retry
> copy_from_user() with page faults enabled. During this window, the VMA
> can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
> another thread), but the caller proceeds with a folio allocated from the
> original VMA's backing store.
> 
> Checking ops alone is insufficient: the replacement VMA could be the
> same type (e.g. shmem -> shmem) with identical flags but a different
> backing inode. Take a snapshot of the VMA's inode and flags before
> dropping locks, and compare after re-acquiring them. If anything
> changed, bail out with -EAGAIN.

Thanks.  What are the userspace-visible runtime effects of the bug?

If they're serious we might be looking at a cc:stable and a
Fixes: tag?
Re: [PATCH v2] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by David CARLIER 1 day, 10 hours ago
To "mitigate" my previous answer after further digging ...

The userspace-visible effect is a kernel NULL pointer dereference. When
  a shared shmem VMA gets replaced by an anonymous VMA during the
retry
  window, the stale ops->filemap_add() ends up calling
  shmem_mfill_filemap_add() which dereferences vma->vm_file via
  file_inode(). Since vm_file is NULL for anonymous mappings, this is a
  straight kernel oops.

  The window is particularly wide when copy_from_user() blocks on slow
  backing stores (FUSE, NFS) as it runs with page faults enabled.

  The Fixes target would be 56a3706fd7f9 ("shmem, userfaultfd:
implement
  shmem uffd operations using vm_uffd_ops") but that's mm-unstable only,
  so no Cc: stable for now.

On Mon, 30 Mar 2026 at 21:40, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 30 Mar 2026 21:29:09 +0100 David Carlier <devnexen@gmail.com> wrote:
>
> > In mfill_copy_folio_retry(), all locks are dropped to retry
> > copy_from_user() with page faults enabled. During this window, the VMA
> > can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
> > another thread), but the caller proceeds with a folio allocated from the
> > original VMA's backing store.
> >
> > Checking ops alone is insufficient: the replacement VMA could be the
> > same type (e.g. shmem -> shmem) with identical flags but a different
> > backing inode. Take a snapshot of the VMA's inode and flags before
> > dropping locks, and compare after re-acquiring them. If anything
> > changed, bail out with -EAGAIN.
>
> Thanks.  What are the userspace-visible runtime effects of the bug?
>
> If they're serious we might be looking at a cc:stable and a
> Fixes: tag?
>
Re: [PATCH v2] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by Andrew Morton 1 day, 8 hours ago
On Mon, 30 Mar 2026 22:32:58 +0100 David CARLIER <devnexen@gmail.com> wrote:

> The userspace-visible effect is a kernel NULL pointer dereference. When
>   a shared shmem VMA gets replaced by an anonymous VMA during the
> retry
>   window, the stale ops->filemap_add() ends up calling
>   shmem_mfill_filemap_add() which dereferences vma->vm_file via
>   file_inode(). Since vm_file is NULL for anonymous mappings, this is a
>   straight kernel oops.
> 
>   The window is particularly wide when copy_from_user() blocks on slow
>   backing stores (FUSE, NFS) as it runs with page faults enabled.
> 
>   The Fixes target would be 56a3706fd7f9 ("shmem, userfaultfd:
> implement
>   shmem uffd operations using vm_uffd_ops") but that's mm-unstable only,
>   so no Cc: stable for now.

Ah, OK, thanks.  I'll add a note to "shmem, userfaultfd: implement
shmem uffd operations using vm_uffd_ops" for now, let's see what Mike
thinks.
Re: [PATCH v2] mm/userfaultfd: detect VMA replacement after copy retry in mfill_copy_folio_retry()
Posted by David CARLIER 1 day, 10 hours ago
Hi

On Mon, 30 Mar 2026 at 21:40, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 30 Mar 2026 21:29:09 +0100 David Carlier <devnexen@gmail.com> wrote:
>
> > In mfill_copy_folio_retry(), all locks are dropped to retry
> > copy_from_user() with page faults enabled. During this window, the VMA
> > can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
> > another thread), but the caller proceeds with a folio allocated from the
> > original VMA's backing store.
> >
> > Checking ops alone is insufficient: the replacement VMA could be the
> > same type (e.g. shmem -> shmem) with identical flags but a different
> > backing inode. Take a snapshot of the VMA's inode and flags before
> > dropping locks, and compare after re-acquiring them. If anything
> > changed, bail out with -EAGAIN.
>
> Thanks.  What are the userspace-visible runtime effects of the bug?
>
> If they're serious we might be looking at a cc:stable and a
> Fixes: tag?
>

The bug manifests as a NULL pointer dereference in
shmem_mfill_filemap_add() via file_inode(vma->vm_file) when vm_file is
NULL (anonymous VMA). This is a kernel
  oops/panic — so it's definitely serious enough for Cc: stable and Fixes:.

Cheers