mm/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 56 insertions(+), 7 deletions(-)
In mfill_copy_folio_retry(), all locks are dropped to retry
copy_from_user() with page faults enabled. During this window, the VMA
can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
another thread), but the caller proceeds with a folio allocated from the
original VMA's backing store.
Checking ops alone is insufficient: the replacement VMA could be the
same type (e.g. shmem -> shmem) with identical flags but a different
backing inode. Take a snapshot of the VMA's file and flags before
dropping locks, and compare after re-acquiring them. If anything
changed, bail out with -EINVAL.
Use get_file()/fput() rather than ihold()/iput() to hold the file
reference across the lock-dropped window, avoiding potential deadlocks
from filesystem eviction under mmap_lock.
Fixes: 56a3706fd7f9 ("shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops")
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
---
mm/userfaultfd.c | 63 ++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 56 insertions(+), 7 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 481ec7eb4442..93d6a954e659 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -443,33 +443,82 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
return ret;
}
+struct vma_snapshot {
+ struct file *file;
+ vma_flags_t flags;
+};
+
+static void vma_snapshot_take(struct vm_area_struct *vma,
+ struct vma_snapshot *s)
+{
+ memcpy(&s->flags, &vma->flags, sizeof(s->flags));
+ if (vma->vm_file)
+ s->file = get_file(vma->vm_file);
+ else
+ s->file = NULL;
+}
+
+static bool vma_snapshot_changed(struct vm_area_struct *vma,
+ struct vma_snapshot *s)
+{
+ if (memcmp(&s->flags, &vma->flags, sizeof(s->flags)))
+ return true;
+
+ if (s->file && (!vma->vm_file ||
+ vma->vm_file->f_inode != s->file->f_inode))
+ return true;
+
+ if (!s->file && !vma_is_anonymous(vma))
+ return true;
+
+ return false;
+}
+
+static void vma_snapshot_release(struct vma_snapshot *s)
+{
+ if (s->file) {
+ fput(s->file);
+ s->file = NULL;
+ }
+}
+
static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio)
{
unsigned long src_addr = state->src_addr;
+ struct vma_snapshot s;
void *kaddr;
int err;
+ /* Take a quick snapshot of the current vma */
+ vma_snapshot_take(state->vma, &s);
+
/* retry copying with mm_lock dropped */
mfill_put_vma(state);
kaddr = kmap_local_folio(folio, 0);
err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE);
kunmap_local(kaddr);
- if (unlikely(err))
- return -EFAULT;
+ if (unlikely(err)) {
+ err = -EFAULT;
+ goto out;
+ }
flush_dcache_folio(folio);
/* reget VMA and PMD, they could change underneath us */
err = mfill_get_vma(state);
if (err)
- return err;
+ goto out;
- err = mfill_establish_pmd(state);
- if (err)
- return err;
+ if (vma_snapshot_changed(state->vma, &s)) {
+ err = -EINVAL;
+ goto out;
+ }
- return 0;
+ err = mfill_establish_pmd(state);
+out:
+ vma_snapshot_release(&s);
+ return err;
}
static int __mfill_atomic_pte(struct mfill_state *state,
--
2.53.0
On Tue, 31 Mar 2026 14:41:58 +0100 David Carlier <devnexen@gmail.com> wrote: > In mfill_copy_folio_retry(), all locks are dropped to retry > copy_from_user() with page faults enabled. During this window, the VMA > can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by > another thread), but the caller proceeds with a folio allocated from the > original VMA's backing store. > > Checking ops alone is insufficient: the replacement VMA could be the > same type (e.g. shmem -> shmem) with identical flags but a different > backing inode. Take a snapshot of the VMA's file and flags before > dropping locks, and compare after re-acquiring them. If anything > changed, bail out with -EINVAL. > > Use get_file()/fput() rather than ihold()/iput() to hold the file > reference across the lock-dropped window, avoiding potential deadlocks > from filesystem eviction under mmap_lock. Thanks, I've queued this as a squashable fix against mm-unstable's "shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops ongoing". I've fumbled the ball on your [2/2] unlikely() fix ;). Please resend that after -rc1.
Hi Andrew, On Tue, Mar 31, 2026 at 08:01:48PM -0700, Andrew Morton wrote: > On Tue, 31 Mar 2026 14:41:58 +0100 David Carlier <devnexen@gmail.com> wrote: > > > In mfill_copy_folio_retry(), all locks are dropped to retry > > copy_from_user() with page faults enabled. During this window, the VMA > > can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by > > another thread), but the caller proceeds with a folio allocated from the > > original VMA's backing store. What does "folio allocated from the original VMA's backing store" exactly mean? Why is this a problem? > > Checking ops alone is insufficient: the replacement VMA could be the > > same type (e.g. shmem -> shmem) with identical flags but a different > > backing inode. Take a snapshot of the VMA's file and flags before > > dropping locks, and compare after re-acquiring them. If anything > > changed, bail out with -EINVAL. > > > > Use get_file()/fput() rather than ihold()/iput() to hold the file > > reference across the lock-dropped window, avoiding potential deadlocks > > from filesystem eviction under mmap_lock. > > Thanks, I've queued this as a squashable fix against mm-unstable's > "shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops > ongoing". First, this a pre-existing and TBH quite theoretical bug and it was there since the very beginning, so it should not be added as a fixup for the uffd+guestmemfd series. Second, I have reservations about vma_snapshot implementation. What invariant does it exactly enforce? > I've fumbled the ball on your [2/2] unlikely() fix ;). Please resend that > after -rc1. This one should go the same route IMO. -- Sincerely yours, Mike.
Hi Mike, On Tue, Apr 01, 2026 at 08:49:00AM +0300, Mike Rapoport wrote: > What does "folio allocated from the original VMA's backing store" exactly > mean? Why is this a problem? Fair point, the commit message was vague here. What I meant is: mfill_atomic_pte_copy() captures ops = vma_uffd_ops(state->vma) and passes it to __mfill_atomic_pte(). There, ops->alloc_folio() allocates a folio for the original VMA's inode (e.g. a shmem folio for that specific shmem inode). Then mfill_copy_folio_retry() drops all locks for the copy_from_user retry. After mfill_get_vma() re-acquires them, state->vma may now point to a replacement VMA, but ops is still the stale pointer from before the drop. The code then calls ops->filemap_add(folio, state->vma, ...) which would insert a folio allocated for the old inode into the new VMA's backing store. If the VMA changed type entirely (e.g. shmem -> anon), ops->filemap_add could be operating on a VMA that has no business receiving this folio. > First, this a pre-existing and TBH quite theoretical bug and it was there > since the very beginning, so it should not be added as a fixup for the > uffd+guestmemfd series. You're right. The race window (VMA replacement during the lock-dropped copy retry) existed in the original mcopy_atomic_pte() code long before the vm_uffd_ops refactoring. The Fixes tag pointing at 56a3706fd7f9 was wrong. I'll drop it and resend as a standalone fix against the original retry logic. > Second, I have reservations about vma_snapshot implementation. What > invariant does it exactly enforce? The invariant I was going for: "the folio we allocated is still compatible with the VMA we're about to install it into." Since alloc_folio() allocates from the VMA's backing file (inode), checking that vm_file is still the same after re-acquiring locks ensures the folio matches the inode. The vm_flags comparison was a secondary guard against permission/type changes during the window. That said, I can see the vma_snapshot abstraction is doing too much for what's really needed. Would a simpler approach work better — just saving vm_file (with get_file/fput) before the drop and comparing it directly after re-acquiring? That makes the invariant explicit: "same backing file means the folio is valid for this VMA." Happy to rework along those lines, or if you have a different approach in mind I'm open to suggestions. > > I've fumbled the ball on your [2/2] unlikely() fix ;). Please resend that > > after -rc1. > > This one should go the same route IMO. Agreed, I'll resend both after -rc1.
© 2016 - 2026 Red Hat, Inc.