mm/userfaultfd.c | 64 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 57 insertions(+), 7 deletions(-)
In mfill_copy_folio_retry(), all locks are dropped to retry
copy_from_user() with page faults enabled. During this window, the VMA
can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
another thread), but the caller proceeds with a folio allocated from the
original VMA's backing store.
Checking ops alone is insufficient: the replacement VMA could be the
same type (e.g. shmem -> shmem) with identical flags but a different
backing inode. Take a snapshot of the VMA's inode and flags before
dropping locks, and compare after re-acquiring them. If anything
changed, bail out with -EAGAIN.
Suggested-by: Peter Xu <peterx@redhat.com>
Signed-off-by: David Carlier <devnexen@gmail.com>
---
mm/userfaultfd.c | 64 ++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 57 insertions(+), 7 deletions(-)
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 481ec7eb4442..22e82f953eb5 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -443,33 +443,83 @@ static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_addr)
return ret;
}
+struct vma_snapshot {
+ struct inode *inode;
+ vm_flags_t flags;
+};
+
+static void vma_snapshot_take(struct vm_area_struct *vma,
+ struct vma_snapshot *s)
+{
+ s->flags = vma->vm_flags;
+ if (vma->vm_file) {
+ s->inode = vma->vm_file->f_inode;
+ ihold(s->inode);
+ } else {
+ s->inode = NULL;
+ }
+}
+
+static bool vma_snapshot_changed(struct vm_area_struct *vma,
+ struct vma_snapshot *s)
+{
+ if (s->flags != vma->vm_flags)
+ return true;
+
+ if (s->inode && vma->vm_file->f_inode != s->inode)
+ return true;
+
+ if (!s->inode && !vma_is_anonymous(vma))
+ return true;
+
+ return false;
+}
+
+static void vma_snapshot_release(struct vma_snapshot *s)
+{
+ if (s->inode) {
+ iput(s->inode);
+ s->inode = NULL;
+ }
+}
+
static int mfill_copy_folio_retry(struct mfill_state *state, struct folio *folio)
{
unsigned long src_addr = state->src_addr;
+ struct vma_snapshot s;
void *kaddr;
int err;
+ /* Take a quick snapshot of the current vma */
+ vma_snapshot_take(state->vma, &s);
+
/* retry copying with mm_lock dropped */
mfill_put_vma(state);
kaddr = kmap_local_folio(folio, 0);
err = copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE);
kunmap_local(kaddr);
- if (unlikely(err))
- return -EFAULT;
+ if (unlikely(err)) {
+ err = -EFAULT;
+ goto out;
+ }
flush_dcache_folio(folio);
/* reget VMA and PMD, they could change underneath us */
err = mfill_get_vma(state);
if (err)
- return err;
+ goto out;
- err = mfill_establish_pmd(state);
- if (err)
- return err;
+ if (vma_snapshot_changed(state->vma, &s)) {
+ err = -EAGAIN;
+ goto out;
+ }
- return 0;
+ err = mfill_establish_pmd(state);
+out:
+ vma_snapshot_release(&s);
+ return err;
}
static int __mfill_atomic_pte(struct mfill_state *state,
--
2.53.0
(added VMA folks)
Hi,
On Mon, Mar 30, 2026 at 09:29:09PM +0100, David Carlier wrote:
> In mfill_copy_folio_retry(), all locks are dropped to retry
> copy_from_user() with page faults enabled. During this window, the VMA
> can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
> another thread), but the caller proceeds with a folio allocated from the
> original VMA's backing store.
Is it possible at all that after all that dance vma pointer will remain the
same?
> Checking ops alone is insufficient: the replacement VMA could be the
> same type (e.g. shmem -> shmem) with identical flags but a different
> backing inode. Take a snapshot of the VMA's inode and flags before
> dropping locks, and compare after re-acquiring them. If anything
> changed, bail out with -EAGAIN.
>
> Suggested-by: Peter Xu <peterx@redhat.com>
> Signed-off-by: David Carlier <devnexen@gmail.com>
Sashiko has comments and they seem quite relevant to me:
https://sashiko.dev/#/patchset/20260330214948.148349-1-devnexen%40gmail.com
> ---
> mm/userfaultfd.c | 64 ++++++++++++++++++++++++++++++++++++++++++------
> 1 file changed, 57 insertions(+), 7 deletions(-)
...
> + if (vma_snapshot_changed(state->vma, &s)) {
> + err = -EAGAIN;
Whatever we do verify the VMA this should not be EAGAIN. EINVAL or ENOENT
like mfill_get_vma() returns seem more appropriate.
> + goto out;
> + }
--
Sincerely yours,
Mike.
Hi Mike,
▎ Is it possible at all that after all that dance vma pointer will
remain
▎ the same?
Yes, VMA structs are slab-allocated so after munmap frees the old
VMA
and mmap allocates a new one, SLUB can hand back the same address.
The
pointer matches but it's a different VMA — which is exactly why the
snapshot is needed.
▎ This isn't a bug, but struct vm_area_struct uses vm_flags, not
flags.
▎ Will this cause a compilation error?
This is a false positive from Sashiko. vm_area_struct has a union at
include/linux/mm_types.h:956-960:
union {
const vm_flags_t vm_flags;
vma_flags_t flags;
};
Peter explicitly asked to use vma_flags_t / vma->flags since
vm_flags_t
is being deprecated (see vma_flags_to_legacy()).
▎ If the original VMA was file-backed (s->inode is non-NULL), but is
▎ concurrently replaced by an anonymous VMA during the lock-dropped
▎ window, vma->vm_file will be NULL. Does accessing
▎ vma->vm_file->f_inode here cause a NULL pointer dereference?
Good catch, this is a real bug. Will fix with a vm_file NULL guard.
▎ Filesystem eviction paths often acquire locks (like i_rwsem) that
▎ invert with the mmap lock. Can this cause an AB-BA deadlock? Should
▎ this take a reference to the struct file via get_file() and
release it
▎ with fput() instead, which defers destruction safely?
Valid concern. I'll switch to get_file()/fput() which defers the
destruction safely.
▎ Whatever we do verify the VMA this should not be EAGAIN. EINVAL or
▎ ENOENT like mfill_get_vma() returns seem more appropriate.
Agreed, will change to -EINVAL to match mfill_get_vma()'s validation
failures.
Will send a v2 with all three fixes later on.
Cheers !
On Mon, Mar 30, 2026 at 09:29:09PM +0100, David Carlier wrote:
> +struct vma_snapshot {
> + struct inode *inode;
> + vm_flags_t flags;
Note that I used vma_flags_t / memcpy() / memcmp(), explicitly because IIUC
we're moving towards deprecating vm_flags_t.
Please see vma_flags_to_legacy().
> +};
--
Peter Xu
On Mon, 30 Mar 2026 21:29:09 +0100 David Carlier <devnexen@gmail.com> wrote: > In mfill_copy_folio_retry(), all locks are dropped to retry > copy_from_user() with page faults enabled. During this window, the VMA > can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by > another thread), but the caller proceeds with a folio allocated from the > original VMA's backing store. > > Checking ops alone is insufficient: the replacement VMA could be the > same type (e.g. shmem -> shmem) with identical flags but a different > backing inode. Take a snapshot of the VMA's inode and flags before > dropping locks, and compare after re-acquiring them. If anything > changed, bail out with -EAGAIN. Thanks. What are the userspace-visible runtime effects of the bug? If they're serious we might be looking at a cc:stable and a Fixes: tag?
To "mitigate" my previous answer after further digging ...
The userspace-visible effect is a kernel NULL pointer dereference. When
a shared shmem VMA gets replaced by an anonymous VMA during the
retry
window, the stale ops->filemap_add() ends up calling
shmem_mfill_filemap_add() which dereferences vma->vm_file via
file_inode(). Since vm_file is NULL for anonymous mappings, this is a
straight kernel oops.
The window is particularly wide when copy_from_user() blocks on slow
backing stores (FUSE, NFS) as it runs with page faults enabled.
The Fixes target would be 56a3706fd7f9 ("shmem, userfaultfd:
implement
shmem uffd operations using vm_uffd_ops") but that's mm-unstable only,
so no Cc: stable for now.
On Mon, 30 Mar 2026 at 21:40, Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Mon, 30 Mar 2026 21:29:09 +0100 David Carlier <devnexen@gmail.com> wrote:
>
> > In mfill_copy_folio_retry(), all locks are dropped to retry
> > copy_from_user() with page faults enabled. During this window, the VMA
> > can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by
> > another thread), but the caller proceeds with a folio allocated from the
> > original VMA's backing store.
> >
> > Checking ops alone is insufficient: the replacement VMA could be the
> > same type (e.g. shmem -> shmem) with identical flags but a different
> > backing inode. Take a snapshot of the VMA's inode and flags before
> > dropping locks, and compare after re-acquiring them. If anything
> > changed, bail out with -EAGAIN.
>
> Thanks. What are the userspace-visible runtime effects of the bug?
>
> If they're serious we might be looking at a cc:stable and a
> Fixes: tag?
>
On Mon, 30 Mar 2026 22:32:58 +0100 David CARLIER <devnexen@gmail.com> wrote:
> The userspace-visible effect is a kernel NULL pointer dereference. When
> a shared shmem VMA gets replaced by an anonymous VMA during the
> retry
> window, the stale ops->filemap_add() ends up calling
> shmem_mfill_filemap_add() which dereferences vma->vm_file via
> file_inode(). Since vm_file is NULL for anonymous mappings, this is a
> straight kernel oops.
>
> The window is particularly wide when copy_from_user() blocks on slow
> backing stores (FUSE, NFS) as it runs with page faults enabled.
>
> The Fixes target would be 56a3706fd7f9 ("shmem, userfaultfd:
> implement
> shmem uffd operations using vm_uffd_ops") but that's mm-unstable only,
> so no Cc: stable for now.
Ah, OK, thanks. I'll add a note to "shmem, userfaultfd: implement
shmem uffd operations using vm_uffd_ops" for now, let's see what Mike
thinks.
Hi On Mon, 30 Mar 2026 at 21:40, Andrew Morton <akpm@linux-foundation.org> wrote: > > On Mon, 30 Mar 2026 21:29:09 +0100 David Carlier <devnexen@gmail.com> wrote: > > > In mfill_copy_folio_retry(), all locks are dropped to retry > > copy_from_user() with page faults enabled. During this window, the VMA > > can be replaced entirely (e.g. munmap + mmap + UFFDIO_REGISTER by > > another thread), but the caller proceeds with a folio allocated from the > > original VMA's backing store. > > > > Checking ops alone is insufficient: the replacement VMA could be the > > same type (e.g. shmem -> shmem) with identical flags but a different > > backing inode. Take a snapshot of the VMA's inode and flags before > > dropping locks, and compare after re-acquiring them. If anything > > changed, bail out with -EAGAIN. > > Thanks. What are the userspace-visible runtime effects of the bug? > > If they're serious we might be looking at a cc:stable and a > Fixes: tag? > The bug manifests as a NULL pointer dereference in shmem_mfill_filemap_add() via file_inode(vma->vm_file) when vm_file is NULL (anonymous VMA). This is a kernel oops/panic — so it's definitely serious enough for Cc: stable and Fixes:. Cheers
© 2016 - 2026 Red Hat, Inc.