From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BB17441C2FA; Fri, 6 Mar 2026 17:18:32 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817512; cv=none; b=u8KuqL90PpH5tsGqo46zVhUG6xnW9QBZmiFcpWrR0TeH9HzIpb6OnN3gUa8NuAJgQvQFzXI00zCxY2awZSi9qsAzn0strc7N/fZVFZqXUtf7ze8k7Xr5uY4yojEMk5qfG1qMVUWxAZiyZ4nRTGk/YkPr/FXq9ubRq9iJmzh4jpI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817512; c=relaxed/simple; bh=397VT2bfh6JH5zOk4Jxb7yq714cVzynRATwQAJO9BEM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=HaSgAVgGLRGeABElptsC/rml1Etot1UQOySuW7JYj5HOrI9oTp5x4szpgG6YxVjnDJEyWookMu/boi5BNj30utllJ6OcE4fXyt5IMILWbvt/O7jsJCSuXIkCJC5pqqeRglxKahhM+awWlwfmtiuMUDLyoX2ujjQ91qskJ87V/Es= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=qlgt/S7N; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="qlgt/S7N" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B4FC4C19425; Fri, 6 Mar 2026 17:18:26 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817512; bh=397VT2bfh6JH5zOk4Jxb7yq714cVzynRATwQAJO9BEM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=qlgt/S7NpfdH9Gc2ob6Yr+X4acOZFl+ppk+KjxulPRuaUSHFTt73uEkMag9iY5yIz 1GpAvrz0u6jWLxVT9Z22rBux3z43wReozPGjEW85ySNEd3UqjR+SmL5yvTfdTUMx9L XZC4kDjJWwnTx29xciv0A0sFCet6yI935Z5QveQ2t5qv760fRg/66ESPkblNAbqePS K01DQoBLLfXsJym5tB2nQs8RXFzzqr0izenV+bthd2rxCE+/Gj/Lu8IWoTwwV0PZ5F FiRkc3/nQ88NSFaw7E97Nb9fGgjdAZr7Gq8s42T/IdDiSmPXjcojmg/YawtOCP+PN4 METWUMJIiGr5Q== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 01/15] userfaultfd: introduce mfill_copy_folio_locked() helper Date: Fri, 6 Mar 2026 19:18:01 +0200 Message-ID: <20260306171815.3160826-2-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" Split copying of data when locks held from mfill_atomic_pte_copy() into a helper function mfill_copy_folio_locked(). This makes improves code readability and makes complex mfill_atomic_pte_copy() function easier to comprehend. No functional change. Acked-by: Peter Xu Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: David Hildenbrand (Arm) --- mm/userfaultfd.c | 59 ++++++++++++++++++++++++++++-------------------- 1 file changed, 35 insertions(+), 24 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 927086bb4a3c..32637d557c95 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -238,6 +238,40 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, return ret; } =20 +static int mfill_copy_folio_locked(struct folio *folio, unsigned long src_= addr) +{ + void *kaddr; + int ret; + + kaddr =3D kmap_local_folio(folio, 0); + /* + * The read mmap_lock is held here. Despite the + * mmap_lock being read recursive a deadlock is still + * possible if a writer has taken a lock. For example: + * + * process A thread 1 takes read lock on own mmap_lock + * process A thread 2 calls mmap, blocks taking write lock + * process B thread 1 takes page fault, read lock on own mmap lock + * process B thread 2 calls mmap, blocks taking write lock + * process A thread 1 blocks taking read lock on process B + * process B thread 1 blocks taking read lock on process A + * + * Disable page faults to prevent potential deadlock + * and retry the copy outside the mmap_lock. + */ + pagefault_disable(); + ret =3D copy_from_user(kaddr, (const void __user *) src_addr, + PAGE_SIZE); + pagefault_enable(); + kunmap_local(kaddr); + + if (ret) + return -EFAULT; + + flush_dcache_folio(folio); + return ret; +} + static int mfill_atomic_pte_copy(pmd_t *dst_pmd, struct vm_area_struct *dst_vma, unsigned long dst_addr, @@ -245,7 +279,6 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, uffd_flags_t flags, struct folio **foliop) { - void *kaddr; int ret; struct folio *folio; =20 @@ -256,27 +289,7 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, if (!folio) goto out; =20 - kaddr =3D kmap_local_folio(folio, 0); - /* - * The read mmap_lock is held here. Despite the - * mmap_lock being read recursive a deadlock is still - * possible if a writer has taken a lock. For example: - * - * process A thread 1 takes read lock on own mmap_lock - * process A thread 2 calls mmap, blocks taking write lock - * process B thread 1 takes page fault, read lock on own mmap lock - * process B thread 2 calls mmap, blocks taking write lock - * process A thread 1 blocks taking read lock on process B - * process B thread 1 blocks taking read lock on process A - * - * Disable page faults to prevent potential deadlock - * and retry the copy outside the mmap_lock. - */ - pagefault_disable(); - ret =3D copy_from_user(kaddr, (const void __user *) src_addr, - PAGE_SIZE); - pagefault_enable(); - kunmap_local(kaddr); + ret =3D mfill_copy_folio_locked(folio, src_addr); =20 /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { @@ -285,8 +298,6 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, /* don't free the page */ goto out; } - - flush_dcache_folio(folio); } else { folio =3D *foliop; *foliop =3D NULL; --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C6A87421F10; Fri, 6 Mar 2026 17:18:38 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817518; cv=none; b=jukshoiDOqvj0GIM3k3aaB9Bn3vviDJKnmUmpLlC8KsxQK2c3+OocmDzA+oDM0+KgL8JJUfMXC4wHRWmZ028yCnGdyuaPSSztpP64bM3uCpo5O3htc2eXPIMtOna0eg44uExwmvguC/eTkuYk4g5+uqTHUBr+fBPSVak+28Jv4o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817518; c=relaxed/simple; bh=qeHJOG4jd2y41fBZrRJMl9E/6dNaSv4rX52hxPC5T7Q=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ArbCRnvJl/zVpS/jXfcFyhlFteQoxx18OtZWDTkNdscEIXBMDmRa1WQiNKI3xbxX1fqKlzcl0ln2fvxLvpZgWWAKRp6oAgK0jNU15/ssJkfNzFZZCN4coJhaPyc3gupjKd5gbXUXhzqu7g622p6sEBt7ivsfsuwMzhvxVqDN3zg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=cLVM3J8S; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="cLVM3J8S" Received: by smtp.kernel.org (Postfix) with ESMTPSA id E5703C19425; Fri, 6 Mar 2026 17:18:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817518; bh=qeHJOG4jd2y41fBZrRJMl9E/6dNaSv4rX52hxPC5T7Q=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=cLVM3J8SAfmWzXoIaZeXe6pFUJFoevf1MSDJioG2T5eUPZs7IMQXtmiOWVbgRud/P oy/5uBaBYuETtj9PWxF153gMPEWwKxZn4BTvliutLJiaKhW7Gk01l9awV29B7q3DAF eXaRmKNGRzkMZrM85Qrc/+WnKWMAOn/EbHN/0YlGm91+NKYIP+juDmYThXH7N/hwUq 7KIr9bPgmOupOt3L3frD0a52Ec/f4ehdmrgrtylNuhqidjN7pyflY+K0n5qJ3xn/kU 8qP9RpAmXVD4+os6mFJSs6Xjv/BJpREZLcnTMHtPbVvjeaXuTtbsim/a7o0HQD6/4I PpJm71VdMvKkA== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 02/15] userfaultfd: introduce struct mfill_state Date: Fri, 6 Mar 2026 19:18:02 +0200 Message-ID: <20260306171815.3160826-3-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" mfill_atomic() passes a lot of parameters down to its callees. Aggregate them all into mfill_state structure and pass this structure to functions that implement various UFFDIO_ commands. Tracking the state in a structure will allow moving the code that retries copying of data for UFFDIO_COPY into mfill_atomic_pte_copy() and make the loop in mfill_atomic() identical for all UFFDIO operations on PTE-mapped memory. The mfill_state definition is deliberately local to mm/userfaultfd.c, hence shmem_mfill_atomic_pte() is not updated. Signed-off-by: Mike Rapoport (Microsoft) Acked-by: David Hildenbrand (Arm) --- mm/userfaultfd.c | 148 ++++++++++++++++++++++++++--------------------- 1 file changed, 82 insertions(+), 66 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 32637d557c95..e68d01743b03 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -20,6 +20,20 @@ #include "internal.h" #include "swap.h" =20 +struct mfill_state { + struct userfaultfd_ctx *ctx; + unsigned long src_start; + unsigned long dst_start; + unsigned long len; + uffd_flags_t flags; + + struct vm_area_struct *vma; + unsigned long src_addr; + unsigned long dst_addr; + struct folio *folio; + pmd_t *pmd; +}; + static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_en= d) { @@ -272,17 +286,17 @@ static int mfill_copy_folio_locked(struct folio *foli= o, unsigned long src_addr) return ret; } =20 -static int mfill_atomic_pte_copy(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - uffd_flags_t flags, - struct folio **foliop) +static int mfill_atomic_pte_copy(struct mfill_state *state) { - int ret; + struct vm_area_struct *dst_vma =3D state->vma; + unsigned long dst_addr =3D state->dst_addr; + unsigned long src_addr =3D state->src_addr; + uffd_flags_t flags =3D state->flags; + pmd_t *dst_pmd =3D state->pmd; struct folio *folio; + int ret; =20 - if (!*foliop) { + if (!state->folio) { ret =3D -ENOMEM; folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma, dst_addr); @@ -294,13 +308,13 @@ static int mfill_atomic_pte_copy(pmd_t *dst_pmd, /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { ret =3D -ENOENT; - *foliop =3D folio; + state->folio =3D folio; /* don't free the page */ goto out; } } else { - folio =3D *foliop; - *foliop =3D NULL; + folio =3D state->folio; + state->folio =3D NULL; } =20 /* @@ -357,10 +371,11 @@ static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_p= md, return ret; } =20 -static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr) +static int mfill_atomic_pte_zeropage(struct mfill_state *state) { + struct vm_area_struct *dst_vma =3D state->vma; + unsigned long dst_addr =3D state->dst_addr; + pmd_t *dst_pmd =3D state->pmd; pte_t _dst_pte, *dst_pte; spinlock_t *ptl; int ret; @@ -392,13 +407,14 @@ static int mfill_atomic_pte_zeropage(pmd_t *dst_pmd, } =20 /* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ -static int mfill_atomic_pte_continue(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - uffd_flags_t flags) +static int mfill_atomic_pte_continue(struct mfill_state *state) { - struct inode *inode =3D file_inode(dst_vma->vm_file); + struct vm_area_struct *dst_vma =3D state->vma; + unsigned long dst_addr =3D state->dst_addr; pgoff_t pgoff =3D linear_page_index(dst_vma, dst_addr); + struct inode *inode =3D file_inode(dst_vma->vm_file); + uffd_flags_t flags =3D state->flags; + pmd_t *dst_pmd =3D state->pmd; struct folio *folio; struct page *page; int ret; @@ -436,15 +452,15 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, } =20 /* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ -static int mfill_atomic_pte_poison(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - uffd_flags_t flags) +static int mfill_atomic_pte_poison(struct mfill_state *state) { - int ret; + struct vm_area_struct *dst_vma =3D state->vma; struct mm_struct *dst_mm =3D dst_vma->vm_mm; + unsigned long dst_addr =3D state->dst_addr; + pmd_t *dst_pmd =3D state->pmd; pte_t _dst_pte, *dst_pte; spinlock_t *ptl; + int ret; =20 _dst_pte =3D make_pte_marker(PTE_MARKER_POISONED); ret =3D -EAGAIN; @@ -668,22 +684,20 @@ extern ssize_t mfill_atomic_hugetlb(struct userfaultf= d_ctx *ctx, uffd_flags_t flags); #endif /* CONFIG_HUGETLB_PAGE */ =20 -static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - uffd_flags_t flags, - struct folio **foliop) +static __always_inline ssize_t mfill_atomic_pte(struct mfill_state *state) { + struct vm_area_struct *dst_vma =3D state->vma; + unsigned long src_addr =3D state->src_addr; + unsigned long dst_addr =3D state->dst_addr; + struct folio **foliop =3D &state->folio; + uffd_flags_t flags =3D state->flags; + pmd_t *dst_pmd =3D state->pmd; ssize_t err; =20 - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { - return mfill_atomic_pte_continue(dst_pmd, dst_vma, - dst_addr, flags); - } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { - return mfill_atomic_pte_poison(dst_pmd, dst_vma, - dst_addr, flags); - } + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + return mfill_atomic_pte_continue(state); + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) + return mfill_atomic_pte_poison(state); =20 /* * The normal page fault path for a shmem will invoke the @@ -697,12 +711,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t = *dst_pmd, */ if (!(dst_vma->vm_flags & VM_SHARED)) { if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) - err =3D mfill_atomic_pte_copy(dst_pmd, dst_vma, - dst_addr, src_addr, - flags, foliop); + err =3D mfill_atomic_pte_copy(state); else - err =3D mfill_atomic_pte_zeropage(dst_pmd, - dst_vma, dst_addr); + err =3D mfill_atomic_pte_zeropage(state); } else { err =3D shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, src_addr, @@ -718,13 +729,20 @@ static __always_inline ssize_t mfill_atomic(struct us= erfaultfd_ctx *ctx, unsigned long len, uffd_flags_t flags) { + struct mfill_state state =3D (struct mfill_state){ + .ctx =3D ctx, + .dst_start =3D dst_start, + .src_start =3D src_start, + .flags =3D flags, + + .src_addr =3D src_start, + .dst_addr =3D dst_start, + }; struct mm_struct *dst_mm =3D ctx->mm; struct vm_area_struct *dst_vma; + long copied =3D 0; ssize_t err; pmd_t *dst_pmd; - unsigned long src_addr, dst_addr; - long copied; - struct folio *folio; =20 /* * Sanitize the command parameters: @@ -736,10 +754,6 @@ static __always_inline ssize_t mfill_atomic(struct use= rfaultfd_ctx *ctx, VM_WARN_ON_ONCE(src_start + len <=3D src_start); VM_WARN_ON_ONCE(dst_start + len <=3D dst_start); =20 - src_addr =3D src_start; - dst_addr =3D dst_start; - copied =3D 0; - folio =3D NULL; retry: /* * Make sure the vma is not shared, that the dst range is @@ -790,12 +804,14 @@ static __always_inline ssize_t mfill_atomic(struct us= erfaultfd_ctx *ctx, uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) goto out_unlock; =20 - while (src_addr < src_start + len) { - pmd_t dst_pmdval; + state.vma =3D dst_vma; =20 - VM_WARN_ON_ONCE(dst_addr >=3D dst_start + len); + while (state.src_addr < src_start + len) { + VM_WARN_ON_ONCE(state.dst_addr >=3D dst_start + len); + + pmd_t dst_pmdval; =20 - dst_pmd =3D mm_alloc_pmd(dst_mm, dst_addr); + dst_pmd =3D mm_alloc_pmd(dst_mm, state.dst_addr); if (unlikely(!dst_pmd)) { err =3D -ENOMEM; break; @@ -827,34 +843,34 @@ static __always_inline ssize_t mfill_atomic(struct us= erfaultfd_ctx *ctx, * tables under us; pte_offset_map_lock() will deal with that. */ =20 - err =3D mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, - src_addr, flags, &folio); + state.pmd =3D dst_pmd; + err =3D mfill_atomic_pte(&state); cond_resched(); =20 if (unlikely(err =3D=3D -ENOENT)) { void *kaddr; =20 up_read(&ctx->map_changing_lock); - uffd_mfill_unlock(dst_vma); - VM_WARN_ON_ONCE(!folio); + uffd_mfill_unlock(state.vma); + VM_WARN_ON_ONCE(!state.folio); =20 - kaddr =3D kmap_local_folio(folio, 0); + kaddr =3D kmap_local_folio(state.folio, 0); err =3D copy_from_user(kaddr, - (const void __user *) src_addr, + (const void __user *)state.src_addr, PAGE_SIZE); kunmap_local(kaddr); if (unlikely(err)) { err =3D -EFAULT; goto out; } - flush_dcache_folio(folio); + flush_dcache_folio(state.folio); goto retry; } else - VM_WARN_ON_ONCE(folio); + VM_WARN_ON_ONCE(state.folio); =20 if (!err) { - dst_addr +=3D PAGE_SIZE; - src_addr +=3D PAGE_SIZE; + state.dst_addr +=3D PAGE_SIZE; + state.src_addr +=3D PAGE_SIZE; copied +=3D PAGE_SIZE; =20 if (fatal_signal_pending(current)) @@ -866,10 +882,10 @@ static __always_inline ssize_t mfill_atomic(struct us= erfaultfd_ctx *ctx, =20 out_unlock: up_read(&ctx->map_changing_lock); - uffd_mfill_unlock(dst_vma); + uffd_mfill_unlock(state.vma); out: - if (folio) - folio_put(folio); + if (state.folio) + folio_put(state.folio); VM_WARN_ON_ONCE(copied < 0); VM_WARN_ON_ONCE(err > 0); VM_WARN_ON_ONCE(!copied && !err); --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 34483421F10; Fri, 6 Mar 2026 17:18:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817525; cv=none; b=WpD2PzsJjKoF+VE+2ZH1B4dDFJnKHkslAlfNc0e6WpbZJE76Sf3ZJEF4Jj2OA2RyKh64Ys69zhHEF3mX85iGGiMLcaKtwAROZUxyZiQWusSO+1mZuB0FGwUZg4LMaavg4lf8hpmDPDInZXFztKp+pF7na4Nqe59shCOKBU0Nu6w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817525; c=relaxed/simple; bh=DdkBLcyw+sipu4BSpL/585CvFlIlf6rPDOWu2SiTMnY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=bJsxbC6iqGdjTAlENP7y0KlTmGiyF6+eDYFdO4mtO/CReHnBSnpo5AexqeAOus8ICLGkkBsn1RlURbXNOEX7suvgDqS3Vv696FMYOo5ZSP+96fnGmR8uhF8R8s1W3MqovJ+sUjIbRuAG6+ABf9QrWJA6s/XxwEWWVU3qWgsF+Oc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=J3HRU2ly; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="J3HRU2ly" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 26B55C2BC9E; Fri, 6 Mar 2026 17:18:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817524; bh=DdkBLcyw+sipu4BSpL/585CvFlIlf6rPDOWu2SiTMnY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=J3HRU2lyWBByIjTTYyfZYZak9BV/jY/lGIMnXLyCgmBCD81NBAfQjFBY9ow5u2M7I nUIDesnw68TLEvSvBx0m6XKIIsrxf0MS0K4MKVcIe5H/+cwHijM91MTIkiIN/LGhWc /5sxbrSW0wDfTA7Hta3ANG+yJmOjyYtZZD1giqm1U/kPCBYvonFxw4G9t4UBkemHwm +cy0bUnJhLbuDi6tAJ5kOsP7tFwIlvCtospvN4F+Kw1jzczJLPuly4hohmRgfQC8qk JrrsOL8LIs5uTFqZ+G3VNMKe9F2/IOGD+j8D6uVPDMdpHmiF3V/Hv+klbNKjTvHbwc VqFPzJMuB94mg== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 03/15] userfaultfd: introduce mfill_get_pmd() helper. Date: Fri, 6 Mar 2026 19:18:03 +0200 Message-ID: <20260306171815.3160826-4-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" There is a lengthy code chunk in mfill_atomic() that establishes the PMD for UFFDIO operations. This code may be called twice: first time when the copy is performed with VMA/mm locks held and the other time after the copy is retried with locks dropped. Move the code that establishes a PMD into a helper function so it can be reused later during refactoring of mfill_atomic_pte_copy(). Signed-off-by: Mike Rapoport (Microsoft) --- mm/userfaultfd.c | 103 ++++++++++++++++++++++++----------------------- 1 file changed, 53 insertions(+), 50 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index e68d01743b03..224b55804f99 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -157,6 +157,57 @@ static void uffd_mfill_unlock(struct vm_area_struct *v= ma) } #endif =20 +static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) +{ + pgd_t *pgd; + p4d_t *p4d; + pud_t *pud; + + pgd =3D pgd_offset(mm, address); + p4d =3D p4d_alloc(mm, pgd, address); + if (!p4d) + return NULL; + pud =3D pud_alloc(mm, p4d, address); + if (!pud) + return NULL; + /* + * Note that we didn't run this because the pmd was + * missing, the *pmd may be already established and in + * turn it may also be a trans_huge_pmd. + */ + return pmd_alloc(mm, pud, address); +} + +static int mfill_get_pmd(struct mfill_state *state) +{ + struct mm_struct *dst_mm =3D state->ctx->mm; + pmd_t *dst_pmd; + pmd_t dst_pmdval; + + dst_pmd =3D mm_alloc_pmd(dst_mm, state->dst_addr); + if (unlikely(!dst_pmd)) + return -ENOMEM; + + dst_pmdval =3D pmdp_get_lockless(dst_pmd); + if (unlikely(pmd_none(dst_pmdval)) && + unlikely(__pte_alloc(dst_mm, dst_pmd))) + return -ENOMEM; + + dst_pmdval =3D pmdp_get_lockless(dst_pmd); + /* + * If the dst_pmd is THP don't override it and just be strict. + * (This includes the case where the PMD used to be THP and + * changed back to none after __pte_alloc().) + */ + if (unlikely(!pmd_present(dst_pmdval) || pmd_trans_huge(dst_pmdval))) + return -EEXIST; + if (unlikely(pmd_bad(dst_pmdval))) + return -EFAULT; + + state->pmd =3D dst_pmd; + return 0; +} + /* Check if dst_addr is outside of file's size. Must be called with ptl he= ld. */ static bool mfill_file_over_size(struct vm_area_struct *dst_vma, unsigned long dst_addr) @@ -489,27 +540,6 @@ static int mfill_atomic_pte_poison(struct mfill_state = *state) return ret; } =20 -static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) -{ - pgd_t *pgd; - p4d_t *p4d; - pud_t *pud; - - pgd =3D pgd_offset(mm, address); - p4d =3D p4d_alloc(mm, pgd, address); - if (!p4d) - return NULL; - pud =3D pud_alloc(mm, p4d, address); - if (!pud) - return NULL; - /* - * Note that we didn't run this because the pmd was - * missing, the *pmd may be already established and in - * turn it may also be a trans_huge_pmd. - */ - return pmd_alloc(mm, pud, address); -} - #ifdef CONFIG_HUGETLB_PAGE /* * mfill_atomic processing for HUGETLB vmas. Note that this routine is @@ -742,7 +772,6 @@ static __always_inline ssize_t mfill_atomic(struct user= faultfd_ctx *ctx, struct vm_area_struct *dst_vma; long copied =3D 0; ssize_t err; - pmd_t *dst_pmd; =20 /* * Sanitize the command parameters: @@ -809,41 +838,15 @@ static __always_inline ssize_t mfill_atomic(struct us= erfaultfd_ctx *ctx, while (state.src_addr < src_start + len) { VM_WARN_ON_ONCE(state.dst_addr >=3D dst_start + len); =20 - pmd_t dst_pmdval; - - dst_pmd =3D mm_alloc_pmd(dst_mm, state.dst_addr); - if (unlikely(!dst_pmd)) { - err =3D -ENOMEM; + err =3D mfill_get_pmd(&state); + if (err) break; - } =20 - dst_pmdval =3D pmdp_get_lockless(dst_pmd); - if (unlikely(pmd_none(dst_pmdval)) && - unlikely(__pte_alloc(dst_mm, dst_pmd))) { - err =3D -ENOMEM; - break; - } - dst_pmdval =3D pmdp_get_lockless(dst_pmd); - /* - * If the dst_pmd is THP don't override it and just be strict. - * (This includes the case where the PMD used to be THP and - * changed back to none after __pte_alloc().) - */ - if (unlikely(!pmd_present(dst_pmdval) || - pmd_trans_huge(dst_pmdval))) { - err =3D -EEXIST; - break; - } - if (unlikely(pmd_bad(dst_pmdval))) { - err =3D -EFAULT; - break; - } /* * For shmem mappings, khugepaged is allowed to remove page * tables under us; pte_offset_map_lock() will deal with that. */ =20 - state.pmd =3D dst_pmd; err =3D mfill_atomic_pte(&state); cond_resched(); =20 --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2E402421F05; Fri, 6 Mar 2026 17:18:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817531; cv=none; b=YwvpcVaXdXNoW0AEjtpTXFbPcyFgeSyYp7ucGVnCKVLPzPJYL8OMl7IG15Vt4Li7TGw3/sNUF3E5noowFxzGTBKONrx0mKWBSoyIcvOZ5cVf4eCzeWQbEy367GJQuNgl1KHE/nTWk33jDHqLyFZMaTRvD/UQ3r8aiN5IifOif90= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817531; c=relaxed/simple; bh=G5arZbl8cQ1Z8+AqNUYSv0raYqzUJNHuE3AX6opuyHg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=PCfBFLE0SiiBp6qU3JrND3No877XCzu2owtix9U59pIPXZzUr61QQwgdHym7dHRkHpebj9XaFEiLn/AqqjVy4wkT5+ybBwwNx76j+imQcB5Nw2DzwsHsUBQPU0R7QkzRM4KGDykYZQWjjOGPfz6sLyglig1AQ3bn5d1xBCALd94= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=e0ECtBkC; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="e0ECtBkC" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5992AC4CEF7; Fri, 6 Mar 2026 17:18:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817531; bh=G5arZbl8cQ1Z8+AqNUYSv0raYqzUJNHuE3AX6opuyHg=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=e0ECtBkCqhwlPUhpp20EDW+OTbsO/AwZiJOU8EVYFYjbLEOyU/XGLaAJ7uTlhSiah P8aV4/SPGl5vj+fQEn4d4QaUgyOY/3GUt2qQ4j2J4/5ygt1wuZJ1nH1U88SQ2VETvY GGfEz46WwjatfHliclfeBvPhTpZyYow7bP9jvr5yGANcXBimwFy/o+ho67ia/OCRYr v2AKW/h1tbBX4pmQYCaxLmMITsalTlShiLAHSmWCmqUAxYFat6eCslXF4NSDjm86A/ l8Yx5r1NPEasv72Jw2fTXxml9jJdYhvB7Xk8tq4oZj3CAf4W3iBrq2EHnRPJvZZMco C4uhnR215Vu0g== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 04/15] userfaultfd: introduce mfill_get_vma() and mfill_put_vma() Date: Fri, 6 Mar 2026 19:18:04 +0200 Message-ID: <20260306171815.3160826-5-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" Split the code that finds, locks and verifies VMA from mfill_atomic() into a helper function. This function will be used later during refactoring of mfill_atomic_pte_copy(). Add a counterpart mfill_put_vma() helper that unlocks the VMA and releases map_changing_lock. Signed-off-by: Mike Rapoport (Microsoft) --- mm/userfaultfd.c | 124 ++++++++++++++++++++++++++++------------------- 1 file changed, 73 insertions(+), 51 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 224b55804f99..baff11e83101 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -157,6 +157,73 @@ static void uffd_mfill_unlock(struct vm_area_struct *v= ma) } #endif =20 +static void mfill_put_vma(struct mfill_state *state) +{ + up_read(&state->ctx->map_changing_lock); + uffd_mfill_unlock(state->vma); + state->vma =3D NULL; +} + +static int mfill_get_vma(struct mfill_state *state) +{ + struct userfaultfd_ctx *ctx =3D state->ctx; + uffd_flags_t flags =3D state->flags; + struct vm_area_struct *dst_vma; + int err; + + /* + * Make sure the vma is not shared, that the dst range is + * both valid and fully within a single existing vma. + */ + dst_vma =3D uffd_mfill_lock(ctx->mm, state->dst_start, state->len); + if (IS_ERR(dst_vma)) + return PTR_ERR(dst_vma); + + /* + * If memory mappings are changing because of non-cooperative + * operation (e.g. mremap) running in parallel, bail out and + * request the user to retry later + */ + down_read(&ctx->map_changing_lock); + err =3D -EAGAIN; + if (atomic_read(&ctx->mmap_changing)) + goto out_unlock; + + err =3D -EINVAL; + + /* + * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but + * it will overwrite vm_ops, so vma_is_anonymous must return false. + */ + if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && + dst_vma->vm_flags & VM_SHARED)) + goto out_unlock; + + /* + * validate 'mode' now that we know the dst_vma: don't allow + * a wrprotect copy if the userfaultfd didn't register as WP. + */ + if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP)) + goto out_unlock; + + if (is_vm_hugetlb_page(dst_vma)) + goto out; + + if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) + goto out_unlock; + if (!vma_is_shmem(dst_vma) && + uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + goto out_unlock; + +out: + state->vma =3D dst_vma; + return 0; + +out_unlock: + mfill_put_vma(state); + return err; +} + static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -768,8 +835,6 @@ static __always_inline ssize_t mfill_atomic(struct user= faultfd_ctx *ctx, .src_addr =3D src_start, .dst_addr =3D dst_start, }; - struct mm_struct *dst_mm =3D ctx->mm; - struct vm_area_struct *dst_vma; long copied =3D 0; ssize_t err; =20 @@ -784,57 +849,17 @@ static __always_inline ssize_t mfill_atomic(struct us= erfaultfd_ctx *ctx, VM_WARN_ON_ONCE(dst_start + len <=3D dst_start); =20 retry: - /* - * Make sure the vma is not shared, that the dst range is - * both valid and fully within a single existing vma. - */ - dst_vma =3D uffd_mfill_lock(dst_mm, dst_start, len); - if (IS_ERR(dst_vma)) { - err =3D PTR_ERR(dst_vma); + err =3D mfill_get_vma(&state); + if (err) goto out; - } - - /* - * If memory mappings are changing because of non-cooperative - * operation (e.g. mremap) running in parallel, bail out and - * request the user to retry later - */ - down_read(&ctx->map_changing_lock); - err =3D -EAGAIN; - if (atomic_read(&ctx->mmap_changing)) - goto out_unlock; - - err =3D -EINVAL; - /* - * shmem_zero_setup is invoked in mmap for MAP_ANONYMOUS|MAP_SHARED but - * it will overwrite vm_ops, so vma_is_anonymous must return false. - */ - if (WARN_ON_ONCE(vma_is_anonymous(dst_vma) && - dst_vma->vm_flags & VM_SHARED)) - goto out_unlock; - - /* - * validate 'mode' now that we know the dst_vma: don't allow - * a wrprotect copy if the userfaultfd didn't register as WP. - */ - if ((flags & MFILL_ATOMIC_WP) && !(dst_vma->vm_flags & VM_UFFD_WP)) - goto out_unlock; =20 /* * If this is a HUGETLB vma, pass off to appropriate routine */ - if (is_vm_hugetlb_page(dst_vma)) - return mfill_atomic_hugetlb(ctx, dst_vma, dst_start, + if (is_vm_hugetlb_page(state.vma)) + return mfill_atomic_hugetlb(ctx, state.vma, dst_start, src_start, len, flags); =20 - if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) - goto out_unlock; - if (!vma_is_shmem(dst_vma) && - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) - goto out_unlock; - - state.vma =3D dst_vma; - while (state.src_addr < src_start + len) { VM_WARN_ON_ONCE(state.dst_addr >=3D dst_start + len); =20 @@ -853,8 +878,7 @@ static __always_inline ssize_t mfill_atomic(struct user= faultfd_ctx *ctx, if (unlikely(err =3D=3D -ENOENT)) { void *kaddr; =20 - up_read(&ctx->map_changing_lock); - uffd_mfill_unlock(state.vma); + mfill_put_vma(&state); VM_WARN_ON_ONCE(!state.folio); =20 kaddr =3D kmap_local_folio(state.folio, 0); @@ -883,9 +907,7 @@ static __always_inline ssize_t mfill_atomic(struct user= faultfd_ctx *ctx, break; } =20 -out_unlock: - up_read(&ctx->map_changing_lock); - uffd_mfill_unlock(state.vma); + mfill_put_vma(&state); out: if (state.folio) folio_put(state.folio); --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8E6AF30CDBC; Fri, 6 Mar 2026 17:18:57 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817537; cv=none; b=jNc8jQc5r2UDHufAg6Tu2JmxRs/HloJBWCB1SN3ktO2pXgGCDHUe+aaZpMXQfimCNVj6RbFtKGKC7M7pO/PzJ4XaPzGjUKjINX/yDnpTXuE6+6Amm+6gy3vn/4wDrraOnn1wPHj7q/GWvnhzkM3eJcayTNbwToKcM7oMyC9MUW0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817537; c=relaxed/simple; bh=uK6YVn1uE8/KerH/03wILxjQbYg54EfQnPDfF+duTRs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hy+p2VJcGhU3SNYsYNsakfDBRRfzgrUElr5nCc97Ac7lMPYIm26GR9cNb61YwBdvalJKt3B2VhVAWFVDvScAFk+2CpQgErZPLCID2tjWuIzmY9MW+wmExXT/U6PxzaNIFCLvvmv0vmnU96qaPji53shUn+3QQD2/Zx5CyhVcmnw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=L9/3UiF0; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="L9/3UiF0" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8E93AC2BCB5; Fri, 6 Mar 2026 17:18:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817537; bh=uK6YVn1uE8/KerH/03wILxjQbYg54EfQnPDfF+duTRs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=L9/3UiF0bLNrJW644W0ccpADpS02wFJO0vTsEAx70zQnWCc+CmzLC0z5GNaljf838 9kmqMyDToEUngqQ0VlIu9ohLy+iRzBw5PM10ms0ZQHGffSwdW7ZtQLvPOEb/LG72K5 Tyf2dRIFqQqYxpoPFkaaSXkl1Y3ixUfXKoG2LPnxrMu1X3zUB9mdO75veRF4I0Nxam i3rxEdp3qAx0Y5UmJj9Yx209e0qhdG6Ui7U2tA1sWXPKk8npq4f9C/K9lIebErZiRW mGCHmipp5578CvkR5Qrznyly///YQ/XqjCCFnDB4oUAP9zMjtseOfZO+kh8Dpt3KvP bjE4FGXWmF+vw== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 05/15] userfaultfd: retry copying with locks dropped in mfill_atomic_pte_copy() Date: Fri, 6 Mar 2026 19:18:05 +0200 Message-ID: <20260306171815.3160826-6-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" Implementation of UFFDIO_COPY for anonymous memory might fail to copy data from userspace buffer when the destination VMA is locked (either with mm_lock or with per-VMA lock). In that case, mfill_atomic() releases the locks, retries copying the data with locks dropped and then re-locks the destination VMA and re-establishes PMD. Since this retry-reget dance is only relevant for UFFDIO_COPY and it never happens for other UFFDIO_ operations, make it a part of mfill_atomic_pte_copy() that actually implements UFFDIO_COPY for anonymous memory. As a temporal safety measure to avoid breaking biscection mfill_atomic_pte_copy() makes sure to never return -ENOENT so that the loop in mfill_atomic() won't retry copiyng outside of mmap_lock. This is removed later when shmem implementation will be updated later and the loop in mfill_atomic() will be adjusted. Signed-off-by: Mike Rapoport (Microsoft) --- mm/userfaultfd.c | 78 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 54 insertions(+), 24 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index baff11e83101..828f252c720c 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -159,6 +159,9 @@ static void uffd_mfill_unlock(struct vm_area_struct *vm= a) =20 static void mfill_put_vma(struct mfill_state *state) { + if (!state->vma) + return; + up_read(&state->ctx->map_changing_lock); uffd_mfill_unlock(state->vma); state->vma =3D NULL; @@ -404,35 +407,63 @@ static int mfill_copy_folio_locked(struct folio *foli= o, unsigned long src_addr) return ret; } =20 +static int mfill_copy_folio_retry(struct mfill_state *state, struct folio = *folio) +{ + unsigned long src_addr =3D state->src_addr; + void *kaddr; + int err; + + /* retry copying with mm_lock dropped */ + mfill_put_vma(state); + + kaddr =3D kmap_local_folio(folio, 0); + err =3D copy_from_user(kaddr, (const void __user *) src_addr, PAGE_SIZE); + kunmap_local(kaddr); + if (unlikely(err)) + return -EFAULT; + + flush_dcache_folio(folio); + + /* reget VMA and PMD, they could change underneath us */ + err =3D mfill_get_vma(state); + if (err) + return err; + + err =3D mfill_get_pmd(state); + if (err) + return err; + + return 0; +} + static int mfill_atomic_pte_copy(struct mfill_state *state) { - struct vm_area_struct *dst_vma =3D state->vma; unsigned long dst_addr =3D state->dst_addr; unsigned long src_addr =3D state->src_addr; uffd_flags_t flags =3D state->flags; - pmd_t *dst_pmd =3D state->pmd; struct folio *folio; int ret; =20 - if (!state->folio) { - ret =3D -ENOMEM; - folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, dst_vma, - dst_addr); - if (!folio) - goto out; + folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, state->vma, dst_addr); + if (!folio) + return -ENOMEM; =20 - ret =3D mfill_copy_folio_locked(folio, src_addr); + ret =3D -ENOMEM; + if (mem_cgroup_charge(folio, state->vma->vm_mm, GFP_KERNEL)) + goto out_release; =20 - /* fallback to copy_from_user outside mmap_lock */ - if (unlikely(ret)) { - ret =3D -ENOENT; - state->folio =3D folio; - /* don't free the page */ - goto out; - } - } else { - folio =3D state->folio; - state->folio =3D NULL; + ret =3D mfill_copy_folio_locked(folio, src_addr); + if (unlikely(ret)) { + /* + * Fallback to copy_from_user outside mmap_lock. + * If retry is successful, mfill_copy_folio_locked() returns + * with locks retaken by mfill_get_vma(). + * If there was an error, we must mfill_put_vma() anyway and it + * will take care of unlocking if needed. + */ + ret =3D mfill_copy_folio_retry(state, folio); + if (ret) + goto out_release; } =20 /* @@ -442,17 +473,16 @@ static int mfill_atomic_pte_copy(struct mfill_state *= state) */ __folio_mark_uptodate(folio); =20 - ret =3D -ENOMEM; - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL)) - goto out_release; - - ret =3D mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, + ret =3D mfill_atomic_install_pte(state->pmd, state->vma, dst_addr, &folio->page, true, flags); if (ret) goto out_release; out: return ret; out_release: + /* Don't return -ENOENT so that our caller won't retry */ + if (ret =3D=3D -ENOENT) + ret =3D -EFAULT; folio_put(folio); goto out; } --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DB876352926; Fri, 6 Mar 2026 17:19:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817543; cv=none; b=lQHUbq18Gyb97okmSGdFPO9EGGgVuLbDm0Cn+txUs14vHX60+tRUsLXEdhZInGsIPXxHWdewbu8dJIgaLjkby3CaTTJzSJgOzDhjd8jeKjlTT7365qqhM3/i46KgB0ugjRFonvbR74jiTqlhJYW+3jKyIewtPSOx/4IxPzppL1Y= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817543; c=relaxed/simple; bh=KhFLfhIIhTESC6O2RDaPZDKOqdegXjeqD7hvwXb+zdY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=hWq0+sLqlU9h3Jf7mkUBk21LVFWS2XhYEk5h8U6WrB9X/VH56kds8vrVOvSQOh5lsYaSWMPAa09wewxW78jT8ZO+lDcbgnc0HPSTcGsaxLGOwRtcQ+Rbm27/A0/0ooT9ftdHXqTh1uLTHkn2Sn13Kjf0dvxAujE33OBS4d7p5oA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=n/BNV5xO; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="n/BNV5xO" Received: by smtp.kernel.org (Postfix) with ESMTPSA id BEC7EC19425; Fri, 6 Mar 2026 17:18:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817543; bh=KhFLfhIIhTESC6O2RDaPZDKOqdegXjeqD7hvwXb+zdY=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=n/BNV5xOySZe9AKNX1wOkj19WGVBZ/w7VUrNUpRUPaYN0KQTyarx88sHxhJkUKW/8 qD0XMYwEbQIRYEOw/eHp8h+dBHiedCpzvYwmViPZsUcCh36dZ1ZjcDGrCerpcEdYnO AXsvyyfyU+rU/4f7LoZHuEdlWUVbgUqsTwh7XDJVw54FE9PL1bAWKeYrN+JhOz6TfJ LG91bv+zMC7XU8/4RM5dqgSBzWGeFr80Bkm5bUqGSDK6/sRr4wabNWW8hLyMsuBe8y kLTLsSveYxV9hSjv8pR9zDXlO7a6NqjUkkhwyKq1vHcsq55LqiMAE2oIV5z3K+dD7F 2uobBzL8OhZLA== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 06/15] userfaultfd: move vma_can_userfault out of line Date: Fri, 6 Mar 2026 19:18:06 +0200 Message-ID: <20260306171815.3160826-7-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" vma_can_userfault() has grown pretty big and it's not called on performance critical path. Move it out of line. No functional changes. Reviewed-by: David Hildenbrand (Red Hat) Reviewed-by: Liam R. Howlett Signed-off-by: Mike Rapoport (Microsoft) --- include/linux/userfaultfd_k.h | 35 ++--------------------------------- mm/userfaultfd.c | 33 +++++++++++++++++++++++++++++++++ 2 files changed, 35 insertions(+), 33 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index fd5f42765497..a49cf750e803 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -208,39 +208,8 @@ static inline bool userfaultfd_armed(struct vm_area_st= ruct *vma) return vma->vm_flags & __VM_UFFD_FLAGS; } =20 -static inline bool vma_can_userfault(struct vm_area_struct *vma, - vm_flags_t vm_flags, - bool wp_async) -{ - vm_flags &=3D __VM_UFFD_FLAGS; - - if (vma->vm_flags & VM_DROPPABLE) - return false; - - if ((vm_flags & VM_UFFD_MINOR) && - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) - return false; - - /* - * If wp async enabled, and WP is the only mode enabled, allow any - * memory type. - */ - if (wp_async && (vm_flags =3D=3D VM_UFFD_WP)) - return true; - - /* - * If user requested uffd-wp but not enabled pte markers for - * uffd-wp, then shmem & hugetlbfs are not supported but only - * anonymous. - */ - if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) && - !vma_is_anonymous(vma)) - return false; - - /* By default, allow any of anon|shmem|hugetlb */ - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || - vma_is_shmem(vma); -} +bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, + bool wp_async); =20 static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct = *vma) { diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 828f252c720c..c5fd1e5c67b3 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -2020,6 +2020,39 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, unsi= gned long dst_start, return moved ? moved : err; } =20 +bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, + bool wp_async) +{ + vm_flags &=3D __VM_UFFD_FLAGS; + + if (vma->vm_flags & VM_DROPPABLE) + return false; + + if ((vm_flags & VM_UFFD_MINOR) && + (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) + return false; + + /* + * If wp async enabled, and WP is the only mode enabled, allow any + * memory type. + */ + if (wp_async && (vm_flags =3D=3D VM_UFFD_WP)) + return true; + + /* + * If user requested uffd-wp but not enabled pte markers for + * uffd-wp, then shmem & hugetlbfs are not supported but only + * anonymous. + */ + if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) && + !vma_is_anonymous(vma)) + return false; + + /* By default, allow any of anon|shmem|hugetlb */ + return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || + vma_is_shmem(vma); +} + static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, vm_flags_t vm_flags) { --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CB6B352926; Fri, 6 Mar 2026 17:19:09 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817550; cv=none; b=pK22Qv6Fn0hn47pxr4SLrq5Ik3dhInzFib+qwNH4Gu7LE4lB2tvhuLCRGnc1uo/7PnU417Jt5ovADnYAvrnitLPKCPtLeGwUCXtOYs8Rt0Dg3HWIsP8RdIZX1Sob+6GaWqw5Q5rg9+5vMXQ+LsvRRkbAef72LPRyD9RifxXp0tI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817550; c=relaxed/simple; bh=Sv4+9KeoC3sTffZxTlVqK+mroHcnok9qkJN7lo9ppOs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=dOouapwkcQlMVUOm9y8jOV3mNDNd08JPeKjkVYVG9iiO7VqB8gakE15jHNJQNDRY+Fe0SV5rqJQ4XGzIK8Plaaq0bVNBfJum/C1KC0PDYCJU+vpY37oeoiVM6MgQCTLVIrzRXPLN/GxCIWUBiPXSzE1Y2s7nzAjK0xUJuPAQE+M= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=cC/nwaq8; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="cC/nwaq8" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 19E47C4CEF7; Fri, 6 Mar 2026 17:19:03 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817549; bh=Sv4+9KeoC3sTffZxTlVqK+mroHcnok9qkJN7lo9ppOs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=cC/nwaq8MfXxeznsUWyMc4//fJyQY6Gy3lmbfX8nvg3tpTSws3+Top8otjCejjNcz +iWsx+MFM9OJXHtYj8yDDe0QsR5GgaNijUE+iH18yL/vd0qlWg0FF7YX9jHBrA75zI ooH8y83jzoFddEdNKA/XO6hzr3o/XxzkGcb1OZA18Pla4pR/Ylv+jxmLxEudTXNt51 J6t1gnpJLyayOzNYMTy7o21c7x2lFPQRgWKLocKom/dMGuiHYhOjKAVLJDJ3RRJfuq UOpxQs8qxizckrtpQY357vjvIl+aolRQ/CqSA0cSlaRNQ/UK+XYVR85D3sIBvnb5XH S2TL/AJF5gjCw== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 07/15] userfaultfd: introduce vm_uffd_ops Date: Fri, 6 Mar 2026 19:18:07 +0200 Message-ID: <20260306171815.3160826-8-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" Current userfaultfd implementation works only with memory managed by core MM: anonymous, shmem and hugetlb. First, there is no fundamental reason to limit userfaultfd support only to the core memory types and userfaults can be handled similarly to regular page faults provided a VMA owner implements appropriate callbacks. Second, historically various code paths were conditioned on vma_is_anonymous(), vma_is_shmem() and is_vm_hugetlb_page() and some of these conditions can be expressed as operations implemented by a particular memory type. Introduce vm_uffd_ops extension to vm_operations_struct that will delegate memory type specific operations to a VMA owner. Operations for anonymous memory are handled internally in userfaultfd using anon_uffd_ops that implicitly assigned to anonymous VMAs. Start with a single operation, ->can_userfault() that will verify that a VMA meets requirements for userfaultfd support at registration time. Implement that method for anonymous, shmem and hugetlb and move relevant parts of vma_can_userfault() into the new callbacks. Signed-off-by: Mike Rapoport (Microsoft) --- include/linux/mm.h | 5 +++++ include/linux/userfaultfd_k.h | 6 ++++++ mm/hugetlb.c | 15 +++++++++++++++ mm/shmem.c | 15 +++++++++++++++ mm/userfaultfd.c | 36 ++++++++++++++++++++++++++--------- 5 files changed, 68 insertions(+), 9 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5be3d8a8f806..b63b28c65676 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -741,6 +741,8 @@ struct vm_fault { */ }; =20 +struct vm_uffd_ops; + /* * These are the virtual MM functions - opening of an area, closing and * unmapping it (needed to keep files on disk up-to-date etc), pointer @@ -826,6 +828,9 @@ struct vm_operations_struct { struct page *(*find_normal_page)(struct vm_area_struct *vma, unsigned long addr); #endif /* CONFIG_FIND_NORMAL_PAGE */ +#ifdef CONFIG_USERFAULTFD + const struct vm_uffd_ops *uffd_ops; +#endif }; =20 #ifdef CONFIG_NUMA_BALANCING diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index a49cf750e803..56e85ab166c7 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -80,6 +80,12 @@ struct userfaultfd_ctx { =20 extern vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long rea= son); =20 +/* VMA userfaultfd operations */ +struct vm_uffd_ops { + /* Checks if a VMA can support userfaultfd */ + bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags); +}; + /* A combined operation mode + behavior flags. */ typedef unsigned int __bitwise uffd_flags_t; =20 diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 0beb6e22bc26..077968a8a69a 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4818,6 +4818,18 @@ static vm_fault_t hugetlb_vm_op_fault(struct vm_faul= t *vmf) return 0; } =20 +#ifdef CONFIG_USERFAULTFD +static bool hugetlb_can_userfault(struct vm_area_struct *vma, + vm_flags_t vm_flags) +{ + return true; +} + +static const struct vm_uffd_ops hugetlb_uffd_ops =3D { + .can_userfault =3D hugetlb_can_userfault, +}; +#endif + /* * When a new function is introduced to vm_operations_struct and added * to hugetlb_vm_ops, please consider adding the function to shm_vm_ops. @@ -4831,6 +4843,9 @@ const struct vm_operations_struct hugetlb_vm_ops =3D { .close =3D hugetlb_vm_op_close, .may_split =3D hugetlb_vm_op_split, .pagesize =3D hugetlb_vm_op_pagesize, +#ifdef CONFIG_USERFAULTFD + .uffd_ops =3D &hugetlb_uffd_ops, +#endif }; =20 static pte_t make_huge_pte(struct vm_area_struct *vma, struct folio *folio, diff --git a/mm/shmem.c b/mm/shmem.c index b40f3cd48961..f2a25805b9bf 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3294,6 +3294,15 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, shmem_inode_unacct_blocks(inode, 1); return ret; } + +static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_= flags) +{ + return true; +} + +static const struct vm_uffd_ops shmem_uffd_ops =3D { + .can_userfault =3D shmem_can_userfault, +}; #endif /* CONFIG_USERFAULTFD */ =20 #ifdef CONFIG_TMPFS @@ -5313,6 +5322,9 @@ static const struct vm_operations_struct shmem_vm_ops= =3D { .set_policy =3D shmem_set_policy, .get_policy =3D shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .uffd_ops =3D &shmem_uffd_ops, +#endif }; =20 static const struct vm_operations_struct shmem_anon_vm_ops =3D { @@ -5322,6 +5334,9 @@ static const struct vm_operations_struct shmem_anon_v= m_ops =3D { .set_policy =3D shmem_set_policy, .get_policy =3D shmem_get_policy, #endif +#ifdef CONFIG_USERFAULTFD + .uffd_ops =3D &shmem_uffd_ops, +#endif }; =20 int shmem_init_fs_context(struct fs_context *fc) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index c5fd1e5c67b3..b55d4a8d88cc 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -34,6 +34,25 @@ struct mfill_state { pmd_t *pmd; }; =20 +static bool anon_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_f= lags) +{ + /* anonymous memory does not support MINOR mode */ + if (vm_flags & VM_UFFD_MINOR) + return false; + return true; +} + +static const struct vm_uffd_ops anon_uffd_ops =3D { + .can_userfault =3D anon_can_userfault, +}; + +static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma) +{ + if (vma_is_anonymous(vma)) + return &anon_uffd_ops; + return vma->vm_ops ? vma->vm_ops->uffd_ops : NULL; +} + static __always_inline bool validate_dst_vma(struct vm_area_struct *dst_vma, unsigned long dst_en= d) { @@ -2023,13 +2042,15 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, uns= igned long dst_start, bool vma_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_flags, bool wp_async) { - vm_flags &=3D __VM_UFFD_FLAGS; + const struct vm_uffd_ops *ops =3D vma_uffd_ops(vma); =20 - if (vma->vm_flags & VM_DROPPABLE) + /* only VMAs that implement vm_uffd_ops are supported */ + if (!ops) return false; =20 - if ((vm_flags & VM_UFFD_MINOR) && - (!is_vm_hugetlb_page(vma) && !vma_is_shmem(vma))) + vm_flags &=3D __VM_UFFD_FLAGS; + + if (vma->vm_flags & VM_DROPPABLE) return false; =20 /* @@ -2041,16 +2062,13 @@ bool vma_can_userfault(struct vm_area_struct *vma, = vm_flags_t vm_flags, =20 /* * If user requested uffd-wp but not enabled pte markers for - * uffd-wp, then shmem & hugetlbfs are not supported but only - * anonymous. + * uffd-wp, then only anonymous memory is supported */ if (!uffd_supports_wp_marker() && (vm_flags & VM_UFFD_WP) && !vma_is_anonymous(vma)) return false; =20 - /* By default, allow any of anon|shmem|hugetlb */ - return vma_is_anonymous(vma) || is_vm_hugetlb_page(vma) || - vma_is_shmem(vma); + return ops->can_userfault(vma, vm_flags); } =20 static void userfaultfd_set_vm_flags(struct vm_area_struct *vma, --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1E9C93FFAB0; Fri, 6 Mar 2026 17:19:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817556; cv=none; b=CKQl5Fj87rbRl3H4BZJezVhcn5Ab0/7f6wJIGcNQ40ySfPNyF5Z54QTc3WPaONb1SUBbcWA8blpCajSmZSQhQzkPoG4fWnqnsH0+vWKUSIzv0sUGV9a9kWE4wZCfXi2aBYnDo+Wn+n4enIRnLgrWB0a0XSiw87O0rfoTzKLbzi8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817556; c=relaxed/simple; bh=/0oNlg7fiYv022nxhnrVEoCU64mvQSwuS01zGUwLSsQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=LxGfjO9xzgbFXBrFus4WMN8D8ztGQZOP/OJFBm96Jp8Uo+wZiLFZlNM9bFv3JKjXJDKDWgrCMo03BMOiWemIQTjxOhAsQvn4Bz3+B0SLNP0PtbzmY73kHamBvGUQOJhlP22DA0kKTMsyK+8RmJJ7FCywsmQ4oDbhrqmB/Nk+Vhk= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=muO0cZ9r; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="muO0cZ9r" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4C953C19425; Fri, 6 Mar 2026 17:19:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817556; bh=/0oNlg7fiYv022nxhnrVEoCU64mvQSwuS01zGUwLSsQ=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=muO0cZ9r+JBiL14XaQg5yw9LiQxGht59+apEgASJ2jR3LGxBAL2DsQtktz4uqSNqJ PkXHY04N+jL5v3G1MiqgSB6TbYGHqdBDSNAWY/aNjQxFtJSBEK0S22O0Y9wGNaJP17 In19HrRalevmDmNv5JBdAdSrm7ZCxnFnlXDD6srzq008AVs4fRuyfSBrSnPz+RHps5 1lQ+eH6EvtYxUo4N0NJqRwDFaYVUx8J7UuASQtNTGCIpna7WU5kGuspoZHDKNb5Pl7 dHZou2u+P4ijg57yiLXnqs07Bf41RlHmXBQ2PWIXWVKS2bD7DRkDLEXTYWCMpSCDpW ThUH3wtwpnBqA== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 08/15] shmem, userfaultfd: use a VMA callback to handle UFFDIO_CONTINUE Date: Fri, 6 Mar 2026 19:18:08 +0200 Message-ID: <20260306171815.3160826-9-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" When userspace resolves a page fault in a shmem VMA with UFFDIO_CONTINUE it needs to get a folio that already exists in the pagecache backing that VMA. Instead of using shmem_get_folio() for that, add a get_folio_noalloc() method to 'struct vm_uffd_ops' that will return a folio if it exists in the VMA's pagecache at given pgoff. Implement get_folio_noalloc() method for shmem and slightly refactor userfaultfd's mfill_get_vma() and mfill_atomic_pte_continue() to support this new API. Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: James Houghton --- include/linux/userfaultfd_k.h | 7 +++++++ mm/shmem.c | 15 ++++++++++++++- mm/userfaultfd.c | 32 ++++++++++++++++---------------- 3 files changed, 37 insertions(+), 17 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 56e85ab166c7..66dfc3c164e6 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -84,6 +84,13 @@ extern vm_fault_t handle_userfault(struct vm_fault *vmf,= unsigned long reason); struct vm_uffd_ops { /* Checks if a VMA can support userfaultfd */ bool (*can_userfault)(struct vm_area_struct *vma, vm_flags_t vm_flags); + /* + * Called to resolve UFFDIO_CONTINUE request. + * Should return the folio found at pgoff in the VMA's pagecache if it + * exists or ERR_PTR otherwise. + * The returned folio is locked and with reference held. + */ + struct folio *(*get_folio_noalloc)(struct inode *inode, pgoff_t pgoff); }; =20 /* A combined operation mode + behavior flags. */ diff --git a/mm/shmem.c b/mm/shmem.c index f2a25805b9bf..7bd887b64f62 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3295,13 +3295,26 @@ int shmem_mfill_atomic_pte(pmd_t *dst_pmd, return ret; } =20 +static struct folio *shmem_get_folio_noalloc(struct inode *inode, pgoff_t = pgoff) +{ + struct folio *folio; + int err; + + err =3D shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); + if (err) + return ERR_PTR(err); + + return folio; +} + static bool shmem_can_userfault(struct vm_area_struct *vma, vm_flags_t vm_= flags) { return true; } =20 static const struct vm_uffd_ops shmem_uffd_ops =3D { - .can_userfault =3D shmem_can_userfault, + .can_userfault =3D shmem_can_userfault, + .get_folio_noalloc =3D shmem_get_folio_noalloc, }; #endif /* CONFIG_USERFAULTFD */ =20 diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index b55d4a8d88cc..98ade14eaa5b 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -191,6 +191,7 @@ static int mfill_get_vma(struct mfill_state *state) struct userfaultfd_ctx *ctx =3D state->ctx; uffd_flags_t flags =3D state->flags; struct vm_area_struct *dst_vma; + const struct vm_uffd_ops *ops; int err; =20 /* @@ -231,10 +232,12 @@ static int mfill_get_vma(struct mfill_state *state) if (is_vm_hugetlb_page(dst_vma)) goto out; =20 - if (!vma_is_anonymous(dst_vma) && !vma_is_shmem(dst_vma)) + ops =3D vma_uffd_ops(dst_vma); + if (!ops) goto out_unlock; - if (!vma_is_shmem(dst_vma) && - uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) + + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) && + !ops->get_folio_noalloc) goto out_unlock; =20 out: @@ -577,6 +580,7 @@ static int mfill_atomic_pte_zeropage(struct mfill_state= *state) static int mfill_atomic_pte_continue(struct mfill_state *state) { struct vm_area_struct *dst_vma =3D state->vma; + const struct vm_uffd_ops *ops =3D vma_uffd_ops(dst_vma); unsigned long dst_addr =3D state->dst_addr; pgoff_t pgoff =3D linear_page_index(dst_vma, dst_addr); struct inode *inode =3D file_inode(dst_vma->vm_file); @@ -586,16 +590,13 @@ static int mfill_atomic_pte_continue(struct mfill_sta= te *state) struct page *page; int ret; =20 - ret =3D shmem_get_folio(inode, pgoff, 0, &folio, SGP_NOALLOC); + if (!ops) + return -EOPNOTSUPP; + + folio =3D ops->get_folio_noalloc(inode, pgoff); /* Our caller expects us to return -EFAULT if we failed to find folio */ - if (ret =3D=3D -ENOENT) - ret =3D -EFAULT; - if (ret) - goto out; - if (!folio) { - ret =3D -EFAULT; - goto out; - } + if (IS_ERR_OR_NULL(folio)) + return -EFAULT; =20 page =3D folio_file_page(folio, pgoff); if (PageHWPoison(page)) { @@ -609,13 +610,12 @@ static int mfill_atomic_pte_continue(struct mfill_sta= te *state) goto out_release; =20 folio_unlock(folio); - ret =3D 0; -out: - return ret; + return 0; + out_release: folio_unlock(folio); folio_put(folio); - goto out; + return ret; } =20 /* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7E62541B35C; Fri, 6 Mar 2026 17:19:22 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817562; cv=none; b=NzW45yUe+p9Wc+8lFvaFAYLwtTTv6313uursnKqmbv7l/TOaE554BTqg4sd2Zbo9Uf7EhpvReNXbIDxlxCXjzqhXe41mbt078ENShcgKe08KsCT3VX8CqPtH15zR/Hbu/e01TGNEY4eLZrIlIMZbzEyrssOrp6qdCA4uQzl3M6U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817562; c=relaxed/simple; bh=gidwe1JDFVRvjF4HHZQP+Up45bV0zxpQn7GX5nnMwuI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=JqhTN74wbqa/j62CH6hEXqbZygok9Enl9mEZLlPaABJx4tkW28ZuTwBu6ShBfRouXZh2r32T3rjxzbEa9kBq0gpqgf/6SwH6oeHELk8rzA+oCNDAuIYRUVHDk8b0s1u7ehUaPk58GPOGkyIKu9MHvWIb9xcuxTXDBvjqxbi2Rwo= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=uG7uATZ5; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="uG7uATZ5" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 7CD11C4CEF7; Fri, 6 Mar 2026 17:19:16 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817562; bh=gidwe1JDFVRvjF4HHZQP+Up45bV0zxpQn7GX5nnMwuI=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=uG7uATZ5bBIgEOK0p125XokCc9P2sskQdk5BkZ/aYHwXO5FcYeCCVi370RwcFDozA 8ioSXRz1nLl7+49RynTI5/DazMkdsyku6wbfT5VRvIblvYeh2TVXUqLpKeNbbNHk5b 0hclBvWDAxLWj4uHKksnB+9SCahwhy7HXf0kT5qNxaeEPyHD3bYSwOtaAGVa1UisF4 s9sS2ZLiw++pf8MQE2EEub9QOqfl5Wd7BejU/H6PHnKpgpQZuVaZ+Aj3lKNMBfO/jw F3dmC2FIaDhzjTqG3tnqz6poh5ppJoXRSn1Lnw/in8UNYZruk1QqlOt9QG4kRcXh+p TOZ4kGajFXh1A== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 09/15] userfaultfd: introduce vm_uffd_ops->alloc_folio() Date: Fri, 6 Mar 2026 19:18:09 +0200 Message-ID: <20260306171815.3160826-10-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" and use it to refactor mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy(). mfill_atomic_pte_zeroed_folio() and mfill_atomic_pte_copy() perform almost identical actions: * allocate a folio * update folio contents (either copy from userspace of fill with zeros) * update page tables with the new folio Split a __mfill_atomic_pte() helper that handles both cases and uses newly introduced vm_uffd_ops->alloc_folio() to allocate the folio. Pass the ops structure from the callers to __mfill_atomic_pte() to later allow using anon_uffd_ops for MAP_PRIVATE mappings of file-backed VMAs. Note, that the new ops method is called alloc_folio() rather than folio_alloc() to avoid clash with alloc_tag macro folio_alloc(). Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: James Houghton --- include/linux/userfaultfd_k.h | 6 +++ mm/userfaultfd.c | 92 ++++++++++++++++++----------------- 2 files changed, 54 insertions(+), 44 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 66dfc3c164e6..4d8b879eed91 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -91,6 +91,12 @@ struct vm_uffd_ops { * The returned folio is locked and with reference held. */ struct folio *(*get_folio_noalloc)(struct inode *inode, pgoff_t pgoff); + /* + * Called during resolution of UFFDIO_COPY request. + * Should return allocate a and return folio or NULL if allocation fails. + */ + struct folio *(*alloc_folio)(struct vm_area_struct *vma, + unsigned long addr); }; =20 /* A combined operation mode + behavior flags. */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 98ade14eaa5b..31f3ab6a73e2 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -42,8 +42,26 @@ static bool anon_can_userfault(struct vm_area_struct *vm= a, vm_flags_t vm_flags) return true; } =20 +static struct folio *anon_alloc_folio(struct vm_area_struct *vma, + unsigned long addr) +{ + struct folio *folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, vma, + addr); + + if (!folio) + return NULL; + + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) { + folio_put(folio); + return NULL; + } + + return folio; +} + static const struct vm_uffd_ops anon_uffd_ops =3D { .can_userfault =3D anon_can_userfault, + .alloc_folio =3D anon_alloc_folio, }; =20 static const struct vm_uffd_ops *vma_uffd_ops(struct vm_area_struct *vma) @@ -458,7 +476,8 @@ static int mfill_copy_folio_retry(struct mfill_state *s= tate, struct folio *folio return 0; } =20 -static int mfill_atomic_pte_copy(struct mfill_state *state) +static int __mfill_atomic_pte(struct mfill_state *state, + const struct vm_uffd_ops *ops) { unsigned long dst_addr =3D state->dst_addr; unsigned long src_addr =3D state->src_addr; @@ -466,16 +485,12 @@ static int mfill_atomic_pte_copy(struct mfill_state *= state) struct folio *folio; int ret; =20 - folio =3D vma_alloc_folio(GFP_HIGHUSER_MOVABLE, 0, state->vma, dst_addr); + folio =3D ops->alloc_folio(state->vma, state->dst_addr); if (!folio) return -ENOMEM; =20 - ret =3D -ENOMEM; - if (mem_cgroup_charge(folio, state->vma->vm_mm, GFP_KERNEL)) - goto out_release; - - ret =3D mfill_copy_folio_locked(folio, src_addr); - if (unlikely(ret)) { + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { + ret =3D mfill_copy_folio_locked(folio, src_addr); /* * Fallback to copy_from_user outside mmap_lock. * If retry is successful, mfill_copy_folio_locked() returns @@ -483,9 +498,15 @@ static int mfill_atomic_pte_copy(struct mfill_state *s= tate) * If there was an error, we must mfill_put_vma() anyway and it * will take care of unlocking if needed. */ - ret =3D mfill_copy_folio_retry(state, folio); - if (ret) - goto out_release; + if (unlikely(ret)) { + ret =3D mfill_copy_folio_retry(state, folio); + if (ret) + goto err_folio_put; + } + } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) { + clear_user_highpage(&folio->page, state->dst_addr); + } else { + VM_WARN_ONCE(1, "unknown UFFDIO operation"); } =20 /* @@ -498,47 +519,30 @@ static int mfill_atomic_pte_copy(struct mfill_state *= state) ret =3D mfill_atomic_install_pte(state->pmd, state->vma, dst_addr, &folio->page, true, flags); if (ret) - goto out_release; -out: - return ret; -out_release: + goto err_folio_put; + + return 0; + +err_folio_put: + folio_put(folio); /* Don't return -ENOENT so that our caller won't retry */ if (ret =3D=3D -ENOENT) ret =3D -EFAULT; - folio_put(folio); - goto out; + return ret; } =20 -static int mfill_atomic_pte_zeroed_folio(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr) +static int mfill_atomic_pte_copy(struct mfill_state *state) { - struct folio *folio; - int ret =3D -ENOMEM; - - folio =3D vma_alloc_zeroed_movable_folio(dst_vma, dst_addr); - if (!folio) - return ret; - - if (mem_cgroup_charge(folio, dst_vma->vm_mm, GFP_KERNEL)) - goto out_put; + const struct vm_uffd_ops *ops =3D vma_uffd_ops(state->vma); =20 - /* - * The memory barrier inside __folio_mark_uptodate makes sure that - * zeroing out the folio become visible before mapping the page - * using set_pte_at(). See do_anonymous_page(). - */ - __folio_mark_uptodate(folio); + return __mfill_atomic_pte(state, ops); +} =20 - ret =3D mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, - &folio->page, true, 0); - if (ret) - goto out_put; +static int mfill_atomic_pte_zeroed_folio(struct mfill_state *state) +{ + const struct vm_uffd_ops *ops =3D vma_uffd_ops(state->vma); =20 - return 0; -out_put: - folio_put(folio); - return ret; + return __mfill_atomic_pte(state, ops); } =20 static int mfill_atomic_pte_zeropage(struct mfill_state *state) @@ -551,7 +555,7 @@ static int mfill_atomic_pte_zeropage(struct mfill_state= *state) int ret; =20 if (mm_forbids_zeropage(dst_vma->vm_mm)) - return mfill_atomic_pte_zeroed_folio(dst_pmd, dst_vma, dst_addr); + return mfill_atomic_pte_zeroed_folio(state); =20 _dst_pte =3D pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), dst_vma->vm_page_prot)); --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 81BAB3659F8; Fri, 6 Mar 2026 17:19:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817568; cv=none; b=DVrWKJYwA8FPSeoAGRulGHWgtHCfsyZt/B2Wta4M+6s61d4vxmEJEp2bZjMQGpLTArAuIzJvM8FHR6b0Dq9Tt1YIvsG4hWz4ThSK2xepnMvUS2cgwf/TNSUjucrrHfGZAMM3YoAGYnyguFyvPAmlX2C/hEpu0mJkT8V4oaxaZ3A= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817568; c=relaxed/simple; bh=Yw4D+/1fCurRlZ1OehdGpOfaYnZr4j87kF8uPGr9MdU=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=QckUlVwwQSngnlixNx9j8EcMWRt+yjZ0AgNdJzzQEurNY9Gb6JKcYesemnfV8UNCRz7dXlxR1HCvfEPKlIKfb1blcMLAwsJhddNr/iNTt6oBcP1/KoZV5vNrUrALyGXcGXftzLX/0i1KKWz8WJAfRzNNsDFHGLV8kgz9dunMdgY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=OOIsInpM; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="OOIsInpM" Received: by smtp.kernel.org (Postfix) with ESMTPSA id AE03CC4CEF7; Fri, 6 Mar 2026 17:19:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817568; bh=Yw4D+/1fCurRlZ1OehdGpOfaYnZr4j87kF8uPGr9MdU=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=OOIsInpM53tHfARK9/OsmmAkYqqMvg1P5gi0oa4Uskn3kZmMY7h8qF2pnsqQ5NBdx +9ofjRS3lA89zmcxhXMn09Jx08Wj0fAW4GdRmhhDm1edFlRNJvFQtGb/g0dyxZZNst NrcgLM6mN0swuQlMGLF8uF/kRmE2Copw2JPWTG39PsCkoYSGQ6kA0/wUcHebfHkogl ADQrm6Ul05U5z1d6MRhmGZ3BGlt6CRJQvaB/T9B+cGtK0xnlrVlgqIygemtPIqkh2+ qOHvIgjIexcJDiVLbD4Kzp62XELKwp6s9JjRY6P3RPw17sbB0+MmbV5MG1z9rRZbMV D6QBfiVfC414A== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 10/15] shmem, userfaultfd: implement shmem uffd operations using vm_uffd_ops Date: Fri, 6 Mar 2026 19:18:10 +0200 Message-ID: <20260306171815.3160826-11-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" Add filemap_add() and filemap_remove() methods to vm_uffd_ops and use them in __mfill_atomic_pte() to add shmem folios to page cache and remove them in case of error. Implement these methods in shmem along with vm_uffd_ops->alloc_folio() and drop shmem_mfill_atomic_pte(). Since userfaultfd now does not reference any functions from shmem, drop include if linux/shmem_fs.h from mm/userfaultfd.c mfill_atomic_install_pte() is not used anywhere outside of mm/userfaultfd, make it static. Signed-off-by: Mike Rapoport (Microsoft) Reviewed-by: James Houghton --- include/linux/shmem_fs.h | 14 ---- include/linux/userfaultfd_k.h | 21 +++-- mm/shmem.c | 148 ++++++++++++---------------------- mm/userfaultfd.c | 79 +++++++++--------- 4 files changed, 106 insertions(+), 156 deletions(-) diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h index a8273b32e041..1a345142af7d 100644 --- a/include/linux/shmem_fs.h +++ b/include/linux/shmem_fs.h @@ -221,20 +221,6 @@ static inline pgoff_t shmem_fallocend(struct inode *in= ode, pgoff_t eof) =20 extern bool shmem_charge(struct inode *inode, long pages); =20 -#ifdef CONFIG_USERFAULTFD -#ifdef CONFIG_SHMEM -extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - uffd_flags_t flags, - struct folio **foliop); -#else /* !CONFIG_SHMEM */ -#define shmem_mfill_atomic_pte(dst_pmd, dst_vma, dst_addr, \ - src_addr, flags, foliop) ({ BUG(); 0; }) -#endif /* CONFIG_SHMEM */ -#endif /* CONFIG_USERFAULTFD */ - /* * Used space is stored as unsigned 64-bit value in bytes but * quota core supports only signed 64-bit values so use that diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 4d8b879eed91..bf4e595ac914 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -93,10 +93,24 @@ struct vm_uffd_ops { struct folio *(*get_folio_noalloc)(struct inode *inode, pgoff_t pgoff); /* * Called during resolution of UFFDIO_COPY request. - * Should return allocate a and return folio or NULL if allocation fails. + * Should allocate and return a folio or NULL if allocation + * fails. */ struct folio *(*alloc_folio)(struct vm_area_struct *vma, unsigned long addr); + /* + * Called during resolution of UFFDIO_COPY request. + * Should lock the folio and add it to VMA's page cache. + * Returns 0 on success, error code on failure. + */ + int (*filemap_add)(struct folio *folio, struct vm_area_struct *vma, + unsigned long addr); + /* + * Called during resolution of UFFDIO_COPY request on the error + * handling path. + * Should revert the operation of ->filemap_add(). + */ + void (*filemap_remove)(struct folio *folio, struct vm_area_struct *vma); }; =20 /* A combined operation mode + behavior flags. */ @@ -130,11 +144,6 @@ static inline uffd_flags_t uffd_flags_set_mode(uffd_fl= ags_t flags, enum mfill_at /* Flags controlling behavior. These behavior changes are mode-independent= . */ #define MFILL_ATOMIC_WP MFILL_ATOMIC_FLAG(0) =20 -extern int mfill_atomic_install_pte(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, struct page *page, - bool newly_allocated, uffd_flags_t flags); - extern ssize_t mfill_atomic_copy(struct userfaultfd_ctx *ctx, unsigned lon= g dst_start, unsigned long src_start, unsigned long len, uffd_flags_t flags); diff --git a/mm/shmem.c b/mm/shmem.c index 7bd887b64f62..68620caaf75f 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -3181,118 +3181,73 @@ static struct inode *shmem_get_inode(struct mnt_id= map *idmap, #endif /* CONFIG_TMPFS_QUOTA */ =20 #ifdef CONFIG_USERFAULTFD -int shmem_mfill_atomic_pte(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, - unsigned long src_addr, - uffd_flags_t flags, - struct folio **foliop) -{ - struct inode *inode =3D file_inode(dst_vma->vm_file); - struct shmem_inode_info *info =3D SHMEM_I(inode); +static struct folio *shmem_mfill_folio_alloc(struct vm_area_struct *vma, + unsigned long addr) +{ + struct inode *inode =3D file_inode(vma->vm_file); struct address_space *mapping =3D inode->i_mapping; + struct shmem_inode_info *info =3D SHMEM_I(inode); + pgoff_t pgoff =3D linear_page_index(vma, addr); gfp_t gfp =3D mapping_gfp_mask(mapping); - pgoff_t pgoff =3D linear_page_index(dst_vma, dst_addr); - void *page_kaddr; struct folio *folio; - int ret; - pgoff_t max_off; - - if (shmem_inode_acct_blocks(inode, 1)) { - /* - * We may have got a page, returned -ENOENT triggering a retry, - * and now we find ourselves with -ENOMEM. Release the page, to - * avoid a BUG_ON in our caller. - */ - if (unlikely(*foliop)) { - folio_put(*foliop); - *foliop =3D NULL; - } - return -ENOMEM; - } =20 - if (!*foliop) { - ret =3D -ENOMEM; - folio =3D shmem_alloc_folio(gfp, 0, info, pgoff); - if (!folio) - goto out_unacct_blocks; + if (unlikely(pgoff >=3D DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE))) + return NULL; =20 - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) { - page_kaddr =3D kmap_local_folio(folio, 0); - /* - * The read mmap_lock is held here. Despite the - * mmap_lock being read recursive a deadlock is still - * possible if a writer has taken a lock. For example: - * - * process A thread 1 takes read lock on own mmap_lock - * process A thread 2 calls mmap, blocks taking write lock - * process B thread 1 takes page fault, read lock on own mmap lock - * process B thread 2 calls mmap, blocks taking write lock - * process A thread 1 blocks taking read lock on process B - * process B thread 1 blocks taking read lock on process A - * - * Disable page faults to prevent potential deadlock - * and retry the copy outside the mmap_lock. - */ - pagefault_disable(); - ret =3D copy_from_user(page_kaddr, - (const void __user *)src_addr, - PAGE_SIZE); - pagefault_enable(); - kunmap_local(page_kaddr); - - /* fallback to copy_from_user outside mmap_lock */ - if (unlikely(ret)) { - *foliop =3D folio; - ret =3D -ENOENT; - /* don't free the page */ - goto out_unacct_blocks; - } + folio =3D shmem_alloc_folio(gfp, 0, info, pgoff); + if (!folio) + return NULL; =20 - flush_dcache_folio(folio); - } else { /* ZEROPAGE */ - clear_user_highpage(&folio->page, dst_addr); - } - } else { - folio =3D *foliop; - VM_BUG_ON_FOLIO(folio_test_large(folio), folio); - *foliop =3D NULL; + if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) { + folio_put(folio); + return NULL; } =20 - VM_BUG_ON(folio_test_locked(folio)); - VM_BUG_ON(folio_test_swapbacked(folio)); + return folio; +} + +static int shmem_mfill_filemap_add(struct folio *folio, + struct vm_area_struct *vma, + unsigned long addr) +{ + struct inode *inode =3D file_inode(vma->vm_file); + struct address_space *mapping =3D inode->i_mapping; + pgoff_t pgoff =3D linear_page_index(vma, addr); + gfp_t gfp =3D mapping_gfp_mask(mapping); + int err; + __folio_set_locked(folio); __folio_set_swapbacked(folio); - __folio_mark_uptodate(folio); - - ret =3D -EFAULT; - max_off =3D DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); - if (unlikely(pgoff >=3D max_off)) - goto out_release; =20 - ret =3D mem_cgroup_charge(folio, dst_vma->vm_mm, gfp); - if (ret) - goto out_release; - ret =3D shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp); - if (ret) - goto out_release; + err =3D shmem_add_to_page_cache(folio, mapping, pgoff, NULL, gfp); + if (err) + goto err_unlock; =20 - ret =3D mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, - &folio->page, true, flags); - if (ret) - goto out_delete_from_cache; + if (shmem_inode_acct_blocks(inode, 1)) { + err =3D -ENOMEM; + goto err_delete_from_cache; + } =20 + folio_add_lru(folio); shmem_recalc_inode(inode, 1, 0); - folio_unlock(folio); + return 0; -out_delete_from_cache: + +err_delete_from_cache: filemap_remove_folio(folio); -out_release: +err_unlock: + folio_unlock(folio); + return err; +} + +static void shmem_mfill_filemap_remove(struct folio *folio, + struct vm_area_struct *vma) +{ + struct inode *inode =3D file_inode(vma->vm_file); + + filemap_remove_folio(folio); + shmem_recalc_inode(inode, 0, 0); folio_unlock(folio); - folio_put(folio); -out_unacct_blocks: - shmem_inode_unacct_blocks(inode, 1); - return ret; } =20 static struct folio *shmem_get_folio_noalloc(struct inode *inode, pgoff_t = pgoff) @@ -3315,6 +3270,9 @@ static bool shmem_can_userfault(struct vm_area_struct= *vma, vm_flags_t vm_flags) static const struct vm_uffd_ops shmem_uffd_ops =3D { .can_userfault =3D shmem_can_userfault, .get_folio_noalloc =3D shmem_get_folio_noalloc, + .alloc_folio =3D shmem_mfill_folio_alloc, + .filemap_add =3D shmem_mfill_filemap_add, + .filemap_remove =3D shmem_mfill_filemap_remove, }; #endif /* CONFIG_USERFAULTFD */ =20 diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 31f3ab6a73e2..a0f8e67006d6 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -14,7 +14,6 @@ #include #include #include -#include #include #include #include "internal.h" @@ -340,10 +339,10 @@ static bool mfill_file_over_size(struct vm_area_struc= t *dst_vma, * This function handles both MCOPY_ATOMIC_NORMAL and _CONTINUE for both s= hmem * and anon, and for both shared and private VMAs. */ -int mfill_atomic_install_pte(pmd_t *dst_pmd, - struct vm_area_struct *dst_vma, - unsigned long dst_addr, struct page *page, - bool newly_allocated, uffd_flags_t flags) +static int mfill_atomic_install_pte(pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, struct page *page, + uffd_flags_t flags) { int ret; struct mm_struct *dst_mm =3D dst_vma->vm_mm; @@ -387,9 +386,6 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, goto out_unlock; =20 if (page_in_cache) { - /* Usually, cache pages are already added to LRU */ - if (newly_allocated) - folio_add_lru(folio); folio_add_file_rmap_pte(folio, page, dst_vma); } else { folio_add_new_anon_rmap(folio, dst_vma, dst_addr, RMAP_EXCLUSIVE); @@ -404,6 +400,9 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd, =20 set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); =20 + if (page_in_cache) + folio_unlock(folio); + /* No need to invalidate - it was non-present before */ update_mmu_cache(dst_vma, dst_addr, dst_pte); ret =3D 0; @@ -516,13 +515,22 @@ static int __mfill_atomic_pte(struct mfill_state *sta= te, */ __folio_mark_uptodate(folio); =20 + if (ops->filemap_add) { + ret =3D ops->filemap_add(folio, state->vma, state->dst_addr); + if (ret) + goto err_folio_put; + } + ret =3D mfill_atomic_install_pte(state->pmd, state->vma, dst_addr, - &folio->page, true, flags); + &folio->page, flags); if (ret) - goto err_folio_put; + goto err_filemap_remove; =20 return 0; =20 +err_filemap_remove: + if (ops->filemap_remove) + ops->filemap_remove(folio, state->vma); err_folio_put: folio_put(folio); /* Don't return -ENOENT so that our caller won't retry */ @@ -535,6 +543,18 @@ static int mfill_atomic_pte_copy(struct mfill_state *s= tate) { const struct vm_uffd_ops *ops =3D vma_uffd_ops(state->vma); =20 + /* + * The normal page fault path for a MAP_PRIVATE mapping in a + * file-backed VMA will invoke the fault, fill the hole in the file and + * COW it right away. The result generates plain anonymous memory. + * So when we are asked to fill a hole in a MAP_PRIVATE mapping, we'll + * generate anonymous memory directly without actually filling the + * hole. For the MAP_PRIVATE case the robustness check only happens in + * the pagetable (to verify it's still none) and not in the page cache. + */ + if (!(state->vma->vm_flags & VM_SHARED)) + ops =3D &anon_uffd_ops; + return __mfill_atomic_pte(state, ops); } =20 @@ -554,7 +574,8 @@ static int mfill_atomic_pte_zeropage(struct mfill_state= *state) spinlock_t *ptl; int ret; =20 - if (mm_forbids_zeropage(dst_vma->vm_mm)) + if (mm_forbids_zeropage(dst_vma->vm_mm) || + (dst_vma->vm_flags & VM_SHARED)) return mfill_atomic_pte_zeroed_folio(state); =20 _dst_pte =3D pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), @@ -609,11 +630,10 @@ static int mfill_atomic_pte_continue(struct mfill_sta= te *state) } =20 ret =3D mfill_atomic_install_pte(dst_pmd, dst_vma, dst_addr, - page, false, flags); + page, flags); if (ret) goto out_release; =20 - folio_unlock(folio); return 0; =20 out_release: @@ -836,41 +856,18 @@ extern ssize_t mfill_atomic_hugetlb(struct userfaultf= d_ctx *ctx, =20 static __always_inline ssize_t mfill_atomic_pte(struct mfill_state *state) { - struct vm_area_struct *dst_vma =3D state->vma; - unsigned long src_addr =3D state->src_addr; - unsigned long dst_addr =3D state->dst_addr; - struct folio **foliop =3D &state->folio; uffd_flags_t flags =3D state->flags; - pmd_t *dst_pmd =3D state->pmd; - ssize_t err; =20 if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) return mfill_atomic_pte_continue(state); if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) return mfill_atomic_pte_poison(state); + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) + return mfill_atomic_pte_copy(state); + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) + return mfill_atomic_pte_zeropage(state); =20 - /* - * The normal page fault path for a shmem will invoke the - * fault, fill the hole in the file and COW it right away. The - * result generates plain anonymous memory. So when we are - * asked to fill an hole in a MAP_PRIVATE shmem mapping, we'll - * generate anonymous memory directly without actually filling - * the hole. For the MAP_PRIVATE case the robustness check - * only happens in the pagetable (to verify it's still none) - * and not in the radix tree. - */ - if (!(dst_vma->vm_flags & VM_SHARED)) { - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_COPY)) - err =3D mfill_atomic_pte_copy(state); - else - err =3D mfill_atomic_pte_zeropage(state); - } else { - err =3D shmem_mfill_atomic_pte(dst_pmd, dst_vma, - dst_addr, src_addr, - flags, foliop); - } - - return err; + return -EOPNOTSUPP; } =20 static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx, --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B16BE3659F8; Fri, 6 Mar 2026 17:19:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817574; cv=none; b=GES+Wla8BPOa4h3A41IYgOvb0rUdvO0DXJ73mPFu5DSw27ZJB6pOBVMRgU3ZGNVXj2OmY0u4GxKj8a/ixmgx4CFOZp3Dv25uNzLIcbRlyBeOowoWvbGgKKJi78jfak8z5ADPCeyNxfTBhupBBLJgI3cXWKIlifvckf0eg55ZLWs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817574; c=relaxed/simple; bh=EX01Sv5lXfTYk78BKTDVygQlYzTzVovEBgCSXUVGYbs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=tIVHjlFUYh1AlyABpFHuWS3fCSrWo6CUWLz3Z8Fu9t/b8kAKnvetXtY//IOQ2K/RbKfKXhdMD9lnNfNEEfZrLAPgHRhgCvD8GlMPeYBii47S/eJpQrgan9/ch3xR3QkXCu3ESzLQpUPiVXU3L2A3tNEPmqVdvsY794hKoTa88AY= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=CwJ+JaiO; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="CwJ+JaiO" Received: by smtp.kernel.org (Postfix) with ESMTPSA id DE73BC19425; Fri, 6 Mar 2026 17:19:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817574; bh=EX01Sv5lXfTYk78BKTDVygQlYzTzVovEBgCSXUVGYbs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=CwJ+JaiOGHHxS1bFCqciSl2ZX4/9f+bGWO8AzpSyMwdcRyKNG3sAhwH/CC0EDoJmm zkgyjEz8qKmt0R5fzfZ8dY1kx1NiV4XTSs9vTgXu0K9n/koVg47wydd37a43JXNGyZ ZE/XhHy/DdAmMbr/vnnPfX/fzyJopSE3VCuetcLg4uHzfWlJVHp4kr+olUC+RK4EvB L1KsYhaHvTBm+lkVMat2tDAlcDX/kHv7JYgFuvj2mSDAEAOeh32vxiQycKiWty2wB+ lWENr+BNKrt9mpGiMF22mjCckSRbEBDDMDdtHygFdbAUpyTZBaUP4BfpZ+O9UtXqXT yL8z/fA+jDO1A== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 11/15] userfaultfd: mfill_atomic(): remove retry logic Date: Fri, 6 Mar 2026 19:18:11 +0200 Message-ID: <20260306171815.3160826-12-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: "Mike Rapoport (Microsoft)" Since __mfill_atomic_pte() handles the retry for both anonymous and shmem, there is no need to retry copying the date from the userspace in the loop in mfill_atomic(). Drop the retry logic from mfill_atomic(). Signed-off-by: Mike Rapoport (Microsoft) --- mm/userfaultfd.c | 24 ------------------------ 1 file changed, 24 deletions(-) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index a0f8e67006d6..7cd7c5d1ce84 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -29,7 +29,6 @@ struct mfill_state { struct vm_area_struct *vma; unsigned long src_addr; unsigned long dst_addr; - struct folio *folio; pmd_t *pmd; }; =20 @@ -898,7 +897,6 @@ static __always_inline ssize_t mfill_atomic(struct user= faultfd_ctx *ctx, VM_WARN_ON_ONCE(src_start + len <=3D src_start); VM_WARN_ON_ONCE(dst_start + len <=3D dst_start); =20 -retry: err =3D mfill_get_vma(&state); if (err) goto out; @@ -925,26 +923,6 @@ static __always_inline ssize_t mfill_atomic(struct use= rfaultfd_ctx *ctx, err =3D mfill_atomic_pte(&state); cond_resched(); =20 - if (unlikely(err =3D=3D -ENOENT)) { - void *kaddr; - - mfill_put_vma(&state); - VM_WARN_ON_ONCE(!state.folio); - - kaddr =3D kmap_local_folio(state.folio, 0); - err =3D copy_from_user(kaddr, - (const void __user *)state.src_addr, - PAGE_SIZE); - kunmap_local(kaddr); - if (unlikely(err)) { - err =3D -EFAULT; - goto out; - } - flush_dcache_folio(state.folio); - goto retry; - } else - VM_WARN_ON_ONCE(state.folio); - if (!err) { state.dst_addr +=3D PAGE_SIZE; state.src_addr +=3D PAGE_SIZE; @@ -959,8 +937,6 @@ static __always_inline ssize_t mfill_atomic(struct user= faultfd_ctx *ctx, =20 mfill_put_vma(&state); out: - if (state.folio) - folio_put(state.folio); VM_WARN_ON_ONCE(copied < 0); VM_WARN_ON_ONCE(err > 0); VM_WARN_ON_ONCE(!copied && !err); --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1D3B83659F8; Fri, 6 Mar 2026 17:19:40 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817581; cv=none; b=rpf6xG9AwSd69XQH/TQcH8hN1ilu+k4MJkZuWNmPDdZ/5sLHnQrKYTZmaxZqI23HVWtS6Bf44Zu0Fxx/3bQ7JTFiwbVFIrhQ7VTourWA3wmv9ykXLUm2xxETPz0N0FRhvLRRGXMyDBU2ImHqS0h0jQScQXY+G6ijyOc5C1VeeDQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817581; c=relaxed/simple; bh=D2z107Crk/aKX5jAkrVRRaT1KD/wz+I3gmrbuJFA7+o=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=tNeDV+7l/OW3jZyyUFmJYjOsLhw/1Jqr4paUcV25dKNHUTdwV/F4qJl5wqKviSMbbgdCrDIjwlLPnDH8vAyAp8ebEVCacEEsdo8deLb6DBdABitb4YsMVz0ieRWyxOPvkI+VsNwFI5oFqRoa80awJ4Uge70ywi09kRA8ZZ0tY94= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=ngPuMDYO; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="ngPuMDYO" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1BED3C4CEF7; Fri, 6 Mar 2026 17:19:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817580; bh=D2z107Crk/aKX5jAkrVRRaT1KD/wz+I3gmrbuJFA7+o=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=ngPuMDYOlU0mS+ASFRSNx8uaRHF9kFKx0o8JB0Q8efPd7kbzUcnYwjsCOJjXJpfG9 zVkLPIpevfPpHikwzQ8vofWtaw8pOystr4JUYmsveUeFSZ27lHyXPw5IwgE34H3niA 6tViETw/ALirduugAdhngm9PvkddKiL8mR1puYQEBKKjXeFmqbS0PjRhhOyiIk2pLf 2ICVUi2AUzCNuukDh9OwdRURRDUsnfH9IHXfawgc4eaZcUzrU/genjD9gL8O0xAF8N m5StoeImAE7EMg3Mih2E+7lLLKRY5i/yfXYNpa2RZDhsS13WvOMyEG+CX0bBmf/7Ax rPWnW2MdDsOww== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 12/15] mm: generalize handling of userfaults in __do_fault() Date: Fri, 6 Mar 2026 19:18:12 +0200 Message-ID: <20260306171815.3160826-13-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Peter Xu When a VMA is registered with userfaulfd, its ->fault() method should check if a folio exists in the page cache and call handle_userfault() with appropriate mode: - VM_UFFD_MINOR if VMA is registered in minor mode and the folio exists - VM_UFFD_MISSING if VMA is registered in missing mode and the folio does not exist Instead of calling handle_userfault() directly from a specific ->fault() handler, call __do_userfault() helper from the generic __do_fault(). For VMAs registered with userfaultfd the new __do_userfault() helper will check if the folio is found in the page cache using vm_uffd_ops->get_folio_noalloc() and call handle_userfault() with the appropriate mode. Make vm_uffd_ops->get_folio_noalloc() required method for non-anonymous VMAs mapped at PTE level. Signed-off-by: Peter Xu Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) --- mm/memory.c | 43 +++++++++++++++++++++++++++++++++++++++++++ mm/shmem.c | 12 ------------ mm/userfaultfd.c | 8 ++++++++ 3 files changed, 51 insertions(+), 12 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 07778814b4a8..e2183c44d70b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5328,6 +5328,41 @@ static vm_fault_t do_anonymous_page(struct vm_fault = *vmf) return VM_FAULT_OOM; } =20 +#ifdef CONFIG_USERFAULTFD +static vm_fault_t __do_userfault(struct vm_fault *vmf) +{ + struct vm_area_struct *vma =3D vmf->vma; + struct inode *inode; + struct folio *folio; + + if (!(userfaultfd_missing(vma) || userfaultfd_minor(vma))) + return 0; + + inode =3D file_inode(vma->vm_file); + folio =3D vma->vm_ops->uffd_ops->get_folio_noalloc(inode, vmf->pgoff); + if (!IS_ERR_OR_NULL(folio)) { + /* + * TODO: provide a flag for get_folio_noalloc() to avoid + * locking (or even the extra reference?) + */ + folio_unlock(folio); + folio_put(folio); + if (userfaultfd_minor(vma)) + return handle_userfault(vmf, VM_UFFD_MINOR); + } else { + if (userfaultfd_missing(vma)) + return handle_userfault(vmf, VM_UFFD_MISSING); + } + + return 0; +} +#else +static inline vm_fault_t __do_userfault(struct vm_fault *vmf) +{ + return 0; +} +#endif + /* * The mmap_lock must have been held on entry, and may have been * released depending on flags and vma->vm_ops->fault() return value. @@ -5360,6 +5395,14 @@ static vm_fault_t __do_fault(struct vm_fault *vmf) return VM_FAULT_OOM; } =20 + /* + * If this is an userfaultfd trap, process it in advance before + * triggering the genuine fault handler. + */ + ret =3D __do_userfault(vmf); + if (ret) + return ret; + ret =3D vma->vm_ops->fault(vmf); if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY | VM_FAULT_DONE_COW))) diff --git a/mm/shmem.c b/mm/shmem.c index 68620caaf75f..239545352cd2 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -2489,13 +2489,6 @@ static int shmem_get_folio_gfp(struct inode *inode, = pgoff_t index, fault_mm =3D vma ? vma->vm_mm : NULL; =20 folio =3D filemap_get_entry(inode->i_mapping, index); - if (folio && vma && userfaultfd_minor(vma)) { - if (!xa_is_value(folio)) - folio_put(folio); - *fault_type =3D handle_userfault(vmf, VM_UFFD_MINOR); - return 0; - } - if (xa_is_value(folio)) { error =3D shmem_swapin_folio(inode, index, &folio, sgp, gfp, vma, fault_type); @@ -2540,11 +2533,6 @@ static int shmem_get_folio_gfp(struct inode *inode, = pgoff_t index, * Fast cache lookup and swap lookup did not find it: allocate. */ =20 - if (vma && userfaultfd_missing(vma)) { - *fault_type =3D handle_userfault(vmf, VM_UFFD_MISSING); - return 0; - } - /* Find hugepage orders that are allowed for anonymous shmem and tmpfs. */ orders =3D shmem_allowable_huge_orders(inode, vma, index, write_end, fals= e); if (orders > 0) { diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 7cd7c5d1ce84..2ac5fad0ed6c 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -2045,6 +2045,14 @@ bool vma_can_userfault(struct vm_area_struct *vma, v= m_flags_t vm_flags, !vma_is_anonymous(vma)) return false; =20 + /* + * File backed memory with PTE level mappigns must implement + * ops->get_folio_noalloc() + */ + if (!vma_is_anonymous(vma) && !is_vm_hugetlb_page(vma) && + !ops->get_folio_noalloc) + return false; + return ops->can_userfault(vma, vm_flags); } =20 --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 4B2CF3659F8; Fri, 6 Mar 2026 17:19:47 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817587; cv=none; b=MJkyZhYcsnOAwi4eaXfDL+tB9MyQu0uYDTE82GXB+Rvdv0S7Rtnj9V99er5T95TVa5UfuOfE9stnWEipw/pvh9CGkqREwirLPbXcGI523dKP/O/cDmDr/igrgB8QTwXOzRf7azqMzmvbAw4ZOjyZDLB5xVZFPboZhedhPNaYU1k= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817587; c=relaxed/simple; bh=4hcRW7GlxXluQlc26C/qvdD+bZFAR7n0L5fkslie5bM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=AzxYhp/oFyLfu5pXxQO33UGvLoXABLxBxVnnAeDhkuy0cqDE5ujFbv09rIkbiefemQyfJBwECPY4WbrX72yTkE+mP+mxFXuxvRH7BfsFr5fIFdTnCAcjDooEIMYAL0cuHtcsSTF2as0pZc8u4DmWF4loBzy2+c/J3Trw2ac+wtw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=hcaj1Lxw; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="hcaj1Lxw" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4CA70C4CEF7; Fri, 6 Mar 2026 17:19:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817587; bh=4hcRW7GlxXluQlc26C/qvdD+bZFAR7n0L5fkslie5bM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=hcaj1LxwcAYPRChevInQFpSIb2L6pDg0SXrwiAE+sU2PCjvTh0wAkWwxQ869+qJMA K2AIVQWZB5co/YkXsN7BNFxZ7eDNEiXmWiW8pvh+quysUsbj8Ew0jj0rl5073qB4v4 htHDVj+bdVKUbmiVuBR/iUb1fqIbQNiWg4UBi9Jf99RKFM7Oz7FqaV4F1kg6HjRw++ ZVT9cwAT7sxnOfm1jOg7JJrjYZW2Ilng+T3NLQHZbnPTKiaqyxK+pC6KFROiRk8p5C kEsB9gASp2XlJCZfutHB5hqxaxiNcmdX6eHMsSJtmZnL2T3yfESGTPRHbNdxdm61Z6 teCAKrn8yeJQQ== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 13/15] KVM: guest_memfd: implement userfaultfd operations Date: Fri, 6 Mar 2026 19:18:13 +0200 Message-ID: <20260306171815.3160826-14-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Nikita Kalyazin userfaultfd notifications about page faults used for live migration and snapshotting of VMs. MISSING mode allows post-copy live migration and MINOR mode allows optimization for post-copy live migration for VMs backed with shared hugetlbfs or tmpfs mappings as described in detail in commit 7677f7fd8be7 ("userfaultfd: add minor fault registration mode"). To use the same mechanisms for VMs that use guest_memfd to map their memory, guest_memfd should support userfaultfd operations. Add implementation of vm_uffd_ops to guest_memfd. Signed-off-by: Nikita Kalyazin Co-developed-by: Mike Rapoport (Microsoft) Signed-off-by: Mike Rapoport (Microsoft) --- mm/filemap.c | 1 + virt/kvm/guest_memfd.c | 84 +++++++++++++++++++++++++++++++++++++++++- 2 files changed, 83 insertions(+), 2 deletions(-) diff --git a/mm/filemap.c b/mm/filemap.c index 6cd7974d4ada..19dfcebcd23f 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -262,6 +262,7 @@ void filemap_remove_folio(struct folio *folio) =20 filemap_free_folio(mapping, folio); } +EXPORT_SYMBOL_FOR_MODULES(filemap_remove_folio, "kvm"); =20 /* * page_cache_delete_batch - delete several folios from page cache diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c index 017d84a7adf3..46582feeed75 100644 --- a/virt/kvm/guest_memfd.c +++ b/virt/kvm/guest_memfd.c @@ -7,6 +7,7 @@ #include #include #include +#include =20 #include "kvm_mm.h" =20 @@ -107,6 +108,12 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, str= uct kvm_memory_slot *slot, return __kvm_gmem_prepare_folio(kvm, slot, index, folio); } =20 +static struct folio *kvm_gmem_get_folio_noalloc(struct inode *inode, pgoff= _t pgoff) +{ + return __filemap_get_folio(inode->i_mapping, pgoff, + FGP_LOCK | FGP_ACCESSED, 0); +} + /* * Returns a locked folio on success. The caller is responsible for * setting the up-to-date flag before the memory is mapped into the guest. @@ -126,8 +133,7 @@ static struct folio *kvm_gmem_get_folio(struct inode *i= node, pgoff_t index) * Fast-path: See if folio is already present in mapping to avoid * policy_lookup. */ - folio =3D __filemap_get_folio(inode->i_mapping, index, - FGP_LOCK | FGP_ACCESSED, 0); + folio =3D kvm_gmem_get_folio_noalloc(inode, index); if (!IS_ERR(folio)) return folio; =20 @@ -457,12 +463,86 @@ static struct mempolicy *kvm_gmem_get_policy(struct v= m_area_struct *vma, } #endif /* CONFIG_NUMA */ =20 +#ifdef CONFIG_USERFAULTFD +static bool kvm_gmem_can_userfault(struct vm_area_struct *vma, vm_flags_t = vm_flags) +{ + struct inode *inode =3D file_inode(vma->vm_file); + + /* + * Only support userfaultfd for guest_memfd with INIT_SHARED flag. + * This ensures the memory can be mapped to userspace. + */ + if (!(GMEM_I(inode)->flags & GUEST_MEMFD_FLAG_INIT_SHARED)) + return false; + + return true; +} + +static struct folio *kvm_gmem_folio_alloc(struct vm_area_struct *vma, + unsigned long addr) +{ + struct inode *inode =3D file_inode(vma->vm_file); + pgoff_t pgoff =3D linear_page_index(vma, addr); + struct mempolicy *mpol; + struct folio *folio; + gfp_t gfp; + + if (unlikely(pgoff >=3D (i_size_read(inode) >> PAGE_SHIFT))) + return NULL; + + gfp =3D mapping_gfp_mask(inode->i_mapping); + mpol =3D mpol_shared_policy_lookup(&GMEM_I(inode)->policy, pgoff); + mpol =3D mpol ?: get_task_policy(current); + folio =3D filemap_alloc_folio(gfp, 0, mpol); + mpol_cond_put(mpol); + + return folio; +} + +static int kvm_gmem_filemap_add(struct folio *folio, + struct vm_area_struct *vma, + unsigned long addr) +{ + struct inode *inode =3D file_inode(vma->vm_file); + struct address_space *mapping =3D inode->i_mapping; + pgoff_t pgoff =3D linear_page_index(vma, addr); + int err; + + __folio_set_locked(folio); + err =3D filemap_add_folio(mapping, folio, pgoff, GFP_KERNEL); + if (err) { + folio_unlock(folio); + return err; + } + + return 0; +} + +static void kvm_gmem_filemap_remove(struct folio *folio, + struct vm_area_struct *vma) +{ + filemap_remove_folio(folio); + folio_unlock(folio); +} + +static const struct vm_uffd_ops kvm_gmem_uffd_ops =3D { + .can_userfault =3D kvm_gmem_can_userfault, + .get_folio_noalloc =3D kvm_gmem_get_folio_noalloc, + .alloc_folio =3D kvm_gmem_folio_alloc, + .filemap_add =3D kvm_gmem_filemap_add, + .filemap_remove =3D kvm_gmem_filemap_remove, +}; +#endif /* CONFIG_USERFAULTFD */ + static const struct vm_operations_struct kvm_gmem_vm_ops =3D { .fault =3D kvm_gmem_fault_user_mapping, #ifdef CONFIG_NUMA .get_policy =3D kvm_gmem_get_policy, .set_policy =3D kvm_gmem_set_policy, #endif +#ifdef CONFIG_USERFAULTFD + .uffd_ops =3D &kvm_gmem_uffd_ops, +#endif }; =20 static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma) --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 9098A3659F8; Fri, 6 Mar 2026 17:19:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817593; cv=none; b=HVqnRUqTeyKlvHE9YeJvtoTNafp9atIJ22dgwjS7iBx5PTKnwi68GBqduo4sgG7mDScAoVVKvl7JSv/PaM25oi99+mkRFW0OJVUAs2iMOFAoumfdstOHITv3gfR3no4tTrBH0Z+CUGmhlL7uTUym9kHIsJ/7m0oexMJKUKktffU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817593; c=relaxed/simple; bh=rG3XSy87LBPK/+pmM3kvRr3VcKC4P6b5lnQZUhRBOEs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UBL9fvz/aYwHnNAzNVoVC3+HHsrX+FOcypEyCGHBsH2SVFIrDRa9FwVlVoiczvoNO0JOLuAY/h6UqNjKzk1yf3ftZDi0FWVxsWwU5NGepdCnATcR8WtchN1frHT1l+UWuVE4iu/8LqjdgbWJ4q/TVJsy7dwKKY1ATz60N2PEFQ0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=aQTtLBYR; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="aQTtLBYR" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 82B84C4CEF7; Fri, 6 Mar 2026 17:19:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817593; bh=rG3XSy87LBPK/+pmM3kvRr3VcKC4P6b5lnQZUhRBOEs=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=aQTtLBYR95i7Q8TFnUCiIF8jCCW6/JH7ZjPzkpbShkuu5ynxIjeTeuWt/K5arUiW3 aQeLzok6f+/Clt1krj3fnjwj0KIR21YxhDl/HMvKsstnNdG5+uRBbfL/vFplDbswjt WShH259MkmoV93LTbS+f534uaac/2b/NVVMZ2hXgSCeuAAgfyWYNAZcV93lZDm6B0M QMZlgXYHXcfmK6o9YjXQTeGJ4JIWacUvDajooh1tmAzXblpAitjJWKeVh0V1wzb0e6 gNJrXdXhsYYSQGImHjXXQuANbqOoUksRgBeW75euaCvbhHiUm1puXH+iD5Iy126ef8 ptgjjMnJomHPw== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 14/15] KVM: selftests: test userfaultfd minor for guest_memfd Date: Fri, 6 Mar 2026 19:18:14 +0200 Message-ID: <20260306171815.3160826-15-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Nikita Kalyazin The test demonstrates that a minor userfaultfd event in guest_memfd can be resolved via a memcpy followed by a UFFDIO_CONTINUE ioctl. Signed-off-by: Nikita Kalyazin Signed-off-by: Mike Rapoport (Microsoft) --- .../testing/selftests/kvm/guest_memfd_test.c | 113 ++++++++++++++++++ 1 file changed, 113 insertions(+) diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing= /selftests/kvm/guest_memfd_test.c index 618c937f3c90..7612819e340a 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -10,13 +10,17 @@ #include #include #include +#include =20 #include #include #include +#include #include #include #include +#include +#include =20 #include "kvm_util.h" #include "numaif.h" @@ -329,6 +333,112 @@ static void test_create_guest_memfd_multiple(struct k= vm_vm *vm) close(fd1); } =20 +struct fault_args { + char *addr; + char value; +}; + +static void *fault_thread_fn(void *arg) +{ + struct fault_args *args =3D arg; + + /* Trigger page fault */ + args->value =3D *args->addr; + return NULL; +} + +static void test_uffd_minor(int fd, size_t total_size) +{ + struct uffdio_register uffd_reg; + struct uffdio_continue uffd_cont; + struct uffd_msg msg; + struct fault_args args; + pthread_t fault_thread; + void *mem, *mem_nofault, *buf =3D NULL; + int uffd, ret; + off_t offset =3D page_size; + void *fault_addr; + const char test_val =3D 0xcd; + + ret =3D posix_memalign(&buf, page_size, total_size); + TEST_ASSERT_EQ(ret, 0); + memset(buf, test_val, total_size); + + uffd =3D syscall(__NR_userfaultfd, O_CLOEXEC); + TEST_ASSERT(uffd !=3D -1, "userfaultfd creation should succeed"); + + struct uffdio_api uffdio_api =3D { + .api =3D UFFD_API, + .features =3D 0, + }; + ret =3D ioctl(uffd, UFFDIO_API, &uffdio_api); + TEST_ASSERT(ret !=3D -1, "ioctl(UFFDIO_API) should succeed"); + + /* Map the guest_memfd twice: once with UFFD registered, once without */ + mem =3D mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem !=3D MAP_FAILED, "mmap should succeed"); + + mem_nofault =3D mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED= , fd, 0); + TEST_ASSERT(mem_nofault !=3D MAP_FAILED, "mmap should succeed"); + + /* Register UFFD_MINOR on the first mapping */ + uffd_reg.range.start =3D (unsigned long)mem; + uffd_reg.range.len =3D total_size; + uffd_reg.mode =3D UFFDIO_REGISTER_MODE_MINOR; + ret =3D ioctl(uffd, UFFDIO_REGISTER, &uffd_reg); + TEST_ASSERT(ret !=3D -1, "ioctl(UFFDIO_REGISTER) should succeed"); + + /* + * Populate the page in the page cache first via mem_nofault. + * This is required for UFFD_MINOR - the page must exist in the cache. + * Write test data to the page. + */ + memcpy(mem_nofault + offset, buf + offset, page_size); + + /* + * Now access the same page via mem (which has UFFD_MINOR registered). + * Since the page exists in the cache, this should trigger UFFD_MINOR. + */ + fault_addr =3D mem + offset; + args.addr =3D fault_addr; + + ret =3D pthread_create(&fault_thread, NULL, fault_thread_fn, &args); + TEST_ASSERT(ret =3D=3D 0, "pthread_create should succeed"); + + ret =3D read(uffd, &msg, sizeof(msg)); + TEST_ASSERT(ret !=3D -1, "read from userfaultfd should succeed"); + TEST_ASSERT(msg.event =3D=3D UFFD_EVENT_PAGEFAULT, "event type should be = pagefault"); + TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) =3D=3D= fault_addr, + "pagefault should occur at expected address"); + TEST_ASSERT(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_MINOR, + "pagefault should be minor fault"); + + /* Resolve the minor fault with UFFDIO_CONTINUE */ + uffd_cont.range.start =3D (unsigned long)fault_addr; + uffd_cont.range.len =3D page_size; + uffd_cont.mode =3D 0; + ret =3D ioctl(uffd, UFFDIO_CONTINUE, &uffd_cont); + TEST_ASSERT(ret !=3D -1, "ioctl(UFFDIO_CONTINUE) should succeed"); + + /* Wait for the faulting thread to complete */ + ret =3D pthread_join(fault_thread, NULL); + TEST_ASSERT(ret =3D=3D 0, "pthread_join should succeed"); + + /* Verify the thread read the correct value */ + TEST_ASSERT(args.value =3D=3D test_val, + "memory should contain the value that was written"); + TEST_ASSERT(*(char *)(mem + offset) =3D=3D test_val, + "no further fault is expected"); + + ret =3D munmap(mem_nofault, total_size); + TEST_ASSERT(!ret, "munmap should succeed"); + + ret =3D munmap(mem, total_size); + TEST_ASSERT(!ret, "munmap should succeed"); + free(buf); + close(uffd); +} + static void test_guest_memfd_flags(struct kvm_vm *vm) { uint64_t valid_flags =3D vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_FLAGS); @@ -383,6 +493,9 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint6= 4_t flags) gmem_test(file_size, vm, flags); gmem_test(fallocate, vm, flags); gmem_test(invalid_punch_hole, vm, flags); + + if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) + gmem_test(uffd_minor, vm, flags); } =20 static void test_guest_memfd(unsigned long vm_type) --=20 2.51.0 From nobody Thu Apr 9 16:32:33 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 84580428834; Fri, 6 Mar 2026 17:19:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817599; cv=none; b=qoysnlM8JcNNQAobzYx993xqHUNMhMvjSMimdL0oX6p/LezkGANJfCVsAyqt/t+3vKffppL90PWxEZM+gApdSEV1NwPOru7oa/7UbsfEWGbALgF4tqr2PPsoee5XY2VhDbY1yjfWFy+/oN2jCnlZE/EvzEcS/IEbGdDHZwjP2eE= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1772817599; c=relaxed/simple; bh=Z7I33Hc39huXt1KMAEv+ORYc+prBLTGLG9GW3dkBprM=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=JM72kRrt4SECy4ev87/9Jc6H21YOA/RlFNbnTEikJSEkYcN5mEPj8c8/gCGHsRUF885NtHfAEvALDaFoxaJfXyZWOXcbLM4T1HHfCVJvX2QZ++UKi3srz432sygKDb9X0R8N2jzbNrCD6+Gk48a3aEBZd2xyPYf4gk29JYme1a8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=iZZ0Q6Nw; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="iZZ0Q6Nw" Received: by smtp.kernel.org (Postfix) with ESMTPSA id B32FFC4CEF7; Fri, 6 Mar 2026 17:19:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1772817599; bh=Z7I33Hc39huXt1KMAEv+ORYc+prBLTGLG9GW3dkBprM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=iZZ0Q6Nw7qv8CZFI8d6UDYrNmF2nArSoJtfi8sTKMVKcWDMEZ2CzlTcMMQotCE6b4 MX6mwVM7CSMLPJCKNKdtoq0nkReIXP03U0fItKcDOTRPUeWebXfX+XB1QQEgDuzK04 iYe5hcXhXLkQzsbUWB1NFnTIC/zhdAbJnwW1bmUVnpMm4R00qo1KqUk8M7TL2UFH4i CGoCw7ESWEQqmw3VAU2CDYMYkJ/YE1eAcF+AqRHOu/UhwmG+seBFyAise1vqWymY/d 73xnZkSefWceJsy9JxFeppk9VUqWAAjQVDEM17tjasvdY1WljUaB5D328CFTNJCqog QXRtrM4L6gaBw== From: Mike Rapoport To: Andrew Morton Cc: Andrea Arcangeli , Axel Rasmussen , Baolin Wang , David Hildenbrand , Hugh Dickins , James Houghton , "Liam R. Howlett" , Lorenzo Stoakes , "Matthew Wilcox (Oracle)" , Michal Hocko , Mike Rapoport , Muchun Song , Nikita Kalyazin , Oscar Salvador , Paolo Bonzini , Peter Xu , Sean Christopherson , Shuah Khan , Suren Baghdasaryan , Vlastimil Babka , kvm@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org Subject: [PATCH v2 15/15] KVM: selftests: test userfaultfd missing for guest_memfd Date: Fri, 6 Mar 2026 19:18:15 +0200 Message-ID: <20260306171815.3160826-16-rppt@kernel.org> X-Mailer: git-send-email 2.51.0 In-Reply-To: <20260306171815.3160826-1-rppt@kernel.org> References: <20260306171815.3160826-1-rppt@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" From: Nikita Kalyazin The test demonstrates that a missing userfaultfd event in guest_memfd can be resolved via a UFFDIO_COPY ioctl. Signed-off-by: Nikita Kalyazin Signed-off-by: Mike Rapoport (Microsoft) --- .../testing/selftests/kvm/guest_memfd_test.c | 80 ++++++++++++++++++- 1 file changed, 79 insertions(+), 1 deletion(-) diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing= /selftests/kvm/guest_memfd_test.c index 7612819e340a..f77e70d22175 100644 --- a/tools/testing/selftests/kvm/guest_memfd_test.c +++ b/tools/testing/selftests/kvm/guest_memfd_test.c @@ -439,6 +439,82 @@ static void test_uffd_minor(int fd, size_t total_size) close(uffd); } =20 +static void test_uffd_missing(int fd, size_t total_size) +{ + struct uffdio_register uffd_reg; + struct uffdio_copy uffd_copy; + struct uffd_msg msg; + struct fault_args args; + pthread_t fault_thread; + void *mem, *buf =3D NULL; + int uffd, ret; + off_t offset =3D page_size; + void *fault_addr; + const char test_val =3D 0xab; + + ret =3D posix_memalign(&buf, page_size, total_size); + TEST_ASSERT_EQ(ret, 0); + memset(buf, test_val, total_size); + + uffd =3D syscall(__NR_userfaultfd, O_CLOEXEC); + TEST_ASSERT(uffd !=3D -1, "userfaultfd creation should succeed"); + + struct uffdio_api uffdio_api =3D { + .api =3D UFFD_API, + .features =3D 0, + }; + ret =3D ioctl(uffd, UFFDIO_API, &uffdio_api); + TEST_ASSERT(ret !=3D -1, "ioctl(UFFDIO_API) should succeed"); + + mem =3D mmap(NULL, total_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); + TEST_ASSERT(mem !=3D MAP_FAILED, "mmap should succeed"); + + uffd_reg.range.start =3D (unsigned long)mem; + uffd_reg.range.len =3D total_size; + uffd_reg.mode =3D UFFDIO_REGISTER_MODE_MISSING; + ret =3D ioctl(uffd, UFFDIO_REGISTER, &uffd_reg); + TEST_ASSERT(ret !=3D -1, "ioctl(UFFDIO_REGISTER) should succeed"); + + fault_addr =3D mem + offset; + args.addr =3D fault_addr; + + ret =3D pthread_create(&fault_thread, NULL, fault_thread_fn, &args); + TEST_ASSERT(ret =3D=3D 0, "pthread_create should succeed"); + + ret =3D read(uffd, &msg, sizeof(msg)); + TEST_ASSERT(ret !=3D -1, "read from userfaultfd should succeed"); + TEST_ASSERT(msg.event =3D=3D UFFD_EVENT_PAGEFAULT, "event type should be = pagefault"); + TEST_ASSERT((void *)(msg.arg.pagefault.address & ~(page_size - 1)) =3D=3D= fault_addr, + "pagefault should occur at expected address"); + TEST_ASSERT(!(msg.arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WP), + "pagefault should not be write-protect"); + + uffd_copy.dst =3D (unsigned long)fault_addr; + uffd_copy.src =3D (unsigned long)(buf + offset); + uffd_copy.len =3D page_size; + uffd_copy.mode =3D 0; + ret =3D ioctl(uffd, UFFDIO_COPY, &uffd_copy); + TEST_ASSERT(ret !=3D -1, "ioctl(UFFDIO_COPY) should succeed"); + + /* Wait for the faulting thread to complete - this provides the memory ba= rrier */ + ret =3D pthread_join(fault_thread, NULL); + TEST_ASSERT(ret =3D=3D 0, "pthread_join should succeed"); + + /* + * Now it's safe to check args.value - the thread has completed + * and memory is synchronized + */ + TEST_ASSERT(args.value =3D=3D test_val, + "memory should contain the value that was copied"); + TEST_ASSERT(*(char *)(mem + offset) =3D=3D test_val, + "no further fault is expected"); + + ret =3D munmap(mem, total_size); + TEST_ASSERT(!ret, "munmap should succeed"); + free(buf); + close(uffd); +} + static void test_guest_memfd_flags(struct kvm_vm *vm) { uint64_t valid_flags =3D vm_check_cap(vm, KVM_CAP_GUEST_MEMFD_FLAGS); @@ -494,8 +570,10 @@ static void __test_guest_memfd(struct kvm_vm *vm, uint= 64_t flags) gmem_test(fallocate, vm, flags); gmem_test(invalid_punch_hole, vm, flags); =20 - if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) + if (flags & GUEST_MEMFD_FLAG_INIT_SHARED) { gmem_test(uffd_minor, vm, flags); + gmem_test(uffd_missing, vm, flags); + } } =20 static void test_guest_memfd(unsigned long vm_type) --=20 2.51.0