From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id B019137F72B; Mon, 25 May 2026 11:37:56 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709078; cv=none; b=ExlM+T/hIZ/uA26D+4S2lLHZplfgaDRR7WqOSJELkW3Nw5lKikWkWZb5FlacCAPk26nwSfqP2JIsGCUuvC0vgsl6Bst95AbJyMmRY04fTAspcdgaQb6/YvCPcBzqAd3KKEREkIg/SJLwTGrW+lMj1xVFTYeMVykN4JHx3czcfHs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709078; c=relaxed/simple; bh=d+7nlkNejsWa30+7+e7xCRhfwj6wpi0cUDEJcEEmeX4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=sj2gv8B+C5f2YhG0gbI9Jt1kSwuSyy6v/+AvV+H5HcH4Hv03x/YtfJWbGoFhyQV6y9q2f0TQiVh0NZG8v+7Joqe5QWguLsjpIPtlvBWEINaRul4hghmgZ6Q5ZperM2jO9xNJLiH9OmfFWG9R8+o1YgSpCErfEguV7MZXTmUa1H4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=nlTOzEvX; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="nlTOzEvX" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A96ED1F00A3C; Mon, 25 May 2026 11:37:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709076; bh=65V0pZt8PzOC8ExjP4JKi8FrLHOgvVkRbRky8UNSx0Y=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=nlTOzEvX7CJO6pRhOkLLPaEOLoatwRrGEu+RHmVfeFPph1nlTL6ml7JS8XyGLWQhB asdpso/WACPmiKXV02Y8Ugx6lB4RJIPSNKHHLPxrYElyeEBEHRA1pGsIHxbBdEBONk JhZTRbsVH/UrLQDKq866Jnl3Dmb+J05/BX4gCzpfjzTMhv1SubDQWJwEGki2uAf7HS kdtGzb0Vi1GMWNg6AS/kfhfv9Xuq2LDL3169IKLPgx0h3lccztcRt6jGUOs5piGKso LHErJQZRKfb59JgPF+cyDpVY6fG7WXTmLvG5Zr4Tn5pGZLM9Z5vW+QItvtsCnF0PWq Pb2zI0j4RFViQ== Received: from phl-compute-01.internal (phl-compute-01.internal [10.202.2.41]) by mailfauth.phl.internal (Postfix) with ESMTP id 00B4EF40082; Mon, 25 May 2026 07:37:55 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-01.internal (MEProxy); Mon, 25 May 2026 07:37:55 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTE8BdsnOcJoKEIc8nWENdzZ3IJcc78XaZb937JZptTbBfsj35uOLLco64YanRFUy/ vxRRXxJncUkkFhQ6Fw9JoTXNpIBZaIk7PJKSogIAKlou1fnKpK60SVEE9wIdCSKeJSiEmd jMYcCTSecv2wdbvACcZ3jQHh2R35wiBKAXeBalxbQ8obtaPzZg26qqKn3eo1ycSESHvZEY 8vksW44pxFuJVzxfC4xrmaM24Xz/fIt3AP98RkQGIoAMYfpB8FzSca86T9jBGlmKI9r1LK Ip3BvhMbIznxq4KZmZ20/tINtDmxtvyOkjCcERfRzx2DQnK/TIBWkODYKxc/0O8u9p7elF KcUTR86ZQJoGhU1HA6iy+dYermkezpR1lmYcDz9spmacRXxrsgjIPgXs/sK3nKIMn5rNYF n/CtnCYKc/DCRzoZfKC518fT2zreVavInyDtfzQWP2Vz+eRBVPhsE9ONRimPLH95Z1kJ2G bb+yMJUGA2dv/h469dGFzti/mCkBjvqPgWs/diwNGOt85XIr+UjaV3wvuMWePmYJD6P5A+ GMRPzaMQ9geks/3O1RETYPcZV2FUgBOZ6XGAbWdEs3E/tlwC7Lu9N2iSle2Om5ITxH0I+U WAuc8Vr4zJIiKb9bDHgxuuBOC/WMy7fdN4SvazDwIL1wKgEGQ2NYGq86O8rA X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:37:53 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 01/14] mm: decouple protnone helpers from CONFIG_NUMA_BALANCING Date: Mon, 25 May 2026 12:37:15 +0100 Message-ID: <20260525113737.1942478-2-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" pte_protnone() and pmd_protnone() detect present-but-inaccessible page table entries. This capability is useful beyond NUMA balancing -- for example, userfaultfd working set tracking uses protnone PTEs to track page access without unmapping pages. Introduce CONFIG_ARCH_HAS_PTE_PROTNONE to decouple the protnone PTE infrastructure from CONFIG_NUMA_BALANCING. The six architectures that support protnone PTEs (x86_64, arm64, powerpc, s390, riscv, loongarch) now select this option, and CONFIG_NUMA_BALANCING depends on it. No functional change -- the same set of architectures continues to have working protnone support, but the infrastructure is now available independently of NUMA balancing. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 Acked-by: SeongJae Park Acked-by: Mike Rapoport (Microsoft) --- arch/arm64/Kconfig | 1 + arch/arm64/include/asm/pgtable.h | 7 ++--- arch/loongarch/Kconfig | 1 + arch/loongarch/include/asm/pgtable.h | 4 +-- arch/powerpc/include/asm/book3s/64/pgtable.h | 8 ++--- arch/powerpc/platforms/Kconfig.cputype | 1 + arch/riscv/Kconfig | 1 + arch/riscv/include/asm/pgtable.h | 7 ++--- arch/s390/Kconfig | 1 + arch/s390/include/asm/pgtable.h | 4 +-- arch/x86/Kconfig | 1 + arch/x86/include/asm/pgtable.h | 8 ++--- include/linux/pgtable.h | 32 ++++++++++++++------ init/Kconfig | 8 +++++ mm/debug_vm_pgtable.c | 4 +-- 15 files changed, 52 insertions(+), 36 deletions(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index fe60738e5943..319470b3b1bb 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -78,6 +78,7 @@ config ARM64 select ARCH_SUPPORTS_CFI select ARCH_SUPPORTS_ATOMIC_RMW select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 + select ARCH_HAS_PTE_PROTNONE select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_PAGE_TABLE_CHECK select ARCH_SUPPORTS_PER_VMA_LOCK diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgta= ble.h index 4dfa42b7d053..873f4ea2e288 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -553,10 +553,7 @@ static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 -#ifdef CONFIG_NUMA_BALANCING -/* - * See the comment in include/linux/pgtable.h - */ +#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE static inline int pte_protnone(pte_t pte) { /* @@ -575,7 +572,7 @@ static inline int pmd_protnone(pmd_t pmd) { return pte_protnone(pmd_pte(pmd)); } -#endif +#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */ =20 #define pmd_present(pmd) pte_present(pmd_pte(pmd)) #define pmd_dirty(pmd) pte_dirty(pmd_pte(pmd)) diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig index 606597da46b8..77e9a9a30483 100644 --- a/arch/loongarch/Kconfig +++ b/arch/loongarch/Kconfig @@ -67,6 +67,7 @@ config LOONGARCH select ARCH_SUPPORTS_LTO_CLANG select ARCH_SUPPORTS_LTO_CLANG_THIN select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS + select ARCH_HAS_PTE_PROTNONE select ARCH_SUPPORTS_NUMA_BALANCING if NUMA select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_SUPPORTS_RT diff --git a/arch/loongarch/include/asm/pgtable.h b/arch/loongarch/include/= asm/pgtable.h index 2a0b63ae421f..d295447a2763 100644 --- a/arch/loongarch/include/asm/pgtable.h +++ b/arch/loongarch/include/asm/pgtable.h @@ -619,7 +619,7 @@ static inline pmd_t pmdp_huge_get_and_clear(struct mm_s= truct *mm, =20 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE static inline long pte_protnone(pte_t pte) { return (pte_val(pte) & _PAGE_PROTNONE); @@ -629,7 +629,7 @@ static inline long pmd_protnone(pmd_t pmd) { return (pmd_val(pmd) & _PAGE_PROTNONE); } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */ =20 #define pmd_leaf(pmd) ((pmd_val(pmd) & _PAGE_HUGE) !=3D 0) #define pud_leaf(pud) ((pud_val(pud) & _PAGE_HUGE) !=3D 0) diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/in= clude/asm/book3s/64/pgtable.h index e67e64ac6e8c..53a0c5892548 100644 --- a/arch/powerpc/include/asm/book3s/64/pgtable.h +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h @@ -490,13 +490,13 @@ static inline pte_t pte_clear_soft_dirty(pte_t pte) } #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ =20 -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE static inline int pte_protnone(pte_t pte) { return (pte_raw(pte) & cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE | _PAGE_RWX)= ) =3D=3D cpu_to_be64(_PAGE_PRESENT | _PAGE_PTE); } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */ =20 static inline bool pte_hw_valid(pte_t pte) { @@ -1067,12 +1067,12 @@ static inline pte_t *pmdp_ptep(pmd_t *pmd) #endif #endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */ =20 -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE static inline int pmd_protnone(pmd_t pmd) { return pte_protnone(pmd_pte(pmd)); } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */ =20 #define pmd_write(pmd) pte_write(pmd_pte(pmd)) =20 diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platform= s/Kconfig.cputype index bac02c83bb3e..36b64a24cf30 100644 --- a/arch/powerpc/platforms/Kconfig.cputype +++ b/arch/powerpc/platforms/Kconfig.cputype @@ -87,6 +87,7 @@ config PPC_BOOK3S_64 select ARCH_ENABLE_HUGEPAGE_MIGRATION if HUGETLB_PAGE && MIGRATION select ARCH_ENABLE_SPLIT_PMD_PTLOCK select ARCH_SUPPORTS_HUGETLBFS + select ARCH_HAS_PTE_PROTNONE select ARCH_SUPPORTS_NUMA_BALANCING select HAVE_MOVE_PMD select HAVE_MOVE_PUD diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index c5754942cf85..e2c5776d18cf 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -71,6 +71,7 @@ config RISCV select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS if 64BIT && MMU select ARCH_SUPPORTS_PAGE_TABLE_CHECK if MMU select ARCH_SUPPORTS_PER_VMA_LOCK if MMU + select ARCH_HAS_PTE_PROTNONE if MMU select ARCH_SUPPORTS_RT select ARCH_SUPPORTS_SHADOW_CALL_STACK if HAVE_SHADOW_CALL_STACK select ARCH_SUPPORTS_SCHED_MC if SMP diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgta= ble.h index a1a7c6520a09..48a127323b21 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -524,10 +524,7 @@ static inline pte_t pte_swp_clear_soft_dirty(pte_t pte) PAGE_SIZE) #endif =20 -#ifdef CONFIG_NUMA_BALANCING -/* - * See the comment in include/asm-generic/pgtable.h - */ +#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE static inline int pte_protnone(pte_t pte) { return (pte_val(pte) & (_PAGE_PRESENT | _PAGE_PROT_NONE)) =3D=3D _PAGE_PR= OT_NONE; @@ -537,7 +534,7 @@ static inline int pmd_protnone(pmd_t pmd) { return pte_protnone(pmd_pte(pmd)); } -#endif +#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */ =20 /* Modify page protection bits */ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index ecbcbb781e40..bc5bef08454b 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -151,6 +151,7 @@ config S390 select ARCH_SUPPORTS_HUGETLBFS select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 && CC_IS_CLANG select ARCH_SUPPORTS_MSEAL_SYSTEM_MAPPINGS + select ARCH_HAS_PTE_PROTNONE select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_PAGE_TABLE_CHECK select ARCH_SUPPORTS_PER_VMA_LOCK diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtabl= e.h index 2c6cee8241e0..97241dea5573 100644 --- a/arch/s390/include/asm/pgtable.h +++ b/arch/s390/include/asm/pgtable.h @@ -842,7 +842,7 @@ static inline int pte_same(pte_t a, pte_t b) return pte_val(a) =3D=3D pte_val(b); } =20 -#ifdef CONFIG_NUMA_BALANCING +#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE static inline int pte_protnone(pte_t pte) { return pte_present(pte) && !(pte_val(pte) & _PAGE_READ); @@ -853,7 +853,7 @@ static inline int pmd_protnone(pmd_t pmd) /* pmd_leaf(pmd) implies pmd_present(pmd) */ return pmd_leaf(pmd) && !(pmd_val(pmd) & _SEGMENT_ENTRY_READ); } -#endif +#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */ =20 static inline bool pte_swp_exclusive(pte_t pte) { diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index f3f7cb01d69d..9da1119e8ff6 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -123,6 +123,7 @@ config X86 select ARCH_SUPPORTS_DEBUG_PAGEALLOC select ARCH_SUPPORTS_HUGETLBFS select ARCH_SUPPORTS_PAGE_TABLE_CHECK if X86_64 + select ARCH_HAS_PTE_PROTNONE if X86_64 select ARCH_SUPPORTS_NUMA_BALANCING if X86_64 select ARCH_SUPPORTS_KMAP_LOCAL_FORCE_MAP if NR_CPUS <=3D 4096 select ARCH_SUPPORTS_CFI if X86_64 diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 2187e9cfcefa..c7f014cbf0a9 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -985,11 +985,7 @@ static inline int pmd_present(pmd_t pmd) return pmd_flags(pmd) & (_PAGE_PRESENT | _PAGE_PROTNONE | _PAGE_PSE); } =20 -#ifdef CONFIG_NUMA_BALANCING -/* - * These work without NUMA balancing but the kernel does not care. See the - * comment in include/linux/pgtable.h - */ +#ifdef CONFIG_ARCH_HAS_PTE_PROTNONE static inline int pte_protnone(pte_t pte) { return (pte_flags(pte) & (_PAGE_PROTNONE | _PAGE_PRESENT)) @@ -1001,7 +997,7 @@ static inline int pmd_protnone(pmd_t pmd) return (pmd_flags(pmd) & (_PAGE_PROTNONE | _PAGE_PRESENT)) =3D=3D _PAGE_PROTNONE; } -#endif /* CONFIG_NUMA_BALANCING */ +#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */ =20 static inline int pmd_none(pmd_t pmd) { diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index cdd68ed3ae1a..b6516a11adfa 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -2052,18 +2052,26 @@ static inline int pud_trans_unstable(pud_t *pud) return 0; } =20 -#ifndef CONFIG_NUMA_BALANCING +#ifndef CONFIG_ARCH_HAS_PTE_PROTNONE /* - * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". = It is - * perfectly valid to indicate "no" in that case, which is why our default - * implementation defaults to "always no". + * In an inaccessible (PROT_NONE) VMA, pte_protnone() may indicate "yes". = It + * is perfectly valid to indicate "no" in that case, which is why our + * default implementation defaults to "always no". * - * In an accessible VMA, however, pte_protnone() reliably indicates PROT_N= ONE - * page protection due to NUMA hinting. NUMA hinting faults only apply in - * accessible VMAs. + * In an accessible VMA, pte_protnone() reliably indicates a present + * PROT_NONE page protection. Today the kernel uses such PTEs for two + * purposes: NUMA hinting faults, and userfaultfd RWP tracking on + * VM_UFFD_RWP VMAs. The two are distinguished by the uffd PTE bit and + * the VMA flag; see include/linux/userfaultfd_k.h. * - * So, to reliably identify PROT_NONE PTEs that require a NUMA hinting fau= lt, - * looking at the VMA accessibility is sufficient. + * So, to reliably identify PROT_NONE PTEs that require kernel handling, + * looking at the VMA accessibility (and the uffd bit on RWP VMAs) is + * sufficient. + * + * Architectures without CONFIG_ARCH_HAS_PTE_PROTNONE get the always-zero + * stubs below; PAGE_NONE references that survive to runtime fire the + * BUILD_BUG() fallback, since callers should have folded such paths to + * dead code via IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE). */ static inline int pte_protnone(pte_t pte) { @@ -2074,7 +2082,11 @@ static inline int pmd_protnone(pmd_t pmd) { return 0; } -#endif /* CONFIG_NUMA_BALANCING */ + +#ifndef PAGE_NONE +#define PAGE_NONE ({ BUILD_BUG(); (pgprot_t){0}; }) +#endif +#endif /* CONFIG_ARCH_HAS_PTE_PROTNONE */ =20 #endif /* CONFIG_MMU */ =20 diff --git a/init/Kconfig b/init/Kconfig index 2937c4d308ae..58abb7f19206 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -944,6 +944,13 @@ config SCHED_PROXY_EXEC =20 endmenu =20 +# +# For architectures that support present-but-inaccessible (PROT_NONE) page +# table entries detectable via pte_protnone() / pmd_protnone(): +# +config ARCH_HAS_PTE_PROTNONE + bool + # # For architectures that want to enable the support for NUMA-affine schedu= ler # balancing logic: @@ -1010,6 +1017,7 @@ config ARCH_WANT_NUMA_VARIABLE_LOCALITY config NUMA_BALANCING bool "Memory placement aware NUMA scheduler" depends on ARCH_SUPPORTS_NUMA_BALANCING + depends on ARCH_HAS_PTE_PROTNONE depends on !ARCH_WANT_NUMA_VARIABLE_LOCALITY depends on SMP && NUMA_MIGRATION && !PREEMPT_RT help diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c index 23dc3ee09561..5e9f3a35f924 100644 --- a/mm/debug_vm_pgtable.c +++ b/mm/debug_vm_pgtable.c @@ -672,7 +672,7 @@ static void __init pte_protnone_tests(struct pgtable_de= bug_args *args) { pte_t pte =3D pfn_pte(args->fixed_pte_pfn, args->page_prot_none); =20 - if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) + if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE)) return; =20 pr_debug("Validating PTE protnone\n"); @@ -685,7 +685,7 @@ static void __init pmd_protnone_tests(struct pgtable_de= bug_args *args) { pmd_t pmd; =20 - if (!IS_ENABLED(CONFIG_NUMA_BALANCING)) + if (!IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE)) return; =20 if (!has_transparent_hugepage()) --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id C5CBD37F8AC; Mon, 25 May 2026 11:38:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709088; cv=none; b=BKqqKh9J7knEqGXbIMBxKQT39DLHm9szi9v1VFYGZpfLmAKYOmNAmvG3t2zJvip4XpqKJK5P7W0Ry4bfJM2A9C47ARem5DpoYb2uOEfOkf31kfyORkliB00KDbcru/QRRri2fv/7UCqXpsBPZkzEIb2h/sBCfuTrgWO9lOUFl5w= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709088; c=relaxed/simple; bh=W8xARqvMauLB860AXK5Y8h+WzBIZnfWZVnCOfU9MjVI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=BoUlJc5a8RL7F0t0ToCg5Gw3jRIo0g8SoFX5gWKd5iTjDrULHSvl+D/i46UtNfq6KmpCnK5+49tz4tlhkiiGZSfpIC37OFNWpZtFMRDW/VvgGBgkNJsRhykXmdlFKv7IyPHzhhFIno720r9cTz+LbIjSp5cKNpDzWSD+oZBeS/8= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=d4+khLBv; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="d4+khLBv" Received: by smtp.kernel.org (Postfix) with ESMTPSA id E1E2E1F00A3C; Mon, 25 May 2026 11:38:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709086; bh=hTy85+CBbgePeScVZ3smE24qgHSrvtQVk5MU+uCFfm4=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=d4+khLBvGS6a2SSpk3CKPWG036AQ5aCYvcx5yXFGNAuZcHB6WltW4lt6nNxmJtMTa XoSt0CWScZzhB4Zof6uGrDs5xNuzCrtQD+knrvBKR3Qw2anF2Bhfn4Pw4zLVkomCIa 6AtcWcL1rxIoC/d+M1UqmwtsQYg1rnoROkeM7DR5lIZlfJpJowvvfuF6tdt89NxG9R vjjwatPPYL7b9nfIduvKTSauiC92cm7j4vpHiplcxrik/9NmJ5He+ZdrD9PnXvp1s0 XAUF8RRFe2ZZK5H9bW8do40UOSBSmKqE+q8+rj4TWy4qaxX/f5/xtJFEVsbQhmHK0e 6i/LuodrwRr6g== Received: from phl-compute-04.internal (phl-compute-04.internal [10.202.2.44]) by mailfauth.phl.internal (Postfix) with ESMTP id 47EC8F40082; Mon, 25 May 2026 07:38:05 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-04.internal (MEProxy); Mon, 25 May 2026 07:38:05 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTE8BdsnOcJoKEIc8nWENdzZ3IJcc78XaZb937JZptTbBfsj35uOLLco64YanRFUy/ vxRRXxJncUkkFhQ6Fw9JoTXNpIBZaIk7PJKSogIAKlou1fnKpK60SVEE9wIdCSKeJSiEmd jMYcCTSecv2wdbvACcZ3jQHh2R35wiBKAXeBalxbQ8obtaPzZg26qqKn3eo1ycSESHvZEY 8vksW44pxFuJVzxfC4xrmaM24Xz/fIt3AP98RkQGIoAMYfpB8FzSca86T9jBGlmKI9r1LK Ip3BvhMbIznxq4KZmZ20/tINtDmxtvyOkjCcERfRzx2DQnK/TIBWkODYKxc/0O8u9p7eSt KFgukp6v1F9HEViT8ifxiH8dlFlgTM5s5IV+CjnhW4c4ElNCbgr135F1gUug81L3nolDoR 9XhpWtFH66WifRfhgOdTzLZ9S2axM51tpSNmKvB9C0upXvCSOOz7/BjAgpRe5iUvG2ydUL LMNzjeP65hpq0JZ+dkROBr4zaj06D2avRKNy9HKyMAzk/dDmFUzwTlT1MjwylUeacsUqkk FanTkahN9bq1OozI4h6TYlA1Ppim9JFQ7/0/iZXg+PyRps3ZuypdlirDEmKHY3aJqyco8V Aso7u7qUonm9bJWMgJwg31bPZmTu6cQqeXSNd/CCUGKPDo1xxPXHGLYWNgTA X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:38:01 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 02/14] mm: rename uffd-wp PTE bit macros to uffd Date: Mon, 25 May 2026 12:37:16 +0100 Message-ID: <20260525113737.1942478-3-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The uffd-wp PTE bit is about to gain a second consumer: userfaultfd RWP will use the same bit to mark access-tracking PTEs, distinct from mprotect(PROT_NONE) or NUMA-hinting PTEs. WP vs RWP semantics come from the VMA flag; the bit is just "uffd has claimed this entry." Drop the "_wp" suffix from the arch-private bit macros so they reflect that. x86: _PAGE_BIT_UFFD_WP -> _PAGE_BIT_UFFD _PAGE_UFFD_WP -> _PAGE_UFFD _PAGE_SWP_UFFD_WP -> _PAGE_SWP_UFFD arm64: PTE_UFFD_WP -> PTE_UFFD PTE_SWP_UFFD_WP -> PTE_SWP_UFFD riscv: _PAGE_UFFD_WP -> _PAGE_UFFD _PAGE_SWP_UFFD_WP -> _PAGE_SWP_UFFD Pure mechanical rename -- no behavior change. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: SeongJae Park --- arch/arm64/include/asm/pgtable-prot.h | 8 ++++---- arch/arm64/include/asm/pgtable.h | 12 ++++++------ arch/riscv/include/asm/pgtable-bits.h | 12 ++++++------ arch/riscv/include/asm/pgtable.h | 14 +++++++------- arch/x86/include/asm/pgtable.h | 24 ++++++++++++------------ arch/x86/include/asm/pgtable_types.h | 16 ++++++++-------- 6 files changed, 43 insertions(+), 43 deletions(-) diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm= /pgtable-prot.h index 212ce1b02e15..09d7c00cf405 100644 --- a/arch/arm64/include/asm/pgtable-prot.h +++ b/arch/arm64/include/asm/pgtable-prot.h @@ -28,11 +28,11 @@ #define PTE_PRESENT_VALID_KERNEL (PTE_VALID | PTE_MAYBE_NG) =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP -#define PTE_UFFD_WP (_AT(pteval_t, 1) << 58) /* uffd-wp tracking */ -#define PTE_SWP_UFFD_WP (_AT(pteval_t, 1) << 3) /* only for swp ptes */ +#define PTE_UFFD (_AT(pteval_t, 1) << 58) /* userfaultfd tracking */ +#define PTE_SWP_UFFD (_AT(pteval_t, 1) << 3) /* only for swp ptes */ #else -#define PTE_UFFD_WP (_AT(pteval_t, 0)) -#define PTE_SWP_UFFD_WP (_AT(pteval_t, 0)) +#define PTE_UFFD (_AT(pteval_t, 0)) +#define PTE_SWP_UFFD (_AT(pteval_t, 0)) #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 #define _PROT_DEFAULT (PTE_TYPE_PAGE | PTE_AF | PTE_SHARED) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgta= ble.h index 873f4ea2e288..3eecb2c17711 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -343,17 +343,17 @@ static inline pmd_t pmd_mknoncont(pmd_t pmd) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP static inline int pte_uffd_wp(pte_t pte) { - return !!(pte_val(pte) & PTE_UFFD_WP); + return !!(pte_val(pte) & PTE_UFFD); } =20 static inline pte_t pte_mkuffd_wp(pte_t pte) { - return pte_wrprotect(set_pte_bit(pte, __pgprot(PTE_UFFD_WP))); + return pte_wrprotect(set_pte_bit(pte, __pgprot(PTE_UFFD))); } =20 static inline pte_t pte_clear_uffd_wp(pte_t pte) { - return clear_pte_bit(pte, __pgprot(PTE_UFFD_WP)); + return clear_pte_bit(pte, __pgprot(PTE_UFFD)); } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 @@ -539,17 +539,17 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP static inline pte_t pte_swp_mkuffd_wp(pte_t pte) { - return set_pte_bit(pte, __pgprot(PTE_SWP_UFFD_WP)); + return set_pte_bit(pte, __pgprot(PTE_SWP_UFFD)); } =20 static inline int pte_swp_uffd_wp(pte_t pte) { - return !!(pte_val(pte) & PTE_SWP_UFFD_WP); + return !!(pte_val(pte) & PTE_SWP_UFFD); } =20 static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) { - return clear_pte_bit(pte, __pgprot(PTE_SWP_UFFD_WP)); + return clear_pte_bit(pte, __pgprot(PTE_SWP_UFFD)); } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 diff --git a/arch/riscv/include/asm/pgtable-bits.h b/arch/riscv/include/asm= /pgtable-bits.h index b422d9691e60..d5a86b4df3ce 100644 --- a/arch/riscv/include/asm/pgtable-bits.h +++ b/arch/riscv/include/asm/pgtable-bits.h @@ -40,20 +40,20 @@ =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP =20 -/* ext_svrsw60t59b: Bit(60) for uffd-wp tracking */ -#define _PAGE_UFFD_WP \ +/* ext_svrsw60t59b: Bit(60) for userfaultfd tracking */ +#define _PAGE_UFFD \ ((riscv_has_extension_unlikely(RISCV_ISA_EXT_SVRSW60T59B)) ? \ (1UL << 60) : 0) /* * Bit 4 is not involved into swap entry computation, so we - * can borrow it for swap page uffd-wp tracking. + * can borrow it for swap page userfaultfd tracking. */ -#define _PAGE_SWP_UFFD_WP \ +#define _PAGE_SWP_UFFD \ ((riscv_has_extension_unlikely(RISCV_ISA_EXT_SVRSW60T59B)) ? \ _PAGE_USER : 0) #else -#define _PAGE_UFFD_WP 0 -#define _PAGE_SWP_UFFD_WP 0 +#define _PAGE_UFFD 0 +#define _PAGE_SWP_UFFD 0 #endif =20 #define _PAGE_TABLE _PAGE_PRESENT diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgta= ble.h index 48a127323b21..ca69948b3ed8 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -405,32 +405,32 @@ static inline pte_t pte_wrprotect(pte_t pte) =20 static inline bool pte_uffd_wp(pte_t pte) { - return !!(pte_val(pte) & _PAGE_UFFD_WP); + return !!(pte_val(pte) & _PAGE_UFFD); } =20 static inline pte_t pte_mkuffd_wp(pte_t pte) { - return pte_wrprotect(__pte(pte_val(pte) | _PAGE_UFFD_WP)); + return pte_wrprotect(__pte(pte_val(pte) | _PAGE_UFFD)); } =20 static inline pte_t pte_clear_uffd_wp(pte_t pte) { - return __pte(pte_val(pte) & ~(_PAGE_UFFD_WP)); + return __pte(pte_val(pte) & ~(_PAGE_UFFD)); } =20 static inline bool pte_swp_uffd_wp(pte_t pte) { - return !!(pte_val(pte) & _PAGE_SWP_UFFD_WP); + return !!(pte_val(pte) & _PAGE_SWP_UFFD); } =20 static inline pte_t pte_swp_mkuffd_wp(pte_t pte) { - return __pte(pte_val(pte) | _PAGE_SWP_UFFD_WP); + return __pte(pte_val(pte) | _PAGE_SWP_UFFD); } =20 static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) { - return __pte(pte_val(pte) & ~(_PAGE_SWP_UFFD_WP)); + return __pte(pte_val(pte) & ~(_PAGE_SWP_UFFD)); } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 @@ -1157,7 +1157,7 @@ static inline pud_t pud_modify(pud_t pud, pgprot_t ne= wprot) * bit 0: _PAGE_PRESENT (zero) * bit 1 to 2: (zero) * bit 3: _PAGE_SWP_SOFT_DIRTY - * bit 4: _PAGE_SWP_UFFD_WP + * bit 4: _PAGE_SWP_UFFD * bit 5: _PAGE_PROT_NONE (zero) * bit 6: exclusive marker * bits 7 to 11: swap type diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index c7f014cbf0a9..038c806b50a2 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -413,17 +413,17 @@ static inline pte_t pte_wrprotect(pte_t pte) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP static inline int pte_uffd_wp(pte_t pte) { - return pte_flags(pte) & _PAGE_UFFD_WP; + return pte_flags(pte) & _PAGE_UFFD; } =20 static inline pte_t pte_mkuffd_wp(pte_t pte) { - return pte_wrprotect(pte_set_flags(pte, _PAGE_UFFD_WP)); + return pte_wrprotect(pte_set_flags(pte, _PAGE_UFFD)); } =20 static inline pte_t pte_clear_uffd_wp(pte_t pte) { - return pte_clear_flags(pte, _PAGE_UFFD_WP); + return pte_clear_flags(pte, _PAGE_UFFD); } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 @@ -528,17 +528,17 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP static inline int pmd_uffd_wp(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_UFFD_WP; + return pmd_flags(pmd) & _PAGE_UFFD; } =20 static inline pmd_t pmd_mkuffd_wp(pmd_t pmd) { - return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD_WP)); + return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD)); } =20 static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd) { - return pmd_clear_flags(pmd, _PAGE_UFFD_WP); + return pmd_clear_flags(pmd, _PAGE_UFFD); } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 @@ -1550,32 +1550,32 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t = pmd) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP static inline pte_t pte_swp_mkuffd_wp(pte_t pte) { - return pte_set_flags(pte, _PAGE_SWP_UFFD_WP); + return pte_set_flags(pte, _PAGE_SWP_UFFD); } =20 static inline int pte_swp_uffd_wp(pte_t pte) { - return pte_flags(pte) & _PAGE_SWP_UFFD_WP; + return pte_flags(pte) & _PAGE_SWP_UFFD; } =20 static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) { - return pte_clear_flags(pte, _PAGE_SWP_UFFD_WP); + return pte_clear_flags(pte, _PAGE_SWP_UFFD); } =20 static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd) { - return pmd_set_flags(pmd, _PAGE_SWP_UFFD_WP); + return pmd_set_flags(pmd, _PAGE_SWP_UFFD); } =20 static inline int pmd_swp_uffd_wp(pmd_t pmd) { - return pmd_flags(pmd) & _PAGE_SWP_UFFD_WP; + return pmd_flags(pmd) & _PAGE_SWP_UFFD; } =20 static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd) { - return pmd_clear_flags(pmd, _PAGE_SWP_UFFD_WP); + return pmd_clear_flags(pmd, _PAGE_SWP_UFFD); } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pg= table_types.h index 2ec250ba467e..af08d98be930 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -31,7 +31,7 @@ =20 #define _PAGE_BIT_SPECIAL _PAGE_BIT_SOFTW1 #define _PAGE_BIT_CPA_TEST _PAGE_BIT_SOFTW1 -#define _PAGE_BIT_UFFD_WP _PAGE_BIT_SOFTW2 /* userfaultfd wrprotected */ +#define _PAGE_BIT_UFFD _PAGE_BIT_SOFTW2 /* userfaultfd tracking */ #define _PAGE_BIT_SOFT_DIRTY _PAGE_BIT_SOFTW3 /* software dirty tracking */ #define _PAGE_BIT_KERNEL_4K _PAGE_BIT_SOFTW3 /* page must not be converted= to large */ =20 @@ -39,7 +39,7 @@ #define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW5 /* Saved Dirty bit (leaf) */ #define _PAGE_BIT_NOPTISHADOW _PAGE_BIT_SOFTW5 /* No PTI shadow (root PGD)= */ #else -/* Shared with _PAGE_BIT_UFFD_WP which is not supported on 32 bit */ +/* Shared with _PAGE_BIT_UFFD which is not supported on 32 bit */ #define _PAGE_BIT_SAVED_DIRTY _PAGE_BIT_SOFTW2 /* Saved Dirty bit (leaf) */ #define _PAGE_BIT_NOPTISHADOW _PAGE_BIT_SOFTW2 /* No PTI shadow (root PGD)= */ #endif @@ -111,11 +111,11 @@ #endif =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP -#define _PAGE_UFFD_WP (_AT(pteval_t, 1) << _PAGE_BIT_UFFD_WP) -#define _PAGE_SWP_UFFD_WP _PAGE_USER +#define _PAGE_UFFD (_AT(pteval_t, 1) << _PAGE_BIT_UFFD) +#define _PAGE_SWP_UFFD _PAGE_USER #else -#define _PAGE_UFFD_WP (_AT(pteval_t, 0)) -#define _PAGE_SWP_UFFD_WP (_AT(pteval_t, 0)) +#define _PAGE_UFFD (_AT(pteval_t, 0)) +#define _PAGE_SWP_UFFD (_AT(pteval_t, 0)) #endif =20 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) @@ -129,7 +129,7 @@ /* * The hardware requires shadow stack to be Write=3D0,Dirty=3D1. However, * there are valid cases where the kernel might create read-only PTEs that - * are dirty (e.g., fork(), mprotect(), uffd-wp(), soft-dirty tracking). In + * are dirty (e.g., fork(), mprotect(), userfaultfd, soft-dirty tracking).= In * this case, the _PAGE_SAVED_DIRTY bit is used instead of the HW-dirty bi= t, * to avoid creating a wrong "shadow stack" PTEs. Such PTEs have * (Write=3D0,SavedDirty=3D1,Dirty=3D0) set. @@ -151,7 +151,7 @@ #define _COMMON_PAGE_CHG_MASK (PTE_PFN_MASK | _PAGE_PCD | _PAGE_PWT | \ _PAGE_SPECIAL | _PAGE_ACCESSED | \ _PAGE_DIRTY_BITS | _PAGE_SOFT_DIRTY | \ - _PAGE_CC | _PAGE_UFFD_WP) + _PAGE_CC | _PAGE_UFFD) #define _PAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PAT) #define _HPAGE_CHG_MASK (_COMMON_PAGE_CHG_MASK | _PAGE_PSE | _PAGE_PAT_LAR= GE) =20 --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2A3C437F746 for ; Mon, 25 May 2026 11:38:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709099; cv=none; b=oGv5aXnXXt3xARqDjRJ6m5QBcpLVQFhWRHWLM7Xka1XiZdeq7Ho26184TWQZGr6aU4OEKpqRxb4apdxhQhrUpmKvoAtQi9qESqtJvTaBPv0yTD3TRkL4ZzwCC3PXzQTUSWnOuolyjQugi07+07Nst80tYqSLxw3s6wz8zenNodA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709099; c=relaxed/simple; bh=mjB2wVOhbOVUUWkbfKgoETUE2EQFh1Tjo/gm1506MCo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=UdOmT3Oqlpn6UlHz+jea2BxJ5b2csHV1tZ4Inb34mMaVxDN3N8KEMifL0yASN3gjxAqXr9/ciwBKlsmPpU13BbWVKpCGRDx90r79p1YIZikQz30dHSHNDoksRIuQDPTfvAWG3O9pVhfOgN6YnoJwSiW6mMGTEBHzQq6n8fxbmjA= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=oWEtJXbm; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="oWEtJXbm" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 692191F00A3A; Mon, 25 May 2026 11:38:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709096; bh=UhPkCObob9DPk3FbuCTm2UEAudHDgPww/hJYnEPKEJc=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=oWEtJXbmIFYIad4d7R7HWtrQwgvlyjNtRaGfaD3w1h/CAYwcQ/mWhCTk1m8XT9wfy ICIhyaaoAyyj6JcrriTs9OlrxTTC/SpgfMCMKBVt2GlGxN6uFjB/APHqGb9SjQTI8E pEWu6Wfhazq1jxuCB8ZZnkam8uHYt5/sb1loT8E1NjJMTs//68lG0jc+pZvJR3H/mg AzN2kaRXKEwvFXgrDeQaHE+l+Ugvia8nVy922VS72xVkTn+iaKCkh4dTEeJQrzGe95 t/uvA8BdzWdcQTITcObtgm77eVsdXpnHyBVQiCEhdz4AbjT4H9kPJsNACqCwbHCUjE 3ybHRhh+yjOKg== Received: from phl-compute-01.internal (phl-compute-01.internal [10.202.2.41]) by mailfauth.phl.internal (Postfix) with ESMTP id C85BDF40082; Mon, 25 May 2026 07:38:14 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-01.internal (MEProxy); Mon, 25 May 2026 07:38:14 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTFnTuEJhvyD+myNg4p794atgQHzdZOXs2A3gxvR3FSD7M0iqWwHWhOfEG1IgnscR2 dpWEnUQHTt04gBiZJ6FOYLwJWkueK3/2x7yPKsRIizx2ATY4cFZISTDZP9EiTwl/39cNmM OA0CsMrDRjxwva2a2lWcAgzMzYsZoGkumkodmX+5ok307LF0rRD0oLFOgPIpccglfJ7Cmv AURvPb/dWg6Eh/o3LACLo5CgxqvMM0m0aATVsjZcPfM5Eqk+WNKfRQ8GkDG8MvyLCL9P+n Ffx6yvOWdEN3fPIKUkuyF3PJ8RYaK5VmWWEYQaxtLNJTEh5sz7hSNo8ZFBfkwknbfQV6lV bI82cJmW+kvQkbd2BkDIQeE0xIxPG53sEn9AcIVc9Z/GO1vrDsgiUdjN0btGoPhP0ZQH03 EUcbDKH8a3AaRBcA52O+xNYmr869/1J1CFXQ93adDK0V/UQtWd1b/ynQtX2foGGxQwHMKI JmvoDhEKRt5eQFn81celqH9txfpN8paBm/hoXIa40GE6jsPfv4qm4pky0z59o5lUsqEM5B HQ2+wLTO+R1jNyElVxEhFG1qswDf2htHcdDoe7QfTzcJz0q9o2+WziZmoVT5zwYUpDzWy/ cKGa2Zeu4asXhc482hoxubk/PdCH78nRcdSzqYqm8P2RDJyfY8X0eAmpl1zQ X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:38:11 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 03/14] mm: rename uffd-wp PTE accessors to uffd Date: Mon, 25 May 2026 12:37:17 +0100 Message-ID: <20260525113737.1942478-4-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Userfaultfd RWP will reuse the uffd-wp PTE bit to mark access-tracking PTEs, alongside the write-protected ones it already marks. The bit's meaning now depends on the VMA flag (WP or RWP), not on its name. Rename the kernel-internal names that describe the bit: - pte/pmd/huge_pte accessors (and swap variants) - pgtable_supports_uffd() capability query - SCAN_PTE_UFFD khugepaged enum The ftrace string emitted by mm_khugepaged_scan_pmd for this enum is kept as "pte_uffd_wp" so existing trace-based tooling keeps matching. Pure mechanical rename -- no behavior change. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: SeongJae Park --- arch/arm64/include/asm/pgtable.h | 28 ++++++++-------- arch/riscv/include/asm/pgtable.h | 38 +++++++++++----------- arch/s390/include/asm/hugetlb.h | 12 +++---- arch/x86/include/asm/pgtable.h | 24 +++++++------- fs/proc/task_mmu.c | 44 ++++++++++++------------- include/asm-generic/hugetlb.h | 18 +++++------ include/asm-generic/pgtable_uffd.h | 32 +++++++++--------- include/linux/leafops.h | 4 +-- include/linux/mm_inline.h | 4 +-- include/linux/swapops.h | 4 +-- include/linux/userfaultfd_k.h | 14 ++++---- include/trace/events/huge_memory.h | 2 +- mm/huge_memory.c | 52 +++++++++++++++--------------- mm/hugetlb.c | 46 +++++++++++++------------- mm/internal.h | 4 +-- mm/khugepaged.c | 22 ++++++------- mm/memory.c | 34 +++++++++---------- mm/migrate.c | 12 +++---- mm/migrate_device.c | 8 ++--- mm/mprotect.c | 12 +++---- mm/mremap.c | 4 +-- mm/page_table_check.c | 8 ++--- mm/rmap.c | 16 ++++----- mm/swapfile.c | 4 +-- mm/userfaultfd.c | 6 ++-- 25 files changed, 226 insertions(+), 226 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgta= ble.h index 3eecb2c17711..c41e4d59dc9f 100644 --- a/arch/arm64/include/asm/pgtable.h +++ b/arch/arm64/include/asm/pgtable.h @@ -341,17 +341,17 @@ static inline pmd_t pmd_mknoncont(pmd_t pmd) } =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP -static inline int pte_uffd_wp(pte_t pte) +static inline int pte_uffd(pte_t pte) { return !!(pte_val(pte) & PTE_UFFD); } =20 -static inline pte_t pte_mkuffd_wp(pte_t pte) +static inline pte_t pte_mkuffd(pte_t pte) { return pte_wrprotect(set_pte_bit(pte, __pgprot(PTE_UFFD))); } =20 -static inline pte_t pte_clear_uffd_wp(pte_t pte) +static inline pte_t pte_clear_uffd(pte_t pte) { return clear_pte_bit(pte, __pgprot(PTE_UFFD)); } @@ -537,17 +537,17 @@ static inline pte_t pte_swp_clear_exclusive(pte_t pte) } =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP -static inline pte_t pte_swp_mkuffd_wp(pte_t pte) +static inline pte_t pte_swp_mkuffd(pte_t pte) { return set_pte_bit(pte, __pgprot(PTE_SWP_UFFD)); } =20 -static inline int pte_swp_uffd_wp(pte_t pte) +static inline int pte_swp_uffd(pte_t pte) { return !!(pte_val(pte) & PTE_SWP_UFFD); } =20 -static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) +static inline pte_t pte_swp_clear_uffd(pte_t pte) { return clear_pte_bit(pte, __pgprot(PTE_SWP_UFFD)); } @@ -590,13 +590,13 @@ static inline int pmd_protnone(pmd_t pmd) #define pmd_mkvalid_k(pmd) pte_pmd(pte_mkvalid_k(pmd_pte(pmd))) #define pmd_mkinvalid(pmd) pte_pmd(pte_mkinvalid(pmd_pte(pmd))) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP -#define pmd_uffd_wp(pmd) pte_uffd_wp(pmd_pte(pmd)) -#define pmd_mkuffd_wp(pmd) pte_pmd(pte_mkuffd_wp(pmd_pte(pmd))) -#define pmd_clear_uffd_wp(pmd) pte_pmd(pte_clear_uffd_wp(pmd_pte(pmd))) -#define pmd_swp_uffd_wp(pmd) pte_swp_uffd_wp(pmd_pte(pmd)) -#define pmd_swp_mkuffd_wp(pmd) pte_pmd(pte_swp_mkuffd_wp(pmd_pte(pmd))) -#define pmd_swp_clear_uffd_wp(pmd) \ - pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd))) +#define pmd_uffd(pmd) pte_uffd(pmd_pte(pmd)) +#define pmd_mkuffd(pmd) pte_pmd(pte_mkuffd(pmd_pte(pmd))) +#define pmd_clear_uffd(pmd) pte_pmd(pte_clear_uffd(pmd_pte(pmd))) +#define pmd_swp_uffd(pmd) pte_swp_uffd(pmd_pte(pmd)) +#define pmd_swp_mkuffd(pmd) pte_pmd(pte_swp_mkuffd(pmd_pte(pmd))) +#define pmd_swp_clear_uffd(pmd) \ + pte_pmd(pte_swp_clear_uffd(pmd_pte(pmd))) #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 #define pmd_write(pmd) pte_write(pmd_pte(pmd)) @@ -1512,7 +1512,7 @@ static inline pmd_t pmdp_establish(struct vm_area_str= uct *vma, * Encode and decode a swap entry: * bits 0-1: present (must be zero) * bits 2: remember PG_anon_exclusive - * bit 3: remember uffd-wp state + * bit 3: remember uffd state * bits 6-10: swap type * bit 11: PTE_PRESENT_INVALID (must be zero) * bits 12-61: swap offset diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgta= ble.h index ca69948b3ed8..b111e134795e 100644 --- a/arch/riscv/include/asm/pgtable.h +++ b/arch/riscv/include/asm/pgtable.h @@ -400,35 +400,35 @@ static inline pte_t pte_wrprotect(pte_t pte) } =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP -#define pgtable_supports_uffd_wp() \ +#define pgtable_supports_uffd() \ riscv_has_extension_unlikely(RISCV_ISA_EXT_SVRSW60T59B) =20 -static inline bool pte_uffd_wp(pte_t pte) +static inline bool pte_uffd(pte_t pte) { return !!(pte_val(pte) & _PAGE_UFFD); } =20 -static inline pte_t pte_mkuffd_wp(pte_t pte) +static inline pte_t pte_mkuffd(pte_t pte) { return pte_wrprotect(__pte(pte_val(pte) | _PAGE_UFFD)); } =20 -static inline pte_t pte_clear_uffd_wp(pte_t pte) +static inline pte_t pte_clear_uffd(pte_t pte) { return __pte(pte_val(pte) & ~(_PAGE_UFFD)); } =20 -static inline bool pte_swp_uffd_wp(pte_t pte) +static inline bool pte_swp_uffd(pte_t pte) { return !!(pte_val(pte) & _PAGE_SWP_UFFD); } =20 -static inline pte_t pte_swp_mkuffd_wp(pte_t pte) +static inline pte_t pte_swp_mkuffd(pte_t pte) { return __pte(pte_val(pte) | _PAGE_SWP_UFFD); } =20 -static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) +static inline pte_t pte_swp_clear_uffd(pte_t pte) { return __pte(pte_val(pte) & ~(_PAGE_SWP_UFFD)); } @@ -886,34 +886,34 @@ static inline pud_t pud_mkspecial(pud_t pud) #endif =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP -static inline bool pmd_uffd_wp(pmd_t pmd) +static inline bool pmd_uffd(pmd_t pmd) { - return pte_uffd_wp(pmd_pte(pmd)); + return pte_uffd(pmd_pte(pmd)); } =20 -static inline pmd_t pmd_mkuffd_wp(pmd_t pmd) +static inline pmd_t pmd_mkuffd(pmd_t pmd) { - return pte_pmd(pte_mkuffd_wp(pmd_pte(pmd))); + return pte_pmd(pte_mkuffd(pmd_pte(pmd))); } =20 -static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd) +static inline pmd_t pmd_clear_uffd(pmd_t pmd) { - return pte_pmd(pte_clear_uffd_wp(pmd_pte(pmd))); + return pte_pmd(pte_clear_uffd(pmd_pte(pmd))); } =20 -static inline bool pmd_swp_uffd_wp(pmd_t pmd) +static inline bool pmd_swp_uffd(pmd_t pmd) { - return pte_swp_uffd_wp(pmd_pte(pmd)); + return pte_swp_uffd(pmd_pte(pmd)); } =20 -static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd) +static inline pmd_t pmd_swp_mkuffd(pmd_t pmd) { - return pte_pmd(pte_swp_mkuffd_wp(pmd_pte(pmd))); + return pte_pmd(pte_swp_mkuffd(pmd_pte(pmd))); } =20 -static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd) +static inline pmd_t pmd_swp_clear_uffd(pmd_t pmd) { - return pte_pmd(pte_swp_clear_uffd_wp(pmd_pte(pmd))); + return pte_pmd(pte_swp_clear_uffd(pmd_pte(pmd))); } #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_WP */ =20 diff --git a/arch/s390/include/asm/hugetlb.h b/arch/s390/include/asm/hugetl= b.h index 6983e52eaf81..cf8a176ff3d8 100644 --- a/arch/s390/include/asm/hugetlb.h +++ b/arch/s390/include/asm/hugetlb.h @@ -77,20 +77,20 @@ static inline void huge_ptep_set_wrprotect(struct mm_st= ruct *mm, __set_huge_pte_at(mm, addr, ptep, pte_wrprotect(pte)); } =20 -#define __HAVE_ARCH_HUGE_PTE_MKUFFD_WP -static inline pte_t huge_pte_mkuffd_wp(pte_t pte) +#define __HAVE_ARCH_HUGE_PTE_MKUFFD +static inline pte_t huge_pte_mkuffd(pte_t pte) { return pte; } =20 -#define __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD_WP -static inline pte_t huge_pte_clear_uffd_wp(pte_t pte) +#define __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD +static inline pte_t huge_pte_clear_uffd(pte_t pte) { return pte; } =20 -#define __HAVE_ARCH_HUGE_PTE_UFFD_WP -static inline int huge_pte_uffd_wp(pte_t pte) +#define __HAVE_ARCH_HUGE_PTE_UFFD +static inline int huge_pte_uffd(pte_t pte) { return 0; } diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h index 038c806b50a2..d14c84b2a332 100644 --- a/arch/x86/include/asm/pgtable.h +++ b/arch/x86/include/asm/pgtable.h @@ -411,17 +411,17 @@ static inline pte_t pte_wrprotect(pte_t pte) } =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP -static inline int pte_uffd_wp(pte_t pte) +static inline int pte_uffd(pte_t pte) { return pte_flags(pte) & _PAGE_UFFD; } =20 -static inline pte_t pte_mkuffd_wp(pte_t pte) +static inline pte_t pte_mkuffd(pte_t pte) { return pte_wrprotect(pte_set_flags(pte, _PAGE_UFFD)); } =20 -static inline pte_t pte_clear_uffd_wp(pte_t pte) +static inline pte_t pte_clear_uffd(pte_t pte) { return pte_clear_flags(pte, _PAGE_UFFD); } @@ -526,17 +526,17 @@ static inline pmd_t pmd_wrprotect(pmd_t pmd) } =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP -static inline int pmd_uffd_wp(pmd_t pmd) +static inline int pmd_uffd(pmd_t pmd) { return pmd_flags(pmd) & _PAGE_UFFD; } =20 -static inline pmd_t pmd_mkuffd_wp(pmd_t pmd) +static inline pmd_t pmd_mkuffd(pmd_t pmd) { return pmd_wrprotect(pmd_set_flags(pmd, _PAGE_UFFD)); } =20 -static inline pmd_t pmd_clear_uffd_wp(pmd_t pmd) +static inline pmd_t pmd_clear_uffd(pmd_t pmd) { return pmd_clear_flags(pmd, _PAGE_UFFD); } @@ -1548,32 +1548,32 @@ static inline pmd_t pmd_swp_clear_soft_dirty(pmd_t = pmd) #endif =20 #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_WP -static inline pte_t pte_swp_mkuffd_wp(pte_t pte) +static inline pte_t pte_swp_mkuffd(pte_t pte) { return pte_set_flags(pte, _PAGE_SWP_UFFD); } =20 -static inline int pte_swp_uffd_wp(pte_t pte) +static inline int pte_swp_uffd(pte_t pte) { return pte_flags(pte) & _PAGE_SWP_UFFD; } =20 -static inline pte_t pte_swp_clear_uffd_wp(pte_t pte) +static inline pte_t pte_swp_clear_uffd(pte_t pte) { return pte_clear_flags(pte, _PAGE_SWP_UFFD); } =20 -static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd) +static inline pmd_t pmd_swp_mkuffd(pmd_t pmd) { return pmd_set_flags(pmd, _PAGE_SWP_UFFD); } =20 -static inline int pmd_swp_uffd_wp(pmd_t pmd) +static inline int pmd_swp_uffd(pmd_t pmd) { return pmd_flags(pmd) & _PAGE_SWP_UFFD; } =20 -static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd) +static inline pmd_t pmd_swp_clear_uffd(pmd_t pmd) { return pmd_clear_flags(pmd, _PAGE_SWP_UFFD); } diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 1e3a15bf46f4..cbd164f4928f 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2035,14 +2035,14 @@ static pagemap_entry_t pte_to_pagemap_entry(struct = pagemapread *pm, page =3D vm_normal_page(vma, addr, pte); if (pte_soft_dirty(pte)) flags |=3D PM_SOFT_DIRTY; - if (pte_uffd_wp(pte)) + if (pte_uffd(pte)) flags |=3D PM_UFFD_WP; } else { softleaf_t entry; =20 if (pte_swp_soft_dirty(pte)) flags |=3D PM_SOFT_DIRTY; - if (pte_swp_uffd_wp(pte)) + if (pte_swp_uffd(pte)) flags |=3D PM_UFFD_WP; entry =3D softleaf_from_pte(pte); if (pm->show_pfn) { @@ -2108,7 +2108,7 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigne= d long addr, flags |=3D PM_PRESENT; if (pmd_soft_dirty(pmd)) flags |=3D PM_SOFT_DIRTY; - if (pmd_uffd_wp(pmd)) + if (pmd_uffd(pmd)) flags |=3D PM_UFFD_WP; if (pm->show_pfn) frame =3D pmd_pfn(pmd) + idx; @@ -2127,7 +2127,7 @@ static int pagemap_pmd_range_thp(pmd_t *pmdp, unsigne= d long addr, flags |=3D PM_SWAP; if (pmd_swp_soft_dirty(pmd)) flags |=3D PM_SOFT_DIRTY; - if (pmd_swp_uffd_wp(pmd)) + if (pmd_swp_uffd(pmd)) flags |=3D PM_UFFD_WP; VM_WARN_ON_ONCE(!pmd_is_migration_entry(pmd)); page =3D softleaf_to_page(entry); @@ -2233,14 +2233,14 @@ static int pagemap_hugetlb_range(pte_t *ptep, unsig= ned long hmask, !hugetlb_pmd_shared(ptep)) flags |=3D PM_MMAP_EXCLUSIVE; =20 - if (huge_pte_uffd_wp(pte)) + if (huge_pte_uffd(pte)) flags |=3D PM_UFFD_WP; =20 flags |=3D PM_PRESENT; if (pm->show_pfn) frame =3D pte_pfn(pte) + ((addr & ~hmask) >> PAGE_SHIFT); - } else if (pte_swp_uffd_wp_any(pte)) { + } else if (pte_swp_uffd_any(pte)) { flags |=3D PM_UFFD_WP; } =20 @@ -2441,7 +2441,7 @@ static unsigned long pagemap_page_category(struct pag= emap_scan_private *p, =20 categories =3D PAGE_IS_PRESENT; =20 - if (!pte_uffd_wp(pte)) + if (!pte_uffd(pte)) categories |=3D PAGE_IS_WRITTEN; =20 if (p->masks_of_interest & PAGE_IS_FILE) { @@ -2459,7 +2459,7 @@ static unsigned long pagemap_page_category(struct pag= emap_scan_private *p, =20 categories =3D PAGE_IS_SWAPPED; =20 - if (!pte_swp_uffd_wp_any(pte)) + if (!pte_swp_uffd_any(pte)) categories |=3D PAGE_IS_WRITTEN; =20 entry =3D softleaf_from_pte(pte); @@ -2484,13 +2484,13 @@ static void make_uffd_wp_pte(struct vm_area_struct = *vma, pte_t old_pte; =20 old_pte =3D ptep_modify_prot_start(vma, addr, pte); - ptent =3D pte_mkuffd_wp(old_pte); + ptent =3D pte_mkuffd(old_pte); ptep_modify_prot_commit(vma, addr, pte, old_pte, ptent); } else if (pte_none(ptent)) { set_pte_at(vma->vm_mm, addr, pte, make_pte_marker(PTE_MARKER_UFFD_WP)); } else { - ptent =3D pte_swp_mkuffd_wp(ptent); + ptent =3D pte_swp_mkuffd(ptent); set_pte_at(vma->vm_mm, addr, pte, ptent); } } @@ -2509,7 +2509,7 @@ static unsigned long pagemap_thp_category(struct page= map_scan_private *p, struct page *page; =20 categories |=3D PAGE_IS_PRESENT; - if (!pmd_uffd_wp(pmd)) + if (!pmd_uffd(pmd)) categories |=3D PAGE_IS_WRITTEN; =20 if (p->masks_of_interest & PAGE_IS_FILE) { @@ -2524,7 +2524,7 @@ static unsigned long pagemap_thp_category(struct page= map_scan_private *p, categories |=3D PAGE_IS_SOFT_DIRTY; } else { categories |=3D PAGE_IS_SWAPPED; - if (!pmd_swp_uffd_wp(pmd)) + if (!pmd_swp_uffd(pmd)) categories |=3D PAGE_IS_WRITTEN; if (pmd_swp_soft_dirty(pmd)) categories |=3D PAGE_IS_SOFT_DIRTY; @@ -2548,10 +2548,10 @@ static void make_uffd_wp_pmd(struct vm_area_struct = *vma, =20 if (pmd_present(pmd)) { old =3D pmdp_invalidate_ad(vma, addr, pmdp); - pmd =3D pmd_mkuffd_wp(old); + pmd =3D pmd_mkuffd(old); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); } else if (pmd_is_migration_entry(pmd)) { - pmd =3D pmd_swp_mkuffd_wp(pmd); + pmd =3D pmd_swp_mkuffd(pmd); set_pmd_at(vma->vm_mm, addr, pmdp, pmd); } } @@ -2573,7 +2573,7 @@ static unsigned long pagemap_hugetlb_category(pte_t p= te) if (pte_present(pte)) { categories |=3D PAGE_IS_PRESENT; =20 - if (!huge_pte_uffd_wp(pte)) + if (!huge_pte_uffd(pte)) categories |=3D PAGE_IS_WRITTEN; if (!PageAnon(pte_page(pte))) categories |=3D PAGE_IS_FILE; @@ -2584,7 +2584,7 @@ static unsigned long pagemap_hugetlb_category(pte_t p= te) } else { categories |=3D PAGE_IS_SWAPPED; =20 - if (!pte_swp_uffd_wp_any(pte)) + if (!pte_swp_uffd_any(pte)) categories |=3D PAGE_IS_WRITTEN; if (pte_swp_soft_dirty(pte)) categories |=3D PAGE_IS_SOFT_DIRTY; @@ -2612,10 +2612,10 @@ static void make_uffd_wp_huge_pte(struct vm_area_st= ruct *vma, =20 if (softleaf_is_migration(entry)) set_huge_pte_at(vma->vm_mm, addr, ptep, - pte_swp_mkuffd_wp(ptent), psize); + pte_swp_mkuffd(ptent), psize); else huge_ptep_modify_prot_commit(vma, addr, ptep, ptent, - huge_pte_mkuffd_wp(ptent)); + huge_pte_mkuffd(ptent)); } #endif /* CONFIG_HUGETLB_PAGE */ =20 @@ -2846,8 +2846,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigne= d long start, for (addr =3D start; addr !=3D end; pte++, addr +=3D PAGE_SIZE) { pte_t ptent =3D ptep_get(pte); =20 - if ((pte_present(ptent) && pte_uffd_wp(ptent)) || - pte_swp_uffd_wp_any(ptent)) + if ((pte_present(ptent) && pte_uffd(ptent)) || + pte_swp_uffd_any(ptent)) continue; make_uffd_wp_pte(vma, addr, pte, ptent); if (!flush_end) @@ -2864,8 +2864,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigne= d long start, unsigned long next =3D addr + PAGE_SIZE; pte_t ptent =3D ptep_get(pte); =20 - if ((pte_present(ptent) && pte_uffd_wp(ptent)) || - pte_swp_uffd_wp_any(ptent)) + if ((pte_present(ptent) && pte_uffd(ptent)) || + pte_swp_uffd_any(ptent)) continue; ret =3D pagemap_scan_output(p->cur_vma_category | PAGE_IS_WRITTEN, p, addr, &next); diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h index e1a2e1b7c8e7..635c41cc3479 100644 --- a/include/asm-generic/hugetlb.h +++ b/include/asm-generic/hugetlb.h @@ -37,24 +37,24 @@ static inline pte_t huge_pte_modify(pte_t pte, pgprot_t= newprot) return pte_modify(pte, newprot); } =20 -#ifndef __HAVE_ARCH_HUGE_PTE_MKUFFD_WP -static inline pte_t huge_pte_mkuffd_wp(pte_t pte) +#ifndef __HAVE_ARCH_HUGE_PTE_MKUFFD +static inline pte_t huge_pte_mkuffd(pte_t pte) { - return huge_pte_wrprotect(pte_mkuffd_wp(pte)); + return huge_pte_wrprotect(pte_mkuffd(pte)); } #endif =20 -#ifndef __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD_WP -static inline pte_t huge_pte_clear_uffd_wp(pte_t pte) +#ifndef __HAVE_ARCH_HUGE_PTE_CLEAR_UFFD +static inline pte_t huge_pte_clear_uffd(pte_t pte) { - return pte_clear_uffd_wp(pte); + return pte_clear_uffd(pte); } #endif =20 -#ifndef __HAVE_ARCH_HUGE_PTE_UFFD_WP -static inline int huge_pte_uffd_wp(pte_t pte) +#ifndef __HAVE_ARCH_HUGE_PTE_UFFD +static inline int huge_pte_uffd(pte_t pte) { - return pte_uffd_wp(pte); + return pte_uffd(pte); } #endif =20 diff --git a/include/asm-generic/pgtable_uffd.h b/include/asm-generic/pgtab= le_uffd.h index 0d85791efdf7..30e88fc1de2f 100644 --- a/include/asm-generic/pgtable_uffd.h +++ b/include/asm-generic/pgtable_uffd.h @@ -2,79 +2,79 @@ #define _ASM_GENERIC_PGTABLE_UFFD_H =20 /* - * Some platforms can customize the uffd-wp bit, making it unavailable + * Some platforms can customize the uffd PTE bit, making it unavailable * even if the architecture provides the resource. * Adding this API allows architectures to add their own checks for the * devices on which the kernel is running. * Note: When overriding it, please make sure the * CONFIG_HAVE_ARCH_USERFAULTFD_WP is part of this macro. */ -#ifndef pgtable_supports_uffd_wp -#define pgtable_supports_uffd_wp() IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD= _WP) +#ifndef pgtable_supports_uffd +#define pgtable_supports_uffd() IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_WP) #endif =20 static inline bool uffd_supports_wp_marker(void) { - return pgtable_supports_uffd_wp() && IS_ENABLED(CONFIG_PTE_MARKER_UFFD_WP= ); + return pgtable_supports_uffd() && IS_ENABLED(CONFIG_PTE_MARKER_UFFD_WP); } =20 #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_WP -static __always_inline int pte_uffd_wp(pte_t pte) +static __always_inline int pte_uffd(pte_t pte) { return 0; } =20 -static __always_inline int pmd_uffd_wp(pmd_t pmd) +static __always_inline int pmd_uffd(pmd_t pmd) { return 0; } =20 -static __always_inline pte_t pte_mkuffd_wp(pte_t pte) +static __always_inline pte_t pte_mkuffd(pte_t pte) { return pte; } =20 -static __always_inline pmd_t pmd_mkuffd_wp(pmd_t pmd) +static __always_inline pmd_t pmd_mkuffd(pmd_t pmd) { return pmd; } =20 -static __always_inline pte_t pte_clear_uffd_wp(pte_t pte) +static __always_inline pte_t pte_clear_uffd(pte_t pte) { return pte; } =20 -static __always_inline pmd_t pmd_clear_uffd_wp(pmd_t pmd) +static __always_inline pmd_t pmd_clear_uffd(pmd_t pmd) { return pmd; } =20 -static __always_inline pte_t pte_swp_mkuffd_wp(pte_t pte) +static __always_inline pte_t pte_swp_mkuffd(pte_t pte) { return pte; } =20 -static __always_inline int pte_swp_uffd_wp(pte_t pte) +static __always_inline int pte_swp_uffd(pte_t pte) { return 0; } =20 -static __always_inline pte_t pte_swp_clear_uffd_wp(pte_t pte) +static __always_inline pte_t pte_swp_clear_uffd(pte_t pte) { return pte; } =20 -static inline pmd_t pmd_swp_mkuffd_wp(pmd_t pmd) +static inline pmd_t pmd_swp_mkuffd(pmd_t pmd) { return pmd; } =20 -static inline int pmd_swp_uffd_wp(pmd_t pmd) +static inline int pmd_swp_uffd(pmd_t pmd) { return 0; } =20 -static inline pmd_t pmd_swp_clear_uffd_wp(pmd_t pmd) +static inline pmd_t pmd_swp_clear_uffd(pmd_t pmd) { return pmd; } diff --git a/include/linux/leafops.h b/include/linux/leafops.h index 992cd8bd8ed0..2ce2f37ac883 100644 --- a/include/linux/leafops.h +++ b/include/linux/leafops.h @@ -100,8 +100,8 @@ static inline softleaf_t softleaf_from_pmd(pmd_t pmd) =20 if (pmd_swp_soft_dirty(pmd)) pmd =3D pmd_swp_clear_soft_dirty(pmd); - if (pmd_swp_uffd_wp(pmd)) - pmd =3D pmd_swp_clear_uffd_wp(pmd); + if (pmd_swp_uffd(pmd)) + pmd =3D pmd_swp_clear_uffd(pmd); arch_entry =3D __pmd_to_swp_entry(pmd); =20 /* Temporary until swp_entry_t eliminated. */ diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h index a171070e15f0..2811caf4188d 100644 --- a/include/linux/mm_inline.h +++ b/include/linux/mm_inline.h @@ -600,14 +600,14 @@ pte_install_uffd_wp_if_needed(struct vm_area_struct *= vma, unsigned long addr, return false; =20 /* A uffd-wp wr-protected normal pte */ - if (unlikely(pte_present(pteval) && pte_uffd_wp(pteval))) + if (unlikely(pte_present(pteval) && pte_uffd(pteval))) arm_uffd_pte =3D true; =20 /* * A uffd-wp wr-protected swap pte. Note: this should even cover an * existing pte marker with uffd-wp bit set. */ - if (unlikely(pte_swp_uffd_wp_any(pteval))) + if (unlikely(pte_swp_uffd_any(pteval))) arm_uffd_pte =3D true; =20 if (unlikely(arm_uffd_pte)) { diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 8cfc966eae48..15c6440e38dd 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -73,8 +73,8 @@ static inline pte_t pte_swp_clear_flags(pte_t pte) pte =3D pte_swp_clear_exclusive(pte); if (pte_swp_soft_dirty(pte)) pte =3D pte_swp_clear_soft_dirty(pte); - if (pte_swp_uffd_wp(pte)) - pte =3D pte_swp_clear_uffd_wp(pte); + if (pte_swp_uffd(pte)) + pte =3D pte_swp_clear_uffd(pte); return pte; } =20 diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 3ec8e1071673..f4cf5763f92c 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -211,13 +211,13 @@ static inline bool userfaultfd_minor(struct vm_area_s= truct *vma) static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma, pte_t pte) { - return userfaultfd_wp(vma) && pte_uffd_wp(pte); + return userfaultfd_wp(vma) && pte_uffd(pte); } =20 static inline bool userfaultfd_huge_pmd_wp(struct vm_area_struct *vma, pmd_t pmd) { - return userfaultfd_wp(vma) && pmd_uffd_wp(pmd); + return userfaultfd_wp(vma) && pmd_uffd(pmd); } =20 static inline bool userfaultfd_armed(struct vm_area_struct *vma) @@ -272,10 +272,10 @@ static inline bool userfaultfd_wp_use_markers(struct = vm_area_struct *vma) } =20 /* - * Returns true if this is a swap pte and was uffd-wp wr-protected in eith= er - * forms (pte marker or a normal swap pte), false otherwise. + * Returns true if this swap pte carries uffd-tracked state in either + * form (pte marker or a normal swap pte), false otherwise. */ -static inline bool pte_swp_uffd_wp_any(pte_t pte) +static inline bool pte_swp_uffd_any(pte_t pte) { if (!uffd_supports_wp_marker()) return false; @@ -283,7 +283,7 @@ static inline bool pte_swp_uffd_wp_any(pte_t pte) if (pte_present(pte)) return false; =20 - if (pte_swp_uffd_wp(pte)) + if (pte_swp_uffd(pte)) return true; =20 if (pte_is_uffd_wp_marker(pte)) @@ -424,7 +424,7 @@ static inline bool userfaultfd_wp_use_markers(struct vm= _area_struct *vma) * Returns true if this is a swap pte and was uffd-wp wr-protected in eith= er * forms (pte marker or a normal swap pte), false otherwise. */ -static inline bool pte_swp_uffd_wp_any(pte_t pte) +static inline bool pte_swp_uffd_any(pte_t pte) { return false; } diff --git a/include/trace/events/huge_memory.h b/include/trace/events/huge= _memory.h index 291fae364c62..5a48c5406cce 100644 --- a/include/trace/events/huge_memory.h +++ b/include/trace/events/huge_memory.h @@ -16,7 +16,7 @@ EM( SCAN_EXCEED_SWAP_PTE, "exceed_swap_pte") \ EM( SCAN_EXCEED_SHARED_PTE, "exceed_shared_pte") \ EM( SCAN_PTE_NON_PRESENT, "pte_non_present") \ - EM( SCAN_PTE_UFFD_WP, "pte_uffd_wp") \ + EM( SCAN_PTE_UFFD, "pte_uffd_wp") \ EM( SCAN_PTE_MAPPED_HUGEPAGE, "pte_mapped_hugepage") \ EM( SCAN_LACK_REFERENCED_PAGE, "lack_referenced_page") \ EM( SCAN_PAGE_NULL, "page_null") \ diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 42b86e8ab7c0..6017c73c92a0 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1909,8 +1909,8 @@ static void copy_huge_non_present_pmd( pmd =3D swp_entry_to_pmd(entry); if (pmd_swp_soft_dirty(*src_pmd)) pmd =3D pmd_swp_mksoft_dirty(pmd); - if (pmd_swp_uffd_wp(*src_pmd)) - pmd =3D pmd_swp_mkuffd_wp(pmd); + if (pmd_swp_uffd(*src_pmd)) + pmd =3D pmd_swp_mkuffd(pmd); set_pmd_at(src_mm, addr, src_pmd, pmd); } else if (softleaf_is_device_private(entry)) { /* @@ -1923,8 +1923,8 @@ static void copy_huge_non_present_pmd( =20 if (pmd_swp_soft_dirty(*src_pmd)) pmd =3D pmd_swp_mksoft_dirty(pmd); - if (pmd_swp_uffd_wp(*src_pmd)) - pmd =3D pmd_swp_mkuffd_wp(pmd); + if (pmd_swp_uffd(*src_pmd)) + pmd =3D pmd_swp_mkuffd(pmd); set_pmd_at(src_mm, addr, src_pmd, pmd); } =20 @@ -1944,7 +1944,7 @@ static void copy_huge_non_present_pmd( mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); if (!userfaultfd_wp(dst_vma)) - pmd =3D pmd_swp_clear_uffd_wp(pmd); + pmd =3D pmd_swp_clear_uffd(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); } =20 @@ -2040,7 +2040,7 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm= _struct *src_mm, pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); pmdp_set_wrprotect(src_mm, addr, src_pmd); if (!userfaultfd_wp(dst_vma)) - pmd =3D pmd_clear_uffd_wp(pmd); + pmd =3D pmd_clear_uffd(pmd); pmd =3D pmd_wrprotect(pmd); set_pmd: pmd =3D pmd_mkold(pmd); @@ -2581,9 +2581,9 @@ static pmd_t clear_uffd_wp_pmd(pmd_t pmd) if (pmd_none(pmd)) return pmd; if (pmd_present(pmd)) - pmd =3D pmd_clear_uffd_wp(pmd); + pmd =3D pmd_clear_uffd(pmd); else - pmd =3D pmd_swp_clear_uffd_wp(pmd); + pmd =3D pmd_swp_clear_uffd(pmd); =20 return pmd; } @@ -2668,9 +2668,9 @@ static void change_non_present_huge_pmd(struct mm_str= uct *mm, } =20 if (uffd_wp) - newpmd =3D pmd_swp_mkuffd_wp(newpmd); + newpmd =3D pmd_swp_mkuffd(newpmd); else if (uffd_wp_resolve) - newpmd =3D pmd_swp_clear_uffd_wp(newpmd); + newpmd =3D pmd_swp_clear_uffd(newpmd); if (!pmd_same(*pmd, newpmd)) set_pmd_at(mm, addr, pmd, newpmd); } @@ -2751,14 +2751,14 @@ int change_huge_pmd(struct mmu_gather *tlb, struct = vm_area_struct *vma, =20 entry =3D pmd_modify(oldpmd, newprot); if (uffd_wp) - entry =3D pmd_mkuffd_wp(entry); + entry =3D pmd_mkuffd(entry); else if (uffd_wp_resolve) /* * Leave the write bit to be handled by PF interrupt * handler, then things like COW could be properly * handled. */ - entry =3D pmd_clear_uffd_wp(entry); + entry =3D pmd_clear_uffd(entry); =20 /* See change_pte_range(). */ if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) && @@ -3101,8 +3101,8 @@ static void __split_huge_zero_page_pmd(struct vm_area= _struct *vma, =20 entry =3D pfn_pte(zero_pfn(addr), vma->vm_page_prot); entry =3D pte_mkspecial(entry); - if (pmd_uffd_wp(old_pmd)) - entry =3D pte_mkuffd_wp(entry); + if (pmd_uffd(old_pmd)) + entry =3D pte_mkuffd(entry); VM_BUG_ON(!pte_none(ptep_get(pte))); set_pte_at(mm, addr, pte, entry); pte++; @@ -3186,7 +3186,7 @@ static void __split_huge_pmd_locked(struct vm_area_st= ruct *vma, pmd_t *pmd, folio =3D page_folio(page); =20 soft_dirty =3D pmd_swp_soft_dirty(old_pmd); - uffd_wp =3D pmd_swp_uffd_wp(old_pmd); + uffd_wp =3D pmd_swp_uffd(old_pmd); =20 write =3D softleaf_is_migration_write(entry); if (PageAnon(page)) @@ -3202,7 +3202,7 @@ static void __split_huge_pmd_locked(struct vm_area_st= ruct *vma, pmd_t *pmd, folio =3D page_folio(page); =20 soft_dirty =3D pmd_swp_soft_dirty(old_pmd); - uffd_wp =3D pmd_swp_uffd_wp(old_pmd); + uffd_wp =3D pmd_swp_uffd(old_pmd); =20 write =3D softleaf_is_device_private_write(entry); anon_exclusive =3D PageAnonExclusive(page); @@ -3259,7 +3259,7 @@ static void __split_huge_pmd_locked(struct vm_area_st= ruct *vma, pmd_t *pmd, write =3D pmd_write(old_pmd); young =3D pmd_young(old_pmd); soft_dirty =3D pmd_soft_dirty(old_pmd); - uffd_wp =3D pmd_uffd_wp(old_pmd); + uffd_wp =3D pmd_uffd(old_pmd); =20 VM_WARN_ON_FOLIO(!folio_ref_count(folio), folio); VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); @@ -3330,7 +3330,7 @@ static void __split_huge_pmd_locked(struct vm_area_st= ruct *vma, pmd_t *pmd, if (soft_dirty) entry =3D pte_swp_mksoft_dirty(entry); if (uffd_wp) - entry =3D pte_swp_mkuffd_wp(entry); + entry =3D pte_swp_mkuffd(entry); VM_WARN_ON(!pte_none(ptep_get(pte + i))); set_pte_at(mm, addr, pte + i, entry); } @@ -3357,7 +3357,7 @@ static void __split_huge_pmd_locked(struct vm_area_st= ruct *vma, pmd_t *pmd, if (soft_dirty) entry =3D pte_swp_mksoft_dirty(entry); if (uffd_wp) - entry =3D pte_swp_mkuffd_wp(entry); + entry =3D pte_swp_mkuffd(entry); VM_WARN_ON(!pte_none(ptep_get(pte + i))); set_pte_at(mm, addr, pte + i, entry); } @@ -3375,7 +3375,7 @@ static void __split_huge_pmd_locked(struct vm_area_st= ruct *vma, pmd_t *pmd, if (soft_dirty) entry =3D pte_mksoft_dirty(entry); if (uffd_wp) - entry =3D pte_mkuffd_wp(entry); + entry =3D pte_mkuffd(entry); =20 for (i =3D 0; i < HPAGE_PMD_NR; i++) VM_WARN_ON(!pte_none(ptep_get(pte + i))); @@ -5016,8 +5016,8 @@ int set_pmd_migration_entry(struct page_vma_mapped_wa= lk *pvmw, pmdswp =3D swp_entry_to_pmd(entry); if (pmd_soft_dirty(pmdval)) pmdswp =3D pmd_swp_mksoft_dirty(pmdswp); - if (pmd_uffd_wp(pmdval)) - pmdswp =3D pmd_swp_mkuffd_wp(pmdswp); + if (pmd_uffd(pmdval)) + pmdswp =3D pmd_swp_mkuffd(pmdswp); set_pmd_at(mm, address, pvmw->pmd, pmdswp); folio_remove_rmap_pmd(folio, page, vma); folio_put(folio); @@ -5047,8 +5047,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk= *pvmw, struct page *new) pmde =3D pmd_mksoft_dirty(pmde); if (softleaf_is_migration_write(entry)) pmde =3D pmd_mkwrite(pmde, vma); - if (pmd_swp_uffd_wp(*pvmw->pmd)) - pmde =3D pmd_mkuffd_wp(pmde); + if (pmd_swp_uffd(*pvmw->pmd)) + pmde =3D pmd_mkuffd(pmde); if (!softleaf_is_migration_young(entry)) pmde =3D pmd_mkold(pmde); /* NOTE: this may contain setting soft-dirty on some archs */ @@ -5068,8 +5068,8 @@ void remove_migration_pmd(struct page_vma_mapped_walk= *pvmw, struct page *new) =20 if (pmd_swp_soft_dirty(*pvmw->pmd)) pmde =3D pmd_swp_mksoft_dirty(pmde); - if (pmd_swp_uffd_wp(*pvmw->pmd)) - pmde =3D pmd_swp_mkuffd_wp(pmde); + if (pmd_swp_uffd(*pvmw->pmd)) + pmde =3D pmd_swp_mkuffd(pmde); } =20 if (folio_test_anon(folio)) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 571212b80835..d0c81a056ae2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4843,8 +4843,8 @@ hugetlb_install_folio(struct vm_area_struct *vma, pte= _t *ptep, unsigned long add =20 __folio_mark_uptodate(new_folio); hugetlb_add_new_anon_rmap(new_folio, vma, addr); - if (userfaultfd_wp(vma) && huge_pte_uffd_wp(old)) - newpte =3D huge_pte_mkuffd_wp(newpte); + if (userfaultfd_wp(vma) && huge_pte_uffd(old)) + newpte =3D huge_pte_mkuffd(newpte); set_huge_pte_at(vma->vm_mm, addr, ptep, newpte, sz); hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm); folio_set_hugetlb_migratable(new_folio); @@ -4918,10 +4918,10 @@ int copy_hugetlb_page_range(struct mm_struct *dst, = struct mm_struct *src, softleaf =3D softleaf_from_pte(entry); if (unlikely(softleaf_is_hwpoison(softleaf))) { if (!userfaultfd_wp(dst_vma)) - entry =3D huge_pte_clear_uffd_wp(entry); + entry =3D huge_pte_clear_uffd(entry); set_huge_pte_at(dst, addr, dst_pte, entry, sz); } else if (unlikely(softleaf_is_migration(softleaf))) { - bool uffd_wp =3D pte_swp_uffd_wp(entry); + bool uffd =3D pte_swp_uffd(entry); =20 if (!softleaf_is_migration_read(softleaf) && cow) { /* @@ -4931,12 +4931,12 @@ int copy_hugetlb_page_range(struct mm_struct *dst, = struct mm_struct *src, softleaf =3D make_readable_migration_entry( swp_offset(softleaf)); entry =3D swp_entry_to_pte(softleaf); - if (userfaultfd_wp(src_vma) && uffd_wp) - entry =3D pte_swp_mkuffd_wp(entry); + if (userfaultfd_wp(src_vma) && uffd) + entry =3D pte_swp_mkuffd(entry); set_huge_pte_at(src, addr, src_pte, entry, sz); } if (!userfaultfd_wp(dst_vma)) - entry =3D huge_pte_clear_uffd_wp(entry); + entry =3D huge_pte_clear_uffd(entry); set_huge_pte_at(dst, addr, dst_pte, entry, sz); } else if (unlikely(pte_is_marker(entry))) { const pte_marker marker =3D copy_pte_marker(softleaf, dst_vma); @@ -5013,7 +5013,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src, } =20 if (!userfaultfd_wp(dst_vma)) - entry =3D huge_pte_clear_uffd_wp(entry); + entry =3D huge_pte_clear_uffd(entry); =20 set_huge_pte_at(dst, addr, dst_pte, entry, sz); hugetlb_count_add(npages, dst); @@ -5061,9 +5061,9 @@ static void move_huge_pte(struct vm_area_struct *vma,= unsigned long old_addr, } else { if (need_clear_uffd_wp) { if (pte_present(pte)) - pte =3D huge_pte_clear_uffd_wp(pte); + pte =3D huge_pte_clear_uffd(pte); else - pte =3D pte_swp_clear_uffd_wp(pte); + pte =3D pte_swp_clear_uffd(pte); } set_huge_pte_at(mm, new_addr, dst_pte, pte, sz); } @@ -5197,7 +5197,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, s= truct vm_area_struct *vma, * drop the uffd-wp bit in this zap, then replace the * pte with a marker. */ - if (pte_swp_uffd_wp_any(pte) && + if (pte_swp_uffd_any(pte) && !(zap_flags & ZAP_FLAG_DROP_MARKER)) set_huge_pte_at(mm, address, ptep, make_pte_marker(PTE_MARKER_UFFD_WP), @@ -5233,7 +5233,7 @@ void __unmap_hugepage_range(struct mmu_gather *tlb, s= truct vm_area_struct *vma, if (huge_pte_dirty(pte)) folio_mark_dirty(folio); /* Leave a uffd-wp pte marker if needed */ - if (huge_pte_uffd_wp(pte) && + if (huge_pte_uffd(pte) && !(zap_flags & ZAP_FLAG_DROP_MARKER)) set_huge_pte_at(mm, address, ptep, make_pte_marker(PTE_MARKER_UFFD_WP), @@ -5437,7 +5437,7 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf) * can trigger this, because hugetlb_fault() will always resolve * uffd-wp bit first. */ - if (!unshare && huge_pte_uffd_wp(pte)) + if (!unshare && huge_pte_uffd(pte)) return 0; =20 /* Let's take out MAP_SHARED mappings first. */ @@ -5581,8 +5581,8 @@ static vm_fault_t hugetlb_wp(struct vm_fault *vmf) huge_ptep_clear_flush(vma, vmf->address, vmf->pte); hugetlb_remove_rmap(old_folio); hugetlb_add_new_anon_rmap(new_folio, vma, vmf->address); - if (huge_pte_uffd_wp(pte)) - newpte =3D huge_pte_mkuffd_wp(newpte); + if (huge_pte_uffd(pte)) + newpte =3D huge_pte_mkuffd(newpte); set_huge_pte_at(mm, vmf->address, vmf->pte, newpte, huge_page_size(h)); folio_set_hugetlb_migratable(new_folio); @@ -5860,7 +5860,7 @@ static vm_fault_t hugetlb_no_page(struct address_spac= e *mapping, * if populated. */ if (unlikely(pte_is_uffd_wp_marker(vmf->orig_pte))) - new_pte =3D huge_pte_mkuffd_wp(new_pte); + new_pte =3D huge_pte_mkuffd(new_pte); set_huge_pte_at(mm, vmf->address, vmf->pte, new_pte, huge_page_size(h)); =20 hugetlb_count_add(pages_per_huge_page(h), mm); @@ -6058,7 +6058,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, goto out_ptl; =20 /* Handle userfault-wp first, before trying to lock more pages */ - if (userfaultfd_wp(vma) && huge_pte_uffd_wp(huge_ptep_get(mm, vmf.address= , vmf.pte)) && + if (userfaultfd_wp(vma) && huge_pte_uffd(huge_ptep_get(mm, vmf.address, v= mf.pte)) && (flags & FAULT_FLAG_WRITE) && !huge_pte_write(vmf.orig_pte)) { if (!userfaultfd_wp_async(vma)) { spin_unlock(vmf.ptl); @@ -6067,7 +6067,7 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct= vm_area_struct *vma, return handle_userfault(&vmf, VM_UFFD_WP); } =20 - vmf.orig_pte =3D huge_pte_clear_uffd_wp(vmf.orig_pte); + vmf.orig_pte =3D huge_pte_clear_uffd(vmf.orig_pte); set_huge_pte_at(mm, vmf.address, vmf.pte, vmf.orig_pte, huge_page_size(hstate_vma(vma))); /* Fallthrough to CoW */ @@ -6352,7 +6352,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte, _dst_pte =3D pte_mkyoung(_dst_pte); =20 if (wp_enabled) - _dst_pte =3D huge_pte_mkuffd_wp(_dst_pte); + _dst_pte =3D huge_pte_mkuffd(_dst_pte); =20 set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte, size); =20 @@ -6476,9 +6476,9 @@ long hugetlb_change_protection(struct vm_area_struct = *vma, } =20 if (uffd_wp) - newpte =3D pte_swp_mkuffd_wp(newpte); + newpte =3D pte_swp_mkuffd(newpte); else if (uffd_wp_resolve) - newpte =3D pte_swp_clear_uffd_wp(newpte); + newpte =3D pte_swp_clear_uffd(newpte); if (!pte_same(pte, newpte)) set_huge_pte_at(mm, address, ptep, newpte, psize); } else if (unlikely(pte_is_marker(pte))) { @@ -6499,9 +6499,9 @@ long hugetlb_change_protection(struct vm_area_struct = *vma, pte =3D huge_pte_modify(old_pte, newprot); pte =3D arch_make_huge_pte(pte, shift, vma->vm_flags); if (uffd_wp) - pte =3D huge_pte_mkuffd_wp(pte); + pte =3D huge_pte_mkuffd(pte); else if (uffd_wp_resolve) - pte =3D huge_pte_clear_uffd_wp(pte); + pte =3D huge_pte_clear_uffd(pte); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++; tlb_remove_huge_tlb_entry(h, &tlb, ptep, address); diff --git a/mm/internal.h b/mm/internal.h index 5602393054f3..9325eefbea6a 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -412,8 +412,8 @@ static inline pte_t pte_move_swp_offset(pte_t pte, long= delta) new =3D pte_swp_mksoft_dirty(new); if (pte_swp_exclusive(pte)) new =3D pte_swp_mkexclusive(new); - if (pte_swp_uffd_wp(pte)) - new =3D pte_swp_mkuffd_wp(new); + if (pte_swp_uffd(pte)) + new =3D pte_swp_mkuffd(new); =20 return new; } diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 4549a020bf73..afa218be15de 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -37,7 +37,7 @@ enum scan_result { SCAN_EXCEED_SWAP_PTE, SCAN_EXCEED_SHARED_PTE, SCAN_PTE_NON_PRESENT, - SCAN_PTE_UFFD_WP, + SCAN_PTE_UFFD, SCAN_PTE_MAPPED_HUGEPAGE, SCAN_LACK_REFERENCED_PAGE, SCAN_PAGE_NULL, @@ -712,8 +712,8 @@ static enum scan_result __collapse_huge_page_isolate(st= ruct vm_area_struct *vma, result =3D SCAN_PTE_NON_PRESENT; goto out; } - if (pte_uffd_wp(pteval)) { - result =3D SCAN_PTE_UFFD_WP; + if (pte_uffd(pteval)) { + result =3D SCAN_PTE_UFFD; goto out; } page =3D vm_normal_page(vma, addr, pteval); @@ -1566,7 +1566,7 @@ static int mthp_collapse(struct mm_struct *mm, struct= vm_area_struct *vma, case SCAN_PAGE_NULL: case SCAN_DEL_PAGE_LRU: case SCAN_PTE_NON_PRESENT: - case SCAN_PTE_UFFD_WP: + case SCAN_PTE_UFFD: case SCAN_ALLOC_HUGE_PAGE_FAIL: case SCAN_PAGE_LAZYFREE: goto next_order; @@ -1666,15 +1666,15 @@ static enum scan_result collapse_scan_pmd(struct mm= _struct *mm, /* * Always be strict with uffd-wp * enabled swap entries. Please see - * comment below for pte_uffd_wp(). + * comment below for pte_uffd(). */ - if (pte_swp_uffd_wp_any(pteval)) { - result =3D SCAN_PTE_UFFD_WP; + if (pte_swp_uffd_any(pteval)) { + result =3D SCAN_PTE_UFFD; goto out_unmap; } continue; } - if (pte_uffd_wp(pteval)) { + if (pte_uffd(pteval)) { /* * Don't collapse the page if any of the small * PTEs are armed with uffd write protection. @@ -1684,7 +1684,7 @@ static enum scan_result collapse_scan_pmd(struct mm_s= truct *mm, * userfault messages that falls outside of * the registered range. So, just be simple. */ - result =3D SCAN_PTE_UFFD_WP; + result =3D SCAN_PTE_UFFD; goto out_unmap; } =20 @@ -1897,7 +1897,7 @@ static enum scan_result try_collapse_pte_mapped_thp(s= truct mm_struct *mm, unsign =20 /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ if (userfaultfd_wp(vma)) - return SCAN_PTE_UFFD_WP; + return SCAN_PTE_UFFD; =20 folio =3D filemap_lock_folio(vma->vm_file->f_mapping, linear_page_index(vma, haddr)); @@ -3244,7 +3244,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsi= gned long start, /* Whitelisted set of results where continuing OK */ case SCAN_NO_PTE_TABLE: case SCAN_PTE_NON_PRESENT: - case SCAN_PTE_UFFD_WP: + case SCAN_PTE_UFFD: case SCAN_LACK_REFERENCED_PAGE: case SCAN_PAGE_NULL: case SCAN_PAGE_COUNT: diff --git a/mm/memory.c b/mm/memory.c index 7c020995eafc..c4fd5cb4a08f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -893,8 +893,8 @@ static void restore_exclusive_pte(struct vm_area_struct= *vma, if (pte_swp_soft_dirty(orig_pte)) pte =3D pte_mksoft_dirty(pte); =20 - if (pte_swp_uffd_wp(orig_pte)) - pte =3D pte_mkuffd_wp(pte); + if (pte_swp_uffd(orig_pte)) + pte =3D pte_mkuffd(pte); =20 if ((vma->vm_flags & VM_WRITE) && can_change_pte_writable(vma, address, pte)) { @@ -984,8 +984,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm= _struct *src_mm, pte =3D softleaf_to_pte(entry); if (pte_swp_soft_dirty(orig_pte)) pte =3D pte_swp_mksoft_dirty(pte); - if (pte_swp_uffd_wp(orig_pte)) - pte =3D pte_swp_mkuffd_wp(pte); + if (pte_swp_uffd(orig_pte)) + pte =3D pte_swp_mkuffd(pte); set_pte_at(src_mm, addr, src_pte, pte); } } else if (softleaf_is_device_private(entry)) { @@ -1018,8 +1018,8 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct = mm_struct *src_mm, entry =3D make_readable_device_private_entry( swp_offset(entry)); pte =3D swp_entry_to_pte(entry); - if (pte_swp_uffd_wp(orig_pte)) - pte =3D pte_swp_mkuffd_wp(pte); + if (pte_swp_uffd(orig_pte)) + pte =3D pte_swp_mkuffd(pte); set_pte_at(src_mm, addr, src_pte, pte); } } else if (softleaf_is_device_exclusive(entry)) { @@ -1042,7 +1042,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct = mm_struct *src_mm, return 0; } if (!userfaultfd_wp(dst_vma)) - pte =3D pte_swp_clear_uffd_wp(pte); + pte =3D pte_swp_clear_uffd(pte); set_pte_at(dst_mm, addr, dst_pte, pte); return 0; } @@ -1090,7 +1090,7 @@ copy_present_page(struct vm_area_struct *dst_vma, str= uct vm_area_struct *src_vma pte =3D maybe_mkwrite(pte_mkdirty(pte), dst_vma); if (userfaultfd_pte_wp(dst_vma, ptep_get(src_pte))) /* Uffd-wp needs to be delivered to dest pte as well */ - pte =3D pte_mkuffd_wp(pte); + pte =3D pte_mkuffd(pte); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); return 0; } @@ -1113,7 +1113,7 @@ static __always_inline void __copy_present_ptes(struc= t vm_area_struct *dst_vma, pte =3D pte_mkold(pte); =20 if (!userfaultfd_wp(dst_vma)) - pte =3D pte_clear_uffd_wp(pte); + pte =3D pte_clear_uffd(pte); =20 set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); } @@ -3925,8 +3925,8 @@ static vm_fault_t wp_page_copy(struct vm_fault *vmf) if (unlikely(unshare)) { if (pte_soft_dirty(vmf->orig_pte)) entry =3D pte_mksoft_dirty(entry); - if (pte_uffd_wp(vmf->orig_pte)) - entry =3D pte_mkuffd_wp(entry); + if (pte_uffd(vmf->orig_pte)) + entry =3D pte_mkuffd(entry); } else { entry =3D maybe_mkwrite(pte_mkdirty(entry), vma); } @@ -4261,7 +4261,7 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf) * etc.) because we're only removing the uffd-wp bit, * which is completely invisible to the user. */ - pte =3D pte_clear_uffd_wp(ptep_get(vmf->pte)); + pte =3D pte_clear_uffd(ptep_get(vmf->pte)); =20 set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte); /* @@ -5038,8 +5038,8 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) pte =3D mk_pte(page, vma->vm_page_prot); if (pte_swp_soft_dirty(vmf->orig_pte)) pte =3D pte_mksoft_dirty(pte); - if (pte_swp_uffd_wp(vmf->orig_pte)) - pte =3D pte_mkuffd_wp(pte); + if (pte_swp_uffd(vmf->orig_pte)) + pte =3D pte_mkuffd(pte); =20 /* * Same logic as in do_wp_page(); however, optimize for pages that are @@ -5255,7 +5255,7 @@ void map_anon_folio_pte_nopf(struct folio *folio, pte= _t *pte, if (vma->vm_flags & VM_WRITE) entry =3D pte_mkwrite(pte_mkdirty(entry), vma); if (uffd_wp) - entry =3D pte_mkuffd_wp(entry); + entry =3D pte_mkuffd(entry); =20 folio_ref_add(folio, nr_pages - 1); folio_add_new_anon_rmap(folio, vma, addr, RMAP_EXCLUSIVE); @@ -5322,7 +5322,7 @@ static vm_fault_t do_anonymous_page(struct vm_fault *= vmf) return handle_userfault(vmf, VM_UFFD_MISSING); } if (vmf_orig_pte_uffd_wp(vmf)) - entry =3D pte_mkuffd_wp(entry); + entry =3D pte_mkuffd(entry); set_pte_at(vma->vm_mm, addr, vmf->pte, entry); =20 /* No need to invalidate - it was non-present before */ @@ -5572,7 +5572,7 @@ void set_pte_range(struct vm_fault *vmf, struct folio= *folio, else if (pte_write(entry) && folio_test_dirty(folio)) entry =3D pte_mkdirty(entry); if (unlikely(vmf_orig_pte_uffd_wp(vmf))) - entry =3D pte_mkuffd_wp(entry); + entry =3D pte_mkuffd(entry); /* copy-on-write page */ if (write && !(vma->vm_flags & VM_SHARED)) { VM_BUG_ON_FOLIO(nr !=3D 1, folio); diff --git a/mm/migrate.c b/mm/migrate.c index 0c6a0ab6ecce..4bdb5be7afbf 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -326,8 +326,8 @@ static bool try_to_map_unused_to_zeropage(struct page_v= ma_mapped_walk *pvmw, =20 if (pte_swp_soft_dirty(old_pte)) newpte =3D pte_mksoft_dirty(newpte); - if (pte_swp_uffd_wp(old_pte)) - newpte =3D pte_mkuffd_wp(newpte); + if (pte_swp_uffd(old_pte)) + newpte =3D pte_mkuffd(newpte); =20 set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); =20 @@ -391,8 +391,8 @@ static bool remove_migration_pte(struct folio *folio, =20 if (softleaf_is_migration_write(entry)) pte =3D pte_mkwrite(pte, vma); - else if (pte_swp_uffd_wp(old_pte)) - pte =3D pte_mkuffd_wp(pte); + else if (pte_swp_uffd(old_pte)) + pte =3D pte_mkuffd(pte); =20 if (folio_test_anon(folio) && !softleaf_is_migration_read(entry)) rmap_flags |=3D RMAP_EXCLUSIVE; @@ -407,8 +407,8 @@ static bool remove_migration_pte(struct folio *folio, pte =3D softleaf_to_pte(entry); if (pte_swp_soft_dirty(old_pte)) pte =3D pte_swp_mksoft_dirty(pte); - if (pte_swp_uffd_wp(old_pte)) - pte =3D pte_swp_mkuffd_wp(pte); + if (pte_swp_uffd(old_pte)) + pte =3D pte_swp_mkuffd(pte); } =20 #ifdef CONFIG_HUGETLB_PAGE diff --git a/mm/migrate_device.c b/mm/migrate_device.c index 554754eb26ff..17da1bab0248 100644 --- a/mm/migrate_device.c +++ b/mm/migrate_device.c @@ -445,13 +445,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, if (pte_present(pte)) { if (pte_soft_dirty(pte)) swp_pte =3D pte_swp_mksoft_dirty(swp_pte); - if (pte_uffd_wp(pte)) - swp_pte =3D pte_swp_mkuffd_wp(swp_pte); + if (pte_uffd(pte)) + swp_pte =3D pte_swp_mkuffd(swp_pte); } else { if (pte_swp_soft_dirty(pte)) swp_pte =3D pte_swp_mksoft_dirty(swp_pte); - if (pte_swp_uffd_wp(pte)) - swp_pte =3D pte_swp_mkuffd_wp(swp_pte); + if (pte_swp_uffd(pte)) + swp_pte =3D pte_swp_mkuffd(swp_pte); } set_pte_at(mm, addr, ptep, swp_pte); =20 diff --git a/mm/mprotect.c b/mm/mprotect.c index 9cbf932b028c..8340c8b228c6 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -240,8 +240,8 @@ static long change_softleaf_pte(struct vm_area_struct *= vma, */ entry =3D make_readable_device_private_entry(swp_offset(entry)); newpte =3D swp_entry_to_pte(entry); - if (pte_swp_uffd_wp(oldpte)) - newpte =3D pte_swp_mkuffd_wp(newpte); + if (pte_swp_uffd(oldpte)) + newpte =3D pte_swp_mkuffd(newpte); } else if (softleaf_is_marker(entry)) { /* * Ignore error swap entries unconditionally, @@ -266,9 +266,9 @@ static long change_softleaf_pte(struct vm_area_struct *= vma, } =20 if (uffd_wp) - newpte =3D pte_swp_mkuffd_wp(newpte); + newpte =3D pte_swp_mkuffd(newpte); else if (uffd_wp_resolve) - newpte =3D pte_swp_clear_uffd_wp(newpte); + newpte =3D pte_swp_clear_uffd(newpte); =20 if (!pte_same(oldpte, newpte)) { set_pte_at(vma->vm_mm, addr, pte, newpte); @@ -290,9 +290,9 @@ static __always_inline void change_present_ptes(struct = mmu_gather *tlb, ptent =3D pte_modify(oldpte, newprot); =20 if (uffd_wp) - ptent =3D pte_mkuffd_wp(ptent); + ptent =3D pte_mkuffd(ptent); else if (uffd_wp_resolve) - ptent =3D pte_clear_uffd_wp(ptent); + ptent =3D pte_clear_uffd(ptent); =20 /* * In some writable, shared mappings, we might want diff --git a/mm/mremap.c b/mm/mremap.c index e9c8b1d05832..12732a5c547e 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -297,9 +297,9 @@ static int move_ptes(struct pagetable_move_control *pmc, else { if (need_clear_uffd_wp) { if (pte_present(pte)) - pte =3D pte_clear_uffd_wp(pte); + pte =3D pte_clear_uffd(pte); else - pte =3D pte_swp_clear_uffd_wp(pte); + pte =3D pte_swp_clear_uffd(pte); } set_ptes(mm, new_addr, new_ptep, pte, nr_ptes); } diff --git a/mm/page_table_check.c b/mm/page_table_check.c index 53a8997ec043..3fb995e5d40d 100644 --- a/mm/page_table_check.c +++ b/mm/page_table_check.c @@ -188,8 +188,8 @@ static inline bool softleaf_cached_writable(softleaf_t = entry) static void page_table_check_pte_flags(pte_t pte) { if (pte_present(pte)) { - WARN_ON_ONCE(pte_uffd_wp(pte) && pte_write(pte)); - } else if (pte_swp_uffd_wp(pte)) { + WARN_ON_ONCE(pte_uffd(pte) && pte_write(pte)); + } else if (pte_swp_uffd(pte)) { const softleaf_t entry =3D softleaf_from_pte(pte); =20 WARN_ON_ONCE(softleaf_cached_writable(entry)); @@ -216,9 +216,9 @@ EXPORT_SYMBOL(__page_table_check_ptes_set); static inline void page_table_check_pmd_flags(pmd_t pmd) { if (pmd_present(pmd)) { - if (pmd_uffd_wp(pmd)) + if (pmd_uffd(pmd)) WARN_ON_ONCE(pmd_write(pmd)); - } else if (pmd_swp_uffd_wp(pmd)) { + } else if (pmd_swp_uffd(pmd)) { const softleaf_t entry =3D softleaf_from_pmd(pmd); =20 WARN_ON_ONCE(softleaf_cached_writable(entry)); diff --git a/mm/rmap.c b/mm/rmap.c index 1c77d5dc06e9..546bc1cf9391 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2318,13 +2318,13 @@ static bool try_to_unmap_one(struct folio *folio, s= truct vm_area_struct *vma, if (likely(pte_present(pteval))) { if (pte_soft_dirty(pteval)) swp_pte =3D pte_swp_mksoft_dirty(swp_pte); - if (pte_uffd_wp(pteval)) - swp_pte =3D pte_swp_mkuffd_wp(swp_pte); + if (pte_uffd(pteval)) + swp_pte =3D pte_swp_mkuffd(swp_pte); } else { if (pte_swp_soft_dirty(pteval)) swp_pte =3D pte_swp_mksoft_dirty(swp_pte); - if (pte_swp_uffd_wp(pteval)) - swp_pte =3D pte_swp_mkuffd_wp(swp_pte); + if (pte_swp_uffd(pteval)) + swp_pte =3D pte_swp_mkuffd(swp_pte); } set_pte_at(mm, address, pvmw.pte, swp_pte); } else { @@ -2692,14 +2692,14 @@ static bool try_to_migrate_one(struct folio *folio,= struct vm_area_struct *vma, swp_pte =3D swp_entry_to_pte(entry); if (pte_soft_dirty(pteval)) swp_pte =3D pte_swp_mksoft_dirty(swp_pte); - if (pte_uffd_wp(pteval)) - swp_pte =3D pte_swp_mkuffd_wp(swp_pte); + if (pte_uffd(pteval)) + swp_pte =3D pte_swp_mkuffd(swp_pte); } else { swp_pte =3D swp_entry_to_pte(entry); if (pte_swp_soft_dirty(pteval)) swp_pte =3D pte_swp_mksoft_dirty(swp_pte); - if (pte_swp_uffd_wp(pteval)) - swp_pte =3D pte_swp_mkuffd_wp(swp_pte); + if (pte_swp_uffd(pteval)) + swp_pte =3D pte_swp_mkuffd(swp_pte); } if (folio_test_hugetlb(folio)) set_huge_pte_at(mm, address, pvmw.pte, swp_pte, diff --git a/mm/swapfile.c b/mm/swapfile.c index e3d126602a1e..15fdca2da1f7 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2557,8 +2557,8 @@ static int unuse_pte(struct vm_area_struct *vma, pmd_= t *pmd, new_pte =3D pte_mkold(mk_pte(page, vma->vm_page_prot)); if (pte_swp_soft_dirty(old_pte)) new_pte =3D pte_mksoft_dirty(new_pte); - if (pte_swp_uffd_wp(old_pte)) - new_pte =3D pte_mkuffd_wp(new_pte); + if (pte_swp_uffd(old_pte)) + new_pte =3D pte_mkuffd(new_pte); setpte: set_pte_at(vma->vm_mm, addr, pte, new_pte); folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry))); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 35b206cc9aa6..ebce642c8805 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -394,7 +394,7 @@ static int mfill_atomic_install_pte(pmd_t *dst_pmd, if (writable) _dst_pte =3D pte_mkwrite(_dst_pte, dst_vma); if (flags & MFILL_ATOMIC_WP) - _dst_pte =3D pte_mkuffd_wp(_dst_pte); + _dst_pte =3D pte_mkuffd(_dst_pte); =20 ret =3D -EAGAIN; dst_pte =3D pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); @@ -3571,7 +3571,7 @@ static int userfaultfd_register(struct userfaultfd_ct= x *ctx, if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MISSING) vm_flags |=3D VM_UFFD_MISSING; if (uffdio_register.mode & UFFDIO_REGISTER_MODE_WP) { - if (!pgtable_supports_uffd_wp()) + if (!pgtable_supports_uffd()) goto out; =20 vm_flags |=3D VM_UFFD_WP; @@ -4281,7 +4281,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *ct= x, uffdio_api.features &=3D ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM); #endif - if (!pgtable_supports_uffd_wp()) + if (!pgtable_supports_uffd()) uffdio_api.features &=3D ~UFFD_FEATURE_PAGEFAULT_FLAG_WP; =20 if (!uffd_supports_wp_marker()) { --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7571F37F8AC for ; Mon, 25 May 2026 11:38:25 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709106; cv=none; b=RkFcYgwapPHm5xnIImAKY4UxUq9fLf1jIh/B0qOf+P+xIPrDpVO/ztpYX0JZpkaCEFBvOReJms/t1jwMsCfItnSGn6wbGm/fFPg7d2+3tDVrPNZVTualvNey8YSzBm19HvDamUG9DKZdH98PevUeuR0gSIvRkuV7JUiZ10OUqAM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709106; c=relaxed/simple; bh=4KrgtG7vRsO0qfpvEbInXlP3uCIfTg+Njx2HUkkl7jQ=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=He6F1qzx3w3tezuyGdXcVOOmkCSkHTTe8RP1zBjtpyEYYERaX5RHmYBtU0jGHhoO9ptA3fTwBvntEshdWKEBKAdbeV2PSKxaKDM0uyV+P+vDOVVODK6/BnF+3haJllaIBQ/f9vhJTCJZ8yXT98+2xKYNeS0z0HK3V5FhKCpwdDE= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=fgDkRLew; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="fgDkRLew" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9AF901F000E9; Mon, 25 May 2026 11:38:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709105; bh=7juNkMV5TDpUnlxcmv88D06LjDSQgKGBYnBfLIQ2xYI=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=fgDkRLewuZxGr4+3E4KlrnizN/2k+trryj8j/PG3G0hhmLbLUNveOm2kvUKB5rblT gqNNXMr+jZBNvtrsD73iKtx+OlZFa7HtTwlJjfjmzGvHJmWsvvjIA5981kgqbt0SbC Bw6x5+7JBVMrJ0cP/5IEt6PlRdlZvlhbCvymZ1QHVvpBn9r6towbKO9RZvBcSe7u1I zrTqwJOxRpJJPIEpy/TcenIEait1V1VmFWe1CBFciYaKH1wsnMmQT8cVQ0ruPYnPsK nTkOKlmrQ3Wns/2RSkDH32PFecFrO/vLpxTSpxI+mI9XSuCHU51KBXmXawtoo5Kf4G ka6vvbQ7ei55w== Received: from phl-compute-01.internal (phl-compute-01.internal [10.202.2.41]) by mailfauth.phl.internal (Postfix) with ESMTP id 0413FF40082; Mon, 25 May 2026 07:38:24 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-01.internal (MEProxy); Mon, 25 May 2026 07:38:24 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTEgijbjjUCUhtY2VaKKjXb4gcimluwf202HZFDKxEui2BlgHuLI1UEEi/eMokyXuA a2E+X4+kRvsxjI6H4HIsnupxRR/h8xNVRKTfREZ/dBM1+bNmP6eZod0MlLM6BS3n5MLTru DFAPtuu9eHiW4wZdezaatBA9PaDtMNRtePvzx4xHdz5odRImn6rzmKpWWCJY57tyVM68hg 6hiyI47tklKjeEZPc+OVuEyQzgGjohrHzOhn9CL0GB1XC7ZHOPHNqzO9SjB/g+CKICA1Pc hnNW6YlnLZWKYV9MguMVFQmjoMkYeCI94OvICOd3v50m1nVI2u29Wd5SJb/ZmMVY5A1HOt 8qOxwGK+UagydTTI/8CYcTWMqZ14n+0AJOQ+yxyZkIno6xEY1G+7tXV245QN5Zh2Ml+aCD aZuF8aYZzrd8JWcPsUXqzNL3JzPfa8/VvhydiYbemdB5s5bqpasABrKv1WjAeiXXGQRBS5 /UTXrwwaI+ey+MVGlVaVInOqm0zCQceJ/B0a/lio2f7YUimFKszqRCxfPq1Cfg2jiod//w zgKF9Bt7USHLdM+hHTabaXUubcyjRcafchd1xLJ2TaEMM/fZqXkpVPhIZlFJvq07QKHhyR jWT5/hRkEYOwT/sTjwAtdwomSTNVTjakaeNitlp8mGftnYkhfMsIN0MRG9Lg X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:38:21 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 04/14] mm: add VM_UFFD_RWP VMA flag Date: Mon, 25 May 2026 12:37:18 +0100 Message-ID: <20260525113737.1942478-5-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Preparatory patch for userfaultfd read-write protection (RWP). RWP extends userfaultfd protection from plain write-protection (WP) to full read-write protection: accesses to an RWP-protected range -- reads as well as writes -- trap through userfaultfd. Reserve VM_UFFD_RWP, add the userfaultfd_rwp() and userfaultfd_protected() helpers, and wire up the smaps "ur" entry and the trace-flag table the rest of the series will use. The flag is gated on CONFIG_USERFAULTFD_RWP, which is introduced together with the UAPI in a later patch; until then VM_UFFD_RWP aliases VM_NONE and every downstream check folds to dead code. Nothing sets or queries the flag yet. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Mike Rapoport (Microsoft) Reviewed-by: SeongJae Park --- Documentation/filesystems/proc.rst | 1 + fs/proc/task_mmu.c | 3 +++ include/linux/mm.h | 28 ++++++++++++++++--------- include/linux/userfaultfd_k.h | 33 ++++++++++++++++++++++++------ include/trace/events/mmflags.h | 7 +++++++ 5 files changed, 56 insertions(+), 16 deletions(-) diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems= /proc.rst index db6167befb7b..db28207c5290 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -607,6 +607,7 @@ encoded manner. The codes are the following: um userfaultfd missing tracking uw userfaultfd wr-protect tracking ui userfaultfd minor fault + ur userfaultfd read-write-protect tracking ss shadow/guarded control stack page sl sealed lf lock on fault pages diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index cbd164f4928f..5e74dadfb1cb 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -1237,6 +1237,9 @@ static void show_smap_vma_flags(struct seq_file *m, s= truct vm_area_struct *vma) #ifdef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR [ilog2(VM_UFFD_MINOR)] =3D "ui", #endif /* CONFIG_HAVE_ARCH_USERFAULTFD_MINOR */ +#ifdef CONFIG_USERFAULTFD_RWP + [ilog2(VM_UFFD_RWP)] =3D "ur", +#endif #ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK [ilog2(VM_SHADOW_STACK)] =3D "ss", #endif diff --git a/include/linux/mm.h b/include/linux/mm.h index 0f2612a70fb1..3d0a5ac3c717 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -353,6 +353,7 @@ enum { #endif DECLARE_VMA_BIT(UFFD_MINOR, 41), DECLARE_VMA_BIT(SEALED, 42), + DECLARE_VMA_BIT(UFFD_RWP, 43), /* Flags that reuse flags above. */ DECLARE_VMA_BIT_ALIAS(PKEY_BIT0, HIGH_ARCH_0), DECLARE_VMA_BIT_ALIAS(PKEY_BIT1, HIGH_ARCH_1), @@ -496,6 +497,11 @@ enum { #else #define VM_UFFD_MINOR VM_NONE #endif +#ifdef CONFIG_USERFAULTFD_RWP +#define VM_UFFD_RWP INIT_VM_FLAG(UFFD_RWP) +#else +#define VM_UFFD_RWP VM_NONE +#endif #ifdef CONFIG_64BIT #define VM_ALLOW_ANY_UNCACHED INIT_VM_FLAG(ALLOW_ANY_UNCACHED) #define VM_SEALED INIT_VM_FLAG(SEALED) @@ -633,22 +639,24 @@ enum { * reconsistuted upon page fault, so necessitate page table copying upon f= ork. * * Note that these flags should be compared with the DESTINATION VMA not t= he - * source, as VM_UFFD_WP may not be propagated to destination, while all o= ther - * flags will be. + * source: VM_UFFD_WP and VM_UFFD_RWP may be cleared on the destination + * (dup_userfaultfd() -> userfaultfd_reset_ctx() when the parent context d= id + * not negotiate UFFD_FEATURE_EVENT_FORK), while all other flags propagate. * * VM_PFNMAP / VM_MIXEDMAP - These contain kernel-mapped data which cannot= be * reasonably reconstructed on page fault. * * VM_UFFD_WP - Encodes metadata about an installed uffd - * write protect handler, which cannot be - * reconstructed on page fault. + * VM_UFFD_RWP write- or read-write-protect handler, which + * cannot be reconstructed on page fault. * - * We always copy pgtables when dst_vma has uffd= -wp - * enabled even if it's file-backed - * (e.g. shmem). Because when uffd-wp is enabled, - * pgtable contains uffd-wp protection informati= on, - * that's something we can't retrieve from page = cache, - * and skip copying will lose those info. + * We always copy pgtables when dst_vma has the + * uffd PTE bit in use even if it's file-backed + * (e.g. shmem). Because when the uffd bit is + * in use, the pgtable contains the protection + * information, that's something we can't + * retrieve from page cache, and skip copying + * will lose those info. * * VM_MAYBE_GUARD - Could contain page guard region markers which * by design are a property of the page tables diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index f4cf5763f92c..87a8cebd5938 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -21,10 +21,11 @@ #include =20 /* The set of all possible UFFD-related VM flags. */ -#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_WP | VM_UFFD_MINOR) +#define __VM_UFFD_FLAGS (VM_UFFD_MISSING | VM_UFFD_MINOR | \ + VM_UFFD_WP | VM_UFFD_RWP) =20 -#define __VMA_UFFD_FLAGS mk_vma_flags(VMA_UFFD_MISSING_BIT, VMA_UFFD_WP_BI= T, \ - VMA_UFFD_MINOR_BIT) +#define __VMA_UFFD_FLAGS mk_vma_flags(VMA_UFFD_MISSING_BIT, VMA_UFFD_MINOR= _BIT, \ + VMA_UFFD_WP_BIT, VMA_UFFD_RWP_BIT) =20 /* * CAREFUL: Check include/uapi/asm-generic/fcntl.h when defining @@ -178,7 +179,7 @@ static inline bool is_mergeable_vm_userfaultfd_ctx(stru= ct vm_area_struct *vma, */ static inline bool uffd_disable_huge_pmd_share(struct vm_area_struct *vma) { - return vma->vm_flags & (VM_UFFD_WP | VM_UFFD_MINOR); + return vma->vm_flags & (VM_UFFD_MINOR | VM_UFFD_WP | VM_UFFD_RWP); } =20 /* @@ -208,6 +209,16 @@ static inline bool userfaultfd_minor(struct vm_area_st= ruct *vma) return vma->vm_flags & VM_UFFD_MINOR; } =20 +static inline bool userfaultfd_rwp(struct vm_area_struct *vma) +{ + return vma->vm_flags & VM_UFFD_RWP; +} + +static inline bool userfaultfd_protected(struct vm_area_struct *vma) +{ + return userfaultfd_wp(vma) || userfaultfd_rwp(vma); +} + static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma, pte_t pte) { @@ -328,6 +339,16 @@ static inline bool userfaultfd_minor(struct vm_area_st= ruct *vma) return false; } =20 +static inline bool userfaultfd_rwp(struct vm_area_struct *vma) +{ + return false; +} + +static inline bool userfaultfd_protected(struct vm_area_struct *vma) +{ + return false; +} + static inline bool userfaultfd_pte_wp(struct vm_area_struct *vma, pte_t pte) { @@ -421,8 +442,8 @@ static inline bool userfaultfd_wp_use_markers(struct vm= _area_struct *vma) } =20 /* - * Returns true if this is a swap pte and was uffd-wp wr-protected in eith= er - * forms (pte marker or a normal swap pte), false otherwise. + * Returns true if this swap pte carries uffd-tracked state in either + * form (pte marker or a normal swap pte), false otherwise. */ static inline bool pte_swp_uffd_any(pte_t pte) { diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h index a6e5a44c9b42..bfface3d0203 100644 --- a/include/trace/events/mmflags.h +++ b/include/trace/events/mmflags.h @@ -194,6 +194,12 @@ IF_HAVE_PG_ARCH_3(arch_3) # define IF_HAVE_UFFD_MINOR(flag, name) #endif =20 +#ifdef CONFIG_USERFAULTFD_RWP +# define IF_HAVE_UFFD_RWP(flag, name) {flag, name}, +#else +# define IF_HAVE_UFFD_RWP(flag, name) +#endif + #if defined(CONFIG_64BIT) || defined(CONFIG_PPC32) # define IF_HAVE_VM_DROPPABLE(flag, name) {flag, name}, #else @@ -215,6 +221,7 @@ IF_HAVE_UFFD_MINOR(VM_UFFD_MINOR, "uffd_minor" ) \ {VM_PFNMAP, "pfnmap" }, \ {VM_MAYBE_GUARD, "maybe_guard" }, \ {VM_UFFD_WP, "uffd_wp" }, \ +IF_HAVE_UFFD_RWP(VM_UFFD_RWP, "uffd_rwp" ) \ {VM_LOCKED, "locked" }, \ {VM_IO, "io" }, \ {VM_SEQ_READ, "seqread" }, \ --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 213393537DE; Mon, 25 May 2026 11:38:34 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709115; cv=none; b=dcG4UlCaEQQuvX7HNWjwW+DCTd1ll/RQrq8RC7VC4bpsawY3SVh/1/QCCmeGIruoQRXRqWC74JlYYx5MSYTBGBVInoif0wUUMCnA/MfIp8nRf3q5DSXmlc+qwL0UVfKQ2LMZ3HdXJurXTPsfRFPKAQEfi9E4/F7pcWWEgGf+iW0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709115; c=relaxed/simple; bh=nH+DV8TMFPp7poGzqh3KlZ+MSgriT4DZeEGWEghoZfs=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=FxFal5B3ju/4UwenoyZFS7zc8BQgLY6bJm0LB4R+q+pabrv5erPLDLHuV0SVr854/wR9J9F1xLt/9n9tbkR7+YnXpD0FOk/jLg9FU7Plvebz9sPfgyT+n8yGAyDRQjQMXD24Z/gEwwjIk476MK7DisQbT98jWt8ZZHb2pkEqlao= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=B4TI8cud; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="B4TI8cud" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 578841F00A3A; Mon, 25 May 2026 11:38:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709114; bh=YaYAA4LDLxQSQ00+eYpFUK2Ea0ShPnS1Md5Dkd+3h9c=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=B4TI8cudOEvvDk93negGtfyvd5qJ9xp3iSk3lMQKduCGEW8fFGLdp9vptusOSMcby j4BgXNnfKvRApU1ah7inTu7JyU9HfPmlfwkNA51ATJIHh7Kc2VUotCuzOSSyd8NnSK xRH9LureLXAPaj7KeWKOGJeESVprFuMBuwi+LtxKgbwuaJlzQqsDzLh8J2j5XHIWic WbkEt1ScXUw4BKbb4Eb6gA9rZeKfyI+91hcl2kwqLR027iRKmj2l+s3JJ5LagozIdu ghGzyer4o1DghJ28vcIvSIWz6VC56J1DRwEOdQojrvmnAKKaue6GCZhvwumFsmQLme EQBiMsiAs9D9Q== Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfauth.phl.internal (Postfix) with ESMTP id B6979F40082; Mon, 25 May 2026 07:38:32 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-03.internal (MEProxy); Mon, 25 May 2026 07:38:32 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTE8BdsnOcJoKEIc8nWENdzZ3IJcc78XaZb937JZptTbBfsj35uOLLco64YanRFUy/ vxRRXxJncUkkFhQ6Fw9JoTXNpIBZaIk7PJKSogIAKlou1fnKpK60SVEE9wIdCSKeJSiEmd jMYcCTSecv2wdbvACcZ3jQHh2R35wiBKAXeBalxbQ8obtaPzZg26qqKn3eo1ycSESHvZEY 8vksW44pxFuJVzxfC4xrmaM24Xz/fIt3AP98RkQGIoAMYfpB8FzSca86T9jBGlmKI9r1LK Ip3BvhMbIznxq4KZmZ20/tINtDmxtvyOkjCcERfRzx2DQnK/TIBWkODYKxc/0O8u9p7eJa T6hUf/mJ+nZsoYTOiz1Cb2+KAyyLn0R/R3r0RXsTw92y0vMFHqj4tuch0BcWUJnAgHF7i/ +IlIGXUBbqz9Bx7N+IqZu0eIRx/QKzC+NGbTOHcutWN62zk4j/RMaPb+JSnw6d6T93R7WZ EVhNnroNs/TZuBT1WuQkCYX+SZ2FzCIAie68u4T2p1kMzTZgpZz1ffVrOpTkSTeM9TTw7J 5PoA0rJUHkWCHnmr3yBrOogvJhpyOE0hIjyRmOMgfWE1IzSRJqTy2GvZ5LPWm3UqHW07RR ozbHcuNNmRTa7lzT7rTHgZPmXovuzFqMR7wOx+AlXANnURdOGf+6g5b4XZtQ X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:38:30 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 05/14] mm: add MM_CP_UFFD_RWP change_protection() flag Date: Mon, 25 May 2026 12:37:19 +0100 Message-ID: <20260525113737.1942478-6-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Preparatory patch. Add the change_protection() primitive that userfaultfd RWP will use. An RWP-protected PTE is PAGE_NONE with the uffd PTE bit set. The PROT_NONE half makes the CPU fault on any access; the uffd bit distinguishes an RWP fault from a plain mprotect(PROT_NONE) or NUMA hinting fault. MM_CP_UFFD_WP and MM_CP_UFFD_RWP share the same PTE bit, so the two cannot be used together on the same range. Two new change_protection() flags: MM_CP_UFFD_RWP install PAGE_NONE and set the uffd bit MM_CP_UFFD_RWP_RESOLVE restore vma->vm_page_prot, clear the uffd bit Both are wired through change_pte_range(), change_huge_pmd(), and hugetlb_change_protection() so anon, shmem, THP, and hugetlb all share the same semantics. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Mike Rapoport (Microsoft) --- include/linux/mm.h | 5 ++++ include/linux/userfaultfd_k.h | 1 - mm/huge_memory.c | 30 +++++++++++++---------- mm/hugetlb.c | 25 ++++++++++++++----- mm/mprotect.c | 46 +++++++++++++++++++++++++++-------- 5 files changed, 77 insertions(+), 30 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3d0a5ac3c717..ecbf3e83a892 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -3286,6 +3286,11 @@ int get_cmdline(struct task_struct *task, char *buff= er, int buflen); #define MM_CP_UFFD_WP_RESOLVE (1UL << 3) /* Resolve wp */ #define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \ MM_CP_UFFD_WP_RESOLVE) +/* Whether this change is for uffd RWP */ +#define MM_CP_UFFD_RWP (1UL << 4) /* do rwp */ +#define MM_CP_UFFD_RWP_RESOLVE (1UL << 5) /* resolve rwp */ +#define MM_CP_UFFD_RWP_ALL (MM_CP_UFFD_RWP | \ + MM_CP_UFFD_RWP_RESOLVE) =20 bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long add= r, pte_t pte); diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 87a8cebd5938..16fbe11c0c55 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -361,7 +361,6 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_ar= ea_struct *vma, return false; } =20 - static inline bool userfaultfd_armed(struct vm_area_struct *vma) { return false; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6017c73c92a0..0d05abb0cd81 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2640,8 +2640,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsign= ed long old_addr, } =20 static void change_non_present_huge_pmd(struct mm_struct *mm, - unsigned long addr, pmd_t *pmd, bool uffd_wp, - bool uffd_wp_resolve) + unsigned long addr, pmd_t *pmd, bool uffd_prot, + bool uffd_prot_resolve) { softleaf_t entry =3D softleaf_from_pmd(*pmd); const struct folio *folio =3D softleaf_to_folio(entry); @@ -2667,9 +2667,9 @@ static void change_non_present_huge_pmd(struct mm_str= uct *mm, newpmd =3D *pmd; } =20 - if (uffd_wp) + if (uffd_prot) newpmd =3D pmd_swp_mkuffd(newpmd); - else if (uffd_wp_resolve) + else if (uffd_prot_resolve) newpmd =3D pmd_swp_clear_uffd(newpmd); if (!pmd_same(*pmd, newpmd)) set_pmd_at(mm, addr, pmd, newpmd); @@ -2690,8 +2690,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm= _area_struct *vma, spinlock_t *ptl; pmd_t oldpmd, entry; bool prot_numa =3D cp_flags & MM_CP_PROT_NUMA; - bool uffd_wp =3D cp_flags & MM_CP_UFFD_WP; - bool uffd_wp_resolve =3D cp_flags & MM_CP_UFFD_WP_RESOLVE; + bool uffd_prot =3D cp_flags & (MM_CP_UFFD_WP | MM_CP_UFFD_RWP); + bool uffd_prot_resolve =3D cp_flags & + (MM_CP_UFFD_WP_RESOLVE | MM_CP_UFFD_RWP_RESOLVE); int ret =3D 1; =20 tlb_change_page_size(tlb, HPAGE_PMD_SIZE); @@ -2704,11 +2705,17 @@ int change_huge_pmd(struct mmu_gather *tlb, struct = vm_area_struct *vma, return 0; =20 if (thp_migration_supported() && pmd_is_valid_softleaf(*pmd)) { - change_non_present_huge_pmd(mm, addr, pmd, uffd_wp, - uffd_wp_resolve); + change_non_present_huge_pmd(mm, addr, pmd, uffd_prot, + uffd_prot_resolve); goto unlock; } =20 + /* Already in the desired state */ + if (prot_numa && pmd_protnone(*pmd)) + goto unlock; + if ((cp_flags & MM_CP_UFFD_RWP) && pmd_protnone(*pmd) && pmd_uffd(*pmd)) + goto unlock; + if (prot_numa) { =20 /* @@ -2719,9 +2726,6 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm= _area_struct *vma, if (is_huge_zero_pmd(*pmd)) goto unlock; =20 - if (pmd_protnone(*pmd)) - goto unlock; - if (!folio_can_map_prot_numa(pmd_folio(*pmd), vma, vma_is_single_threaded_private(vma))) goto unlock; @@ -2750,9 +2754,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm= _area_struct *vma, oldpmd =3D pmdp_invalidate_ad(vma, addr, pmd); =20 entry =3D pmd_modify(oldpmd, newprot); - if (uffd_wp) + if (uffd_prot) entry =3D pmd_mkuffd(entry); - else if (uffd_wp_resolve) + else if (uffd_prot_resolve) /* * Leave the write bit to be handled by PF interrupt * handler, then things like COW could be properly diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d0c81a056ae2..4d75b69d4272 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6395,6 +6395,8 @@ long hugetlb_change_protection(struct vm_area_struct = *vma, unsigned long last_addr_mask; bool uffd_wp =3D cp_flags & MM_CP_UFFD_WP; bool uffd_wp_resolve =3D cp_flags & MM_CP_UFFD_WP_RESOLVE; + bool uffd_rwp =3D cp_flags & MM_CP_UFFD_RWP; + bool uffd_rwp_resolve =3D cp_flags & MM_CP_UFFD_RWP_RESOLVE; struct mmu_gather tlb; =20 /* @@ -6420,6 +6422,11 @@ long hugetlb_change_protection(struct vm_area_struct= *vma, =20 ptep =3D hugetlb_walk(vma, address, psize); if (!ptep) { + /* + * uffd_wp installs a pte marker on the unpopulated + * entry; uffd_rwp does not install markers so the + * allocation is unnecessary for it. + */ if (!uffd_wp) { address |=3D last_addr_mask; continue; @@ -6441,7 +6448,8 @@ long hugetlb_change_protection(struct vm_area_struct = *vma, * shouldn't happen at all. Warn about it if it * happened due to some reason. */ - WARN_ON_ONCE(uffd_wp || uffd_wp_resolve); + WARN_ON_ONCE(uffd_wp || uffd_wp_resolve || + uffd_rwp || uffd_rwp_resolve); pages++; spin_unlock(ptl); address |=3D last_addr_mask; @@ -6475,9 +6483,9 @@ long hugetlb_change_protection(struct vm_area_struct = *vma, pages++; } =20 - if (uffd_wp) + if (uffd_wp || uffd_rwp) newpte =3D pte_swp_mkuffd(newpte); - else if (uffd_wp_resolve) + else if (uffd_wp_resolve || uffd_rwp_resolve) newpte =3D pte_swp_clear_uffd(newpte); if (!pte_same(pte, newpte)) set_huge_pte_at(mm, address, ptep, newpte, psize); @@ -6488,19 +6496,24 @@ long hugetlb_change_protection(struct vm_area_struc= t *vma, * pte_marker_uffd_wp()=3D=3Dtrue implies !poison * because they're mutual exclusive. */ - if (pte_is_uffd_wp_marker(pte) && uffd_wp_resolve) + if (pte_is_uffd_wp_marker(pte) && + (uffd_wp_resolve || uffd_rwp_resolve)) /* Safe to modify directly (non-present->none). */ huge_pte_clear(mm, address, ptep, psize); } else { pte_t old_pte; unsigned int shift =3D huge_page_shift(hstate_vma(vma)); =20 + /* Already protnone with uffd bit set? Nothing to do. */ + if (uffd_rwp && pte_protnone(pte) && huge_pte_uffd(pte)) + goto next; + old_pte =3D huge_ptep_modify_prot_start(vma, address, ptep); pte =3D huge_pte_modify(old_pte, newprot); pte =3D arch_make_huge_pte(pte, shift, vma->vm_flags); - if (uffd_wp) + if (uffd_wp || uffd_rwp) pte =3D huge_pte_mkuffd(pte); - else if (uffd_wp_resolve) + else if (uffd_wp_resolve || uffd_rwp_resolve) pte =3D huge_pte_clear_uffd(pte); huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++; diff --git a/mm/mprotect.c b/mm/mprotect.c index 8340c8b228c6..4a6b35482aee 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -214,8 +214,9 @@ static __always_inline void set_write_prot_commit_flush= _ptes(struct vm_area_stru static long change_softleaf_pte(struct vm_area_struct *vma, unsigned long addr, pte_t *pte, pte_t oldpte, unsigned long cp_flags) { - const bool uffd_wp =3D cp_flags & MM_CP_UFFD_WP; - const bool uffd_wp_resolve =3D cp_flags & MM_CP_UFFD_WP_RESOLVE; + const bool uffd_prot =3D cp_flags & (MM_CP_UFFD_WP | MM_CP_UFFD_RWP); + const bool uffd_prot_resolve =3D cp_flags & + (MM_CP_UFFD_WP_RESOLVE | MM_CP_UFFD_RWP_RESOLVE); softleaf_t entry =3D softleaf_from_pte(oldpte); pte_t newpte; =20 @@ -256,7 +257,7 @@ static long change_softleaf_pte(struct vm_area_struct *= vma, * to unprotect it, drop it; the next page * fault will trigger without uffd trapping. */ - if (uffd_wp_resolve) { + if (uffd_prot_resolve) { pte_clear(vma->vm_mm, addr, pte); return 1; } @@ -265,9 +266,9 @@ static long change_softleaf_pte(struct vm_area_struct *= vma, newpte =3D oldpte; } =20 - if (uffd_wp) + if (uffd_prot) newpte =3D pte_swp_mkuffd(newpte); - else if (uffd_wp_resolve) + else if (uffd_prot_resolve) newpte =3D pte_swp_clear_uffd(newpte); =20 if (!pte_same(oldpte, newpte)) { @@ -282,16 +283,17 @@ static __always_inline void change_present_ptes(struc= t mmu_gather *tlb, int nr_ptes, unsigned long end, pgprot_t newprot, struct folio *folio, struct page *page, unsigned long cp_flags) { - const bool uffd_wp_resolve =3D cp_flags & MM_CP_UFFD_WP_RESOLVE; - const bool uffd_wp =3D cp_flags & MM_CP_UFFD_WP; + const bool uffd_prot =3D cp_flags & (MM_CP_UFFD_WP | MM_CP_UFFD_RWP); + const bool uffd_prot_resolve =3D cp_flags & + (MM_CP_UFFD_WP_RESOLVE | MM_CP_UFFD_RWP_RESOLVE); pte_t ptent, oldpte; =20 oldpte =3D modify_prot_start_ptes(vma, addr, ptep, nr_ptes); ptent =3D pte_modify(oldpte, newprot); =20 - if (uffd_wp) + if (uffd_prot) ptent =3D pte_mkuffd(ptent); - else if (uffd_wp_resolve) + else if (uffd_prot_resolve) ptent =3D pte_clear_uffd(ptent); =20 /* @@ -325,6 +327,7 @@ static long change_pte_range(struct mmu_gather *tlb, long pages =3D 0; bool is_private_single_threaded; bool prot_numa =3D cp_flags & MM_CP_PROT_NUMA; + bool uffd_rwp =3D cp_flags & MM_CP_UFFD_RWP; bool uffd_wp =3D cp_flags & MM_CP_UFFD_WP; int nr_ptes; =20 @@ -350,6 +353,14 @@ static long change_pte_range(struct mmu_gather *tlb, /* Already in the desired state. */ if (prot_numa && pte_protnone(oldpte)) continue; + /* + * RWP-protected PTEs carry _PAGE_UFFD as a marker on + * top of PROT_NONE. Skip only entries already in that + * exact state; plain PROT_NONE from mprotect() still needs + * to be promoted so future faults can be distinguished. + */ + if (uffd_rwp && pte_protnone(oldpte) && pte_uffd(oldpte)) + continue; =20 page =3D vm_normal_page(vma, addr, oldpte); if (page) @@ -358,6 +369,8 @@ static long change_pte_range(struct mmu_gather *tlb, /* * Avoid trapping faults against the zero or KSM * pages. See similar comment in change_huge_pmd. + * Skip this filter for uffd RWP which + * must set protnone regardless of NUMA placement. */ if (prot_numa && !folio_can_map_prot_numa(folio, vma, @@ -667,7 +680,16 @@ long change_protection(struct mmu_gather *tlb, pgprot_t newprot =3D vma->vm_page_prot; long pages; =20 - BUG_ON((cp_flags & MM_CP_UFFD_WP_ALL) =3D=3D MM_CP_UFFD_WP_ALL); + /* + * MM_CP_UFFD_{WP,RWP} and _RESOLVE are mutually exclusive within one + * change, and WP and RWP cannot mix. Miswired callers get a warn and + * a no-op; userspace cannot reach this state. + */ + if (WARN_ON_ONCE((cp_flags & MM_CP_UFFD_WP_ALL) =3D=3D MM_CP_UFFD_WP_ALL = || + (cp_flags & MM_CP_UFFD_RWP_ALL) =3D=3D MM_CP_UFFD_RWP_ALL || + ((cp_flags & MM_CP_UFFD_WP_ALL) && + (cp_flags & MM_CP_UFFD_RWP_ALL)))) + return 0; =20 #ifdef CONFIG_NUMA_BALANCING /* @@ -681,6 +703,10 @@ long change_protection(struct mmu_gather *tlb, WARN_ON_ONCE(cp_flags & MM_CP_PROT_NUMA); #endif =20 + if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_PROTNONE) && + (cp_flags & MM_CP_UFFD_RWP)) + newprot =3D PAGE_NONE; + if (is_vm_hugetlb_page(vma)) pages =3D hugetlb_change_protection(vma, start, end, newprot, cp_flags); --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 75FAB3090C6; Mon, 25 May 2026 11:38:42 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709124; cv=none; b=LnDoLXa7Po54Z1RjWpHL0ct4u4VT1pgrnWU1LRXraICQ79lPpgG+exlvkhkuxjs9YF04vvmYVWfT21v+UP7V8USxS6uJPx5INacvXuAN8Hinkid+wY4q0Go0/RMbvGsb9DnHg0UOk6rY6Mdefpq1rFN2ZtLwX6D9M2FacqBMc/o= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709124; c=relaxed/simple; bh=eNs5qE5k4HvhYhwQX7fbHCHF5mLHkU6e853tI6bh6jI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=KQEjzHBPzooUZnejbaYrNZLTfFDiiTutGUdtIFeUyBwHGgBoilgObPHU0fg5/ULkl01JFLodNmay23e5oL42xzcnOrClFzCKsX2jbzi/Na0xBwTRc87xlPTEPyDJ9fRxnuZ/RSFemtZQGkm+vM5PORezmgMJVfCcIOAzHsAAaao= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=MFw+ZK1E; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="MFw+ZK1E" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 89DB21F00A3C; Mon, 25 May 2026 11:38:41 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709122; bh=olgbrJlVgP2WDpq1fL68DP4kGZb6HuRSvQcoZuqJoWc=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=MFw+ZK1EHkPk5U6yqUrIMwvw0Ixq1+j8hdSaJfFcm5LbH3tP8vdi0YOpusUjbqmbu cxTdcReU27PWtoYvLM67iCIdfJ5CHB6zuErOkJfYXB008sG6/zGEfUDbz2FPyhA5r7 S68bS/GHmjtb5eWkWNKAJTel9rlAwcFfyidihhVD3ybcnzdTlIZ5Bt3s3XwMHoziFm pAvuosSFUj7nf38Q8c2cusG/sLal+L8pim/10hYGmH31UGq/Yw7VKjdWSxHFIdS2Gn rReCSitPiIB53+E7xv9M2GP2KCB9qiiP55Sltf6hQ/4JUi9zmvdHJ1Jcw7w0o8PCrP XcMaE35+2NtcQ== Received: from phl-compute-02.internal (phl-compute-02.internal [10.202.2.42]) by mailfauth.phl.internal (Postfix) with ESMTP id E8EB7F40082; Mon, 25 May 2026 07:38:40 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-02.internal (MEProxy); Mon, 25 May 2026 07:38:40 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTE8BdsnOcJoKEIc8nWENdzZ3IJcc78XaZb937JZptTbBfsj35uOLLco64YanRFUy/ vxRRXxJncUkkFhQ6Fw9JoTXNpIBZaIk7PJKSogIAKlou1fnKpK60SVEE9wIdCSKeJSiEmd jMYcCTSecv2wdbvACcZ3jQHh2R35wiBKAXeBalxbQ8obtaPzZg26qqKn3eo1ycSESHvZEY 8vksW44pxFuJVzxfC4xrmaM24Xz/fIt3AP98RkQGIoAMYfpB8FzSca86T9jBGlmKI9r1LK Ip3BvhMbIznxq4KZmZ20/tINtDmxtvyOkjCcERfRzx2DQnK/TIBWkODYKxc/0O8u9p7eDL xF3FiN5ZUJUlSuuH9MCEMalq1YWELD2gqcQgs7X+IqpwJ1U/OdCnAcN9WPu3g6l0XEzgQZ yrRHY4bToW4OdBhIPynpcAI0iCrAh1RB1XXjwmmYns5Yt2QxgPc4CF7Axbc51CfI+gVeff k5t7iuCIpPRkO6FHU+qTPw+QXEHzD8KLybJl379wulCzIVjcpDrEQj+eQ+S5bwniqGZxNu /GGyhLSWI2WEpWaMX69xfSTySxHAN8XB74iGpUB/E6iylHq0H2zLHzsg2hUNFezTYv/OdS Jq4iBK8m1KvcWi7LYAajbv755XovDiVzUh+2iwCG2lNhkbVHyZNsFZEQ8JdA X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:38:39 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 06/14] mm: preserve RWP marker across PTE rewrites Date: Mon, 25 May 2026 12:37:20 +0100 Message-ID: <20260525113737.1942478-7-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" The uffd PTE bit must survive any kernel path that rewrites a PTE on a VM_UFFD_RWP VMA, otherwise the marker that carries PAGE_NONE semantics is silently dropped and the next access leaks past RWP tracking. Wire the preservation through every path that rewrites a VM_UFFD_RWP PTE. Swap and device-exclusive: do_swap_page(), restore_exclusive_pte(), and unuse_pte() (swapoff()) re-apply PAGE_NONE when the swap PTE carries the uffd bit and the VMA has VM_UFFD_RWP. Migration: remove_migration_pte() and remove_migration_pmd() do the same after the migration entry is replaced with a real PTE/PMD. Fork: __copy_present_ptes(), copy_present_page(), copy_nonpresent_pte(), copy_huge_pmd(), copy_huge_non_present_pmd(), and copy_hugetlb_page_range() keep the uffd bit on the child when the destination VMA has VM_UFFD_RWP, matching the existing VM_UFFD_WP handling. Add VM_UFFD_RWP to VM_COPY_ON_FORK so the flag itself propagates. mprotect(): change_pte_range() and change_huge_pmd() restore PAGE_NONE after pte_modify()/pmd_modify() have recomputed the base protection from a (possibly user-changed) vm_page_prot. pte_modify() preserves _PAGE_UFFD, so the bit stays; we just have to force PAGE_NONE back on top. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Acked-by: Mike Rapoport (Microsoft) --- include/linux/mm.h | 3 ++- mm/huge_memory.c | 47 +++++++++++++++++++++++++++++++++++++---- mm/hugetlb.c | 52 ++++++++++++++++++++++++++++++++++++++-------- mm/memory.c | 47 ++++++++++++++++++++++++++++++++++------- mm/migrate.c | 8 +++++++ mm/mprotect.c | 10 +++++++++ mm/mremap.c | 13 ++++++++++-- mm/swapfile.c | 5 +++++ mm/userfaultfd.c | 17 +++++++++++++++ 9 files changed, 179 insertions(+), 23 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ecbf3e83a892..5953106758fa 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -663,7 +663,8 @@ enum { * only and thus cannot be reconstructed on page * fault. */ -#define VM_COPY_ON_FORK (VM_PFNMAP | VM_MIXEDMAP | VM_UFFD_WP | VM_MAYBE_G= UARD) +#define VM_COPY_ON_FORK (VM_PFNMAP | VM_MIXEDMAP | VM_UFFD_WP | VM_UFFD_RW= P | \ + VM_MAYBE_GUARD) =20 /* * mapping from the currently active vm_flags protection bits (the diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 0d05abb0cd81..8620ba92263f 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1943,7 +1943,7 @@ static void copy_huge_non_present_pmd( add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); - if (!userfaultfd_wp(dst_vma)) + if (!userfaultfd_protected(dst_vma)) pmd =3D pmd_swp_clear_uffd(pmd); set_pmd_at(dst_mm, addr, dst_pmd, pmd); } @@ -2038,9 +2038,15 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct m= m_struct *src_mm, out_zero_page: mm_inc_nr_ptes(dst_mm); pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable); - pmdp_set_wrprotect(src_mm, addr, src_pmd); - if (!userfaultfd_wp(dst_vma)) + + /* See __copy_present_ptes(): restore accessible protection. */ + if (!userfaultfd_protected(dst_vma)) { + if (userfaultfd_rwp(src_vma)) + pmd =3D pmd_modify(pmd, dst_vma->vm_page_prot); pmd =3D pmd_clear_uffd(pmd); + } + + pmdp_set_wrprotect(src_mm, addr, src_pmd); pmd =3D pmd_wrprotect(pmd); set_pmd: pmd =3D pmd_mkold(pmd); @@ -2626,8 +2632,16 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsig= ned long old_addr, pgtable_trans_huge_deposit(mm, new_pmd, pgtable); } pmd =3D move_soft_dirty_pmd(pmd); - if (vma_has_uffd_without_event_remap(vma)) + if (vma_has_uffd_without_event_remap(vma)) { + /* + * See __copy_present_ptes(): normalise RWP PMDs so + * the destination starts accessible instead of taking + * a numa-hinting fault on first access. + */ + if (pmd_present(pmd) && userfaultfd_rwp(vma)) + pmd =3D pmd_modify(pmd, vma->vm_page_prot); pmd =3D clear_uffd_wp_pmd(pmd); + } set_pmd_at(mm, new_addr, new_pmd, pmd); if (force_flush) flush_pmd_tlb_range(vma, old_addr, old_addr + PMD_SIZE); @@ -2764,6 +2778,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct v= m_area_struct *vma, */ entry =3D pmd_clear_uffd(entry); =20 + /* See change_pte_range(): preserve RWP protection across mprotect() */ + if (userfaultfd_rwp(vma) && pmd_uffd(entry)) + entry =3D pmd_modify(entry, PAGE_NONE); + /* See change_pte_range(). */ if ((cp_flags & MM_CP_TRY_CHANGE_WRITABLE) && !pmd_write(entry) && can_change_pmd_writable(vma, addr, entry)) @@ -2931,6 +2949,13 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd_t = *dst_pmd, pmd_t *src_pmd, pm _dst_pmd =3D move_soft_dirty_pmd(src_pmdval); _dst_pmd =3D clear_uffd_wp_pmd(_dst_pmd); } + + /* Re-arm RWP on the moved PMD if dst_vma is RWP-registered. */ + if (userfaultfd_rwp(dst_vma)) { + _dst_pmd =3D pmd_modify(_dst_pmd, PAGE_NONE); + _dst_pmd =3D pmd_mkuffd(_dst_pmd); + } + set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd); =20 src_pgtable =3D pgtable_trans_huge_withdraw(mm, src_pmd); @@ -3107,6 +3132,11 @@ static void __split_huge_zero_page_pmd(struct vm_are= a_struct *vma, entry =3D pte_mkspecial(entry); if (pmd_uffd(old_pmd)) entry =3D pte_mkuffd(entry); + + /* Restore PAGE_NONE so an RWP marker keeps trapping */ + if (userfaultfd_rwp(vma) && pmd_uffd(old_pmd)) + entry =3D pte_modify(entry, PAGE_NONE); + VM_BUG_ON(!pte_none(ptep_get(pte))); set_pte_at(mm, addr, pte, entry); pte++; @@ -3381,6 +3411,10 @@ static void __split_huge_pmd_locked(struct vm_area_s= truct *vma, pmd_t *pmd, if (uffd_wp) entry =3D pte_mkuffd(entry); =20 + /* Restore PAGE_NONE so an RWP marker keeps trapping */ + if (userfaultfd_rwp(vma) && uffd_wp) + entry =3D pte_modify(entry, PAGE_NONE); + for (i =3D 0; i < HPAGE_PMD_NR; i++) VM_WARN_ON(!pte_none(ptep_get(pte + i))); =20 @@ -5053,6 +5087,11 @@ void remove_migration_pmd(struct page_vma_mapped_wal= k *pvmw, struct page *new) pmde =3D pmd_mkwrite(pmde, vma); if (pmd_swp_uffd(*pvmw->pmd)) pmde =3D pmd_mkuffd(pmde); + + /* See do_swap_page(): restore PAGE_NONE for RWP */ + if (pmd_swp_uffd(*pvmw->pmd) && userfaultfd_rwp(vma)) + pmde =3D pmd_modify(pmde, PAGE_NONE); + if (!softleaf_is_migration_young(entry)) pmde =3D pmd_mkold(pmde); /* NOTE: this may contain setting soft-dirty on some archs */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4d75b69d4272..8555810cd42e 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4843,8 +4843,16 @@ hugetlb_install_folio(struct vm_area_struct *vma, pt= e_t *ptep, unsigned long add =20 __folio_mark_uptodate(new_folio); hugetlb_add_new_anon_rmap(new_folio, vma, addr); - if (userfaultfd_wp(vma) && huge_pte_uffd(old)) + if (userfaultfd_protected(vma) && huge_pte_uffd(old)) { newpte =3D huge_pte_mkuffd(newpte); + /* Restore PAGE_NONE so the RWP marker keeps trapping. */ + if (userfaultfd_rwp(vma)) { + unsigned int shift =3D huge_page_shift(hstate_vma(vma)); + + newpte =3D huge_pte_modify(newpte, PAGE_NONE); + newpte =3D arch_make_huge_pte(newpte, shift, vma->vm_flags); + } + } set_huge_pte_at(vma->vm_mm, addr, ptep, newpte, sz); hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm); folio_set_hugetlb_migratable(new_folio); @@ -4917,7 +4925,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src, =20 softleaf =3D softleaf_from_pte(entry); if (unlikely(softleaf_is_hwpoison(softleaf))) { - if (!userfaultfd_wp(dst_vma)) + if (!userfaultfd_protected(dst_vma)) entry =3D huge_pte_clear_uffd(entry); set_huge_pte_at(dst, addr, dst_pte, entry, sz); } else if (unlikely(softleaf_is_migration(softleaf))) { @@ -4931,11 +4939,11 @@ int copy_hugetlb_page_range(struct mm_struct *dst, = struct mm_struct *src, softleaf =3D make_readable_migration_entry( swp_offset(softleaf)); entry =3D swp_entry_to_pte(softleaf); - if (userfaultfd_wp(src_vma) && uffd) + if (userfaultfd_protected(src_vma) && uffd) entry =3D pte_swp_mkuffd(entry); set_huge_pte_at(src, addr, src_pte, entry, sz); } - if (!userfaultfd_wp(dst_vma)) + if (!userfaultfd_protected(dst_vma)) entry =3D huge_pte_clear_uffd(entry); set_huge_pte_at(dst, addr, dst_pte, entry, sz); } else if (unlikely(pte_is_marker(entry))) { @@ -5000,6 +5008,16 @@ int copy_hugetlb_page_range(struct mm_struct *dst, s= truct mm_struct *src, goto next; } =20 + /* See __copy_present_ptes(): restore accessible protection. */ + if (!userfaultfd_protected(dst_vma)) { + if (userfaultfd_rwp(src_vma)) { + entry =3D huge_pte_modify(entry, dst_vma->vm_page_prot); + entry =3D arch_make_huge_pte(entry, huge_page_shift(h), + dst_vma->vm_flags); + } + entry =3D huge_pte_clear_uffd(entry); + } + if (cow) { /* * No need to notify as we are downgrading page @@ -5012,9 +5030,6 @@ int copy_hugetlb_page_range(struct mm_struct *dst, st= ruct mm_struct *src, entry =3D huge_pte_wrprotect(entry); } =20 - if (!userfaultfd_wp(dst_vma)) - entry =3D huge_pte_clear_uffd(entry); - set_huge_pte_at(dst, addr, dst_pte, entry, sz); hugetlb_count_add(npages, dst); } @@ -5060,10 +5075,22 @@ static void move_huge_pte(struct vm_area_struct *vm= a, unsigned long old_addr, huge_pte_clear(mm, new_addr, dst_pte, sz); } else { if (need_clear_uffd_wp) { - if (pte_present(pte)) + if (pte_present(pte)) { + /* + * See __copy_present_ptes(): normalise RWP + * PTEs so the destination starts accessible + * instead of taking a numa-hinting fault on + * first access. + */ + if (userfaultfd_rwp(vma)) { + pte =3D huge_pte_modify(pte, vma->vm_page_prot); + pte =3D arch_make_huge_pte(pte, huge_page_shift(h), + vma->vm_flags); + } pte =3D huge_pte_clear_uffd(pte); - else + } else { pte =3D pte_swp_clear_uffd(pte); + } } set_huge_pte_at(mm, new_addr, dst_pte, pte, sz); } @@ -6515,6 +6542,13 @@ long hugetlb_change_protection(struct vm_area_struct= *vma, pte =3D huge_pte_mkuffd(pte); else if (uffd_wp_resolve || uffd_rwp_resolve) pte =3D huge_pte_clear_uffd(pte); + + /* Preserve RWP protection across mprotect() */ + if (userfaultfd_rwp(vma) && huge_pte_uffd(pte)) { + pte =3D huge_pte_modify(pte, PAGE_NONE); + pte =3D arch_make_huge_pte(pte, shift, vma->vm_flags); + } + huge_ptep_modify_prot_commit(vma, address, ptep, old_pte, pte); pages++; tlb_remove_huge_tlb_entry(h, &tlb, ptep, address); diff --git a/mm/memory.c b/mm/memory.c index c4fd5cb4a08f..e4ae5350db41 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -896,6 +896,10 @@ static void restore_exclusive_pte(struct vm_area_struc= t *vma, if (pte_swp_uffd(orig_pte)) pte =3D pte_mkuffd(pte); =20 + /* See do_swap_page(): restore PAGE_NONE for RWP */ + if (pte_swp_uffd(orig_pte) && userfaultfd_rwp(vma)) + pte =3D pte_modify(pte, PAGE_NONE); + if ((vma->vm_flags & VM_WRITE) && can_change_pte_writable(vma, address, pte)) { if (folio_test_dirty(folio)) @@ -1041,7 +1045,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct = mm_struct *src_mm, make_pte_marker(marker)); return 0; } - if (!userfaultfd_wp(dst_vma)) + if (!userfaultfd_protected(dst_vma)) pte =3D pte_swp_clear_uffd(pte); set_pte_at(dst_mm, addr, dst_pte, pte); return 0; @@ -1088,9 +1092,13 @@ copy_present_page(struct vm_area_struct *dst_vma, st= ruct vm_area_struct *src_vma /* All done, just insert the new page copy in the child */ pte =3D folio_mk_pte(new_folio, dst_vma->vm_page_prot); pte =3D maybe_mkwrite(pte_mkdirty(pte), dst_vma); - if (userfaultfd_pte_wp(dst_vma, ptep_get(src_pte))) - /* Uffd-wp needs to be delivered to dest pte as well */ + if (userfaultfd_protected(dst_vma) && pte_uffd(ptep_get(src_pte))) { + /* The uffd bit needs to be delivered to the dest pte as well */ pte =3D pte_mkuffd(pte); + /* Restore PAGE_NONE so the RWP marker keeps trapping */ + if (userfaultfd_rwp(dst_vma)) + pte =3D pte_modify(pte, PAGE_NONE); + } set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); return 0; } @@ -1100,9 +1108,29 @@ static __always_inline void __copy_present_ptes(stru= ct vm_area_struct *dst_vma, pte_t pte, unsigned long addr, int nr) { struct mm_struct *src_mm =3D src_vma->vm_mm; + bool writable; + + /* + * Snapshot writability before the RWP-disarm rewrite below: when the + * child is not RWP-armed, pte_modify(pte, dst_vma->vm_page_prot) can + * silently drop _PAGE_RW from a resolved (no-marker) writable PTE, + * so a later pte_write(pte) check would skip the COW wrprotect and + * leave the parent writable over a folio shared with the child. + */ + writable =3D pte_write(pte); + + /* + * Child is not RWP-armed: restore accessible protection so the + * inherited PAGE_NONE does not cost a fault on first read. + */ + if (!userfaultfd_protected(dst_vma)) { + if (userfaultfd_rwp(src_vma)) + pte =3D pte_modify(pte, dst_vma->vm_page_prot); + pte =3D pte_clear_uffd(pte); + } =20 /* If it's a COW mapping, write protect it both processes. */ - if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) { + if (is_cow_mapping(src_vma->vm_flags) && writable) { wrprotect_ptes(src_mm, addr, src_pte, nr); pte =3D pte_wrprotect(pte); } @@ -1112,9 +1140,6 @@ static __always_inline void __copy_present_ptes(struc= t vm_area_struct *dst_vma, pte =3D pte_mkclean(pte); pte =3D pte_mkold(pte); =20 - if (!userfaultfd_wp(dst_vma)) - pte =3D pte_clear_uffd(pte); - set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr); } =20 @@ -5041,6 +5066,14 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (pte_swp_uffd(vmf->orig_pte)) pte =3D pte_mkuffd(pte); =20 + /* + * A page reclaimed while RWP-protected carries the uffd bit on + * its swap entry. Re-apply PAGE_NONE on swap-in so the first access + * still traps as an RWP fault. pte_modify() preserves _PAGE_UFFD. + */ + if (pte_swp_uffd(vmf->orig_pte) && userfaultfd_rwp(vma)) + pte =3D pte_modify(pte, PAGE_NONE); + /* * Same logic as in do_wp_page(); however, optimize for pages that are * certainly not shared either because we just allocated them without diff --git a/mm/migrate.c b/mm/migrate.c index 4bdb5be7afbf..8d7fd0b056b6 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -329,6 +329,10 @@ static bool try_to_map_unused_to_zeropage(struct page_= vma_mapped_walk *pvmw, if (pte_swp_uffd(old_pte)) newpte =3D pte_mkuffd(newpte); =20 + /* See remove_migration_pte(): restore PAGE_NONE for RWP */ + if (pte_swp_uffd(old_pte) && userfaultfd_rwp(pvmw->vma)) + newpte =3D pte_modify(newpte, PAGE_NONE); + set_pte_at(pvmw->vma->vm_mm, pvmw->address, pvmw->pte, newpte); =20 dec_mm_counter(pvmw->vma->vm_mm, mm_counter(folio)); @@ -394,6 +398,10 @@ static bool remove_migration_pte(struct folio *folio, else if (pte_swp_uffd(old_pte)) pte =3D pte_mkuffd(pte); =20 + /* See do_swap_page(): restore PAGE_NONE for RWP */ + if (pte_swp_uffd(old_pte) && userfaultfd_rwp(vma)) + pte =3D pte_modify(pte, PAGE_NONE); + if (folio_test_anon(folio) && !softleaf_is_migration_read(entry)) rmap_flags |=3D RMAP_EXCLUSIVE; =20 diff --git a/mm/mprotect.c b/mm/mprotect.c index 4a6b35482aee..e0b5fe7c66b2 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -296,6 +296,16 @@ static __always_inline void change_present_ptes(struct= mmu_gather *tlb, else if (uffd_prot_resolve) ptent =3D pte_clear_uffd(ptent); =20 + /* + * The uffd bit on a VM_UFFD_RWP VMA carries PROT_NONE + * semantics. If mprotect() or NUMA hinting changed the + * base protection, restore PAGE_NONE so the PTE still + * traps on any access. pte_modify() preserves + * _PAGE_UFFD. + */ + if (userfaultfd_rwp(vma) && pte_uffd(ptent)) + ptent =3D pte_modify(ptent, PAGE_NONE); + /* * In some writable, shared mappings, we might want * to catch actual write access -- see diff --git a/mm/mremap.c b/mm/mremap.c index 12732a5c547e..14e5df316f83 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -296,10 +296,19 @@ static int move_ptes(struct pagetable_move_control *p= mc, pte_clear(mm, new_addr, new_ptep); else { if (need_clear_uffd_wp) { - if (pte_present(pte)) + if (pte_present(pte)) { + /* + * See __copy_present_ptes(): normalise + * RWP PTEs so the destination starts + * accessible instead of taking a + * numa-hinting fault on first access. + */ + if (userfaultfd_rwp(vma)) + pte =3D pte_modify(pte, vma->vm_page_prot); pte =3D pte_clear_uffd(pte); - else + } else { pte =3D pte_swp_clear_uffd(pte); + } } set_ptes(mm, new_addr, new_ptep, pte, nr_ptes); } diff --git a/mm/swapfile.c b/mm/swapfile.c index 15fdca2da1f7..27cc299ead9b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -2559,6 +2559,11 @@ static int unuse_pte(struct vm_area_struct *vma, pmd= _t *pmd, new_pte =3D pte_mksoft_dirty(new_pte); if (pte_swp_uffd(old_pte)) new_pte =3D pte_mkuffd(new_pte); + + /* See do_swap_page(): restore PAGE_NONE for RWP */ + if (pte_swp_uffd(old_pte) && userfaultfd_rwp(vma)) + new_pte =3D pte_modify(new_pte, PAGE_NONE); + setpte: set_pte_at(vma->vm_mm, addr, pte, new_pte); folio_put_swap(swapcache, folio_file_page(swapcache, swp_offset(entry))); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index ebce642c8805..9799abff1e76 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1285,6 +1285,13 @@ static long move_present_ptes(struct mm_struct *mm, if (pte_dirty(orig_src_pte)) orig_dst_pte =3D pte_mkdirty(orig_dst_pte); orig_dst_pte =3D pte_mkwrite(orig_dst_pte, dst_vma); + + /* Re-arm RWP on the moved PTE if dst_vma is RWP-registered. */ + if (userfaultfd_rwp(dst_vma)) { + orig_dst_pte =3D pte_modify(orig_dst_pte, PAGE_NONE); + orig_dst_pte =3D pte_mkuffd(orig_dst_pte); + } + set_pte_at(mm, dst_addr, dst_pte, orig_dst_pte); =20 src_addr +=3D PAGE_SIZE; @@ -1366,6 +1373,9 @@ static int move_swap_pte(struct mm_struct *mm, struct= vm_area_struct *dst_vma, orig_src_pte =3D ptep_get_and_clear(mm, src_addr, src_pte); if (pgtable_supports_soft_dirty()) orig_src_pte =3D pte_swp_mksoft_dirty(orig_src_pte); + /* Re-arm RWP on the moved swap entry if dst_vma is RWP-registered. */ + if (userfaultfd_rwp(dst_vma)) + orig_src_pte =3D pte_swp_mkuffd(orig_src_pte); set_pte_at(mm, dst_addr, dst_pte, orig_src_pte); double_pt_unlock(dst_ptl, src_ptl); =20 @@ -1392,6 +1402,13 @@ static int move_zeropage_pte(struct mm_struct *mm, =20 zero_pte =3D pte_mkspecial(pfn_pte(zero_pfn(dst_addr), dst_vma->vm_page_prot)); + + /* Re-arm RWP on the moved PTE if dst_vma is RWP-registered. */ + if (userfaultfd_rwp(dst_vma)) { + zero_pte =3D pte_modify(zero_pte, PAGE_NONE); + zero_pte =3D pte_mkuffd(zero_pte); + } + ptep_clear_flush(src_vma, src_addr, src_pte); set_pte_at(mm, dst_addr, dst_pte, zero_pte); double_pt_unlock(dst_ptl, src_ptl); --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 705263090C6; Mon, 25 May 2026 11:38:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709132; cv=none; b=LEjUfYuwIhzpxRq/dOnJyu2NjA1BPXpGYK5BaAEDu9xL2Vtz2cQmSUEGb3T4ucnujB3s4fMvpPFl87pWnRak/0Evv7iOk4pmg3eOSRchyYL19cTYYQNkUCZNpqO2m/nXLgNym3zI0Nyv5yBor+VqqQJOETC/6ZxgZPvWrYScL30= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709132; c=relaxed/simple; bh=u3dPUTchYjuZGOdBM6mVzH4C5Zzq6kXfnLj6+3I0za8=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Ew6XMD+Ox0UXsTXtfH6C++16HQvflmCqGWJINBrmmauWnLWFPsOSAqmBNrtVpfsaqJOiyYTKwH8ihCNY/OHT+iMBj0meK2IuZJCY2BK+qok8OF3ab8deJps3LRNh+BoPIBfQS5VUktsAjALEvsLWgTQyrC4slYGSLk0jPugJ6fs= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=dkBJOnaF; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="dkBJOnaF" Received: by smtp.kernel.org (Postfix) with ESMTPSA id C13911F00A3A; Mon, 25 May 2026 11:38:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709131; bh=Ps0N9FhSg2WInzdVQ3glYAf7iP/rHFVJxbQNtaX3gcU=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=dkBJOnaFtPJHLE94XqDZFU860sPGU7UzqZ0m6KATTLDfn97gNkoyhMMGP3n8BUDdf yStaI0Zj9oe/YyTCO46KyUf2YrQOFc+RbTr5Kscr3wEQ/ZRzGssPHhDEZqa+ikY/hV O/gnXnAnHs6fDoXFFXOmzL96PSb1hAOXEjsHLnmdSZKIj6yrZ8C6wqL/Fdsh/E7g/I K/SOZJbGzjeWiUYiOoUepHv+IkntOZFO3WmXkOWSqRZUNq3hTbsl1imM54fX3vSEo+ BELRoFfFt9w89ZbcpZYyfiqvi1dmsREfIzeLwnb1sGMjzRv+xNHudvCqctojOUahB5 iUFaSHNAFqfEw== Received: from phl-compute-06.internal (phl-compute-06.internal [10.202.2.46]) by mailfauth.phl.internal (Postfix) with ESMTP id 2B394F40082; Mon, 25 May 2026 07:38:50 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-06.internal (MEProxy); Mon, 25 May 2026 07:38:50 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTEfoRdVBJYWBUADV7d9kIV+Nxm/pVUcgu2VEqiWnj8eQrxuVRmWJvpc9q3Oqv2JCI in3fN4Zd8kb+AIzxXT1aE8XhYzXjllPczHZpsDuave1RqUEJt2aM8NK4F+1xGB1lHbYgSB 4+lR3LF0kMycZigQtO5w6VfLN0Aex+LApk+NDe4mN/AQtgEH82UY4yLdpanr1nUEhfYRPX qXaeB8tr6YycPJO/GKcHBfWh4XfvSwFXkiL84UUAHkAJqW6Bc0LpWOIYef91VRiFvO85Cg Xd8jmXx4jYBCkCANHtSsVLhpCbXIiIojLruOFr51QxylDrx8yfg2/g0kqIsUfNxsLFZyNP nEjj3iDx3Pih6xK1tSzuV2TKDG9t7caUpsOxcJHSxNo/j6vMD5+xZJdixPCSN5F5VW5BpS CMhUff1ycXkGeotLXYjcA7XaRuiwzGRJJYP1CiWCis02P2P2D+sbk/R6Vq3YcOpvx+7NVY 02CdvOeg4gtHAIdwNDSdP6hvITAphIltUDsfDNLu1gKBL4QkleYh63JZk6PEiUDceeiqTk pjg5DSe82vF51LR2ATCUKQH7BsFFjuliK5Avxdd8WLkHY6w4J3QIZyAaJgZHBy+Rd9+8bN 1XCHyZJhWdnxFkS1nrcinn/kw+oCB9wEdzzkHbYBwGwNowh5hGHHYdqYGn2Q X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:38:48 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 07/14] mm: handle VM_UFFD_RWP in khugepaged, rmap, and GUP Date: Mon, 25 May 2026 12:37:21 +0100 Message-ID: <20260525113737.1942478-8-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Three mm paths outside the fault handler gate on the uffd PTE bit today: khugepaged (skip collapse on ranges carrying markers), rmap (cap unmap batching), and GUP (force a fault through gup_can_follow_protnone). Extend each to treat VM_UFFD_RWP the same as VM_UFFD_WP; otherwise per-PTE RWP state is silently destroyed or bypassed. khugepaged: try_collapse_pte_mapped_thp() and file_backed_vma_is_retractable() already refuse to collapse or retract page tables on ranges carrying the uffd PTE bit. Broaden the VMA predicate from userfaultfd_wp() to userfaultfd_protected() so VM_UFFD_RWP ranges get the same protection. hpage_collapse_scan_pmd() needs no change =E2=80=94 its existing pte_uffd() check already catches an RWP PTE because it carries the uffd bit. rmap: folio_unmap_pte_batch() caps batching at 1 for VM_UFFD_RWP so the restore path handles each PTE with its own marker. GUP: gup_can_follow_protnone() forces a fault on VM_UFFD_RWP VMAs regardless of FOLL_HONOR_NUMA_FAULT. RWP uses protnone as an access-tracking marker, not for NUMA hinting, so any GUP =E2=80=94 read or write =E2=80=94 must go through the userfaultfd fault path. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Acked-by: Mike Rapoport (Microsoft) --- include/linux/mm.h | 10 +++++++++- mm/khugepaged.c | 18 +++++++++++------- mm/rmap.c | 2 +- 3 files changed, 21 insertions(+), 9 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5953106758fa..f72bf5ccf72c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -4600,11 +4600,19 @@ static inline int vm_fault_to_errno(vm_fault_t vm_f= ault, int foll_flags) =20 /* * Indicates whether GUP can follow a PROT_NONE mapped page, or whether - * a (NUMA hinting) fault is required. + * a (NUMA hinting or userfaultfd RWP) fault is required. */ static inline bool gup_can_follow_protnone(const struct vm_area_struct *vm= a, unsigned int flags) { + /* + * VM_UFFD_RWP uses protnone as an access-tracking marker, not for + * NUMA hinting. GUP must always take a fault so the access is + * delivered to userfaultfd, regardless of FOLL_HONOR_NUMA_FAULT. + */ + if (vma->vm_flags & VM_UFFD_RWP) + return false; + /* * If callers don't want to honor NUMA hinting faults, no need to * determine if we would actually have to trigger a NUMA hinting fault. diff --git a/mm/khugepaged.c b/mm/khugepaged.c index afa218be15de..4f3fedcd75cf 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1895,8 +1895,11 @@ static enum scan_result try_collapse_pte_mapped_thp(= struct mm_struct *mm, unsign if (!thp_vma_allowable_order(vma, vma->vm_flags, TVA_FORCED_COLLAPSE, PMD= _ORDER)) return SCAN_VMA_CHECK; =20 - /* Keep pmd pgtable for uffd-wp; see comment in retract_page_tables() */ - if (userfaultfd_wp(vma)) + /* + * Keep pmd pgtable while the uffd bit is in use; see comment in + * retract_page_tables(). + */ + if (userfaultfd_protected(vma)) return SCAN_PTE_UFFD; =20 folio =3D filemap_lock_folio(vma->vm_file->f_mapping, @@ -2109,13 +2112,14 @@ static bool file_backed_vma_is_retractable(struct v= m_area_struct *vma) return false; =20 /* - * When a vma is registered with uffd-wp, we cannot recycle + * When a vma is registered with uffd-wp or RWP, we cannot recycle * the page table because there may be pte markers installed. - * Other vmas can still have the same file mapped hugely, but - * skip this one: it will always be mapped in small page size - * for uffd-wp registered ranges. + * VM_UFFD_RWP ranges similarly rely on per-PTE uffd state + * and cannot be recycled to a shared PMD. Other vmas can still + * have the same file mapped hugely, but skip this one: it will + * always be mapped in small page size for these registrations. */ - if (userfaultfd_wp(vma)) + if (userfaultfd_protected(vma)) return false; =20 /* diff --git a/mm/rmap.c b/mm/rmap.c index 546bc1cf9391..9fb733489898 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1965,7 +1965,7 @@ static inline unsigned int folio_unmap_pte_batch(stru= ct folio *folio, if (pte_unused(pte)) return 1; =20 - if (userfaultfd_wp(vma)) + if (userfaultfd_protected(vma)) return 1; =20 /* --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 35E6C37FF6F for ; Mon, 25 May 2026 11:39:01 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709143; cv=none; b=B/L0C9CQ0XQ2BrF6Bb/e3yW8+jPSBME8ipbdpRcDqSEYlTIp/OjZp9F3ErpRhuhiuhoUAW4HE26c4BR2dFOVCfmpGNRk3L95xyyTFqs87+5kOmx4BRU3Oh3Q9CD4B56Xkoz3sxBw8A/hSjc+N3m+FACIgdOYBHqK0RiS8ktNYg0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709143; c=relaxed/simple; bh=9c/ZjsZcD9wQN50Vsfqif7MYYHGPlyxMKase1/S0UQI=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=YKZd8A43q00uChRp8efc1W8ylhpUtKXpwSZnTZH+K7rmhD83e2j5/zOHDN2KqStB0dSAuyuY1vJ1ACkzXOYHACLdUAN+3piRGNQ9zLV3Y/Fevu+CYYzIROWIjksduIFkuWKqF0yUWWmrFdwv+T7z8LBuXokCcESeFFB8OjqqK+0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=M6WiPsrP; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="M6WiPsrP" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5E9661F000E9; Mon, 25 May 2026 11:39:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709141; bh=t6ELIsi6v/elkIirziBiyXbJIJnO66l+DtRj2dcMPkI=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=M6WiPsrPDVI7piTeRLQVlZwXT6na6td4SXgIJSj6KrG4OypO7NfI3vDXMN+bPZnAo jjI9ko280kK0SHW3E0s8N2DFVvQm39nog4nEg7UuU59OyCF71R+uJrZPWBER+cTvHu yneH5U6aTWidN6aX4SUulnxeb/hBCczIk3J46hcUGtuIB+1duSfzTedA5a4gFDwX1D ELg9G6mm4iwWnBHotdImZjm0IxWiNwio8mimkkkF0pUvqyimyXj/PCF/HeSltU/ZOo lOuE9vj8z7GL6sUn7Q00tufr5te8/67FnnNAX0U2davwTkd47rbTrX9k0koJXuLt56 Fx+v5gIS57GGQ== Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfauth.phl.internal (Postfix) with ESMTP id AE798F40082; Mon, 25 May 2026 07:38:59 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-03.internal (MEProxy); Mon, 25 May 2026 07:38:59 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTEfoRdVBJYWBUADV7d9kIV+Nxm/pVUcgu2VEqiWnj8eQrxuVRmWJvpc9q3Oqv2JCI in3fN4Zd8kb+AIzxXT1aE8XhYzXjllPczHZpsDuave1RqUEJt2aM8NK4F+1xGB1lHbYgSB 4+lR3LF0kMycZigQtO5w6VfLN0Aex+LApk+NDe4mN/AQtgEH82UY4yLdpanr1nUEhfYRPX qXaeB8tr6YycPJO/GKcHBfWh4XfvSwFXkiL84UUAHkAJqW6Bc0LpWOIYef91VRiFvO85Cg Xd8jmXx4jYBCkCANHtSsVLhpCbXIiIojLruOFr51QxylDrx8yfg2/g0kqIsUfNxsLFZykD xsECufDkTIqu1uB1L0alwtBEQPeH52fWYDFmmuO+AgYdLW9KN4qx2Iuqr7BaVzQ99OmnPX 8a7RumJP58Q7Qau0x9uitLUcPsFtl1ePdjopmET/d06HvYnah5FKgAvvQObF0g/lNDRtpq enZc3ZSvdbH/hNZ+XS1g60wiFt5fIdgLwnRGKX6RXZgMYYShwhvZSFzlZ/98FyeLK7pZmB VNz+3USHL0NrFMKE8wp4TFYY1YJ1CeDzz5t9PLzQkz7qMwlRWcjgH+I5UiTwuWlfIBUhjO DZgwp4lFPds3RPPiUHyatoh3BnYOBqzrgrk9Qq0+DQrIOBhHEPr3YzuiSGvA X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:38:57 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 08/14] userfaultfd: add UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT plumbing Date: Mon, 25 May 2026 12:37:22 +0100 Message-ID: <20260525113737.1942478-9-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Add the userspace interface for read-write protection tracking: - UFFDIO_REGISTER_MODE_RWP register a range for RWP tracking - UFFD_FEATURE_RWP capability bit - UFFDIO_RWPROTECT install / remove RWP on a range Introduce CONFIG_USERFAULTFD_RWP, auto-selected on 64-bit kernels with ARCH_HAS_PTE_PROTNONE and HAVE_ARCH_USERFAULTFD_WP. The symbol gates VM_UFFD_RWP (previously aliased to VM_NONE) and the smaps/trace-flag hooks added in the preparatory patches; without it the UAPI bits added here have nothing to drive and would be unreachable. Registration sets VM_UFFD_RWP on the VMA. Combining MODE_WP with MODE_RWP is rejected because both modes claim the uffd PTE bit. UFFDIO_RWPROTECT is the bidirectional counterpart of UFFDIO_WRITEPROTECT: - MODE_RWP change_protection() with MM_CP_UFFD_RWP installs PAGE_NONE and sets the uffd bit on present PTEs - !MODE_RWP change_protection() with MM_CP_UFFD_RWP_RESOLVE restores vma->vm_page_prot and clears the bit userfaultfd_clear_vma() runs the same resolve pass on unregister so RWP state cannot outlive the uffd. Re-registering a range must not drop a mode that installs per-PTE markers (WP or RWP); doing so returns -EBUSY. This also closes a pre-existing window where re-registering without MODE_WP would strand uffd-wp markers: before, those caused extra write-faults but were otherwise benign; with RWP preservation in place, a subsequent mprotect() on a VM_UFFD_RWP VMA would silently promote the stale markers to RWP. The feature is not yet advertised. UFFDIO_REGISTER_MODE_RWP, UFFD_FEATURE_RWP, and _UFFDIO_RWPROTECT are intentionally absent from UFFD_API_REGISTER_MODES, UFFD_API_FEATURES, and UFFD_API_RANGE_IOCTLS, so UFFDIO_API masks them out and the register-mode validator rejects the bit. The follow-up patch adds fault dispatch and exposes the UAPI. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Mike Rapoport (Microsoft) --- Documentation/admin-guide/mm/userfaultfd.rst | 10 ++ include/linux/userfaultfd_k.h | 2 + include/uapi/linux/userfaultfd.h | 19 ++ mm/Kconfig | 9 + mm/userfaultfd.c | 180 ++++++++++++++++++- 5 files changed, 217 insertions(+), 3 deletions(-) diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/a= dmin-guide/mm/userfaultfd.rst index e5cc8848dcb3..1e533639fd50 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -131,6 +131,16 @@ userfaults on the range registered. Not all ioctls wil= l necessarily be supported for all memory types (e.g. anonymous memory vs. shmem vs. hugetlbfs), or all types of intercepted faults. =20 +.. note:: + + Re-registering an already-registered range must not drop any of the + modes that install per-PTE markers =E2=80=94 currently + ``UFFDIO_REGISTER_MODE_WP`` and ``UFFDIO_REGISTER_MODE_RWP``. Doing + so would strand markers with no flag to describe them, so the call + is rejected with ``-EBUSY``; userspace must issue + ``UFFDIO_UNREGISTER`` first. This differs from older kernels, which + silently replaced the mode bits on re-registration. + Userland can use the ``uffdio_register.ioctls`` to manage the virtual address space in the background (to add or potentially also remove memory from the ``userfaultfd`` registered range). This means a userfault diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index 16fbe11c0c55..f78d5d370d0a 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -150,6 +150,8 @@ static inline uffd_flags_t uffd_flags_set_mode(uffd_fla= gs_t flags, enum mfill_at =20 extern long uffd_wp_range(struct vm_area_struct *vma, unsigned long start, unsigned long len, bool enable_wp); +extern int mrwprotect_range(struct userfaultfd_ctx *ctx, unsigned long sta= rt, + unsigned long len, bool enable_rwp); =20 /* move_pages */ void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2); diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaul= tfd.h index 2841e4ea8f2c..7b78aa3b5318 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -79,6 +79,7 @@ #define _UFFDIO_WRITEPROTECT (0x06) #define _UFFDIO_CONTINUE (0x07) #define _UFFDIO_POISON (0x08) +#define _UFFDIO_RWPROTECT (0x09) #define _UFFDIO_API (0x3F) =20 /* userfaultfd ioctl ids */ @@ -103,6 +104,8 @@ struct uffdio_continue) #define UFFDIO_POISON _IOWR(UFFDIO, _UFFDIO_POISON, \ struct uffdio_poison) +#define UFFDIO_RWPROTECT _IOWR(UFFDIO, _UFFDIO_RWPROTECT, \ + struct uffdio_rwprotect) =20 /* read() structure */ struct uffd_msg { @@ -158,6 +161,7 @@ struct uffd_msg { #define UFFD_PAGEFAULT_FLAG_WRITE (1<<0) /* If this was a write fault */ #define UFFD_PAGEFAULT_FLAG_WP (1<<1) /* If reason is VM_UFFD_WP */ #define UFFD_PAGEFAULT_FLAG_MINOR (1<<2) /* If reason is VM_UFFD_MINOR */ +#define UFFD_PAGEFAULT_FLAG_RWP (1<<3) /* If reason is VM_UFFD_RWP */ =20 struct uffdio_api { /* userland asks for an API number and the features to enable */ @@ -230,6 +234,11 @@ struct uffdio_api { * * UFFD_FEATURE_MOVE indicates that the kernel supports moving an * existing page contents from userspace. + * + * UFFD_FEATURE_RWP indicates that the kernel supports + * UFFDIO_REGISTER_MODE_RWP for read-write protection tracking. + * Pages are made inaccessible via UFFDIO_RWPROTECT and faults + * are delivered when the pages are re-accessed. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -248,6 +257,7 @@ struct uffdio_api { #define UFFD_FEATURE_POISON (1<<14) #define UFFD_FEATURE_WP_ASYNC (1<<15) #define UFFD_FEATURE_MOVE (1<<16) +#define UFFD_FEATURE_RWP (1<<17) __u64 features; =20 __u64 ioctls; @@ -263,6 +273,7 @@ struct uffdio_register { #define UFFDIO_REGISTER_MODE_MISSING ((__u64)1<<0) #define UFFDIO_REGISTER_MODE_WP ((__u64)1<<1) #define UFFDIO_REGISTER_MODE_MINOR ((__u64)1<<2) +#define UFFDIO_REGISTER_MODE_RWP ((__u64)1<<3) __u64 mode; =20 /* @@ -356,6 +367,14 @@ struct uffdio_poison { __s64 updated; }; =20 +struct uffdio_rwprotect { + struct uffdio_range range; + /* !RWP means undo RWP-protection */ +#define UFFDIO_RWPROTECT_MODE_RWP ((__u64)1<<0) +#define UFFDIO_RWPROTECT_MODE_DONTWAKE ((__u64)1<<1) + __u64 mode; +}; + struct uffdio_move { __u64 dst; __u64 src; diff --git a/mm/Kconfig b/mm/Kconfig index 776b67c66e82..fac01bcfc0d1 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1333,6 +1333,15 @@ config HAVE_ARCH_USERFAULTFD_MINOR help Arch has userfaultfd minor fault support =20 +config USERFAULTFD_RWP + def_bool y + depends on 64BIT && ARCH_HAS_PTE_PROTNONE && HAVE_ARCH_USERFAULTFD_WP + help + Userfaultfd read-write protection (UFFDIO_RWPROTECT) delivers a + userfaultfd notification on every access -- read or write -- to a + protected range, letting userspace observe the working set of a + process. + menuconfig USERFAULTFD bool "Enable userfaultfd() system call" depends on MMU diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 9799abff1e76..78eb63702649 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -1157,6 +1157,75 @@ static int mwriteprotect_range(struct userfaultfd_ct= x *ctx, unsigned long start, return err; } =20 +int mrwprotect_range(struct userfaultfd_ctx *ctx, unsigned long start, + unsigned long len, bool enable_rwp) +{ + struct mm_struct *dst_mm =3D ctx->mm; + unsigned long end =3D start + len; + struct vm_area_struct *dst_vma; + unsigned int mm_cp_flags; + struct mmu_gather tlb; + bool found =3D false; + VMA_ITERATOR(vmi, dst_mm, start); + + VM_WARN_ON_ONCE(start & ~PAGE_MASK); + VM_WARN_ON_ONCE(len & ~PAGE_MASK); + VM_WARN_ON_ONCE(start + len <=3D start); + + guard(mmap_read_lock)(dst_mm); + guard(rwsem_read)(&ctx->map_changing_lock); + + if (atomic_read(&ctx->mmap_changing)) + return -EAGAIN; + + if (enable_rwp) + mm_cp_flags =3D MM_CP_UFFD_RWP; + else + mm_cp_flags =3D MM_CP_UFFD_RWP_RESOLVE; + + /* + * Pre-scan the range: validate every spanned VMA before applying + * any change_protection() so a partial failure cannot leave the + * process with only a prefix of the range re-protected. + */ + for_each_vma_range(vmi, dst_vma, end) { + if (!userfaultfd_rwp(dst_vma)) + return -ENOENT; + + if (is_vm_hugetlb_page(dst_vma)) { + unsigned long page_mask; + + page_mask =3D vma_kernel_pagesize(dst_vma) - 1; + if ((start & page_mask) || (len & page_mask)) + return -EINVAL; + } + found =3D true; + } + if (!found) + return -ENOENT; + + vma_iter_set(&vmi, start); + tlb_gather_mmu(&tlb, dst_mm); + for_each_vma_range(vmi, dst_vma, end) { + unsigned long vma_start =3D max(dst_vma->vm_start, start); + unsigned long vma_end =3D min(dst_vma->vm_end, end); + unsigned int flags =3D mm_cp_flags; + + /* + * On resolve, try to upgrade writability per-VMA -- + * MM_CP_TRY_CHANGE_WRITABLE WARNs in + * maybe_change_pte_writable() if the VMA is not VM_WRITE, + * and RWP can be registered on PROT_READ-only mappings. + */ + if (!enable_rwp && vma_wants_manual_pte_write_upgrade(dst_vma)) + flags |=3D MM_CP_TRY_CHANGE_WRITABLE; + + change_protection(&tlb, dst_vma, vma_start, vma_end, flags); + } + tlb_finish_mmu(&tlb); + + return 0; +} =20 void double_pt_lock(spinlock_t *ptl1, spinlock_t *ptl2) @@ -2197,9 +2266,22 @@ static struct vm_area_struct *userfaultfd_clear_vma(= struct vma_iterator *vmi, if (start =3D=3D vma->vm_start && end =3D=3D vma->vm_end) give_up_on_oom =3D true; =20 - /* Reset ptes for the whole vma range if wr-protected */ - if (userfaultfd_wp(vma)) - uffd_wp_range(vma, start, end - start, false); + /* Clear the uffd bit and/or restore protnone PTEs */ + if (userfaultfd_protected(vma)) { + unsigned int mm_cp_flags =3D 0; + struct mmu_gather tlb; + + if (userfaultfd_wp(vma)) + mm_cp_flags |=3D MM_CP_UFFD_WP_RESOLVE; + if (userfaultfd_rwp(vma)) + mm_cp_flags |=3D MM_CP_UFFD_RWP_RESOLVE; + if (vma_wants_manual_pte_write_upgrade(vma)) + mm_cp_flags |=3D MM_CP_TRY_CHANGE_WRITABLE; + + tlb_gather_mmu(&tlb, vma->vm_mm); + change_protection(&tlb, vma, start, end, mm_cp_flags); + tlb_finish_mmu(&tlb); + } =20 ret =3D vma_modify_flags_uffd(vmi, prev, vma, start, end, &new_vma_flags, NULL_VM_UFFD_CTX, @@ -2248,6 +2330,14 @@ static int userfaultfd_register_range(struct userfau= ltfd_ctx *ctx, vma_test_all_mask(vma, vma_flags)) goto skip; =20 + /* + * Pre-scan in userfaultfd_register() already rejected mode + * switches that would drop VM_UFFD_WP or VM_UFFD_RWP, so a + * stray bit here is a bug. + */ + VM_WARN_ON_ONCE(vma->vm_userfaultfd_ctx.ctx =3D=3D ctx && + vma->vm_flags & (VM_UFFD_WP | VM_UFFD_RWP) & ~vm_flags); + if (vma->vm_start > start) start =3D vma->vm_start; vma_end =3D min(end, vma->vm_end); @@ -2514,6 +2604,8 @@ static inline struct uffd_msg userfault_msg(unsigned = long address, msg.arg.pagefault.flags |=3D UFFD_PAGEFAULT_FLAG_WRITE; if (reason & VM_UFFD_WP) msg.arg.pagefault.flags |=3D UFFD_PAGEFAULT_FLAG_WP; + if (reason & VM_UFFD_RWP) + msg.arg.pagefault.flags |=3D UFFD_PAGEFAULT_FLAG_RWP; if (reason & VM_UFFD_MINOR) msg.arg.pagefault.flags |=3D UFFD_PAGEFAULT_FLAG_MINOR; if (features & UFFD_FEATURE_THREAD_ID) @@ -3593,6 +3685,22 @@ static int userfaultfd_register(struct userfaultfd_c= tx *ctx, =20 vm_flags |=3D VM_UFFD_WP; } + if (uffdio_register.mode & UFFDIO_REGISTER_MODE_RWP) { + if (!pgtable_supports_uffd() || VM_UFFD_RWP =3D=3D VM_NONE) + goto out; + if (!(ctx->features & UFFD_FEATURE_RWP)) + goto out; + vm_flags |=3D VM_UFFD_RWP; + } + + /* + * WP and RWP share the uffd PTE bit and + * cannot coexist in the same VMA =E2=80=94 the bit would carry ambiguous + * semantics. Reject the combination up front. + */ + if ((vm_flags & VM_UFFD_WP) && (vm_flags & VM_UFFD_RWP)) + goto out; + if (uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR) { #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR goto out; @@ -3686,6 +3794,16 @@ static int userfaultfd_register(struct userfaultfd_c= tx *ctx, cur->vm_userfaultfd_ctx.ctx !=3D ctx) goto out_unlock; =20 + /* + * Mode switches that drop VM_UFFD_WP or VM_UFFD_RWP would + * leave PTE markers without the flag that describes them; + * subsequent mprotect() would then promote stale markers + * into the other mode. Require an unregister first. + */ + if (cur->vm_userfaultfd_ctx.ctx =3D=3D ctx && + cur->vm_flags & (VM_UFFD_WP | VM_UFFD_RWP) & ~vm_flags) + goto out_unlock; + /* * Note vmas containing huge pages */ @@ -3719,6 +3837,10 @@ static int userfaultfd_register(struct userfaultfd_c= tx *ctx, if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_MINOR)) ioctls_out &=3D ~((__u64)1 << _UFFDIO_CONTINUE); =20 + /* RWPROTECT is only supported for RWP ranges */ + if (!(uffdio_register.mode & UFFDIO_REGISTER_MODE_RWP)) + ioctls_out &=3D ~((__u64)1 << _UFFDIO_RWPROTECT); + /* * Now that we scanned all vmas we can already tell * userland which ioctls methods are guaranteed to @@ -4066,6 +4188,55 @@ static int userfaultfd_writeprotect(struct userfault= fd_ctx *ctx, return ret; } =20 +static int userfaultfd_rwprotect(struct userfaultfd_ctx *ctx, + unsigned long arg) +{ + int ret; + struct uffdio_rwprotect uffdio_rwp; + struct userfaultfd_wake_range range; + bool mode_rwp, mode_dontwake; + + if (atomic_read(&ctx->mmap_changing)) + return -EAGAIN; + + if (copy_from_user(&uffdio_rwp, (void __user *)arg, + sizeof(uffdio_rwp))) + return -EFAULT; + + ret =3D validate_range(ctx->mm, uffdio_rwp.range.start, + uffdio_rwp.range.len); + if (ret) + return ret; + + if (uffdio_rwp.mode & ~(UFFDIO_RWPROTECT_MODE_DONTWAKE | + UFFDIO_RWPROTECT_MODE_RWP)) + return -EINVAL; + + mode_rwp =3D uffdio_rwp.mode & UFFDIO_RWPROTECT_MODE_RWP; + mode_dontwake =3D uffdio_rwp.mode & UFFDIO_RWPROTECT_MODE_DONTWAKE; + + if (mode_rwp && mode_dontwake) + return -EINVAL; + + if (mmget_not_zero(ctx->mm)) { + ret =3D mrwprotect_range(ctx, uffdio_rwp.range.start, + uffdio_rwp.range.len, mode_rwp); + mmput(ctx->mm); + } else { + return -ESRCH; + } + + if (ret) + return ret; + + if (!mode_rwp && !mode_dontwake) { + range.start =3D uffdio_rwp.range.start; + range.len =3D uffdio_rwp.range.len; + wake_userfault(ctx, &range); + } + return ret; +} + static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long= arg) { __s64 ret; @@ -4372,6 +4543,9 @@ static long userfaultfd_ioctl(struct file *file, unsi= gned cmd, case UFFDIO_POISON: ret =3D userfaultfd_poison(ctx, arg); break; + case UFFDIO_RWPROTECT: + ret =3D userfaultfd_rwprotect(ctx, arg); + break; } return ret; } --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 59EFD37FF43; Mon, 25 May 2026 11:39:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709149; cv=none; b=UDLC8gwMn4Iq+xPFlVxyC3fY7EE+9ef4JIz49pgZ+HSuUqdVVRU8nRD0cRyc3DRWrQw2v6T3Ila4tsF8yBlFpyUZ0ek604yNTNsiQcwN74PypZI4TD4dZvAehgoVOdYr+kk25VuSCPUrxUe9YhHAOaS9KiEIDd9bGVheuOPSSgM= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709149; c=relaxed/simple; bh=NKLkFY4oqFNiOxUjUUiiCIHAC4/l518pOisYv3gMBHg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=ScG7r4M+ALl8LDaVFMrRlI2eRUu2Ww/jOa1xBGsymEAF2a9lyOhk4uvXeM4V7UmtGxhmfstbrDIa6E9DXN9ePK+FzhpF9pp49eZSoFbPMHh/C1SLsFYWCtdkQPFrnT2Jdhb+ouTmnw+h4db59TnIQpQlIYZ7ATMePaajeX8Qmdc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=lH5cJQmP; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="lH5cJQmP" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 99E2C1F00A3A; Mon, 25 May 2026 11:39:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709147; bh=UYFNxHBj+X9DzBRzdjiAmOuXk1cN5trQXnOJ5Ey2j8w=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=lH5cJQmPcVaM965/SY5fYob8+8tNzRtNhvKv2aNo7S5IqoIC5NCzXq9t361OWY4NU SMaOAKjIh0S08NwdKGs6XnF76wQaaRxnSYbOdNYAtyAGlzQrSsvw89MzFNCKcZ9OtF Qag2P2Qk2XZCYYhvMgJcpGBOShbobSxah/x4RqhS5MwGPsKuQ79nviIuG6X32QbmiD Q69HuzgrhmY9KpRfgPZas5VpKxDT3w+Cm0iH1CvAKDR4brRoyEHw7knvZ9Q14fXIYG ogpbeAHz9/2zvTohgSAIHU5azqnXfjf/sDE2HYRsFuw/EdBd2iHCtzeZG8XxsxyeDr AP8dGf4a6N57Q== Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfauth.phl.internal (Postfix) with ESMTP id 046D2F40082; Mon, 25 May 2026 07:39:06 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-03.internal (MEProxy); Mon, 25 May 2026 07:39:06 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTFnTuEJhvyD+myNg4p794atgQHzdZOXs2A3gxvR3FSD7M0iqWwHWhOfEG1IgnscR2 dpWEnUQHTt04gBiZJ6FOYLwJWkueK3/2x7yPKsRIizx2ATY4cFZISTDZP9EiTwl/39cNmM OA0CsMrDRjxwva2a2lWcAgzMzYsZoGkumkodmX+5ok307LF0rRD0oLFOgPIpccglfJ7Cmv AURvPb/dWg6Eh/o3LACLo5CgxqvMM0m0aATVsjZcPfM5Eqk+WNKfRQ8GkDG8MvyLCL9P+n Ffx6yvOWdEN3fPIKUkuyF3PJ8RYaK5VmWWEYQaxtLNJTEh5sz7hSNo8ZFBfkwknbfQV6SR ZJG/EnGQXxPUd+ThT0wRe2z8HAL+SXTI7hO/F/eeJly9kq9yylfFG88SwAqzAwD9HHERme G8HfeJltgyPiJI1xtRmMnfUJBikdE64v2k24H+YipvRguofRyGj1JEcS5zsa3rIE9v+moT lrp54cePWta33SuKtNISKb73LMJJD9GOKep7qocolhnD5wwXEg5TwqXkChD+gFQhn9AHQl B2zgy61tB2MbbieFTu2lQtOfDMTWaiOmuvSA3YJ/QfUbtCX6K5lD5Lfis+Qkmf8nwaxdVA /jbvKAZkJQPhjydCFet30pKzX0mF3h8Ch9Ro/O/f8GXt6rcBL6DczNArHwYA X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:39:04 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 09/14] mm/userfaultfd: add RWP fault delivery and expose UFFDIO_REGISTER_MODE_RWP Date: Mon, 25 May 2026 12:37:23 +0100 Message-ID: <20260525113737.1942478-10-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Wire the fault side of read-write protection tracking and turn the userspace interface on. An RWP-protected PTE is PAGE_NONE with the uffd bit set. The PROT_NONE triggers a fault on any access; the uffd bit distinguishes it from plain mprotect(PROT_NONE) or NUMA hinting. Fault dispatch, per level: PTE handle_pte_fault() -> do_uffd_rwp() PMD __handle_mm_fault() -> do_huge_pmd_uffd_rwp() hugetlb hugetlb_fault() -> hugetlb_handle_userfault() The RWP branches gate on userfaultfd_pte_rwp() / userfaultfd_huge_pmd_rwp() (VM_UFFD_RWP plus the uffd bit) and fall through to do_numa_page() / do_huge_pmd_numa_page() otherwise. Each delivers a UFFD_PAGEFAULT_FLAG_RWP message through handle_userfault(); the handler resolves it with UFFDIO_RWPROTECT clearing MODE_RWP. userfaultfd_must_wait() and userfaultfd_huge_must_wait() add matching protnone+uffd waiters so sync-mode fault handlers block correctly. Expose the UAPI: UFFDIO_REGISTER_MODE_RWP -> UFFD_API_REGISTER_MODES UFFD_FEATURE_RWP -> UFFD_API_FEATURES _UFFDIO_RWPROTECT -> UFFD_API_RANGE_IOCTLS UFFD_API_RANGE_IOCTLS_BASIC UFFD_FEATURE_RWP is masked out at UFFDIO_API time when PROT_NONE is not available or VM_UFFD_RWP aliases VM_NONE (32-bit), so userspace never sees an advertised-but-broken feature. Works on anonymous, shmem, and hugetlb memory. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Mike Rapoport (Microsoft) --- include/linux/huge_mm.h | 7 +++++++ include/linux/userfaultfd_k.h | 24 ++++++++++++++++++++++++ include/uapi/linux/userfaultfd.h | 12 ++++++++---- mm/huge_memory.c | 5 +++++ mm/hugetlb.c | 11 +++++++++++ mm/memory.c | 21 +++++++++++++++++++-- mm/userfaultfd.c | 32 ++++++++++++++++++++++++++++++-- 7 files changed, 104 insertions(+), 8 deletions(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index edece3e26985..fe48d76957fb 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -529,6 +529,8 @@ static inline bool folio_test_pmd_mappable(struct folio= *folio) =20 vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf); =20 +vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf); + vm_fault_t do_huge_pmd_device_private(struct vm_fault *vmf); =20 extern struct folio *huge_zero_folio; @@ -716,6 +718,11 @@ static inline spinlock_t *pud_trans_huge_lock(pud_t *p= ud, return NULL; } =20 +static inline vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf) +{ + return 0; +} + static inline vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) { return 0; diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index f78d5d370d0a..d8f5f400c8ef 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -233,6 +233,18 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_a= rea_struct *vma, return userfaultfd_wp(vma) && pmd_uffd(pmd); } =20 +static inline bool userfaultfd_pte_rwp(struct vm_area_struct *vma, + pte_t pte) +{ + return userfaultfd_rwp(vma) && pte_uffd(pte); +} + +static inline bool userfaultfd_huge_pmd_rwp(struct vm_area_struct *vma, + pmd_t pmd) +{ + return userfaultfd_rwp(vma) && pmd_uffd(pmd); +} + static inline bool userfaultfd_armed(struct vm_area_struct *vma) { return vma->vm_flags & __VM_UFFD_FLAGS; @@ -363,6 +375,18 @@ static inline bool userfaultfd_huge_pmd_wp(struct vm_a= rea_struct *vma, return false; } =20 +static inline bool userfaultfd_pte_rwp(struct vm_area_struct *vma, + pte_t pte) +{ + return false; +} + +static inline bool userfaultfd_huge_pmd_rwp(struct vm_area_struct *vma, + pmd_t pmd) +{ + return false; +} + static inline bool userfaultfd_armed(struct vm_area_struct *vma) { return false; diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaul= tfd.h index 7b78aa3b5318..d803e76d47ad 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -25,7 +25,8 @@ #define UFFD_API ((__u64)0xAA) #define UFFD_API_REGISTER_MODES (UFFDIO_REGISTER_MODE_MISSING | \ UFFDIO_REGISTER_MODE_WP | \ - UFFDIO_REGISTER_MODE_MINOR) + UFFDIO_REGISTER_MODE_MINOR | \ + UFFDIO_REGISTER_MODE_RWP) #define UFFD_API_FEATURES (UFFD_FEATURE_PAGEFAULT_FLAG_WP | \ UFFD_FEATURE_EVENT_FORK | \ UFFD_FEATURE_EVENT_REMAP | \ @@ -42,7 +43,8 @@ UFFD_FEATURE_WP_UNPOPULATED | \ UFFD_FEATURE_POISON | \ UFFD_FEATURE_WP_ASYNC | \ - UFFD_FEATURE_MOVE) + UFFD_FEATURE_MOVE | \ + UFFD_FEATURE_RWP) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -54,13 +56,15 @@ (__u64)1 << _UFFDIO_MOVE | \ (__u64)1 << _UFFDIO_WRITEPROTECT | \ (__u64)1 << _UFFDIO_CONTINUE | \ - (__u64)1 << _UFFDIO_POISON) + (__u64)1 << _UFFDIO_POISON | \ + (__u64)1 << _UFFDIO_RWPROTECT) #define UFFD_API_RANGE_IOCTLS_BASIC \ ((__u64)1 << _UFFDIO_WAKE | \ (__u64)1 << _UFFDIO_COPY | \ (__u64)1 << _UFFDIO_WRITEPROTECT | \ (__u64)1 << _UFFDIO_CONTINUE | \ - (__u64)1 << _UFFDIO_POISON) + (__u64)1 << _UFFDIO_POISON | \ + (__u64)1 << _UFFDIO_RWPROTECT) =20 /* * Valid ioctl command number range with this API is from 0x00 to diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 8620ba92263f..cd32bd51e311 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2289,6 +2289,11 @@ static inline bool can_change_pmd_writable(struct vm= _area_struct *vma, return pmd_dirty(pmd); } =20 +vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf) +{ + return handle_userfault(vmf, VM_UFFD_RWP); +} + /* NUMA hinting page fault entry point for trans huge pmds */ vm_fault_t do_huge_pmd_numa_page(struct vm_fault *vmf) { diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 8555810cd42e..f63718296cc2 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6062,6 +6062,17 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struc= t vm_area_struct *vma, goto out_mutex; } =20 + /* + * Protnone hugetlb PTEs with the uffd bit are used by + * userfaultfd RWP for access tracking. Plain PROT_NONE (without the + * marker) is not an RWP fault and is not expected on hugetlb (no + * NUMA hinting), so let normal hugetlb fault handling proceed. + */ + if (pte_protnone(vmf.orig_pte) && vma_is_accessible(vma) && + userfaultfd_rwp(vma) && huge_pte_uffd(vmf.orig_pte)) { + return hugetlb_handle_userfault(&vmf, mapping, VM_UFFD_RWP); + } + /* * If we are going to COW/unshare the mapping later, we examine the * pending reservations for this page now. This will ensure that any diff --git a/mm/memory.c b/mm/memory.c index e4ae5350db41..3e393881031d 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6135,6 +6135,12 @@ static void numa_rebuild_large_mapping(struct vm_fau= lt *vmf, struct vm_area_stru } } =20 +static vm_fault_t do_uffd_rwp(struct vm_fault *vmf) +{ + pte_unmap(vmf->pte); + return handle_userfault(vmf, VM_UFFD_RWP); +} + static vm_fault_t do_numa_page(struct vm_fault *vmf) { struct vm_area_struct *vma =3D vmf->vma; @@ -6410,8 +6416,16 @@ static vm_fault_t handle_pte_fault(struct vm_fault *= vmf) if (!pte_present(vmf->orig_pte)) return do_swap_page(vmf); =20 - if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) + if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) { + /* + * RWP-protected PTEs are protnone plus the uffd bit. On a + * VM_UFFD_RWP VMA, a protnone PTE without the uffd bit is + * NUMA hinting and must still fall through to do_numa_page(). + */ + if (userfaultfd_pte_rwp(vmf->vma, vmf->orig_pte)) + return do_uffd_rwp(vmf); return do_numa_page(vmf); + } =20 spin_lock(vmf->ptl); entry =3D vmf->orig_pte; @@ -6525,8 +6539,11 @@ static vm_fault_t __handle_mm_fault(struct vm_area_s= truct *vma, return 0; } if (pmd_trans_huge(vmf.orig_pmd)) { - if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) + if (pmd_protnone(vmf.orig_pmd) && vma_is_accessible(vma)) { + if (userfaultfd_huge_pmd_rwp(vma, vmf.orig_pmd)) + return do_huge_pmd_uffd_rwp(&vmf); return do_huge_pmd_numa_page(&vmf); + } =20 if ((flags & (FAULT_FLAG_WRITE|FAULT_FLAG_UNSHARE)) && !pmd_write(vmf.orig_pmd)) { diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 78eb63702649..b966df47800c 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -2650,6 +2650,12 @@ static inline bool userfaultfd_huge_must_wait(struct= userfaultfd_ctx *ctx, */ if (!huge_pte_write(pte) && (reason & VM_UFFD_WP)) return true; + /* + * PTE is still RW-protected (protnone with uffd bit), wait for + * resolution. Plain PROT_NONE without the marker is not an RWP fault. + */ + if (pte_protnone(pte) && huge_pte_uffd(pte) && (reason & VM_UFFD_RWP)) + return true; =20 return false; } @@ -2710,8 +2716,14 @@ static inline bool userfaultfd_must_wait(struct user= faultfd_ctx *ctx, if (!pmd_present(_pmd)) return false; =20 - if (pmd_trans_huge(_pmd)) - return !pmd_write(_pmd) && (reason & VM_UFFD_WP); + if (pmd_trans_huge(_pmd)) { + if (!pmd_write(_pmd) && (reason & VM_UFFD_WP)) + return true; + if (pmd_protnone(_pmd) && pmd_uffd(_pmd) && + (reason & VM_UFFD_RWP)) + return true; + return false; + } =20 pte =3D pte_offset_map(pmd, address); if (!pte) @@ -2736,6 +2748,13 @@ static inline bool userfaultfd_must_wait(struct user= faultfd_ctx *ctx, */ if (!pte_write(ptent) && (reason & VM_UFFD_WP)) goto out; + /* + * PTE is still RW-protected (protnone with uffd bit), wait for + * userspace to resolve. Plain PROT_NONE without the marker is not + * an RWP fault. + */ + if (pte_protnone(ptent) && pte_uffd(ptent) && (reason & VM_UFFD_RWP)) + goto out; =20 ret =3D false; out: @@ -4477,6 +4496,15 @@ static int userfaultfd_api(struct userfaultfd_ctx *c= tx, uffdio_api.features &=3D ~UFFD_FEATURE_WP_UNPOPULATED; uffdio_api.features &=3D ~UFFD_FEATURE_WP_ASYNC; } + /* + * RWP needs both PROT_NONE support and the uffd-wp PTE bit. The + * VM_UFFD_RWP check covers compile-time unavailability; the + * pgtable_supports_uffd() check covers runtime (e.g. riscv + * without the SVRSW60T59B extension) where the PTE bit is declared + * but not actually usable. + */ + if (VM_UFFD_RWP =3D=3D VM_NONE || !pgtable_supports_uffd()) + uffdio_api.features &=3D ~UFFD_FEATURE_RWP; =20 ret =3D -EINVAL; if (features & ~uffdio_api.features) --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 40DDC3806AC; Mon, 25 May 2026 11:39:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709154; cv=none; b=J9NJ8COMPQxHyQGm3odJ9YAyqfZPAYJgYatXar2OYwuBGQlYcHcXaXPRT/cg8LHFCUYeqZmOHRZLH3gflST8H+QcBrEIJf6R2UiwD+UL5sSjubDgVh8AirwSPdFOzJXqk9lUhnMy5v2xsYClryv3KKNzk+pkH0lWhkSPecj2w/g= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709154; c=relaxed/simple; bh=cgUKeDh0U3noHm2VzcyyAI5TawD4qN1P3cWwZgXDIok=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=k4BfQ8PXyXfhZkmLXzYHqDraZDbC1owGFKG4OD3Wc46d4UCuqXj5Cpj4+2E0YP2x4EejXO6niQUMXzaxD/DPL7iQ2WHOlyL3N74xVbVDgFAqcm/ocpVhx+8cDBJBMInIvT5BNoAnym0XJZ99/SAQAhoLSDTUrZhhwpi9Y2OVvrg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=auce5c+1; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="auce5c+1" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 475B51F000E9; Mon, 25 May 2026 11:39:12 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709153; bh=AP1OQJh4VME3VhcFR5nWryP6+e9wX0Pv2VjfuiXGvp4=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=auce5c+1bAVlr/Lb+GTwY/nNXzFVaZjzfbIedLiW+6rNS8GDYGRNhpUhAvMOq/u4T obQ1Gn/pwGO5u+kIL67qcJGKdBKb75YfuWj7iiAyX6BtFoxnbTNAIV6d910hn5unvX Sc+3rPLNpt6tsCiq0NnaZMWwiEIjndsUNhzMhQe5Sl3D6oLBxopsPQy6VpZm4hMC3g EdMXeqDlgcJAGYERrtukmakK+8Ec6wcGtPEjclIkVTFfvXs1tRKzWOx5j1cGTORa/U 54L0lG4Ap5FqrxWOLTUMeHN2CMtC4yuHiBlTwqPRFczqijB+4vTktjNkQLK+grpvtc VppF6MWEfOd2w== Received: from phl-compute-04.internal (phl-compute-04.internal [10.202.2.44]) by mailfauth.phl.internal (Postfix) with ESMTP id A5613F40082; Mon, 25 May 2026 07:39:11 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-04.internal (MEProxy); Mon, 25 May 2026 07:39:11 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTEfoRdVBJYWBUADV7d9kIV+Nxm/pVUcgu2VEqiWnj8eQrxuVRmWJvpc9q3Oqv2JCI in3fN4Zd8kb+AIzxXT1aE8XhYzXjllPczHZpsDuave1RqUEJt2aM8NK4F+1xGB1lHbYgSB 4+lR3LF0kMycZigQtO5w6VfLN0Aex+LApk+NDe4mN/AQtgEH82UY4yLdpanr1nUEhfYRPX qXaeB8tr6YycPJO/GKcHBfWh4XfvSwFXkiL84UUAHkAJqW6Bc0LpWOIYef91VRiFvO85Cg Xd8jmXx4jYBCkCANHtSsVLhpCbXIiIojLruOFr51QxylDrx8yfg2/g0kqIsUfNxsLFZyOU 3RrUT5nIDeRZpNvsTWcFpr7rGsBwuIoOQCf2HRU39VtHyP2sFvtytEkvkniYVobYJyQa/W WOCpBX7J02APixnIAWiyEVh8PQ0YOVDsZ+/BYFLl0sGUyceYds+vcT5oRhwHSw5N04CBnJ Ec0J0YNaB4Uo8ddpNOenTZL3ESheTCHLPulUAvG5Rf93FAF0BRmzNjNRtu+8yjLF70Wl76 1VzHMXRcbSnv8wQwvVqyINrDUw2lVXeGPoSAOGFSxh2wdB6nF0oYBZStdnnrmf67rpLtk9 bWtUBzNd70dT3zbA1HDTe9N3NvcMjJp919CXc05ghVOvAf4bNYq2d3Ms/phA X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:39:09 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 10/14] mm/pagemap: add PAGE_IS_ACCESSED for RWP tracking Date: Mon, 25 May 2026 12:37:24 +0100 Message-ID: <20260525113737.1942478-11-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable PAGEMAP_SCAN already reports PAGE_IS_WRITTEN from the inverted uffd PTE bit, targeting the UFFDIO_WRITEPROTECT workflow. UFFDIO_RWPROTECT reuses the same PTE bit as a marker for read-write protection, but "has been written" and "has been accessed" are distinct semantic signals =E2=80=94 they happen to share one PTE bit today only because the t= wo implementations share infrastructure. Give RWP its own pagemap category so the UAPI does not conflate them: PAGE_IS_WRITTEN reported on VM_UFFD_WP VMAs, !pte_uffd(pte) PAGE_IS_ACCESSED reported on VM_UFFD_RWP VMAs, !pte_uffd(pte) Both still read the same PTE bit today, but each is scoped to the VMA whose registered mode makes the bit meaningful. If a future implementation moves RWP to a separate PTE bit, only PAGE_IS_ACCESSED switches over. This is a UAPI narrowing. Outside VM_UFFD_WP VMAs the uffd bit is always clear, so PAGEMAP_SCAN used to flag PAGE_IS_WRITTEN on every present PTE there =E2=80=94 a meaningless duplicate of PAGE_IS_PRESENT. Now PAGE_IS_WRITTEN fires only inside VM_UFFD_WP VMAs. pagemap_hugetlb_category() now takes the vma like its PTE/PMD peers. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Acked-by: Mike Rapoport (Microsoft) --- Documentation/admin-guide/mm/pagemap.rst | 13 ++++- fs/proc/task_mmu.c | 73 ++++++++++++++++++------ include/uapi/linux/fs.h | 1 + tools/include/uapi/linux/fs.h | 1 + 4 files changed, 67 insertions(+), 21 deletions(-) diff --git a/Documentation/admin-guide/mm/pagemap.rst b/Documentation/admin= -guide/mm/pagemap.rst index c57e61b5d8aa..ffa690a171c8 100644 --- a/Documentation/admin-guide/mm/pagemap.rst +++ b/Documentation/admin-guide/mm/pagemap.rst @@ -19,8 +19,11 @@ There are four components to pagemap: * Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dirty.rst) * Bit 56 page exclusively mapped (since 4.2) - * Bit 57 pte is uffd-wp write-protected (since 5.13) (see - Documentation/admin-guide/mm/userfaultfd.rst) + * Bit 57 pte is tracked by userfaultfd (since 5.13) =E2=80=94 in a + ``VM_UFFD_WP`` VMA this indicates a write-protected PTE; in a + ``VM_UFFD_RWP`` VMA it indicates an RWP-protected PTE. WP and + RWP are mutually exclusive per VMA, so the meaning is + unambiguous. See Documentation/admin-guide/mm/userfaultfd.rst. * Bit 58 pte is a guard region (since 6.15) (see madvise (2) man p= age) * Bits 59-60 zero * Bit 61 page is file-page or shared-anon (since 3.5) @@ -244,7 +247,8 @@ in this IOCTL: Following flags about pages are currently supported: =20 - ``PAGE_IS_WPALLOWED`` - Page has async-write-protection enabled -- ``PAGE_IS_WRITTEN`` - Page has been written to from the time it was writ= e protected +- ``PAGE_IS_WRITTEN`` - Page in a ``UFFDIO_REGISTER_MODE_WP`` VMA has been + written to since it was write-protected. Only reported inside such VMAs. - ``PAGE_IS_FILE`` - Page is file backed - ``PAGE_IS_PRESENT`` - Page is present in the memory - ``PAGE_IS_SWAPPED`` - Page is in swapped @@ -252,6 +256,9 @@ Following flags about pages are currently supported: - ``PAGE_IS_HUGE`` - Page is PMD-mapped THP or Hugetlb backed - ``PAGE_IS_SOFT_DIRTY`` - Page is soft-dirty - ``PAGE_IS_GUARD`` - Page is a part of a guard region +- ``PAGE_IS_ACCESSED`` - Page in a ``UFFDIO_REGISTER_MODE_RWP`` VMA has be= en + accessed since RWP was applied. Only reported inside such VMAs. See + Documentation/admin-guide/mm/userfaultfd.rst for the RWP workflow. =20 The ``struct pm_scan_arg`` is used as the argument of the IOCTL. =20 diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c index 5e74dadfb1cb..97fb941871a3 100644 --- a/fs/proc/task_mmu.c +++ b/fs/proc/task_mmu.c @@ -2284,7 +2284,7 @@ static const struct mm_walk_ops pagemap_ops =3D { * Bits 5-54 swap offset if swapped * Bit 55 pte is soft-dirty (see Documentation/admin-guide/mm/soft-dir= ty.rst) * Bit 56 page exclusively mapped - * Bit 57 pte is uffd-wp write-protected + * Bit 57 pte is tracked by userfaultfd (uffd-wp or RWP) * Bit 58 pte is a guard region * Bits 59-60 zero * Bit 61 page is file-page or shared-anon @@ -2419,7 +2419,7 @@ static int pagemap_release(struct inode *inode, struc= t file *file) PAGE_IS_FILE | PAGE_IS_PRESENT | \ PAGE_IS_SWAPPED | PAGE_IS_PFNZERO | \ PAGE_IS_HUGE | PAGE_IS_SOFT_DIRTY | \ - PAGE_IS_GUARD) + PAGE_IS_GUARD | PAGE_IS_ACCESSED) #define PM_SCAN_FLAGS (PM_SCAN_WP_MATCHING | PM_SCAN_CHECK_WPASYNC) =20 struct pagemap_scan_private { @@ -2444,8 +2444,12 @@ static unsigned long pagemap_page_category(struct pa= gemap_scan_private *p, =20 categories =3D PAGE_IS_PRESENT; =20 - if (!pte_uffd(pte)) - categories |=3D PAGE_IS_WRITTEN; + if (!pte_uffd(pte)) { + if (userfaultfd_wp(vma)) + categories |=3D PAGE_IS_WRITTEN; + if (userfaultfd_rwp(vma)) + categories |=3D PAGE_IS_ACCESSED; + } =20 if (p->masks_of_interest & PAGE_IS_FILE) { page =3D vm_normal_page(vma, addr, pte); @@ -2462,8 +2466,12 @@ static unsigned long pagemap_page_category(struct pa= gemap_scan_private *p, =20 categories =3D PAGE_IS_SWAPPED; =20 - if (!pte_swp_uffd_any(pte)) - categories |=3D PAGE_IS_WRITTEN; + if (!pte_swp_uffd_any(pte)) { + if (userfaultfd_wp(vma)) + categories |=3D PAGE_IS_WRITTEN; + if (userfaultfd_rwp(vma)) + categories |=3D PAGE_IS_ACCESSED; + } =20 entry =3D softleaf_from_pte(pte); if (softleaf_is_guard_marker(entry)) @@ -2512,8 +2520,12 @@ static unsigned long pagemap_thp_category(struct pag= emap_scan_private *p, struct page *page; =20 categories |=3D PAGE_IS_PRESENT; - if (!pmd_uffd(pmd)) - categories |=3D PAGE_IS_WRITTEN; + if (!pmd_uffd(pmd)) { + if (userfaultfd_wp(vma)) + categories |=3D PAGE_IS_WRITTEN; + if (userfaultfd_rwp(vma)) + categories |=3D PAGE_IS_ACCESSED; + } =20 if (p->masks_of_interest & PAGE_IS_FILE) { page =3D vm_normal_page_pmd(vma, addr, pmd); @@ -2527,8 +2539,12 @@ static unsigned long pagemap_thp_category(struct pag= emap_scan_private *p, categories |=3D PAGE_IS_SOFT_DIRTY; } else { categories |=3D PAGE_IS_SWAPPED; - if (!pmd_swp_uffd(pmd)) - categories |=3D PAGE_IS_WRITTEN; + if (!pmd_swp_uffd(pmd)) { + if (userfaultfd_wp(vma)) + categories |=3D PAGE_IS_WRITTEN; + if (userfaultfd_rwp(vma)) + categories |=3D PAGE_IS_ACCESSED; + } if (pmd_swp_soft_dirty(pmd)) categories |=3D PAGE_IS_SOFT_DIRTY; =20 @@ -2561,7 +2577,8 @@ static void make_uffd_wp_pmd(struct vm_area_struct *v= ma, #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ =20 #ifdef CONFIG_HUGETLB_PAGE -static unsigned long pagemap_hugetlb_category(pte_t pte) +static unsigned long pagemap_hugetlb_category(struct vm_area_struct *vma, + pte_t pte) { unsigned long categories =3D PAGE_IS_HUGE; =20 @@ -2576,8 +2593,12 @@ static unsigned long pagemap_hugetlb_category(pte_t = pte) if (pte_present(pte)) { categories |=3D PAGE_IS_PRESENT; =20 - if (!huge_pte_uffd(pte)) - categories |=3D PAGE_IS_WRITTEN; + if (!huge_pte_uffd(pte)) { + if (userfaultfd_wp(vma)) + categories |=3D PAGE_IS_WRITTEN; + if (userfaultfd_rwp(vma)) + categories |=3D PAGE_IS_ACCESSED; + } if (!PageAnon(pte_page(pte))) categories |=3D PAGE_IS_FILE; if (is_zero_pfn(pte_pfn(pte))) @@ -2587,8 +2608,12 @@ static unsigned long pagemap_hugetlb_category(pte_t = pte) } else { categories |=3D PAGE_IS_SWAPPED; =20 - if (!pte_swp_uffd_any(pte)) - categories |=3D PAGE_IS_WRITTEN; + if (!pte_swp_uffd_any(pte)) { + if (userfaultfd_wp(vma)) + categories |=3D PAGE_IS_WRITTEN; + if (userfaultfd_rwp(vma)) + categories |=3D PAGE_IS_ACCESSED; + } if (pte_swp_soft_dirty(pte)) categories |=3D PAGE_IS_SOFT_DIRTY; } @@ -2673,6 +2698,16 @@ static int pagemap_scan_test_walk(unsigned long star= t, unsigned long end, bool wp_allowed =3D userfaultfd_wp_async(vma) && userfaultfd_wp_use_markers(vma); =20 + /* + * PM_SCAN_WP_MATCHING is the atomic read-and-reset flavour of the + * scan and is implemented for the WP marker only. Reject it on + * VM_UFFD_RWP VMAs explicitly so userspace gets a clear error + * instead of a silently-skipped range; re-arming is done with + * UFFDIO_RWPROTECT(MODE_RWP). + */ + if (userfaultfd_rwp(vma) && (p->arg.flags & PM_SCAN_WP_MATCHING)) + return -EINVAL; + if (!wp_allowed) { /* User requested explicit failure over wp-async capability */ if (p->arg.flags & PM_SCAN_CHECK_WPASYNC) @@ -2860,7 +2895,8 @@ static int pagemap_scan_pmd_entry(pmd_t *pmd, unsigne= d long start, goto flush_and_return; } =20 - if (!p->arg.category_anyof_mask && !p->arg.category_inverted && + if (userfaultfd_wp(vma) && !p->arg.category_anyof_mask && + !p->arg.category_inverted && p->arg.category_mask =3D=3D PAGE_IS_WRITTEN && p->arg.return_mask =3D=3D PAGE_IS_WRITTEN) { for (addr =3D start; addr < end; pte++, addr +=3D PAGE_SIZE) { @@ -2935,7 +2971,8 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, un= signed long hmask, /* Go the short route when not write-protecting pages. */ =20 pte =3D huge_ptep_get(walk->mm, start, ptep); - categories =3D p->cur_vma_category | pagemap_hugetlb_category(pte); + categories =3D p->cur_vma_category | + pagemap_hugetlb_category(vma, pte); =20 if (!pagemap_scan_is_interesting_page(categories, p)) return 0; @@ -2947,7 +2984,7 @@ static int pagemap_scan_hugetlb_entry(pte_t *ptep, un= signed long hmask, ptl =3D huge_pte_lock(hstate_vma(vma), vma->vm_mm, ptep); =20 pte =3D huge_ptep_get(walk->mm, start, ptep); - categories =3D p->cur_vma_category | pagemap_hugetlb_category(pte); + categories =3D p->cur_vma_category | pagemap_hugetlb_category(vma, pte); =20 if (!pagemap_scan_is_interesting_page(categories, p)) goto out_unlock; diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h index 13f71202845e..c4aeaa0c31c7 100644 --- a/include/uapi/linux/fs.h +++ b/include/uapi/linux/fs.h @@ -455,6 +455,7 @@ typedef int __bitwise __kernel_rwf_t; #define PAGE_IS_HUGE (1 << 6) #define PAGE_IS_SOFT_DIRTY (1 << 7) #define PAGE_IS_GUARD (1 << 8) +#define PAGE_IS_ACCESSED (1 << 9) =20 /* * struct page_region - Page region with flags diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h index 24ddf7bc4f25..f0a26309b6d5 100644 --- a/tools/include/uapi/linux/fs.h +++ b/tools/include/uapi/linux/fs.h @@ -364,6 +364,7 @@ typedef int __bitwise __kernel_rwf_t; #define PAGE_IS_HUGE (1 << 6) #define PAGE_IS_SOFT_DIRTY (1 << 7) #define PAGE_IS_GUARD (1 << 8) +#define PAGE_IS_ACCESSED (1 << 9) =20 /* * struct page_region - Page region with flags --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CE92E37FF51; Mon, 25 May 2026 11:39:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709156; cv=none; b=dp+uJn6tcJDY6wMbPwXPkNuadN+mNxXCtwOM1PCeo6W1aJuFkOYVSWoWOIj+kQ63+rRxWD70bshngUEgSyZUxWUD3a5elqJzxPbDdGciEgBuO00sYunthQHoTVY/RhBFPNT7xZof1mNs4AmpvB8RpyCv4WK9yyztYMRMqF9qYGQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709156; c=relaxed/simple; bh=CpJQe5C+A/HE9ZnfyPuS+tt2Xm4ZmYgNeNYYYBBHrWY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=OUd7rX+YXTrKezr9c4fisaf/EflhtcHcQbsu5jiW+ISiPmeHzJ/O5/HJkfKub5PNFbpQYTf4Fjq0RIqX45mTvjfV274OYgRH5Qbj0FDUczSIulS0l6drCUrK7whjrGiqsP+hGcA42exwFKhiFG9ju8rpKMqZJHKWT6hFPFNKB8c= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=mZE7sTR3; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="mZE7sTR3" Received: by smtp.kernel.org (Postfix) with ESMTPSA id CAAC31F00A3E; Mon, 25 May 2026 11:39:13 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709154; bh=lCGr5XWCC6+imV6TUBSzXRf560ZlDCIsQjlB2eChBBs=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=mZE7sTR3XSOWaYUezwS97BOjQy86emLDG7x2Ipc/x0Z5MPoMa4zlH4HahEpI92yw7 6XxJ9AjDowozm1C3bLydbuGauzWwJaHDUtB6iEDyNE95QLU2W8spWIfMexB5A/chJo RLbc6i5el5/S7xzErYApeeLWwdhPqwL54QZCzUCoF5Eb5DkIT0Y9EaSWHYosr8PI0J cX3FNyUrKigscRyrnMSg/uMWL2wQ3T3XfSQzqfKVQerCdBhhDCSJru3CvdJzek1rBS J+NvO2Bm3Thl8rc91+6gj9Xcv54mzYsr+Rp6aSSPowTMDanoO/a4ZgZixWd56gmyYi DbxQvAh4dMSVA== Received: from phl-compute-03.internal (phl-compute-03.internal [10.202.2.43]) by mailfauth.phl.internal (Postfix) with ESMTP id 35549F40085; Mon, 25 May 2026 07:39:13 -0400 (EDT) Received: from phl-frontend-03 ([10.202.2.162]) by phl-compute-03.internal (MEProxy); Mon, 25 May 2026 07:39:13 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTE63z7Xr862URFTA+sRkbbWHzxlgalCSEhhNVgHzskp81KGOwvsn/jXJEp/u3k61f TFIEBvaujjjH3eJ6mZELD7tk9E825mNMYD8C27DqqXF/qXnX+dpUrYKsw7bOabDmRE2JNk w+koB9Xvv465LvXG0jkirzGdghO6WF8fUKxXa7niMd2i8oC4DUvZusWovo9fww0ymMbY+w EpmEfSpqgCkzsSYUrzkxgBYKe5HluhaWQMPb6+nZGJhjp54N25KObppup8JOiNuoOM8NuA CGqxQDUSfVDoRQ0zg7DBJlZixuBLr2HprGw2v6vMJ7mH4XRXAQQ+vMdrlMih3UVuJzuxhT qt36uCXSlhzvec3+yrUwK460fzuWZZPD9cb8bGvqiveuoD09MfMTVmsi3gz/Pu09IHnRk6 M9akhwulE0Yl9TMpln8orCLIihNiSHqyC72Oo22IRQyZtKBsE/eEBvt5b6rwazh5A5iV8+ 0ZzQ+S/xwzbcmhv4ITkSnla2595Hvg8KaTRqZjeiR6h6KT1pBjT8JV9lylApH4gOQDo1D0 ASOgRulwXfyHS5ieGrZ3j8W5LmNOgHDLVTGX3QbBUuQ/IbWf7se82SdFaQRemTAend0kYD pjHsW1ACAdnMjGYMK5ojEL54kAheeMAJSxpnxUHPuZjut+8TClNORA7o+WkA X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:39:12 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 11/14] userfaultfd: add UFFD_FEATURE_RWP_ASYNC for async fault resolution Date: Mon, 25 May 2026 12:37:25 +0100 Message-ID: <20260525113737.1942478-12-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Sync RWP delivers a message and blocks the faulting thread until the handler resolves the fault. For working-set tracking the VMM does not need the message: it just needs to know, at scan time, which pages were touched. Async RWP serves that use case =E2=80=94 the kernel restores access in-place and the faulting thread continues without blocking. The VMM reconstructs the access pattern after the fact via PAGEMAP_SCAN: pages whose uffd bit is still set (inverted PAGE_IS_ACCESSED) were not re-accessed since the last RWP cycle. Worth calling out: async resolution upgrades writable private anon PTEs via pte_mkwrite() when can_change_pte_writable() allows, mirroring do_numa_page(). Without it, every re-access of an RWP'd writable page would COW-fault a second time. UFFD_FEATURE_RWP_ASYNC requires UFFD_FEATURE_RWP. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Acked-by: Mike Rapoport (Microsoft) --- include/linux/userfaultfd_k.h | 6 ++++++ include/uapi/linux/userfaultfd.h | 11 ++++++++++- mm/huge_memory.c | 25 ++++++++++++++++++++++++- mm/hugetlb.c | 32 +++++++++++++++++++++++++++++++- mm/memory.c | 27 +++++++++++++++++++++++++-- mm/userfaultfd.c | 19 ++++++++++++++++++- 6 files changed, 114 insertions(+), 6 deletions(-) diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index d8f5f400c8ef..43b2fb587ce3 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -278,6 +278,7 @@ extern void userfaultfd_unmap_complete(struct mm_struct= *mm, struct list_head *uf); extern bool userfaultfd_wp_unpopulated(struct vm_area_struct *vma); extern bool userfaultfd_wp_async(struct vm_area_struct *vma); +extern bool userfaultfd_rwp_async(struct vm_area_struct *vma); =20 static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma) { @@ -456,6 +457,11 @@ static inline bool userfaultfd_wp_async(struct vm_area= _struct *vma) return false; } =20 +static inline bool userfaultfd_rwp_async(struct vm_area_struct *vma) +{ + return false; +} + static inline bool vma_has_uffd_without_event_remap(struct vm_area_struct = *vma) { return false; diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaul= tfd.h index d803e76d47ad..c10f08f8a618 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -44,7 +44,8 @@ UFFD_FEATURE_POISON | \ UFFD_FEATURE_WP_ASYNC | \ UFFD_FEATURE_MOVE | \ - UFFD_FEATURE_RWP) + UFFD_FEATURE_RWP | \ + UFFD_FEATURE_RWP_ASYNC) #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ @@ -243,6 +244,13 @@ struct uffdio_api { * UFFDIO_REGISTER_MODE_RWP for read-write protection tracking. * Pages are made inaccessible via UFFDIO_RWPROTECT and faults * are delivered when the pages are re-accessed. + * + * UFFD_FEATURE_RWP_ASYNC indicates asynchronous mode for + * UFFDIO_REGISTER_MODE_RWP. When set, faults on read-write + * protected pages are auto-resolved by the kernel (PTE + * permissions restored immediately) without delivering a message + * to the userfaultfd handler. Use PAGEMAP_SCAN with inverted + * PAGE_IS_ACCESSED to find pages that were not re-accessed. */ #define UFFD_FEATURE_PAGEFAULT_FLAG_WP (1<<0) #define UFFD_FEATURE_EVENT_FORK (1<<1) @@ -262,6 +270,7 @@ struct uffdio_api { #define UFFD_FEATURE_WP_ASYNC (1<<15) #define UFFD_FEATURE_MOVE (1<<16) #define UFFD_FEATURE_RWP (1<<17) +#define UFFD_FEATURE_RWP_ASYNC (1<<18) __u64 features; =20 __u64 ioctls; diff --git a/mm/huge_memory.c b/mm/huge_memory.c index cd32bd51e311..803fbc41e501 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2291,7 +2291,30 @@ static inline bool can_change_pmd_writable(struct vm= _area_struct *vma, =20 vm_fault_t do_huge_pmd_uffd_rwp(struct vm_fault *vmf) { - return handle_userfault(vmf, VM_UFFD_RWP); + struct vm_area_struct *vma =3D vmf->vma; + pmd_t pmd; + + if (!userfaultfd_rwp_async(vma)) + return handle_userfault(vmf, VM_UFFD_RWP); + + vmf->ptl =3D pmd_lock(vma->vm_mm, vmf->pmd); + if (unlikely(!pmd_same(pmdp_get(vmf->pmd), vmf->orig_pmd))) { + spin_unlock(vmf->ptl); + return 0; + } + pmd =3D pmd_modify(vmf->orig_pmd, vma->vm_page_prot); + /* pmd_modify() preserves _PAGE_UFFD; drop it on resolution */ + pmd =3D pmd_clear_uffd(pmd); + pmd =3D pmd_mkyoung(pmd); + if (!pmd_write(pmd) && + vma_wants_manual_pte_write_upgrade(vma) && + can_change_pmd_writable(vma, vmf->address, pmd)) + pmd =3D pmd_mkwrite(pmd, vma); + set_pmd_at(vma->vm_mm, vmf->address & HPAGE_PMD_MASK, + vmf->pmd, pmd); + update_mmu_cache_pmd(vma, vmf->address, vmf->pmd); + spin_unlock(vmf->ptl); + return 0; } =20 /* NUMA hinting page fault entry point for trans huge pmds */ diff --git a/mm/hugetlb.c b/mm/hugetlb.c index f63718296cc2..a5ff9018af06 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -6070,7 +6070,37 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struc= t vm_area_struct *vma, */ if (pte_protnone(vmf.orig_pte) && vma_is_accessible(vma) && userfaultfd_rwp(vma) && huge_pte_uffd(vmf.orig_pte)) { - return hugetlb_handle_userfault(&vmf, mapping, VM_UFFD_RWP); + spinlock_t *ptl; + pte_t pte; + + /* Sync: drop hugetlb locks before blocking in handle_userfault() */ + if (!userfaultfd_rwp_async(vma)) + return hugetlb_handle_userfault(&vmf, mapping, VM_UFFD_RWP); + + ptl =3D huge_pte_lock(h, mm, vmf.pte); + pte =3D huge_ptep_get(mm, vmf.address, vmf.pte); + if (pte_protnone(pte) && huge_pte_uffd(pte)) { + unsigned int shift =3D huge_page_shift(h); + + pte =3D huge_pte_modify(pte, vma->vm_page_prot); + pte =3D arch_make_huge_pte(pte, shift, vma->vm_flags); + /* huge_pte_modify() preserves _PAGE_UFFD; drop it on resolution */ + pte =3D huge_pte_clear_uffd(pte); + pte =3D pte_mkyoung(pte); + /* + * Unlike do_uffd_rwp(), do not upgrade to writable + * here. Hugetlb lacks a can_change_huge_pte_writable() + * equivalent, so a write access will take a separate + * COW fault =E2=80=94 acceptable for the rare private hugetlb + * case. + */ + set_huge_pte_at(mm, vmf.address, vmf.pte, pte, + huge_page_size(h)); + update_mmu_cache(vma, vmf.address, vmf.pte); + } + spin_unlock(ptl); + ret =3D 0; + goto out_mutex; } =20 /* diff --git a/mm/memory.c b/mm/memory.c index 3e393881031d..89c9a44d07ce 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -6137,8 +6137,31 @@ static void numa_rebuild_large_mapping(struct vm_fau= lt *vmf, struct vm_area_stru =20 static vm_fault_t do_uffd_rwp(struct vm_fault *vmf) { - pte_unmap(vmf->pte); - return handle_userfault(vmf, VM_UFFD_RWP); + pte_t pte; + + if (!userfaultfd_rwp_async(vmf->vma)) { + /* Sync mode: unmap PTE and deliver to userfaultfd handler */ + pte_unmap(vmf->pte); + return handle_userfault(vmf, VM_UFFD_RWP); + } + + spin_lock(vmf->ptl); + if (unlikely(!pte_same(ptep_get(vmf->pte), vmf->orig_pte))) { + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; + } + pte =3D pte_modify(vmf->orig_pte, vmf->vma->vm_page_prot); + /* pte_modify() preserves _PAGE_UFFD; drop it on resolution */ + pte =3D pte_clear_uffd(pte); + pte =3D pte_mkyoung(pte); + if (!pte_write(pte) && + vma_wants_manual_pte_write_upgrade(vmf->vma) && + can_change_pte_writable(vmf->vma, vmf->address, pte)) + pte =3D pte_mkwrite(pte, vmf->vma); + set_pte_at(vmf->vma->vm_mm, vmf->address, vmf->pte, pte); + update_mmu_cache(vmf->vma, vmf->address, vmf->pte); + pte_unmap_unlock(vmf->pte, vmf->ptl); + return 0; } =20 static vm_fault_t do_numa_page(struct vm_fault *vmf) diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index b966df47800c..20478bb37311 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -2478,6 +2478,11 @@ static bool userfaultfd_wp_async_ctx(struct userfaul= tfd_ctx *ctx) return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC); } =20 +static bool userfaultfd_rwp_async_ctx(struct userfaultfd_ctx *ctx) +{ + return ctx && (ctx->features & UFFD_FEATURE_RWP_ASYNC); +} + /* * Whether WP_UNPOPULATED is enabled on the uffd context. It is only * meaningful when userfaultfd_wp()=3D=3Dtrue on the vma and when it's @@ -4379,6 +4384,11 @@ bool userfaultfd_wp_async(struct vm_area_struct *vma) return userfaultfd_wp_async_ctx(vma->vm_userfaultfd_ctx.ctx); } =20 +bool userfaultfd_rwp_async(struct vm_area_struct *vma) +{ + return userfaultfd_rwp_async_ctx(vma->vm_userfaultfd_ctx.ctx); +} + static inline unsigned int uffd_ctx_features(__u64 user_features) { /* @@ -4482,6 +4492,12 @@ static int userfaultfd_api(struct userfaultfd_ctx *c= tx, if (features & UFFD_FEATURE_WP_ASYNC) features |=3D UFFD_FEATURE_WP_UNPOPULATED; =20 + ret =3D -EINVAL; + /* RWP_ASYNC requires RWP */ + if ((features & UFFD_FEATURE_RWP_ASYNC) && + !(features & UFFD_FEATURE_RWP)) + goto err_out; + /* report all available features and ioctls to userland */ uffdio_api.features =3D UFFD_API_FEATURES; #ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR @@ -4504,7 +4520,8 @@ static int userfaultfd_api(struct userfaultfd_ctx *ct= x, * but not actually usable. */ if (VM_UFFD_RWP =3D=3D VM_NONE || !pgtable_supports_uffd()) - uffdio_api.features &=3D ~UFFD_FEATURE_RWP; + uffdio_api.features &=3D + ~(UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC); =20 ret =3D -EINVAL; if (features & ~uffdio_api.features) --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5F3CE3812FE for ; Mon, 25 May 2026 11:39:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709158; cv=none; b=oKk9iWbKHjIZE3D/Ty8zUpG3Isj64LCkeb7Awsjc/jlG/7jPEQOrgPV2d+ARW7zx/q1P+ROgp2xfvzC43x6O40zGCOBjzv0MQoWp8ITZpSXaycTtpiiV55psq0+ybrV+CjDQyQk+d5tP2IV/h92UXfAuFOFrXRAung2/T703GM0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709158; c=relaxed/simple; bh=mctcQ/FJGtncFlKppsIgH+mn3mHj9FbVyJUEWoObBIY=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=RII+noss+5O+eZf604ABIxiC5DcRHt9qJKRUUoxnqVp70RwX4ZH3357K3JLXdBnksf+E20pv5TwFpg3ARt8DQVcYpmHLH/Q50T+ugkyigvtSOd8qOamZJ67EEh5P2u444PQU/oiZrarG6FgH1mkiq2c5Za626bW15tdvU8cCx20= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=XVMFF+sX; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="XVMFF+sX" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 855861F00A3C; Mon, 25 May 2026 11:39:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709156; bh=VJiLnnmxKx1AqdCXsILEaQEuBxwUu2dxhQnKdI014eI=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=XVMFF+sX4StSiQ06wtHszuJChE0tmiL0oaL4TgEg5vBgnOCciuOEoUVq8SajUq3yO eEr1Yej9aB6JUmeXV7HYUrSiXQJqWC6pxp1SQgXsib2rv3V9/uKO+pGqfNxcFxC3OI BXS+VLmNp0Za7Yf8V9JKFbbjxYWA3r9SEBhoIeGbbAjbX64q1Xsc9DqRh9Md5haRry /TIzkSE3CIr8izT6JW6Q/u3VCnH2JtgF6bP66vx+Oso70pLllfdvsHiMV+RuEPNkpP Bg4OlmhfNKAnhBzWdLIIhvtyFDA3p7xQ0CkSy3OeqMMq4bn2ljrHfxfJUnttFd6lCg xOjLJGzPvlvdw== Received: from phl-compute-02.internal (phl-compute-02.internal [10.202.2.42]) by mailfauth.phl.internal (Postfix) with ESMTP id E2944F40082; Mon, 25 May 2026 07:39:14 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-02.internal (MEProxy); Mon, 25 May 2026 07:39:14 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTFnTuEJhvyD+myNg4p794atgQHzdZOXs2A3gxvR3FSD7M0iqWwHWhOfEG1IgnscR2 dpWEnUQHTt04gBiZJ6FOYLwJWkueK3/2x7yPKsRIizx2ATY4cFZISTDZP9EiTwl/39cNmM OA0CsMrDRjxwva2a2lWcAgzMzYsZoGkumkodmX+5ok307LF0rRD0oLFOgPIpccglfJ7Cmv AURvPb/dWg6Eh/o3LACLo5CgxqvMM0m0aATVsjZcPfM5Eqk+WNKfRQ8GkDG8MvyLCL9P+n Ffx6yvOWdEN3fPIKUkuyF3PJ8RYaK5VmWWEYQaxtLNJTEh5sz7hSNo8ZFBfkwknbfQV6HA 7SIS4yYDrm7L/JyPwzoUh/jhVh8fTVT6g27fnKgicpk86vK9a+0Gke1AfOkjM+6TDjHjVb fNKLEKArXXTuzvROeBsxd9FXHQjGTxjvBjnkm8Ggf6DzqfgOycjNsGDy/TzWnbSS1vbbVo /BBpYUUqtwJrP3ujfvaq/a4LYgi9p7yt07xwmPlqwQHyqbUp+9hXF9zxPMSPHZB7iuDeWn C5rF9J8xyX7m27pqToy+P7UD7UA7a2SlYIYmmPTlo+C7r5OxT0rcGXSs80ECVaoqaOgSMG pYsrF2sXk2BI4xWyuyWQtQgrXPE/nN6dVL1YbLMMmMFxPfglk91Y7iNhsJcQ X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:39:14 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 12/14] userfaultfd: add UFFDIO_SET_MODE for runtime sync/async toggle Date: Mon, 25 May 2026 12:37:26 +0100 Message-ID: <20260525113737.1942478-13-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" Add an ioctl to toggle async mode at runtime without re-registering the userfaultfd. This allows a VMM to switch between sync and async RWP modes on-the-fly -- for example, starting in async mode for working set scanning, then switching to sync mode to intercept faults during page eviction. UFFDIO_SET_MODE takes an enable/disable bitmask of UFFD_FEATURE_* flags. Only UFFD_FEATURE_RWP_ASYNC is toggleable today; the ioctl rejects any other bit with -EINVAL. Enabling RWP_ASYNC also requires RWP to have been negotiated at UFFDIO_API time, mirroring the UFFDIO_API invariant. Fault-path readers of ctx->features run under mmap_read_lock or a per-VMA lock; the RMW takes mmap_write_lock and calls vma_start_write() on every UFFD-armed VMA, so those readers are fully excluded. userfaultfd_show_fdinfo(), however, reads ctx->features without any lock, so the RMW is written as a single WRITE_ONCE and fdinfo reads it with READ_ONCE. That keeps the lockless observer from seeing a mid-RMW intermediate and removes the audit burden when new toggleable bits are added later. When switching to async, pending sync waiters are woken so they retry and auto-resolve under the new mode. Signed-off-by: Kiryl Shutsemau (Meta) Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Mike Rapoport (Microsoft) --- include/uapi/linux/userfaultfd.h | 14 +++ mm/userfaultfd.c | 150 +++++++++++++++++++++++++------ 2 files changed, 136 insertions(+), 28 deletions(-) diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaul= tfd.h index c10f08f8a618..cea11aad6b54 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -49,6 +49,7 @@ #define UFFD_API_IOCTLS \ ((__u64)1 << _UFFDIO_REGISTER | \ (__u64)1 << _UFFDIO_UNREGISTER | \ + (__u64)1 << _UFFDIO_SET_MODE | \ (__u64)1 << _UFFDIO_API) #define UFFD_API_RANGE_IOCTLS \ ((__u64)1 << _UFFDIO_WAKE | \ @@ -85,6 +86,7 @@ #define _UFFDIO_CONTINUE (0x07) #define _UFFDIO_POISON (0x08) #define _UFFDIO_RWPROTECT (0x09) +#define _UFFDIO_SET_MODE (0x0A) #define _UFFDIO_API (0x3F) =20 /* userfaultfd ioctl ids */ @@ -111,6 +113,8 @@ struct uffdio_poison) #define UFFDIO_RWPROTECT _IOWR(UFFDIO, _UFFDIO_RWPROTECT, \ struct uffdio_rwprotect) +#define UFFDIO_SET_MODE _IOW(UFFDIO, _UFFDIO_SET_MODE, \ + struct uffdio_set_mode) =20 /* read() structure */ struct uffd_msg { @@ -406,6 +410,16 @@ struct uffdio_move { __s64 move; }; =20 +struct uffdio_set_mode { + /* + * Toggle async mode for features at runtime. + * Supported: UFFD_FEATURE_RWP_ASYNC. + * Setting a bit in both enable and disable is invalid. + */ + __u64 enable; + __u64 disable; +}; + /* * Flags for the userfaultfd(2) system call itself. */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 20478bb37311..680ef9bd57fd 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -2468,19 +2468,29 @@ struct userfaultfd_wake_range { /* internal indication that UFFD_API ioctl was successfully executed */ #define UFFD_FEATURE_INITIALIZED (1u << 31) =20 +/* + * UFFDIO_SET_MODE updates ctx->features under mmap_write_lock with + * WRITE_ONCE; readers that run outside mmap_read_lock or the per-VMA + * lock (poll/read_iter/ioctl, fdinfo) must pair with READ_ONCE. + */ +static unsigned int userfaultfd_features(struct userfaultfd_ctx *ctx) +{ + return READ_ONCE(ctx->features); +} + static bool userfaultfd_is_initialized(struct userfaultfd_ctx *ctx) { - return ctx->features & UFFD_FEATURE_INITIALIZED; + return userfaultfd_features(ctx) & UFFD_FEATURE_INITIALIZED; } =20 static bool userfaultfd_wp_async_ctx(struct userfaultfd_ctx *ctx) { - return ctx && (ctx->features & UFFD_FEATURE_WP_ASYNC); + return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_WP_ASYNC); } =20 static bool userfaultfd_rwp_async_ctx(struct userfaultfd_ctx *ctx) { - return ctx && (ctx->features & UFFD_FEATURE_RWP_ASYNC); + return ctx && (userfaultfd_features(ctx) & UFFD_FEATURE_RWP_ASYNC); } =20 /* @@ -2495,7 +2505,7 @@ bool userfaultfd_wp_unpopulated(struct vm_area_struct= *vma) if (!ctx) return false; =20 - return ctx->features & UFFD_FEATURE_WP_UNPOPULATED; + return userfaultfd_features(ctx) & UFFD_FEATURE_WP_UNPOPULATED; } =20 static int userfaultfd_wake_function(wait_queue_entry_t *wq, unsigned mode, @@ -4261,6 +4271,109 @@ static int userfaultfd_rwprotect(struct userfaultfd= _ctx *ctx, return ret; } =20 +/* Subset of UFFD_API_FEATURES actually supported by this kernel/arch */ +static __u64 uffd_api_available_features(void) +{ + __u64 f =3D UFFD_API_FEATURES; + + if (!IS_ENABLED(CONFIG_HAVE_ARCH_USERFAULTFD_MINOR)) + f &=3D ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM); + if (!pgtable_supports_uffd()) + f &=3D ~UFFD_FEATURE_PAGEFAULT_FLAG_WP; + if (!uffd_supports_wp_marker()) + f &=3D ~(UFFD_FEATURE_WP_HUGETLBFS_SHMEM | + UFFD_FEATURE_WP_UNPOPULATED | + UFFD_FEATURE_WP_ASYNC); + /* + * RWP needs both PROT_NONE support and the uffd PTE bit. The + * VM_UFFD_RWP check covers compile-time unavailability; the + * pgtable_supports_uffd() check covers runtime (e.g. riscv + * without the SVRSW60T59B extension) where the PTE bit is declared + * but not actually usable. + */ + if (VM_UFFD_RWP =3D=3D VM_NONE || !pgtable_supports_uffd()) + f &=3D ~(UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC); + return f; +} + +/* Async features that can be toggled at runtime via UFFDIO_SET_MODE */ +#define UFFD_FEATURE_TOGGLEABLE UFFD_FEATURE_RWP_ASYNC + +static int userfaultfd_set_mode(struct userfaultfd_ctx *ctx, + unsigned long arg) +{ + struct uffdio_set_mode mode; + struct mm_struct *mm =3D ctx->mm; + + if (copy_from_user(&mode, (void __user *)arg, sizeof(mode))) + return -EFAULT; + + /* enable and disable must not overlap */ + if (mode.enable & mode.disable) + return -EINVAL; + + /* only toggleable features that this kernel/arch actually supports */ + if ((mode.enable | mode.disable) & + ~(uffd_api_available_features() & UFFD_FEATURE_TOGGLEABLE)) + return -EINVAL; + + /* RWP_ASYNC can only be enabled on contexts that negotiated RWP */ + if ((mode.enable & UFFD_FEATURE_RWP_ASYNC) && + !(ctx->features & UFFD_FEATURE_RWP)) + return -EINVAL; + + if (!mmget_not_zero(mm)) + return -ESRCH; + + /* + * Drain in-flight faults before flipping features. mmap_write_lock() + * blocks new mmap_read_lock() callers, but per-VMA locked faults + * (lock_vma_under_rcu() + FAULT_FLAG_VMA_LOCK) that acquired before + * this point keep running. Calling vma_start_write() on each UFFD- + * armed VMA waits for those readers to drop, so no in-flight fault + * can observe the old features after mmap_write_unlock(). + */ + mmap_write_lock(mm); + { + struct vm_area_struct *vma; + VMA_ITERATOR(vmi, mm, 0); + + for_each_vma(vmi, vma) { + if (vma->vm_userfaultfd_ctx.ctx =3D=3D ctx) + vma_start_write(vma); + } + } + /* + * Single WRITE_ONCE so lockless readers (fdinfo, poll/read_iter + * via userfaultfd_is_initialized(), and the userfaultfd_features() + * helper used elsewhere) can't observe a mid-RMW intermediate + * value. Hot-path readers already serialise through the mmap lock + * + vma_start_write() drain above, so their load doesn't need an + * annotation. + */ + WRITE_ONCE(ctx->features, + (ctx->features | mode.enable) & ~mode.disable); + mmap_write_unlock(mm); + + /* + * If switching to async, wake threads blocked in handle_userfault(). + * They will retry the fault and auto-resolve under the new mode. + * len=3D0 means wake all pending faults on this context. + */ + if (mode.enable & UFFD_FEATURE_RWP_ASYNC) { + struct userfaultfd_wake_range range =3D { .len =3D 0 }; + + spin_lock_irq(&ctx->fault_pending_wqh.lock); + __wake_up_locked_key(&ctx->fault_pending_wqh, TASK_NORMAL, + &range); + __wake_up(&ctx->fault_wqh, TASK_NORMAL, 1, &range); + spin_unlock_irq(&ctx->fault_pending_wqh.lock); + } + + mmput(mm); + return 0; +} + static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long= arg) { __s64 ret; @@ -4499,29 +4612,7 @@ static int userfaultfd_api(struct userfaultfd_ctx *c= tx, goto err_out; =20 /* report all available features and ioctls to userland */ - uffdio_api.features =3D UFFD_API_FEATURES; -#ifndef CONFIG_HAVE_ARCH_USERFAULTFD_MINOR - uffdio_api.features &=3D - ~(UFFD_FEATURE_MINOR_HUGETLBFS | UFFD_FEATURE_MINOR_SHMEM); -#endif - if (!pgtable_supports_uffd()) - uffdio_api.features &=3D ~UFFD_FEATURE_PAGEFAULT_FLAG_WP; - - if (!uffd_supports_wp_marker()) { - uffdio_api.features &=3D ~UFFD_FEATURE_WP_HUGETLBFS_SHMEM; - uffdio_api.features &=3D ~UFFD_FEATURE_WP_UNPOPULATED; - uffdio_api.features &=3D ~UFFD_FEATURE_WP_ASYNC; - } - /* - * RWP needs both PROT_NONE support and the uffd-wp PTE bit. The - * VM_UFFD_RWP check covers compile-time unavailability; the - * pgtable_supports_uffd() check covers runtime (e.g. riscv - * without the SVRSW60T59B extension) where the PTE bit is declared - * but not actually usable. - */ - if (VM_UFFD_RWP =3D=3D VM_NONE || !pgtable_supports_uffd()) - uffdio_api.features &=3D - ~(UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC); + uffdio_api.features =3D uffd_api_available_features(); =20 ret =3D -EINVAL; if (features & ~uffdio_api.features) @@ -4591,6 +4682,9 @@ static long userfaultfd_ioctl(struct file *file, unsi= gned cmd, case UFFDIO_RWPROTECT: ret =3D userfaultfd_rwprotect(ctx, arg); break; + case UFFDIO_SET_MODE: + ret =3D userfaultfd_set_mode(ctx, arg); + break; } return ret; } @@ -4618,7 +4712,7 @@ static void userfaultfd_show_fdinfo(struct seq_file *= m, struct file *f) * protocols: aa:... bb:... */ seq_printf(m, "pending:\t%lu\ntotal:\t%lu\nAPI:\t%Lx:%x:%Lx\n", - pending, total, UFFD_API, ctx->features, + pending, total, UFFD_API, userfaultfd_features(ctx), UFFD_API_IOCTLS|UFFD_API_RANGE_IOCTLS); } #endif --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 535ED3812EE; Mon, 25 May 2026 11:39:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709160; cv=none; b=MsRDg9bVEuO/4X6defqhCp1mDdXkBYOoqKw21tAwDGZcKFb5Vugij1bGsqp23/Ltv0hP7rBp4XPg/KWxtHenJkZlMMjy1pQG0DL5byOW/tGJqVMe8iaIZ8Yd0gbeNjbiVJ+pyX/PCoVDff4Q8Dh2DhH27ohSzMFiAurpoz157j4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709160; c=relaxed/simple; bh=80VEwQRpOwdvDbuzRL9gByGCUy7q/2d9np3g1aaR25c=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=HPN8UJutHXqWtXex1gz1MAsKfnJ3PpsrKHMx5aWR3VtcybE+59fY60YVGYxFyQXc1YvF0rv/cDw12Mngw+8ubtZDJyx/TvB0kAgDBhtPkT24wPI+1dYyPmqX/3PTYOYtZGWB+oKPfcGixsAVhBBaa8vJNE3mgSmSdNJ3mc8kd1Y= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=UjtEyxch; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="UjtEyxch" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3801B1F00A3E; Mon, 25 May 2026 11:39:17 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709158; bh=kQqY0+iGcAEjqKeZRliYdZXDmdZP82Ym3ISZ9RI5IgI=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=UjtEyxch/v5BZ11X46nVr9Loz/0g9l1r9RusR+IZ7/y2X1Ct1TJbIOjf40tACO7MZ 6TX6+Vyy2kLhgq018499il/m8XAfrm5kw9gTllFTw1TpkMbrRBfC+v65jhQhvW42ie V9HT4E3gjPntDuzAI8AsYeO4RuJtZXK/5peDD7qAsesBCOxGpQ0skQ1QlpUzQV/hbA ycvD+k3QZmGSlUZMe87Jy79WgYmtpOgjfl9uNvNFpfC2gwu+AKccRGSe7/xvLFpMI5 9HA4ONVppBqnfcoI/lXyCS9fQaI3aFlRriZRARzRm+F05imoMSm+fP/QnZjqgca+oe ANdEJkDVxqCPQ== Received: from phl-compute-05.internal (phl-compute-05.internal [10.202.2.45]) by mailfauth.phl.internal (Postfix) with ESMTP id 978A7F40082; Mon, 25 May 2026 07:39:16 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-05.internal (MEProxy); Mon, 25 May 2026 07:39:16 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTEfoRdVBJYWBUADV7d9kIV+Nxm/pVUcgu2VEqiWnj8eQrxuVRmWJvpc9q3Oqv2JCI in3fN4Zd8kb+AIzxXT1aE8XhYzXjllPczHZpsDuave1RqUEJt2aM8NK4F+1xGB1lHbYgSB 4+lR3LF0kMycZigQtO5w6VfLN0Aex+LApk+NDe4mN/AQtgEH82UY4yLdpanr1nUEhfYRPX qXaeB8tr6YycPJO/GKcHBfWh4XfvSwFXkiL84UUAHkAJqW6Bc0LpWOIYef91VRiFvO85Cg Xd8jmXx4jYBCkCANHtSsVLhpCbXIiIojLruOFr51QxylDrx8yfg2/g0kqIsUfNxsLFZylp O9pSZdr0rbc6kh+tY5RaeQWFElQFFEQ5C8jUzkTKgalcxVAEILHqs4mxBagviCZMf4P8zc Tl2m1aVHXIpS+jnMmIYoG7xtwTxGDlgfJ0s/XgKeqG/EXsL3dLjv4fdyHKqLUiE3OAzMxG FkJnasO99z9fWynQMn9ZShKukQwwb6FLvBLxqQ/XOQCHWNRc4pdGmdoypgqApPUFf6ehXH m0+yquNsiozSQUrpsR7b8kTvQK9nCAK8iMjLEzXYGAZ72hUoV4MV643e9dQRhfrX0CNzfG kauGw6kiHwJP0Mjflq4efMGJiB2GjMyxiWkWbjvheEKxjrJM4bmdm1VsqsbQ X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:39:16 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 13/14] selftests/mm: add userfaultfd RWP tests Date: Mon, 25 May 2026 12:37:27 +0100 Message-ID: <20260525113737.1942478-14-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Coverage for UFFDIO_REGISTER_MODE_RWP and UFFDIO_RWPROTECT: rwp-async async mode =E2=80=94 touch pages, verify permissions a= re auto-restored without a message rwp-sync sync mode =E2=80=94 access blocks, handler resolves via UFFDIO_RWPROTECT rwp-pagemap PAGEMAP_SCAN reports still-cold pages via inverted PAGE_IS_ACCESSED rwp-mprotect RWP survives mprotect(PROT_NONE) -> mprotect(PROT_READ|PROT_WRITE) round-trip rwp-gup GUP walks through a protnone RWP PTE (pipe write/read drives the GUP path) rwp-async-toggle UFFDIO_SET_MODE flips between sync and async without re-registering rwp-close closing the uffd restores page permissions rwp-fork RWP survives fork() with EVENT_FORK; child's PTEs keep the uffd bit rwp-fork-pin RWP survives fork() on an RO-longterm-pinned anon page (forces copy_present_page()); child read auto-resolves and clears the bit, proving PAGE_NONE was in place rwp-wp-exclusive register with MODE_WP|MODE_RWP returns -EINVAL All tests run against anon, shmem, shmem-private, hugetlb, and hugetlb-private memory, except rwp-fork-pin which is anon-only =E2=80=94 copy_present_page() is the private-anon pinned-exclusive fork path. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 Reviewed-by: Mike Rapoport (Microsoft) --- tools/testing/selftests/mm/uffd-unit-tests.c | 766 +++++++++++++++++++ 1 file changed, 766 insertions(+) diff --git a/tools/testing/selftests/mm/uffd-unit-tests.c b/tools/testing/s= elftests/mm/uffd-unit-tests.c index a6c14109e818..bd6f35ddaa4d 100644 --- a/tools/testing/selftests/mm/uffd-unit-tests.c +++ b/tools/testing/selftests/mm/uffd-unit-tests.c @@ -7,6 +7,8 @@ =20 #include "uffd-common.h" =20 +#include +#include #include "../../../../mm/gup_test.h" =20 #ifdef __NR_userfaultfd @@ -109,6 +111,11 @@ static void uffd_test_skip(const char *message) =20 static void test_uffd_api(bool use_dev) { + const uint64_t expected_ioctls =3D + BIT_ULL(_UFFDIO_REGISTER) | + BIT_ULL(_UFFDIO_UNREGISTER) | + BIT_ULL(_UFFDIO_API) | + BIT_ULL(_UFFDIO_SET_MODE); struct uffdio_api uffdio_api; int uffd; =20 @@ -148,6 +155,15 @@ static void test_uffd_api(bool use_dev) goto out; } =20 + /* Verify returned fd-level ioctls bitmask */ + if ((uffdio_api.ioctls & expected_ioctls) !=3D expected_ioctls) { + uffd_test_fail("UFFDIO_API missing expected ioctls: " + "got=3D0x%"PRIx64", expected=3D0x%"PRIx64, + (uint64_t)uffdio_api.ioctls, + expected_ioctls); + goto out; + } + /* Test double requests of UFFDIO_API with a random feature set */ uffdio_api.features =3D BIT_ULL(0); if (ioctl(uffd, UFFDIO_API, &uffdio_api) =3D=3D 0) { @@ -602,6 +618,685 @@ void uffd_minor_collapse_test(uffd_global_test_opts_t= *gopts, uffd_test_args_t * uffd_minor_test_common(gopts, true, false); } =20 +static int uffd_register_rwp(int uffd, void *addr, uint64_t len) +{ + struct uffdio_register reg =3D { + .range =3D { .start =3D (unsigned long)addr, .len =3D len }, + .mode =3D UFFDIO_REGISTER_MODE_RWP, + }; + + if (ioctl(uffd, UFFDIO_REGISTER, ®) =3D=3D -1) + return -errno; + return 0; +} + +static void rwprotect_range(int uffd, __u64 start, __u64 len, bool protect) +{ + struct uffdio_rwprotect rwp =3D { + .range =3D { .start =3D start, .len =3D len }, + .mode =3D protect ? UFFDIO_RWPROTECT_MODE_RWP : 0, + }; + + if (ioctl(uffd, UFFDIO_RWPROTECT, &rwp)) + err("UFFDIO_RWPROTECT failed"); +} + +static void set_async_mode(int uffd, bool enable) +{ + struct uffdio_set_mode mode =3D { }; + + if (enable) + mode.enable =3D UFFD_FEATURE_RWP_ASYNC; + else + mode.disable =3D UFFD_FEATURE_RWP_ASYNC; + + if (ioctl(uffd, UFFDIO_SET_MODE, &mode)) + err("UFFDIO_SET_MODE failed"); +} + +/* + * Test async RWP faults on anonymous memory. + * Populate pages, register MODE_RWP with RWP_ASYNC, + * RW-protect, re-access, verify content preserved and no faults delivered. + */ +static void uffd_rwp_async_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + unsigned long p; + + /* Populate all pages with known content */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size); + + /* Register MODE_RWP */ + if (uffd_register_rwp(gopts->uffd, gopts->area_dst, + nr_pages * page_size)) + err("register failure"); + + /* RW-protect all pages (sets protnone) */ + rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size, true); + + /* Access all pages =E2=80=94 should auto-resolve, no faults */ + for (p =3D 0; p < nr_pages; p++) { + unsigned char *page =3D (unsigned char *)gopts->area_dst + + p * page_size; + unsigned char expected =3D p % 255 + 1; + + if (page[0] !=3D expected) { + uffd_test_fail("page %lu content mismatch: %u !=3D %u", + p, page[0], expected); + return; + } + } + + uffd_test_pass(); +} + +/* + * Fault handler for RWP =E2=80=94 unprotect the page via UFFDIO_RWPROTECT. + */ +static void uffd_handle_rwp_fault(uffd_global_test_opts_t *gopts, + struct uffd_msg *msg, + struct uffd_args *uargs) +{ + if (!(msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_RWP)) + err("expected RWP fault, got 0x%llx", + msg->arg.pagefault.flags); + + rwprotect_range(gopts->uffd, msg->arg.pagefault.address, + gopts->page_size, false); + uargs->minor_faults++; +} + +/* + * Test sync RWP faults on anonymous memory. + * Populate pages, register MODE_RWP (sync), RW-protect, + * access from worker thread, verify fault delivered, UFFDIO_RWPROTECT res= olves. + */ +static void uffd_rwp_sync_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + pthread_t uffd_mon; + struct uffd_args uargs =3D { }; + bool failed =3D false; + char c =3D '\0'; + unsigned long p; + + uargs.gopts =3D gopts; + uargs.handle_fault =3D uffd_handle_rwp_fault; + + /* Populate all pages */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size); + + /* Register MODE_RWP */ + if (uffd_register_rwp(gopts->uffd, gopts->area_dst, + nr_pages * page_size)) + err("register failure"); + + /* RW-protect all pages */ + rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size, true); + + /* Start fault handler thread */ + if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &uargs)) + err("uffd_poll_thread create"); + + /* Access all pages =E2=80=94 triggers sync RWP faults, handler unprotect= s */ + for (p =3D 0; p < nr_pages; p++) { + unsigned char *page =3D (unsigned char *)gopts->area_dst + + p * page_size; + + if (page[0] !=3D (p % 255 + 1)) { + uffd_test_fail("page %lu content mismatch", p); + failed =3D true; + goto out; + } + } + +out: + /* + * Stop the handler before reading minor_faults: the last fault + * resolution rwprotect_range()s before incrementing the counter, + * so the main thread can race ahead of the increment. + */ + if (write(gopts->pipefd[1], &c, sizeof(c)) !=3D sizeof(c)) + err("pipe write"); + if (pthread_join(uffd_mon, NULL)) + err("join() failed"); + + if (failed) + return; + if (uargs.minor_faults =3D=3D 0) + uffd_test_fail("expected RWP faults, got 0"); + else + uffd_test_pass(); +} + +/* + * Test PAGEMAP_SCAN detection of RW-protected (cold) pages. + */ +static void uffd_rwp_pagemap_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + unsigned long p; + struct page_region regions[16]; + struct pm_scan_arg pm_arg; + int pagemap_fd; + long ret; + + /* Need at least 4 pages */ + if (nr_pages < 4) { + uffd_test_skip("need at least 4 pages"); + return; + } + + /* Populate all pages */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, 0xab, page_size); + + /* Register and RW-protect */ + if (uffd_register_rwp(gopts->uffd, gopts->area_dst, + nr_pages * page_size)) + err("register failure"); + + rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size, true); + + /* Touch first half of pages to re-activate them (async auto-resolve) */ + for (p =3D 0; p < nr_pages / 2; p++) { + volatile char *page =3D gopts->area_dst + p * page_size; + (void)*page; + } + + /* Scan for cold (still RW-protected) pages */ + pagemap_fd =3D open("/proc/self/pagemap", O_RDONLY); + if (pagemap_fd < 0) + err("open pagemap"); + + /* + * PAGE_IS_ACCESSED is set once the uffd-wp bit has been cleared + * (access happened, or the user resolved). Invert it to select + * still-protected (cold) pages. + */ + memset(&pm_arg, 0, sizeof(pm_arg)); + pm_arg.size =3D sizeof(pm_arg); + pm_arg.start =3D (uint64_t)gopts->area_dst; + pm_arg.end =3D (uint64_t)gopts->area_dst + nr_pages * page_size; + pm_arg.vec =3D (uint64_t)regions; + pm_arg.vec_len =3D ARRAY_SIZE(regions); + pm_arg.category_mask =3D PAGE_IS_ACCESSED; + pm_arg.category_inverted =3D PAGE_IS_ACCESSED; + pm_arg.return_mask =3D PAGE_IS_ACCESSED; + + ret =3D ioctl(pagemap_fd, PAGEMAP_SCAN, &pm_arg); + close(pagemap_fd); + + if (ret < 0) { + uffd_test_fail("PAGEMAP_SCAN failed: %s", strerror(errno)); + return; + } + + /* + * The second half of pages should be reported as RW-protected. + * They may be coalesced into one region. + */ + if (ret < 1) { + uffd_test_fail("expected cold pages, got %ld regions", ret); + return; + } + + /* Verify the cold region covers the second half */ + uint64_t cold_start =3D regions[0].start; + uint64_t expected_start =3D (uint64_t)gopts->area_dst + + (nr_pages / 2) * page_size; + + if (cold_start !=3D expected_start) { + uffd_test_fail("cold region starts at 0x%lx, expected 0x%lx", + (unsigned long)cold_start, + (unsigned long)expected_start); + return; + } + + uffd_test_pass(); +} + +/* + * Test that RWP protection survives a mprotect(PROT_NONE) -> + * mprotect(PROT_READ|PROT_WRITE) round-trip. The uffd-wp bit on a + * VM_UFFD_RWP VMA must continue to carry PROT_NONE semantics after + * mprotect() changes the base protection; otherwise accesses would + * silently succeed and the pagemap bit would stick without a fault + * ever clearing it. + */ +static void uffd_rwp_mprotect_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + unsigned long p; + struct page_region regions[16]; + struct pm_scan_arg pm_arg; + int pagemap_fd; + long ret; + + /* Populate all pages */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, 0xab, page_size); + + /* Register and RW-protect the whole range */ + if (uffd_register_rwp(gopts->uffd, gopts->area_dst, + nr_pages * page_size)) + err("register failure"); + rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size, true); + + /* Round-trip mprotect(): PROT_NONE -> PROT_READ|PROT_WRITE */ + if (mprotect(gopts->area_dst, nr_pages * page_size, PROT_NONE)) + err("mprotect() PROT_NONE"); + if (mprotect(gopts->area_dst, nr_pages * page_size, + PROT_READ | PROT_WRITE)) + err("mprotect() PROT_READ|PROT_WRITE"); + + /* Touch every page. Async RWP must auto-resolve each fault. */ + for (p =3D 0; p < nr_pages; p++) { + volatile char *page =3D gopts->area_dst + p * page_size; + (void)*page; + } + + /* + * After touching, no page should remain RW-protected. A stuck + * uffd-wp bit would mean mprotect() silently dropped PROT_NONE and + * the access never faulted. + */ + pagemap_fd =3D open("/proc/self/pagemap", O_RDONLY); + if (pagemap_fd < 0) + err("open pagemap"); + + memset(&pm_arg, 0, sizeof(pm_arg)); + pm_arg.size =3D sizeof(pm_arg); + pm_arg.start =3D (uint64_t)gopts->area_dst; + pm_arg.end =3D (uint64_t)gopts->area_dst + nr_pages * page_size; + pm_arg.vec =3D (uint64_t)regions; + pm_arg.vec_len =3D ARRAY_SIZE(regions); + pm_arg.category_mask =3D PAGE_IS_ACCESSED; + pm_arg.category_inverted =3D PAGE_IS_ACCESSED; + pm_arg.return_mask =3D PAGE_IS_ACCESSED; + + ret =3D ioctl(pagemap_fd, PAGEMAP_SCAN, &pm_arg); + close(pagemap_fd); + + if (ret < 0) { + uffd_test_fail("PAGEMAP_SCAN failed: %s", strerror(errno)); + return; + } + if (ret !=3D 0) { + uffd_test_fail("expected no cold pages after mprotect()+touch, got %ld r= egions", + ret); + return; + } + + uffd_test_pass(); +} + +/* + * Test that GUP resolves through protnone PTEs (async mode). + * vmsplice() into a pipe pins user pages via get_user_pages_fast() -- + * unlike write(), which goes through copy_from_user() and ordinary + * hardware page faults -- so it exercises gup_can_follow_protnone() on + * the RW-protected PTE. In async mode the kernel auto-restores + * permissions and GUP returns the page. + */ +static void uffd_rwp_gup_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + struct iovec iov; + char buf; + int pipefd[2]; + + /* Populate first page with known content */ + memset(gopts->area_dst, 0xCD, gopts->page_size); + + if (uffd_register_rwp(gopts->uffd, gopts->area_dst, gopts->page_size)) + err("register failure"); + + rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, + gopts->page_size, true); + + if (pipe(pipefd)) + err("pipe"); + + /* + * One byte's worth of iov is enough to GUP the containing page and + * keeps the pipe transfer well under any pipe-capacity limit even on + * hugetlb-backed runs. + */ + iov.iov_base =3D gopts->area_dst; + iov.iov_len =3D 1; + if (vmsplice(pipefd[1], &iov, 1, 0) !=3D 1) { + uffd_test_fail("vmsplice from RW-protected page failed: %s", + strerror(errno)); + goto out; + } + + if (read(pipefd[0], &buf, 1) !=3D 1) { + uffd_test_fail("read from pipe failed"); + goto out; + } + + if (buf !=3D (char)0xCD) { + uffd_test_fail("content mismatch: got 0x%02x, expected 0xCD", + (unsigned char)buf); + goto out; + } + + uffd_test_pass(); +out: + close(pipefd[0]); + close(pipefd[1]); +} + +/* + * Test runtime toggle between async and sync modes. + * Start in async mode (detection), flip to sync (eviction), verify faults + * block, resolve them, flip back to async. + */ +static void uffd_rwp_async_toggle_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + struct uffd_args uargs =3D { }; + pthread_t uffd_mon; + char c =3D '\0'; + unsigned long p; + + uargs.gopts =3D gopts; + uargs.handle_fault =3D uffd_handle_rwp_fault; + + /* Populate */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size); + + if (uffd_register_rwp(gopts->uffd, gopts->area_dst, + nr_pages * page_size)) + err("register failure"); + + /* Phase 1: async detection =E2=80=94 RW-protect, access first half */ + rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size, true); + + for (p =3D 0; p < nr_pages / 2; p++) { + volatile char *page =3D gopts->area_dst + p * page_size; + (void)*page; /* auto-resolves in async mode */ + } + + /* Phase 2: flip to sync for eviction */ + set_async_mode(gopts->uffd, false); + + /* Start handler =E2=80=94 will receive faults for cold pages */ + if (pthread_create(&uffd_mon, NULL, uffd_poll_thread, &uargs)) + err("uffd_poll_thread create"); + + /* Access second half (cold pages) =E2=80=94 should trigger sync faults */ + for (p =3D nr_pages / 2; p < nr_pages; p++) { + unsigned char *page =3D (unsigned char *)gopts->area_dst + + p * page_size; + if (page[0] !=3D (p % 255 + 1)) { + uffd_test_fail("page %lu content mismatch", p); + goto out; + } + } + + /* + * Stop the handler before reading minor_faults: the last fault + * resolution rwprotect_range()s before incrementing the counter, + * so the main thread can race ahead of the increment. Stopping + * here also makes Phase 3 a clean async-only test -- with the + * handler still running it would silently resolve any sync fault + * the kernel erroneously delivers, masking a regression. + */ + if (write(gopts->pipefd[1], &c, sizeof(c)) !=3D sizeof(c)) + err("pipe write"); + if (pthread_join(uffd_mon, NULL)) + err("join() failed"); + + if (uargs.minor_faults =3D=3D 0) { + uffd_test_fail("expected sync faults, got 0"); + return; + } + + /* Phase 3: flip back to async */ + set_async_mode(gopts->uffd, true); + + /* RW-protect and access again =E2=80=94 should auto-resolve */ + rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size, true); + + for (p =3D 0; p < nr_pages; p++) { + volatile char *page =3D gopts->area_dst + p * page_size; + (void)*page; + } + + uffd_test_pass(); + return; +out: + if (write(gopts->pipefd[1], &c, sizeof(c)) !=3D sizeof(c)) + err("pipe write"); + if (pthread_join(uffd_mon, NULL)) + err("join() failed"); +} + +/* + * Test that RW-protected pages become accessible after closing uffd. + */ +static void uffd_rwp_close_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + unsigned long p; + + /* Populate */ + for (p =3D 0; p < nr_pages; p++) + memset(gopts->area_dst + p * page_size, p % 255 + 1, page_size); + + if (uffd_register_rwp(gopts->uffd, gopts->area_dst, + nr_pages * page_size)) + err("register failure"); + + rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, + nr_pages * page_size, true); + + /* Close uffd =E2=80=94 should restore protnone PTEs */ + close(gopts->uffd); + gopts->uffd =3D -1; + + /* All pages should be accessible with original content */ + for (p =3D 0; p < nr_pages; p++) { + unsigned char *page =3D (unsigned char *)gopts->area_dst + + p * page_size; + unsigned char expected =3D p % 255 + 1; + + if (page[0] !=3D expected) { + uffd_test_fail("page %lu not accessible after close", p); + return; + } + } + + uffd_test_pass(); +} + +/* + * Test that RWP protection is preserved across fork() when + * UFFD_FEATURE_EVENT_FORK is enabled. Without preservation, the child's + * PTEs would lose the uffd-wp marker and RWP-protected accesses would + * silently fall through to do_numa_page(). + */ +static void uffd_rwp_fork_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + int pagemap_fd; + uint64_t value; + + if (uffd_register_rwp(gopts->uffd, gopts->area_dst, + nr_pages * page_size)) + err("register failed"); + + /* Populate + RWP-protect */ + *gopts->area_dst =3D 1; + rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, + page_size, true); + + /* Parent: verify uffd-wp bit is set before fork */ + pagemap_fd =3D pagemap_open(); + value =3D pagemap_get_entry(pagemap_fd, gopts->area_dst); + pagemap_check_wp(value, true); + + /* + * Fork with EVENT_FORK: child inherits VM_UFFD_RWP. Child reads + * its own pagemap and must still see the uffd-wp bit set. + */ + if (pagemap_test_fork(gopts, true, false)) { + uffd_test_fail("RWP marker lost in child after fork"); + goto out; + } + + uffd_test_pass(); +out: + close(pagemap_fd); +} + +/* + * Test that RWP protection on a pinned anon page is preserved across fork= (). + * Pinning forces copy_present_page() in the child path, which must restore + * PAGE_NONE on top of the uffd bit. Using async mode, a read in the child + * auto-resolves if =E2=80=94 and only if =E2=80=94 the PTE was actually p= rotnone+uffd; the + * cleared uffd bit afterward proves the fault path ran. + */ +static void uffd_rwp_fork_pin_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long page_size =3D gopts->page_size; + fork_event_args fevent_args =3D { .gopts =3D gopts, .child_uffd =3D -1 }; + pin_args pin_args =3D {}; + int pagemap_fd, status; + pthread_t fevent_thread; + uint64_t value; + pid_t child; + + if (uffd_register_rwp(gopts->uffd, gopts->area_dst, page_size)) + err("register failed"); + + /* Populate. */ + *gopts->area_dst =3D 1; + + /* RO-longterm pin so fork() takes copy_present_page() for this PTE. */ + if (pin_pages(&pin_args, gopts->area_dst, page_size)) { + uffd_test_skip("Possibly CONFIG_GUP_TEST missing or unprivileged"); + uffd_unregister(gopts->uffd, gopts->area_dst, page_size); + return; + } + + /* RWP-protect: PTE is now PAGE_NONE + uffd bit. */ + rwprotect_range(gopts->uffd, (uint64_t)gopts->area_dst, page_size, true); + + pagemap_fd =3D pagemap_open(); + value =3D pagemap_get_entry(pagemap_fd, gopts->area_dst); + pagemap_check_wp(value, true); + + /* + * UFFD_FEATURE_EVENT_FORK is required so the child inherits + * VM_UFFD_RWP and the marker; without it dup_userfaultfd() resets + * the child VMA and the test would pass for the wrong reason. + * dup_userfaultfd() blocks until the EVENT_FORK message is consumed, + * so spawn a reader before the fork(). + */ + gopts->ready_for_fork =3D false; + if (pthread_create(&fevent_thread, NULL, fork_event_consumer, + &fevent_args)) + err("pthread_create() for fork event consumer"); + while (!gopts->ready_for_fork) + ; /* Wait for consumer to start polling. */ + + child =3D fork(); + if (child < 0) + err("fork"); + if (child =3D=3D 0) { + volatile char c; + int cfd; + + /* + * Read the pinned page. Only reaches the fault path if the + * child PTE is protnone + uffd; async mode auto-resolves and + * clears the uffd bit. If copy_present_page() dropped + * PAGE_NONE, the read would silently succeed and the bit + * would still be set. + */ + c =3D *(volatile char *)gopts->area_dst; + (void)c; + + cfd =3D pagemap_open(); + value =3D pagemap_get_entry(cfd, gopts->area_dst); + close(cfd); + _exit((value & PM_UFFD_WP) ? 1 : 0); + } + if (waitpid(child, &status, 0) < 0) + err("waitpid"); + if (pthread_join(fevent_thread, NULL)) + err("pthread_join() for fork event consumer"); + if (fevent_args.child_uffd >=3D 0) + close(fevent_args.child_uffd); + + unpin_pages(&pin_args); + close(pagemap_fd); + if (uffd_unregister(gopts->uffd, gopts->area_dst, page_size)) + err("unregister failed"); + + if (!WIFEXITED(status) || WEXITSTATUS(status) !=3D 0) { + uffd_test_fail("RWP not enforced in child after pinned fork"); + return; + } + + uffd_test_pass(); +} + +/* + * WP and RWP share the uffd-wp PTE bit and cannot coexist in the same VMA. + * Registration requesting both modes must be rejected. + */ +static void uffd_rwp_wp_exclusive_test(uffd_global_test_opts_t *gopts, + uffd_test_args_t *args) +{ + unsigned long nr_pages =3D gopts->nr_pages; + unsigned long page_size =3D gopts->page_size; + struct uffdio_register reg =3D { }; + + reg.range.start =3D (unsigned long)gopts->area_dst; + reg.range.len =3D nr_pages * page_size; + reg.mode =3D UFFDIO_REGISTER_MODE_WP | UFFDIO_REGISTER_MODE_RWP; + + if (ioctl(gopts->uffd, UFFDIO_REGISTER, ®) =3D=3D 0) { + uffd_test_fail("register with WP|RWP unexpectedly succeeded"); + return; + } + if (errno !=3D EINVAL) { + uffd_test_fail("register with WP|RWP: expected EINVAL, got %d", + errno); + return; + } + uffd_test_pass(); +} + static sigjmp_buf jbuf, *sigbuf; =20 static void sighndl(int sig, siginfo_t *siginfo, void *ptr) @@ -1604,6 +2299,77 @@ uffd_test_case_t uffd_tests[] =3D { /* We can't test MADV_COLLAPSE, so try our luck */ .uffd_feature_required =3D UFFD_FEATURE_MINOR_SHMEM, }, + { + .name =3D "rwp-async", + .uffd_fn =3D uffd_rwp_async_test, + .mem_targets =3D MEM_ALL, + .uffd_feature_required =3D + UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC, + }, + { + .name =3D "rwp-sync", + .uffd_fn =3D uffd_rwp_sync_test, + .mem_targets =3D MEM_ALL, + .uffd_feature_required =3D UFFD_FEATURE_RWP, + }, + { + .name =3D "rwp-pagemap", + .uffd_fn =3D uffd_rwp_pagemap_test, + .mem_targets =3D MEM_ALL, + .uffd_feature_required =3D + UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC, + }, + { + .name =3D "rwp-mprotect", + .uffd_fn =3D uffd_rwp_mprotect_test, + .mem_targets =3D MEM_ALL, + .uffd_feature_required =3D + UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC, + }, + { + .name =3D "rwp-gup", + .uffd_fn =3D uffd_rwp_gup_test, + .mem_targets =3D MEM_ALL, + .uffd_feature_required =3D + UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC, + }, + { + .name =3D "rwp-async-toggle", + .uffd_fn =3D uffd_rwp_async_toggle_test, + .mem_targets =3D MEM_ALL, + .uffd_feature_required =3D + UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC, + }, + { + .name =3D "rwp-close", + .uffd_fn =3D uffd_rwp_close_test, + .mem_targets =3D MEM_ALL, + .uffd_feature_required =3D UFFD_FEATURE_RWP, + }, + { + .name =3D "rwp-fork", + .uffd_fn =3D uffd_rwp_fork_test, + .mem_targets =3D MEM_ALL, + .uffd_feature_required =3D + UFFD_FEATURE_RWP | UFFD_FEATURE_EVENT_FORK, + }, + { + .name =3D "rwp-fork-pin", + .uffd_fn =3D uffd_rwp_fork_pin_test, + .mem_targets =3D MEM_ANON, + .uffd_feature_required =3D + UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC | + UFFD_FEATURE_EVENT_FORK, + }, + { + .name =3D "rwp-wp-exclusive", + .uffd_fn =3D uffd_rwp_wp_exclusive_test, + .mem_targets =3D MEM_ALL, + .uffd_feature_required =3D + UFFD_FEATURE_RWP | + UFFD_FEATURE_PAGEFAULT_FLAG_WP | + UFFD_FEATURE_WP_HUGETLBFS_SHMEM, + }, { .name =3D "sigbus", .uffd_fn =3D uffd_sigbus_test, --=20 2.54.0 From nobody Mon Jun 8 23:56:07 2026 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 99ED738229F; Mon, 25 May 2026 11:39:19 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709161; cv=none; b=XvC5ducnQmpxWfPhRo46e5a0810H8DIoHv1cd5oS8tGC6Ms6EBnNkv1WB4E3ox14UihHiM56yisIbwDQt7fIY97E2y7yW0gXmsh7WyPB3A/vBEqoHbFAIox8j7G/eWSAM92DVzEj+CxRP+DH3GYNsl++lcnTyCs5QLjP5IpXwTI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779709161; c=relaxed/simple; bh=hc7/zb6iIKC/wUNS0C+7ujoKavF01OcggctcJTryLpg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=bJ5p+USv+89arc0z0Zw/sJ/T+vblbaySA0C5DjUkpbZnEKIbuvQl/e9N03ji2o/JKsMUDNjPKVoyQeB+bf2ql7O7dLd7TUDwsHuR/aSNu5GhIKZ1r3S51kkW0ioif83KYh11QhpG9FxwvBcfI25lliyz5KVphHsjE45RTwdZyD4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=GzgP1Lhn; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="GzgP1Lhn" Received: by smtp.kernel.org (Postfix) with ESMTPSA id C4B171F00A3C; Mon, 25 May 2026 11:39:18 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779709159; bh=Hye5fxZW50CCDqJ7YyOB92+pBAmmJOZ9B6SG4ApV+6E=; h=From:To:Cc:Subject:Date:In-Reply-To:References; b=GzgP1Lhnu3O3HWiuWZMBAVRfKCjBecJEVeWgmUGTL8+Piu6qwFOvlRudwW26q59vX NAf49XSTGqp0y2ovf29vm7WQ6bm5m3XkB9WyqAqJclcl7H6GkSwQgRl/XboxGa4wLT zOpt3wCNAdcLEUdrRXLRDrwpQKmBFt+d4SQ8MTpzLRgVTEOW5KaEOZ+jAKJPzgDsnp nlgIducUAvNV4eKxURNLf+11t60uMh/n/6BQFwRjeqJnHH0zFJqo0Cc5YgCOq6TDlP rJ50pH7WAUiJcFd+PrNQPPHDwDXfCFg3IlNMz2eEwMJ3KByuqK5DvX02njU47nyXRq UvEKJcZVqQRCw== Received: from phl-compute-10.internal (phl-compute-10.internal [10.202.2.50]) by mailfauth.phl.internal (Postfix) with ESMTP id 30278F40082; Mon, 25 May 2026 07:39:18 -0400 (EDT) Received: from phl-frontend-04 ([10.202.2.163]) by phl-compute-10.internal (MEProxy); Mon, 25 May 2026 07:39:18 -0400 X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: dmFkZTEfoRdVBJYWBUADV7d9kIV+Nxm/pVUcgu2VEqiWnj8eQrxuVRmWJvpc9q3Oqv2JCI in3fN4Zd8kb+AIzxXT1aE8XhYzXjllPczHZpsDuave1RqUEJt2aM8NK4F+1xGB1lHbYgSB 4+lR3LF0kMycZigQtO5w6VfLN0Aex+LApk+NDe4mN/AQtgEH82UY4yLdpanr1nUEhfYRPX qXaeB8tr6YycPJO/GKcHBfWh4XfvSwFXkiL84UUAHkAJqW6Bc0LpWOIYef91VRiFvO85Cg Xd8jmXx4jYBCkCANHtSsVLhpCbXIiIojLruOFr51QxylDrx8yfg2/g0kqIsUfNxsLFZyah 1WtxIfMz6ZJeAiukcR6qxW5tDZd9mgUWB1AYzbt8FmzNd+EULMeBTFxo3MJeE5u3XNcdz7 9fNnkSAm2foNSPwXtjawGXMPzu4aqtee6dQTx434U/aed7rDhTKhJccM6Na59r1bxSXJ8T ZQqXqvcAceMCMygwQ9LJTTduSFpyXfd1GZsCMTS9dgOfy81em/bT8UPWe4B6E3QJbHl2+7 3/tofNLSdsio67XkIktCbDOaE/hQgFLvSTctuQb4Gnsh5pgiWiFAeXh/1m8qL1d0VxFIhQ etLKKCAmFyJJs2NPy5ZXhaQ781Zidp3SKMrxniyfznYg7kcaR9mrsXXu3Ikw X-ME-Proxy: Feedback-ID: i10464835:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Mon, 25 May 2026 07:39:17 -0400 (EDT) From: "Kiryl Shutsemau (Meta)" To: akpm@linux-foundation.org, rppt@kernel.org, peterx@redhat.com, david@kernel.org Cc: ljs@kernel.org, surenb@google.com, vbabka@kernel.org, Liam.Howlett@oracle.com, ziy@nvidia.com, corbet@lwn.net, skhan@linuxfoundation.org, seanjc@google.com, pbonzini@redhat.com, jthoughton@google.com, aarcange@redhat.com, sj@kernel.org, usama.arif@linux.dev, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org, kvm@vger.kernel.org, kernel-team@meta.com, "Kiryl Shutsemau (Meta)" Subject: [PATCH v4 14/14] Documentation/userfaultfd: document RWP working set tracking Date: Mon, 25 May 2026 12:37:28 +0100 Message-ID: <20260525113737.1942478-15-kas@kernel.org> X-Mailer: git-send-email 2.54.0 In-Reply-To: <20260525113737.1942478-1-kas@kernel.org> References: <20260525113737.1942478-1-kas@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Add an admin-guide section covering UFFDIO_REGISTER_MODE_RWP: - sync and async fault models; - UFFDIO_RWPROTECT semantics; - UFFD_FEATURE_RWP_ASYNC; - UFFDIO_SET_MODE runtime mode flips. It also covers typical VMM working-set-tracking workflow from detection loop through sync-mode eviction and back to async. Signed-off-by: Kiryl Shutsemau Assisted-by: Claude:claude-opus-4-6 --- Documentation/admin-guide/mm/userfaultfd.rst | 226 ++++++++++++++++++- 1 file changed, 220 insertions(+), 6 deletions(-) diff --git a/Documentation/admin-guide/mm/userfaultfd.rst b/Documentation/a= dmin-guide/mm/userfaultfd.rst index 1e533639fd50..cb5d0e0c9fff 100644 --- a/Documentation/admin-guide/mm/userfaultfd.rst +++ b/Documentation/admin-guide/mm/userfaultfd.rst @@ -275,16 +275,16 @@ tracking and it can be different in a few ways: - Dirty information will not get lost if the pte was zapped due to various reasons (e.g. during split of a shmem transparent huge page). =20 - - Due to a reverted meaning of soft-dirty (page clean when uffd-wp bit - set; dirty when uffd-wp bit cleared), it has different semantics on - some of the memory operations. For example: ``MADV_DONTNEED`` on + - Due to a reverted meaning of soft-dirty (page clean when the uffd bit + is set; dirty when the uffd bit is cleared), it has different semantics + on some of the memory operations. For example: ``MADV_DONTNEED`` on anonymous (or ``MADV_REMOVE`` on a file mapping) will be treated as - dirtying of memory by dropping uffd-wp bit during the procedure. + dirtying of memory by dropping the uffd bit during the procedure. =20 The user app can collect the "written/dirty" status by looking up the -uffd-wp bit for the pages being interested in /proc/pagemap. +uffd bit for the pages being interested in /proc/pagemap. =20 -The page will not be under track of uffd-wp async mode until the page is +The page will not be under track of userfaultfd-wp async mode until the pa= ge is explicitly write-protected by ``ioctl(UFFDIO_WRITEPROTECT)`` with the mode flag ``UFFDIO_WRITEPROTECT_MODE_WP`` set. Trying to resolve a page fault that was tracked by async mode userfaultfd-wp is invalid. @@ -307,6 +307,220 @@ transparent to the guest, we want that same address r= ange to act as if it was still poisoned, even though it's on a new physical host which ostensibly doesn't have a memory error in the exact same spot. =20 +Read-Write Protection +--------------------- + +``UFFDIO_REGISTER_MODE_RWP`` enables read-write protection tracking on a +memory range. It is similar to (but faster than) ``mprotect(PROT_NONE)`` +combined with a signal handler; unlike ``mprotect(PROT_NONE)``, RWP only +traps accesses to *present* PTEs, so accesses to unpopulated addresses in a +protected range fall through to the normal missing-page path. It uses the +PROT_NONE hinting mechanism (same as NUMA balancing) to make pages +inaccessible while keeping them resident in memory. Works on anonymous, +shmem, and hugetlbfs memory. + +RWP is designed for VM memory managers that need to track the working set +of guest memory for cold page eviction to tiered or remote storage. + +**Setup:** + +1. Open a userfaultfd and enable ``UFFD_FEATURE_RWP`` via ``UFFDIO_API``. + Optionally request ``UFFD_FEATURE_RWP_ASYNC`` as well =E2=80=94 it requ= ires + ``UFFD_FEATURE_RWP`` to be set in the same ``UFFDIO_API`` call. + +2. Register the guest memory range with ``UFFDIO_REGISTER_MODE_RWP`` + (and ``UFFDIO_REGISTER_MODE_MISSING`` if evicted pages will need to be + fetched back from storage). + +**Feature availability:** + +RWP is built on top of two kernel primitives: a spare PTE bit owned by +userfaultfd (``CONFIG_HAVE_ARCH_USERFAULTFD_WP``) and architecture support +for present-but-inaccessible PTEs (``CONFIG_ARCH_HAS_PTE_PROTNONE``). When= both +are available on a 64-bit kernel, the build selects +``CONFIG_USERFAULTFD_RWP=3Dy`` and the ``VM_UFFD_RWP`` VMA flag becomes +available. + +``UFFD_FEATURE_RWP`` and ``UFFD_FEATURE_RWP_ASYNC`` are masked out of the +features returned by ``UFFDIO_API`` when the running kernel or architecture +cannot support them =E2=80=94 for example 32-bit kernels (where ``VM_UFFD_= RWP`` is +unavailable), kernels built without ``CONFIG_USERFAULTFD_RWP``, and +architectures whose ptes cannot carry the uffd bit at runtime (e.g. riscv +without the ``SVRSW60T59B`` extension). ``UFFDIO_API`` does not fail; +unsupported bits are simply absent from ``uffdio_api.features`` on return. +Callers should inspect the returned ``features`` after ``UFFDIO_API`` and +fall back to another tracking method when RWP is unavailable. + +**Protecting and Unprotecting:** + +Use ``UFFDIO_RWPROTECT`` to protect or unprotect a range, mirroring the +``UFFDIO_WRITEPROTECT`` interface:: + + struct uffdio_rwprotect rwp =3D { + .range =3D { .start =3D addr, .len =3D len }, + .mode =3D UFFDIO_RWPROTECT_MODE_RWP, /* protect */ + }; + ioctl(uffd, UFFDIO_RWPROTECT, &rwp); + +Setting ``UFFDIO_RWPROTECT_MODE_RWP`` sets PROT_NONE on present PTEs in the +range. Pages stay resident and their physical frames are preserved =E2=80= =94 only +access permissions are removed. + +Clearing ``UFFDIO_RWPROTECT_MODE_RWP`` restores normal VMA permissions and +wakes any faulting threads (unless ``UFFDIO_RWPROTECT_MODE_DONTWAKE`` is s= et). + +**Scope of protection:** + +RWP protection is a property of *present* PTEs. ``UFFDIO_RWPROTECT`` only +affects entries that are already populated. Unpopulated addresses within +the range remain unpopulated; when first accessed they fault through the +normal missing path (``do_anonymous_page()``, ``do_swap_page()``, +``finish_fault()``) and the resulting PTE is not RWP-protected. To observe +the population itself, co-register the range with +``UFFDIO_REGISTER_MODE_MISSING``. + +Protection is preserved across page reclaim: a page swapped out while +RWP-protected carries the marker on its swap entry, and swap-in restores +the PROT_NONE state so the first access after swap-in still faults. The +same applies to pages temporarily replaced by migration entries. + +Operations that drop the PTE entirely =E2=80=94 ``MADV_DONTNEED`` on anony= mous +memory, hole-punch on shmem, truncation of a file mapping =E2=80=94 also d= rop the +RWP marker: the next access re-populates the range without protection. +Unlike WP (which persists via ``PTE_MARKER_UFFD_WP``), there is no +persistent RWP marker today. The user needs to re-arm the range with +``UFFDIO_RWPROTECT`` after any operation that explicitly frees PTEs. + +**Fault Handling:** + +When a protected page is accessed: + +- **Sync mode** (default): The faulting thread blocks and a + ``UFFD_PAGEFAULT_FLAG_RWP`` message is delivered to the userfaultfd + handler. The handler resolves the fault with ``UFFDIO_RWPROTECT`` + (clearing ``MODE_RWP``), which restores the PTE permissions and wakes + the faulting thread. + +- **Async mode** (``UFFD_FEATURE_RWP_ASYNC``): The kernel automatically + restores PTE permissions and the thread continues without blocking. No + message is delivered to the handler. + +**Runtime Mode Switching:** + +``UFFDIO_SET_MODE`` toggles ``UFFD_FEATURE_RWP_ASYNC`` at runtime, allowing +the VMM to switch between lightweight async detection and safe sync +eviction without re-registering. The toggle takes ``mmap_write_lock()`` to +ensure all in-flight faults complete before the mode change takes effect. + +**Cold Page Detection with PAGEMAP_SCAN:** + +RWP-protected PTEs carry the uffd PTE bit; the fault-resolution path +clears it. ``PAGEMAP_SCAN`` reports ``PAGE_IS_ACCESSED`` once the bit is +clear on a ``VM_UFFD_RWP`` VMA, so inverting it efficiently reports the +still-protected (cold) pages:: + + struct pm_scan_arg arg =3D { + .size =3D sizeof(arg), + .start =3D guest_mem_start, + .end =3D guest_mem_end, + .vec =3D (uint64_t)regions, + .vec_len =3D regions_len, + .category_mask =3D PAGE_IS_ACCESSED, + .category_inverted =3D PAGE_IS_ACCESSED, + .return_mask =3D PAGE_IS_ACCESSED, + }; + long n =3D ioctl(pagemap_fd, PAGEMAP_SCAN, &arg); + +The returned ``page_region`` array contains contiguous cold ranges that can +then be evicted. + +**Cleanup:** + +When the userfaultfd is closed or the range is unregistered, all PROT_NONE +PTEs are automatically restored to their normal VMA permissions. This +prevents pages from becoming permanently inaccessible. + +**VMM Working Set Tracking Workflow:** + +A typical VMM lifecycle for cold page eviction to tiered storage. Two +mappings of the same shmem (or hugetlbfs) file are used: ``guest_mem`` is +the RWP-registered mapping that vCPUs access through, and ``io_mem`` is a +private mapping for VMM-side I/O. Reading ``io_mem`` does not go through +the RWP-protected PTEs of ``guest_mem``, so the VMM's own ``pwrite()`` +never traps on its own :: + + /* One-time setup */ + fd =3D memfd_create("guest", MFD_CLOEXEC); + ftruncate(fd, guest_size); + guest_mem =3D mmap(NULL, guest_size, PROT_READ | PROT_WRITE, + MAP_SHARED, fd, 0); /* vCPU view, RWP-registered */ + io_mem =3D mmap(NULL, guest_size, PROT_READ | PROT_WRITE, + MAP_SHARED, fd, 0); /* VMM I/O view, unprotected */ + + uffd =3D userfaultfd(O_CLOEXEC | O_NONBLOCK); + ioctl(uffd, UFFDIO_API, &(struct uffdio_api){ + .api =3D UFFD_API, + .features =3D UFFD_FEATURE_RWP | UFFD_FEATURE_RWP_ASYNC, + }); + ioctl(uffd, UFFDIO_REGISTER, &(struct uffdio_register){ + .range =3D { guest_mem, guest_size }, + .mode =3D UFFDIO_REGISTER_MODE_RWP | + UFFDIO_REGISTER_MODE_MISSING, + }); + + /* Tracking loop */ + while (vm_running) { + /* 1. Detection phase (async =E2=80=94 no vCPU stalls) */ + ioctl(uffd, UFFDIO_RWPROTECT, &(struct uffdio_rwprotect){ + .range =3D full_range, + .mode =3D UFFDIO_RWPROTECT_MODE_RWP }); + sleep(tracking_interval); + + /* 2. Find cold pages (uffd bit still set) */ + ioctl(pagemap_fd, PAGEMAP_SCAN, &(struct pm_scan_arg){ + .category_mask =3D PAGE_IS_ACCESSED, + .category_inverted =3D PAGE_IS_ACCESSED, + .return_mask =3D PAGE_IS_ACCESSED, + ... + }); + + /* 3. Switch to sync for safe eviction */ + ioctl(uffd, UFFDIO_SET_MODE, + &(struct uffdio_set_mode){ + .disable =3D UFFD_FEATURE_RWP_ASYNC }); + + /* 4. Evict cold pages (vCPU faults block on guest_mem) */ + for each cold range: + /* Read from io_mem -- bypasses RWP, no fault. */ + pwrite(storage_fd, io_mem + cold_offset, len, offset); + /* Drop the page from the shared file. */ + fallocate(fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, + cold_offset, len); + /* + * Wake any vCPU blocked on the RWP fault for this range: + * fallocate() does not iterate ctx->fault_pending_wqh. + */ + ioctl(uffd, UFFDIO_WAKE, &(struct uffdio_range){ + .start =3D (uintptr_t)guest_mem + cold_offset, + .len =3D len }); + + /* 5. Resume async tracking */ + ioctl(uffd, UFFDIO_SET_MODE, + &(struct uffdio_set_mode){ + .enable =3D UFFD_FEATURE_RWP_ASYNC }); + } + +During step 4, a vCPU that accesses ``guest_mem + cold_offset`` blocks +with a ``UFFD_PAGEFAULT_FLAG_RWP`` fault while the eviction is in +progress. After ``fallocate()`` punches the page out and ``UFFDIO_WAKE`` +fires, the vCPU retries the access, faults as ``MISSING``, and the +handler resolves it with ``UFFDIO_COPY`` from storage. + +This workflow targets shmem and hugetlbfs (both support a private +``io_mem`` mapping over the same fd). Anonymous-memory backings need a +different inner-loop strategy because the VMM has no way to read the +page without going through the RWP-protected mapping. + QEMU/KVM =3D=3D=3D=3D=3D=3D=3D=3D =20 --=20 2.54.0