[v2] mm: remove is_swap_[pte, pmd]() + non-swap entries, introduce leaf entries

[PATCH v2 01/16] mm: correctly handle UFFD PTE markers

Posted by Lorenzo Stoakes 3 months ago

PTE markers were previously only concerned with UFFD-specific logic - that
is, PTE entries with the UFFD WP marker set or those marked via
UFFDIO_POISON.

However since the introduction of guard markers in commit
 7c53dfbdb024 ("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer
 been the case.

Issues have been avoided as guard regions are not permitted in conjunction
with UFFD, but it still leaves very confusing logic in place, most notably
the misleading and poorly named pte_none_mostly() and
huge_pte_none_mostly().

This predicate returns true for PTE entries that ought to be treated as
none, but only in certain circumstances, and on the assumption we are
dealing with H/W poison markers or UFFD WP markers.

This patch removes these functions and makes each invocation of these
functions instead explicitly check what it needs to check.

As part of this effort it introduces is_uffd_pte_marker() to explicitly
determine if a marker in fact is used as part of UFFD or not.

In the HMM logic we note that the only time we would need to check for a
fault is in the case of a UFFD WP marker, otherwise we simply encounter a
fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
for a guard marker), so only check for the UFFD WP case.

While we're here we also refactor code to make it easier to understand.

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 fs/userfaultfd.c              | 83 +++++++++++++++++++----------------
 include/asm-generic/hugetlb.h |  8 ----
 include/linux/swapops.h       | 18 --------
 include/linux/userfaultfd_k.h | 21 +++++++++
 mm/hmm.c                      |  2 +-
 mm/hugetlb.c                  | 47 ++++++++++----------
 mm/mincore.c                  | 17 +++++--
 mm/userfaultfd.c              | 27 +++++++-----
 8 files changed, 123 insertions(+), 100 deletions(-)

diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 54c6cc7fe9c6..04c66b5001d5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -233,40 +233,46 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 {
 	struct vm_area_struct *vma = vmf->vma;
 	pte_t *ptep, pte;
-	bool ret = true;
 
 	assert_fault_locked(vmf);
 
 	ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
 	if (!ptep)
-		goto out;
+		return true;
 
-	ret = false;
 	pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
 
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
-	 * changes under us.  PTE markers should be handled the same as none
-	 * ptes here.
+	 * changes under us.
 	 */
-	if (huge_pte_none_mostly(pte))
-		ret = true;
+
+	/* If missing entry, wait for handler. */
+	if (huge_pte_none(pte))
+		return true;
+	/* UFFD PTE markers require handling. */
+	if (is_uffd_pte_marker(pte))
+		return true;
+	/* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
 	if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
-		ret = true;
-out:
-	return ret;
+		return true;
+
+	/* Otherwise, if entry isn't present, let fault handler deal with it. */
+	return false;
 }
 #else
 static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 					      struct vm_fault *vmf,
 					      unsigned long reason)
 {
-	return false;	/* should never get here */
+	/* Should never get here. */
+	VM_WARN_ON_ONCE(1);
+	return false;
 }
 #endif /* CONFIG_HUGETLB_PAGE */
 
 /*
- * Verify the pagetables are still not ok after having reigstered into
+ * Verify the pagetables are still not ok after having registered into
  * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
  * userfault that has already been resolved, if userfaultfd_read_iter and
  * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different
@@ -284,53 +290,55 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	pmd_t *pmd, _pmd;
 	pte_t *pte;
 	pte_t ptent;
-	bool ret = true;
+	bool ret;
 
 	assert_fault_locked(vmf);
 
 	pgd = pgd_offset(mm, address);
 	if (!pgd_present(*pgd))
-		goto out;
+		return true;
 	p4d = p4d_offset(pgd, address);
 	if (!p4d_present(*p4d))
-		goto out;
+		return true;
 	pud = pud_offset(p4d, address);
 	if (!pud_present(*pud))
-		goto out;
+		return true;
 	pmd = pmd_offset(pud, address);
 again:
 	_pmd = pmdp_get_lockless(pmd);
 	if (pmd_none(_pmd))
-		goto out;
+		return true;
 
-	ret = false;
 	if (!pmd_present(_pmd))
-		goto out;
+		return false;
 
-	if (pmd_trans_huge(_pmd)) {
-		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
-			ret = true;
-		goto out;
-	}
+	if (pmd_trans_huge(_pmd))
+		return !pmd_write(_pmd) && (reason & VM_UFFD_WP);
 
 	pte = pte_offset_map(pmd, address);
-	if (!pte) {
-		ret = true;
+	if (!pte)
 		goto again;
-	}
+
 	/*
 	 * Lockless access: we're in a wait_event so it's ok if it
-	 * changes under us.  PTE markers should be handled the same as none
-	 * ptes here.
+	 * changes under us.
 	 */
 	ptent = ptep_get(pte);
-	if (pte_none_mostly(ptent))
-		ret = true;
+
+	ret = true;
+	/* If missing entry, wait for handler. */
+	if (pte_none(ptent))
+		goto out;
+	/* UFFD PTE markers require handling. */
+	if (is_uffd_pte_marker(ptent))
+		goto out;
+	/* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
 	if (!pte_write(ptent) && (reason & VM_UFFD_WP))
-		ret = true;
-	pte_unmap(pte);
+		goto out;
 
+	ret = false;
 out:
+	pte_unmap(pte);
 	return ret;
 }
 
@@ -490,12 +498,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
 	set_current_state(blocking_state);
 	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
 
-	if (!is_vm_hugetlb_page(vma))
-		must_wait = userfaultfd_must_wait(ctx, vmf, reason);
-	else
+	if (is_vm_hugetlb_page(vma)) {
 		must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason);
-	if (is_vm_hugetlb_page(vma))
 		hugetlb_vma_unlock_read(vma);
+	} else {
+		must_wait = userfaultfd_must_wait(ctx, vmf, reason);
+	}
+
 	release_fault_lock(vmf);
 
 	if (likely(must_wait && !READ_ONCE(ctx->released))) {
diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
index dcb8727f2b82..e1a2e1b7c8e7 100644
--- a/include/asm-generic/hugetlb.h
+++ b/include/asm-generic/hugetlb.h
@@ -97,14 +97,6 @@ static inline int huge_pte_none(pte_t pte)
 }
 #endif
 
-/* Please refer to comments above pte_none_mostly() for the usage */
-#ifndef __HAVE_ARCH_HUGE_PTE_NONE_MOSTLY
-static inline int huge_pte_none_mostly(pte_t pte)
-{
-	return huge_pte_none(pte) || is_pte_marker(pte);
-}
-#endif
-
 #ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
 static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
 		unsigned long addr, pte_t *ptep)
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 2687928a8146..d1f665935cfc 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -469,24 +469,6 @@ static inline int is_guard_swp_entry(swp_entry_t entry)
 		(pte_marker_get(entry) & PTE_MARKER_GUARD);
 }
 
-/*
- * This is a special version to check pte_none() just to cover the case when
- * the pte is a pte marker.  It existed because in many cases the pte marker
- * should be seen as a none pte; it's just that we have stored some information
- * onto the none pte so it becomes not-none any more.
- *
- * It should be used when the pte is file-backed, ram-based and backing
- * userspace pages, like shmem.  It is not needed upon pgtables that do not
- * support pte markers at all.  For example, it's not needed on anonymous
- * memory, kernel-only memory (including when the system is during-boot),
- * non-ram based generic file-system.  It's fine to be used even there, but the
- * extra pte marker check will be pure overhead.
- */
-static inline int pte_none_mostly(pte_t pte)
-{
-	return pte_none(pte) || is_pte_marker(pte);
-}
-
 static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
 {
 	struct page *p = pfn_to_page(swp_offset_pfn(entry));
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index c0e716aec26a..da0b4fcc566f 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -479,4 +479,25 @@ static inline bool pte_swp_uffd_wp_any(pte_t pte)
 	return false;
 }
 
+
+static inline bool is_uffd_pte_marker(pte_t pte)
+{
+	swp_entry_t entry;
+
+	if (pte_present(pte))
+		return false;
+
+	entry = pte_to_swp_entry(pte);
+	if (!is_pte_marker_entry(entry))
+		return false;
+
+	/* UFFD WP, poisoned swap entries are UFFD handled. */
+	if (pte_marker_entry_uffd_wp(entry))
+		return true;
+	if (is_poisoned_swp_entry(entry))
+		return true;
+
+	return false;
+}
+
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/mm/hmm.c b/mm/hmm.c
index a56081d67ad6..43d4a91035ff 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -244,7 +244,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	uint64_t pfn_req_flags = *hmm_pfn;
 	uint64_t new_pfn_flags = 0;
 
-	if (pte_none_mostly(pte)) {
+	if (pte_none(pte) || pte_marker_uffd_wp(pte)) {
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (required_fault)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 1ea459723cce..01c784547d1e 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6743,29 +6743,28 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	}
 
 	vmf.orig_pte = huge_ptep_get(mm, vmf.address, vmf.pte);
-	if (huge_pte_none_mostly(vmf.orig_pte)) {
-		if (is_pte_marker(vmf.orig_pte)) {
-			pte_marker marker =
-				pte_marker_get(pte_to_swp_entry(vmf.orig_pte));
-
-			if (marker & PTE_MARKER_POISONED) {
-				ret = VM_FAULT_HWPOISON_LARGE |
-				      VM_FAULT_SET_HINDEX(hstate_index(h));
-				goto out_mutex;
-			} else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) {
-				/* This isn't supported in hugetlb. */
-				ret = VM_FAULT_SIGSEGV;
-				goto out_mutex;
-			}
-		}
-
+	if (huge_pte_none(vmf.orig_pte))
 		/*
-		 * Other PTE markers should be handled the same way as none PTE.
-		 *
 		 * hugetlb_no_page will drop vma lock and hugetlb fault
 		 * mutex internally, which make us return immediately.
 		 */
 		return hugetlb_no_page(mapping, &vmf);
+
+	if (is_pte_marker(vmf.orig_pte)) {
+		const pte_marker marker =
+			pte_marker_get(pte_to_swp_entry(vmf.orig_pte));
+
+		if (marker & PTE_MARKER_POISONED) {
+			ret = VM_FAULT_HWPOISON_LARGE |
+				VM_FAULT_SET_HINDEX(hstate_index(h));
+			goto out_mutex;
+		} else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) {
+			/* This isn't supported in hugetlb. */
+			ret = VM_FAULT_SIGSEGV;
+			goto out_mutex;
+		}
+
+		return hugetlb_no_page(mapping, &vmf);
 	}
 
 	ret = 0;
@@ -6934,6 +6933,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 	int ret = -ENOMEM;
 	struct folio *folio;
 	bool folio_in_pagecache = false;
+	pte_t dst_ptep;
 
 	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) {
 		ptl = huge_pte_lock(h, dst_mm, dst_pte);
@@ -7073,13 +7073,14 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 	if (folio_test_hwpoison(folio))
 		goto out_release_unlock;
 
+	ret = -EEXIST;
+
+	dst_ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
 	/*
-	 * We allow to overwrite a pte marker: consider when both MISSING|WP
-	 * registered, we firstly wr-protect a none pte which has no page cache
-	 * page backing it, then access the page.
+	 * See comment about UFFD marker overwriting in
+	 * mfill_atomic_install_pte().
 	 */
-	ret = -EEXIST;
-	if (!huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte)))
+	if (!huge_pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
 		goto out_release_unlock;
 
 	if (folio_in_pagecache)
diff --git a/mm/mincore.c b/mm/mincore.c
index 8ec4719370e1..151b2dbb783b 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -32,11 +32,22 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 	spinlock_t *ptl;
 
 	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
+
 	/*
 	 * Hugepages under user process are always in RAM and never
 	 * swapped out, but theoretically it needs to be checked.
 	 */
-	present = pte && !huge_pte_none_mostly(huge_ptep_get(walk->mm, addr, pte));
+	if (!pte) {
+		present = 0;
+	} else {
+		const pte_t ptep = huge_ptep_get(walk->mm, addr, pte);
+
+		if (huge_pte_none(ptep) || is_pte_marker(ptep))
+			present = 0;
+		else
+			present = 1;
+	}
+
 	for (; addr != end; vec++, addr += PAGE_SIZE)
 		*vec = present;
 	walk->private = vec;
@@ -175,8 +186,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		pte_t pte = ptep_get(ptep);
 
 		step = 1;
-		/* We need to do cache lookup too for pte markers */
-		if (pte_none_mostly(pte))
+		/* We need to do cache lookup too for UFFD pte markers */
+		if (pte_none(pte) || is_uffd_pte_marker(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
 						 vma, vec);
 		else if (pte_present(pte)) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 00122f42718c..cc4ce205bbec 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -178,6 +178,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
 	spinlock_t *ptl;
 	struct folio *folio = page_folio(page);
 	bool page_in_cache = folio_mapping(folio);
+	pte_t dst_ptep;
 
 	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
 	_dst_pte = pte_mkdirty(_dst_pte);
@@ -199,12 +200,15 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
 	}
 
 	ret = -EEXIST;
+
+	dst_ptep = ptep_get(dst_pte);
+
 	/*
-	 * We allow to overwrite a pte marker: consider when both MISSING|WP
-	 * registered, we firstly wr-protect a none pte which has no page cache
-	 * page backing it, then access the page.
+	 * We are allowed to overwrite a UFFD pte marker: consider when both
+	 * MISSING|WP registered, we firstly wr-protect a none pte which has no
+	 * page cache page backing it, then access the page.
 	 */
-	if (!pte_none_mostly(ptep_get(dst_pte)))
+	if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
 		goto out_unlock;
 
 	if (page_in_cache) {
@@ -583,12 +587,15 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 			goto out_unlock;
 		}
 
-		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) &&
-		    !huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) {
-			err = -EEXIST;
-			hugetlb_vma_unlock_read(dst_vma);
-			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-			goto out_unlock;
+		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
+			const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
+
+			if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) {
+				err = -EEXIST;
+				hugetlb_vma_unlock_read(dst_vma);
+				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+				goto out_unlock;
+			}
 		}
 
 		err = hugetlb_mfill_atomic_pte(dst_pte, dst_vma, dst_addr,
-- 
2.51.0

Re: [PATCH v2 01/16] mm: correctly handle UFFD PTE markers

Posted by Mike Rapoport 3 months ago

On Sat, Nov 08, 2025 at 05:08:15PM +0000, Lorenzo Stoakes wrote:
> PTE markers were previously only concerned with UFFD-specific logic - that
> is, PTE entries with the UFFD WP marker set or those marked via
> UFFDIO_POISON.
> 
> However since the introduction of guard markers in commit
>  7c53dfbdb024 ("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer
>  been the case.
> 
> Issues have been avoided as guard regions are not permitted in conjunction
> with UFFD, but it still leaves very confusing logic in place, most notably
> the misleading and poorly named pte_none_mostly() and
> huge_pte_none_mostly().
> 
> This predicate returns true for PTE entries that ought to be treated as
> none, but only in certain circumstances, and on the assumption we are
> dealing with H/W poison markers or UFFD WP markers.
> 
> This patch removes these functions and makes each invocation of these
> functions instead explicitly check what it needs to check.
> 
> As part of this effort it introduces is_uffd_pte_marker() to explicitly
> determine if a marker in fact is used as part of UFFD or not.
> 
> In the HMM logic we note that the only time we would need to check for a
> fault is in the case of a UFFD WP marker, otherwise we simply encounter a
> fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
> for a guard marker), so only check for the UFFD WP case.
> 
> While we're here we also refactor code to make it easier to understand.
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  fs/userfaultfd.c              | 83 +++++++++++++++++++----------------
>  include/asm-generic/hugetlb.h |  8 ----
>  include/linux/swapops.h       | 18 --------
>  include/linux/userfaultfd_k.h | 21 +++++++++
>  mm/hmm.c                      |  2 +-
>  mm/hugetlb.c                  | 47 ++++++++++----------
>  mm/mincore.c                  | 17 +++++--
>  mm/userfaultfd.c              | 27 +++++++-----
>  8 files changed, 123 insertions(+), 100 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 54c6cc7fe9c6..04c66b5001d5 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -233,40 +233,46 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>  {
>  	struct vm_area_struct *vma = vmf->vma;
>  	pte_t *ptep, pte;
> -	bool ret = true;
>  
>  	assert_fault_locked(vmf);
>  
>  	ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
>  	if (!ptep)
> -		goto out;
> +		return true;
>  
> -	ret = false;
>  	pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
>  
>  	/*
>  	 * Lockless access: we're in a wait_event so it's ok if it
> -	 * changes under us.  PTE markers should be handled the same as none
> -	 * ptes here.
> +	 * changes under us.
>  	 */
> -	if (huge_pte_none_mostly(pte))
> -		ret = true;
> +
> +	/* If missing entry, wait for handler. */

It's actually #PF handler that waits ;-)

When userfaultfd_(huge_)must_wait() return true, it means that process that
caused a fault should wait until userspace resolves the fault and return
false means that it's ok to retry the #PF.

So the comment here should probably read as

	/* entry is still missing, wait for userspace to resolve the fault */

and the rest of the comments here and in userfaultfd_must_wait() need
similar update.

> +	if (huge_pte_none(pte))
> +		return true;
> +	/* UFFD PTE markers require handling. */
> +	if (is_uffd_pte_marker(pte))
> +		return true;
> +	/* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
>  	if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> -		ret = true;
> -out:
> -	return ret;
> +		return true;
> +
> +	/* Otherwise, if entry isn't present, let fault handler deal with it. */

Entry is actually present here, e.g because there is a thread that called
UFFDIO_COPY in parallel with the fault, so no need to stuck the faulting
process.

> +	return false;
>  }
>  #else
>  static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>  					      struct vm_fault *vmf,
>  					      unsigned long reason)
>  {
> -	return false;	/* should never get here */
> +	/* Should never get here. */
> +	VM_WARN_ON_ONCE(1);
> +	return false;
>  }
>  #endif /* CONFIG_HUGETLB_PAGE */
>  
>  /*
> - * Verify the pagetables are still not ok after having reigstered into
> + * Verify the pagetables are still not ok after having registered into
>   * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
>   * userfault that has already been resolved, if userfaultfd_read_iter and
>   * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different
> @@ -284,53 +290,55 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>  	pmd_t *pmd, _pmd;
>  	pte_t *pte;
>  	pte_t ptent;
> -	bool ret = true;
> +	bool ret;
>  
>  	assert_fault_locked(vmf);
>  
>  	pgd = pgd_offset(mm, address);
>  	if (!pgd_present(*pgd))
> -		goto out;
> +		return true;
>  	p4d = p4d_offset(pgd, address);
>  	if (!p4d_present(*p4d))
> -		goto out;
> +		return true;
>  	pud = pud_offset(p4d, address);
>  	if (!pud_present(*pud))
> -		goto out;
> +		return true;
>  	pmd = pmd_offset(pud, address);
>  again:
>  	_pmd = pmdp_get_lockless(pmd);
>  	if (pmd_none(_pmd))
> -		goto out;
> +		return true;
>  
> -	ret = false;
>  	if (!pmd_present(_pmd))
> -		goto out;
> +		return false;

This one is actually tricky, maybe it's worth adding a gist of commit log
from a365ac09d334 ("mm, userfaultfd, THP: avoid waiting when PMD under THP migration")
as a comment.

>  
> -	if (pmd_trans_huge(_pmd)) {
> -		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
> -			ret = true;
> -		goto out;
> -	}
> +	if (pmd_trans_huge(_pmd))
> +		return !pmd_write(_pmd) && (reason & VM_UFFD_WP);

...

> diff --git a/mm/hmm.c b/mm/hmm.c
> index a56081d67ad6..43d4a91035ff 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -244,7 +244,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>  	uint64_t pfn_req_flags = *hmm_pfn;
>  	uint64_t new_pfn_flags = 0;
>  
> -	if (pte_none_mostly(pte)) {
> +	if (pte_none(pte) || pte_marker_uffd_wp(pte)) {

Would be nice to add the note from the changelog as a comment here.

>  		required_fault =
>  			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
>  		if (required_fault)

-- 
Sincerely yours,
Mike.

Re: [PATCH v2 01/16] mm: correctly handle UFFD PTE markers

Posted by Lorenzo Stoakes 3 months ago

On Mon, Nov 10, 2025 at 01:17:37PM +0200, Mike Rapoport wrote:
> On Sat, Nov 08, 2025 at 05:08:15PM +0000, Lorenzo Stoakes wrote:
> > PTE markers were previously only concerned with UFFD-specific logic - that
> > is, PTE entries with the UFFD WP marker set or those marked via
> > UFFDIO_POISON.
> >
> > However since the introduction of guard markers in commit
> >  7c53dfbdb024 ("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer
> >  been the case.
> >
> > Issues have been avoided as guard regions are not permitted in conjunction
> > with UFFD, but it still leaves very confusing logic in place, most notably
> > the misleading and poorly named pte_none_mostly() and
> > huge_pte_none_mostly().
> >
> > This predicate returns true for PTE entries that ought to be treated as
> > none, but only in certain circumstances, and on the assumption we are
> > dealing with H/W poison markers or UFFD WP markers.
> >
> > This patch removes these functions and makes each invocation of these
> > functions instead explicitly check what it needs to check.
> >
> > As part of this effort it introduces is_uffd_pte_marker() to explicitly
> > determine if a marker in fact is used as part of UFFD or not.
> >
> > In the HMM logic we note that the only time we would need to check for a
> > fault is in the case of a UFFD WP marker, otherwise we simply encounter a
> > fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
> > for a guard marker), so only check for the UFFD WP case.
> >
> > While we're here we also refactor code to make it easier to understand.
> >
> > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  fs/userfaultfd.c              | 83 +++++++++++++++++++----------------
> >  include/asm-generic/hugetlb.h |  8 ----
> >  include/linux/swapops.h       | 18 --------
> >  include/linux/userfaultfd_k.h | 21 +++++++++
> >  mm/hmm.c                      |  2 +-
> >  mm/hugetlb.c                  | 47 ++++++++++----------
> >  mm/mincore.c                  | 17 +++++--
> >  mm/userfaultfd.c              | 27 +++++++-----
> >  8 files changed, 123 insertions(+), 100 deletions(-)
> >
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index 54c6cc7fe9c6..04c66b5001d5 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -233,40 +233,46 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> >  {
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	pte_t *ptep, pte;
> > -	bool ret = true;
> >
> >  	assert_fault_locked(vmf);
> >
> >  	ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
> >  	if (!ptep)
> > -		goto out;
> > +		return true;
> >
> > -	ret = false;
> >  	pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
> >
> >  	/*
> >  	 * Lockless access: we're in a wait_event so it's ok if it
> > -	 * changes under us.  PTE markers should be handled the same as none
> > -	 * ptes here.
> > +	 * changes under us.
> >  	 */
> > -	if (huge_pte_none_mostly(pte))
> > -		ret = true;
> > +
> > +	/* If missing entry, wait for handler. */
>
> It's actually #PF handler that waits ;-)

Think I meant uffd userland 'handler' as in handle_userfault(). But this is not
clear obviously.

>
> When userfaultfd_(huge_)must_wait() return true, it means that process that
> caused a fault should wait until userspace resolves the fault and return
> false means that it's ok to retry the #PF.

Yup.

>
> So the comment here should probably read as
>
> 	/* entry is still missing, wait for userspace to resolve the fault */
>

Will update to make clearer thanks.

>
> > +	if (huge_pte_none(pte))
> > +		return true;
> > +	/* UFFD PTE markers require handling. */
> > +	if (is_uffd_pte_marker(pte))
> > +		return true;
> > +	/* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
> >  	if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> > -		ret = true;
> > -out:
> > -	return ret;
> > +		return true;
> > +
> > +	/* Otherwise, if entry isn't present, let fault handler deal with it. */
>
> Entry is actually present here, e.g because there is a thread that called
> UFFDIO_COPY in parallel with the fault, so no need to stuck the faulting
> process.

Well it might not be? Could be a swap entry, migration entry, etc. unless I'm
missing cases? Point of comment was 'ok if non-present in a way that doesn't
require a userfaultfd userland handler the fault handler will deal'

But anyway agree this isn't clear, probably better to just say 'otherwise no
need for userland uffd handler to do anything here' or similar.

Will update.

>
> > +	return false;
> >  }
> >  #else
> >  static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> >  					      struct vm_fault *vmf,
> >  					      unsigned long reason)
> >  {
> > -	return false;	/* should never get here */
> > +	/* Should never get here. */
> > +	VM_WARN_ON_ONCE(1);
> > +	return false;
> >  }
> >  #endif /* CONFIG_HUGETLB_PAGE */
> >
> >  /*
> > - * Verify the pagetables are still not ok after having reigstered into
> > + * Verify the pagetables are still not ok after having registered into
> >   * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
> >   * userfault that has already been resolved, if userfaultfd_read_iter and
> >   * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different
> > @@ -284,53 +290,55 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
> >  	pmd_t *pmd, _pmd;
> >  	pte_t *pte;
> >  	pte_t ptent;
> > -	bool ret = true;
> > +	bool ret;
> >
> >  	assert_fault_locked(vmf);
> >
> >  	pgd = pgd_offset(mm, address);
> >  	if (!pgd_present(*pgd))
> > -		goto out;
> > +		return true;
> >  	p4d = p4d_offset(pgd, address);
> >  	if (!p4d_present(*p4d))
> > -		goto out;
> > +		return true;
> >  	pud = pud_offset(p4d, address);
> >  	if (!pud_present(*pud))
> > -		goto out;
> > +		return true;
> >  	pmd = pmd_offset(pud, address);
> >  again:
> >  	_pmd = pmdp_get_lockless(pmd);
> >  	if (pmd_none(_pmd))
> > -		goto out;
> > +		return true;
> >
> > -	ret = false;
> >  	if (!pmd_present(_pmd))
> > -		goto out;
> > +		return false;
>
> This one is actually tricky, maybe it's worth adding a gist of commit log
> from a365ac09d334 ("mm, userfaultfd, THP: avoid waiting when PMD under THP migration")
> as a comment.

OK.

>
> >
> > -	if (pmd_trans_huge(_pmd)) {
> > -		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
> > -			ret = true;
> > -		goto out;
> > -	}
> > +	if (pmd_trans_huge(_pmd))
> > +		return !pmd_write(_pmd) && (reason & VM_UFFD_WP);
>
> ...
>
> > diff --git a/mm/hmm.c b/mm/hmm.c
> > index a56081d67ad6..43d4a91035ff 100644
> > --- a/mm/hmm.c
> > +++ b/mm/hmm.c
> > @@ -244,7 +244,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
> >  	uint64_t pfn_req_flags = *hmm_pfn;
> >  	uint64_t new_pfn_flags = 0;
> >
> > -	if (pte_none_mostly(pte)) {
> > +	if (pte_none(pte) || pte_marker_uffd_wp(pte)) {
>
> Would be nice to add the note from the changelog as a comment here.

OK will do.

>
> >  		required_fault =
> >  			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
> >  		if (required_fault)
>
> --
> Sincerely yours,
> Mike.

Cheers, Lorenzo

Re: [PATCH v2 01/16] mm: correctly handle UFFD PTE markers

Posted by Mike Rapoport 3 months ago

On Mon, Nov 10, 2025 at 01:01:36PM +0000, Lorenzo Stoakes wrote:
> On Mon, Nov 10, 2025 at 01:17:37PM +0200, Mike Rapoport wrote:
> > On Sat, Nov 08, 2025 at 05:08:15PM +0000, Lorenzo Stoakes wrote:
> > > PTE markers were previously only concerned with UFFD-specific logic - that
> > > is, PTE entries with the UFFD WP marker set or those marked via
> > > UFFDIO_POISON.
> > >
> > > However since the introduction of guard markers in commit
> > >  7c53dfbdb024 ("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer
> > >  been the case.
> > >
> > > Issues have been avoided as guard regions are not permitted in conjunction
> > > with UFFD, but it still leaves very confusing logic in place, most notably
> > > the misleading and poorly named pte_none_mostly() and
> > > huge_pte_none_mostly().
> > >
> > > This predicate returns true for PTE entries that ought to be treated as
> > > none, but only in certain circumstances, and on the assumption we are
> > > dealing with H/W poison markers or UFFD WP markers.
> > >
> > > This patch removes these functions and makes each invocation of these
> > > functions instead explicitly check what it needs to check.
> > >
> > > As part of this effort it introduces is_uffd_pte_marker() to explicitly
> > > determine if a marker in fact is used as part of UFFD or not.
> > >
> > > In the HMM logic we note that the only time we would need to check for a
> > > fault is in the case of a UFFD WP marker, otherwise we simply encounter a
> > > fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
> > > for a guard marker), so only check for the UFFD WP case.
> > >
> > > While we're here we also refactor code to make it easier to understand.
> > >
> > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > ---
> > >  fs/userfaultfd.c              | 83 +++++++++++++++++++----------------
> > >  include/asm-generic/hugetlb.h |  8 ----
> > >  include/linux/swapops.h       | 18 --------
> > >  include/linux/userfaultfd_k.h | 21 +++++++++
> > >  mm/hmm.c                      |  2 +-
> > >  mm/hugetlb.c                  | 47 ++++++++++----------
> > >  mm/mincore.c                  | 17 +++++--
> > >  mm/userfaultfd.c              | 27 +++++++-----
> > >  8 files changed, 123 insertions(+), 100 deletions(-)
> > >
> > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > index 54c6cc7fe9c6..04c66b5001d5 100644
> > > --- a/fs/userfaultfd.c
> > > +++ b/fs/userfaultfd.c
> > > @@ -233,40 +233,46 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > >  {
> > >  	struct vm_area_struct *vma = vmf->vma;
> > >  	pte_t *ptep, pte;
> > > -	bool ret = true;
> > >
> > >  	assert_fault_locked(vmf);
> > >
> > >  	ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
> > >  	if (!ptep)
> > > -		goto out;
> > > +		return true;
> > >
> > > -	ret = false;
> > >  	pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
> > >
> > >  	/*
> > >  	 * Lockless access: we're in a wait_event so it's ok if it
> > > -	 * changes under us.  PTE markers should be handled the same as none
> > > -	 * ptes here.
> > > +	 * changes under us.
> > >  	 */
> > > -	if (huge_pte_none_mostly(pte))
> > > -		ret = true;
> > > +
> > > +	/* If missing entry, wait for handler. */
> >
> > It's actually #PF handler that waits ;-)
> 
> Think I meant uffd userland 'handler' as in handle_userfault(). But this is not
> clear obviously.
> 
> >
> > When userfaultfd_(huge_)must_wait() return true, it means that process that
> > caused a fault should wait until userspace resolves the fault and return
> > false means that it's ok to retry the #PF.
> 
> Yup.
> 
> >
> > So the comment here should probably read as
> >
> > 	/* entry is still missing, wait for userspace to resolve the fault */
> >
> 
> Will update to make clearer thanks.
> 
> >
> > > +	if (huge_pte_none(pte))
> > > +		return true;
> > > +	/* UFFD PTE markers require handling. */
> > > +	if (is_uffd_pte_marker(pte))
> > > +		return true;
> > > +	/* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
> > >  	if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> > > -		ret = true;
> > > -out:
> > > -	return ret;
> > > +		return true;
> > > +
> > > +	/* Otherwise, if entry isn't present, let fault handler deal with it. */
> >
> > Entry is actually present here, e.g because there is a thread that called
> > UFFDIO_COPY in parallel with the fault, so no need to stuck the faulting
> > process.
> 
> Well it might not be? Could be a swap entry, migration entry, etc. unless I'm
> missing cases? Point of comment was 'ok if non-present in a way that doesn't
> require a userfaultfd userland handler the fault handler will deal'
> 
> But anyway agree this isn't clear, probably better to just say 'otherwise no
> need for userland uffd handler to do anything here' or similar.

It's not that userspace does not need to do anything, it's just that pte is
good enough for the faulting thread to retry the page fault without waiting
for userspace to resolve the fault.
 
> Cheers, Lorenzo

-- 
Sincerely yours,
Mike.

Re: [PATCH v2 01/16] mm: correctly handle UFFD PTE markers

Posted by Lorenzo Stoakes 3 months ago

On Mon, Nov 10, 2025 at 03:44:43PM +0200, Mike Rapoport wrote:
> On Mon, Nov 10, 2025 at 01:01:36PM +0000, Lorenzo Stoakes wrote:
> > On Mon, Nov 10, 2025 at 01:17:37PM +0200, Mike Rapoport wrote:
> > > On Sat, Nov 08, 2025 at 05:08:15PM +0000, Lorenzo Stoakes wrote:
> > > > PTE markers were previously only concerned with UFFD-specific logic - that
> > > > is, PTE entries with the UFFD WP marker set or those marked via
> > > > UFFDIO_POISON.
> > > >
> > > > However since the introduction of guard markers in commit
> > > >  7c53dfbdb024 ("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer
> > > >  been the case.
> > > >
> > > > Issues have been avoided as guard regions are not permitted in conjunction
> > > > with UFFD, but it still leaves very confusing logic in place, most notably
> > > > the misleading and poorly named pte_none_mostly() and
> > > > huge_pte_none_mostly().
> > > >
> > > > This predicate returns true for PTE entries that ought to be treated as
> > > > none, but only in certain circumstances, and on the assumption we are
> > > > dealing with H/W poison markers or UFFD WP markers.
> > > >
> > > > This patch removes these functions and makes each invocation of these
> > > > functions instead explicitly check what it needs to check.
> > > >
> > > > As part of this effort it introduces is_uffd_pte_marker() to explicitly
> > > > determine if a marker in fact is used as part of UFFD or not.
> > > >
> > > > In the HMM logic we note that the only time we would need to check for a
> > > > fault is in the case of a UFFD WP marker, otherwise we simply encounter a
> > > > fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
> > > > for a guard marker), so only check for the UFFD WP case.
> > > >
> > > > While we're here we also refactor code to make it easier to understand.
> > > >
> > > > Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> > > > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > > ---
> > > >  fs/userfaultfd.c              | 83 +++++++++++++++++++----------------
> > > >  include/asm-generic/hugetlb.h |  8 ----
> > > >  include/linux/swapops.h       | 18 --------
> > > >  include/linux/userfaultfd_k.h | 21 +++++++++
> > > >  mm/hmm.c                      |  2 +-
> > > >  mm/hugetlb.c                  | 47 ++++++++++----------
> > > >  mm/mincore.c                  | 17 +++++--
> > > >  mm/userfaultfd.c              | 27 +++++++-----
> > > >  8 files changed, 123 insertions(+), 100 deletions(-)
> > > >
> > > > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > > > index 54c6cc7fe9c6..04c66b5001d5 100644
> > > > --- a/fs/userfaultfd.c
> > > > +++ b/fs/userfaultfd.c
> > > > @@ -233,40 +233,46 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > > >  {
> > > >  	struct vm_area_struct *vma = vmf->vma;
> > > >  	pte_t *ptep, pte;
> > > > -	bool ret = true;
> > > >
> > > >  	assert_fault_locked(vmf);
> > > >
> > > >  	ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
> > > >  	if (!ptep)
> > > > -		goto out;
> > > > +		return true;
> > > >
> > > > -	ret = false;
> > > >  	pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
> > > >
> > > >  	/*
> > > >  	 * Lockless access: we're in a wait_event so it's ok if it
> > > > -	 * changes under us.  PTE markers should be handled the same as none
> > > > -	 * ptes here.
> > > > +	 * changes under us.
> > > >  	 */
> > > > -	if (huge_pte_none_mostly(pte))
> > > > -		ret = true;
> > > > +
> > > > +	/* If missing entry, wait for handler. */
> > >
> > > It's actually #PF handler that waits ;-)
> >
> > Think I meant uffd userland 'handler' as in handle_userfault(). But this is not
> > clear obviously.
> >
> > >
> > > When userfaultfd_(huge_)must_wait() return true, it means that process that
> > > caused a fault should wait until userspace resolves the fault and return
> > > false means that it's ok to retry the #PF.
> >
> > Yup.
> >
> > >
> > > So the comment here should probably read as
> > >
> > > 	/* entry is still missing, wait for userspace to resolve the fault */
> > >
> >
> > Will update to make clearer thanks.
> >
> > >
> > > > +	if (huge_pte_none(pte))
> > > > +		return true;
> > > > +	/* UFFD PTE markers require handling. */
> > > > +	if (is_uffd_pte_marker(pte))
> > > > +		return true;
> > > > +	/* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
> > > >  	if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> > > > -		ret = true;
> > > > -out:
> > > > -	return ret;
> > > > +		return true;
> > > > +
> > > > +	/* Otherwise, if entry isn't present, let fault handler deal with it. */
> > >
> > > Entry is actually present here, e.g because there is a thread that called
> > > UFFDIO_COPY in parallel with the fault, so no need to stuck the faulting
> > > process.
> >
> > Well it might not be? Could be a swap entry, migration entry, etc. unless I'm
> > missing cases? Point of comment was 'ok if non-present in a way that doesn't
> > require a userfaultfd userland handler the fault handler will deal'
> >
> > But anyway agree this isn't clear, probably better to just say 'otherwise no
> > need for userland uffd handler to do anything here' or similar.
>
> It's not that userspace does not need to do anything, it's just that pte is
> good enough for the faulting thread to retry the page fault without waiting
> for userspace to resolve the fault.

OK I will clarify that in the comment.

>
> > Cheers, Lorenzo
>
> --
> Sincerely yours,
> Mike.

Re: [PATCH v2 01/16] mm: correctly handle UFFD PTE markers

Posted by Lance Yang 3 months ago


On 2025/11/9 01:08, Lorenzo Stoakes wrote:
> PTE markers were previously only concerned with UFFD-specific logic - that
> is, PTE entries with the UFFD WP marker set or those marked via
> UFFDIO_POISON.
> 
> However since the introduction of guard markers in commit
>   7c53dfbdb024 ("mm: add PTE_MARKER_GUARD PTE marker"), this has no longer
>   been the case.
> 
> Issues have been avoided as guard regions are not permitted in conjunction
> with UFFD, but it still leaves very confusing logic in place, most notably
> the misleading and poorly named pte_none_mostly() and
> huge_pte_none_mostly().
> 
> This predicate returns true for PTE entries that ought to be treated as
> none, but only in certain circumstances, and on the assumption we are
> dealing with H/W poison markers or UFFD WP markers.
> 
> This patch removes these functions and makes each invocation of these
> functions instead explicitly check what it needs to check.
> 
> As part of this effort it introduces is_uffd_pte_marker() to explicitly
> determine if a marker in fact is used as part of UFFD or not.
> 
> In the HMM logic we note that the only time we would need to check for a
> fault is in the case of a UFFD WP marker, otherwise we simply encounter a
> fault error (VM_FAULT_HWPOISON for H/W poisoned marker, VM_FAULT_SIGSEGV
> for a guard marker), so only check for the UFFD WP case.
> 
> While we're here we also refactor code to make it easier to understand.
> 
> Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>   fs/userfaultfd.c              | 83 +++++++++++++++++++----------------
>   include/asm-generic/hugetlb.h |  8 ----
>   include/linux/swapops.h       | 18 --------
>   include/linux/userfaultfd_k.h | 21 +++++++++
>   mm/hmm.c                      |  2 +-
>   mm/hugetlb.c                  | 47 ++++++++++----------
>   mm/mincore.c                  | 17 +++++--
>   mm/userfaultfd.c              | 27 +++++++-----
>   8 files changed, 123 insertions(+), 100 deletions(-)
> 
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index 54c6cc7fe9c6..04c66b5001d5 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -233,40 +233,46 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>   {
>   	struct vm_area_struct *vma = vmf->vma;
>   	pte_t *ptep, pte;
> -	bool ret = true;
>   
>   	assert_fault_locked(vmf);
>   
>   	ptep = hugetlb_walk(vma, vmf->address, vma_mmu_pagesize(vma));
>   	if (!ptep)
> -		goto out;
> +		return true;
>   
> -	ret = false;
>   	pte = huge_ptep_get(vma->vm_mm, vmf->address, ptep);
>   
>   	/*
>   	 * Lockless access: we're in a wait_event so it's ok if it
> -	 * changes under us.  PTE markers should be handled the same as none
> -	 * ptes here.
> +	 * changes under us.
>   	 */
> -	if (huge_pte_none_mostly(pte))
> -		ret = true;
> +
> +	/* If missing entry, wait for handler. */
> +	if (huge_pte_none(pte))
> +		return true;
> +	/* UFFD PTE markers require handling. */
> +	if (is_uffd_pte_marker(pte))
> +		return true;
> +	/* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
>   	if (!huge_pte_write(pte) && (reason & VM_UFFD_WP))
> -		ret = true;
> -out:
> -	return ret;
> +		return true;
> +
> +	/* Otherwise, if entry isn't present, let fault handler deal with it. */
> +	return false;
>   }
>   #else
>   static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
>   					      struct vm_fault *vmf,
>   					      unsigned long reason)
>   {
> -	return false;	/* should never get here */
> +	/* Should never get here. */
> +	VM_WARN_ON_ONCE(1);
> +	return false;
>   }
>   #endif /* CONFIG_HUGETLB_PAGE */
>   
>   /*
> - * Verify the pagetables are still not ok after having reigstered into
> + * Verify the pagetables are still not ok after having registered into
>    * the fault_pending_wqh to avoid userland having to UFFDIO_WAKE any
>    * userfault that has already been resolved, if userfaultfd_read_iter and
>    * UFFDIO_COPY|ZEROPAGE are being run simultaneously on two different
> @@ -284,53 +290,55 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
>   	pmd_t *pmd, _pmd;
>   	pte_t *pte;
>   	pte_t ptent;
> -	bool ret = true;
> +	bool ret;
>   
>   	assert_fault_locked(vmf);
>   
>   	pgd = pgd_offset(mm, address);
>   	if (!pgd_present(*pgd))
> -		goto out;
> +		return true;
>   	p4d = p4d_offset(pgd, address);
>   	if (!p4d_present(*p4d))
> -		goto out;
> +		return true;
>   	pud = pud_offset(p4d, address);
>   	if (!pud_present(*pud))
> -		goto out;
> +		return true;
>   	pmd = pmd_offset(pud, address);
>   again:
>   	_pmd = pmdp_get_lockless(pmd);
>   	if (pmd_none(_pmd))
> -		goto out;
> +		return true;
>   
> -	ret = false;
>   	if (!pmd_present(_pmd))
> -		goto out;
> +		return false;
>   
> -	if (pmd_trans_huge(_pmd)) {
> -		if (!pmd_write(_pmd) && (reason & VM_UFFD_WP))
> -			ret = true;
> -		goto out;
> -	}
> +	if (pmd_trans_huge(_pmd))
> +		return !pmd_write(_pmd) && (reason & VM_UFFD_WP);
>   
>   	pte = pte_offset_map(pmd, address);
> -	if (!pte) {
> -		ret = true;
> +	if (!pte)
>   		goto again;
> -	}
> +
>   	/*
>   	 * Lockless access: we're in a wait_event so it's ok if it
> -	 * changes under us.  PTE markers should be handled the same as none
> -	 * ptes here.
> +	 * changes under us.
>   	 */
>   	ptent = ptep_get(pte);
> -	if (pte_none_mostly(ptent))
> -		ret = true;
> +
> +	ret = true;
> +	/* If missing entry, wait for handler. */
> +	if (pte_none(ptent))
> +		goto out;
> +	/* UFFD PTE markers require handling. */
> +	if (is_uffd_pte_marker(ptent))
> +		goto out;
> +	/* If VMA has UFFD WP faults enabled and WP fault, wait for handler. */
>   	if (!pte_write(ptent) && (reason & VM_UFFD_WP))
> -		ret = true;
> -	pte_unmap(pte);
> +		goto out;
>   
> +	ret = false;
>   out:
> +	pte_unmap(pte);
>   	return ret;
>   }
>   
> @@ -490,12 +498,13 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>   	set_current_state(blocking_state);
>   	spin_unlock_irq(&ctx->fault_pending_wqh.lock);
>   
> -	if (!is_vm_hugetlb_page(vma))
> -		must_wait = userfaultfd_must_wait(ctx, vmf, reason);
> -	else
> +	if (is_vm_hugetlb_page(vma)) {
>   		must_wait = userfaultfd_huge_must_wait(ctx, vmf, reason);
> -	if (is_vm_hugetlb_page(vma))
>   		hugetlb_vma_unlock_read(vma);
> +	} else {
> +		must_wait = userfaultfd_must_wait(ctx, vmf, reason);
> +	}
> +
>   	release_fault_lock(vmf);
>   
>   	if (likely(must_wait && !READ_ONCE(ctx->released))) {
> diff --git a/include/asm-generic/hugetlb.h b/include/asm-generic/hugetlb.h
> index dcb8727f2b82..e1a2e1b7c8e7 100644
> --- a/include/asm-generic/hugetlb.h
> +++ b/include/asm-generic/hugetlb.h
> @@ -97,14 +97,6 @@ static inline int huge_pte_none(pte_t pte)
>   }
>   #endif
>   
> -/* Please refer to comments above pte_none_mostly() for the usage */
> -#ifndef __HAVE_ARCH_HUGE_PTE_NONE_MOSTLY
> -static inline int huge_pte_none_mostly(pte_t pte)
> -{
> -	return huge_pte_none(pte) || is_pte_marker(pte);
> -}
> -#endif
> -
>   #ifndef __HAVE_ARCH_HUGE_PTEP_SET_WRPROTECT
>   static inline void huge_ptep_set_wrprotect(struct mm_struct *mm,
>   		unsigned long addr, pte_t *ptep)
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 2687928a8146..d1f665935cfc 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -469,24 +469,6 @@ static inline int is_guard_swp_entry(swp_entry_t entry)
>   		(pte_marker_get(entry) & PTE_MARKER_GUARD);
>   }
>   
> -/*
> - * This is a special version to check pte_none() just to cover the case when
> - * the pte is a pte marker.  It existed because in many cases the pte marker
> - * should be seen as a none pte; it's just that we have stored some information
> - * onto the none pte so it becomes not-none any more.
> - *
> - * It should be used when the pte is file-backed, ram-based and backing
> - * userspace pages, like shmem.  It is not needed upon pgtables that do not
> - * support pte markers at all.  For example, it's not needed on anonymous
> - * memory, kernel-only memory (including when the system is during-boot),
> - * non-ram based generic file-system.  It's fine to be used even there, but the
> - * extra pte marker check will be pure overhead.
> - */
> -static inline int pte_none_mostly(pte_t pte)
> -{
> -	return pte_none(pte) || is_pte_marker(pte);
> -}
> -
>   static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
>   {
>   	struct page *p = pfn_to_page(swp_offset_pfn(entry));
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index c0e716aec26a..da0b4fcc566f 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -479,4 +479,25 @@ static inline bool pte_swp_uffd_wp_any(pte_t pte)
>   	return false;
>   }
>   
> +
> +static inline bool is_uffd_pte_marker(pte_t pte)
> +{
> +	swp_entry_t entry;
> +
> +	if (pte_present(pte))
> +		return false;
> +
> +	entry = pte_to_swp_entry(pte);
> +	if (!is_pte_marker_entry(entry))
> +		return false;
> +
> +	/* UFFD WP, poisoned swap entries are UFFD handled. */
> +	if (pte_marker_entry_uffd_wp(entry))
> +		return true;
> +	if (is_poisoned_swp_entry(entry))
> +		return true;
> +
> +	return false;
> +}
> +
>   #endif /* _LINUX_USERFAULTFD_K_H */
> diff --git a/mm/hmm.c b/mm/hmm.c
> index a56081d67ad6..43d4a91035ff 100644
> --- a/mm/hmm.c
> +++ b/mm/hmm.c
> @@ -244,7 +244,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
>   	uint64_t pfn_req_flags = *hmm_pfn;
>   	uint64_t new_pfn_flags = 0;
>   
> -	if (pte_none_mostly(pte)) {
> +	if (pte_none(pte) || pte_marker_uffd_wp(pte)) {
>   		required_fault =
>   			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
>   		if (required_fault)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 1ea459723cce..01c784547d1e 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6743,29 +6743,28 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
>   	}
>   
>   	vmf.orig_pte = huge_ptep_get(mm, vmf.address, vmf.pte);
> -	if (huge_pte_none_mostly(vmf.orig_pte)) {
> -		if (is_pte_marker(vmf.orig_pte)) {
> -			pte_marker marker =
> -				pte_marker_get(pte_to_swp_entry(vmf.orig_pte));
> -
> -			if (marker & PTE_MARKER_POISONED) {
> -				ret = VM_FAULT_HWPOISON_LARGE |
> -				      VM_FAULT_SET_HINDEX(hstate_index(h));
> -				goto out_mutex;
> -			} else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) {
> -				/* This isn't supported in hugetlb. */
> -				ret = VM_FAULT_SIGSEGV;
> -				goto out_mutex;
> -			}
> -		}
> -
> +	if (huge_pte_none(vmf.orig_pte))
>   		/*
> -		 * Other PTE markers should be handled the same way as none PTE.
> -		 *
>   		 * hugetlb_no_page will drop vma lock and hugetlb fault
>   		 * mutex internally, which make us return immediately.
>   		 */
>   		return hugetlb_no_page(mapping, &vmf);
> +
> +	if (is_pte_marker(vmf.orig_pte)) {
> +		const pte_marker marker =
> +			pte_marker_get(pte_to_swp_entry(vmf.orig_pte));
> +
> +		if (marker & PTE_MARKER_POISONED) {
> +			ret = VM_FAULT_HWPOISON_LARGE |
> +				VM_FAULT_SET_HINDEX(hstate_index(h));
> +			goto out_mutex;
> +		} else if (WARN_ON_ONCE(marker & PTE_MARKER_GUARD)) {
> +			/* This isn't supported in hugetlb. */
> +			ret = VM_FAULT_SIGSEGV;
> +			goto out_mutex;
> +		}
> +
> +		return hugetlb_no_page(mapping, &vmf);
>   	}
>   
>   	ret = 0;
> @@ -6934,6 +6933,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>   	int ret = -ENOMEM;
>   	struct folio *folio;
>   	bool folio_in_pagecache = false;
> +	pte_t dst_ptep;
>   
>   	if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) {
>   		ptl = huge_pte_lock(h, dst_mm, dst_pte);
> @@ -7073,13 +7073,14 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
>   	if (folio_test_hwpoison(folio))
>   		goto out_release_unlock;
>   
> +	ret = -EEXIST;
> +
> +	dst_ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
>   	/*
> -	 * We allow to overwrite a pte marker: consider when both MISSING|WP
> -	 * registered, we firstly wr-protect a none pte which has no page cache
> -	 * page backing it, then access the page.
> +	 * See comment about UFFD marker overwriting in
> +	 * mfill_atomic_install_pte().
>   	 */
> -	ret = -EEXIST;
> -	if (!huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte)))
> +	if (!huge_pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
>   		goto out_release_unlock;
>   
>   	if (folio_in_pagecache)
> diff --git a/mm/mincore.c b/mm/mincore.c
> index 8ec4719370e1..151b2dbb783b 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -32,11 +32,22 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
>   	spinlock_t *ptl;
>   
>   	ptl = huge_pte_lock(hstate_vma(walk->vma), walk->mm, pte);
> +
>   	/*
>   	 * Hugepages under user process are always in RAM and never
>   	 * swapped out, but theoretically it needs to be checked.
>   	 */
> -	present = pte && !huge_pte_none_mostly(huge_ptep_get(walk->mm, addr, pte));
> +	if (!pte) {
> +		present = 0;
> +	} else {
> +		const pte_t ptep = huge_ptep_get(walk->mm, addr, pte);
> +
> +		if (huge_pte_none(ptep) || is_pte_marker(ptep))
> +			present = 0;
> +		else
> +			present = 1;
> +	}
> +
>   	for (; addr != end; vec++, addr += PAGE_SIZE)
>   		*vec = present;
>   	walk->private = vec;
> @@ -175,8 +186,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>   		pte_t pte = ptep_get(ptep);
>   
>   		step = 1;
> -		/* We need to do cache lookup too for pte markers */
> -		if (pte_none_mostly(pte))
> +		/* We need to do cache lookup too for UFFD pte markers */
> +		if (pte_none(pte) || is_uffd_pte_marker(pte))

Seems like something is changed, new is_uffd_pte_marker check will
miss non-UFFD markers (like guard markers) , and then would fall
through to the swap entry logic to be misreported as resident by
mincore_swap().

```
		/* We need to do cache lookup too for UFFD pte markers */
		if (pte_none(pte) || is_uffd_pte_marker(pte))
			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
						 vma, vec);
		else if (pte_present(pte)) {
			unsigned int batch = pte_batch_hint(ptep, pte);

			if (batch > 1) {
				unsigned int max_nr = (end - addr) >> PAGE_SHIFT;

				step = min_t(unsigned int, batch, max_nr);
			}

			for (i = 0; i < step; i++)
				vec[i] = 1;
		} else { /* pte is a swap entry */
			*vec = mincore_swap(pte_to_swp_entry(pte), false);
		}
```

Wouldn't the generic is_pte_marker() be safer here?

Thanks,
Lance

>   			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
>   						 vma, vec);
>   		else if (pte_present(pte)) {
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 00122f42718c..cc4ce205bbec 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -178,6 +178,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
>   	spinlock_t *ptl;
>   	struct folio *folio = page_folio(page);
>   	bool page_in_cache = folio_mapping(folio);
> +	pte_t dst_ptep;
>   
>   	_dst_pte = mk_pte(page, dst_vma->vm_page_prot);
>   	_dst_pte = pte_mkdirty(_dst_pte);
> @@ -199,12 +200,15 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
>   	}
>   
>   	ret = -EEXIST;
> +
> +	dst_ptep = ptep_get(dst_pte);
> +
>   	/*
> -	 * We allow to overwrite a pte marker: consider when both MISSING|WP
> -	 * registered, we firstly wr-protect a none pte which has no page cache
> -	 * page backing it, then access the page.
> +	 * We are allowed to overwrite a UFFD pte marker: consider when both
> +	 * MISSING|WP registered, we firstly wr-protect a none pte which has no
> +	 * page cache page backing it, then access the page.
>   	 */
> -	if (!pte_none_mostly(ptep_get(dst_pte)))
> +	if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
>   		goto out_unlock;
>   
>   	if (page_in_cache) {
> @@ -583,12 +587,15 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
>   			goto out_unlock;
>   		}
>   
> -		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE) &&
> -		    !huge_pte_none_mostly(huge_ptep_get(dst_mm, dst_addr, dst_pte))) {
> -			err = -EEXIST;
> -			hugetlb_vma_unlock_read(dst_vma);
> -			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> -			goto out_unlock;
> +		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
> +			const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
> +
> +			if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) {
> +				err = -EEXIST;
> +				hugetlb_vma_unlock_read(dst_vma);
> +				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +				goto out_unlock;
> +			}
>   		}
>   
>   		err = hugetlb_mfill_atomic_pte(dst_pte, dst_vma, dst_addr,

Re: [PATCH v2 01/16] mm: correctly handle UFFD PTE markers

Posted by Lorenzo Stoakes 3 months ago

On Mon, Nov 10, 2025 at 12:26:26AM +0800, Lance Yang wrote:
> > @@ -175,8 +186,8 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> >   		pte_t pte = ptep_get(ptep);
> >   		step = 1;
> > -		/* We need to do cache lookup too for pte markers */
> > -		if (pte_none_mostly(pte))
> > +		/* We need to do cache lookup too for UFFD pte markers */
> > +		if (pte_none(pte) || is_uffd_pte_marker(pte))
>
> Seems like something is changed, new is_uffd_pte_marker check will
> miss non-UFFD markers (like guard markers) , and then would fall
> through to the swap entry logic to be misreported as resident by
> mincore_swap().

I intentionally changed cases that seemed to be explicitly wanting to only check
for is_uffd_pte_marker().

The issue with markers is it was first implemented on the assumption that it was
only one kind (UFFD WP) then other markers were added without correction.

Since we explicitly test for the softleaf is swap case I assumed we were good
(we check for softleaf entries explicitly):

	/*
	 * Shmem mapping may contain swapin error entries, which are
	 * absent. Page table may contain migration or hwpoison
	 * entries which are always uptodate.
	 */
	if (!leafent_is_swap(entry))
		return !shmem;

But obviously didn't read that carefully enough - mincore assumes literally all
soft leaf entries can be considered present for not-shmem and shmem would only
have no-longer-exists swapin error entries...

Really that function needs refactoring and the is swap check put higher.

But TL;DR you're right I"ll send a fixpatch...

>
> ```
> 		/* We need to do cache lookup too for UFFD pte markers */
> 		if (pte_none(pte) || is_uffd_pte_marker(pte))
> 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
> 						 vma, vec);
> 		else if (pte_present(pte)) {
> 			unsigned int batch = pte_batch_hint(ptep, pte);
>
> 			if (batch > 1) {
> 				unsigned int max_nr = (end - addr) >> PAGE_SHIFT;
>
> 				step = min_t(unsigned int, batch, max_nr);
> 			}
>
> 			for (i = 0; i < step; i++)
> 				vec[i] = 1;
> 		} else { /* pte is a swap entry */
> 			*vec = mincore_swap(pte_to_swp_entry(pte), false);
> 		}
> ```
>
> Wouldn't the generic is_pte_marker() be safer here?

pte_is_marker() now :) I fixed the silly naming inconsistency...

Cheers, Lorenzo