[PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic

Lorenzo Stoakes posted 16 patches 1 month, 1 week ago
[PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Lorenzo Stoakes 1 month, 1 week ago
The kernel maintains leaf page table entries which contain either:

- Nothing ('none' entries)
- Present entries (that is stuff the hardware can navigate without fault)
- Everything else that will cause a fault which the kernel handles

In the 'everything else' group we include swap entries, but we also include
a number of other things such as migration entries, device private entries
and marker entries.

Unfortunately this 'everything else' group expresses everything through
a swp_entry_t type, and these entries are referred to swap entries even
though they may well not contain a... swap entry.

This is compounded by the rather mind-boggling concept of a non-swap swap
entry (checked via non_swap_entry()) and the means by which we twist and
turn to satisfy this.

This patch lays the foundation for reducing this confusion.

We refer to 'everything else' as a 'software-define leaf entry' or
'softleaf'. for short And in fact we scoop up the 'none' entries into this
concept also so we are left with:

- Present entries.
- Softleaf entries (which may be empty).

This allows for radical simplification across the board - one can simply
convert any leaf page table entry to a leaf entry via softleaf_from_pte().

If the entry is present, we return an empty leaf entry, so it is assumed
the caller is aware that they must differentiate between the two categories
of page table entries, checking for the former via pte_present().

As a result, we can eliminate a number of places where we would otherwise
need to use predicates to see if we can proceed with leaf page table entry
conversion and instead just go ahead and do it unconditionally.

We do so where we can, adjusting surrounding logic as necessary to
integrate the new softleaf_t logic as far as seems reasonable at this
stage.

We typedef swp_entry_t to softleaf_t for the time being until the
conversion can be complete, meaning everything remains compatible
regardless of which type is used. We will eventually remove swp_entry_t
when the conversion is complete.

We introduce a new header file to keep things clear - leafops.h - this
imports swapops.h so can direct replace swapops imports without issue, and
we do so in all the files that require it.

Additionally, add new leafops.h file to core mm maintainers entry.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 MAINTAINERS                   |   1 +
 fs/proc/task_mmu.c            |  26 +--
 fs/userfaultfd.c              |   6 +-
 include/linux/leafops.h       | 387 ++++++++++++++++++++++++++++++++++
 include/linux/mm_inline.h     |   6 +-
 include/linux/mm_types.h      |  25 +++
 include/linux/swapops.h       |  28 ---
 include/linux/userfaultfd_k.h |  51 +----
 mm/hmm.c                      |   2 +-
 mm/hugetlb.c                  |  37 ++--
 mm/madvise.c                  |  16 +-
 mm/memory.c                   |  41 ++--
 mm/mincore.c                  |   6 +-
 mm/mprotect.c                 |   6 +-
 mm/mremap.c                   |   4 +-
 mm/page_vma_mapped.c          |  11 +-
 mm/shmem.c                    |   7 +-
 mm/userfaultfd.c              |   6 +-
 18 files changed, 502 insertions(+), 164 deletions(-)
 create mode 100644 include/linux/leafops.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2628431dcdfe..314910a70bbf 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16257,6 +16257,7 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
 F:	include/linux/gfp.h
 F:	include/linux/gfp_types.h
 F:	include/linux/highmem.h
+F:	include/linux/leafops.h
 F:	include/linux/memory.h
 F:	include/linux/mm.h
 F:	include/linux/mm_*.h
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fc35a0543f01..24d26b49d870 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -14,7 +14,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/sched/mm.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 #include <linux/mmu_notifier.h>
 #include <linux/page_idle.h>
 #include <linux/shmem_fs.h>
@@ -1230,11 +1230,11 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
 	if (pte_present(ptent)) {
 		folio = page_folio(pte_page(ptent));
 		present = true;
-	} else if (is_swap_pte(ptent)) {
-		swp_entry_t swpent = pte_to_swp_entry(ptent);
+	} else {
+		const softleaf_t entry = softleaf_from_pte(ptent);
 
-		if (is_pfn_swap_entry(swpent))
-			folio = pfn_swap_entry_folio(swpent);
+		if (softleaf_has_pfn(entry))
+			folio = softleaf_to_folio(entry);
 	}
 
 	if (folio) {
@@ -1955,9 +1955,9 @@ static pagemap_entry_t pte_to_pagemap_entry(struct pagemapread *pm,
 		flags |= PM_SWAP;
 		if (is_pfn_swap_entry(entry))
 			page = pfn_swap_entry_to_page(entry);
-		if (pte_marker_entry_uffd_wp(entry))
+		if (softleaf_is_uffd_wp_marker(entry))
 			flags |= PM_UFFD_WP;
-		if (is_guard_swp_entry(entry))
+		if (softleaf_is_guard_marker(entry))
 			flags |=  PM_GUARD_REGION;
 	}
 
@@ -2330,18 +2330,18 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
 		if (pte_soft_dirty(pte))
 			categories |= PAGE_IS_SOFT_DIRTY;
 	} else if (is_swap_pte(pte)) {
-		swp_entry_t swp;
+		softleaf_t entry;
 
 		categories |= PAGE_IS_SWAPPED;
 		if (!pte_swp_uffd_wp_any(pte))
 			categories |= PAGE_IS_WRITTEN;
 
-		swp = pte_to_swp_entry(pte);
-		if (is_guard_swp_entry(swp))
+		entry = softleaf_from_pte(pte);
+		if (softleaf_is_guard_marker(entry))
 			categories |= PAGE_IS_GUARD;
 		else if ((p->masks_of_interest & PAGE_IS_FILE) &&
-			 is_pfn_swap_entry(swp) &&
-			 !folio_test_anon(pfn_swap_entry_folio(swp)))
+			 softleaf_has_pfn(entry) &&
+			 !folio_test_anon(softleaf_to_folio(entry)))
 			categories |= PAGE_IS_FILE;
 
 		if (pte_swp_soft_dirty(pte))
@@ -2466,7 +2466,7 @@ static void make_uffd_wp_huge_pte(struct vm_area_struct *vma,
 {
 	unsigned long psize;
 
-	if (is_hugetlb_entry_hwpoisoned(ptent) || is_pte_marker(ptent))
+	if (is_hugetlb_entry_hwpoisoned(ptent) || pte_is_marker(ptent))
 		return;
 
 	psize = huge_page_size(hstate_vma(vma));
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 375494309182..4e900091849b 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -29,7 +29,7 @@
 #include <linux/ioctl.h>
 #include <linux/security.h>
 #include <linux/hugetlb.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 #include <linux/miscdevice.h>
 #include <linux/uio.h>
 
@@ -251,7 +251,7 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
 	if (huge_pte_none(pte))
 		return true;
 	/* UFFD PTE markers require userspace to resolve the fault. */
-	if (is_uffd_pte_marker(pte))
+	if (pte_is_uffd_marker(pte))
 		return true;
 	/*
 	 * If VMA has UFFD WP faults enabled and WP fault, wait for userspace to
@@ -337,7 +337,7 @@ static inline bool userfaultfd_must_wait(struct userfaultfd_ctx *ctx,
 	if (pte_none(ptent))
 		goto out;
 	/* UFFD PTE markers require userspace to resolve the fault. */
-	if (is_uffd_pte_marker(ptent))
+	if (pte_is_uffd_marker(ptent))
 		goto out;
 	/*
 	 * If VMA has UFFD WP faults enabled and WP fault, wait for userspace to
diff --git a/include/linux/leafops.h b/include/linux/leafops.h
new file mode 100644
index 000000000000..cff9d94fd5d1
--- /dev/null
+++ b/include/linux/leafops.h
@@ -0,0 +1,387 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Describes operations that can be performed on software-defined page table
+ * leaf entries. These are abstracted from the hardware page table entries
+ * themselves by the softleaf_t type, see mm_types.h.
+ */
+#ifndef _LINUX_LEAFOPS_H
+#define _LINUX_LEAFOPS_H
+
+#include <linux/mm_types.h>
+#include <linux/swapops.h>
+#include <linux/swap.h>
+
+#ifdef CONFIG_MMU
+
+/* Temporary until swp_entry_t eliminated. */
+#define LEAF_TYPE_SHIFT SWP_TYPE_SHIFT
+
+enum softleaf_type {
+	/* Fundamental types. */
+	SOFTLEAF_NONE,
+	SOFTLEAF_SWAP,
+	/* Migration types. */
+	SOFTLEAF_MIGRATION_READ,
+	SOFTLEAF_MIGRATION_READ_EXCLUSIVE,
+	SOFTLEAF_MIGRATION_WRITE,
+	/* Device types. */
+	SOFTLEAF_DEVICE_PRIVATE_READ,
+	SOFTLEAF_DEVICE_PRIVATE_WRITE,
+	SOFTLEAF_DEVICE_EXCLUSIVE,
+	/* H/W posion types. */
+	SOFTLEAF_HWPOISON,
+	/* Marker types. */
+	SOFTLEAF_MARKER,
+};
+
+/**
+ * softleaf_mk_none() - Create an empty ('none') leaf entry.
+ * Returns: empty leaf entry.
+ */
+static inline softleaf_t softleaf_mk_none(void)
+{
+	return ((softleaf_t) { 0 });
+}
+
+/**
+ * softleaf_from_pte() - Obtain a leaf entry from a PTE entry.
+ * @pte: PTE entry.
+ *
+ * If @pte is present (therefore not a leaf entry) the function returns an empty
+ * leaf entry. Otherwise, it returns a leaf entry.
+ *
+ * Returns: Leaf entry.
+ */
+static inline softleaf_t softleaf_from_pte(pte_t pte)
+{
+	if (pte_present(pte) || pte_none(pte))
+		return softleaf_mk_none();
+
+	/* Temporary until swp_entry_t eliminated. */
+	return pte_to_swp_entry(pte);
+}
+
+/**
+ * softleaf_is_none() - Is the leaf entry empty?
+ * @entry: Leaf entry.
+ *
+ * Empty entries are typically the result of a 'none' page table leaf entry
+ * being converted to a leaf entry.
+ *
+ * Returns: true if the entry is empty, false otherwise.
+ */
+static inline bool softleaf_is_none(softleaf_t entry)
+{
+	return entry.val == 0;
+}
+
+/**
+ * softleaf_type() - Identify the type of leaf entry.
+ * @enntry: Leaf entry.
+ *
+ * Returns: the leaf entry type associated with @entry.
+ */
+static inline enum softleaf_type softleaf_type(softleaf_t entry)
+{
+	unsigned int type_num;
+
+	if (softleaf_is_none(entry))
+		return SOFTLEAF_NONE;
+
+	type_num = entry.val >> LEAF_TYPE_SHIFT;
+
+	if (type_num < MAX_SWAPFILES)
+		return SOFTLEAF_SWAP;
+
+	switch (type_num) {
+#ifdef CONFIG_MIGRATION
+	case SWP_MIGRATION_READ:
+		return SOFTLEAF_MIGRATION_READ;
+	case SWP_MIGRATION_READ_EXCLUSIVE:
+		return SOFTLEAF_MIGRATION_READ_EXCLUSIVE;
+	case SWP_MIGRATION_WRITE:
+		return SOFTLEAF_MIGRATION_WRITE;
+#endif
+#ifdef CONFIG_DEVICE_PRIVATE
+	case SWP_DEVICE_WRITE:
+		return SOFTLEAF_DEVICE_PRIVATE_WRITE;
+	case SWP_DEVICE_READ:
+		return SOFTLEAF_DEVICE_PRIVATE_READ;
+	case SWP_DEVICE_EXCLUSIVE:
+		return SOFTLEAF_DEVICE_EXCLUSIVE;
+#endif
+#ifdef CONFIG_MEMORY_FAILURE
+	case SWP_HWPOISON:
+		return SOFTLEAF_HWPOISON;
+#endif
+	case SWP_PTE_MARKER:
+		return SOFTLEAF_MARKER;
+	}
+
+	/* Unknown entry type. */
+	VM_WARN_ON_ONCE(1);
+	return SOFTLEAF_NONE;
+}
+
+/**
+ * softleaf_is_swap() - Is this leaf entry a swap entry?
+ * @entry: Leaf entry.
+ *
+ * Returns: true if the leaf entry is a swap entry, otherwise false.
+ */
+static inline bool softleaf_is_swap(softleaf_t entry)
+{
+	return softleaf_type(entry) == SOFTLEAF_SWAP;
+}
+
+/**
+ * softleaf_is_migration() - Is this leaf entry a migration entry?
+ * @entry: Leaf entry.
+ *
+ * Returns: true if the leaf entry is a migration entry, otherwise false.
+ */
+static inline bool softleaf_is_migration(softleaf_t entry)
+{
+	switch (softleaf_type(entry)) {
+	case SOFTLEAF_MIGRATION_READ:
+	case SOFTLEAF_MIGRATION_READ_EXCLUSIVE:
+	case SOFTLEAF_MIGRATION_WRITE:
+		return true;
+	default:
+		return false;
+	}
+}
+
+/**
+ * softleaf_is_device_private() - Is this leaf entry a device private entry?
+ * @entry: Leaf entry.
+ *
+ * Returns: true if the leaf entry is a device private entry, otherwise false.
+ */
+static inline bool softleaf_is_device_private(softleaf_t entry)
+{
+	switch (softleaf_type(entry)) {
+	case SOFTLEAF_DEVICE_PRIVATE_WRITE:
+	case SOFTLEAF_DEVICE_PRIVATE_READ:
+		return true;
+	default:
+		return false;
+	}
+}
+
+/**
+ * softleaf_is_device_exclusive() - Is this leaf entry a device exclusive entry?
+ * @entry: Leaf entry.
+ *
+ * Returns: true if the leaf entry is a device exclusive entry, otherwise false.
+ */
+static inline bool softleaf_is_device_exclusive(softleaf_t entry)
+{
+	return softleaf_type(entry) == SOFTLEAF_DEVICE_EXCLUSIVE;
+}
+
+/**
+ * softleaf_is_hwpoison() - Is this leaf entry a hardware poison entry?
+ * @entry: Leaf entry.
+ *
+ * Returns: true if the leaf entry is a hardware poison entry, otherwise false.
+ */
+static inline bool softleaf_is_hwpoison(softleaf_t entry)
+{
+	return softleaf_type(entry) == SOFTLEAF_HWPOISON;
+}
+
+/**
+ * softleaf_is_marker() - Is this leaf entry a marker?
+ * @entry: Leaf entry.
+ *
+ * Returns: true if the leaf entry is a marker entry, otherwise false.
+ */
+static inline bool softleaf_is_marker(softleaf_t entry)
+{
+	return softleaf_type(entry) == SOFTLEAF_MARKER;
+}
+
+/**
+ * softleaf_to_marker() - Obtain marker associated with leaf entry.
+ * @entry: Leaf entry, softleaf_is_marker(@entry) must return true.
+ *
+ * Returns: Marker associated with the leaf entry.
+ */
+static inline pte_marker softleaf_to_marker(softleaf_t entry)
+{
+	VM_WARN_ON_ONCE(!softleaf_is_marker(entry));
+
+	return swp_offset(entry) & PTE_MARKER_MASK;
+}
+
+/**
+ * softleaf_has_pfn() - Does this leaf entry encode a valid PFN number?
+ * @entry: Leaf entry.
+ *
+ * A pfn swap entry is a special type of swap entry that always has a pfn stored
+ * in the swap offset. They can either be used to represent unaddressable device
+ * memory, to restrict access to a page undergoing migration or to represent a
+ * pfn which has been hwpoisoned and unmapped.
+ *
+ * Returns: true if the leaf entry encodes a PFN, otherwise false.
+ */
+static inline bool softleaf_has_pfn(softleaf_t entry)
+{
+	/* Make sure the swp offset can always store the needed fields. */
+	BUILD_BUG_ON(SWP_TYPE_SHIFT < SWP_PFN_BITS);
+
+	if (softleaf_is_migration(entry))
+		return true;
+	if (softleaf_is_device_private(entry))
+		return true;
+	if (softleaf_is_device_exclusive(entry))
+		return true;
+	if (softleaf_is_hwpoison(entry))
+		return true;
+
+	return false;
+}
+
+/**
+ * softleaf_to_pfn() - Obtain PFN encoded within leaf entry.
+ * @entry: Leaf entry, softleaf_has_pfn(@entry) must return true.
+ *
+ * Returns: The PFN associated with the leaf entry.
+ */
+static inline unsigned long softleaf_to_pfn(softleaf_t entry)
+{
+	VM_WARN_ON_ONCE(!softleaf_has_pfn(entry));
+
+	/* Temporary until swp_entry_t eliminated. */
+	return swp_offset_pfn(entry);
+}
+
+/**
+ * softleaf_to_page() - Obtains struct page for PFN encoded within leaf entry.
+ * @entry: Leaf entry, softleaf_has_pfn(@entry) must return true.
+ *
+ * Returns: Pointer to the struct page associated with the leaf entry's PFN.
+ */
+static inline struct page *softleaf_to_page(softleaf_t entry)
+{
+	VM_WARN_ON_ONCE(!softleaf_has_pfn(entry));
+
+	/* Temporary until swp_entry_t eliminated. */
+	return pfn_swap_entry_to_page(entry);
+}
+
+/**
+ * softleaf_to_folio() - Obtains struct folio for PFN encoded within leaf entry.
+ * @entry: Leaf entry, softleaf_has_pfn(@entry) must return true.
+ *
+ * Returns: Pointer to the struct folio associated with the leaf entry's PFN.
+ */
+static inline struct folio *softleaf_to_folio(softleaf_t entry)
+{
+	VM_WARN_ON_ONCE(!softleaf_has_pfn(entry));
+
+	/* Temporary until swp_entry_t eliminated. */
+	return pfn_swap_entry_folio(entry);
+}
+
+/**
+ * softleaf_is_poison_marker() - Is this leaf entry a poison marker?
+ * @entry: Leaf entry.
+ *
+ * The poison marker is set via UFFDIO_POISON. Userfaultfd-specific.
+ *
+ * Returns: true if the leaf entry is a poison marker, otherwise false.
+ */
+static inline bool softleaf_is_poison_marker(softleaf_t entry)
+{
+	if (!softleaf_is_marker(entry))
+		return false;
+
+	return softleaf_to_marker(entry) & PTE_MARKER_POISONED;
+}
+
+/**
+ * softleaf_is_guard_marker() - Is this leaf entry a guard region marker?
+ * @entry: Leaf entry.
+ *
+ * Returns: true if the leaf entry is a guard marker, otherwise false.
+ */
+static inline bool softleaf_is_guard_marker(softleaf_t entry)
+{
+	if (!softleaf_is_marker(entry))
+		return false;
+
+	return softleaf_to_marker(entry) & PTE_MARKER_GUARD;
+}
+
+/**
+ * softleaf_is_uffd_wp_marker() - Is this leaf entry a userfautlfd write protect
+ * marker?
+ * @entry: Leaf entry.
+ *
+ * Userfaultfd-specific.
+ *
+ * Returns: true if the leaf entry is a UFFD WP marker, otherwise false.
+ */
+static inline bool softleaf_is_uffd_wp_marker(softleaf_t entry)
+{
+	if (!softleaf_is_marker(entry))
+		return false;
+
+	return softleaf_to_marker(entry) & PTE_MARKER_UFFD_WP;
+}
+
+/**
+ * pte_is_marker() - Does the PTE entry encode a marker leaf entry?
+ * @pte: PTE entry.
+ *
+ * Returns: true if this PTE is a marker leaf entry, otherwise false.
+ */
+static inline bool pte_is_marker(pte_t pte)
+{
+	return softleaf_is_marker(softleaf_from_pte(pte));
+}
+
+/**
+ * pte_is_uffd_wp_marker() - Does this PTE entry encode a userfaultfd write
+ * protect marker leaf entry?
+ * @pte: PTE entry.
+ *
+ * Returns: true if this PTE is a UFFD WP marker leaf entry, otherwise false.
+ */
+static inline bool pte_is_uffd_wp_marker(pte_t pte)
+{
+	const softleaf_t entry = softleaf_from_pte(pte);
+
+	return softleaf_is_uffd_wp_marker(entry);
+}
+
+/**
+ * pte_is_uffd_marker() - Does this PTE entry encode a userfault-specific marker
+ * leaf entry?
+ * @entry: Leaf entry.
+ *
+ * It's useful to be able to determine which leaf entries encode UFFD-specific
+ * markers so we can handle these correctly.
+ *
+ * Returns: true if this PTE entry is a UFFD-specific marker, otherwise false.
+ */
+static inline bool pte_is_uffd_marker(pte_t pte)
+{
+	const softleaf_t entry = softleaf_from_pte(pte);
+
+	if (!softleaf_is_marker(entry))
+		return false;
+
+	/* UFFD WP, poisoned swap entries are UFFD-handled. */
+	if (softleaf_is_uffd_wp_marker(entry))
+		return true;
+	if (softleaf_is_poison_marker(entry))
+		return true;
+
+	return false;
+}
+
+#endif  /* CONFIG_MMU */
+#endif  /* _LINUX_LEAFOPS_H */
diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
index f6a2b2d20016..ca7a18351797 100644
--- a/include/linux/mm_inline.h
+++ b/include/linux/mm_inline.h
@@ -8,7 +8,7 @@
 #include <linux/swap.h>
 #include <linux/string.h>
 #include <linux/userfaultfd_k.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 
 /**
  * folio_is_file_lru - Should the folio be on a file LRU or anon LRU?
@@ -541,9 +541,9 @@ static inline bool mm_tlb_flush_nested(const struct mm_struct *mm)
  * The caller should insert a new pte created with make_pte_marker().
  */
 static inline pte_marker copy_pte_marker(
-		swp_entry_t entry, struct vm_area_struct *dst_vma)
+		softleaf_t entry, struct vm_area_struct *dst_vma)
 {
-	pte_marker srcm = pte_marker_get(entry);
+	const pte_marker srcm = softleaf_to_marker(entry);
 	/* Always copy error entries. */
 	pte_marker dstm = srcm & (PTE_MARKER_POISONED | PTE_MARKER_GUARD);
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 5021047485a9..4f66a3206a63 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -285,6 +285,31 @@ typedef struct {
 	unsigned long val;
 } swp_entry_t;
 
+/**
+ * typedef softleaf_t - Describes a page table software leaf entry, abstracted
+ * from its architecture-specific encoding.
+ *
+ * Page table leaf entries are those which do not reference any descendent page
+ * tables but rather either reference a data page, are an empty (or 'none'
+ * entry), or contain a non-present entry.
+ *
+ * If referencing another page table or a data page then the page table entry is
+ * pertinent to hardware - that is it tells the hardware how to decode the page
+ * table entry.
+ *
+ * Otherwise it is a software-defined leaf page table entry, which this type
+ * describes. See leafops.h and specifically @softleaf_type for a list of all
+ * possible kinds of software leaf entry.
+ *
+ * A softleaf_t entry is abstracted from the hardware page table entry, so is
+ * not architecture-specific.
+ *
+ * NOTE: While we transition from the confusing swp_entry_t type used for this
+ *       purpose, we simply alias this type. This will be removed once the
+ *       transition is complete.
+ */
+typedef swp_entry_t softleaf_t;
+
 #if defined(CONFIG_MEMCG) || defined(CONFIG_SLAB_OBJ_EXT)
 /* We have some extra room after the refcount in tail pages. */
 #define NR_PAGES_IN_LARGE_FOLIO
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index d1f665935cfc..0a4b3f51ecf5 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -426,21 +426,6 @@ static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
 	return swp_entry(SWP_PTE_MARKER, marker);
 }
 
-static inline bool is_pte_marker_entry(swp_entry_t entry)
-{
-	return swp_type(entry) == SWP_PTE_MARKER;
-}
-
-static inline pte_marker pte_marker_get(swp_entry_t entry)
-{
-	return swp_offset(entry) & PTE_MARKER_MASK;
-}
-
-static inline bool is_pte_marker(pte_t pte)
-{
-	return is_swap_pte(pte) && is_pte_marker_entry(pte_to_swp_entry(pte));
-}
-
 static inline pte_t make_pte_marker(pte_marker marker)
 {
 	return swp_entry_to_pte(make_pte_marker_entry(marker));
@@ -451,24 +436,11 @@ static inline swp_entry_t make_poisoned_swp_entry(void)
 	return make_pte_marker_entry(PTE_MARKER_POISONED);
 }
 
-static inline int is_poisoned_swp_entry(swp_entry_t entry)
-{
-	return is_pte_marker_entry(entry) &&
-	    (pte_marker_get(entry) & PTE_MARKER_POISONED);
-
-}
-
 static inline swp_entry_t make_guard_swp_entry(void)
 {
 	return make_pte_marker_entry(PTE_MARKER_GUARD);
 }
 
-static inline int is_guard_swp_entry(swp_entry_t entry)
-{
-	return is_pte_marker_entry(entry) &&
-		(pte_marker_get(entry) & PTE_MARKER_GUARD);
-}
-
 static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
 {
 	struct page *p = pfn_to_page(swp_offset_pfn(entry));
diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
index da0b4fcc566f..983c860a00f1 100644
--- a/include/linux/userfaultfd_k.h
+++ b/include/linux/userfaultfd_k.h
@@ -16,7 +16,7 @@
 #include <linux/fcntl.h>
 #include <linux/mm.h>
 #include <linux/swap.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 #include <asm-generic/pgtable_uffd.h>
 #include <linux/hugetlb_inline.h>
 
@@ -434,32 +434,6 @@ static inline bool userfaultfd_wp_use_markers(struct vm_area_struct *vma)
 	return userfaultfd_wp_unpopulated(vma);
 }
 
-static inline bool pte_marker_entry_uffd_wp(swp_entry_t entry)
-{
-#ifdef CONFIG_PTE_MARKER_UFFD_WP
-	return is_pte_marker_entry(entry) &&
-	    (pte_marker_get(entry) & PTE_MARKER_UFFD_WP);
-#else
-	return false;
-#endif
-}
-
-static inline bool pte_marker_uffd_wp(pte_t pte)
-{
-#ifdef CONFIG_PTE_MARKER_UFFD_WP
-	swp_entry_t entry;
-
-	if (!is_swap_pte(pte))
-		return false;
-
-	entry = pte_to_swp_entry(pte);
-
-	return pte_marker_entry_uffd_wp(entry);
-#else
-	return false;
-#endif
-}
-
 /*
  * Returns true if this is a swap pte and was uffd-wp wr-protected in either
  * forms (pte marker or a normal swap pte), false otherwise.
@@ -473,31 +447,10 @@ static inline bool pte_swp_uffd_wp_any(pte_t pte)
 	if (pte_swp_uffd_wp(pte))
 		return true;
 
-	if (pte_marker_uffd_wp(pte))
+	if (pte_is_uffd_wp_marker(pte))
 		return true;
 #endif
 	return false;
 }
 
-
-static inline bool is_uffd_pte_marker(pte_t pte)
-{
-	swp_entry_t entry;
-
-	if (pte_present(pte))
-		return false;
-
-	entry = pte_to_swp_entry(pte);
-	if (!is_pte_marker_entry(entry))
-		return false;
-
-	/* UFFD WP, poisoned swap entries are UFFD handled. */
-	if (pte_marker_entry_uffd_wp(entry))
-		return true;
-	if (is_poisoned_swp_entry(entry))
-		return true;
-
-	return false;
-}
-
 #endif /* _LINUX_USERFAULTFD_K_H */
diff --git a/mm/hmm.c b/mm/hmm.c
index 387a38bbaf6a..e350d0cc9d41 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -249,7 +249,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 	 * that will be correctly handled, so we need only check for UFFD WP
 	 * here.
 	 */
-	if (pte_none(pte) || pte_marker_uffd_wp(pte)) {
+	if (pte_none(pte) || pte_is_uffd_wp_marker(pte)) {
 		required_fault =
 			hmm_pte_need_fault(hmm_vma_walk, pfn_req_flags, 0);
 		if (required_fault)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 01c784547d1e..a05edefec1ca 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -28,7 +28,7 @@
 #include <linux/string_choices.h>
 #include <linux/string_helpers.h>
 #include <linux/swap.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 #include <linux/jhash.h>
 #include <linux/numa.h>
 #include <linux/llist.h>
@@ -5662,17 +5662,17 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				entry = huge_pte_clear_uffd_wp(entry);
 			set_huge_pte_at(dst, addr, dst_pte, entry, sz);
 		} else if (unlikely(is_hugetlb_entry_migration(entry))) {
-			swp_entry_t swp_entry = pte_to_swp_entry(entry);
+			softleaf_t softleaf = softleaf_from_pte(entry);
 			bool uffd_wp = pte_swp_uffd_wp(entry);
 
-			if (!is_readable_migration_entry(swp_entry) && cow) {
+			if (!is_readable_migration_entry(softleaf) && cow) {
 				/*
 				 * COW mappings require pages in both
 				 * parent and child to be set to read.
 				 */
-				swp_entry = make_readable_migration_entry(
-							swp_offset(swp_entry));
-				entry = swp_entry_to_pte(swp_entry);
+				softleaf = make_readable_migration_entry(
+							swp_offset(softleaf));
+				entry = swp_entry_to_pte(softleaf);
 				if (userfaultfd_wp(src_vma) && uffd_wp)
 					entry = pte_swp_mkuffd_wp(entry);
 				set_huge_pte_at(src, addr, src_pte, entry, sz);
@@ -5680,9 +5680,9 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 			if (!userfaultfd_wp(dst_vma))
 				entry = huge_pte_clear_uffd_wp(entry);
 			set_huge_pte_at(dst, addr, dst_pte, entry, sz);
-		} else if (unlikely(is_pte_marker(entry))) {
-			pte_marker marker = copy_pte_marker(
-				pte_to_swp_entry(entry), dst_vma);
+		} else if (unlikely(pte_is_marker(entry))) {
+			const softleaf_t softleaf = softleaf_from_pte(entry);
+			const pte_marker marker = copy_pte_marker(softleaf, dst_vma);
 
 			if (marker)
 				set_huge_pte_at(dst, addr, dst_pte,
@@ -5798,7 +5798,7 @@ static void move_huge_pte(struct vm_area_struct *vma, unsigned long old_addr,
 
 	pte = huge_ptep_get_and_clear(mm, old_addr, src_pte, sz);
 
-	if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
+	if (need_clear_uffd_wp && pte_is_uffd_wp_marker(pte))
 		huge_pte_clear(mm, new_addr, dst_pte, sz);
 	else {
 		if (need_clear_uffd_wp) {
@@ -6617,7 +6617,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 	 * If this pte was previously wr-protected, keep it wr-protected even
 	 * if populated.
 	 */
-	if (unlikely(pte_marker_uffd_wp(vmf->orig_pte)))
+	if (unlikely(pte_is_uffd_wp_marker(vmf->orig_pte)))
 		new_pte = huge_pte_mkuffd_wp(new_pte);
 	set_huge_pte_at(mm, vmf->address, vmf->pte, new_pte, huge_page_size(h));
 
@@ -6750,9 +6750,9 @@ vm_fault_t hugetlb_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		 */
 		return hugetlb_no_page(mapping, &vmf);
 
-	if (is_pte_marker(vmf.orig_pte)) {
+	if (pte_is_marker(vmf.orig_pte)) {
 		const pte_marker marker =
-			pte_marker_get(pte_to_swp_entry(vmf.orig_pte));
+			softleaf_to_marker(softleaf_from_pte(vmf.orig_pte));
 
 		if (marker & PTE_MARKER_POISONED) {
 			ret = VM_FAULT_HWPOISON_LARGE |
@@ -7080,7 +7080,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 	 * See comment about UFFD marker overwriting in
 	 * mfill_atomic_install_pte().
 	 */
-	if (!huge_pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
+	if (!huge_pte_none(dst_ptep) && !pte_is_uffd_marker(dst_ptep))
 		goto out_release_unlock;
 
 	if (folio_in_pagecache)
@@ -7201,8 +7201,9 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 		if (unlikely(is_hugetlb_entry_hwpoisoned(pte))) {
 			/* Nothing to do. */
 		} else if (unlikely(is_hugetlb_entry_migration(pte))) {
-			swp_entry_t entry = pte_to_swp_entry(pte);
-			struct folio *folio = pfn_swap_entry_folio(entry);
+			softleaf_t entry = softleaf_from_pte(pte);
+
+			struct folio *folio = softleaf_to_folio(entry);
 			pte_t newpte = pte;
 
 			if (is_writable_migration_entry(entry)) {
@@ -7222,14 +7223,14 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 				newpte = pte_swp_clear_uffd_wp(newpte);
 			if (!pte_same(pte, newpte))
 				set_huge_pte_at(mm, address, ptep, newpte, psize);
-		} else if (unlikely(is_pte_marker(pte))) {
+		} else if (unlikely(pte_is_marker(pte))) {
 			/*
 			 * Do nothing on a poison marker; page is
 			 * corrupted, permissions do not apply. Here
 			 * pte_marker_uffd_wp()==true implies !poison
 			 * because they're mutual exclusive.
 			 */
-			if (pte_marker_uffd_wp(pte) && uffd_wp_resolve)
+			if (pte_is_uffd_wp_marker(pte) && uffd_wp_resolve)
 				/* Safe to modify directly (non-present->none). */
 				huge_pte_clear(mm, address, ptep, psize);
 		} else if (!huge_pte_none(pte)) {
diff --git a/mm/madvise.c b/mm/madvise.c
index fb1c86e630b6..2d5ad3cb37bb 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -29,7 +29,7 @@
 #include <linux/backing-dev.h>
 #include <linux/pagewalk.h>
 #include <linux/swap.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 #include <linux/shmem_fs.h>
 #include <linux/mmu_notifier.h>
 
@@ -690,17 +690,16 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		 * (page allocation + zeroing).
 		 */
 		if (!pte_present(ptent)) {
-			swp_entry_t entry;
+			softleaf_t entry = softleaf_from_pte(ptent);
 
-			entry = pte_to_swp_entry(ptent);
-			if (!non_swap_entry(entry)) {
+			if (softleaf_is_swap(entry)) {
 				max_nr = (end - addr) / PAGE_SIZE;
 				nr = swap_pte_batch(pte, max_nr, ptent);
 				nr_swap -= nr;
 				free_swap_and_cache_nr(entry, nr);
 				clear_not_present_full_ptes(mm, addr, pte, nr, tlb->fullmm);
-			} else if (is_hwpoison_entry(entry) ||
-				   is_poisoned_swp_entry(entry)) {
+			} else if (softleaf_is_hwpoison(entry) ||
+				   softleaf_is_poison_marker(entry)) {
 				pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			}
 			continue;
@@ -1071,8 +1070,9 @@ static bool is_valid_guard_vma(struct vm_area_struct *vma, bool allow_locked)
 
 static bool is_guard_pte_marker(pte_t ptent)
 {
-	return is_swap_pte(ptent) &&
-	       is_guard_swp_entry(pte_to_swp_entry(ptent));
+	const softleaf_t entry = softleaf_from_pte(ptent);
+
+	return softleaf_is_guard_marker(entry);
 }
 
 static int guard_install_pud_entry(pud_t *pud, unsigned long addr,
diff --git a/mm/memory.c b/mm/memory.c
index 4c3a7e09a159..7493ed084b99 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -60,7 +60,7 @@
 #include <linux/writeback.h>
 #include <linux/memcontrol.h>
 #include <linux/mmu_notifier.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 #include <linux/elf.h>
 #include <linux/gfp.h>
 #include <linux/migrate.h>
@@ -109,7 +109,7 @@ static __always_inline bool vmf_orig_pte_uffd_wp(struct vm_fault *vmf)
 	if (!(vmf->flags & FAULT_FLAG_ORIG_PTE_VALID))
 		return false;
 
-	return pte_marker_uffd_wp(vmf->orig_pte);
+	return pte_is_uffd_wp_marker(vmf->orig_pte);
 }
 
 /*
@@ -927,10 +927,10 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 {
 	vm_flags_t vm_flags = dst_vma->vm_flags;
 	pte_t orig_pte = ptep_get(src_pte);
+	softleaf_t entry = softleaf_from_pte(orig_pte);
 	pte_t pte = orig_pte;
 	struct folio *folio;
 	struct page *page;
-	swp_entry_t entry = pte_to_swp_entry(orig_pte);
 
 	if (likely(!non_swap_entry(entry))) {
 		if (swap_duplicate(entry) < 0)
@@ -1016,7 +1016,7 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 		if (try_restore_exclusive_pte(src_vma, addr, src_pte, orig_pte))
 			return -EBUSY;
 		return -ENOENT;
-	} else if (is_pte_marker_entry(entry)) {
+	} else if (softleaf_is_marker(entry)) {
 		pte_marker marker = copy_pte_marker(entry, dst_vma);
 
 		if (marker)
@@ -1717,14 +1717,14 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 		unsigned int max_nr, unsigned long addr,
 		struct zap_details *details, int *rss, bool *any_skipped)
 {
-	swp_entry_t entry;
+	softleaf_t entry;
 	int nr = 1;
 
 	*any_skipped = true;
-	entry = pte_to_swp_entry(ptent);
-	if (is_device_private_entry(entry) ||
-		is_device_exclusive_entry(entry)) {
-		struct page *page = pfn_swap_entry_to_page(entry);
+	entry = softleaf_from_pte(ptent);
+	if (softleaf_is_device_private(entry) ||
+	    softleaf_is_device_exclusive(entry)) {
+		struct page *page = softleaf_to_page(entry);
 		struct folio *folio = page_folio(page);
 
 		if (unlikely(!should_zap_folio(details, folio)))
@@ -1739,7 +1739,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 		rss[mm_counter(folio)]--;
 		folio_remove_rmap_pte(folio, page, vma);
 		folio_put(folio);
-	} else if (!non_swap_entry(entry)) {
+	} else if (softleaf_is_swap(entry)) {
 		/* Genuine swap entries, hence a private anon pages */
 		if (!should_zap_cows(details))
 			return 1;
@@ -1747,20 +1747,20 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 		nr = swap_pte_batch(pte, max_nr, ptent);
 		rss[MM_SWAPENTS] -= nr;
 		free_swap_and_cache_nr(entry, nr);
-	} else if (is_migration_entry(entry)) {
-		struct folio *folio = pfn_swap_entry_folio(entry);
+	} else if (softleaf_is_migration(entry)) {
+		struct folio *folio = softleaf_to_folio(entry);
 
 		if (!should_zap_folio(details, folio))
 			return 1;
 		rss[mm_counter(folio)]--;
-	} else if (pte_marker_entry_uffd_wp(entry)) {
+	} else if (softleaf_is_uffd_wp_marker(entry)) {
 		/*
 		 * For anon: always drop the marker; for file: only
 		 * drop the marker if explicitly requested.
 		 */
 		if (!vma_is_anonymous(vma) && !zap_drop_markers(details))
 			return 1;
-	} else if (is_guard_swp_entry(entry)) {
+	} else if (softleaf_is_guard_marker(entry)) {
 		/*
 		 * Ordinary zapping should not remove guard PTE
 		 * markers. Only do so if we should remove PTE markers
@@ -1768,7 +1768,8 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 		 */
 		if (!zap_drop_markers(details))
 			return 1;
-	} else if (is_hwpoison_entry(entry) || is_poisoned_swp_entry(entry)) {
+	} else if (softleaf_is_hwpoison(entry) ||
+		   softleaf_is_poison_marker(entry)) {
 		if (!should_zap_cows(details))
 			return 1;
 	} else {
@@ -4390,7 +4391,7 @@ static vm_fault_t pte_marker_clear(struct vm_fault *vmf)
 	 *
 	 * This should also cover the case where e.g. the pte changed
 	 * quickly from a PTE_MARKER_UFFD_WP into PTE_MARKER_POISONED.
-	 * So is_pte_marker() check is not enough to safely drop the pte.
+	 * So pte_is_marker() check is not enough to safely drop the pte.
 	 */
 	if (pte_same(vmf->orig_pte, ptep_get(vmf->pte)))
 		pte_clear(vmf->vma->vm_mm, vmf->address, vmf->pte);
@@ -4424,8 +4425,8 @@ static vm_fault_t pte_marker_handle_uffd_wp(struct vm_fault *vmf)
 
 static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 {
-	swp_entry_t entry = pte_to_swp_entry(vmf->orig_pte);
-	unsigned long marker = pte_marker_get(entry);
+	const softleaf_t entry = softleaf_from_pte(vmf->orig_pte);
+	const pte_marker marker = softleaf_to_marker(entry);
 
 	/*
 	 * PTE markers should never be empty.  If anything weird happened,
@@ -4442,7 +4443,7 @@ static vm_fault_t handle_pte_marker(struct vm_fault *vmf)
 	if (marker & PTE_MARKER_GUARD)
 		return VM_FAULT_SIGSEGV;
 
-	if (pte_marker_entry_uffd_wp(entry))
+	if (softleaf_is_uffd_wp_marker(entry))
 		return pte_marker_handle_uffd_wp(vmf);
 
 	/* This is an unknown pte marker */
@@ -4690,7 +4691,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 			}
 		} else if (is_hwpoison_entry(entry)) {
 			ret = VM_FAULT_HWPOISON;
-		} else if (is_pte_marker_entry(entry)) {
+		} else if (softleaf_is_marker(entry)) {
 			ret = handle_pte_marker(vmf);
 		} else {
 			print_bad_pte(vma, vmf->address, vmf->orig_pte, NULL);
diff --git a/mm/mincore.c b/mm/mincore.c
index fb80becd6119..b3682488a65d 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -14,7 +14,7 @@
 #include <linux/mman.h>
 #include <linux/syscalls.h>
 #include <linux/swap.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 #include <linux/shmem_fs.h>
 #include <linux/hugetlb.h>
 #include <linux/pgtable.h>
@@ -42,7 +42,7 @@ static int mincore_hugetlb(pte_t *pte, unsigned long hmask, unsigned long addr,
 	} else {
 		const pte_t ptep = huge_ptep_get(walk->mm, addr, pte);
 
-		if (huge_pte_none(ptep) || is_pte_marker(ptep))
+		if (huge_pte_none(ptep) || pte_is_marker(ptep))
 			present = 0;
 		else
 			present = 1;
@@ -187,7 +187,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 
 		step = 1;
 		/* We need to do cache lookup too for markers */
-		if (pte_none(pte) || is_pte_marker(pte))
+		if (pte_none(pte) || pte_is_marker(pte))
 			__mincore_unmapped_range(addr, addr + PAGE_SIZE,
 						 vma, vec);
 		else if (pte_present(pte)) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ab4e06cd9a69..0bae241eb7aa 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -326,14 +326,14 @@ static long change_pte_range(struct mmu_gather *tlb,
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
-			} else if (is_pte_marker_entry(entry)) {
+			} else if (softleaf_is_marker(entry)) {
 				/*
 				 * Ignore error swap entries unconditionally,
 				 * because any access should sigbus/sigsegv
 				 * anyway.
 				 */
-				if (is_poisoned_swp_entry(entry) ||
-				    is_guard_swp_entry(entry))
+				if (softleaf_is_poison_marker(entry) ||
+				    softleaf_is_guard_marker(entry))
 					continue;
 				/*
 				 * If this is uffd-wp pte marker and we'd like
diff --git a/mm/mremap.c b/mm/mremap.c
index 8ad06cf50783..7c21b2ad13f6 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -17,7 +17,7 @@
 #include <linux/swap.h>
 #include <linux/capability.h>
 #include <linux/fs.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 #include <linux/highmem.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
@@ -288,7 +288,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
 		pte = move_pte(pte, old_addr, new_addr);
 		pte = move_soft_dirty_pte(pte);
 
-		if (need_clear_uffd_wp && pte_marker_uffd_wp(pte))
+		if (need_clear_uffd_wp && pte_is_uffd_wp_marker(pte))
 			pte_clear(mm, new_addr, new_ptep);
 		else {
 			if (need_clear_uffd_wp) {
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index 137ce27ff68c..be20468fb5a9 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -3,7 +3,7 @@
 #include <linux/rmap.h>
 #include <linux/hugetlb.h>
 #include <linux/swap.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 
 #include "internal.h"
 
@@ -107,15 +107,12 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
 	pte_t ptent = ptep_get(pvmw->pte);
 
 	if (pvmw->flags & PVMW_MIGRATION) {
-		swp_entry_t entry;
-		if (!is_swap_pte(ptent))
-			return false;
-		entry = pte_to_swp_entry(ptent);
+		const softleaf_t entry = softleaf_from_pte(ptent);
 
-		if (!is_migration_entry(entry))
+		if (!softleaf_is_migration(entry))
 			return false;
 
-		pfn = swp_offset_pfn(entry);
+		pfn = softleaf_to_pfn(entry);
 	} else if (is_swap_pte(ptent)) {
 		swp_entry_t entry;
 
diff --git a/mm/shmem.c b/mm/shmem.c
index 6580f3cd24bb..395ca58ac4a5 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,7 +66,7 @@ static struct vfsmount *shm_mnt __ro_after_init;
 #include <linux/falloc.h>
 #include <linux/splice.h>
 #include <linux/security.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 #include <linux/mempolicy.h>
 #include <linux/namei.h>
 #include <linux/ctype.h>
@@ -2286,7 +2286,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	struct address_space *mapping = inode->i_mapping;
 	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
 	struct shmem_inode_info *info = SHMEM_I(inode);
-	swp_entry_t swap, index_entry;
+	swp_entry_t swap;
+	softleaf_t index_entry;
 	struct swap_info_struct *si;
 	struct folio *folio = NULL;
 	bool skip_swapcache = false;
@@ -2298,7 +2299,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
 	swap = index_entry;
 	*foliop = NULL;
 
-	if (is_poisoned_swp_entry(index_entry))
+	if (softleaf_is_poison_marker(index_entry))
 		return -EIO;
 
 	si = get_swap_device(index_entry);
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index cc4ce205bbec..055ec1050776 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -10,7 +10,7 @@
 #include <linux/pagemap.h>
 #include <linux/rmap.h>
 #include <linux/swap.h>
-#include <linux/swapops.h>
+#include <linux/leafops.h>
 #include <linux/userfaultfd_k.h>
 #include <linux/mmu_notifier.h>
 #include <linux/hugetlb.h>
@@ -208,7 +208,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
 	 * MISSING|WP registered, we firstly wr-protect a none pte which has no
 	 * page cache page backing it, then access the page.
 	 */
-	if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
+	if (!pte_none(dst_ptep) && !pte_is_uffd_marker(dst_ptep))
 		goto out_unlock;
 
 	if (page_in_cache) {
@@ -590,7 +590,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
 		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
 			const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
 
-			if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) {
+			if (!huge_pte_none(ptep) && !pte_is_uffd_marker(ptep)) {
 				err = -EEXIST;
 				hugetlb_vma_unlock_read(dst_vma);
 				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
-- 
2.51.0
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Lorenzo Stoakes 1 month ago
Hi Andrew,

Please apply the attached fix-patch.

This ensures that we do not accidentally conflict with any valid swap
entry. We can do so without occupying any additional swap (softleaf) type.

Cheers, Lorenzo

----8<----
From 78439310eded5db10692c3e8d0d322bdd6409eff Mon Sep 17 00:00:00 2001
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Date: Thu, 13 Nov 2025 14:20:34 +0000
Subject: [PATCH] mm: avoid any possible collision between swap entries,
 SOFTLEAF_NONE

The way swap entries are encoded varies by architecture. For x86-64 for
instance, the encoded swap offset is the one's complement of the specified
swap offset.

As a result, device 0, offset 0 would be encoded as 0..01..10b.

This means it is possible to specify a PTE entry that is both device 0,
offset 0 and something that will be identified as a swap entry rather than
a pte_none() entry.

For other architectures, the encoding may preclude such entries being
valid.

The softleaf implementation currently depends on a 0..0b entry being
uniquely identifiable as a none entry.

This is therefore not a safe assumption, so let's fix that.

PTE markers unconditionally occupy a softleaf type, and currently use only
3 bits of the offset field to encode their type with no further information
recorded.

It is therefore no issue at all to add an additional marker type
designating the field as a none entry.

We also make the none checks more canonical by adjusting softleaf_is_none()
to reference softleaf_mk_none().

By doing so we avoid any possible collision with swap file entries while
taking up no further meaningful resource.

Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 include/linux/leafops.h | 8 +++++---
 include/linux/swap.h    | 1 +
 include/linux/swapops.h | 6 +++++-
 3 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/linux/leafops.h b/include/linux/leafops.h
index cff9d94fd5d1..74fd95b55e9c 100644
--- a/include/linux/leafops.h
+++ b/include/linux/leafops.h
@@ -40,7 +40,8 @@ enum softleaf_type {
  */
 static inline softleaf_t softleaf_mk_none(void)
 {
-	return ((softleaf_t) { 0 });
+	/* Uniquely identifies none entry. */
+	return make_pte_marker_entry(PTE_MARKER_SOFTLEAF_NONE);
 }

 /**
@@ -72,7 +73,7 @@ static inline softleaf_t softleaf_from_pte(pte_t pte)
  */
 static inline bool softleaf_is_none(softleaf_t entry)
 {
-	return entry.val == 0;
+	return entry.val == softleaf_mk_none().val;
 }

 /**
@@ -199,7 +200,8 @@ static inline bool softleaf_is_hwpoison(softleaf_t entry)
  */
 static inline bool softleaf_is_marker(softleaf_t entry)
 {
-	return softleaf_type(entry) == SOFTLEAF_MARKER;
+	return softleaf_type(entry) == SOFTLEAF_MARKER &&
+		!softleaf_is_none(entry);
 }

 /**
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 38ca3df68716..e5abea55448b 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -112,6 +112,7 @@ static inline int current_is_kswapd(void)
 #define SWP_HWPOISON_NUM 0
 #endif

+/* Leave a type reserved for softleaf none. */
 #define MAX_SWAPFILES \
 	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
 	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - \
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 0a4b3f51ecf5..04e74716a845 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -419,7 +419,11 @@ typedef unsigned long pte_marker;
  * PROT_NONE, rather than if they were a memory hole or equivalent.
  */
 #define  PTE_MARKER_GUARD			BIT(2)
-#define  PTE_MARKER_MASK			(BIT(3) - 1)
+
+/* Internal use by the softleaf implementation to represent 'none' entries. */
+#define  PTE_MARKER_SOFTLEAF_NONE		BIT(3)
+
+#define  PTE_MARKER_MASK			(BIT(4) - 1)

 static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
 {
--
2.51.0
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Lorenzo Stoakes 1 month ago
OK Andrew please disregard and drop this, I was being too paranoid...

Comment at SWAP_ENTRY_INVALID indicates 0,0 is reserved as I thought before.

Cheers, Lorenzo

On Thu, Nov 13, 2025 at 02:56:09PM +0000, Lorenzo Stoakes wrote:
> Hi Andrew,
>
> Please apply the attached fix-patch.
>
> This ensures that we do not accidentally conflict with any valid swap
> entry. We can do so without occupying any additional swap (softleaf) type.
>
> Cheers, Lorenzo
>
> ----8<----
> From 78439310eded5db10692c3e8d0d322bdd6409eff Mon Sep 17 00:00:00 2001
> From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Date: Thu, 13 Nov 2025 14:20:34 +0000
> Subject: [PATCH] mm: avoid any possible collision between swap entries,
>  SOFTLEAF_NONE
>
> The way swap entries are encoded varies by architecture. For x86-64 for
> instance, the encoded swap offset is the one's complement of the specified
> swap offset.
>
> As a result, device 0, offset 0 would be encoded as 0..01..10b.
>
> This means it is possible to specify a PTE entry that is both device 0,
> offset 0 and something that will be identified as a swap entry rather than
> a pte_none() entry.
>
> For other architectures, the encoding may preclude such entries being
> valid.
>
> The softleaf implementation currently depends on a 0..0b entry being
> uniquely identifiable as a none entry.
>
> This is therefore not a safe assumption, so let's fix that.
>
> PTE markers unconditionally occupy a softleaf type, and currently use only
> 3 bits of the offset field to encode their type with no further information
> recorded.
>
> It is therefore no issue at all to add an additional marker type
> designating the field as a none entry.
>
> We also make the none checks more canonical by adjusting softleaf_is_none()
> to reference softleaf_mk_none().
>
> By doing so we avoid any possible collision with swap file entries while
> taking up no further meaningful resource.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  include/linux/leafops.h | 8 +++++---
>  include/linux/swap.h    | 1 +
>  include/linux/swapops.h | 6 +++++-
>  3 files changed, 11 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/leafops.h b/include/linux/leafops.h
> index cff9d94fd5d1..74fd95b55e9c 100644
> --- a/include/linux/leafops.h
> +++ b/include/linux/leafops.h
> @@ -40,7 +40,8 @@ enum softleaf_type {
>   */
>  static inline softleaf_t softleaf_mk_none(void)
>  {
> -	return ((softleaf_t) { 0 });
> +	/* Uniquely identifies none entry. */
> +	return make_pte_marker_entry(PTE_MARKER_SOFTLEAF_NONE);
>  }
>
>  /**
> @@ -72,7 +73,7 @@ static inline softleaf_t softleaf_from_pte(pte_t pte)
>   */
>  static inline bool softleaf_is_none(softleaf_t entry)
>  {
> -	return entry.val == 0;
> +	return entry.val == softleaf_mk_none().val;
>  }
>
>  /**
> @@ -199,7 +200,8 @@ static inline bool softleaf_is_hwpoison(softleaf_t entry)
>   */
>  static inline bool softleaf_is_marker(softleaf_t entry)
>  {
> -	return softleaf_type(entry) == SOFTLEAF_MARKER;
> +	return softleaf_type(entry) == SOFTLEAF_MARKER &&
> +		!softleaf_is_none(entry);
>  }
>
>  /**
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 38ca3df68716..e5abea55448b 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -112,6 +112,7 @@ static inline int current_is_kswapd(void)
>  #define SWP_HWPOISON_NUM 0
>  #endif
>
> +/* Leave a type reserved for softleaf none. */
>  #define MAX_SWAPFILES \
>  	((1 << MAX_SWAPFILES_SHIFT) - SWP_DEVICE_NUM - \
>  	SWP_MIGRATION_NUM - SWP_HWPOISON_NUM - \
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 0a4b3f51ecf5..04e74716a845 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -419,7 +419,11 @@ typedef unsigned long pte_marker;
>   * PROT_NONE, rather than if they were a memory hole or equivalent.
>   */
>  #define  PTE_MARKER_GUARD			BIT(2)
> -#define  PTE_MARKER_MASK			(BIT(3) - 1)
> +
> +/* Internal use by the softleaf implementation to represent 'none' entries. */
> +#define  PTE_MARKER_SOFTLEAF_NONE		BIT(3)
> +
> +#define  PTE_MARKER_MASK			(BIT(4) - 1)
>
>  static inline swp_entry_t make_pte_marker_entry(pte_marker marker)
>  {
> --
> 2.51.0
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Vlastimil Babka 1 month, 1 week ago
On 11/10/25 23:21, Lorenzo Stoakes wrote:
> The kernel maintains leaf page table entries which contain either:
> 
> - Nothing ('none' entries)
> - Present entries (that is stuff the hardware can navigate without fault)
> - Everything else that will cause a fault which the kernel handles
> 
> In the 'everything else' group we include swap entries, but we also include
> a number of other things such as migration entries, device private entries
> and marker entries.
> 
> Unfortunately this 'everything else' group expresses everything through
> a swp_entry_t type, and these entries are referred to swap entries even
> though they may well not contain a... swap entry.
> 
> This is compounded by the rather mind-boggling concept of a non-swap swap
> entry (checked via non_swap_entry()) and the means by which we twist and
> turn to satisfy this.
> 
> This patch lays the foundation for reducing this confusion.
> 
> We refer to 'everything else' as a 'software-define leaf entry' or
> 'softleaf'. for short And in fact we scoop up the 'none' entries into this
> concept also so we are left with:
> 
> - Present entries.
> - Softleaf entries (which may be empty).
> 
> This allows for radical simplification across the board - one can simply
> convert any leaf page table entry to a leaf entry via softleaf_from_pte().
> 
> If the entry is present, we return an empty leaf entry, so it is assumed
> the caller is aware that they must differentiate between the two categories
> of page table entries, checking for the former via pte_present().
> 
> As a result, we can eliminate a number of places where we would otherwise
> need to use predicates to see if we can proceed with leaf page table entry
> conversion and instead just go ahead and do it unconditionally.
> 
> We do so where we can, adjusting surrounding logic as necessary to
> integrate the new softleaf_t logic as far as seems reasonable at this
> stage.
> 
> We typedef swp_entry_t to softleaf_t for the time being until the
> conversion can be complete, meaning everything remains compatible
> regardless of which type is used. We will eventually remove swp_entry_t
> when the conversion is complete.
> 
> We introduce a new header file to keep things clear - leafops.h - this
> imports swapops.h so can direct replace swapops imports without issue, and
> we do so in all the files that require it.
> 
> Additionally, add new leafops.h file to core mm maintainers entry.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Seems like a reasonable first step of the conversion.

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Lorenzo Stoakes 1 month, 1 week ago
Hi Andrew,

Could we replace:

On Mon, Nov 10, 2025 at 10:21:20PM +0000, Lorenzo Stoakes wrote:
> The kernel maintains leaf page table entries which contain either:
>
> - Nothing ('none' entries)
> - Present entries (that is stuff the hardware can navigate without fault)
> - Everything else that will cause a fault which the kernel handles

With:

The kernel maintains leaf page table entries which contain either:

 - Nothing ('none' entries)
 - Present entries*
 - Everything else that will cause a fault which the kernel handles

* Present entries are either entries the hardware can navigate without page
  fault or special cases like NUMA hint protnone or PMD with cleared
  present bit which contain hardware-valid entries modulo the present bit.


Thanks, Lorenzo
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Zi Yan 1 month, 1 week ago
On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:

> The kernel maintains leaf page table entries which contain either:
>
> - Nothing ('none' entries)
> - Present entries (that is stuff the hardware can navigate without fault)
> - Everything else that will cause a fault which the kernel handles
>
> In the 'everything else' group we include swap entries, but we also include
> a number of other things such as migration entries, device private entries
> and marker entries.
>
> Unfortunately this 'everything else' group expresses everything through
> a swp_entry_t type, and these entries are referred to swap entries even
> though they may well not contain a... swap entry.
>
> This is compounded by the rather mind-boggling concept of a non-swap swap
> entry (checked via non_swap_entry()) and the means by which we twist and
> turn to satisfy this.
>
> This patch lays the foundation for reducing this confusion.
>
> We refer to 'everything else' as a 'software-define leaf entry' or
> 'softleaf'. for short And in fact we scoop up the 'none' entries into this
> concept also so we are left with:
>
> - Present entries.
> - Softleaf entries (which may be empty).
>
> This allows for radical simplification across the board - one can simply
> convert any leaf page table entry to a leaf entry via softleaf_from_pte().
>
> If the entry is present, we return an empty leaf entry, so it is assumed
> the caller is aware that they must differentiate between the two categories
> of page table entries, checking for the former via pte_present().
>
> As a result, we can eliminate a number of places where we would otherwise
> need to use predicates to see if we can proceed with leaf page table entry
> conversion and instead just go ahead and do it unconditionally.
>
> We do so where we can, adjusting surrounding logic as necessary to
> integrate the new softleaf_t logic as far as seems reasonable at this
> stage.
>
> We typedef swp_entry_t to softleaf_t for the time being until the
> conversion can be complete, meaning everything remains compatible
> regardless of which type is used. We will eventually remove swp_entry_t
> when the conversion is complete.
>
> We introduce a new header file to keep things clear - leafops.h - this
> imports swapops.h so can direct replace swapops imports without issue, and
> we do so in all the files that require it.
>
> Additionally, add new leafops.h file to core mm maintainers entry.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  MAINTAINERS                   |   1 +
>  fs/proc/task_mmu.c            |  26 +--
>  fs/userfaultfd.c              |   6 +-
>  include/linux/leafops.h       | 387 ++++++++++++++++++++++++++++++++++
>  include/linux/mm_inline.h     |   6 +-
>  include/linux/mm_types.h      |  25 +++
>  include/linux/swapops.h       |  28 ---
>  include/linux/userfaultfd_k.h |  51 +----
>  mm/hmm.c                      |   2 +-
>  mm/hugetlb.c                  |  37 ++--
>  mm/madvise.c                  |  16 +-
>  mm/memory.c                   |  41 ++--
>  mm/mincore.c                  |   6 +-
>  mm/mprotect.c                 |   6 +-
>  mm/mremap.c                   |   4 +-
>  mm/page_vma_mapped.c          |  11 +-
>  mm/shmem.c                    |   7 +-
>  mm/userfaultfd.c              |   6 +-
>  18 files changed, 502 insertions(+), 164 deletions(-)
>  create mode 100644 include/linux/leafops.h
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 2628431dcdfe..314910a70bbf 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -16257,6 +16257,7 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
>  F:	include/linux/gfp.h
>  F:	include/linux/gfp_types.h
>  F:	include/linux/highmem.h
> +F:	include/linux/leafops.h
>  F:	include/linux/memory.h
>  F:	include/linux/mm.h
>  F:	include/linux/mm_*.h
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index fc35a0543f01..24d26b49d870 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -14,7 +14,7 @@
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
>  #include <linux/sched/mm.h>
> -#include <linux/swapops.h>
> +#include <linux/leafops.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/page_idle.h>
>  #include <linux/shmem_fs.h>
> @@ -1230,11 +1230,11 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
>  	if (pte_present(ptent)) {
>  		folio = page_folio(pte_page(ptent));
>  		present = true;
> -	} else if (is_swap_pte(ptent)) {
> -		swp_entry_t swpent = pte_to_swp_entry(ptent);
> +	} else {
> +		const softleaf_t entry = softleaf_from_pte(ptent);
>
> -		if (is_pfn_swap_entry(swpent))
> -			folio = pfn_swap_entry_folio(swpent);
> +		if (softleaf_has_pfn(entry))
> +			folio = softleaf_to_folio(entry);
>  	}
>
>  	if (folio) {

<snip>

>
> @@ -2330,18 +2330,18 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
>  		if (pte_soft_dirty(pte))
>  			categories |= PAGE_IS_SOFT_DIRTY;
>  	} else if (is_swap_pte(pte)) {

This should be just “else” like smaps_hugetlb_range()’s change, right?

> -		swp_entry_t swp;
> +		softleaf_t entry;
>
>  		categories |= PAGE_IS_SWAPPED;
>  		if (!pte_swp_uffd_wp_any(pte))
>  			categories |= PAGE_IS_WRITTEN;
>
> -		swp = pte_to_swp_entry(pte);
> -		if (is_guard_swp_entry(swp))
> +		entry = softleaf_from_pte(pte);
> +		if (softleaf_is_guard_marker(entry))
>  			categories |= PAGE_IS_GUARD;
>  		else if ((p->masks_of_interest & PAGE_IS_FILE) &&
> -			 is_pfn_swap_entry(swp) &&
> -			 !folio_test_anon(pfn_swap_entry_folio(swp)))
> +			 softleaf_has_pfn(entry) &&
> +			 !folio_test_anon(softleaf_to_folio(entry)))
>  			categories |= PAGE_IS_FILE;
>
>  		if (pte_swp_soft_dirty(pte))

<snip>

> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index 137ce27ff68c..be20468fb5a9 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -3,7 +3,7 @@
>  #include <linux/rmap.h>
>  #include <linux/hugetlb.h>
>  #include <linux/swap.h>
> -#include <linux/swapops.h>
> +#include <linux/leafops.h>
>
>  #include "internal.h"
>
> @@ -107,15 +107,12 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
>  	pte_t ptent = ptep_get(pvmw->pte);
>
>  	if (pvmw->flags & PVMW_MIGRATION) {
> -		swp_entry_t entry;
> -		if (!is_swap_pte(ptent))
> -			return false;
> -		entry = pte_to_swp_entry(ptent);
> +		const softleaf_t entry = softleaf_from_pte(ptent);

We do not need is_swap_pte() check here because softleaf_from_pte()
does the check. Just trying to reason the code with myself here.

>
> -		if (!is_migration_entry(entry))
> +		if (!softleaf_is_migration(entry))
>  			return false;
>
> -		pfn = swp_offset_pfn(entry);
> +		pfn = softleaf_to_pfn(entry);
>  	} else if (is_swap_pte(ptent)) {
>  		swp_entry_t entry;
>
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 6580f3cd24bb..395ca58ac4a5 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -66,7 +66,7 @@ static struct vfsmount *shm_mnt __ro_after_init;
>  #include <linux/falloc.h>
>  #include <linux/splice.h>
>  #include <linux/security.h>
> -#include <linux/swapops.h>
> +#include <linux/leafops.h>
>  #include <linux/mempolicy.h>
>  #include <linux/namei.h>
>  #include <linux/ctype.h>
> @@ -2286,7 +2286,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  	struct address_space *mapping = inode->i_mapping;
>  	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
> -	swp_entry_t swap, index_entry;
> +	swp_entry_t swap;
> +	softleaf_t index_entry;
>  	struct swap_info_struct *si;
>  	struct folio *folio = NULL;
>  	bool skip_swapcache = false;
> @@ -2298,7 +2299,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>  	swap = index_entry;
>  	*foliop = NULL;
>
> -	if (is_poisoned_swp_entry(index_entry))
> +	if (softleaf_is_poison_marker(index_entry))
>  		return -EIO;
>
>  	si = get_swap_device(index_entry);
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index cc4ce205bbec..055ec1050776 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -10,7 +10,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/rmap.h>
>  #include <linux/swap.h>
> -#include <linux/swapops.h>
> +#include <linux/leafops.h>
>  #include <linux/userfaultfd_k.h>
>  #include <linux/mmu_notifier.h>
>  #include <linux/hugetlb.h>
> @@ -208,7 +208,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
>  	 * MISSING|WP registered, we firstly wr-protect a none pte which has no
>  	 * page cache page backing it, then access the page.
>  	 */
> -	if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
> +	if (!pte_none(dst_ptep) && !pte_is_uffd_marker(dst_ptep))
>  		goto out_unlock;
>
>  	if (page_in_cache) {
> @@ -590,7 +590,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
>  		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
>  			const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
>
> -			if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) {
> +			if (!huge_pte_none(ptep) && !pte_is_uffd_marker(ptep)) {
>  				err = -EEXIST;
>  				hugetlb_vma_unlock_read(dst_vma);
>  				mutex_unlock(&hugetlb_fault_mutex_table[hash]);

The rest of the code looks good to me. I will check it again once
you fix the commit log and comments. Thank you for working on this.

Best Regards,
Yan, Zi
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Lorenzo Stoakes 1 month, 1 week ago
On Mon, Nov 10, 2025 at 10:56:33PM -0500, Zi Yan wrote:
> On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:
>
> > The kernel maintains leaf page table entries which contain either:
> >
> > - Nothing ('none' entries)
> > - Present entries (that is stuff the hardware can navigate without fault)
> > - Everything else that will cause a fault which the kernel handles
> >
> > In the 'everything else' group we include swap entries, but we also include
> > a number of other things such as migration entries, device private entries
> > and marker entries.
> >
> > Unfortunately this 'everything else' group expresses everything through
> > a swp_entry_t type, and these entries are referred to swap entries even
> > though they may well not contain a... swap entry.
> >
> > This is compounded by the rather mind-boggling concept of a non-swap swap
> > entry (checked via non_swap_entry()) and the means by which we twist and
> > turn to satisfy this.
> >
> > This patch lays the foundation for reducing this confusion.
> >
> > We refer to 'everything else' as a 'software-define leaf entry' or
> > 'softleaf'. for short And in fact we scoop up the 'none' entries into this
> > concept also so we are left with:
> >
> > - Present entries.
> > - Softleaf entries (which may be empty).
> >
> > This allows for radical simplification across the board - one can simply
> > convert any leaf page table entry to a leaf entry via softleaf_from_pte().
> >
> > If the entry is present, we return an empty leaf entry, so it is assumed
> > the caller is aware that they must differentiate between the two categories
> > of page table entries, checking for the former via pte_present().
> >
> > As a result, we can eliminate a number of places where we would otherwise
> > need to use predicates to see if we can proceed with leaf page table entry
> > conversion and instead just go ahead and do it unconditionally.
> >
> > We do so where we can, adjusting surrounding logic as necessary to
> > integrate the new softleaf_t logic as far as seems reasonable at this
> > stage.
> >
> > We typedef swp_entry_t to softleaf_t for the time being until the
> > conversion can be complete, meaning everything remains compatible
> > regardless of which type is used. We will eventually remove swp_entry_t
> > when the conversion is complete.
> >
> > We introduce a new header file to keep things clear - leafops.h - this
> > imports swapops.h so can direct replace swapops imports without issue, and
> > we do so in all the files that require it.
> >
> > Additionally, add new leafops.h file to core mm maintainers entry.
> >
> > Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > ---
> >  MAINTAINERS                   |   1 +
> >  fs/proc/task_mmu.c            |  26 +--
> >  fs/userfaultfd.c              |   6 +-
> >  include/linux/leafops.h       | 387 ++++++++++++++++++++++++++++++++++
> >  include/linux/mm_inline.h     |   6 +-
> >  include/linux/mm_types.h      |  25 +++
> >  include/linux/swapops.h       |  28 ---
> >  include/linux/userfaultfd_k.h |  51 +----
> >  mm/hmm.c                      |   2 +-
> >  mm/hugetlb.c                  |  37 ++--
> >  mm/madvise.c                  |  16 +-
> >  mm/memory.c                   |  41 ++--
> >  mm/mincore.c                  |   6 +-
> >  mm/mprotect.c                 |   6 +-
> >  mm/mremap.c                   |   4 +-
> >  mm/page_vma_mapped.c          |  11 +-
> >  mm/shmem.c                    |   7 +-
> >  mm/userfaultfd.c              |   6 +-
> >  18 files changed, 502 insertions(+), 164 deletions(-)
> >  create mode 100644 include/linux/leafops.h
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 2628431dcdfe..314910a70bbf 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -16257,6 +16257,7 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> >  F:	include/linux/gfp.h
> >  F:	include/linux/gfp_types.h
> >  F:	include/linux/highmem.h
> > +F:	include/linux/leafops.h
> >  F:	include/linux/memory.h
> >  F:	include/linux/mm.h
> >  F:	include/linux/mm_*.h
> > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> > index fc35a0543f01..24d26b49d870 100644
> > --- a/fs/proc/task_mmu.c
> > +++ b/fs/proc/task_mmu.c
> > @@ -14,7 +14,7 @@
> >  #include <linux/rmap.h>
> >  #include <linux/swap.h>
> >  #include <linux/sched/mm.h>
> > -#include <linux/swapops.h>
> > +#include <linux/leafops.h>
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/page_idle.h>
> >  #include <linux/shmem_fs.h>
> > @@ -1230,11 +1230,11 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> >  	if (pte_present(ptent)) {
> >  		folio = page_folio(pte_page(ptent));
> >  		present = true;
> > -	} else if (is_swap_pte(ptent)) {
> > -		swp_entry_t swpent = pte_to_swp_entry(ptent);
> > +	} else {
> > +		const softleaf_t entry = softleaf_from_pte(ptent);
> >
> > -		if (is_pfn_swap_entry(swpent))
> > -			folio = pfn_swap_entry_folio(swpent);
> > +		if (softleaf_has_pfn(entry))
> > +			folio = softleaf_to_folio(entry);
> >  	}
> >
> >  	if (folio) {
>
> <snip>
>
> >
> > @@ -2330,18 +2330,18 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
> >  		if (pte_soft_dirty(pte))
> >  			categories |= PAGE_IS_SOFT_DIRTY;
> >  	} else if (is_swap_pte(pte)) {
>
> This should be just “else” like smaps_hugetlb_range()’s change, right?

This is code this patch doesn't touch? :) It's not my fault...

Actually in a follow-up patch I do exactly this, taking advantage of the fact
that we handle none entries automatically in softleaf_from_pte().

But it's onne step at a time here to make it easier to review/life easier on
bisect in case there's any mistakes.

>
> > -		swp_entry_t swp;
> > +		softleaf_t entry;
> >
> >  		categories |= PAGE_IS_SWAPPED;
> >  		if (!pte_swp_uffd_wp_any(pte))
> >  			categories |= PAGE_IS_WRITTEN;
> >
> > -		swp = pte_to_swp_entry(pte);
> > -		if (is_guard_swp_entry(swp))
> > +		entry = softleaf_from_pte(pte);
> > +		if (softleaf_is_guard_marker(entry))
> >  			categories |= PAGE_IS_GUARD;
> >  		else if ((p->masks_of_interest & PAGE_IS_FILE) &&
> > -			 is_pfn_swap_entry(swp) &&
> > -			 !folio_test_anon(pfn_swap_entry_folio(swp)))
> > +			 softleaf_has_pfn(entry) &&
> > +			 !folio_test_anon(softleaf_to_folio(entry)))
> >  			categories |= PAGE_IS_FILE;
> >
> >  		if (pte_swp_soft_dirty(pte))
>
> <snip>
>
> > diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> > index 137ce27ff68c..be20468fb5a9 100644
> > --- a/mm/page_vma_mapped.c
> > +++ b/mm/page_vma_mapped.c
> > @@ -3,7 +3,7 @@
> >  #include <linux/rmap.h>
> >  #include <linux/hugetlb.h>
> >  #include <linux/swap.h>
> > -#include <linux/swapops.h>
> > +#include <linux/leafops.h>
> >
> >  #include "internal.h"
> >
> > @@ -107,15 +107,12 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
> >  	pte_t ptent = ptep_get(pvmw->pte);
> >
> >  	if (pvmw->flags & PVMW_MIGRATION) {
> > -		swp_entry_t entry;
> > -		if (!is_swap_pte(ptent))
> > -			return false;
> > -		entry = pte_to_swp_entry(ptent);
> > +		const softleaf_t entry = softleaf_from_pte(ptent);
>
> We do not need is_swap_pte() check here because softleaf_from_pte()
> does the check. Just trying to reason the code with myself here.

Right, see the next patch :) I'm laying the groundwork for us to be able to do
that.

>
> >
> > -		if (!is_migration_entry(entry))
> > +		if (!softleaf_is_migration(entry))
> >  			return false;
> >
> > -		pfn = swp_offset_pfn(entry);
> > +		pfn = softleaf_to_pfn(entry);
> >  	} else if (is_swap_pte(ptent)) {
> >  		swp_entry_t entry;
> >
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 6580f3cd24bb..395ca58ac4a5 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -66,7 +66,7 @@ static struct vfsmount *shm_mnt __ro_after_init;
> >  #include <linux/falloc.h>
> >  #include <linux/splice.h>
> >  #include <linux/security.h>
> > -#include <linux/swapops.h>
> > +#include <linux/leafops.h>
> >  #include <linux/mempolicy.h>
> >  #include <linux/namei.h>
> >  #include <linux/ctype.h>
> > @@ -2286,7 +2286,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >  	struct address_space *mapping = inode->i_mapping;
> >  	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> >  	struct shmem_inode_info *info = SHMEM_I(inode);
> > -	swp_entry_t swap, index_entry;
> > +	swp_entry_t swap;
> > +	softleaf_t index_entry;
> >  	struct swap_info_struct *si;
> >  	struct folio *folio = NULL;
> >  	bool skip_swapcache = false;
> > @@ -2298,7 +2299,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >  	swap = index_entry;
> >  	*foliop = NULL;
> >
> > -	if (is_poisoned_swp_entry(index_entry))
> > +	if (softleaf_is_poison_marker(index_entry))
> >  		return -EIO;
> >
> >  	si = get_swap_device(index_entry);
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index cc4ce205bbec..055ec1050776 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -10,7 +10,7 @@
> >  #include <linux/pagemap.h>
> >  #include <linux/rmap.h>
> >  #include <linux/swap.h>
> > -#include <linux/swapops.h>
> > +#include <linux/leafops.h>
> >  #include <linux/userfaultfd_k.h>
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/hugetlb.h>
> > @@ -208,7 +208,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
> >  	 * MISSING|WP registered, we firstly wr-protect a none pte which has no
> >  	 * page cache page backing it, then access the page.
> >  	 */
> > -	if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
> > +	if (!pte_none(dst_ptep) && !pte_is_uffd_marker(dst_ptep))
> >  		goto out_unlock;
> >
> >  	if (page_in_cache) {
> > @@ -590,7 +590,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> >  		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
> >  			const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
> >
> > -			if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) {
> > +			if (!huge_pte_none(ptep) && !pte_is_uffd_marker(ptep)) {
> >  				err = -EEXIST;
> >  				hugetlb_vma_unlock_read(dst_vma);
> >  				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>
> The rest of the code looks good to me. I will check it again once
> you fix the commit log and comments. Thank you for working on this.

As I said before I'm not respinning this entire series to change every single
reference to present/none to include one or several paragraphs about how we
hacked in protnone and other such things.

If I have to respin the series, I'll add a reference in the commit log.

I beleive the only pertinent comment is:

+ * If referencing another page table or a data page then the page table entry is
+ * pertinent to hardware - that is it tells the hardware how to decode the page
+ * table entry.

From the softleaf_t kdoc.

I think this is fine as-is - protnone entries or _PAGE_PSE-only PMD entries
_are_ pertinent to the hardware fault handler, literally every bit except for
the present bit are set ready for the hardware to decode, telling it how to
decode the leaf entry.

Rather than adding additional confusion by citing this stuff and probably
whatever awful architecture-specific stuff lurks in the arch/ directory I think
we are fine as-is.

Again we decided as a community to hack this stuff in so we as a community have
to live with it like a guy who puts a chimney on his car :)

(mm has many such chimneys on a car that only Homer Simpson would be proud of)

>
> Best Regards,
> Yan, Zi

Cheers, Lorenzo
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Zi Yan 1 month, 1 week ago
On 11 Nov 2025, at 2:31, Lorenzo Stoakes wrote:

> On Mon, Nov 10, 2025 at 10:56:33PM -0500, Zi Yan wrote:
>> On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:
>>
>>> The kernel maintains leaf page table entries which contain either:
>>>
>>> - Nothing ('none' entries)
>>> - Present entries (that is stuff the hardware can navigate without fault)
>>> - Everything else that will cause a fault which the kernel handles
>>>
>>> In the 'everything else' group we include swap entries, but we also include
>>> a number of other things such as migration entries, device private entries
>>> and marker entries.
>>>
>>> Unfortunately this 'everything else' group expresses everything through
>>> a swp_entry_t type, and these entries are referred to swap entries even
>>> though they may well not contain a... swap entry.
>>>
>>> This is compounded by the rather mind-boggling concept of a non-swap swap
>>> entry (checked via non_swap_entry()) and the means by which we twist and
>>> turn to satisfy this.
>>>
>>> This patch lays the foundation for reducing this confusion.
>>>
>>> We refer to 'everything else' as a 'software-define leaf entry' or
>>> 'softleaf'. for short And in fact we scoop up the 'none' entries into this
>>> concept also so we are left with:
>>>
>>> - Present entries.
>>> - Softleaf entries (which may be empty).
>>>
>>> This allows for radical simplification across the board - one can simply
>>> convert any leaf page table entry to a leaf entry via softleaf_from_pte().
>>>
>>> If the entry is present, we return an empty leaf entry, so it is assumed
>>> the caller is aware that they must differentiate between the two categories
>>> of page table entries, checking for the former via pte_present().
>>>
>>> As a result, we can eliminate a number of places where we would otherwise
>>> need to use predicates to see if we can proceed with leaf page table entry
>>> conversion and instead just go ahead and do it unconditionally.
>>>
>>> We do so where we can, adjusting surrounding logic as necessary to
>>> integrate the new softleaf_t logic as far as seems reasonable at this
>>> stage.
>>>
>>> We typedef swp_entry_t to softleaf_t for the time being until the
>>> conversion can be complete, meaning everything remains compatible
>>> regardless of which type is used. We will eventually remove swp_entry_t
>>> when the conversion is complete.
>>>
>>> We introduce a new header file to keep things clear - leafops.h - this
>>> imports swapops.h so can direct replace swapops imports without issue, and
>>> we do so in all the files that require it.
>>>
>>> Additionally, add new leafops.h file to core mm maintainers entry.
>>>
>>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> ---
>>>  MAINTAINERS                   |   1 +
>>>  fs/proc/task_mmu.c            |  26 +--
>>>  fs/userfaultfd.c              |   6 +-
>>>  include/linux/leafops.h       | 387 ++++++++++++++++++++++++++++++++++
>>>  include/linux/mm_inline.h     |   6 +-
>>>  include/linux/mm_types.h      |  25 +++
>>>  include/linux/swapops.h       |  28 ---
>>>  include/linux/userfaultfd_k.h |  51 +----
>>>  mm/hmm.c                      |   2 +-
>>>  mm/hugetlb.c                  |  37 ++--
>>>  mm/madvise.c                  |  16 +-
>>>  mm/memory.c                   |  41 ++--
>>>  mm/mincore.c                  |   6 +-
>>>  mm/mprotect.c                 |   6 +-
>>>  mm/mremap.c                   |   4 +-
>>>  mm/page_vma_mapped.c          |  11 +-
>>>  mm/shmem.c                    |   7 +-
>>>  mm/userfaultfd.c              |   6 +-
>>>  18 files changed, 502 insertions(+), 164 deletions(-)
>>>  create mode 100644 include/linux/leafops.h
>>>
>>> diff --git a/MAINTAINERS b/MAINTAINERS
>>> index 2628431dcdfe..314910a70bbf 100644
>>> --- a/MAINTAINERS
>>> +++ b/MAINTAINERS
>>> @@ -16257,6 +16257,7 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
>>>  F:	include/linux/gfp.h
>>>  F:	include/linux/gfp_types.h
>>>  F:	include/linux/highmem.h
>>> +F:	include/linux/leafops.h
>>>  F:	include/linux/memory.h
>>>  F:	include/linux/mm.h
>>>  F:	include/linux/mm_*.h
>>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
>>> index fc35a0543f01..24d26b49d870 100644
>>> --- a/fs/proc/task_mmu.c
>>> +++ b/fs/proc/task_mmu.c
>>> @@ -14,7 +14,7 @@
>>>  #include <linux/rmap.h>
>>>  #include <linux/swap.h>
>>>  #include <linux/sched/mm.h>
>>> -#include <linux/swapops.h>
>>> +#include <linux/leafops.h>
>>>  #include <linux/mmu_notifier.h>
>>>  #include <linux/page_idle.h>
>>>  #include <linux/shmem_fs.h>
>>> @@ -1230,11 +1230,11 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
>>>  	if (pte_present(ptent)) {
>>>  		folio = page_folio(pte_page(ptent));
>>>  		present = true;
>>> -	} else if (is_swap_pte(ptent)) {
>>> -		swp_entry_t swpent = pte_to_swp_entry(ptent);
>>> +	} else {
>>> +		const softleaf_t entry = softleaf_from_pte(ptent);
>>>
>>> -		if (is_pfn_swap_entry(swpent))
>>> -			folio = pfn_swap_entry_folio(swpent);
>>> +		if (softleaf_has_pfn(entry))
>>> +			folio = softleaf_to_folio(entry);
>>>  	}
>>>
>>>  	if (folio) {
>>
>> <snip>
>>
>>>
>>> @@ -2330,18 +2330,18 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
>>>  		if (pte_soft_dirty(pte))
>>>  			categories |= PAGE_IS_SOFT_DIRTY;
>>>  	} else if (is_swap_pte(pte)) {
>>
>> This should be just “else” like smaps_hugetlb_range()’s change, right?
>
> This is code this patch doesn't touch? :) It's not my fault...
>
> Actually in a follow-up patch I do exactly this, taking advantage of the fact
> that we handle none entries automatically in softleaf_from_pte().
>
> But it's onne step at a time here to make it easier to review/life easier on
> bisect in case there's any mistakes.

OK.

>
>>
>>> -		swp_entry_t swp;
>>> +		softleaf_t entry;
>>>
>>>  		categories |= PAGE_IS_SWAPPED;
>>>  		if (!pte_swp_uffd_wp_any(pte))
>>>  			categories |= PAGE_IS_WRITTEN;
>>>
>>> -		swp = pte_to_swp_entry(pte);
>>> -		if (is_guard_swp_entry(swp))
>>> +		entry = softleaf_from_pte(pte);
>>> +		if (softleaf_is_guard_marker(entry))
>>>  			categories |= PAGE_IS_GUARD;
>>>  		else if ((p->masks_of_interest & PAGE_IS_FILE) &&
>>> -			 is_pfn_swap_entry(swp) &&
>>> -			 !folio_test_anon(pfn_swap_entry_folio(swp)))
>>> +			 softleaf_has_pfn(entry) &&
>>> +			 !folio_test_anon(softleaf_to_folio(entry)))
>>>  			categories |= PAGE_IS_FILE;
>>>
>>>  		if (pte_swp_soft_dirty(pte))
>>
>> <snip>
>>
>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>> index 137ce27ff68c..be20468fb5a9 100644
>>> --- a/mm/page_vma_mapped.c
>>> +++ b/mm/page_vma_mapped.c
>>> @@ -3,7 +3,7 @@
>>>  #include <linux/rmap.h>
>>>  #include <linux/hugetlb.h>
>>>  #include <linux/swap.h>
>>> -#include <linux/swapops.h>
>>> +#include <linux/leafops.h>
>>>
>>>  #include "internal.h"
>>>
>>> @@ -107,15 +107,12 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
>>>  	pte_t ptent = ptep_get(pvmw->pte);
>>>
>>>  	if (pvmw->flags & PVMW_MIGRATION) {
>>> -		swp_entry_t entry;
>>> -		if (!is_swap_pte(ptent))
>>> -			return false;
>>> -		entry = pte_to_swp_entry(ptent);
>>> +		const softleaf_t entry = softleaf_from_pte(ptent);
>>
>> We do not need is_swap_pte() check here because softleaf_from_pte()
>> does the check. Just trying to reason the code with myself here.
>
> Right, see the next patch :) I'm laying the groundwork for us to be able to do
> that.
>
>>
>>>
>>> -		if (!is_migration_entry(entry))
>>> +		if (!softleaf_is_migration(entry))
>>>  			return false;
>>>
>>> -		pfn = swp_offset_pfn(entry);
>>> +		pfn = softleaf_to_pfn(entry);
>>>  	} else if (is_swap_pte(ptent)) {
>>>  		swp_entry_t entry;
>>>
>>> diff --git a/mm/shmem.c b/mm/shmem.c
>>> index 6580f3cd24bb..395ca58ac4a5 100644
>>> --- a/mm/shmem.c
>>> +++ b/mm/shmem.c
>>> @@ -66,7 +66,7 @@ static struct vfsmount *shm_mnt __ro_after_init;
>>>  #include <linux/falloc.h>
>>>  #include <linux/splice.h>
>>>  #include <linux/security.h>
>>> -#include <linux/swapops.h>
>>> +#include <linux/leafops.h>
>>>  #include <linux/mempolicy.h>
>>>  #include <linux/namei.h>
>>>  #include <linux/ctype.h>
>>> @@ -2286,7 +2286,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>  	struct address_space *mapping = inode->i_mapping;
>>>  	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
>>>  	struct shmem_inode_info *info = SHMEM_I(inode);
>>> -	swp_entry_t swap, index_entry;
>>> +	swp_entry_t swap;
>>> +	softleaf_t index_entry;
>>>  	struct swap_info_struct *si;
>>>  	struct folio *folio = NULL;
>>>  	bool skip_swapcache = false;
>>> @@ -2298,7 +2299,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
>>>  	swap = index_entry;
>>>  	*foliop = NULL;
>>>
>>> -	if (is_poisoned_swp_entry(index_entry))
>>> +	if (softleaf_is_poison_marker(index_entry))
>>>  		return -EIO;
>>>
>>>  	si = get_swap_device(index_entry);
>>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
>>> index cc4ce205bbec..055ec1050776 100644
>>> --- a/mm/userfaultfd.c
>>> +++ b/mm/userfaultfd.c
>>> @@ -10,7 +10,7 @@
>>>  #include <linux/pagemap.h>
>>>  #include <linux/rmap.h>
>>>  #include <linux/swap.h>
>>> -#include <linux/swapops.h>
>>> +#include <linux/leafops.h>
>>>  #include <linux/userfaultfd_k.h>
>>>  #include <linux/mmu_notifier.h>
>>>  #include <linux/hugetlb.h>
>>> @@ -208,7 +208,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
>>>  	 * MISSING|WP registered, we firstly wr-protect a none pte which has no
>>>  	 * page cache page backing it, then access the page.
>>>  	 */
>>> -	if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
>>> +	if (!pte_none(dst_ptep) && !pte_is_uffd_marker(dst_ptep))
>>>  		goto out_unlock;
>>>
>>>  	if (page_in_cache) {
>>> @@ -590,7 +590,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
>>>  		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
>>>  			const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
>>>
>>> -			if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) {
>>> +			if (!huge_pte_none(ptep) && !pte_is_uffd_marker(ptep)) {
>>>  				err = -EEXIST;
>>>  				hugetlb_vma_unlock_read(dst_vma);
>>>  				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>>
>> The rest of the code looks good to me. I will check it again once
>> you fix the commit log and comments. Thank you for working on this.
>
> As I said before I'm not respinning this entire series to change every single
> reference to present/none to include one or several paragraphs about how we
> hacked in protnone and other such things.

No, I do not want you to do that.

>
> If I have to respin the series, I'll add a reference in the commit log.
>
> I beleive the only pertinent comment is:
>
> + * If referencing another page table or a data page then the page table entry is
> + * pertinent to hardware - that is it tells the hardware how to decode the page
> + * table entry.

I would just remove “(that is stuff the hardware can navigate without fault)”.
People can look at the definition of present entries to get the categorization.
Basically, you just need to only talk about present entries without mentioning
whether it is HW accessible or not, since that is another can of worms.

>
> From the softleaf_t kdoc.
>
> I think this is fine as-is - protnone entries or _PAGE_PSE-only PMD entries
> _are_ pertinent to the hardware fault handler, literally every bit except for
> the present bit are set ready for the hardware to decode, telling it how to
> decode the leaf entry.

After reading it again, I agree the kdoc looks good.

>
> Rather than adding additional confusion by citing this stuff and probably
> whatever awful architecture-specific stuff lurks in the arch/ directory I think
> we are fine as-is.
>
> Again we decided as a community to hack this stuff in so we as a community have
> to live with it like a guy who puts a chimney on his car :)
>
> (mm has many such chimneys on a car that only Homer Simpson would be proud of)

Yeah, it is not pretty, but that is how people get their work done. ;)

Anyway, feel free to add Acked-by: Zi Yan <ziy@nvidia.com>

Best Regards,
Yan, Zi
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Lorenzo Stoakes 1 month, 1 week ago
On Tue, Nov 11, 2025 at 11:40:32AM -0500, Zi Yan wrote:
> On 11 Nov 2025, at 2:31, Lorenzo Stoakes wrote:
>
> > On Mon, Nov 10, 2025 at 10:56:33PM -0500, Zi Yan wrote:
> >> On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:
> >>
> >>> The kernel maintains leaf page table entries which contain either:
> >>>
> >>> - Nothing ('none' entries)
> >>> - Present entries (that is stuff the hardware can navigate without fault)
> >>> - Everything else that will cause a fault which the kernel handles
> >>>
> >>> In the 'everything else' group we include swap entries, but we also include
> >>> a number of other things such as migration entries, device private entries
> >>> and marker entries.
> >>>
> >>> Unfortunately this 'everything else' group expresses everything through
> >>> a swp_entry_t type, and these entries are referred to swap entries even
> >>> though they may well not contain a... swap entry.
> >>>
> >>> This is compounded by the rather mind-boggling concept of a non-swap swap
> >>> entry (checked via non_swap_entry()) and the means by which we twist and
> >>> turn to satisfy this.
> >>>
> >>> This patch lays the foundation for reducing this confusion.
> >>>
> >>> We refer to 'everything else' as a 'software-define leaf entry' or
> >>> 'softleaf'. for short And in fact we scoop up the 'none' entries into this
> >>> concept also so we are left with:
> >>>
> >>> - Present entries.
> >>> - Softleaf entries (which may be empty).
> >>>
> >>> This allows for radical simplification across the board - one can simply
> >>> convert any leaf page table entry to a leaf entry via softleaf_from_pte().
> >>>
> >>> If the entry is present, we return an empty leaf entry, so it is assumed
> >>> the caller is aware that they must differentiate between the two categories
> >>> of page table entries, checking for the former via pte_present().
> >>>
> >>> As a result, we can eliminate a number of places where we would otherwise
> >>> need to use predicates to see if we can proceed with leaf page table entry
> >>> conversion and instead just go ahead and do it unconditionally.
> >>>
> >>> We do so where we can, adjusting surrounding logic as necessary to
> >>> integrate the new softleaf_t logic as far as seems reasonable at this
> >>> stage.
> >>>
> >>> We typedef swp_entry_t to softleaf_t for the time being until the
> >>> conversion can be complete, meaning everything remains compatible
> >>> regardless of which type is used. We will eventually remove swp_entry_t
> >>> when the conversion is complete.
> >>>
> >>> We introduce a new header file to keep things clear - leafops.h - this
> >>> imports swapops.h so can direct replace swapops imports without issue, and
> >>> we do so in all the files that require it.
> >>>
> >>> Additionally, add new leafops.h file to core mm maintainers entry.
> >>>
> >>> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >>> ---
> >>>  MAINTAINERS                   |   1 +
> >>>  fs/proc/task_mmu.c            |  26 +--
> >>>  fs/userfaultfd.c              |   6 +-
> >>>  include/linux/leafops.h       | 387 ++++++++++++++++++++++++++++++++++
> >>>  include/linux/mm_inline.h     |   6 +-
> >>>  include/linux/mm_types.h      |  25 +++
> >>>  include/linux/swapops.h       |  28 ---
> >>>  include/linux/userfaultfd_k.h |  51 +----
> >>>  mm/hmm.c                      |   2 +-
> >>>  mm/hugetlb.c                  |  37 ++--
> >>>  mm/madvise.c                  |  16 +-
> >>>  mm/memory.c                   |  41 ++--
> >>>  mm/mincore.c                  |   6 +-
> >>>  mm/mprotect.c                 |   6 +-
> >>>  mm/mremap.c                   |   4 +-
> >>>  mm/page_vma_mapped.c          |  11 +-
> >>>  mm/shmem.c                    |   7 +-
> >>>  mm/userfaultfd.c              |   6 +-
> >>>  18 files changed, 502 insertions(+), 164 deletions(-)
> >>>  create mode 100644 include/linux/leafops.h
> >>>
> >>> diff --git a/MAINTAINERS b/MAINTAINERS
> >>> index 2628431dcdfe..314910a70bbf 100644
> >>> --- a/MAINTAINERS
> >>> +++ b/MAINTAINERS
> >>> @@ -16257,6 +16257,7 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> >>>  F:	include/linux/gfp.h
> >>>  F:	include/linux/gfp_types.h
> >>>  F:	include/linux/highmem.h
> >>> +F:	include/linux/leafops.h
> >>>  F:	include/linux/memory.h
> >>>  F:	include/linux/mm.h
> >>>  F:	include/linux/mm_*.h
> >>> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> >>> index fc35a0543f01..24d26b49d870 100644
> >>> --- a/fs/proc/task_mmu.c
> >>> +++ b/fs/proc/task_mmu.c
> >>> @@ -14,7 +14,7 @@
> >>>  #include <linux/rmap.h>
> >>>  #include <linux/swap.h>
> >>>  #include <linux/sched/mm.h>
> >>> -#include <linux/swapops.h>
> >>> +#include <linux/leafops.h>
> >>>  #include <linux/mmu_notifier.h>
> >>>  #include <linux/page_idle.h>
> >>>  #include <linux/shmem_fs.h>
> >>> @@ -1230,11 +1230,11 @@ static int smaps_hugetlb_range(pte_t *pte, unsigned long hmask,
> >>>  	if (pte_present(ptent)) {
> >>>  		folio = page_folio(pte_page(ptent));
> >>>  		present = true;
> >>> -	} else if (is_swap_pte(ptent)) {
> >>> -		swp_entry_t swpent = pte_to_swp_entry(ptent);
> >>> +	} else {
> >>> +		const softleaf_t entry = softleaf_from_pte(ptent);
> >>>
> >>> -		if (is_pfn_swap_entry(swpent))
> >>> -			folio = pfn_swap_entry_folio(swpent);
> >>> +		if (softleaf_has_pfn(entry))
> >>> +			folio = softleaf_to_folio(entry);
> >>>  	}
> >>>
> >>>  	if (folio) {
> >>
> >> <snip>
> >>
> >>>
> >>> @@ -2330,18 +2330,18 @@ static unsigned long pagemap_page_category(struct pagemap_scan_private *p,
> >>>  		if (pte_soft_dirty(pte))
> >>>  			categories |= PAGE_IS_SOFT_DIRTY;
> >>>  	} else if (is_swap_pte(pte)) {
> >>
> >> This should be just “else” like smaps_hugetlb_range()’s change, right?
> >
> > This is code this patch doesn't touch? :) It's not my fault...
> >
> > Actually in a follow-up patch I do exactly this, taking advantage of the fact
> > that we handle none entries automatically in softleaf_from_pte().
> >
> > But it's onne step at a time here to make it easier to review/life easier on
> > bisect in case there's any mistakes.
>
> OK.

Yeah you're ahead of the game :)

>
> >
> >>
> >>> -		swp_entry_t swp;
> >>> +		softleaf_t entry;
> >>>
> >>>  		categories |= PAGE_IS_SWAPPED;
> >>>  		if (!pte_swp_uffd_wp_any(pte))
> >>>  			categories |= PAGE_IS_WRITTEN;
> >>>
> >>> -		swp = pte_to_swp_entry(pte);
> >>> -		if (is_guard_swp_entry(swp))
> >>> +		entry = softleaf_from_pte(pte);
> >>> +		if (softleaf_is_guard_marker(entry))
> >>>  			categories |= PAGE_IS_GUARD;
> >>>  		else if ((p->masks_of_interest & PAGE_IS_FILE) &&
> >>> -			 is_pfn_swap_entry(swp) &&
> >>> -			 !folio_test_anon(pfn_swap_entry_folio(swp)))
> >>> +			 softleaf_has_pfn(entry) &&
> >>> +			 !folio_test_anon(softleaf_to_folio(entry)))
> >>>  			categories |= PAGE_IS_FILE;
> >>>
> >>>  		if (pte_swp_soft_dirty(pte))
> >>
> >> <snip>
> >>
> >>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> >>> index 137ce27ff68c..be20468fb5a9 100644
> >>> --- a/mm/page_vma_mapped.c
> >>> +++ b/mm/page_vma_mapped.c
> >>> @@ -3,7 +3,7 @@
> >>>  #include <linux/rmap.h>
> >>>  #include <linux/hugetlb.h>
> >>>  #include <linux/swap.h>
> >>> -#include <linux/swapops.h>
> >>> +#include <linux/leafops.h>
> >>>
> >>>  #include "internal.h"
> >>>
> >>> @@ -107,15 +107,12 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw, unsigned long pte_nr)
> >>>  	pte_t ptent = ptep_get(pvmw->pte);
> >>>
> >>>  	if (pvmw->flags & PVMW_MIGRATION) {
> >>> -		swp_entry_t entry;
> >>> -		if (!is_swap_pte(ptent))
> >>> -			return false;
> >>> -		entry = pte_to_swp_entry(ptent);
> >>> +		const softleaf_t entry = softleaf_from_pte(ptent);
> >>
> >> We do not need is_swap_pte() check here because softleaf_from_pte()
> >> does the check. Just trying to reason the code with myself here.
> >
> > Right, see the next patch :) I'm laying the groundwork for us to be able to do
> > that.
> >
> >>
> >>>
> >>> -		if (!is_migration_entry(entry))
> >>> +		if (!softleaf_is_migration(entry))
> >>>  			return false;
> >>>
> >>> -		pfn = swp_offset_pfn(entry);
> >>> +		pfn = softleaf_to_pfn(entry);
> >>>  	} else if (is_swap_pte(ptent)) {
> >>>  		swp_entry_t entry;
> >>>
> >>> diff --git a/mm/shmem.c b/mm/shmem.c
> >>> index 6580f3cd24bb..395ca58ac4a5 100644
> >>> --- a/mm/shmem.c
> >>> +++ b/mm/shmem.c
> >>> @@ -66,7 +66,7 @@ static struct vfsmount *shm_mnt __ro_after_init;
> >>>  #include <linux/falloc.h>
> >>>  #include <linux/splice.h>
> >>>  #include <linux/security.h>
> >>> -#include <linux/swapops.h>
> >>> +#include <linux/leafops.h>
> >>>  #include <linux/mempolicy.h>
> >>>  #include <linux/namei.h>
> >>>  #include <linux/ctype.h>
> >>> @@ -2286,7 +2286,8 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>>  	struct address_space *mapping = inode->i_mapping;
> >>>  	struct mm_struct *fault_mm = vma ? vma->vm_mm : NULL;
> >>>  	struct shmem_inode_info *info = SHMEM_I(inode);
> >>> -	swp_entry_t swap, index_entry;
> >>> +	swp_entry_t swap;
> >>> +	softleaf_t index_entry;
> >>>  	struct swap_info_struct *si;
> >>>  	struct folio *folio = NULL;
> >>>  	bool skip_swapcache = false;
> >>> @@ -2298,7 +2299,7 @@ static int shmem_swapin_folio(struct inode *inode, pgoff_t index,
> >>>  	swap = index_entry;
> >>>  	*foliop = NULL;
> >>>
> >>> -	if (is_poisoned_swp_entry(index_entry))
> >>> +	if (softleaf_is_poison_marker(index_entry))
> >>>  		return -EIO;
> >>>
> >>>  	si = get_swap_device(index_entry);
> >>> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> >>> index cc4ce205bbec..055ec1050776 100644
> >>> --- a/mm/userfaultfd.c
> >>> +++ b/mm/userfaultfd.c
> >>> @@ -10,7 +10,7 @@
> >>>  #include <linux/pagemap.h>
> >>>  #include <linux/rmap.h>
> >>>  #include <linux/swap.h>
> >>> -#include <linux/swapops.h>
> >>> +#include <linux/leafops.h>
> >>>  #include <linux/userfaultfd_k.h>
> >>>  #include <linux/mmu_notifier.h>
> >>>  #include <linux/hugetlb.h>
> >>> @@ -208,7 +208,7 @@ int mfill_atomic_install_pte(pmd_t *dst_pmd,
> >>>  	 * MISSING|WP registered, we firstly wr-protect a none pte which has no
> >>>  	 * page cache page backing it, then access the page.
> >>>  	 */
> >>> -	if (!pte_none(dst_ptep) && !is_uffd_pte_marker(dst_ptep))
> >>> +	if (!pte_none(dst_ptep) && !pte_is_uffd_marker(dst_ptep))
> >>>  		goto out_unlock;
> >>>
> >>>  	if (page_in_cache) {
> >>> @@ -590,7 +590,7 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> >>>  		if (!uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) {
> >>>  			const pte_t ptep = huge_ptep_get(dst_mm, dst_addr, dst_pte);
> >>>
> >>> -			if (!huge_pte_none(ptep) && !is_uffd_pte_marker(ptep)) {
> >>> +			if (!huge_pte_none(ptep) && !pte_is_uffd_marker(ptep)) {
> >>>  				err = -EEXIST;
> >>>  				hugetlb_vma_unlock_read(dst_vma);
> >>>  				mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> >>
> >> The rest of the code looks good to me. I will check it again once
> >> you fix the commit log and comments. Thank you for working on this.
> >
> > As I said before I'm not respinning this entire series to change every single
> > reference to present/none to include one or several paragraphs about how we
> > hacked in protnone and other such things.
>
> No, I do not want you to do that.

Good! :)

>
> >
> > If I have to respin the series, I'll add a reference in the commit log.
> >
> > I beleive the only pertinent comment is:
> >
> > + * If referencing another page table or a data page then the page table entry is
> > + * pertinent to hardware - that is it tells the hardware how to decode the page
> > + * table entry.
>
> I would just remove “(that is stuff the hardware can navigate without fault)”.
> People can look at the definition of present entries to get the categorization.
> Basically, you just need to only talk about present entries without mentioning
> whether it is HW accessible or not, since that is another can of worms.

OK I'll ask Andrew to update.

>
> >
> > From the softleaf_t kdoc.
> >
> > I think this is fine as-is - protnone entries or _PAGE_PSE-only PMD entries
> > _are_ pertinent to the hardware fault handler, literally every bit except for
> > the present bit are set ready for the hardware to decode, telling it how to
> > decode the leaf entry.
>
> After reading it again, I agree the kdoc looks good.

Thanks!

>
> >
> > Rather than adding additional confusion by citing this stuff and probably
> > whatever awful architecture-specific stuff lurks in the arch/ directory I think
> > we are fine as-is.
> >
> > Again we decided as a community to hack this stuff in so we as a community have
> > to live with it like a guy who puts a chimney on his car :)
> >
> > (mm has many such chimneys on a car that only Homer Simpson would be proud of)
>
> Yeah, it is not pretty, but that is how people get their work done. ;)

Well, it's also how people mess things up :)

I wish we'd done it differently, ideally by separating hardware and sofware page
table state.

pte_sw_present() vs. pte_hw_present() would have cleared this up immediately.

Maybe one I should do at some point...

>
> Anyway, feel free to add Acked-by: Zi Yan <ziy@nvidia.com>

Thanks!

>
> Best Regards,
> Yan, Zi

Your review is much appreciated :)

Cheers, Lorenzo
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Zi Yan 1 month, 1 week ago
On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:

> The kernel maintains leaf page table entries which contain either:
>
> - Nothing ('none' entries)
> - Present entries (that is stuff the hardware can navigate without fault)

This is not true for:

1. pXX_protnone(), where _PAGE_PROTNONE flag also means pXX_present() is
true, but hardware would still trigger a fault.
2. pmd_present() where _PAGE_PSE also means a present PMD (see the comment
in pmd_present()).

This commit log needs to be updated.

> - Everything else that will cause a fault which the kernel handles

This is not true because of the reasons above.

How should we categorize these non-present to HW but present to SW entries,
like protnone and under splitting PMDs? Strictly speaking, they are
softleaf entries, but that would require more changes to the kernel code
and pXX_present() means HW present.

To not make this series more complicated, I think updating commit log
and comments to use pXX_present() instead of HW present might be
the easiest way out. We can revisit pXX_present() vs HW present later.

OK, I will focus on code review now.

>
> In the 'everything else' group we include swap entries, but we also include
> a number of other things such as migration entries, device private entries
> and marker entries.
>
> Unfortunately this 'everything else' group expresses everything through
> a swp_entry_t type, and these entries are referred to swap entries even
> though they may well not contain a... swap entry.
>
> This is compounded by the rather mind-boggling concept of a non-swap swap
> entry (checked via non_swap_entry()) and the means by which we twist and
> turn to satisfy this.
>
> This patch lays the foundation for reducing this confusion.
>
> We refer to 'everything else' as a 'software-define leaf entry' or
> 'softleaf'. for short And in fact we scoop up the 'none' entries into this
> concept also so we are left with:
>
> - Present entries.
> - Softleaf entries (which may be empty).
>
> This allows for radical simplification across the board - one can simply
> convert any leaf page table entry to a leaf entry via softleaf_from_pte().
>
> If the entry is present, we return an empty leaf entry, so it is assumed
> the caller is aware that they must differentiate between the two categories
> of page table entries, checking for the former via pte_present().
>
> As a result, we can eliminate a number of places where we would otherwise
> need to use predicates to see if we can proceed with leaf page table entry
> conversion and instead just go ahead and do it unconditionally.
>
> We do so where we can, adjusting surrounding logic as necessary to
> integrate the new softleaf_t logic as far as seems reasonable at this
> stage.
>
> We typedef swp_entry_t to softleaf_t for the time being until the
> conversion can be complete, meaning everything remains compatible
> regardless of which type is used. We will eventually remove swp_entry_t
> when the conversion is complete.
>
> We introduce a new header file to keep things clear - leafops.h - this
> imports swapops.h so can direct replace swapops imports without issue, and
> we do so in all the files that require it.
>
> Additionally, add new leafops.h file to core mm maintainers entry.
>
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  MAINTAINERS                   |   1 +
>  fs/proc/task_mmu.c            |  26 +--
>  fs/userfaultfd.c              |   6 +-
>  include/linux/leafops.h       | 387 ++++++++++++++++++++++++++++++++++
>  include/linux/mm_inline.h     |   6 +-
>  include/linux/mm_types.h      |  25 +++
>  include/linux/swapops.h       |  28 ---
>  include/linux/userfaultfd_k.h |  51 +----
>  mm/hmm.c                      |   2 +-
>  mm/hugetlb.c                  |  37 ++--
>  mm/madvise.c                  |  16 +-
>  mm/memory.c                   |  41 ++--
>  mm/mincore.c                  |   6 +-
>  mm/mprotect.c                 |   6 +-
>  mm/mremap.c                   |   4 +-
>  mm/page_vma_mapped.c          |  11 +-
>  mm/shmem.c                    |   7 +-
>  mm/userfaultfd.c              |   6 +-
>  18 files changed, 502 insertions(+), 164 deletions(-)
>  create mode 100644 include/linux/leafops.h
>


Best Regards,
Yan, Zi
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by David Hildenbrand (Red Hat) 1 month, 1 week ago
On 11.11.25 04:25, Zi Yan wrote:
> On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:
> 
>> The kernel maintains leaf page table entries which contain either:
>>
>> - Nothing ('none' entries)
>> - Present entries (that is stuff the hardware can navigate without fault)
> 
> This is not true for:
> 
> 1. pXX_protnone(), where _PAGE_PROTNONE flag also means pXX_present() is
> true, but hardware would still trigger a fault.
> 2. pmd_present() where _PAGE_PSE also means a present PMD (see the comment
> in pmd_present()).

I'll note that pte_present/pmd_present etc is always about "soft-present".

For example, if the hardware does not have a hw-managed access bit, 
doing a pte_mkyoung() would also clear the hw-valid/hw-present bit 
because we have to catch any next access done by hardware.

[fun fact: some hardware has an invalid bit instead of a valid/present 
bit :) IIRC s390x falls into that category]

Similar things happen on ordinary PROT_NONE of course (independent of 
pte_protnone).

A better description might be "there is a page/pfn mapped here, but it 
might not be accessible by the CPU right now".

We have device-exclusive/device-private nonswap (before this series) 
entries that fall into the same category, unfortunately ("there is 
something mapped there that is not accessible by the CPU")

-- 
Cheers

David
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Zi Yan 1 month, 1 week ago
On 11 Nov 2025, at 8:06, David Hildenbrand (Red Hat) wrote:

> On 11.11.25 04:25, Zi Yan wrote:
>> On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:
>>
>>> The kernel maintains leaf page table entries which contain either:
>>>
>>> - Nothing ('none' entries)
>>> - Present entries (that is stuff the hardware can navigate without fault)
>>
>> This is not true for:
>>
>> 1. pXX_protnone(), where _PAGE_PROTNONE flag also means pXX_present() is
>> true, but hardware would still trigger a fault.
>> 2. pmd_present() where _PAGE_PSE also means a present PMD (see the comment
>> in pmd_present()).
>
> I'll note that pte_present/pmd_present etc is always about "soft-present".
>
> For example, if the hardware does not have a hw-managed access bit, doing a pte_mkyoung() would also clear the hw-valid/hw-present bit because we have to catch any next access done by hardware.
>
> [fun fact: some hardware has an invalid bit instead of a valid/present bit :) IIRC s390x falls into that category]
>
> Similar things happen on ordinary PROT_NONE of course (independent of pte_protnone).
>
> A better description might be "there is a page/pfn mapped here, but it might not be accessible by the CPU right now".
>
> We have device-exclusive/device-private nonswap (before this series) entries that fall into the same category, unfortunately ("there is something mapped there that is not accessible by the CPU")

I agree. I am fine with the categorization using pte_none(), pte_present(),
and softleaf. It is “hardware can navigate without fault” that causes
confusion. Removing this comment would work for me, since people can look
at the definition of pXX_present() for further clarification.

Best Regards,
Yan, Zi
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Lorenzo Stoakes 1 month, 1 week ago
On Tue, Nov 11, 2025 at 11:26:03AM -0500, Zi Yan wrote:
> On 11 Nov 2025, at 8:06, David Hildenbrand (Red Hat) wrote:
>
> > On 11.11.25 04:25, Zi Yan wrote:
> >> On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:
> >>
> >>> The kernel maintains leaf page table entries which contain either:
> >>>
> >>> - Nothing ('none' entries)
> >>> - Present entries (that is stuff the hardware can navigate without fault)
> >>
> >> This is not true for:
> >>
> >> 1. pXX_protnone(), where _PAGE_PROTNONE flag also means pXX_present() is
> >> true, but hardware would still trigger a fault.
> >> 2. pmd_present() where _PAGE_PSE also means a present PMD (see the comment
> >> in pmd_present()).
> >
> > I'll note that pte_present/pmd_present etc is always about "soft-present".
> >
> > For example, if the hardware does not have a hw-managed access bit, doing a pte_mkyoung() would also clear the hw-valid/hw-present bit because we have to catch any next access done by hardware.
> >
> > [fun fact: some hardware has an invalid bit instead of a valid/present bit :) IIRC s390x falls into that category]
> >
> > Similar things happen on ordinary PROT_NONE of course (independent of pte_protnone).
> >
> > A better description might be "there is a page/pfn mapped here, but it might not be accessible by the CPU right now".
> >
> > We have device-exclusive/device-private nonswap (before this series) entries that fall into the same category, unfortunately ("there is something mapped there that is not accessible by the CPU")
>
> I agree. I am fine with the categorization using pte_none(), pte_present(),
> and softleaf. It is “hardware can navigate without fault” that causes
> confusion. Removing this comment would work for me, since people can look
> at the definition of pXX_present() for further clarification.

Well I proposed a suggestion :) I think referencing PFN is problematic ecause of
those softleaf entries that may also have a PFN.

I think better definition is that the entry is valid modulo the present bit.

>
> Best Regards,
> Yan, Zi

Cheers, Lorenzo
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Lorenzo Stoakes 1 month, 1 week ago
On Mon, Nov 10, 2025 at 10:25:40PM -0500, Zi Yan wrote:
> On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:
>
> > The kernel maintains leaf page table entries which contain either:
> >
> > - Nothing ('none' entries)
> > - Present entries (that is stuff the hardware can navigate without fault)
>
> This is not true for:
>
> 1. pXX_protnone(), where _PAGE_PROTNONE flag also means pXX_present() is
> true, but hardware would still trigger a fault.

Sigh. I'm very well aware of this, I've commented on this issue at length
in discussions on-list and off.

But for good or pad we decided to hack in protnone this way. As far as the
kernel is concerned they _are_ present.

Yes, technically, they're not, and will result in a fault, and will result in
the whole NUMA balancing hint mechanism firing off.

But I feel like it only adds noise and confusion to get into all that here,
frankly.

> 2. pmd_present() where _PAGE_PSE also means a present PMD (see the comment
> in pmd_present()).

Right, and here we go again with another 'wise decision'. That's just intensely
gross, and one I wasn't aware of.

But again, I'm not really interested in asterixing all of these.

'As far as the kernel is concerned' these are present. We have to lie in the bed
we made AFAIC.

>
> This commit log needs to be updated.

No it doesn't. As per the above, we have literally decided to treat these as if
they were present in cases where, in fact, they're not.

Note that to be thorough here I'd have to go through every single architecture
and check every single caveat that exists in pXX_present() and pXX_none().

Because I guarantee you there will be some oddities there.

Is that a good use of my or anybody else's time?

I think we have to draw the pedantry line somewhere.

>
> > - Everything else that will cause a fault which the kernel handles
>
> This is not true because of the reasons above.

I covered this off in the above. I'm not really that interested in adding
additional noise here, sorry.

As a compromise - if I have to respin - I can add a very brief comment like

	* Note that there are exceptions such as protnone which for
	everything but the kernel fault handler ought to be treated as
	present but are in fact not. For avoidance of doubt, soft leaf
	entries treat pXX_none() and pXX_present() as the authoritative
	determinants of whether a page table entry is empty/present,
	regardless of hacked-in implementation details.

Note how _already_ saying stuff like this adds confusion and 'wtf'. THis is
what I'm trying to avoid.

But if I have to respin, can add that.


>
> How should we categorize these non-present to HW but present to SW entries,
> like protnone and under splitting PMDs? Strictly speaking, they are
> softleaf entries, but that would require more changes to the kernel code
> and pXX_present() means HW present.

No they're not strictly speaking softleaf entries at all. These page table
entries use every single bit except present/PSE. The softleaf abstraction
does not retain all of these bits, and then it becomes impossible to
determine which is 'present' in a software sense or not.

We categorise pXX_present() leaf page table entries as... being present,
even if past kernel developers decided to hack in cases which are present
as far as the HW faulting mechanism is concerned, piling yet more confusion
on everything.

We made our bed on this and have to lie in it. There are numerous places
where in page table code to all intents and purposes it looks like we're
literally testing for hw-present entries whereas in fact we are not.

So I don't think it is beneficial to do anything more on this other than
perhaps updating _this_ commit message on respin.

>
> To not make this series more complicated, I think updating commit log
> and comments to use pXX_present() instead of HW present might be
> the easiest way out. We can revisit pXX_present() vs HW present later.

No, there's nothing to revisit AFAIC.

I'm not going to go through and update every single mention of faulting to
account for that.

I think it's an unreasonable level of pedantry.

>
> OK, I will focus on code review now.

Thanks.
Re: [PATCH v3 02/16] mm: introduce leaf entry type and use to simplify leaf entry logic
Posted by Zi Yan 1 month, 1 week ago
On 11 Nov 2025, at 2:16, Lorenzo Stoakes wrote:

> On Mon, Nov 10, 2025 at 10:25:40PM -0500, Zi Yan wrote:
>> On 10 Nov 2025, at 17:21, Lorenzo Stoakes wrote:
>>
>>> The kernel maintains leaf page table entries which contain either:
>>>
>>> - Nothing ('none' entries)
>>> - Present entries (that is stuff the hardware can navigate without fault)
>>
>> This is not true for:
>>
>> 1. pXX_protnone(), where _PAGE_PROTNONE flag also means pXX_present() is
>> true, but hardware would still trigger a fault.
>
> Sigh. I'm very well aware of this, I've commented on this issue at length
> in discussions on-list and off.
>
> But for good or pad we decided to hack in protnone this way. As far as the
> kernel is concerned they _are_ present.
>
> Yes, technically, they're not, and will result in a fault, and will result in
> the whole NUMA balancing hint mechanism firing off.
>
> But I feel like it only adds noise and confusion to get into all that here,
> frankly.
>
>> 2. pmd_present() where _PAGE_PSE also means a present PMD (see the comment
>> in pmd_present()).
>
> Right, and here we go again with another 'wise decision'. That's just intensely
> gross, and one I wasn't aware of.
>
> But again, I'm not really interested in asterixing all of these.
>
> 'As far as the kernel is concerned' these are present. We have to lie in the bed
> we made AFAIC.
>
>>
>> This commit log needs to be updated.
>
> No it doesn't. As per the above, we have literally decided to treat these as if
> they were present in cases where, in fact, they're not.
>
> Note that to be thorough here I'd have to go through every single architecture
> and check every single caveat that exists in pXX_present() and pXX_none().
>
> Because I guarantee you there will be some oddities there.
>
> Is that a good use of my or anybody else's time?
>
> I think we have to draw the pedantry line somewhere.
>
>>
>>> - Everything else that will cause a fault which the kernel handles
>>
>> This is not true because of the reasons above.
>
> I covered this off in the above. I'm not really that interested in adding
> additional noise here, sorry.
>
> As a compromise - if I have to respin - I can add a very brief comment like
>
> 	* Note that there are exceptions such as protnone which for
> 	everything but the kernel fault handler ought to be treated as
> 	present but are in fact not. For avoidance of doubt, soft leaf
> 	entries treat pXX_none() and pXX_present() as the authoritative
> 	determinants of whether a page table entry is empty/present,
> 	regardless of hacked-in implementation details.
>
> Note how _already_ saying stuff like this adds confusion and 'wtf'. THis is
> what I'm trying to avoid.
>
> But if I have to respin, can add that.
>
>
>>
>> How should we categorize these non-present to HW but present to SW entries,
>> like protnone and under splitting PMDs? Strictly speaking, they are
>> softleaf entries, but that would require more changes to the kernel code
>> and pXX_present() means HW present.
>
> No they're not strictly speaking softleaf entries at all. These page table
> entries use every single bit except present/PSE. The softleaf abstraction
> does not retain all of these bits, and then it becomes impossible to
> determine which is 'present' in a software sense or not.
>
> We categorise pXX_present() leaf page table entries as... being present,
> even if past kernel developers decided to hack in cases which are present
> as far as the HW faulting mechanism is concerned, piling yet more confusion
> on everything.
>
> We made our bed on this and have to lie in it. There are numerous places
> where in page table code to all intents and purposes it looks like we're
> literally testing for hw-present entries whereas in fact we are not.
>
> So I don't think it is beneficial to do anything more on this other than
> perhaps updating _this_ commit message on respin.
>
>>
>> To not make this series more complicated, I think updating commit log
>> and comments to use pXX_present() instead of HW present might be
>> the easiest way out. We can revisit pXX_present() vs HW present later.
>
> No, there's nothing to revisit AFAIC.
>
> I'm not going to go through and update every single mention of faulting to
> account for that.
>
> I think it's an unreasonable level of pedantry.

Got it. As long as you are aware of this, I am fine with what you have now.


Best Regards,
Yan, Zi