mm: PUD (1GB) THP implementation

[RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Usama Arif 5 days, 6 hours ago

For page table management, PUD THPs need to pre-deposit page tables
that will be used when the huge page is later split. When a PUD THP
is allocated, we cannot know in advance when or why it might need to
be split (COW, partial unmap, reclaim), but we need page tables ready
for that eventuality. Similar to how PMD THPs deposit a single PTE
table, PUD THPs deposit a PMD table which itself contains deposited
PTE tables - a two-level deposit. This commit adds the deposit/withdraw
infrastructure and a new pud_huge_pmd field in ptdesc to store the
deposited PMD.

The deposited PMD tables are stored as a singly-linked stack using only
page->lru.next as the link pointer. A doubly-linked list using the
standard list_head mechanism would cause memory corruption: list_del()
poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
tables have their own deposited PTE tables stored in pmd_huge_pte,
poisoning lru.prev would corrupt the PTE table list and cause crashes
when withdrawing PTE tables during split. PMD THPs don't have this
problem because their deposited PTE tables don't have sub-deposits.
Using only lru.next avoids the overlap entirely.

For reverse mapping, PUD THPs need the same rmap support that PMD THPs
have. The page_vma_mapped_walk() function is extended to recognize and
handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
flag tells the unmap path to split PUD THPs before proceeding, since
there is no PUD-level migration entry format - the split converts the
single PUD mapping into individual PTE mappings that can be migrated
or swapped normally.

Signed-off-by: Usama Arif <usamaarif642@gmail.com>
---
 include/linux/huge_mm.h  |  5 +++
 include/linux/mm.h       | 19 ++++++++
 include/linux/mm_types.h |  5 ++-
 include/linux/pgtable.h  |  8 ++++
 include/linux/rmap.h     |  7 ++-
 mm/huge_memory.c         |  8 ++++
 mm/internal.h            |  3 ++
 mm/page_vma_mapped.c     | 35 +++++++++++++++
 mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
 mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
 10 files changed, 260 insertions(+), 9 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index a4d9f964dfdea..e672e45bb9cc7 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long address);
 
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
+			   unsigned long address);
 int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
 		    unsigned long cp_flags);
 #else
+static inline void
+split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
+		      unsigned long address) {}
 static inline int
 change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		pud_t *pudp, unsigned long addr, pgprot_t newprot,
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ab2e7e30aef96..a15e18df0f771 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
  * considered ready to switch to split PUD locks yet; there may be places
  * which need to be converted from page_table_lock.
  */
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+static inline struct page *pud_pgtable_page(pud_t *pud)
+{
+	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
+
+	return virt_to_page((void *)((unsigned long)pud & mask));
+}
+
+static inline struct ptdesc *pud_ptdesc(pud_t *pud)
+{
+	return page_ptdesc(pud_pgtable_page(pud));
+}
+
+#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
+#endif
+
 static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
 {
 	return &mm->page_table_lock;
@@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
 static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
 {
 	__pagetable_ctor(ptdesc);
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+	ptdesc->pud_huge_pmd = NULL;
+#endif
 }
 
 static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 78950eb8926dc..26a38490ae2e1 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -577,7 +577,10 @@ struct ptdesc {
 		struct list_head pt_list;
 		struct {
 			unsigned long _pt_pad_1;
-			pgtable_t pmd_huge_pte;
+			union {
+				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
+				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
+			};
 		};
 	};
 	unsigned long __page_mapping;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2f0dd3a4ace1a..3ce733c1d71a2 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
 #define arch_needs_pgtable_deposit() (false)
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+					   pmd_t *pmd_table);
+extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
+extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
+extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
+#endif
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
  * This is an implementation of pmdp_establish() that is only suitable for an
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index daa92a58585d9..08cd0a0eb8763 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -101,6 +101,7 @@ enum ttu_flags {
 					 * do a final flush if necessary */
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
+	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
 };
 
 #ifdef CONFIG_MMU
@@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
 	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
 void folio_add_anon_rmap_pmd(struct folio *, struct page *,
 		struct vm_area_struct *, unsigned long address, rmap_t flags);
+void folio_add_anon_rmap_pud(struct folio *, struct page *,
+		struct vm_area_struct *, unsigned long address, rmap_t flags);
 void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
 		unsigned long address, rmap_t flags);
 void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
@@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
 	pgoff_t pgoff;
 	struct vm_area_struct *vma;
 	unsigned long address;
+	pud_t *pud;
 	pmd_t *pmd;
 	pte_t *pte;
 	spinlock_t *ptl;
@@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
 static inline void
 page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
 {
-	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
+	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
 
 	if (likely(pvmw->ptl))
 		spin_unlock(pvmw->ptl);
@@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
 		WARN_ON_ONCE(1);
 
 	pvmw->ptl = NULL;
+	pvmw->pud = NULL;
 	pvmw->pmd = NULL;
 	pvmw->pte = NULL;
 }
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 40cf59301c21a..3128b3beedb0a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 	spin_unlock(ptl);
 	mmu_notifier_invalidate_range_end(&range);
 }
+
+void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
+			   unsigned long address)
+{
+	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
+	if (pud_trans_huge(*pud))
+		__split_huge_pud_locked(vma, pud, address);
+}
 #else
 void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
 		unsigned long address)
diff --git a/mm/internal.h b/mm/internal.h
index 9ee336aa03656..21d5c00f638dc 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
  * in mm/rmap.c:
  */
 pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
+#endif
 
 /*
  * in mm/page_alloc.c
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index b38a1d00c971b..d31eafba38041 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
 	return true;
 }
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+/* Returns true if the two ranges overlap.  Careful to not overflow. */
+static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
+{
+	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
+		return false;
+	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
+		return false;
+	return true;
+}
+#endif
+
 static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
 {
 	pvmw->address = (pvmw->address + size) & ~(size - 1);
@@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 	pud_t *pud;
 	pmd_t pmde;
 
+	/* The only possible pud mapping has been handled on last iteration */
+	if (pvmw->pud && !pvmw->pmd)
+		return not_found(pvmw);
+
 	/* The only possible pmd mapping has been handled on last iteration */
 	if (pvmw->pmd && !pvmw->pte)
 		return not_found(pvmw);
@@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			continue;
 		}
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+		/* Check for PUD-mapped THP */
+		if (pud_trans_huge(*pud)) {
+			pvmw->pud = pud;
+			pvmw->ptl = pud_lock(mm, pud);
+			if (likely(pud_trans_huge(*pud))) {
+				if (pvmw->flags & PVMW_MIGRATION)
+					return not_found(pvmw);
+				if (!check_pud(pud_pfn(*pud), pvmw))
+					return not_found(pvmw);
+				return true;
+			}
+			/* PUD was split under us, retry at PMD level */
+			spin_unlock(pvmw->ptl);
+			pvmw->ptl = NULL;
+			pvmw->pud = NULL;
+		}
+#endif
+
 		pvmw->pmd = pmd_offset(pud, pvmw->address);
 		/*
 		 * Make sure the pmd value isn't cached in a register by the
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index d3aec7a9926ad..2047558ddcd79 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
 }
 #endif
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+/*
+ * Deposit page tables for PUD THP.
+ * Called with PUD lock held. Stores PMD tables in a singly-linked stack
+ * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
+ *
+ * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
+ * list_head. This is because lru.prev (offset 16) overlaps with
+ * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
+ * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
+ *
+ * PTE tables should be deposited into the PMD using pud_deposit_pte().
+ */
+void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
+				    pmd_t *pmd_table)
+{
+	pgtable_t pmd_page = virt_to_page(pmd_table);
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	/* Push onto stack using only lru.next as the link */
+	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
+	pud_huge_pmd(pudp) = pmd_page;
+}
+
+/*
+ * Withdraw the deposited PMD table for PUD THP split or zap.
+ * Called with PUD lock held.
+ * Returns NULL if no more PMD tables are deposited.
+ */
+pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
+{
+	pgtable_t pmd_page;
+
+	assert_spin_locked(pud_lockptr(mm, pudp));
+
+	pmd_page = pud_huge_pmd(pudp);
+	if (!pmd_page)
+		return NULL;
+
+	/* Pop from stack - lru.next points to next PMD page (or NULL) */
+	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
+
+	return page_address(pmd_page);
+}
+
+/*
+ * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
+ * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
+ * No lock assertion since the PMD isn't visible yet.
+ */
+void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
+{
+	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
+
+	/* FIFO - add to front of list */
+	if (!ptdesc->pmd_huge_pte)
+		INIT_LIST_HEAD(&pgtable->lru);
+	else
+		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
+	ptdesc->pmd_huge_pte = pgtable;
+}
+
+/*
+ * Withdraw a PTE table from a standalone PMD table.
+ * Returns NULL if no more PTE tables are deposited.
+ */
+pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
+{
+	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
+	pgtable_t pgtable;
+
+	pgtable = ptdesc->pmd_huge_pte;
+	if (!pgtable)
+		return NULL;
+	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
+							struct page, lru);
+	if (ptdesc->pmd_huge_pte)
+		list_del(&pgtable->lru);
+	return pgtable;
+}
+#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
+
 #ifndef __HAVE_ARCH_PMDP_INVALIDATE
 pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
 		     pmd_t *pmdp)
diff --git a/mm/rmap.c b/mm/rmap.c
index 7b9879ef442d9..69acabd763da4 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
 	return pmd;
 }
 
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+/*
+ * Returns the actual pud_t* where we expect 'address' to be mapped from, or
+ * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
+ * represents.
+ */
+pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud = NULL;
+
+	pgd = pgd_offset(mm, address);
+	if (!pgd_present(*pgd))
+		goto out;
+
+	p4d = p4d_offset(pgd, address);
+	if (!p4d_present(*p4d))
+		goto out;
+
+	pud = pud_offset(p4d, address);
+out:
+	return pud;
+}
+#endif
+
 struct folio_referenced_arg {
 	int mapcount;
 	int referenced;
@@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
 			SetPageAnonExclusive(page);
 			break;
 		case PGTABLE_LEVEL_PUD:
-			/*
-			 * Keep the compiler happy, we don't support anonymous
-			 * PUD mappings.
-			 */
-			WARN_ON_ONCE(1);
+			SetPageAnonExclusive(page);
 			break;
 		default:
 			BUILD_BUG();
@@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
 #endif
 }
 
+/**
+ * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
+ * @folio:	The folio to add the mapping to
+ * @page:	The first page to add
+ * @vma:	The vm area in which the mapping is added
+ * @address:	The user virtual address of the first page to map
+ * @flags:	The rmap flags
+ *
+ * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
+ *
+ * The caller needs to hold the page table lock, and the page must be locked in
+ * the anon_vma case: to serialize mapping,index checking after setting.
+ */
+void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
+		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
+{
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
+	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
+			      PGTABLE_LEVEL_PUD);
+#else
+	WARN_ON_ONCE(true);
+#endif
+}
+
 /**
  * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
  * @folio:	The folio to add the mapping to.
@@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 		}
 
 		if (!pvmw.pte) {
+			/*
+			 * Check for PUD-mapped THP first.
+			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
+			 * split the PUD to PMD level and restart the walk.
+			 */
+			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
+				if (flags & TTU_SPLIT_HUGE_PUD) {
+					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
+					flags &= ~TTU_SPLIT_HUGE_PUD;
+					page_vma_mapped_walk_restart(&pvmw);
+					continue;
+				}
+			}
+
 			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
 				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
 					goto walk_done;
@@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 
 	while (page_vma_mapped_walk(&pvmw)) {
+		/* Handle PUD-mapped THP first */
+		if (!pvmw.pte && !pvmw.pmd) {
+#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+			/*
+			 * PUD-mapped THP: skip migration to preserve the huge
+			 * page. Splitting would defeat the purpose of PUD THPs.
+			 * Return false to indicate migration failure, which
+			 * will cause alloc_contig_range() to try a different
+			 * memory region.
+			 */
+			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
+				page_vma_mapped_walk_done(&pvmw);
+				ret = false;
+				break;
+			}
+#endif
+			/* Unexpected state: !pte && !pmd but not a PUD THP */
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
 			__maybe_unused unsigned long pfn;
@@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 
 	/*
 	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
-	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
+	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
 	 */
 	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
-					TTU_SYNC | TTU_BATCH_FLUSH)))
+					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
 		return;
 
 	if (folio_is_zone_device(folio) &&
-- 
2.47.3

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Lorenzo Stoakes 4 days, 18 hours ago

I think I'm going to have to do several passes on this, so this is just a
first one :)

On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
> For page table management, PUD THPs need to pre-deposit page tables
> that will be used when the huge page is later split. When a PUD THP
> is allocated, we cannot know in advance when or why it might need to
> be split (COW, partial unmap, reclaim), but we need page tables ready
> for that eventuality. Similar to how PMD THPs deposit a single PTE
> table, PUD THPs deposit a PMD table which itself contains deposited
> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
> infrastructure and a new pud_huge_pmd field in ptdesc to store the
> deposited PMD.

This feels like you're hacking this support in, honestly. The list_head
abuse only adds to that feeling.

And are we now not required to store rather a lot of memory to keep all of
this coherent?

>
> The deposited PMD tables are stored as a singly-linked stack using only
> page->lru.next as the link pointer. A doubly-linked list using the
> standard list_head mechanism would cause memory corruption: list_del()
> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
> tables have their own deposited PTE tables stored in pmd_huge_pte,
> poisoning lru.prev would corrupt the PTE table list and cause crashes
> when withdrawing PTE tables during split. PMD THPs don't have this
> problem because their deposited PTE tables don't have sub-deposits.
> Using only lru.next avoids the overlap entirely.

Yeah this is horrendous and a hack, I don't consider this at all
upstreamable.

You need to completely rework this.

>
> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
> have. The page_vma_mapped_walk() function is extended to recognize and
> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
> flag tells the unmap path to split PUD THPs before proceeding, since
> there is no PUD-level migration entry format - the split converts the
> single PUD mapping into individual PTE mappings that can be migrated
> or swapped normally.

Individual PTE... mappings? You need to be a lot clearer here, page tables
are naturally confusing with entries vs. tables.

Let's be VERY specific here. Do you mean you have 1 PMD table and 512 PTE
tables reserved, spanning 1 PUD entry and 262,144 PTE entries?

>
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>

How does this change interact with existing DAX/VFIO code, which now it
seems will be subject to the mechanisms you introduce here?

Right now DAX/VFIO is only obtainable via a specially THP-aligned
get_unmapped_area() + then can only be obtained at fault time.

Is that the intent here also?

What is your intent - that khugepaged do this, or on alloc? How does it
interact with MADV_COLLAPSE?

I noted on the 2nd patch, but you're changing THP_ORDERS_ALL_ANON which
alters __thp_vma_allowable_orders() behaviour, that change belongs here...


> ---
>  include/linux/huge_mm.h  |  5 +++
>  include/linux/mm.h       | 19 ++++++++
>  include/linux/mm_types.h |  5 ++-
>  include/linux/pgtable.h  |  8 ++++
>  include/linux/rmap.h     |  7 ++-
>  mm/huge_memory.c         |  8 ++++
>  mm/internal.h            |  3 ++
>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>  10 files changed, 260 insertions(+), 9 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a4d9f964dfdea..e672e45bb9cc7 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  		unsigned long address);
>
>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +			   unsigned long address);
>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
>  		    unsigned long cp_flags);
>  #else
> +static inline void
> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +		      unsigned long address) {}
>  static inline int
>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ab2e7e30aef96..a15e18df0f771 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
>   * considered ready to switch to split PUD locks yet; there may be places
>   * which need to be converted from page_table_lock.
>   */
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +static inline struct page *pud_pgtable_page(pud_t *pud)
> +{
> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
> +
> +	return virt_to_page((void *)((unsigned long)pud & mask));
> +}
> +
> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
> +{
> +	return page_ptdesc(pud_pgtable_page(pud));
> +}
> +
> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
> +#endif
> +
>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
>  {
>  	return &mm->page_table_lock;
> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
>  {
>  	__pagetable_ctor(ptdesc);
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +	ptdesc->pud_huge_pmd = NULL;
> +#endif
>  }
>
>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 78950eb8926dc..26a38490ae2e1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -577,7 +577,10 @@ struct ptdesc {
>  		struct list_head pt_list;
>  		struct {
>  			unsigned long _pt_pad_1;
> -			pgtable_t pmd_huge_pte;
> +			union {
> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
> +			};
>  		};
>  	};
>  	unsigned long __page_mapping;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
>  #define arch_needs_pgtable_deposit() (false)
>  #endif
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> +					   pmd_t *pmd_table);
> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);

These are useless extern's.

> +#endif
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /*
>   * This is an implementation of pmdp_establish() that is only suitable for an
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index daa92a58585d9..08cd0a0eb8763 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -101,6 +101,7 @@ enum ttu_flags {
>  					 * do a final flush if necessary */
>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
>  					 * caller holds it */
> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
>  };
>
>  #ifdef CONFIG_MMU
> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>  		unsigned long address, rmap_t flags);
>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
>  	pgoff_t pgoff;
>  	struct vm_area_struct *vma;
>  	unsigned long address;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  	pte_t *pte;
>  	spinlock_t *ptl;
> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>  static inline void
>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>  {
> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
>
>  	if (likely(pvmw->ptl))
>  		spin_unlock(pvmw->ptl);
> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>  		WARN_ON_ONCE(1);
>
>  	pvmw->ptl = NULL;
> +	pvmw->pud = NULL;
>  	pvmw->pmd = NULL;
>  	pvmw->pte = NULL;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 40cf59301c21a..3128b3beedb0a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  	spin_unlock(ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  }
> +
> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +			   unsigned long address)
> +{
> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
> +	if (pud_trans_huge(*pud))
> +		__split_huge_pud_locked(vma, pud, address);
> +}
>  #else
>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  		unsigned long address)
> diff --git a/mm/internal.h b/mm/internal.h
> index 9ee336aa03656..21d5c00f638dc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
>   * in mm/rmap.c:
>   */
>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
> +#endif
>
>  /*
>   * in mm/page_alloc.c
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index b38a1d00c971b..d31eafba38041 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>  	return true;
>  }
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> +{
> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
> +		return false;
> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
> +		return false;
> +	return true;
> +}
> +#endif
> +
>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
>  {
>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	pud_t *pud;
>  	pmd_t pmde;
>
> +	/* The only possible pud mapping has been handled on last iteration */
> +	if (pvmw->pud && !pvmw->pmd)
> +		return not_found(pvmw);
> +
>  	/* The only possible pmd mapping has been handled on last iteration */
>  	if (pvmw->pmd && !pvmw->pte)
>  		return not_found(pvmw);
> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			continue;
>  		}
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD

Said it elsewhere, but it's really weird to treat an arch having the
ability to do something as a go ahead for doing it.

> +		/* Check for PUD-mapped THP */
> +		if (pud_trans_huge(*pud)) {
> +			pvmw->pud = pud;
> +			pvmw->ptl = pud_lock(mm, pud);
> +			if (likely(pud_trans_huge(*pud))) {
> +				if (pvmw->flags & PVMW_MIGRATION)
> +					return not_found(pvmw);
> +				if (!check_pud(pud_pfn(*pud), pvmw))
> +					return not_found(pvmw);
> +				return true;
> +			}
> +			/* PUD was split under us, retry at PMD level */
> +			spin_unlock(pvmw->ptl);
> +			pvmw->ptl = NULL;
> +			pvmw->pud = NULL;
> +		}
> +#endif
> +

Yeah, as I said elsewhere, we got to be refactoring not copy/pasting with
modifications :)


>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
>  		/*
>  		 * Make sure the pmd value isn't cached in a register by the
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index d3aec7a9926ad..2047558ddcd79 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>  }
>  #endif
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/*
> + * Deposit page tables for PUD THP.
> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
> + *
> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
> + * list_head. This is because lru.prev (offset 16) overlaps with
> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.

This is horrible and feels like a hack? Treating a doubly-linked list as a
singly-linked one like this is not upstreamable.

> + *
> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
> + */
> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> +				    pmd_t *pmd_table)

This is a horrid, you're depositing the PMD using the... questionable
list_head abuse, but then also have pud_deposit_pte()... But here we're
depositing a PMD shouldn't the name reflect that?

> +{
> +	pgtable_t pmd_page = virt_to_page(pmd_table);
> +
> +	assert_spin_locked(pud_lockptr(mm, pudp));
> +
> +	/* Push onto stack using only lru.next as the link */
> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);

Yikes...

> +	pud_huge_pmd(pudp) = pmd_page;
> +}
> +
> +/*
> + * Withdraw the deposited PMD table for PUD THP split or zap.
> + * Called with PUD lock held.
> + * Returns NULL if no more PMD tables are deposited.
> + */
> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
> +{
> +	pgtable_t pmd_page;
> +
> +	assert_spin_locked(pud_lockptr(mm, pudp));
> +
> +	pmd_page = pud_huge_pmd(pudp);
> +	if (!pmd_page)
> +		return NULL;
> +
> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;

Where's the popping? You're just assigning here.

> +
> +	return page_address(pmd_page);
> +}
> +
> +/*
> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
> + * No lock assertion since the PMD isn't visible yet.
> + */
> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
> +{
> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> +
> +	/* FIFO - add to front of list */
> +	if (!ptdesc->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
> +	ptdesc->pmd_huge_pte = pgtable;
> +}
> +
> +/*
> + * Withdraw a PTE table from a standalone PMD table.
> + * Returns NULL if no more PTE tables are deposited.
> + */
> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
> +{
> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> +	pgtable_t pgtable;
> +
> +	pgtable = ptdesc->pmd_huge_pte;
> +	if (!pgtable)
> +		return NULL;
> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
> +							struct page, lru);
> +	if (ptdesc->pmd_huge_pte)
> +		list_del(&pgtable->lru);
> +	return pgtable;
> +}
> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> +
>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  		     pmd_t *pmdp)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7b9879ef442d9..69acabd763da4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>  	return pmd;
>  }
>
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/*
> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
> + * represents.
> + */
> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)

This series seems to be full of copy/paste.

It's just not acceptable given the state of THP code as I said in reply to
the cover letter - you need to _refactor_ the code.

The code is bug-prone and difficult to maintain as-is, your series has to
improve the technical debt, not add to it.

> +{
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud = NULL;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	p4d = p4d_offset(pgd, address);
> +	if (!p4d_present(*p4d))
> +		goto out;
> +
> +	pud = pud_offset(p4d, address);
> +out:
> +	return pud;
> +}
> +#endif
> +
>  struct folio_referenced_arg {
>  	int mapcount;
>  	int referenced;
> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>  			SetPageAnonExclusive(page);
>  			break;
>  		case PGTABLE_LEVEL_PUD:
> -			/*
> -			 * Keep the compiler happy, we don't support anonymous
> -			 * PUD mappings.
> -			 */
> -			WARN_ON_ONCE(1);
> +			SetPageAnonExclusive(page);
>  			break;
>  		default:
>  			BUILD_BUG();
> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
>  #endif
>  }
>
> +/**
> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
> + * @folio:	The folio to add the mapping to
> + * @page:	The first page to add
> + * @vma:	The vm area in which the mapping is added
> + * @address:	The user virtual address of the first page to map
> + * @flags:	The rmap flags
> + *
> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
> + *
> + * The caller needs to hold the page table lock, and the page must be locked in
> + * the anon_vma case: to serialize mapping,index checking after setting.
> + */
> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
> +{
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
> +			      PGTABLE_LEVEL_PUD);
> +#else
> +	WARN_ON_ONCE(true);
> +#endif
> +}

More copy/paste... Maybe unavoidable in this case, but be good to try.

> +
>  /**
>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
>   * @folio:	The folio to add the mapping to.
> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		}
>
>  		if (!pvmw.pte) {
> +			/*
> +			 * Check for PUD-mapped THP first.
> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
> +			 * split the PUD to PMD level and restart the walk.
> +			 */

This is literally describing the code below, it's not useful.

> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> +				if (flags & TTU_SPLIT_HUGE_PUD) {
> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
> +					flags &= ~TTU_SPLIT_HUGE_PUD;
> +					page_vma_mapped_walk_restart(&pvmw);
> +					continue;
> +				}
> +			}
> +
>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>  					goto walk_done;
> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  	mmu_notifier_invalidate_range_start(&range);
>
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		/* Handle PUD-mapped THP first */

How did/will this interact with DAX, VFIO PUD THP?

> +		if (!pvmw.pte && !pvmw.pmd) {
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD

Won't pud_trans_huge() imply this...

> +			/*
> +			 * PUD-mapped THP: skip migration to preserve the huge
> +			 * page. Splitting would defeat the purpose of PUD THPs.
> +			 * Return false to indicate migration failure, which
> +			 * will cause alloc_contig_range() to try a different
> +			 * memory region.
> +			 */
> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> +				page_vma_mapped_walk_done(&pvmw);
> +				ret = false;
> +				break;
> +			}
> +#endif
> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
> +
>  		/* PMD-mapped THP migration entry */
>  		if (!pvmw.pte) {
>  			__maybe_unused unsigned long pfn;
> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>
>  	/*
>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>  	 */
>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
> -					TTU_SYNC | TTU_BATCH_FLUSH)))
> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
>  		return;
>
>  	if (folio_is_zone_device(folio) &&
> --
> 2.47.3
>

This isn't a final review, I'll have to look more thoroughly through here
over time and you're going to have to be patient in general :)

Cheers, Lorenzo

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Usama Arif 2 days, 23 hours ago


On 02/02/2026 04:15, Lorenzo Stoakes wrote:
> I think I'm going to have to do several passes on this, so this is just a
> first one :)
> 

Thanks! Really appreciate the reviews!

One thing over here is the higher level design decision when it comes to migration
of 1G pages. As Zi said in [1]:
"I also wonder what the purpose of PUD THP migration can be.
It does not create memory fragmentation, since it is the largest folio size
we have and contiguous. NUMA balancing 1GB THP seems too much work."

> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>> For page table management, PUD THPs need to pre-deposit page tables
>> that will be used when the huge page is later split. When a PUD THP
>> is allocated, we cannot know in advance when or why it might need to
>> be split (COW, partial unmap, reclaim), but we need page tables ready
>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>> table, PUD THPs deposit a PMD table which itself contains deposited
>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>> deposited PMD.
> 
> This feels like you're hacking this support in, honestly. The list_head
> abuse only adds to that feeling.
> 

Yeah so I hope turning it to something like [2] is the way forward.

> And are we now not required to store rather a lot of memory to keep all of
> this coherent?

PMD THP allocates 1 4K page (pte_alloc_one) at fault time so that split
doesnt fail.

For PUD we allocate 2M worth of PTE page tables and 1 4K PMD table at fault
time so that split doesnt fail due to there not being enough memory.
Its not great, but its not bad as well.
The alternative is to allocate this at split time and so we are not
pre-reserving them. Now there is a chance that allocation and therefore split
fails, so the tradeoff is some memory vs reliability. This patch favours
reliability.

Lets say a user gets 100x1G THPs. They would end up using ~200M for it.
I think that is okish. If the user has 100G, 200M might not be an issue
for them :)

> 
>>
>> The deposited PMD tables are stored as a singly-linked stack using only
>> page->lru.next as the link pointer. A doubly-linked list using the
>> standard list_head mechanism would cause memory corruption: list_del()
>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>> when withdrawing PTE tables during split. PMD THPs don't have this
>> problem because their deposited PTE tables don't have sub-deposits.
>> Using only lru.next avoids the overlap entirely.
> 
> Yeah this is horrendous and a hack, I don't consider this at all
> upstreamable.
> 
> You need to completely rework this.

Hopefully [2] is the path forward!
> 
>>
>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>> have. The page_vma_mapped_walk() function is extended to recognize and
>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>> flag tells the unmap path to split PUD THPs before proceeding, since
>> there is no PUD-level migration entry format - the split converts the
>> single PUD mapping into individual PTE mappings that can be migrated
>> or swapped normally.
> 
> Individual PTE... mappings? You need to be a lot clearer here, page tables
> are naturally confusing with entries vs. tables.
> 
> Let's be VERY specific here. Do you mean you have 1 PMD table and 512 PTE
> tables reserved, spanning 1 PUD entry and 262,144 PTE entries?
> 

Yes that is correct, Thanks! I will change the commit message in the next revision
to what you have written: 1 PMD table and 512 PTE tables reserved, spanning
1 PUD entry and 262,144 PTE entries.

>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> 
> How does this change interact with existing DAX/VFIO code, which now it
> seems will be subject to the mechanisms you introduce here?

I think what you mean here is the change in try_to_migrate_one?


So one 

> 
> Right now DAX/VFIO is only obtainable via a specially THP-aligned
> get_unmapped_area() + then can only be obtained at fault time.
> > Is that the intent here also?
> 

Ah thanks for pointing this out. This is something the series is missing.

What I did in the selftest and benchmark was fault on an address that was already aligned.
i.e. basically call the below function before faulting in.

static inline void *pud_align(void *addr)
{
	return (void *)(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1));
}


What I think you are suggesting this series is missing is the below diff? (its untested).

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87b2c21df4a49..461158a0840db 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1236,6 +1236,12 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
        unsigned long ret;
        loff_t off = (loff_t)pgoff << PAGE_SHIFT;
 
+       if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && len >= PUD_SIZE) {
+               ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PUD_SIZE, vm_flags);
+               if (ret)
+                       return ret;
+       }
+


> What is your intent - that khugepaged do this, or on alloc? How does it
> interact with MADV_COLLAPSE?
> 

Ah basically what I mentioned in [3], we want to go slow. Only enable PUD THPs
page faults at the start. If there is data supporting that khugepaged will work
than we do it, but we keep it disabled.

> I noted on the 2nd patch, but you're changing THP_ORDERS_ALL_ANON which
> alters __thp_vma_allowable_orders() behaviour, that change belongs here...
> 
> 

Thanks for this! I only tried to split this code into logical commits
after the whole thing was working. Some things are tightly coupled
and I would need to move them to the right commit.

>> ---
>>  include/linux/huge_mm.h  |  5 +++
>>  include/linux/mm.h       | 19 ++++++++
>>  include/linux/mm_types.h |  5 ++-
>>  include/linux/pgtable.h  |  8 ++++
>>  include/linux/rmap.h     |  7 ++-
>>  mm/huge_memory.c         |  8 ++++
>>  mm/internal.h            |  3 ++
>>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>>  10 files changed, 260 insertions(+), 9 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index a4d9f964dfdea..e672e45bb9cc7 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>  		unsigned long address);
>>
>>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>> +			   unsigned long address);
>>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
>>  		    unsigned long cp_flags);
>>  #else
>> +static inline void
>> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>> +		      unsigned long address) {}
>>  static inline int
>>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>> index ab2e7e30aef96..a15e18df0f771 100644
>> --- a/include/linux/mm.h
>> +++ b/include/linux/mm.h
>> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
>>   * considered ready to switch to split PUD locks yet; there may be places
>>   * which need to be converted from page_table_lock.
>>   */
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +static inline struct page *pud_pgtable_page(pud_t *pud)
>> +{
>> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
>> +
>> +	return virt_to_page((void *)((unsigned long)pud & mask));
>> +}
>> +
>> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
>> +{
>> +	return page_ptdesc(pud_pgtable_page(pud));
>> +}
>> +
>> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
>> +#endif
>> +
>>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
>>  {
>>  	return &mm->page_table_lock;
>> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
>>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
>>  {
>>  	__pagetable_ctor(ptdesc);
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +	ptdesc->pud_huge_pmd = NULL;
>> +#endif
>>  }
>>
>>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>> index 78950eb8926dc..26a38490ae2e1 100644
>> --- a/include/linux/mm_types.h
>> +++ b/include/linux/mm_types.h
>> @@ -577,7 +577,10 @@ struct ptdesc {
>>  		struct list_head pt_list;
>>  		struct {
>>  			unsigned long _pt_pad_1;
>> -			pgtable_t pmd_huge_pte;
>> +			union {
>> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
>> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
>> +			};
>>  		};
>>  	};
>>  	unsigned long __page_mapping;
>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
>> --- a/include/linux/pgtable.h
>> +++ b/include/linux/pgtable.h
>> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
>>  #define arch_needs_pgtable_deposit() (false)
>>  #endif
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
>> +					   pmd_t *pmd_table);
>> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
>> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
>> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
> 
> These are useless extern's.
> 


ack

These are coming from the existing functions from the file:
extern void pgtable_trans_huge_deposit
extern pgtable_t pgtable_trans_huge_withdraw

I think the externs can be removed from these as well? We can
fix those in a separate patch.


>> +#endif
>> +
>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>  /*
>>   * This is an implementation of pmdp_establish() that is only suitable for an
>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>> index daa92a58585d9..08cd0a0eb8763 100644
>> --- a/include/linux/rmap.h
>> +++ b/include/linux/rmap.h
>> @@ -101,6 +101,7 @@ enum ttu_flags {
>>  					 * do a final flush if necessary */
>>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
>>  					 * caller holds it */
>> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
>>  };
>>
>>  #ifdef CONFIG_MMU
>> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
>>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
>>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
>> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
>> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>  		unsigned long address, rmap_t flags);
>>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
>> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
>>  	pgoff_t pgoff;
>>  	struct vm_area_struct *vma;
>>  	unsigned long address;
>> +	pud_t *pud;
>>  	pmd_t *pmd;
>>  	pte_t *pte;
>>  	spinlock_t *ptl;
>> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>>  static inline void
>>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>>  {
>> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
>> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
>>
>>  	if (likely(pvmw->ptl))
>>  		spin_unlock(pvmw->ptl);
>> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>>  		WARN_ON_ONCE(1);
>>
>>  	pvmw->ptl = NULL;
>> +	pvmw->pud = NULL;
>>  	pvmw->pmd = NULL;
>>  	pvmw->pte = NULL;
>>  }
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 40cf59301c21a..3128b3beedb0a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>  	spin_unlock(ptl);
>>  	mmu_notifier_invalidate_range_end(&range);
>>  }
>> +
>> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>> +			   unsigned long address)
>> +{
>> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
>> +	if (pud_trans_huge(*pud))
>> +		__split_huge_pud_locked(vma, pud, address);
>> +}
>>  #else
>>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>  		unsigned long address)
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 9ee336aa03656..21d5c00f638dc 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
>>   * in mm/rmap.c:
>>   */
>>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
>> +#endif
>>
>>  /*
>>   * in mm/page_alloc.c
>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>> index b38a1d00c971b..d31eafba38041 100644
>> --- a/mm/page_vma_mapped.c
>> +++ b/mm/page_vma_mapped.c
>> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>>  	return true;
>>  }
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
>> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>> +{
>> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
>> +		return false;
>> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
>> +		return false;
>> +	return true;
>> +}
>> +#endif
>> +
>>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
>>  {
>>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
>> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>  	pud_t *pud;
>>  	pmd_t pmde;
>>
>> +	/* The only possible pud mapping has been handled on last iteration */
>> +	if (pvmw->pud && !pvmw->pmd)
>> +		return not_found(pvmw);
>> +
>>  	/* The only possible pmd mapping has been handled on last iteration */
>>  	if (pvmw->pmd && !pvmw->pte)
>>  		return not_found(pvmw);
>> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>  			continue;
>>  		}
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> 
> Said it elsewhere, but it's really weird to treat an arch having the
> ability to do something as a go ahead for doing it.
> 
>> +		/* Check for PUD-mapped THP */
>> +		if (pud_trans_huge(*pud)) {
>> +			pvmw->pud = pud;
>> +			pvmw->ptl = pud_lock(mm, pud);
>> +			if (likely(pud_trans_huge(*pud))) {
>> +				if (pvmw->flags & PVMW_MIGRATION)
>> +					return not_found(pvmw);
>> +				if (!check_pud(pud_pfn(*pud), pvmw))
>> +					return not_found(pvmw);
>> +				return true;
>> +			}
>> +			/* PUD was split under us, retry at PMD level */
>> +			spin_unlock(pvmw->ptl);
>> +			pvmw->ptl = NULL;
>> +			pvmw->pud = NULL;
>> +		}
>> +#endif
>> +
> 
> Yeah, as I said elsewhere, we got to be refactoring not copy/pasting with
> modifications :)
> 

Yeah there is repeated code in multiple places, where all I did was replace
what was done from PMD into PUD. In a lot of places, its actually difficult
to not repeat the code (unless we want function macros, which is much worse
IMO).
 
> 
>>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
>>  		/*
>>  		 * Make sure the pmd value isn't cached in a register by the
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index d3aec7a9926ad..2047558ddcd79 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +/*
>> + * Deposit page tables for PUD THP.
>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>> + *
>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>> + * list_head. This is because lru.prev (offset 16) overlaps with
>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
> 
> This is horrible and feels like a hack? Treating a doubly-linked list as a
> singly-linked one like this is not upstreamable.
> 
>> + *
>> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
>> + */
>> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
>> +				    pmd_t *pmd_table)
> 
> This is a horrid, you're depositing the PMD using the... questionable
> list_head abuse, but then also have pud_deposit_pte()... But here we're
> depositing a PMD shouldn't the name reflect that?
> 
>> +{
>> +	pgtable_t pmd_page = virt_to_page(pmd_table);
>> +
>> +	assert_spin_locked(pud_lockptr(mm, pudp));
>> +
>> +	/* Push onto stack using only lru.next as the link */
>> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
> 
> Yikes...
> 
>> +	pud_huge_pmd(pudp) = pmd_page;
>> +}
>> +
>> +/*
>> + * Withdraw the deposited PMD table for PUD THP split or zap.
>> + * Called with PUD lock held.
>> + * Returns NULL if no more PMD tables are deposited.
>> + */
>> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
>> +{
>> +	pgtable_t pmd_page;
>> +
>> +	assert_spin_locked(pud_lockptr(mm, pudp));
>> +
>> +	pmd_page = pud_huge_pmd(pudp);
>> +	if (!pmd_page)
>> +		return NULL;
>> +
>> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
>> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
> 
> Where's the popping? You're just assigning here.


Ack on all of the above. Hopefully [1] is better.
> 
>> +
>> +	return page_address(pmd_page);
>> +}
>> +
>> +/*
>> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
>> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
>> + * No lock assertion since the PMD isn't visible yet.
>> + */
>> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
>> +{
>> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
>> +
>> +	/* FIFO - add to front of list */
>> +	if (!ptdesc->pmd_huge_pte)
>> +		INIT_LIST_HEAD(&pgtable->lru);
>> +	else
>> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
>> +	ptdesc->pmd_huge_pte = pgtable;
>> +}
>> +
>> +/*
>> + * Withdraw a PTE table from a standalone PMD table.
>> + * Returns NULL if no more PTE tables are deposited.
>> + */
>> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
>> +{
>> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
>> +	pgtable_t pgtable;
>> +
>> +	pgtable = ptdesc->pmd_huge_pte;
>> +	if (!pgtable)
>> +		return NULL;
>> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
>> +							struct page, lru);
>> +	if (ptdesc->pmd_huge_pte)
>> +		list_del(&pgtable->lru);
>> +	return pgtable;
>> +}
>> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>> +
>>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
>>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>  		     pmd_t *pmdp)
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 7b9879ef442d9..69acabd763da4 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>>  	return pmd;
>>  }
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +/*
>> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
>> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
>> + * represents.
>> + */
>> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
> 
> This series seems to be full of copy/paste.
> 
> It's just not acceptable given the state of THP code as I said in reply to
> the cover letter - you need to _refactor_ the code.
> 
> The code is bug-prone and difficult to maintain as-is, your series has to
> improve the technical debt, not add to it.
> 

In some cases we might not be able to avoid the copy, but this is definitely
a place where we dont need to. I will change here. Thanks!

>> +{
>> +	pgd_t *pgd;
>> +	p4d_t *p4d;
>> +	pud_t *pud = NULL;
>> +
>> +	pgd = pgd_offset(mm, address);
>> +	if (!pgd_present(*pgd))
>> +		goto out;
>> +
>> +	p4d = p4d_offset(pgd, address);
>> +	if (!p4d_present(*p4d))
>> +		goto out;
>> +
>> +	pud = pud_offset(p4d, address);
>> +out:
>> +	return pud;
>> +}
>> +#endif
>> +
>>  struct folio_referenced_arg {
>>  	int mapcount;
>>  	int referenced;
>> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>>  			SetPageAnonExclusive(page);
>>  			break;
>>  		case PGTABLE_LEVEL_PUD:
>> -			/*
>> -			 * Keep the compiler happy, we don't support anonymous
>> -			 * PUD mappings.
>> -			 */
>> -			WARN_ON_ONCE(1);
>> +			SetPageAnonExclusive(page);
>>  			break;
>>  		default:
>>  			BUILD_BUG();
>> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
>>  #endif
>>  }
>>
>> +/**
>> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
>> + * @folio:	The folio to add the mapping to
>> + * @page:	The first page to add
>> + * @vma:	The vm area in which the mapping is added
>> + * @address:	The user virtual address of the first page to map
>> + * @flags:	The rmap flags
>> + *
>> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
>> + *
>> + * The caller needs to hold the page table lock, and the page must be locked in
>> + * the anon_vma case: to serialize mapping,index checking after setting.
>> + */
>> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
>> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
>> +{
>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
>> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
>> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
>> +			      PGTABLE_LEVEL_PUD);
>> +#else
>> +	WARN_ON_ONCE(true);
>> +#endif
>> +}
> 
> More copy/paste... Maybe unavoidable in this case, but be good to try.
> 
>> +
>>  /**
>>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
>>   * @folio:	The folio to add the mapping to.
>> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>  		}
>>
>>  		if (!pvmw.pte) {
>> +			/*
>> +			 * Check for PUD-mapped THP first.
>> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
>> +			 * split the PUD to PMD level and restart the walk.
>> +			 */
> 
> This is literally describing the code below, it's not useful.

Ack, Will remove this comment, Thanks!
> 
>> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
>> +				if (flags & TTU_SPLIT_HUGE_PUD) {
>> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
>> +					flags &= ~TTU_SPLIT_HUGE_PUD;
>> +					page_vma_mapped_walk_restart(&pvmw);
>> +					continue;
>> +				}
>> +			}
>> +
>>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
>>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>>  					goto walk_done;
>> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>  	mmu_notifier_invalidate_range_start(&range);
>>
>>  	while (page_vma_mapped_walk(&pvmw)) {
>> +		/* Handle PUD-mapped THP first */
> 
> How did/will this interact with DAX, VFIO PUD THP?

It wont interact with DAX. try_to_migrate does the below and just returns:

	if (folio_is_zone_device(folio) &&
	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
		return;

so DAX would never reach here.

I think vfio pages are pinned and therefore cant be migrated? (I have
not looked at vfio code, I will try to get a better understanding tomorrow,
but please let me know if that sounds wrong.)


> 
>> +		if (!pvmw.pte && !pvmw.pmd) {
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> 
> Won't pud_trans_huge() imply this...
> 

Agreed, I think it should cover it.

>> +			/*
>> +			 * PUD-mapped THP: skip migration to preserve the huge
>> +			 * page. Splitting would defeat the purpose of PUD THPs.
>> +			 * Return false to indicate migration failure, which
>> +			 * will cause alloc_contig_range() to try a different
>> +			 * memory region.
>> +			 */
>> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
>> +				page_vma_mapped_walk_done(&pvmw);
>> +				ret = false;
>> +				break;
>> +			}
>> +#endif
>> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
>> +			page_vma_mapped_walk_done(&pvmw);
>> +			break;
>> +		}
>> +
>>  		/* PMD-mapped THP migration entry */
>>  		if (!pvmw.pte) {
>>  			__maybe_unused unsigned long pfn;
>> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>>
>>  	/*
>>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
>> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>>  	 */
>>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
>> -					TTU_SYNC | TTU_BATCH_FLUSH)))
>> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
>>  		return;
>>
>>  	if (folio_is_zone_device(folio) &&
>> --
>> 2.47.3
>>
> 
> This isn't a final review, I'll have to look more thoroughly through here
> over time and you're going to have to be patient in general :)
> 
> Cheers, Lorenzo


Thanks for the review, this is awesome!


[1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/
[2] https://lore.kernel.org/all/05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com/
[3] https://lore.kernel.org/all/2efaa5ed-bd09-41f0-9c07-5cd6cccc4595@gmail.com/

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Lorenzo Stoakes 2 days, 18 hours ago

On Tue, Feb 03, 2026 at 11:38:02PM -0800, Usama Arif wrote:
>
>
> On 02/02/2026 04:15, Lorenzo Stoakes wrote:
> > I think I'm going to have to do several passes on this, so this is just a
> > first one :)
> >
>
> Thanks! Really appreciate the reviews!

No worries!

>
> One thing over here is the higher level design decision when it comes to migration
> of 1G pages. As Zi said in [1]:
> "I also wonder what the purpose of PUD THP migration can be.
> It does not create memory fragmentation, since it is the largest folio size
> we have and contiguous. NUMA balancing 1GB THP seems too much work."
>
> > On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
> >> For page table management, PUD THPs need to pre-deposit page tables
> >> that will be used when the huge page is later split. When a PUD THP
> >> is allocated, we cannot know in advance when or why it might need to
> >> be split (COW, partial unmap, reclaim), but we need page tables ready
> >> for that eventuality. Similar to how PMD THPs deposit a single PTE
> >> table, PUD THPs deposit a PMD table which itself contains deposited
> >> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
> >> infrastructure and a new pud_huge_pmd field in ptdesc to store the
> >> deposited PMD.
> >
> > This feels like you're hacking this support in, honestly. The list_head
> > abuse only adds to that feeling.
> >
>
> Yeah so I hope turning it to something like [2] is the way forward.

Right, that's one option, though David suggested avoiding this altogether by
only pre-allocating PTEs?

>
> > And are we now not required to store rather a lot of memory to keep all of
> > this coherent?
>
> PMD THP allocates 1 4K page (pte_alloc_one) at fault time so that split
> doesnt fail.
>
> For PUD we allocate 2M worth of PTE page tables and 1 4K PMD table at fault
> time so that split doesnt fail due to there not being enough memory.
> Its not great, but its not bad as well.
> The alternative is to allocate this at split time and so we are not
> pre-reserving them. Now there is a chance that allocation and therefore split
> fails, so the tradeoff is some memory vs reliability. This patch favours
> reliability.

That's a significant amount of unmovable, unreclaimable memory though. Going
from 4K to 2M is a pretty huge uptick.

>
> Lets say a user gets 100x1G THPs. They would end up using ~200M for it.
> I think that is okish. If the user has 100G, 200M might not be an issue
> for them :)

But there's more than one user on boxes big enough for this, so this makes me
think we want this to be somehow opt-in right?

And that means we're incurring an unmovable memory penalty, the kind which we're
trying to avoid in general elsewhere in the kernel.

>
> >
> >>
> >> The deposited PMD tables are stored as a singly-linked stack using only
> >> page->lru.next as the link pointer. A doubly-linked list using the
> >> standard list_head mechanism would cause memory corruption: list_del()
> >> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
> >> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
> >> tables have their own deposited PTE tables stored in pmd_huge_pte,
> >> poisoning lru.prev would corrupt the PTE table list and cause crashes
> >> when withdrawing PTE tables during split. PMD THPs don't have this
> >> problem because their deposited PTE tables don't have sub-deposits.
> >> Using only lru.next avoids the overlap entirely.
> >
> > Yeah this is horrendous and a hack, I don't consider this at all
> > upstreamable.
> >
> > You need to completely rework this.
>
> Hopefully [2] is the path forward!

Ack

> >
> >>
> >> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
> >> have. The page_vma_mapped_walk() function is extended to recognize and
> >> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
> >> flag tells the unmap path to split PUD THPs before proceeding, since
> >> there is no PUD-level migration entry format - the split converts the
> >> single PUD mapping into individual PTE mappings that can be migrated
> >> or swapped normally.
> >
> > Individual PTE... mappings? You need to be a lot clearer here, page tables
> > are naturally confusing with entries vs. tables.
> >
> > Let's be VERY specific here. Do you mean you have 1 PMD table and 512 PTE
> > tables reserved, spanning 1 PUD entry and 262,144 PTE entries?
> >
>
> Yes that is correct, Thanks! I will change the commit message in the next revision
> to what you have written: 1 PMD table and 512 PTE tables reserved, spanning
> 1 PUD entry and 262,144 PTE entries.

Yeah :) my concerns remain :)

>
> >>
> >> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> >
> > How does this change interact with existing DAX/VFIO code, which now it
> > seems will be subject to the mechanisms you introduce here?
>
> I think what you mean here is the change in try_to_migrate_one?
>
>
> So one

Unfinished sentence? :P

No I mean currently we support 1G THP for DAX/VFIO right? So how does this
interplay with how that currently works? Does that change how DAX/VFIO works?
Will that impact existing users?

Or are we extending the existing mechanism?

>
> >
> > Right now DAX/VFIO is only obtainable via a specially THP-aligned
> > get_unmapped_area() + then can only be obtained at fault time.
> > > Is that the intent here also?
> >
>
> Ah thanks for pointing this out. This is something the series is missing.
>
> What I did in the selftest and benchmark was fault on an address that was already aligned.
> i.e. basically call the below function before faulting in.
>
> static inline void *pud_align(void *addr)
> {
> 	return (void *)(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1));
> }

Right yeah :)

>
>
> What I think you are suggesting this series is missing is the below diff? (its untested).
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 87b2c21df4a49..461158a0840db 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1236,6 +1236,12 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>         unsigned long ret;
>         loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>
> +       if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && len >= PUD_SIZE) {
> +               ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PUD_SIZE, vm_flags);
> +               if (ret)
> +                       return ret;
> +       }

No not that, that's going to cause issues, see commit d4148aeab4 for details as
to why this can go wrong.

In __get_unmapped_area() where the current 'if PMD size aligned then align area'
logic, like that.

> +
>
>
> > What is your intent - that khugepaged do this, or on alloc? How does it
> > interact with MADV_COLLAPSE?
> >
>
> Ah basically what I mentioned in [3], we want to go slow. Only enable PUD THPs
> page faults at the start. If there is data supporting that khugepaged will work
> than we do it, but we keep it disabled.

Yes I think khugepaged is probably never going to be all that a good idea with
this.

>
> > I noted on the 2nd patch, but you're changing THP_ORDERS_ALL_ANON which
> > alters __thp_vma_allowable_orders() behaviour, that change belongs here...
> >
> >
>
> Thanks for this! I only tried to split this code into logical commits
> after the whole thing was working. Some things are tightly coupled
> and I would need to move them to the right commit.

Yes there's a bunch of things that need tweaking here, to reiterate let's try to
pay down technical debt here and avoid copy/pasting :>)

>
> >> ---
> >>  include/linux/huge_mm.h  |  5 +++
> >>  include/linux/mm.h       | 19 ++++++++
> >>  include/linux/mm_types.h |  5 ++-
> >>  include/linux/pgtable.h  |  8 ++++
> >>  include/linux/rmap.h     |  7 ++-
> >>  mm/huge_memory.c         |  8 ++++
> >>  mm/internal.h            |  3 ++
> >>  mm/page_vma_mapped.c     | 35 +++++++++++++++
> >>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
> >>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
> >>  10 files changed, 260 insertions(+), 9 deletions(-)
> >>
> >> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> >> index a4d9f964dfdea..e672e45bb9cc7 100644
> >> --- a/include/linux/huge_mm.h
> >> +++ b/include/linux/huge_mm.h
> >> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >>  		unsigned long address);
> >>
> >>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> >> +			   unsigned long address);
> >>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
> >>  		    unsigned long cp_flags);
> >>  #else
> >> +static inline void
> >> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> >> +		      unsigned long address) {}
> >>  static inline int
> >>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
> >>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
> >> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >> index ab2e7e30aef96..a15e18df0f771 100644
> >> --- a/include/linux/mm.h
> >> +++ b/include/linux/mm.h
> >> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
> >>   * considered ready to switch to split PUD locks yet; there may be places
> >>   * which need to be converted from page_table_lock.
> >>   */
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +static inline struct page *pud_pgtable_page(pud_t *pud)
> >> +{
> >> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
> >> +
> >> +	return virt_to_page((void *)((unsigned long)pud & mask));
> >> +}
> >> +
> >> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
> >> +{
> >> +	return page_ptdesc(pud_pgtable_page(pud));
> >> +}
> >> +
> >> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
> >> +#endif
> >> +
> >>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
> >>  {
> >>  	return &mm->page_table_lock;
> >> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
> >>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
> >>  {
> >>  	__pagetable_ctor(ptdesc);
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +	ptdesc->pud_huge_pmd = NULL;
> >> +#endif
> >>  }
> >>
> >>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
> >> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> >> index 78950eb8926dc..26a38490ae2e1 100644
> >> --- a/include/linux/mm_types.h
> >> +++ b/include/linux/mm_types.h
> >> @@ -577,7 +577,10 @@ struct ptdesc {
> >>  		struct list_head pt_list;
> >>  		struct {
> >>  			unsigned long _pt_pad_1;
> >> -			pgtable_t pmd_huge_pte;
> >> +			union {
> >> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
> >> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
> >> +			};
> >>  		};
> >>  	};
> >>  	unsigned long __page_mapping;
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
> >>  #define arch_needs_pgtable_deposit() (false)
> >>  #endif
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> >> +					   pmd_t *pmd_table);
> >> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
> >> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
> >> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
> >
> > These are useless extern's.
> >
>
>
> ack
>
> These are coming from the existing functions from the file:
> extern void pgtable_trans_huge_deposit
> extern pgtable_t pgtable_trans_huge_withdraw
>
> I think the externs can be removed from these as well? We can
> fix those in a separate patch.

Generally the approach is to remove externs when adding/changing new stuff as
otherwise we get completely useless churn on that and annoying git history
changes.

>
>
> >> +#endif
> >> +
> >>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> >>  /*
> >>   * This is an implementation of pmdp_establish() that is only suitable for an
> >> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> >> index daa92a58585d9..08cd0a0eb8763 100644
> >> --- a/include/linux/rmap.h
> >> +++ b/include/linux/rmap.h
> >> @@ -101,6 +101,7 @@ enum ttu_flags {
> >>  					 * do a final flush if necessary */
> >>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
> >>  					 * caller holds it */
> >> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
> >>  };
> >>
> >>  #ifdef CONFIG_MMU
> >> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
> >>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
> >>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
> >>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
> >> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
> >> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
> >>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
> >>  		unsigned long address, rmap_t flags);
> >>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
> >> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
> >>  	pgoff_t pgoff;
> >>  	struct vm_area_struct *vma;
> >>  	unsigned long address;
> >> +	pud_t *pud;
> >>  	pmd_t *pmd;
> >>  	pte_t *pte;
> >>  	spinlock_t *ptl;
> >> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
> >>  static inline void
> >>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
> >>  {
> >> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
> >> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
> >>
> >>  	if (likely(pvmw->ptl))
> >>  		spin_unlock(pvmw->ptl);
> >> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
> >>  		WARN_ON_ONCE(1);
> >>
> >>  	pvmw->ptl = NULL;
> >> +	pvmw->pud = NULL;
> >>  	pvmw->pmd = NULL;
> >>  	pvmw->pte = NULL;
> >>  }
> >> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> >> index 40cf59301c21a..3128b3beedb0a 100644
> >> --- a/mm/huge_memory.c
> >> +++ b/mm/huge_memory.c
> >> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >>  	spin_unlock(ptl);
> >>  	mmu_notifier_invalidate_range_end(&range);
> >>  }
> >> +
> >> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> >> +			   unsigned long address)
> >> +{
> >> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
> >> +	if (pud_trans_huge(*pud))
> >> +		__split_huge_pud_locked(vma, pud, address);
> >> +}
> >>  #else
> >>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
> >>  		unsigned long address)
> >> diff --git a/mm/internal.h b/mm/internal.h
> >> index 9ee336aa03656..21d5c00f638dc 100644
> >> --- a/mm/internal.h
> >> +++ b/mm/internal.h
> >> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
> >>   * in mm/rmap.c:
> >>   */
> >>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
> >> +#endif
> >>
> >>  /*
> >>   * in mm/page_alloc.c
> >> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> >> index b38a1d00c971b..d31eafba38041 100644
> >> --- a/mm/page_vma_mapped.c
> >> +++ b/mm/page_vma_mapped.c
> >> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> >>  	return true;
> >>  }
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
> >> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> >> +{
> >> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
> >> +		return false;
> >> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
> >> +		return false;
> >> +	return true;
> >> +}
> >> +#endif
> >> +
> >>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
> >>  {
> >>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
> >> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >>  	pud_t *pud;
> >>  	pmd_t pmde;
> >>
> >> +	/* The only possible pud mapping has been handled on last iteration */
> >> +	if (pvmw->pud && !pvmw->pmd)
> >> +		return not_found(pvmw);
> >> +
> >>  	/* The only possible pmd mapping has been handled on last iteration */
> >>  	if (pvmw->pmd && !pvmw->pte)
> >>  		return not_found(pvmw);
> >> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
> >>  			continue;
> >>  		}
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >
> > Said it elsewhere, but it's really weird to treat an arch having the
> > ability to do something as a go ahead for doing it.
> >
> >> +		/* Check for PUD-mapped THP */
> >> +		if (pud_trans_huge(*pud)) {
> >> +			pvmw->pud = pud;
> >> +			pvmw->ptl = pud_lock(mm, pud);
> >> +			if (likely(pud_trans_huge(*pud))) {
> >> +				if (pvmw->flags & PVMW_MIGRATION)
> >> +					return not_found(pvmw);
> >> +				if (!check_pud(pud_pfn(*pud), pvmw))
> >> +					return not_found(pvmw);
> >> +				return true;
> >> +			}
> >> +			/* PUD was split under us, retry at PMD level */
> >> +			spin_unlock(pvmw->ptl);
> >> +			pvmw->ptl = NULL;
> >> +			pvmw->pud = NULL;
> >> +		}
> >> +#endif
> >> +
> >
> > Yeah, as I said elsewhere, we got to be refactoring not copy/pasting with
> > modifications :)
> >
>
> Yeah there is repeated code in multiple places, where all I did was replace
> what was done from PMD into PUD. In a lot of places, its actually difficult
> to not repeat the code (unless we want function macros, which is much worse
> IMO).

Not if we actually refactor the existing code :)

When I wanted to make functional changes to mremap I took a lot of time to
refactor the code into something sane before even starting that.

Because I _could_ have added the features there as-is, but it would have been
hellish to do so as-is and added more confusion etc.

So yeah, I think a similar mentality has to be had with this change.

>
> >
> >>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
> >>  		/*
> >>  		 * Make sure the pmd value isn't cached in a register by the
> >> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> >> index d3aec7a9926ad..2047558ddcd79 100644
> >> --- a/mm/pgtable-generic.c
> >> +++ b/mm/pgtable-generic.c
> >> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
> >>  }
> >>  #endif
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +/*
> >> + * Deposit page tables for PUD THP.
> >> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
> >> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
> >> + *
> >> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
> >> + * list_head. This is because lru.prev (offset 16) overlaps with
> >> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
> >> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
> >
> > This is horrible and feels like a hack? Treating a doubly-linked list as a
> > singly-linked one like this is not upstreamable.
> >
> >> + *
> >> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
> >> + */
> >> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> >> +				    pmd_t *pmd_table)
> >
> > This is a horrid, you're depositing the PMD using the... questionable
> > list_head abuse, but then also have pud_deposit_pte()... But here we're
> > depositing a PMD shouldn't the name reflect that?
> >
> >> +{
> >> +	pgtable_t pmd_page = virt_to_page(pmd_table);
> >> +
> >> +	assert_spin_locked(pud_lockptr(mm, pudp));
> >> +
> >> +	/* Push onto stack using only lru.next as the link */
> >> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
> >
> > Yikes...
> >
> >> +	pud_huge_pmd(pudp) = pmd_page;
> >> +}
> >> +
> >> +/*
> >> + * Withdraw the deposited PMD table for PUD THP split or zap.
> >> + * Called with PUD lock held.
> >> + * Returns NULL if no more PMD tables are deposited.
> >> + */
> >> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
> >> +{
> >> +	pgtable_t pmd_page;
> >> +
> >> +	assert_spin_locked(pud_lockptr(mm, pudp));
> >> +
> >> +	pmd_page = pud_huge_pmd(pudp);
> >> +	if (!pmd_page)
> >> +		return NULL;
> >> +
> >> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
> >> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
> >
> > Where's the popping? You're just assigning here.
>
>
> Ack on all of the above. Hopefully [1] is better.

Thanks!

> >
> >> +
> >> +	return page_address(pmd_page);
> >> +}
> >> +
> >> +/*
> >> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
> >> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
> >> + * No lock assertion since the PMD isn't visible yet.
> >> + */
> >> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
> >> +{
> >> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> >> +
> >> +	/* FIFO - add to front of list */
> >> +	if (!ptdesc->pmd_huge_pte)
> >> +		INIT_LIST_HEAD(&pgtable->lru);
> >> +	else
> >> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
> >> +	ptdesc->pmd_huge_pte = pgtable;
> >> +}
> >> +
> >> +/*
> >> + * Withdraw a PTE table from a standalone PMD table.
> >> + * Returns NULL if no more PTE tables are deposited.
> >> + */
> >> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
> >> +{
> >> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> >> +	pgtable_t pgtable;
> >> +
> >> +	pgtable = ptdesc->pmd_huge_pte;
> >> +	if (!pgtable)
> >> +		return NULL;
> >> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
> >> +							struct page, lru);
> >> +	if (ptdesc->pmd_huge_pte)
> >> +		list_del(&pgtable->lru);
> >> +	return pgtable;
> >> +}
> >> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> >> +
> >>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
> >>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> >>  		     pmd_t *pmdp)
> >> diff --git a/mm/rmap.c b/mm/rmap.c
> >> index 7b9879ef442d9..69acabd763da4 100644
> >> --- a/mm/rmap.c
> >> +++ b/mm/rmap.c
> >> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
> >>  	return pmd;
> >>  }
> >>
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >> +/*
> >> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
> >> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
> >> + * represents.
> >> + */
> >> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
> >
> > This series seems to be full of copy/paste.
> >
> > It's just not acceptable given the state of THP code as I said in reply to
> > the cover letter - you need to _refactor_ the code.
> >
> > The code is bug-prone and difficult to maintain as-is, your series has to
> > improve the technical debt, not add to it.
> >
>
> In some cases we might not be able to avoid the copy, but this is definitely
> a place where we dont need to. I will change here. Thanks!

I disagree, see above :) But thanks on this one

>
> >> +{
> >> +	pgd_t *pgd;
> >> +	p4d_t *p4d;
> >> +	pud_t *pud = NULL;
> >> +
> >> +	pgd = pgd_offset(mm, address);
> >> +	if (!pgd_present(*pgd))
> >> +		goto out;
> >> +
> >> +	p4d = p4d_offset(pgd, address);
> >> +	if (!p4d_present(*p4d))
> >> +		goto out;
> >> +
> >> +	pud = pud_offset(p4d, address);
> >> +out:
> >> +	return pud;
> >> +}
> >> +#endif
> >> +
> >>  struct folio_referenced_arg {
> >>  	int mapcount;
> >>  	int referenced;
> >> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
> >>  			SetPageAnonExclusive(page);
> >>  			break;
> >>  		case PGTABLE_LEVEL_PUD:
> >> -			/*
> >> -			 * Keep the compiler happy, we don't support anonymous
> >> -			 * PUD mappings.
> >> -			 */
> >> -			WARN_ON_ONCE(1);
> >> +			SetPageAnonExclusive(page);
> >>  			break;
> >>  		default:
> >>  			BUILD_BUG();
> >> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
> >>  #endif
> >>  }
> >>
> >> +/**
> >> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
> >> + * @folio:	The folio to add the mapping to
> >> + * @page:	The first page to add
> >> + * @vma:	The vm area in which the mapping is added
> >> + * @address:	The user virtual address of the first page to map
> >> + * @flags:	The rmap flags
> >> + *
> >> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
> >> + *
> >> + * The caller needs to hold the page table lock, and the page must be locked in
> >> + * the anon_vma case: to serialize mapping,index checking after setting.
> >> + */
> >> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
> >> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
> >> +{
> >> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> >> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> >> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
> >> +			      PGTABLE_LEVEL_PUD);
> >> +#else
> >> +	WARN_ON_ONCE(true);
> >> +#endif
> >> +}
> >
> > More copy/paste... Maybe unavoidable in this case, but be good to try.
> >
> >> +
> >>  /**
> >>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
> >>   * @folio:	The folio to add the mapping to.
> >> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >>  		}
> >>
> >>  		if (!pvmw.pte) {
> >> +			/*
> >> +			 * Check for PUD-mapped THP first.
> >> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
> >> +			 * split the PUD to PMD level and restart the walk.
> >> +			 */
> >
> > This is literally describing the code below, it's not useful.
>
> Ack, Will remove this comment, Thanks!

Thanks

> >
> >> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> >> +				if (flags & TTU_SPLIT_HUGE_PUD) {
> >> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
> >> +					flags &= ~TTU_SPLIT_HUGE_PUD;
> >> +					page_vma_mapped_walk_restart(&pvmw);
> >> +					continue;
> >> +				}
> >> +			}
> >> +
> >>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
> >>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
> >>  					goto walk_done;
> >> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
> >>  	mmu_notifier_invalidate_range_start(&range);
> >>
> >>  	while (page_vma_mapped_walk(&pvmw)) {
> >> +		/* Handle PUD-mapped THP first */
> >
> > How did/will this interact with DAX, VFIO PUD THP?
>
> It wont interact with DAX. try_to_migrate does the below and just returns:
>
> 	if (folio_is_zone_device(folio) &&
> 	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
> 		return;
>
> so DAX would never reach here.

Hmm folio_is_zone_device() always returns true for DAX?

Also that's just one rmap call right?

>
> I think vfio pages are pinned and therefore cant be migrated? (I have
> not looked at vfio code, I will try to get a better understanding tomorrow,
> but please let me know if that sounds wrong.)

OK I've not dug into this either please do check, and be good really to test
this code vs. actual DAX/VFIO scenarios if you can find a way to test that, thanks!

>
>
> >
> >> +		if (!pvmw.pte && !pvmw.pmd) {
> >> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> >
> > Won't pud_trans_huge() imply this...
> >
>
> Agreed, I think it should cover it.
Thanks!

>
> >> +			/*
> >> +			 * PUD-mapped THP: skip migration to preserve the huge
> >> +			 * page. Splitting would defeat the purpose of PUD THPs.
> >> +			 * Return false to indicate migration failure, which
> >> +			 * will cause alloc_contig_range() to try a different
> >> +			 * memory region.
> >> +			 */
> >> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> >> +				page_vma_mapped_walk_done(&pvmw);
> >> +				ret = false;
> >> +				break;
> >> +			}
> >> +#endif
> >> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
> >> +			page_vma_mapped_walk_done(&pvmw);
> >> +			break;
> >> +		}
> >> +
> >>  		/* PMD-mapped THP migration entry */
> >>  		if (!pvmw.pte) {
> >>  			__maybe_unused unsigned long pfn;
> >> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
> >>
> >>  	/*
> >>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
> >> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
> >> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
> >>  	 */
> >>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
> >> -					TTU_SYNC | TTU_BATCH_FLUSH)))
> >> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
> >>  		return;
> >>
> >>  	if (folio_is_zone_device(folio) &&
> >> --
> >> 2.47.3
> >>
> >
> > This isn't a final review, I'll have to look more thoroughly through here
> > over time and you're going to have to be patient in general :)
> >
> > Cheers, Lorenzo
>
>
> Thanks for the review, this is awesome!

Ack, will do more when I have time, and obviously you're getting a lot of input
from others too.

Be good to get a summary at next THP cabal ;)

>
>
> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/
> [2] https://lore.kernel.org/all/05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com/
> [3] https://lore.kernel.org/all/2efaa5ed-bd09-41f0-9c07-5cd6cccc4595@gmail.com/
>
>
>

cheers, Lorenzo

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Usama Arif 2 days ago


On 04/02/2026 04:55, Lorenzo Stoakes wrote:
> On Tue, Feb 03, 2026 at 11:38:02PM -0800, Usama Arif wrote:
>>
>>
>> On 02/02/2026 04:15, Lorenzo Stoakes wrote:
>>> I think I'm going to have to do several passes on this, so this is just a
>>> first one :)
>>>
>>
>> Thanks! Really appreciate the reviews!
> 
> No worries!
> 
>>
>> One thing over here is the higher level design decision when it comes to migration
>> of 1G pages. As Zi said in [1]:
>> "I also wonder what the purpose of PUD THP migration can be.
>> It does not create memory fragmentation, since it is the largest folio size
>> we have and contiguous. NUMA balancing 1GB THP seems too much work."
>>
>>> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>>>> For page table management, PUD THPs need to pre-deposit page tables
>>>> that will be used when the huge page is later split. When a PUD THP
>>>> is allocated, we cannot know in advance when or why it might need to
>>>> be split (COW, partial unmap, reclaim), but we need page tables ready
>>>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>>>> table, PUD THPs deposit a PMD table which itself contains deposited
>>>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>>>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>>>> deposited PMD.
>>>
>>> This feels like you're hacking this support in, honestly. The list_head
>>> abuse only adds to that feeling.
>>>
>>
>> Yeah so I hope turning it to something like [2] is the way forward.
> 
> Right, that's one option, though David suggested avoiding this altogether by
> only pre-allocating PTEs?

Maybe I dont understand it properly, but that wont work, right?

You need 1 PMD table and 512 PTE tables for a PMD. Cant just have PTE tables, right?


> 
>>
>>> And are we now not required to store rather a lot of memory to keep all of
>>> this coherent?
>>
>> PMD THP allocates 1 4K page (pte_alloc_one) at fault time so that split
>> doesnt fail.
>>
>> For PUD we allocate 2M worth of PTE page tables and 1 4K PMD table at fault
>> time so that split doesnt fail due to there not being enough memory.
>> Its not great, but its not bad as well.
>> The alternative is to allocate this at split time and so we are not
>> pre-reserving them. Now there is a chance that allocation and therefore split
>> fails, so the tradeoff is some memory vs reliability. This patch favours
>> reliability.
> 
> That's a significant amount of unmovable, unreclaimable memory though. Going
> from 4K to 2M is a pretty huge uptick.
> 

Yeah I dont like it either, but its the only way to make sure split doesnt fail.
I think there will always be a tradeoff between reliability and memory.

>>
>> Lets say a user gets 100x1G THPs. They would end up using ~200M for it.
>> I think that is okish. If the user has 100G, 200M might not be an issue
>> for them :)
> 
> But there's more than one user on boxes big enough for this, so this makes me
> think we want this to be somehow opt-in right?
>

Do you mean madvise?

Also an idea (although probably a very bad one :)) is to have MADV_HUGEPAGE_1G.

> And that means we're incurring an unmovable memory penalty, the kind which we're
> trying to avoid in general elsewhere in the kernel.
> 

ack.

>>
>>>
>>>>
>>>> The deposited PMD tables are stored as a singly-linked stack using only
>>>> page->lru.next as the link pointer. A doubly-linked list using the
>>>> standard list_head mechanism would cause memory corruption: list_del()
>>>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>>>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>>>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>>>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>>>> when withdrawing PTE tables during split. PMD THPs don't have this
>>>> problem because their deposited PTE tables don't have sub-deposits.
>>>> Using only lru.next avoids the overlap entirely.
>>>
>>> Yeah this is horrendous and a hack, I don't consider this at all
>>> upstreamable.
>>>
>>> You need to completely rework this.
>>
>> Hopefully [2] is the path forward!
> 
> Ack
> 
>>>
>>>>
>>>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>>>> have. The page_vma_mapped_walk() function is extended to recognize and
>>>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>>>> flag tells the unmap path to split PUD THPs before proceeding, since
>>>> there is no PUD-level migration entry format - the split converts the
>>>> single PUD mapping into individual PTE mappings that can be migrated
>>>> or swapped normally.
>>>
>>> Individual PTE... mappings? You need to be a lot clearer here, page tables
>>> are naturally confusing with entries vs. tables.
>>>
>>> Let's be VERY specific here. Do you mean you have 1 PMD table and 512 PTE
>>> tables reserved, spanning 1 PUD entry and 262,144 PTE entries?
>>>
>>
>> Yes that is correct, Thanks! I will change the commit message in the next revision
>> to what you have written: 1 PMD table and 512 PTE tables reserved, spanning
>> 1 PUD entry and 262,144 PTE entries.
> 
> Yeah :) my concerns remain :)
> 
>>
>>>>
>>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>>
>>> How does this change interact with existing DAX/VFIO code, which now it
>>> seems will be subject to the mechanisms you introduce here?
>>
>> I think what you mean here is the change in try_to_migrate_one?
>>
>>
>> So one
> 
> Unfinished sentence? :P
> 
> No I mean currently we support 1G THP for DAX/VFIO right? So how does this
> interplay with how that currently works? Does that change how DAX/VFIO works?
> Will that impact existing users?
> 
> Or are we extending the existing mechanism?
> 

A lot of the stuff like copy_huge_pud, zap_huge_pud, __split_huge_pud_locked,
create_huge_pud, wp_huge_pud... is protected by vma_is_anonymous check. I will
try and do a better audit of DAX and VFIO.

>>
>>>
>>> Right now DAX/VFIO is only obtainable via a specially THP-aligned
>>> get_unmapped_area() + then can only be obtained at fault time.
>>>> Is that the intent here also?
>>>
>>
>> Ah thanks for pointing this out. This is something the series is missing.
>>
>> What I did in the selftest and benchmark was fault on an address that was already aligned.
>> i.e. basically call the below function before faulting in.
>>
>> static inline void *pud_align(void *addr)
>> {
>> 	return (void *)(((unsigned long)addr + PUD_SIZE - 1) & ~(PUD_SIZE - 1));
>> }
> 
> Right yeah :)
> 
>>
>>
>> What I think you are suggesting this series is missing is the below diff? (its untested).
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 87b2c21df4a49..461158a0840db 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1236,6 +1236,12 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>         unsigned long ret;
>>         loff_t off = (loff_t)pgoff << PAGE_SHIFT;
>>
>> +       if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD) && len >= PUD_SIZE) {
>> +               ret = __thp_get_unmapped_area(filp, addr, len, off, flags, PUD_SIZE, vm_flags);
>> +               if (ret)
>> +                       return ret;
>> +       }
> 
> No not that, that's going to cause issues, see commit d4148aeab4 for details as
> to why this can go wrong.
> 
> In __get_unmapped_area() where the current 'if PMD size aligned then align area'
> logic, like that.


Ack, Thanks for pointing to this. I will also add another selftest to see that we actually
get this from _get_unmapped_area when we dont do the pud_align trick I currently have in
the selftests.

> 
>> +
>>
>>
>>> What is your intent - that khugepaged do this, or on alloc? How does it
>>> interact with MADV_COLLAPSE?
>>>
>>
>> Ah basically what I mentioned in [3], we want to go slow. Only enable PUD THPs
>> page faults at the start. If there is data supporting that khugepaged will work
>> than we do it, but we keep it disabled.
> 
> Yes I think khugepaged is probably never going to be all that a good idea with
> this.
> 
>>
>>> I noted on the 2nd patch, but you're changing THP_ORDERS_ALL_ANON which
>>> alters __thp_vma_allowable_orders() behaviour, that change belongs here...
>>>
>>>
>>
>> Thanks for this! I only tried to split this code into logical commits
>> after the whole thing was working. Some things are tightly coupled
>> and I would need to move them to the right commit.
> 
> Yes there's a bunch of things that need tweaking here, to reiterate let's try to
> pay down technical debt here and avoid copy/pasting :>)

Yes, will spend a lot more time thinking about how to avoid copy/paste.


> 
>>
>>>> ---
>>>>  include/linux/huge_mm.h  |  5 +++
>>>>  include/linux/mm.h       | 19 ++++++++
>>>>  include/linux/mm_types.h |  5 ++-
>>>>  include/linux/pgtable.h  |  8 ++++
>>>>  include/linux/rmap.h     |  7 ++-
>>>>  mm/huge_memory.c         |  8 ++++
>>>>  mm/internal.h            |  3 ++
>>>>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>>>>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>>>>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>>>>  10 files changed, 260 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index a4d9f964dfdea..e672e45bb9cc7 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>>>  		unsigned long address);
>>>>
>>>>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>>> +			   unsigned long address);
>>>>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
>>>>  		    unsigned long cp_flags);
>>>>  #else
>>>> +static inline void
>>>> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>>> +		      unsigned long address) {}
>>>>  static inline int
>>>>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>>>>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index ab2e7e30aef96..a15e18df0f771 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
>>>>   * considered ready to switch to split PUD locks yet; there may be places
>>>>   * which need to be converted from page_table_lock.
>>>>   */
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +static inline struct page *pud_pgtable_page(pud_t *pud)
>>>> +{
>>>> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
>>>> +
>>>> +	return virt_to_page((void *)((unsigned long)pud & mask));
>>>> +}
>>>> +
>>>> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
>>>> +{
>>>> +	return page_ptdesc(pud_pgtable_page(pud));
>>>> +}
>>>> +
>>>> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
>>>> +#endif
>>>> +
>>>>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
>>>>  {
>>>>  	return &mm->page_table_lock;
>>>> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
>>>>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
>>>>  {
>>>>  	__pagetable_ctor(ptdesc);
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +	ptdesc->pud_huge_pmd = NULL;
>>>> +#endif
>>>>  }
>>>>
>>>>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
>>>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
>>>> index 78950eb8926dc..26a38490ae2e1 100644
>>>> --- a/include/linux/mm_types.h
>>>> +++ b/include/linux/mm_types.h
>>>> @@ -577,7 +577,10 @@ struct ptdesc {
>>>>  		struct list_head pt_list;
>>>>  		struct {
>>>>  			unsigned long _pt_pad_1;
>>>> -			pgtable_t pmd_huge_pte;
>>>> +			union {
>>>> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
>>>> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
>>>> +			};
>>>>  		};
>>>>  	};
>>>>  	unsigned long __page_mapping;
>>>> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
>>>> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
>>>> --- a/include/linux/pgtable.h
>>>> +++ b/include/linux/pgtable.h
>>>> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
>>>>  #define arch_needs_pgtable_deposit() (false)
>>>>  #endif
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
>>>> +					   pmd_t *pmd_table);
>>>> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
>>>> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
>>>> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
>>>
>>> These are useless extern's.
>>>
>>
>>
>> ack
>>
>> These are coming from the existing functions from the file:
>> extern void pgtable_trans_huge_deposit
>> extern pgtable_t pgtable_trans_huge_withdraw
>>
>> I think the externs can be removed from these as well? We can
>> fix those in a separate patch.
> 
> Generally the approach is to remove externs when adding/changing new stuff as
> otherwise we get completely useless churn on that and annoying git history
> changes.

Ack

>>
>>
>>>> +#endif
>>>> +
>>>>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>  /*
>>>>   * This is an implementation of pmdp_establish() that is only suitable for an
>>>> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
>>>> index daa92a58585d9..08cd0a0eb8763 100644
>>>> --- a/include/linux/rmap.h
>>>> +++ b/include/linux/rmap.h
>>>> @@ -101,6 +101,7 @@ enum ttu_flags {
>>>>  					 * do a final flush if necessary */
>>>>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
>>>>  					 * caller holds it */
>>>> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
>>>>  };
>>>>
>>>>  #ifdef CONFIG_MMU
>>>> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>>>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
>>>>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
>>>>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
>>>> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
>>>> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
>>>>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>>>>  		unsigned long address, rmap_t flags);
>>>>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
>>>> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
>>>>  	pgoff_t pgoff;
>>>>  	struct vm_area_struct *vma;
>>>>  	unsigned long address;
>>>> +	pud_t *pud;
>>>>  	pmd_t *pmd;
>>>>  	pte_t *pte;
>>>>  	spinlock_t *ptl;
>>>> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>>>>  static inline void
>>>>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>>>>  {
>>>> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
>>>> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
>>>>
>>>>  	if (likely(pvmw->ptl))
>>>>  		spin_unlock(pvmw->ptl);
>>>> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>>>>  		WARN_ON_ONCE(1);
>>>>
>>>>  	pvmw->ptl = NULL;
>>>> +	pvmw->pud = NULL;
>>>>  	pvmw->pmd = NULL;
>>>>  	pvmw->pte = NULL;
>>>>  }
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 40cf59301c21a..3128b3beedb0a 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>>>  	spin_unlock(ptl);
>>>>  	mmu_notifier_invalidate_range_end(&range);
>>>>  }
>>>> +
>>>> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
>>>> +			   unsigned long address)
>>>> +{
>>>> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
>>>> +	if (pud_trans_huge(*pud))
>>>> +		__split_huge_pud_locked(vma, pud, address);
>>>> +}
>>>>  #else
>>>>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>>>>  		unsigned long address)
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index 9ee336aa03656..21d5c00f638dc 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
>>>>   * in mm/rmap.c:
>>>>   */
>>>>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
>>>> +#endif
>>>>
>>>>  /*
>>>>   * in mm/page_alloc.c
>>>> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
>>>> index b38a1d00c971b..d31eafba38041 100644
>>>> --- a/mm/page_vma_mapped.c
>>>> +++ b/mm/page_vma_mapped.c
>>>> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>>>>  	return true;
>>>>  }
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
>>>> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>>>> +{
>>>> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
>>>> +		return false;
>>>> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
>>>> +		return false;
>>>> +	return true;
>>>> +}
>>>> +#endif
>>>> +
>>>>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
>>>>  {
>>>>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
>>>> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>>  	pud_t *pud;
>>>>  	pmd_t pmde;
>>>>
>>>> +	/* The only possible pud mapping has been handled on last iteration */
>>>> +	if (pvmw->pud && !pvmw->pmd)
>>>> +		return not_found(pvmw);
>>>> +
>>>>  	/* The only possible pmd mapping has been handled on last iteration */
>>>>  	if (pvmw->pmd && !pvmw->pte)
>>>>  		return not_found(pvmw);
>>>> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>>>>  			continue;
>>>>  		}
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>
>>> Said it elsewhere, but it's really weird to treat an arch having the
>>> ability to do something as a go ahead for doing it.
>>>
>>>> +		/* Check for PUD-mapped THP */
>>>> +		if (pud_trans_huge(*pud)) {
>>>> +			pvmw->pud = pud;
>>>> +			pvmw->ptl = pud_lock(mm, pud);
>>>> +			if (likely(pud_trans_huge(*pud))) {
>>>> +				if (pvmw->flags & PVMW_MIGRATION)
>>>> +					return not_found(pvmw);
>>>> +				if (!check_pud(pud_pfn(*pud), pvmw))
>>>> +					return not_found(pvmw);
>>>> +				return true;
>>>> +			}
>>>> +			/* PUD was split under us, retry at PMD level */
>>>> +			spin_unlock(pvmw->ptl);
>>>> +			pvmw->ptl = NULL;
>>>> +			pvmw->pud = NULL;
>>>> +		}
>>>> +#endif
>>>> +
>>>
>>> Yeah, as I said elsewhere, we got to be refactoring not copy/pasting with
>>> modifications :)
>>>
>>
>> Yeah there is repeated code in multiple places, where all I did was replace
>> what was done from PMD into PUD. In a lot of places, its actually difficult
>> to not repeat the code (unless we want function macros, which is much worse
>> IMO).
> 
> Not if we actually refactor the existing code :)
> 
> When I wanted to make functional changes to mremap I took a lot of time to
> refactor the code into something sane before even starting that.
> 
> Because I _could_ have added the features there as-is, but it would have been
> hellish to do so as-is and added more confusion etc.
> 
> So yeah, I think a similar mentality has to be had with this change.

Ack, I will spend a lot more time thinking about the refractoring.
> 
>>
>>>
>>>>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
>>>>  		/*
>>>>  		 * Make sure the pmd value isn't cached in a register by the
>>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>>> index d3aec7a9926ad..2047558ddcd79 100644
>>>> --- a/mm/pgtable-generic.c
>>>> +++ b/mm/pgtable-generic.c
>>>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>>>  }
>>>>  #endif
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +/*
>>>> + * Deposit page tables for PUD THP.
>>>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>>>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>>>> + *
>>>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>>>> + * list_head. This is because lru.prev (offset 16) overlaps with
>>>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>>>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
>>>
>>> This is horrible and feels like a hack? Treating a doubly-linked list as a
>>> singly-linked one like this is not upstreamable.
>>>
>>>> + *
>>>> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
>>>> + */
>>>> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
>>>> +				    pmd_t *pmd_table)
>>>
>>> This is a horrid, you're depositing the PMD using the... questionable
>>> list_head abuse, but then also have pud_deposit_pte()... But here we're
>>> depositing a PMD shouldn't the name reflect that?
>>>
>>>> +{
>>>> +	pgtable_t pmd_page = virt_to_page(pmd_table);
>>>> +
>>>> +	assert_spin_locked(pud_lockptr(mm, pudp));
>>>> +
>>>> +	/* Push onto stack using only lru.next as the link */
>>>> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
>>>
>>> Yikes...
>>>
>>>> +	pud_huge_pmd(pudp) = pmd_page;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Withdraw the deposited PMD table for PUD THP split or zap.
>>>> + * Called with PUD lock held.
>>>> + * Returns NULL if no more PMD tables are deposited.
>>>> + */
>>>> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
>>>> +{
>>>> +	pgtable_t pmd_page;
>>>> +
>>>> +	assert_spin_locked(pud_lockptr(mm, pudp));
>>>> +
>>>> +	pmd_page = pud_huge_pmd(pudp);
>>>> +	if (!pmd_page)
>>>> +		return NULL;
>>>> +
>>>> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
>>>> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
>>>
>>> Where's the popping? You're just assigning here.
>>
>>
>> Ack on all of the above. Hopefully [1] is better.
> 
> Thanks!
> 
>>>
>>>> +
>>>> +	return page_address(pmd_page);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
>>>> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
>>>> + * No lock assertion since the PMD isn't visible yet.
>>>> + */
>>>> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
>>>> +{
>>>> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
>>>> +
>>>> +	/* FIFO - add to front of list */
>>>> +	if (!ptdesc->pmd_huge_pte)
>>>> +		INIT_LIST_HEAD(&pgtable->lru);
>>>> +	else
>>>> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
>>>> +	ptdesc->pmd_huge_pte = pgtable;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Withdraw a PTE table from a standalone PMD table.
>>>> + * Returns NULL if no more PTE tables are deposited.
>>>> + */
>>>> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
>>>> +{
>>>> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
>>>> +	pgtable_t pgtable;
>>>> +
>>>> +	pgtable = ptdesc->pmd_huge_pte;
>>>> +	if (!pgtable)
>>>> +		return NULL;
>>>> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
>>>> +							struct page, lru);
>>>> +	if (ptdesc->pmd_huge_pte)
>>>> +		list_del(&pgtable->lru);
>>>> +	return pgtable;
>>>> +}
>>>> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
>>>> +
>>>>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
>>>>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>>>>  		     pmd_t *pmdp)
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 7b9879ef442d9..69acabd763da4 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>>>>  	return pmd;
>>>>  }
>>>>
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>> +/*
>>>> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
>>>> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
>>>> + * represents.
>>>> + */
>>>> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)
>>>
>>> This series seems to be full of copy/paste.
>>>
>>> It's just not acceptable given the state of THP code as I said in reply to
>>> the cover letter - you need to _refactor_ the code.
>>>
>>> The code is bug-prone and difficult to maintain as-is, your series has to
>>> improve the technical debt, not add to it.
>>>
>>
>> In some cases we might not be able to avoid the copy, but this is definitely
>> a place where we dont need to. I will change here. Thanks!
> 
> I disagree, see above :) But thanks on this one
> 
>>
>>>> +{
>>>> +	pgd_t *pgd;
>>>> +	p4d_t *p4d;
>>>> +	pud_t *pud = NULL;
>>>> +
>>>> +	pgd = pgd_offset(mm, address);
>>>> +	if (!pgd_present(*pgd))
>>>> +		goto out;
>>>> +
>>>> +	p4d = p4d_offset(pgd, address);
>>>> +	if (!p4d_present(*p4d))
>>>> +		goto out;
>>>> +
>>>> +	pud = pud_offset(p4d, address);
>>>> +out:
>>>> +	return pud;
>>>> +}
>>>> +#endif
>>>> +
>>>>  struct folio_referenced_arg {
>>>>  	int mapcount;
>>>>  	int referenced;
>>>> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>>>>  			SetPageAnonExclusive(page);
>>>>  			break;
>>>>  		case PGTABLE_LEVEL_PUD:
>>>> -			/*
>>>> -			 * Keep the compiler happy, we don't support anonymous
>>>> -			 * PUD mappings.
>>>> -			 */
>>>> -			WARN_ON_ONCE(1);
>>>> +			SetPageAnonExclusive(page);
>>>>  			break;
>>>>  		default:
>>>>  			BUILD_BUG();
>>>> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
>>>>  #endif
>>>>  }
>>>>
>>>> +/**
>>>> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
>>>> + * @folio:	The folio to add the mapping to
>>>> + * @page:	The first page to add
>>>> + * @vma:	The vm area in which the mapping is added
>>>> + * @address:	The user virtual address of the first page to map
>>>> + * @flags:	The rmap flags
>>>> + *
>>>> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
>>>> + *
>>>> + * The caller needs to hold the page table lock, and the page must be locked in
>>>> + * the anon_vma case: to serialize mapping,index checking after setting.
>>>> + */
>>>> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
>>>> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
>>>> +{
>>>> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
>>>> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
>>>> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
>>>> +			      PGTABLE_LEVEL_PUD);
>>>> +#else
>>>> +	WARN_ON_ONCE(true);
>>>> +#endif
>>>> +}
>>>
>>> More copy/paste... Maybe unavoidable in this case, but be good to try.
>>>
>>>> +
>>>>  /**
>>>>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
>>>>   * @folio:	The folio to add the mapping to.
>>>> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>>>>  		}
>>>>
>>>>  		if (!pvmw.pte) {
>>>> +			/*
>>>> +			 * Check for PUD-mapped THP first.
>>>> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
>>>> +			 * split the PUD to PMD level and restart the walk.
>>>> +			 */
>>>
>>> This is literally describing the code below, it's not useful.
>>
>> Ack, Will remove this comment, Thanks!
> 
> Thanks
> 
>>>
>>>> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
>>>> +				if (flags & TTU_SPLIT_HUGE_PUD) {
>>>> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
>>>> +					flags &= ~TTU_SPLIT_HUGE_PUD;
>>>> +					page_vma_mapped_walk_restart(&pvmw);
>>>> +					continue;
>>>> +				}
>>>> +			}
>>>> +
>>>>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
>>>>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>>>>  					goto walk_done;
>>>> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>>>>  	mmu_notifier_invalidate_range_start(&range);
>>>>
>>>>  	while (page_vma_mapped_walk(&pvmw)) {
>>>> +		/* Handle PUD-mapped THP first */
>>>
>>> How did/will this interact with DAX, VFIO PUD THP?
>>
>> It wont interact with DAX. try_to_migrate does the below and just returns:
>>
>> 	if (folio_is_zone_device(folio) &&
>> 	    (!folio_is_device_private(folio) && !folio_is_device_coherent(folio)))
>> 		return;
>>
>> so DAX would never reach here.
> 
> Hmm folio_is_zone_device() always returns true for DAX?
> 

Yes, that is my understanding. Both fsdax and devdax call into
devm_memremap_pages() -> memremap_pages() in mm/memremap.c, which
unconditionally places all pages in ZONE_DEVICE.

> Also that's just one rmap call right?
> 
Yes,

>>
>> I think vfio pages are pinned and therefore cant be migrated? (I have
>> not looked at vfio code, I will try to get a better understanding tomorrow,
>> but please let me know if that sounds wrong.)
> 
> OK I've not dug into this either please do check, and be good really to test
> this code vs. actual DAX/VFIO scenarios if you can find a way to test that, thanks!

I think DAX is ok, will check more into VFIO. I will also CC the people who added
DAX and VFIO PUD support in the next RFC.

> 
>>
>>
>>>
>>>> +		if (!pvmw.pte && !pvmw.pmd) {
>>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>>
>>> Won't pud_trans_huge() imply this...
>>>
>>
>> Agreed, I think it should cover it.
> Thanks!
> 
>>
>>>> +			/*
>>>> +			 * PUD-mapped THP: skip migration to preserve the huge
>>>> +			 * page. Splitting would defeat the purpose of PUD THPs.
>>>> +			 * Return false to indicate migration failure, which
>>>> +			 * will cause alloc_contig_range() to try a different
>>>> +			 * memory region.
>>>> +			 */
>>>> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
>>>> +				page_vma_mapped_walk_done(&pvmw);
>>>> +				ret = false;
>>>> +				break;
>>>> +			}
>>>> +#endif
>>>> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
>>>> +			page_vma_mapped_walk_done(&pvmw);
>>>> +			break;
>>>> +		}
>>>> +
>>>>  		/* PMD-mapped THP migration entry */
>>>>  		if (!pvmw.pte) {
>>>>  			__maybe_unused unsigned long pfn;
>>>> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>>>>
>>>>  	/*
>>>>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
>>>> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>>>> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>>>>  	 */
>>>>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
>>>> -					TTU_SYNC | TTU_BATCH_FLUSH)))
>>>> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
>>>>  		return;
>>>>
>>>>  	if (folio_is_zone_device(folio) &&
>>>> --
>>>> 2.47.3
>>>>
>>>
>>> This isn't a final review, I'll have to look more thoroughly through here
>>> over time and you're going to have to be patient in general :)
>>>
>>> Cheers, Lorenzo
>>
>>
>> Thanks for the review, this is awesome!
> 
> Ack, will do more when I have time, and obviously you're getting a lot of input
> from others too.
> 
> Be good to get a summary at next THP cabal ;)
> 
>>
>>
>> [1] https://lore.kernel.org/all/20f92576-e932-435f-bb7b-de49eb84b012@gmail.com/
>> [2] https://lore.kernel.org/all/05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com/
>> [3] https://lore.kernel.org/all/2efaa5ed-bd09-41f0-9c07-5cd6cccc4595@gmail.com/
>>
>>
>>
> 
> cheers, Lorenzo

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Kiryl Shutsemau 4 days, 20 hours ago

On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
> For page table management, PUD THPs need to pre-deposit page tables
> that will be used when the huge page is later split. When a PUD THP
> is allocated, we cannot know in advance when or why it might need to
> be split (COW, partial unmap, reclaim), but we need page tables ready
> for that eventuality. Similar to how PMD THPs deposit a single PTE
> table, PUD THPs deposit a PMD table which itself contains deposited
> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
> infrastructure and a new pud_huge_pmd field in ptdesc to store the
> deposited PMD.
> 
> The deposited PMD tables are stored as a singly-linked stack using only
> page->lru.next as the link pointer. A doubly-linked list using the
> standard list_head mechanism would cause memory corruption: list_del()
> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
> tables have their own deposited PTE tables stored in pmd_huge_pte,
> poisoning lru.prev would corrupt the PTE table list and cause crashes
> when withdrawing PTE tables during split. PMD THPs don't have this
> problem because their deposited PTE tables don't have sub-deposits.
> Using only lru.next avoids the overlap entirely.
> 
> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
> have. The page_vma_mapped_walk() function is extended to recognize and
> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
> flag tells the unmap path to split PUD THPs before proceeding, since
> there is no PUD-level migration entry format - the split converts the
> single PUD mapping into individual PTE mappings that can be migrated
> or swapped normally.
> 
> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
> ---
>  include/linux/huge_mm.h  |  5 +++
>  include/linux/mm.h       | 19 ++++++++
>  include/linux/mm_types.h |  5 ++-
>  include/linux/pgtable.h  |  8 ++++
>  include/linux/rmap.h     |  7 ++-
>  mm/huge_memory.c         |  8 ++++
>  mm/internal.h            |  3 ++
>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>  10 files changed, 260 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index a4d9f964dfdea..e672e45bb9cc7 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -463,10 +463,15 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  		unsigned long address);
>  
>  #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +			   unsigned long address);
>  int change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		    pud_t *pudp, unsigned long addr, pgprot_t newprot,
>  		    unsigned long cp_flags);
>  #else
> +static inline void
> +split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +		      unsigned long address) {}
>  static inline int
>  change_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		pud_t *pudp, unsigned long addr, pgprot_t newprot,
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ab2e7e30aef96..a15e18df0f771 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3455,6 +3455,22 @@ static inline bool pagetable_pmd_ctor(struct mm_struct *mm,
>   * considered ready to switch to split PUD locks yet; there may be places
>   * which need to be converted from page_table_lock.
>   */
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +static inline struct page *pud_pgtable_page(pud_t *pud)
> +{
> +	unsigned long mask = ~(PTRS_PER_PUD * sizeof(pud_t) - 1);
> +
> +	return virt_to_page((void *)((unsigned long)pud & mask));
> +}
> +
> +static inline struct ptdesc *pud_ptdesc(pud_t *pud)
> +{
> +	return page_ptdesc(pud_pgtable_page(pud));
> +}
> +
> +#define pud_huge_pmd(pud) (pud_ptdesc(pud)->pud_huge_pmd)
> +#endif
> +
>  static inline spinlock_t *pud_lockptr(struct mm_struct *mm, pud_t *pud)
>  {
>  	return &mm->page_table_lock;
> @@ -3471,6 +3487,9 @@ static inline spinlock_t *pud_lock(struct mm_struct *mm, pud_t *pud)
>  static inline void pagetable_pud_ctor(struct ptdesc *ptdesc)
>  {
>  	__pagetable_ctor(ptdesc);
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +	ptdesc->pud_huge_pmd = NULL;
> +#endif
>  }
>  
>  static inline void pagetable_p4d_ctor(struct ptdesc *ptdesc)
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 78950eb8926dc..26a38490ae2e1 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -577,7 +577,10 @@ struct ptdesc {
>  		struct list_head pt_list;
>  		struct {
>  			unsigned long _pt_pad_1;
> -			pgtable_t pmd_huge_pte;
> +			union {
> +				pgtable_t pmd_huge_pte;  /* For PMD tables: deposited PTE */
> +				pgtable_t pud_huge_pmd;  /* For PUD tables: deposited PMD list */
> +			};
>  		};
>  	};
>  	unsigned long __page_mapping;
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 2f0dd3a4ace1a..3ce733c1d71a2 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1168,6 +1168,14 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
>  #define arch_needs_pgtable_deposit() (false)
>  #endif
>  
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +extern void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> +					   pmd_t *pmd_table);
> +extern pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp);
> +extern void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable);
> +extern pgtable_t pud_withdraw_pte(pmd_t *pmd_table);
> +#endif
> +
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  /*
>   * This is an implementation of pmdp_establish() that is only suitable for an
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index daa92a58585d9..08cd0a0eb8763 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -101,6 +101,7 @@ enum ttu_flags {
>  					 * do a final flush if necessary */
>  	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
>  					 * caller holds it */
> +	TTU_SPLIT_HUGE_PUD	= 0x100, /* split huge PUD if any */
>  };
>  
>  #ifdef CONFIG_MMU
> @@ -473,6 +474,8 @@ void folio_add_anon_rmap_ptes(struct folio *, struct page *, int nr_pages,
>  	folio_add_anon_rmap_ptes(folio, page, 1, vma, address, flags)
>  void folio_add_anon_rmap_pmd(struct folio *, struct page *,
>  		struct vm_area_struct *, unsigned long address, rmap_t flags);
> +void folio_add_anon_rmap_pud(struct folio *, struct page *,
> +		struct vm_area_struct *, unsigned long address, rmap_t flags);
>  void folio_add_new_anon_rmap(struct folio *, struct vm_area_struct *,
>  		unsigned long address, rmap_t flags);
>  void folio_add_file_rmap_ptes(struct folio *, struct page *, int nr_pages,
> @@ -933,6 +936,7 @@ struct page_vma_mapped_walk {
>  	pgoff_t pgoff;
>  	struct vm_area_struct *vma;
>  	unsigned long address;
> +	pud_t *pud;
>  	pmd_t *pmd;
>  	pte_t *pte;
>  	spinlock_t *ptl;
> @@ -970,7 +974,7 @@ static inline void page_vma_mapped_walk_done(struct page_vma_mapped_walk *pvmw)
>  static inline void
>  page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>  {
> -	WARN_ON_ONCE(!pvmw->pmd && !pvmw->pte);
> +	WARN_ON_ONCE(!pvmw->pud && !pvmw->pmd && !pvmw->pte);
>  
>  	if (likely(pvmw->ptl))
>  		spin_unlock(pvmw->ptl);
> @@ -978,6 +982,7 @@ page_vma_mapped_walk_restart(struct page_vma_mapped_walk *pvmw)
>  		WARN_ON_ONCE(1);
>  
>  	pvmw->ptl = NULL;
> +	pvmw->pud = NULL;
>  	pvmw->pmd = NULL;
>  	pvmw->pte = NULL;
>  }
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 40cf59301c21a..3128b3beedb0a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2933,6 +2933,14 @@ void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  	spin_unlock(ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  }
> +
> +void split_huge_pud_locked(struct vm_area_struct *vma, pud_t *pud,
> +			   unsigned long address)
> +{
> +	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PUD_SIZE));
> +	if (pud_trans_huge(*pud))
> +		__split_huge_pud_locked(vma, pud, address);
> +}
>  #else
>  void __split_huge_pud(struct vm_area_struct *vma, pud_t *pud,
>  		unsigned long address)
> diff --git a/mm/internal.h b/mm/internal.h
> index 9ee336aa03656..21d5c00f638dc 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -545,6 +545,9 @@ int user_proactive_reclaim(char *buf,
>   * in mm/rmap.c:
>   */
>  pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address);
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address);
> +#endif
>  
>  /*
>   * in mm/page_alloc.c
> diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
> index b38a1d00c971b..d31eafba38041 100644
> --- a/mm/page_vma_mapped.c
> +++ b/mm/page_vma_mapped.c
> @@ -146,6 +146,18 @@ static bool check_pmd(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
>  	return true;
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/* Returns true if the two ranges overlap.  Careful to not overflow. */
> +static bool check_pud(unsigned long pfn, struct page_vma_mapped_walk *pvmw)
> +{
> +	if ((pfn + HPAGE_PUD_NR - 1) < pvmw->pfn)
> +		return false;
> +	if (pfn > pvmw->pfn + pvmw->nr_pages - 1)
> +		return false;
> +	return true;
> +}
> +#endif
> +
>  static void step_forward(struct page_vma_mapped_walk *pvmw, unsigned long size)
>  {
>  	pvmw->address = (pvmw->address + size) & ~(size - 1);
> @@ -188,6 +200,10 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  	pud_t *pud;
>  	pmd_t pmde;
>  
> +	/* The only possible pud mapping has been handled on last iteration */
> +	if (pvmw->pud && !pvmw->pmd)
> +		return not_found(pvmw);
> +
>  	/* The only possible pmd mapping has been handled on last iteration */
>  	if (pvmw->pmd && !pvmw->pte)
>  		return not_found(pvmw);
> @@ -234,6 +250,25 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
>  			continue;
>  		}
>  
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +		/* Check for PUD-mapped THP */
> +		if (pud_trans_huge(*pud)) {
> +			pvmw->pud = pud;
> +			pvmw->ptl = pud_lock(mm, pud);
> +			if (likely(pud_trans_huge(*pud))) {
> +				if (pvmw->flags & PVMW_MIGRATION)
> +					return not_found(pvmw);
> +				if (!check_pud(pud_pfn(*pud), pvmw))
> +					return not_found(pvmw);
> +				return true;
> +			}
> +			/* PUD was split under us, retry at PMD level */
> +			spin_unlock(pvmw->ptl);
> +			pvmw->ptl = NULL;
> +			pvmw->pud = NULL;
> +		}
> +#endif
> +
>  		pvmw->pmd = pmd_offset(pud, pvmw->address);
>  		/*
>  		 * Make sure the pmd value isn't cached in a register by the
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index d3aec7a9926ad..2047558ddcd79 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>  }
>  #endif
>  
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/*
> + * Deposit page tables for PUD THP.
> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
> + *
> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
> + * list_head. This is because lru.prev (offset 16) overlaps with
> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.

This is ugly.

Sounds like you want to use llist_node/head instead of list_head for this.

You might able to avoid taking the lock in some cases. Note that
pud_lockptr() is mm->page_table_lock as of now.

> + *
> + * PTE tables should be deposited into the PMD using pud_deposit_pte().
> + */
> +void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
> +				    pmd_t *pmd_table)
> +{
> +	pgtable_t pmd_page = virt_to_page(pmd_table);
> +
> +	assert_spin_locked(pud_lockptr(mm, pudp));
> +
> +	/* Push onto stack using only lru.next as the link */
> +	pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
> +	pud_huge_pmd(pudp) = pmd_page;
> +}
> +
> +/*
> + * Withdraw the deposited PMD table for PUD THP split or zap.
> + * Called with PUD lock held.
> + * Returns NULL if no more PMD tables are deposited.
> + */
> +pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
> +{
> +	pgtable_t pmd_page;
> +
> +	assert_spin_locked(pud_lockptr(mm, pudp));
> +
> +	pmd_page = pud_huge_pmd(pudp);
> +	if (!pmd_page)
> +		return NULL;
> +
> +	/* Pop from stack - lru.next points to next PMD page (or NULL) */
> +	pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
> +
> +	return page_address(pmd_page);
> +}
> +
> +/*
> + * Deposit a PTE table into a standalone PMD table (not yet in page table hierarchy).
> + * Used for PUD THP pre-deposit. The PMD table's pmd_huge_pte stores a linked list.
> + * No lock assertion since the PMD isn't visible yet.
> + */
> +void pud_deposit_pte(pmd_t *pmd_table, pgtable_t pgtable)
> +{
> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> +
> +	/* FIFO - add to front of list */
> +	if (!ptdesc->pmd_huge_pte)
> +		INIT_LIST_HEAD(&pgtable->lru);
> +	else
> +		list_add(&pgtable->lru, &ptdesc->pmd_huge_pte->lru);
> +	ptdesc->pmd_huge_pte = pgtable;
> +}
> +
> +/*
> + * Withdraw a PTE table from a standalone PMD table.
> + * Returns NULL if no more PTE tables are deposited.
> + */
> +pgtable_t pud_withdraw_pte(pmd_t *pmd_table)
> +{
> +	struct ptdesc *ptdesc = virt_to_ptdesc(pmd_table);
> +	pgtable_t pgtable;
> +
> +	pgtable = ptdesc->pmd_huge_pte;
> +	if (!pgtable)
> +		return NULL;
> +	ptdesc->pmd_huge_pte = list_first_entry_or_null(&pgtable->lru,
> +							struct page, lru);
> +	if (ptdesc->pmd_huge_pte)
> +		list_del(&pgtable->lru);
> +	return pgtable;
> +}
> +#endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
> +
>  #ifndef __HAVE_ARCH_PMDP_INVALIDATE
>  pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>  		     pmd_t *pmdp)
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 7b9879ef442d9..69acabd763da4 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -811,6 +811,32 @@ pmd_t *mm_find_pmd(struct mm_struct *mm, unsigned long address)
>  	return pmd;
>  }
>  
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +/*
> + * Returns the actual pud_t* where we expect 'address' to be mapped from, or
> + * NULL if it doesn't exist.  No guarantees / checks on what the pud_t*
> + * represents.
> + */
> +pud_t *mm_find_pud(struct mm_struct *mm, unsigned long address)

Remove the ifdef and make mm_find_pmd() call it.

And in general, try to avoid ifdeffery where possible.

> +{
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +	pud_t *pud = NULL;
> +
> +	pgd = pgd_offset(mm, address);
> +	if (!pgd_present(*pgd))
> +		goto out;
> +
> +	p4d = p4d_offset(pgd, address);
> +	if (!p4d_present(*p4d))
> +		goto out;
> +
> +	pud = pud_offset(p4d, address);
> +out:
> +	return pud;
> +}
> +#endif
> +
>  struct folio_referenced_arg {
>  	int mapcount;
>  	int referenced;
> @@ -1415,11 +1441,7 @@ static __always_inline void __folio_add_anon_rmap(struct folio *folio,
>  			SetPageAnonExclusive(page);
>  			break;
>  		case PGTABLE_LEVEL_PUD:
> -			/*
> -			 * Keep the compiler happy, we don't support anonymous
> -			 * PUD mappings.
> -			 */
> -			WARN_ON_ONCE(1);
> +			SetPageAnonExclusive(page);
>  			break;
>  		default:
>  			BUILD_BUG();
> @@ -1503,6 +1525,31 @@ void folio_add_anon_rmap_pmd(struct folio *folio, struct page *page,
>  #endif
>  }
>  
> +/**
> + * folio_add_anon_rmap_pud - add a PUD mapping to a page range of an anon folio
> + * @folio:	The folio to add the mapping to
> + * @page:	The first page to add
> + * @vma:	The vm area in which the mapping is added
> + * @address:	The user virtual address of the first page to map
> + * @flags:	The rmap flags
> + *
> + * The page range of folio is defined by [first_page, first_page + HPAGE_PUD_NR)
> + *
> + * The caller needs to hold the page table lock, and the page must be locked in
> + * the anon_vma case: to serialize mapping,index checking after setting.
> + */
> +void folio_add_anon_rmap_pud(struct folio *folio, struct page *page,
> +		struct vm_area_struct *vma, unsigned long address, rmap_t flags)
> +{
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && \
> +	defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
> +	__folio_add_anon_rmap(folio, page, HPAGE_PUD_NR, vma, address, flags,
> +			      PGTABLE_LEVEL_PUD);
> +#else
> +	WARN_ON_ONCE(true);
> +#endif
> +}
> +
>  /**
>   * folio_add_new_anon_rmap - Add mapping to a new anonymous folio.
>   * @folio:	The folio to add the mapping to.
> @@ -1934,6 +1981,20 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>  		}
>  
>  		if (!pvmw.pte) {
> +			/*
> +			 * Check for PUD-mapped THP first.
> +			 * If we have a PUD mapping and TTU_SPLIT_HUGE_PUD is set,
> +			 * split the PUD to PMD level and restart the walk.
> +			 */
> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> +				if (flags & TTU_SPLIT_HUGE_PUD) {
> +					split_huge_pud_locked(vma, pvmw.pud, pvmw.address);
> +					flags &= ~TTU_SPLIT_HUGE_PUD;
> +					page_vma_mapped_walk_restart(&pvmw);
> +					continue;
> +				}
> +			}
> +
>  			if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) {
>  				if (unmap_huge_pmd_locked(vma, pvmw.address, pvmw.pmd, folio))
>  					goto walk_done;
> @@ -2325,6 +2386,27 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
>  	mmu_notifier_invalidate_range_start(&range);
>  
>  	while (page_vma_mapped_walk(&pvmw)) {
> +		/* Handle PUD-mapped THP first */
> +		if (!pvmw.pte && !pvmw.pmd) {
> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
> +			/*
> +			 * PUD-mapped THP: skip migration to preserve the huge
> +			 * page. Splitting would defeat the purpose of PUD THPs.
> +			 * Return false to indicate migration failure, which
> +			 * will cause alloc_contig_range() to try a different
> +			 * memory region.
> +			 */
> +			if (pvmw.pud && pud_trans_huge(*pvmw.pud)) {
> +				page_vma_mapped_walk_done(&pvmw);
> +				ret = false;
> +				break;
> +			}
> +#endif
> +			/* Unexpected state: !pte && !pmd but not a PUD THP */
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
> +
>  		/* PMD-mapped THP migration entry */
>  		if (!pvmw.pte) {
>  			__maybe_unused unsigned long pfn;
> @@ -2607,10 +2689,10 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>  
>  	/*
>  	 * Migration always ignores mlock and only supports TTU_RMAP_LOCKED and
> -	 * TTU_SPLIT_HUGE_PMD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
> +	 * TTU_SPLIT_HUGE_PMD, TTU_SPLIT_HUGE_PUD, TTU_SYNC, and TTU_BATCH_FLUSH flags.
>  	 */
>  	if (WARN_ON_ONCE(flags & ~(TTU_RMAP_LOCKED | TTU_SPLIT_HUGE_PMD |
> -					TTU_SYNC | TTU_BATCH_FLUSH)))
> +					TTU_SPLIT_HUGE_PUD | TTU_SYNC | TTU_BATCH_FLUSH)))
>  		return;
>  
>  	if (folio_is_zone_device(folio) &&
> -- 
> 2.47.3
> 

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Zi Yan 4 days, 15 hours ago

On 2 Feb 2026, at 5:44, Kiryl Shutsemau wrote:

> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>> For page table management, PUD THPs need to pre-deposit page tables
>> that will be used when the huge page is later split. When a PUD THP
>> is allocated, we cannot know in advance when or why it might need to
>> be split (COW, partial unmap, reclaim), but we need page tables ready
>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>> table, PUD THPs deposit a PMD table which itself contains deposited
>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>> deposited PMD.
>>
>> The deposited PMD tables are stored as a singly-linked stack using only
>> page->lru.next as the link pointer. A doubly-linked list using the
>> standard list_head mechanism would cause memory corruption: list_del()
>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>> when withdrawing PTE tables during split. PMD THPs don't have this
>> problem because their deposited PTE tables don't have sub-deposits.
>> Using only lru.next avoids the overlap entirely.
>>
>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>> have. The page_vma_mapped_walk() function is extended to recognize and
>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>> flag tells the unmap path to split PUD THPs before proceeding, since
>> there is no PUD-level migration entry format - the split converts the
>> single PUD mapping into individual PTE mappings that can be migrated
>> or swapped normally.
>>
>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>> ---
>>  include/linux/huge_mm.h  |  5 +++
>>  include/linux/mm.h       | 19 ++++++++
>>  include/linux/mm_types.h |  5 ++-
>>  include/linux/pgtable.h  |  8 ++++
>>  include/linux/rmap.h     |  7 ++-
>>  mm/huge_memory.c         |  8 ++++
>>  mm/internal.h            |  3 ++
>>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>>  10 files changed, 260 insertions(+), 9 deletions(-)
>>

<snip>

>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index d3aec7a9926ad..2047558ddcd79 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>  }
>>  #endif
>>
>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>> +/*
>> + * Deposit page tables for PUD THP.
>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>> + *
>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>> + * list_head. This is because lru.prev (offset 16) overlaps with
>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
>
> This is ugly.
>
> Sounds like you want to use llist_node/head instead of list_head for this.
>
> You might able to avoid taking the lock in some cases. Note that
> pud_lockptr() is mm->page_table_lock as of now.

I agree. I used llist_node/head in my implementation[1] and it works.
I have an illustration at[2] to show the concept. Feel free to reuse the code.


[1] https://lore.kernel.org/all/20200928193428.GB30994@casper.infradead.org/
[2] https://normal.zone/blog/2021-01-04-linux-1gb-thp-2/#new-mechanism

Best Regards,
Yan, Zi

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Usama Arif 3 days, 9 hours ago


On 02/02/2026 08:01, Zi Yan wrote:
> On 2 Feb 2026, at 5:44, Kiryl Shutsemau wrote:
> 
>> On Sun, Feb 01, 2026 at 04:50:18PM -0800, Usama Arif wrote:
>>> For page table management, PUD THPs need to pre-deposit page tables
>>> that will be used when the huge page is later split. When a PUD THP
>>> is allocated, we cannot know in advance when or why it might need to
>>> be split (COW, partial unmap, reclaim), but we need page tables ready
>>> for that eventuality. Similar to how PMD THPs deposit a single PTE
>>> table, PUD THPs deposit a PMD table which itself contains deposited
>>> PTE tables - a two-level deposit. This commit adds the deposit/withdraw
>>> infrastructure and a new pud_huge_pmd field in ptdesc to store the
>>> deposited PMD.
>>>
>>> The deposited PMD tables are stored as a singly-linked stack using only
>>> page->lru.next as the link pointer. A doubly-linked list using the
>>> standard list_head mechanism would cause memory corruption: list_del()
>>> poisons both lru.next (offset 8) and lru.prev (offset 16), but lru.prev
>>> overlaps with ptdesc->pmd_huge_pte at offset 16. Since deposited PMD
>>> tables have their own deposited PTE tables stored in pmd_huge_pte,
>>> poisoning lru.prev would corrupt the PTE table list and cause crashes
>>> when withdrawing PTE tables during split. PMD THPs don't have this
>>> problem because their deposited PTE tables don't have sub-deposits.
>>> Using only lru.next avoids the overlap entirely.
>>>
>>> For reverse mapping, PUD THPs need the same rmap support that PMD THPs
>>> have. The page_vma_mapped_walk() function is extended to recognize and
>>> handle PUD-mapped folios during rmap traversal. A new TTU_SPLIT_HUGE_PUD
>>> flag tells the unmap path to split PUD THPs before proceeding, since
>>> there is no PUD-level migration entry format - the split converts the
>>> single PUD mapping into individual PTE mappings that can be migrated
>>> or swapped normally.
>>>
>>> Signed-off-by: Usama Arif <usamaarif642@gmail.com>
>>> ---
>>>  include/linux/huge_mm.h  |  5 +++
>>>  include/linux/mm.h       | 19 ++++++++
>>>  include/linux/mm_types.h |  5 ++-
>>>  include/linux/pgtable.h  |  8 ++++
>>>  include/linux/rmap.h     |  7 ++-
>>>  mm/huge_memory.c         |  8 ++++
>>>  mm/internal.h            |  3 ++
>>>  mm/page_vma_mapped.c     | 35 +++++++++++++++
>>>  mm/pgtable-generic.c     | 83 ++++++++++++++++++++++++++++++++++
>>>  mm/rmap.c                | 96 +++++++++++++++++++++++++++++++++++++---
>>>  10 files changed, 260 insertions(+), 9 deletions(-)
>>>
> 
> <snip>
> 
>>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>>> index d3aec7a9926ad..2047558ddcd79 100644
>>> --- a/mm/pgtable-generic.c
>>> +++ b/mm/pgtable-generic.c
>>> @@ -195,6 +195,89 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
>>>  }
>>>  #endif
>>>
>>> +#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
>>> +/*
>>> + * Deposit page tables for PUD THP.
>>> + * Called with PUD lock held. Stores PMD tables in a singly-linked stack
>>> + * via pud_huge_pmd, using only pmd_page->lru.next as the link pointer.
>>> + *
>>> + * IMPORTANT: We use only lru.next (offset 8) for linking, NOT the full
>>> + * list_head. This is because lru.prev (offset 16) overlaps with
>>> + * ptdesc->pmd_huge_pte, which stores the PMD table's deposited PTE tables.
>>> + * Using list_del() would corrupt pmd_huge_pte with LIST_POISON2.
>>
>> This is ugly.
>>
>> Sounds like you want to use llist_node/head instead of list_head for this.
>>
>> You might able to avoid taking the lock in some cases. Note that
>> pud_lockptr() is mm->page_table_lock as of now.
> 
> I agree. I used llist_node/head in my implementation[1] and it works.
> I have an illustration at[2] to show the concept. Feel free to reuse the code.
> 
> 
> [1] https://lore.kernel.org/all/20200928193428.GB30994@casper.infradead.org/
> [2] https://normal.zone/blog/2021-01-04-linux-1gb-thp-2/#new-mechanism
> 
> Best Regards,
> Yan, Zi



Ah I should have looked at your patches more! I started working by just using lru
and was using list_add/list_del which was ofcourse corrupting the list and took me
way more time than I would like to admit to debug what was going on! The diagrams
in your 2nd link are really useful. I ended up drawing by hand those to debug
the corruption issue. I will point to that link in the next series :) 

How about something like the below diff over this patch? (Not included the comment
changes that I will make everywhere)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 26a38490ae2e1..3653e24ce97d7 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -99,6 +99,9 @@ struct page {
                                struct list_head buddy_list;
                                struct list_head pcp_list;
                                struct llist_node pcp_llist;
+
+                               /* PMD pagetable deposit head */
+                               struct llist_node pgtable_deposit_head;
                        };
                        struct address_space *mapping;
                        union {
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 2047558ddcd79..764f14d0afcbb 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -215,9 +215,7 @@ void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
 
        assert_spin_locked(pud_lockptr(mm, pudp));
 
-       /* Push onto stack using only lru.next as the link */
-       pmd_page->lru.next = (struct list_head *)pud_huge_pmd(pudp);
-       pud_huge_pmd(pudp) = pmd_page;
+       llist_add(&pmd_page->pgtable_deposit_head, (struct llist_head *)&pud_huge_pmd(pudp));
 }
 
 /*
@@ -227,16 +225,16 @@ void pgtable_trans_huge_pud_deposit(struct mm_struct *mm, pud_t *pudp,
  */
 pmd_t *pgtable_trans_huge_pud_withdraw(struct mm_struct *mm, pud_t *pudp)
 {
+       struct llist_node *node;
        pgtable_t pmd_page;
 
        assert_spin_locked(pud_lockptr(mm, pudp));
 
-       pmd_page = pud_huge_pmd(pudp);
-       if (!pmd_page)
+       node = llist_del_first((struct llist_head *)&pud_huge_pmd(pudp));
+       if (!node)
                return NULL;
 
-       /* Pop from stack - lru.next points to next PMD page (or NULL) */
-       pud_huge_pmd(pudp) = (pgtable_t)pmd_page->lru.next;
+       pmd_page = llist_entry(node, struct page, pgtable_deposit_head);
 
        return page_address(pmd_page);
 }

 Also, Zi is it ok if I add your Co-developed by on this patch in future revisions?
 I didn't want to do that without your explicit approval.

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Matthew Wilcox 2 days, 2 hours ago

On Tue, Feb 03, 2026 at 02:07:25PM -0800, Usama Arif wrote:
> Ah I should have looked at your patches more! I started working by just using lru
> and was using list_add/list_del which was ofcourse corrupting the list and took me
> way more time than I would like to admit to debug what was going on! The diagrams
> in your 2nd link are really useful. I ended up drawing by hand those to debug
> the corruption issue. I will point to that link in the next series :) 
> 
> How about something like the below diff over this patch? (Not included the comment
> changes that I will make everywhere)

Why are you even talking about "the next series"?  The approach is
wrong.  You need to put this POC aside and solve the problems that
you've bypassed to create this POC.

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Matthew Wilcox 2 days, 2 hours ago

On Thu, Feb 05, 2026 at 04:17:19AM +0000, Matthew Wilcox wrote:
> Why are you even talking about "the next series"?  The approach is
> wrong.  You need to put this POC aside and solve the problems that
> you've bypassed to create this POC.

... and gmail is rejecting this email as being spam.  You need to stop
using gmail for kernel deveopment work.

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Usama Arif 2 days, 1 hour ago

On 04/02/2026 20:21, Matthew Wilcox wrote:
> On Thu, Feb 05, 2026 at 04:17:19AM +0000, Matthew Wilcox wrote:
>> Why are you even talking about "the next series"?  The approach is
>> wrong.  You need to put this POC aside and solve the problems that
>> you've bypassed to create this POC.

Ah is the issue the code duplication that Lorenzo has raised (ofcourse
completely agree that there is quite a bit), the lru.next patch I did
which hopefully [1] makes better, or investigating if it might be
interferring with DAX/VFIO that Lorenzo pointed out (will ofcourse
investigate before sending the next revision)? The mapcount work
(I think David is working on this?) that is needed to allow splitting
PUDs to PMD is completely a separate issue and can be tackled in parallel
to this.

> 
> ... and gmail is rejecting this email as being spam.  You need to stop
> using gmail for kernel deveopment work.

I asked a couple of folks now and it seems they got it without any issue.
I have used it for a long time. I will try and see if something has changed.

[1] https://lore.kernel.org/all/05d5918f-b61b-4091-b8c6-20eebfffc3c4@gmail.com/

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by David Hildenbrand (Arm) 1 day, 13 hours ago

On 2/5/26 06:13, Usama Arif wrote:
> 
> 
> On 04/02/2026 20:21, Matthew Wilcox wrote:
>> On Thu, Feb 05, 2026 at 04:17:19AM +0000, Matthew Wilcox wrote:
>>> Why are you even talking about "the next series"?  The approach is
>>> wrong.  You need to put this POC aside and solve the problems that
>>> you've bypassed to create this POC.
> 
> 
> Ah is the issue the code duplication that Lorenzo has raised (ofcourse
> completely agree that there is quite a bit), the lru.next patch I did
> which hopefully [1] makes better, or investigating if it might be
> interferring with DAX/VFIO that Lorenzo pointed out (will ofcourse
> investigate before sending the next revision)? The mapcount work
> (I think David is working on this?) that is needed to allow splitting
> PUDs to PMD is completely a separate issue and can be tackled in parallel
> to this.

I would enjoy seeing an investigation where we see what might have to be 
done to avoid preallocating page tables for anonymous memory THPs, and 
instead, try allocating them on demand when remapping. If allocation 
fails, it's just another -ENOMEM or -EAGAIN.

That would not only reduce the page table overhead when using THPs, it 
would also avoid the preallocation of two levels like you need here.

Maybe it's doable, maybe not.

Last time I looked into it I was like "there must be a better way to 
achieve that" :)

Spinlocks might require preallocating etc.

(as raised elsewhere, staring with shmem support avoid the page table 
problem)

> 
>>
>> ... and gmail is rejecting this email as being spam.  You need to stop
>> using gmail for kernel deveopment work.
> 
> I asked a couple of folks now and it seems they got it without any issue.
> I have used it for a long time. I will try and see if something has changed.

Gmail is absolutely horrible for upstream development. For example, 
linux-mm recently un-subscribed all gmail addresses.

When I moved to my kernel.org address I thought using gmail as a backend 
would be a great choice. I was wrong and after getting daily bounce 
notifications from MLs (even though my spamfilter rules essentially 
allowed everything). So I moved to something else (I now pay 3Euro a 
month, omg! :) ).

-- 
Cheers,

David

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Usama Arif 1 day, 13 hours ago


On 05/02/2026 09:40, David Hildenbrand (Arm) wrote:
> On 2/5/26 06:13, Usama Arif wrote:
>>
>>
>> On 04/02/2026 20:21, Matthew Wilcox wrote:
>>> On Thu, Feb 05, 2026 at 04:17:19AM +0000, Matthew Wilcox wrote:
>>>> Why are you even talking about "the next series"?  The approach is
>>>> wrong.  You need to put this POC aside and solve the problems that
>>>> you've bypassed to create this POC.
>>
>>
>> Ah is the issue the code duplication that Lorenzo has raised (ofcourse
>> completely agree that there is quite a bit), the lru.next patch I did
>> which hopefully [1] makes better, or investigating if it might be
>> interferring with DAX/VFIO that Lorenzo pointed out (will ofcourse
>> investigate before sending the next revision)? The mapcount work
>> (I think David is working on this?) that is needed to allow splitting
>> PUDs to PMD is completely a separate issue and can be tackled in parallel
>> to this.
> 
> I would enjoy seeing an investigation where we see what might have to be done to avoid preallocating page tables for anonymous memory THPs, and instead, try allocating them on demand when remapping. If allocation fails, it's just another -ENOMEM or -EAGAIN.
> 
> That would not only reduce the page table overhead when using THPs, it would also avoid the preallocation of two levels like you need here.
> 
> Maybe it's doable, maybe not.
> 
> Last time I looked into it I was like "there must be a better way to achieve that" :)
> 
> Spinlocks might require preallocating etc.


Thanks for this! I am going to try and implement this now and stress test this as well for 2M THPs.
I have access to some production workloads that use a lot of THPs as well and I can put
counters to see how often this even happens in prod workloads. i.e. how often page table
allocation even fails in 2M THPs if its done on demand instead of preallocating this.

> 
> (as raised elsewhere, staring with shmem support avoid the page table problem)
>

Re: [RFC 01/12] mm: add PUD THP ptdesc and rmap support

Posted by Usama Arif 1 day, 12 hours ago

>>
>> (as raised elsewhere, staring with shmem support avoid the page table problem)
>>
> 

Also forgot to add here, I will look into this before PUD anon THPs first.