[v2] THP support for zone device page migration

[v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months, 1 week ago

Make THP handling code in the mm subsystem for THP pages aware of zone
device pages. Although the code is designed to be generic when it comes
to handling splitting of pages, the code is designed to work for THP
page sizes corresponding to HPAGE_PMD_NR.

Modify page_vma_mapped_walk() to return true when a zone device huge
entry is present, enabling try_to_migrate() and other code migration
paths to appropriately process the entry. page_vma_mapped_walk() will
return true for zone device private large folios only when
PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
not zone device private pages from having to add awareness. The key
callback that needs this flag is try_to_migrate_one(). The other
callbacks page idle, damon use it for setting young/dirty bits, which is
not significant when it comes to pmd level bit harvesting.

pmd_pfn() does not work well with zone device entries, use
pfn_pmd_entry_to_swap() for checking and comparison as for zone device
entries.

Zone device private entries when split via munmap go through pmd split,
but need to go through a folio split, deferred split does not work if a
fault is encountered because fault handling involves migration entries
(via folio_migrate_mapping) and the folio sizes are expected to be the
same there. This introduces the need to split the folio while handling
the pmd split. Because the folio is still mapped, but calling
folio_split() will cause lock recursion, the __split_unmapped_folio()
code is used with a new helper to wrap the code
split_device_private_folio(), which skips the checks around
folio->mapping, swapcache and the need to go through unmap and remap
folio.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>

Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h |   1 +
 include/linux/rmap.h    |   2 +
 include/linux/swapops.h |  17 +++
 mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
 mm/page_vma_mapped.c    |  13 +-
 mm/pgtable-generic.c    |   6 +
 mm/rmap.c               |  22 +++-
 7 files changed, 278 insertions(+), 51 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7748489fde1b..2a6f5ff7bca3 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -345,6 +345,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order);
+int split_device_private_folio(struct folio *folio);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 20803fcb49a7..625f36dcc121 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -905,6 +905,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 #define PVMW_SYNC		(1 << 0)
 /* Look for migration entries rather than present PTEs */
 #define PVMW_MIGRATION		(1 << 1)
+/* Look for device private THP entries */
+#define PVMW_THP_DEVICE_PRIVATE	(1 << 2)
 
 struct page_vma_mapped_walk {
 	unsigned long pfn;
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 64ea151a7ae3..2641c01bd5d2 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 {
 	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
 }
+
 #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		struct page *page)
@@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
 }
 #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
 
+#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
+
+static inline int is_pmd_device_private_entry(pmd_t pmd)
+{
+	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
+}
+
+#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
+static inline int is_pmd_device_private_entry(pmd_t pmd)
+{
+	return 0;
+}
+
+#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
+
 static inline int non_swap_entry(swp_entry_t entry)
 {
 	return swp_type(entry) >= MAX_SWAPFILES;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9c38a95e9f09..e373c6578894 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -72,6 +72,10 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 					  struct shrink_control *sc);
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 					 struct shrink_control *sc);
+static int __split_unmapped_folio(struct folio *folio, int new_order,
+		struct page *split_at, struct xa_state *xas,
+		struct address_space *mapping, bool uniform_split);
+
 static bool split_underused_thp = true;
 
 static atomic_t huge_zero_refcount;
@@ -1711,8 +1715,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 	if (unlikely(is_swap_pmd(pmd))) {
 		swp_entry_t entry = pmd_to_swp_entry(pmd);
 
-		VM_BUG_ON(!is_pmd_migration_entry(pmd));
-		if (!is_readable_migration_entry(entry)) {
+		VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
+				!is_pmd_device_private_entry(pmd));
+
+		if (is_migration_entry(entry) &&
+			is_writable_migration_entry(entry)) {
 			entry = make_readable_migration_entry(
 							swp_offset(entry));
 			pmd = swp_entry_to_pmd(entry);
@@ -1722,6 +1729,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pmd = pmd_swp_mkuffd_wp(pmd);
 			set_pmd_at(src_mm, addr, src_pmd, pmd);
 		}
+
+		if (is_device_private_entry(entry)) {
+			if (is_writable_device_private_entry(entry)) {
+				entry = make_readable_device_private_entry(
+					swp_offset(entry));
+				pmd = swp_entry_to_pmd(entry);
+
+				if (pmd_swp_soft_dirty(*src_pmd))
+					pmd = pmd_swp_mksoft_dirty(pmd);
+				if (pmd_swp_uffd_wp(*src_pmd))
+					pmd = pmd_swp_mkuffd_wp(pmd);
+				set_pmd_at(src_mm, addr, src_pmd, pmd);
+			}
+
+			src_folio = pfn_swap_entry_folio(entry);
+			VM_WARN_ON(!folio_test_large(src_folio));
+
+			folio_get(src_folio);
+			/*
+			 * folio_try_dup_anon_rmap_pmd does not fail for
+			 * device private entries.
+			 */
+			VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
+					  &src_folio->page, dst_vma, src_vma));
+		}
+
 		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
 		mm_inc_nr_ptes(dst_mm);
 		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
@@ -2219,15 +2252,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			folio_remove_rmap_pmd(folio, page, vma);
 			WARN_ON_ONCE(folio_mapcount(folio) < 0);
 			VM_BUG_ON_PAGE(!PageHead(page), page);
-		} else if (thp_migration_supported()) {
+		} else if (is_pmd_migration_entry(orig_pmd) ||
+				is_pmd_device_private_entry(orig_pmd)) {
 			swp_entry_t entry;
 
-			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
 			entry = pmd_to_swp_entry(orig_pmd);
 			folio = pfn_swap_entry_folio(entry);
 			flush_needed = 0;
-		} else
-			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+			if (!thp_migration_supported())
+				WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
+
+			if (is_pmd_device_private_entry(orig_pmd)) {
+				folio_remove_rmap_pmd(folio, &folio->page, vma);
+				WARN_ON_ONCE(folio_mapcount(folio) < 0);
+			}
+		}
 
 		if (folio_test_anon(folio)) {
 			zap_deposited_table(tlb->mm, pmd);
@@ -2247,6 +2287,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 				folio_mark_accessed(folio);
 		}
 
+		/*
+		 * Do a folio put on zone device private pages after
+		 * changes to mm_counter, because the folio_put() will
+		 * clean folio->mapping and the folio_test_anon() check
+		 * will not be usable.
+		 */
+		if (folio_is_device_private(folio))
+			folio_put(folio);
+
 		spin_unlock(ptl);
 		if (flush_needed)
 			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
@@ -2375,7 +2424,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		struct folio *folio = pfn_swap_entry_folio(entry);
 		pmd_t newpmd;
 
-		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
+		VM_WARN_ON(!is_pmd_migration_entry(*pmd) &&
+			   !folio_is_device_private(folio));
 		if (is_writable_migration_entry(entry)) {
 			/*
 			 * A protection check is difficult so
@@ -2388,6 +2438,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 			newpmd = swp_entry_to_pmd(entry);
 			if (pmd_swp_soft_dirty(*pmd))
 				newpmd = pmd_swp_mksoft_dirty(newpmd);
+		} else if (is_writable_device_private_entry(entry)) {
+			entry = make_readable_device_private_entry(
+							swp_offset(entry));
+			newpmd = swp_entry_to_pmd(entry);
 		} else {
 			newpmd = *pmd;
 		}
@@ -2834,6 +2888,44 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
+/**
+ * split_huge_device_private_folio - split a huge device private folio into
+ * smaller pages (of order 0), currently used by migrate_device logic to
+ * split folios for pages that are partially mapped
+ *
+ * @folio: the folio to split
+ *
+ * The caller has to hold the folio_lock and a reference via folio_get
+ */
+int split_device_private_folio(struct folio *folio)
+{
+	struct folio *end_folio = folio_next(folio);
+	struct folio *new_folio;
+	int ret = 0;
+
+	/*
+	 * Split the folio now. In the case of device
+	 * private pages, this path is executed when
+	 * the pmd is split and since freeze is not true
+	 * it is likely the folio will be deferred_split.
+	 *
+	 * With device private pages, deferred splits of
+	 * folios should be handled here to prevent partial
+	 * unmaps from causing issues later on in migration
+	 * and fault handling flows.
+	 */
+	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
+	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
+	VM_WARN_ON(ret);
+	for (new_folio = folio_next(folio); new_folio != end_folio;
+					new_folio = folio_next(new_folio)) {
+		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
+								new_folio));
+	}
+	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
+	return ret;
+}
+
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long haddr, bool freeze)
 {
@@ -2842,16 +2934,19 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	struct page *page;
 	pgtable_t pgtable;
 	pmd_t old_pmd, _pmd;
-	bool young, write, soft_dirty, pmd_migration = false, uffd_wp = false;
-	bool anon_exclusive = false, dirty = false;
+	bool young, write, soft_dirty, uffd_wp = false;
+	bool anon_exclusive = false, dirty = false, present = false;
 	unsigned long addr;
 	pte_t *pte;
 	int i;
+	swp_entry_t swp_entry;
 
 	VM_BUG_ON(haddr & ~HPAGE_PMD_MASK);
 	VM_BUG_ON_VMA(vma->vm_start > haddr, vma);
 	VM_BUG_ON_VMA(vma->vm_end < haddr + HPAGE_PMD_SIZE, vma);
-	VM_BUG_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd));
+
+	VM_WARN_ON(!is_pmd_migration_entry(*pmd) && !pmd_trans_huge(*pmd)
+			&& !(is_pmd_device_private_entry(*pmd)));
 
 	count_vm_event(THP_SPLIT_PMD);
 
@@ -2899,18 +2994,60 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		return __split_huge_zero_page_pmd(vma, haddr, pmd);
 	}
 
-	pmd_migration = is_pmd_migration_entry(*pmd);
-	if (unlikely(pmd_migration)) {
-		swp_entry_t entry;
 
+	present = pmd_present(*pmd);
+	if (unlikely(!present)) {
+		swp_entry = pmd_to_swp_entry(*pmd);
 		old_pmd = *pmd;
-		entry = pmd_to_swp_entry(old_pmd);
-		page = pfn_swap_entry_to_page(entry);
-		write = is_writable_migration_entry(entry);
-		if (PageAnon(page))
-			anon_exclusive = is_readable_exclusive_migration_entry(entry);
-		young = is_migration_entry_young(entry);
-		dirty = is_migration_entry_dirty(entry);
+
+		folio = pfn_swap_entry_folio(swp_entry);
+		VM_WARN_ON(!is_migration_entry(swp_entry) &&
+				!is_device_private_entry(swp_entry));
+		page = pfn_swap_entry_to_page(swp_entry);
+
+		if (is_pmd_migration_entry(old_pmd)) {
+			write = is_writable_migration_entry(swp_entry);
+			if (PageAnon(page))
+				anon_exclusive =
+					is_readable_exclusive_migration_entry(
+								swp_entry);
+			young = is_migration_entry_young(swp_entry);
+			dirty = is_migration_entry_dirty(swp_entry);
+		} else if (is_pmd_device_private_entry(old_pmd)) {
+			write = is_writable_device_private_entry(swp_entry);
+			anon_exclusive = PageAnonExclusive(page);
+			if (freeze && anon_exclusive &&
+			    folio_try_share_anon_rmap_pmd(folio, page))
+				freeze = false;
+			if (!freeze) {
+				rmap_t rmap_flags = RMAP_NONE;
+				unsigned long addr = haddr;
+				struct folio *new_folio;
+				struct folio *end_folio = folio_next(folio);
+
+				if (anon_exclusive)
+					rmap_flags |= RMAP_EXCLUSIVE;
+
+				folio_lock(folio);
+				folio_get(folio);
+
+				split_device_private_folio(folio);
+
+				for (new_folio = folio_next(folio);
+					new_folio != end_folio;
+					new_folio = folio_next(new_folio)) {
+					addr += PAGE_SIZE;
+					folio_unlock(new_folio);
+					folio_add_anon_rmap_ptes(new_folio,
+						&new_folio->page, 1,
+						vma, addr, rmap_flags);
+				}
+				folio_unlock(folio);
+				folio_add_anon_rmap_ptes(folio, &folio->page,
+						1, vma, haddr, rmap_flags);
+			}
+		}
+
 		soft_dirty = pmd_swp_soft_dirty(old_pmd);
 		uffd_wp = pmd_swp_uffd_wp(old_pmd);
 	} else {
@@ -2996,30 +3133,49 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	 * Note that NUMA hinting access restrictions are not transferred to
 	 * avoid any possibility of altering permissions across VMAs.
 	 */
-	if (freeze || pmd_migration) {
+	if (freeze || !present) {
 		for (i = 0, addr = haddr; i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE) {
 			pte_t entry;
-			swp_entry_t swp_entry;
-
-			if (write)
-				swp_entry = make_writable_migration_entry(
-							page_to_pfn(page + i));
-			else if (anon_exclusive)
-				swp_entry = make_readable_exclusive_migration_entry(
-							page_to_pfn(page + i));
-			else
-				swp_entry = make_readable_migration_entry(
-							page_to_pfn(page + i));
-			if (young)
-				swp_entry = make_migration_entry_young(swp_entry);
-			if (dirty)
-				swp_entry = make_migration_entry_dirty(swp_entry);
-			entry = swp_entry_to_pte(swp_entry);
-			if (soft_dirty)
-				entry = pte_swp_mksoft_dirty(entry);
-			if (uffd_wp)
-				entry = pte_swp_mkuffd_wp(entry);
-
+			if (freeze || is_migration_entry(swp_entry)) {
+				if (write)
+					swp_entry = make_writable_migration_entry(
+								page_to_pfn(page + i));
+				else if (anon_exclusive)
+					swp_entry = make_readable_exclusive_migration_entry(
+								page_to_pfn(page + i));
+				else
+					swp_entry = make_readable_migration_entry(
+								page_to_pfn(page + i));
+				if (young)
+					swp_entry = make_migration_entry_young(swp_entry);
+				if (dirty)
+					swp_entry = make_migration_entry_dirty(swp_entry);
+				entry = swp_entry_to_pte(swp_entry);
+				if (soft_dirty)
+					entry = pte_swp_mksoft_dirty(entry);
+				if (uffd_wp)
+					entry = pte_swp_mkuffd_wp(entry);
+			} else {
+				/*
+				 * anon_exclusive was already propagated to the relevant
+				 * pages corresponding to the pte entries when freeze
+				 * is false.
+				 */
+				if (write)
+					swp_entry = make_writable_device_private_entry(
+								page_to_pfn(page + i));
+				else
+					swp_entry = make_readable_device_private_entry(
+								page_to_pfn(page + i));
+				/*
+				 * Young and dirty bits are not progated via swp_entry
+				 */
+				entry = swp_entry_to_pte(swp_entry);
+				if (soft_dirty)
+					entry = pte_swp_mksoft_dirty(entry);
+				if (uffd_wp)
+					entry = pte_swp_mkuffd_wp(entry);
+			}
 			VM_WARN_ON(!pte_none(ptep_get(pte + i)));
 			set_pte_at(mm, addr, pte + i, entry);
 		}
@@ -3046,7 +3202,7 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 	}
 	pte_unmap(pte);
 
-	if (!pmd_migration)
+	if (present)
 		folio_remove_rmap_pmd(folio, page, vma);
 	if (freeze)
 		put_page(page);
@@ -3058,8 +3214,10 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 void split_huge_pmd_locked(struct vm_area_struct *vma, unsigned long address,
 			   pmd_t *pmd, bool freeze)
 {
+
 	VM_WARN_ON_ONCE(!IS_ALIGNED(address, HPAGE_PMD_SIZE));
-	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd))
+	if (pmd_trans_huge(*pmd) || is_pmd_migration_entry(*pmd) ||
+			(is_pmd_device_private_entry(*pmd)))
 		__split_huge_pmd_locked(vma, pmd, address, freeze);
 }
 
@@ -3238,6 +3396,9 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
 	VM_BUG_ON_FOLIO(folio_test_lru(new_folio), folio);
 	lockdep_assert_held(&lruvec->lru_lock);
 
+	if (folio_is_device_private(folio))
+		return;
+
 	if (list) {
 		/* page reclaim is reclaiming a huge page */
 		VM_WARN_ON(folio_test_lru(folio));
@@ -3252,6 +3413,7 @@ static void lru_add_split_folio(struct folio *folio, struct folio *new_folio,
 			list_add_tail(&new_folio->lru, &folio->lru);
 		folio_set_lru(new_folio);
 	}
+
 }
 
 /* Racy check whether the huge page can be split */
@@ -3727,7 +3889,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
-	if (folio_ref_freeze(folio, 1 + extra_pins)) {
+	if (folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio))) {
 		struct address_space *swap_cache = NULL;
 		struct lruvec *lruvec;
 		int expected_refs;
@@ -4603,7 +4765,10 @@ int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
 		return 0;
 
 	flush_cache_range(vma, address, address + HPAGE_PMD_SIZE);
-	pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
+	if (unlikely(is_pmd_device_private_entry(*pvmw->pmd)))
+		pmdval = pmdp_huge_clear_flush(vma, address, pvmw->pmd);
+	else
+		pmdval = pmdp_invalidate(vma, address, pvmw->pmd);
 
 	/* See folio_try_share_anon_rmap_pmd(): invalidate PMD first. */
 	anon_exclusive = folio_test_anon(folio) && PageAnonExclusive(page);
@@ -4653,6 +4818,17 @@ void remove_migration_pmd(struct page_vma_mapped_walk *pvmw, struct page *new)
 	entry = pmd_to_swp_entry(*pvmw->pmd);
 	folio_get(folio);
 	pmde = folio_mk_pmd(folio, READ_ONCE(vma->vm_page_prot));
+
+	if (folio_is_device_private(folio)) {
+		if (pmd_write(pmde))
+			entry = make_writable_device_private_entry(
+							page_to_pfn(new));
+		else
+			entry = make_readable_device_private_entry(
+							page_to_pfn(new));
+		pmde = swp_entry_to_pmd(entry);
+	}
+
 	if (pmd_swp_soft_dirty(*pvmw->pmd))
 		pmde = pmd_mksoft_dirty(pmde);
 	if (is_writable_migration_entry(entry))
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index e981a1a292d2..246e6c211f34 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -250,12 +250,11 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			pvmw->ptl = pmd_lock(mm, pvmw->pmd);
 			pmde = *pvmw->pmd;
 			if (!pmd_present(pmde)) {
-				swp_entry_t entry;
+				swp_entry_t entry = pmd_to_swp_entry(pmde);
 
 				if (!thp_migration_supported() ||
 				    !(pvmw->flags & PVMW_MIGRATION))
 					return not_found(pvmw);
-				entry = pmd_to_swp_entry(pmde);
 				if (!is_migration_entry(entry) ||
 				    !check_pmd(swp_offset_pfn(entry), pvmw))
 					return not_found(pvmw);
@@ -277,6 +276,16 @@ bool page_vma_mapped_walk(struct page_vma_mapped_walk *pvmw)
 			 * cannot return prematurely, while zap_huge_pmd() has
 			 * cleared *pmd but not decremented compound_mapcount().
 			 */
+			swp_entry_t entry;
+
+			entry = pmd_to_swp_entry(pmde);
+
+			if (is_device_private_entry(entry) &&
+				(pvmw->flags & PVMW_THP_DEVICE_PRIVATE)) {
+				pvmw->ptl = pmd_lock(mm, pvmw->pmd);
+				return true;
+			}
+
 			if ((pvmw->flags & PVMW_SYNC) &&
 			    thp_vma_suitable_order(vma, pvmw->address,
 						   PMD_ORDER) &&
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 567e2d084071..604e8206a2ec 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -292,6 +292,12 @@ pte_t *___pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp)
 		*pmdvalp = pmdval;
 	if (unlikely(pmd_none(pmdval) || is_pmd_migration_entry(pmdval)))
 		goto nomap;
+	if (is_swap_pmd(pmdval)) {
+		swp_entry_t entry = pmd_to_swp_entry(pmdval);
+
+		if (is_device_private_entry(entry))
+			goto nomap;
+	}
 	if (unlikely(pmd_trans_huge(pmdval)))
 		goto nomap;
 	if (unlikely(pmd_bad(pmdval))) {
diff --git a/mm/rmap.c b/mm/rmap.c
index f93ce27132ab..5c5c1c777ce3 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2281,7 +2281,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 		     unsigned long address, void *arg)
 {
 	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
+	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
+				PVMW_THP_DEVICE_PRIVATE);
 	bool anon_exclusive, writable, ret = true;
 	pte_t pteval;
 	struct page *subpage;
@@ -2326,6 +2327,8 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 	while (page_vma_mapped_walk(&pvmw)) {
 		/* PMD-mapped THP migration entry */
 		if (!pvmw.pte) {
+			unsigned long pfn;
+
 			if (flags & TTU_SPLIT_HUGE_PMD) {
 				split_huge_pmd_locked(vma, pvmw.address,
 						      pvmw.pmd, true);
@@ -2334,8 +2337,21 @@ static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
 				break;
 			}
 #ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
-			subpage = folio_page(folio,
-				pmd_pfn(*pvmw.pmd) - folio_pfn(folio));
+			/*
+			 * Zone device private folios do not work well with
+			 * pmd_pfn() on some architectures due to pte
+			 * inversion.
+			 */
+			if (is_pmd_device_private_entry(*pvmw.pmd)) {
+				swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
+
+				pfn = swp_offset_pfn(entry);
+			} else {
+				pfn = pmd_pfn(*pvmw.pmd);
+			}
+
+			subpage = folio_page(folio, pfn - folio_pfn(folio));
+
 			VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
 					!folio_test_pmd_mappable(folio), folio);
 
-- 
2.50.1

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by kernel test robot 2 months, 1 week ago

Hi Balbir,

kernel test robot noticed the following build warnings:

[auto build test WARNING on akpm-mm/mm-everything]
[also build test WARNING on next-20250730]
[cannot apply to akpm-mm/mm-nonmm-unstable shuah-kselftest/next shuah-kselftest/fixes linus/master v6.16]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Balbir-Singh/mm-zone_device-support-large-zone-device-private-folios/20250730-172600
base:   https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link:    https://lore.kernel.org/r/20250730092139.3890844-3-balbirs%40nvidia.com
patch subject: [v2 02/11] mm/thp: zone_device awareness in THP handling code
config: i386-buildonly-randconfig-001-20250731 (https://download.01.org/0day-ci/archive/20250731/202507310343.ZipoyitU-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250731/202507310343.ZipoyitU-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507310343.ZipoyitU-lkp@intel.com/

All warnings (new ones prefixed by >>):

   mm/rmap.c: In function 'try_to_migrate_one':
>> mm/rmap.c:2330:39: warning: unused variable 'pfn' [-Wunused-variable]
    2330 |                         unsigned long pfn;
         |                                       ^~~


vim +/pfn +2330 mm/rmap.c

  2273	
  2274	/*
  2275	 * @arg: enum ttu_flags will be passed to this argument.
  2276	 *
  2277	 * If TTU_SPLIT_HUGE_PMD is specified any PMD mappings will be split into PTEs
  2278	 * containing migration entries.
  2279	 */
  2280	static bool try_to_migrate_one(struct folio *folio, struct vm_area_struct *vma,
  2281			     unsigned long address, void *arg)
  2282	{
  2283		struct mm_struct *mm = vma->vm_mm;
  2284		DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address,
  2285					PVMW_THP_DEVICE_PRIVATE);
  2286		bool anon_exclusive, writable, ret = true;
  2287		pte_t pteval;
  2288		struct page *subpage;
  2289		struct mmu_notifier_range range;
  2290		enum ttu_flags flags = (enum ttu_flags)(long)arg;
  2291		unsigned long pfn;
  2292		unsigned long hsz = 0;
  2293	
  2294		/*
  2295		 * When racing against e.g. zap_pte_range() on another cpu,
  2296		 * in between its ptep_get_and_clear_full() and folio_remove_rmap_*(),
  2297		 * try_to_migrate() may return before page_mapped() has become false,
  2298		 * if page table locking is skipped: use TTU_SYNC to wait for that.
  2299		 */
  2300		if (flags & TTU_SYNC)
  2301			pvmw.flags = PVMW_SYNC;
  2302	
  2303		/*
  2304		 * For THP, we have to assume the worse case ie pmd for invalidation.
  2305		 * For hugetlb, it could be much worse if we need to do pud
  2306		 * invalidation in the case of pmd sharing.
  2307		 *
  2308		 * Note that the page can not be free in this function as call of
  2309		 * try_to_unmap() must hold a reference on the page.
  2310		 */
  2311		range.end = vma_address_end(&pvmw);
  2312		mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma->vm_mm,
  2313					address, range.end);
  2314		if (folio_test_hugetlb(folio)) {
  2315			/*
  2316			 * If sharing is possible, start and end will be adjusted
  2317			 * accordingly.
  2318			 */
  2319			adjust_range_if_pmd_sharing_possible(vma, &range.start,
  2320							     &range.end);
  2321	
  2322			/* We need the huge page size for set_huge_pte_at() */
  2323			hsz = huge_page_size(hstate_vma(vma));
  2324		}
  2325		mmu_notifier_invalidate_range_start(&range);
  2326	
  2327		while (page_vma_mapped_walk(&pvmw)) {
  2328			/* PMD-mapped THP migration entry */
  2329			if (!pvmw.pte) {
> 2330				unsigned long pfn;
  2331	
  2332				if (flags & TTU_SPLIT_HUGE_PMD) {
  2333					split_huge_pmd_locked(vma, pvmw.address,
  2334							      pvmw.pmd, true);
  2335					ret = false;
  2336					page_vma_mapped_walk_done(&pvmw);
  2337					break;
  2338				}
  2339	#ifdef CONFIG_ARCH_ENABLE_THP_MIGRATION
  2340				/*
  2341				 * Zone device private folios do not work well with
  2342				 * pmd_pfn() on some architectures due to pte
  2343				 * inversion.
  2344				 */
  2345				if (is_pmd_device_private_entry(*pvmw.pmd)) {
  2346					swp_entry_t entry = pmd_to_swp_entry(*pvmw.pmd);
  2347	
  2348					pfn = swp_offset_pfn(entry);
  2349				} else {
  2350					pfn = pmd_pfn(*pvmw.pmd);
  2351				}
  2352	
  2353				subpage = folio_page(folio, pfn - folio_pfn(folio));
  2354	
  2355				VM_BUG_ON_FOLIO(folio_test_hugetlb(folio) ||
  2356						!folio_test_pmd_mappable(folio), folio);
  2357	
  2358				if (set_pmd_migration_entry(&pvmw, subpage)) {
  2359					ret = false;
  2360					page_vma_mapped_walk_done(&pvmw);
  2361					break;
  2362				}
  2363				continue;
  2364	#endif
  2365			}
  2366	
  2367			/* Unexpected PMD-mapped THP? */
  2368			VM_BUG_ON_FOLIO(!pvmw.pte, folio);
  2369	
  2370			/*
  2371			 * Handle PFN swap PTEs, such as device-exclusive ones, that
  2372			 * actually map pages.
  2373			 */
  2374			pteval = ptep_get(pvmw.pte);
  2375			if (likely(pte_present(pteval))) {
  2376				pfn = pte_pfn(pteval);
  2377			} else {
  2378				pfn = swp_offset_pfn(pte_to_swp_entry(pteval));
  2379				VM_WARN_ON_FOLIO(folio_test_hugetlb(folio), folio);
  2380			}
  2381	
  2382			subpage = folio_page(folio, pfn - folio_pfn(folio));
  2383			address = pvmw.address;
  2384			anon_exclusive = folio_test_anon(folio) &&
  2385					 PageAnonExclusive(subpage);
  2386	
  2387			if (folio_test_hugetlb(folio)) {
  2388				bool anon = folio_test_anon(folio);
  2389	
  2390				/*
  2391				 * huge_pmd_unshare may unmap an entire PMD page.
  2392				 * There is no way of knowing exactly which PMDs may
  2393				 * be cached for this mm, so we must flush them all.
  2394				 * start/end were already adjusted above to cover this
  2395				 * range.
  2396				 */
  2397				flush_cache_range(vma, range.start, range.end);
  2398	
  2399				/*
  2400				 * To call huge_pmd_unshare, i_mmap_rwsem must be
  2401				 * held in write mode.  Caller needs to explicitly
  2402				 * do this outside rmap routines.
  2403				 *
  2404				 * We also must hold hugetlb vma_lock in write mode.
  2405				 * Lock order dictates acquiring vma_lock BEFORE
  2406				 * i_mmap_rwsem.  We can only try lock here and
  2407				 * fail if unsuccessful.
  2408				 */
  2409				if (!anon) {
  2410					VM_BUG_ON(!(flags & TTU_RMAP_LOCKED));
  2411					if (!hugetlb_vma_trylock_write(vma)) {
  2412						page_vma_mapped_walk_done(&pvmw);
  2413						ret = false;
  2414						break;
  2415					}
  2416					if (huge_pmd_unshare(mm, vma, address, pvmw.pte)) {
  2417						hugetlb_vma_unlock_write(vma);
  2418						flush_tlb_range(vma,
  2419							range.start, range.end);
  2420	
  2421						/*
  2422						 * The ref count of the PMD page was
  2423						 * dropped which is part of the way map
  2424						 * counting is done for shared PMDs.
  2425						 * Return 'true' here.  When there is
  2426						 * no other sharing, huge_pmd_unshare
  2427						 * returns false and we will unmap the
  2428						 * actual page and drop map count
  2429						 * to zero.
  2430						 */
  2431						page_vma_mapped_walk_done(&pvmw);
  2432						break;
  2433					}
  2434					hugetlb_vma_unlock_write(vma);
  2435				}
  2436				/* Nuke the hugetlb page table entry */
  2437				pteval = huge_ptep_clear_flush(vma, address, pvmw.pte);
  2438				if (pte_dirty(pteval))
  2439					folio_mark_dirty(folio);
  2440				writable = pte_write(pteval);
  2441			} else if (likely(pte_present(pteval))) {
  2442				flush_cache_page(vma, address, pfn);
  2443				/* Nuke the page table entry. */
  2444				if (should_defer_flush(mm, flags)) {
  2445					/*
  2446					 * We clear the PTE but do not flush so potentially
  2447					 * a remote CPU could still be writing to the folio.
  2448					 * If the entry was previously clean then the
  2449					 * architecture must guarantee that a clear->dirty
  2450					 * transition on a cached TLB entry is written through
  2451					 * and traps if the PTE is unmapped.
  2452					 */
  2453					pteval = ptep_get_and_clear(mm, address, pvmw.pte);
  2454	
  2455					set_tlb_ubc_flush_pending(mm, pteval, address, address + PAGE_SIZE);
  2456				} else {
  2457					pteval = ptep_clear_flush(vma, address, pvmw.pte);
  2458				}
  2459				if (pte_dirty(pteval))
  2460					folio_mark_dirty(folio);
  2461				writable = pte_write(pteval);
  2462			} else {
  2463				pte_clear(mm, address, pvmw.pte);
  2464				writable = is_writable_device_private_entry(pte_to_swp_entry(pteval));
  2465			}
  2466	
  2467			VM_WARN_ON_FOLIO(writable && folio_test_anon(folio) &&
  2468					!anon_exclusive, folio);
  2469	
  2470			/* Update high watermark before we lower rss */
  2471			update_hiwater_rss(mm);
  2472	
  2473			if (PageHWPoison(subpage)) {
  2474				VM_WARN_ON_FOLIO(folio_is_device_private(folio), folio);
  2475	
  2476				pteval = swp_entry_to_pte(make_hwpoison_entry(subpage));
  2477				if (folio_test_hugetlb(folio)) {
  2478					hugetlb_count_sub(folio_nr_pages(folio), mm);
  2479					set_huge_pte_at(mm, address, pvmw.pte, pteval,
  2480							hsz);
  2481				} else {
  2482					dec_mm_counter(mm, mm_counter(folio));
  2483					set_pte_at(mm, address, pvmw.pte, pteval);
  2484				}
  2485			} else if (likely(pte_present(pteval)) && pte_unused(pteval) &&
  2486				   !userfaultfd_armed(vma)) {
  2487				/*
  2488				 * The guest indicated that the page content is of no
  2489				 * interest anymore. Simply discard the pte, vmscan
  2490				 * will take care of the rest.
  2491				 * A future reference will then fault in a new zero
  2492				 * page. When userfaultfd is active, we must not drop
  2493				 * this page though, as its main user (postcopy
  2494				 * migration) will not expect userfaults on already
  2495				 * copied pages.
  2496				 */
  2497				dec_mm_counter(mm, mm_counter(folio));
  2498			} else {
  2499				swp_entry_t entry;
  2500				pte_t swp_pte;
  2501	
  2502				/*
  2503				 * arch_unmap_one() is expected to be a NOP on
  2504				 * architectures where we could have PFN swap PTEs,
  2505				 * so we'll not check/care.
  2506				 */
  2507				if (arch_unmap_one(mm, vma, address, pteval) < 0) {
  2508					if (folio_test_hugetlb(folio))
  2509						set_huge_pte_at(mm, address, pvmw.pte,
  2510								pteval, hsz);
  2511					else
  2512						set_pte_at(mm, address, pvmw.pte, pteval);
  2513					ret = false;
  2514					page_vma_mapped_walk_done(&pvmw);
  2515					break;
  2516				}
  2517	
  2518				/* See folio_try_share_anon_rmap_pte(): clear PTE first. */
  2519				if (folio_test_hugetlb(folio)) {
  2520					if (anon_exclusive &&
  2521					    hugetlb_try_share_anon_rmap(folio)) {
  2522						set_huge_pte_at(mm, address, pvmw.pte,
  2523								pteval, hsz);
  2524						ret = false;
  2525						page_vma_mapped_walk_done(&pvmw);
  2526						break;
  2527					}
  2528				} else if (anon_exclusive &&
  2529					   folio_try_share_anon_rmap_pte(folio, subpage)) {
  2530					set_pte_at(mm, address, pvmw.pte, pteval);
  2531					ret = false;
  2532					page_vma_mapped_walk_done(&pvmw);
  2533					break;
  2534				}
  2535	
  2536				/*
  2537				 * Store the pfn of the page in a special migration
  2538				 * pte. do_swap_page() will wait until the migration
  2539				 * pte is removed and then restart fault handling.
  2540				 */
  2541				if (writable)
  2542					entry = make_writable_migration_entry(
  2543								page_to_pfn(subpage));
  2544				else if (anon_exclusive)
  2545					entry = make_readable_exclusive_migration_entry(
  2546								page_to_pfn(subpage));
  2547				else
  2548					entry = make_readable_migration_entry(
  2549								page_to_pfn(subpage));
  2550				if (likely(pte_present(pteval))) {
  2551					if (pte_young(pteval))
  2552						entry = make_migration_entry_young(entry);
  2553					if (pte_dirty(pteval))
  2554						entry = make_migration_entry_dirty(entry);
  2555					swp_pte = swp_entry_to_pte(entry);
  2556					if (pte_soft_dirty(pteval))
  2557						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2558					if (pte_uffd_wp(pteval))
  2559						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2560				} else {
  2561					swp_pte = swp_entry_to_pte(entry);
  2562					if (pte_swp_soft_dirty(pteval))
  2563						swp_pte = pte_swp_mksoft_dirty(swp_pte);
  2564					if (pte_swp_uffd_wp(pteval))
  2565						swp_pte = pte_swp_mkuffd_wp(swp_pte);
  2566				}
  2567				if (folio_test_hugetlb(folio))
  2568					set_huge_pte_at(mm, address, pvmw.pte, swp_pte,
  2569							hsz);
  2570				else
  2571					set_pte_at(mm, address, pvmw.pte, swp_pte);
  2572				trace_set_migration_pte(address, pte_val(swp_pte),
  2573							folio_order(folio));
  2574				/*
  2575				 * No need to invalidate here it will synchronize on
  2576				 * against the special swap migration pte.
  2577				 */
  2578			}
  2579	
  2580			if (unlikely(folio_test_hugetlb(folio)))
  2581				hugetlb_remove_rmap(folio);
  2582			else
  2583				folio_remove_rmap_pte(folio, subpage, vma);
  2584			if (vma->vm_flags & VM_LOCKED)
  2585				mlock_drain_local();
  2586			folio_put(folio);
  2587		}
  2588	
  2589		mmu_notifier_invalidate_range_end(&range);
  2590	
  2591		return ret;
  2592	}
  2593	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months, 1 week ago

Hi,

On 7/30/25 12:21, Balbir Singh wrote:
> Make THP handling code in the mm subsystem for THP pages aware of zone
> device pages. Although the code is designed to be generic when it comes
> to handling splitting of pages, the code is designed to work for THP
> page sizes corresponding to HPAGE_PMD_NR.
>
> Modify page_vma_mapped_walk() to return true when a zone device huge
> entry is present, enabling try_to_migrate() and other code migration
> paths to appropriately process the entry. page_vma_mapped_walk() will
> return true for zone device private large folios only when
> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
> not zone device private pages from having to add awareness. The key
> callback that needs this flag is try_to_migrate_one(). The other
> callbacks page idle, damon use it for setting young/dirty bits, which is
> not significant when it comes to pmd level bit harvesting.
>
> pmd_pfn() does not work well with zone device entries, use
> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
> entries.
>
> Zone device private entries when split via munmap go through pmd split,
> but need to go through a folio split, deferred split does not work if a
> fault is encountered because fault handling involves migration entries
> (via folio_migrate_mapping) and the folio sizes are expected to be the
> same there. This introduces the need to split the folio while handling
> the pmd split. Because the folio is still mapped, but calling
> folio_split() will cause lock recursion, the __split_unmapped_folio()
> code is used with a new helper to wrap the code
> split_device_private_folio(), which skips the checks around
> folio->mapping, swapcache and the need to go through unmap and remap
> folio.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
>
> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h |   1 +
>  include/linux/rmap.h    |   2 +
>  include/linux/swapops.h |  17 +++
>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>  mm/page_vma_mapped.c    |  13 +-
>  mm/pgtable-generic.c    |   6 +
>  mm/rmap.c               |  22 +++-
>  7 files changed, 278 insertions(+), 51 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 7748489fde1b..2a6f5ff7bca3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -345,6 +345,7 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order);
> +int split_device_private_folio(struct folio *folio);
>  int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 20803fcb49a7..625f36dcc121 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -905,6 +905,8 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  #define PVMW_SYNC		(1 << 0)
>  /* Look for migration entries rather than present PTEs */
>  #define PVMW_MIGRATION		(1 << 1)
> +/* Look for device private THP entries */
> +#define PVMW_THP_DEVICE_PRIVATE	(1 << 2)
>  
>  struct page_vma_mapped_walk {
>  	unsigned long pfn;
> diff --git a/include/linux/swapops.h b/include/linux/swapops.h
> index 64ea151a7ae3..2641c01bd5d2 100644
> --- a/include/linux/swapops.h
> +++ b/include/linux/swapops.h
> @@ -563,6 +563,7 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  {
>  	return is_swap_pmd(pmd) && is_migration_entry(pmd_to_swp_entry(pmd));
>  }
> +
>  #else  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  static inline int set_pmd_migration_entry(struct page_vma_mapped_walk *pvmw,
>  		struct page *page)
> @@ -594,6 +595,22 @@ static inline int is_pmd_migration_entry(pmd_t pmd)
>  }
>  #endif  /* CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  
> +#if defined(CONFIG_ZONE_DEVICE) && defined(CONFIG_ARCH_ENABLE_THP_MIGRATION)
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> +	return is_swap_pmd(pmd) && is_device_private_entry(pmd_to_swp_entry(pmd));
> +}
> +
> +#else /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
> +static inline int is_pmd_device_private_entry(pmd_t pmd)
> +{
> +	return 0;
> +}
> +
> +#endif /* CONFIG_ZONE_DEVICE && CONFIG_ARCH_ENABLE_THP_MIGRATION */
> +
>  static inline int non_swap_entry(swp_entry_t entry)
>  {
>  	return swp_type(entry) >= MAX_SWAPFILES;
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 9c38a95e9f09..e373c6578894 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -72,6 +72,10 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>  					  struct shrink_control *sc);
>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>  					 struct shrink_control *sc);
> +static int __split_unmapped_folio(struct folio *folio, int new_order,
> +		struct page *split_at, struct xa_state *xas,
> +		struct address_space *mapping, bool uniform_split);
> +
>  static bool split_underused_thp = true;
>  
>  static atomic_t huge_zero_refcount;
> @@ -1711,8 +1715,11 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  	if (unlikely(is_swap_pmd(pmd))) {
>  		swp_entry_t entry = pmd_to_swp_entry(pmd);
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(pmd));
> -		if (!is_readable_migration_entry(entry)) {
> +		VM_WARN_ON(!is_pmd_migration_entry(pmd) &&
> +				!is_pmd_device_private_entry(pmd));
> +
> +		if (is_migration_entry(entry) &&
> +			is_writable_migration_entry(entry)) {
>  			entry = make_readable_migration_entry(
>  							swp_offset(entry));
>  			pmd = swp_entry_to_pmd(entry);
> @@ -1722,6 +1729,32 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  				pmd = pmd_swp_mkuffd_wp(pmd);
>  			set_pmd_at(src_mm, addr, src_pmd, pmd);
>  		}
> +
> +		if (is_device_private_entry(entry)) {
> +			if (is_writable_device_private_entry(entry)) {
> +				entry = make_readable_device_private_entry(
> +					swp_offset(entry));
> +				pmd = swp_entry_to_pmd(entry);
> +
> +				if (pmd_swp_soft_dirty(*src_pmd))
> +					pmd = pmd_swp_mksoft_dirty(pmd);
> +				if (pmd_swp_uffd_wp(*src_pmd))
> +					pmd = pmd_swp_mkuffd_wp(pmd);
> +				set_pmd_at(src_mm, addr, src_pmd, pmd);
> +			}
> +
> +			src_folio = pfn_swap_entry_folio(entry);
> +			VM_WARN_ON(!folio_test_large(src_folio));
> +
> +			folio_get(src_folio);
> +			/*
> +			 * folio_try_dup_anon_rmap_pmd does not fail for
> +			 * device private entries.
> +			 */
> +			VM_WARN_ON(folio_try_dup_anon_rmap_pmd(src_folio,
> +					  &src_folio->page, dst_vma, src_vma));
> +		}
> +
>  		add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR);
>  		mm_inc_nr_ptes(dst_mm);
>  		pgtable_trans_huge_deposit(dst_mm, dst_pmd, pgtable);
> @@ -2219,15 +2252,22 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			folio_remove_rmap_pmd(folio, page, vma);
>  			WARN_ON_ONCE(folio_mapcount(folio) < 0);
>  			VM_BUG_ON_PAGE(!PageHead(page), page);
> -		} else if (thp_migration_supported()) {
> +		} else if (is_pmd_migration_entry(orig_pmd) ||
> +				is_pmd_device_private_entry(orig_pmd)) {
>  			swp_entry_t entry;
>  
> -			VM_BUG_ON(!is_pmd_migration_entry(orig_pmd));
>  			entry = pmd_to_swp_entry(orig_pmd);
>  			folio = pfn_swap_entry_folio(entry);
>  			flush_needed = 0;
> -		} else
> -			WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> +			if (!thp_migration_supported())
> +				WARN_ONCE(1, "Non present huge pmd without pmd migration enabled!");
> +
> +			if (is_pmd_device_private_entry(orig_pmd)) {
> +				folio_remove_rmap_pmd(folio, &folio->page, vma);
> +				WARN_ON_ONCE(folio_mapcount(folio) < 0);
> +			}
> +		}
>  
>  		if (folio_test_anon(folio)) {
>  			zap_deposited_table(tlb->mm, pmd);
> @@ -2247,6 +2287,15 @@ int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  				folio_mark_accessed(folio);
>  		}
>  
> +		/*
> +		 * Do a folio put on zone device private pages after
> +		 * changes to mm_counter, because the folio_put() will
> +		 * clean folio->mapping and the folio_test_anon() check
> +		 * will not be usable.
> +		 */
> +		if (folio_is_device_private(folio))
> +			folio_put(folio);
> +
>  		spin_unlock(ptl);
>  		if (flush_needed)
>  			tlb_remove_page_size(tlb, &folio->page, HPAGE_PMD_SIZE);
> @@ -2375,7 +2424,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  		struct folio *folio = pfn_swap_entry_folio(entry);
>  		pmd_t newpmd;
>  
> -		VM_BUG_ON(!is_pmd_migration_entry(*pmd));
> +		VM_WARN_ON(!is_pmd_migration_entry(*pmd) &&
> +			   !folio_is_device_private(folio));
>  		if (is_writable_migration_entry(entry)) {
>  			/*
>  			 * A protection check is difficult so
> @@ -2388,6 +2438,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
>  			newpmd = swp_entry_to_pmd(entry);
>  			if (pmd_swp_soft_dirty(*pmd))
>  				newpmd = pmd_swp_mksoft_dirty(newpmd);
> +		} else if (is_writable_device_private_entry(entry)) {
> +			entry = make_readable_device_private_entry(
> +							swp_offset(entry));
> +			newpmd = swp_entry_to_pmd(entry);
>  		} else {
>  			newpmd = *pmd;
>  		}
> @@ -2834,6 +2888,44 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>  	pmd_populate(mm, pmd, pgtable);
>  }
>  
> +/**
> + * split_huge_device_private_folio - split a huge device private folio into
> + * smaller pages (of order 0), currently used by migrate_device logic to
> + * split folios for pages that are partially mapped
> + *
> + * @folio: the folio to split
> + *
> + * The caller has to hold the folio_lock and a reference via folio_get
> + */
> +int split_device_private_folio(struct folio *folio)
> +{
> +	struct folio *end_folio = folio_next(folio);
> +	struct folio *new_folio;
> +	int ret = 0;
> +
> +	/*
> +	 * Split the folio now. In the case of device
> +	 * private pages, this path is executed when
> +	 * the pmd is split and since freeze is not true
> +	 * it is likely the folio will be deferred_split.
> +	 *
> +	 * With device private pages, deferred splits of
> +	 * folios should be handled here to prevent partial
> +	 * unmaps from causing issues later on in migration
> +	 * and fault handling flows.
> +	 */
> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));

Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?

> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);

Confusing to  __split_unmapped_folio() if folio is mapped...

--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Zi Yan 2 months, 1 week ago

On 30 Jul 2025, at 7:16, Mika Penttilä wrote:

> Hi,
>
> On 7/30/25 12:21, Balbir Singh wrote:
>> Make THP handling code in the mm subsystem for THP pages aware of zone
>> device pages. Although the code is designed to be generic when it comes
>> to handling splitting of pages, the code is designed to work for THP
>> page sizes corresponding to HPAGE_PMD_NR.
>>
>> Modify page_vma_mapped_walk() to return true when a zone device huge
>> entry is present, enabling try_to_migrate() and other code migration
>> paths to appropriately process the entry. page_vma_mapped_walk() will
>> return true for zone device private large folios only when
>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>> not zone device private pages from having to add awareness. The key
>> callback that needs this flag is try_to_migrate_one(). The other
>> callbacks page idle, damon use it for setting young/dirty bits, which is
>> not significant when it comes to pmd level bit harvesting.
>>
>> pmd_pfn() does not work well with zone device entries, use
>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>> entries.
>>
>> Zone device private entries when split via munmap go through pmd split,
>> but need to go through a folio split, deferred split does not work if a
>> fault is encountered because fault handling involves migration entries
>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>> same there. This introduces the need to split the folio while handling
>> the pmd split. Because the folio is still mapped, but calling
>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>> code is used with a new helper to wrap the code
>> split_device_private_folio(), which skips the checks around
>> folio->mapping, swapcache and the need to go through unmap and remap
>> folio.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>
>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/huge_mm.h |   1 +
>>  include/linux/rmap.h    |   2 +
>>  include/linux/swapops.h |  17 +++
>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>  mm/page_vma_mapped.c    |  13 +-
>>  mm/pgtable-generic.c    |   6 +
>>  mm/rmap.c               |  22 +++-
>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>

<snip>

>> +/**
>> + * split_huge_device_private_folio - split a huge device private folio into
>> + * smaller pages (of order 0), currently used by migrate_device logic to
>> + * split folios for pages that are partially mapped
>> + *
>> + * @folio: the folio to split
>> + *
>> + * The caller has to hold the folio_lock and a reference via folio_get
>> + */
>> +int split_device_private_folio(struct folio *folio)
>> +{
>> +	struct folio *end_folio = folio_next(folio);
>> +	struct folio *new_folio;
>> +	int ret = 0;
>> +
>> +	/*
>> +	 * Split the folio now. In the case of device
>> +	 * private pages, this path is executed when
>> +	 * the pmd is split and since freeze is not true
>> +	 * it is likely the folio will be deferred_split.
>> +	 *
>> +	 * With device private pages, deferred splits of
>> +	 * folios should be handled here to prevent partial
>> +	 * unmaps from causing issues later on in migration
>> +	 * and fault handling flows.
>> +	 */
>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>
> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?

Based on my off-list conversation with Balbir, the folio is unmapped in
CPU side but mapped in the device. folio_ref_freeeze() is not aware of
device side mapping.

>
>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>
> Confusing to  __split_unmapped_folio() if folio is mapped...

From driver point of view, __split_unmapped_folio() probably should be renamed
to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
folio meta data for split.


Best Regards,
Yan, Zi

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Zi Yan 2 months, 1 week ago

On 30 Jul 2025, at 7:27, Zi Yan wrote:

> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>
>> Hi,
>>
>> On 7/30/25 12:21, Balbir Singh wrote:
>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>> device pages. Although the code is designed to be generic when it comes
>>> to handling splitting of pages, the code is designed to work for THP
>>> page sizes corresponding to HPAGE_PMD_NR.
>>>
>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>> entry is present, enabling try_to_migrate() and other code migration
>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>> return true for zone device private large folios only when
>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>> not zone device private pages from having to add awareness. The key
>>> callback that needs this flag is try_to_migrate_one(). The other
>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>> not significant when it comes to pmd level bit harvesting.
>>>
>>> pmd_pfn() does not work well with zone device entries, use
>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>> entries.
>>>
>>> Zone device private entries when split via munmap go through pmd split,
>>> but need to go through a folio split, deferred split does not work if a
>>> fault is encountered because fault handling involves migration entries
>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>> same there. This introduces the need to split the folio while handling
>>> the pmd split. Because the folio is still mapped, but calling
>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>> code is used with a new helper to wrap the code
>>> split_device_private_folio(), which skips the checks around
>>> folio->mapping, swapcache and the need to go through unmap and remap
>>> folio.
>>>
>>> Cc: Karol Herbst <kherbst@redhat.com>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>> Cc: Shuah Khan <shuah@kernel.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> Cc: Jane Chu <jane.chu@oracle.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>
>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  include/linux/huge_mm.h |   1 +
>>>  include/linux/rmap.h    |   2 +
>>>  include/linux/swapops.h |  17 +++
>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>  mm/page_vma_mapped.c    |  13 +-
>>>  mm/pgtable-generic.c    |   6 +
>>>  mm/rmap.c               |  22 +++-
>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>
>
> <snip>
>
>>> +/**
>>> + * split_huge_device_private_folio - split a huge device private folio into
>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>> + * split folios for pages that are partially mapped
>>> + *
>>> + * @folio: the folio to split
>>> + *
>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>> + */
>>> +int split_device_private_folio(struct folio *folio)
>>> +{
>>> +	struct folio *end_folio = folio_next(folio);
>>> +	struct folio *new_folio;
>>> +	int ret = 0;
>>> +
>>> +	/*
>>> +	 * Split the folio now. In the case of device
>>> +	 * private pages, this path is executed when
>>> +	 * the pmd is split and since freeze is not true
>>> +	 * it is likely the folio will be deferred_split.
>>> +	 *
>>> +	 * With device private pages, deferred splits of
>>> +	 * folios should be handled here to prevent partial
>>> +	 * unmaps from causing issues later on in migration
>>> +	 * and fault handling flows.
>>> +	 */
>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>
> Based on my off-list conversation with Balbir, the folio is unmapped in
> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
> device side mapping.

Maybe we should make it aware of device private mapping? So that the
process mirrors CPU side folio split: 1) unmap device private mapping,
2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
5) remap device private mapping.

>
>>
>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>
>> Confusing to  __split_unmapped_folio() if folio is mapped...
>
> From driver point of view, __split_unmapped_folio() probably should be renamed
> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
> folio meta data for split.
>
>
> Best Regards,
> Yan, Zi


Best Regards,
Yan, Zi

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months, 1 week ago

On 7/30/25 14:30, Zi Yan wrote:
> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>
>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>
>>> Hi,
>>>
>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>> device pages. Although the code is designed to be generic when it comes
>>>> to handling splitting of pages, the code is designed to work for THP
>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>
>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>> entry is present, enabling try_to_migrate() and other code migration
>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>> return true for zone device private large folios only when
>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>> not zone device private pages from having to add awareness. The key
>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>> not significant when it comes to pmd level bit harvesting.
>>>>
>>>> pmd_pfn() does not work well with zone device entries, use
>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>> entries.
>>>>
>>>> Zone device private entries when split via munmap go through pmd split,
>>>> but need to go through a folio split, deferred split does not work if a
>>>> fault is encountered because fault handling involves migration entries
>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>> same there. This introduces the need to split the folio while handling
>>>> the pmd split. Because the folio is still mapped, but calling
>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>> code is used with a new helper to wrap the code
>>>> split_device_private_folio(), which skips the checks around
>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>> folio.
>>>>
>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: David Airlie <airlied@gmail.com>
>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Barry Song <baohua@kernel.org>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>
>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> ---
>>>>  include/linux/huge_mm.h |   1 +
>>>>  include/linux/rmap.h    |   2 +
>>>>  include/linux/swapops.h |  17 +++
>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>  mm/pgtable-generic.c    |   6 +
>>>>  mm/rmap.c               |  22 +++-
>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>
>> <snip>
>>
>>>> +/**
>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>> + * split folios for pages that are partially mapped
>>>> + *
>>>> + * @folio: the folio to split
>>>> + *
>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>> + */
>>>> +int split_device_private_folio(struct folio *folio)
>>>> +{
>>>> +	struct folio *end_folio = folio_next(folio);
>>>> +	struct folio *new_folio;
>>>> +	int ret = 0;
>>>> +
>>>> +	/*
>>>> +	 * Split the folio now. In the case of device
>>>> +	 * private pages, this path is executed when
>>>> +	 * the pmd is split and since freeze is not true
>>>> +	 * it is likely the folio will be deferred_split.
>>>> +	 *
>>>> +	 * With device private pages, deferred splits of
>>>> +	 * folios should be handled here to prevent partial
>>>> +	 * unmaps from causing issues later on in migration
>>>> +	 * and fault handling flows.
>>>> +	 */
>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>> Based on my off-list conversation with Balbir, the folio is unmapped in
>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>> device side mapping.
> Maybe we should make it aware of device private mapping? So that the
> process mirrors CPU side folio split: 1) unmap device private mapping,
> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
> 5) remap device private mapping.

Ah ok this was about device private page obviously here, nevermind..

>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>> From driver point of view, __split_unmapped_folio() probably should be renamed
>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>> folio meta data for split.
>>
>>
>> Best Regards,
>> Yan, Zi
>
> Best Regards,
> Yan, Zi
>

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months, 1 week ago

On 7/30/25 14:42, Mika Penttilä wrote:
> On 7/30/25 14:30, Zi Yan wrote:
>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>
>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>
>>>> Hi,
>>>>
>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>> device pages. Although the code is designed to be generic when it comes
>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>
>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>> return true for zone device private large folios only when
>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>> not zone device private pages from having to add awareness. The key
>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>
>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>> entries.
>>>>>
>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>> but need to go through a folio split, deferred split does not work if a
>>>>> fault is encountered because fault handling involves migration entries
>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>> same there. This introduces the need to split the folio while handling
>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>> code is used with a new helper to wrap the code
>>>>> split_device_private_folio(), which skips the checks around
>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>> folio.
>>>>>
>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>
>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>> ---
>>>>>  include/linux/huge_mm.h |   1 +
>>>>>  include/linux/rmap.h    |   2 +
>>>>>  include/linux/swapops.h |  17 +++
>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>  mm/rmap.c               |  22 +++-
>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>
>>> <snip>
>>>
>>>>> +/**
>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>> + * split folios for pages that are partially mapped
>>>>> + *
>>>>> + * @folio: the folio to split
>>>>> + *
>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>> + */
>>>>> +int split_device_private_folio(struct folio *folio)
>>>>> +{
>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>> +	struct folio *new_folio;
>>>>> +	int ret = 0;
>>>>> +
>>>>> +	/*
>>>>> +	 * Split the folio now. In the case of device
>>>>> +	 * private pages, this path is executed when
>>>>> +	 * the pmd is split and since freeze is not true
>>>>> +	 * it is likely the folio will be deferred_split.
>>>>> +	 *
>>>>> +	 * With device private pages, deferred splits of
>>>>> +	 * folios should be handled here to prevent partial
>>>>> +	 * unmaps from causing issues later on in migration
>>>>> +	 * and fault handling flows.
>>>>> +	 */
>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>> device side mapping.
>> Maybe we should make it aware of device private mapping? So that the
>> process mirrors CPU side folio split: 1) unmap device private mapping,
>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>> 5) remap device private mapping.
> Ah ok this was about device private page obviously here, nevermind..

Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?

>
>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>> folio meta data for split.
>>>
>>>
>>> Best Regards,
>>> Yan, Zi
>> Best Regards,
>> Yan, Zi
>>

--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Zi Yan 2 months, 1 week ago

On 30 Jul 2025, at 8:08, Mika Penttilä wrote:

> On 7/30/25 14:42, Mika Penttilä wrote:
>> On 7/30/25 14:30, Zi Yan wrote:
>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>
>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>
>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>> return true for zone device private large folios only when
>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>> not zone device private pages from having to add awareness. The key
>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>
>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>> entries.
>>>>>>
>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>> fault is encountered because fault handling involves migration entries
>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>> same there. This introduces the need to split the folio while handling
>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>> code is used with a new helper to wrap the code
>>>>>> split_device_private_folio(), which skips the checks around
>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>> folio.
>>>>>>
>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>
>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>
>>>> <snip>
>>>>
>>>>>> +/**
>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>> + * split folios for pages that are partially mapped
>>>>>> + *
>>>>>> + * @folio: the folio to split
>>>>>> + *
>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>> + */
>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>> +{
>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>> +	struct folio *new_folio;
>>>>>> +	int ret = 0;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Split the folio now. In the case of device
>>>>>> +	 * private pages, this path is executed when
>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>> +	 *
>>>>>> +	 * With device private pages, deferred splits of
>>>>>> +	 * folios should be handled here to prevent partial
>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>> +	 * and fault handling flows.
>>>>>> +	 */
>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>> device side mapping.
>>> Maybe we should make it aware of device private mapping? So that the
>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>> 5) remap device private mapping.
>> Ah ok this was about device private page obviously here, nevermind..
>
> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?

The folio only has migration entries pointing to it. From CPU perspective,
it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
folio by replacing existing page table entries with migration entries
and after that the folio is regarded as “unmapped”.

The migration entry is an invalid CPU page table entry, so it is not a CPU
mapping, IIUC.

>
>>
>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>> folio meta data for split.
>>>>
>>>>
>>>> Best Regards,
>>>> Yan, Zi
>>> Best Regards,
>>> Yan, Zi
>>>
>
> --Mika


Best Regards,
Yan, Zi

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months, 1 week ago

On 7/30/25 15:25, Zi Yan wrote:
> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>
>> On 7/30/25 14:42, Mika Penttilä wrote:
>>> On 7/30/25 14:30, Zi Yan wrote:
>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>
>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>
>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>> return true for zone device private large folios only when
>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>
>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>> entries.
>>>>>>>
>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>> code is used with a new helper to wrap the code
>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>> folio.
>>>>>>>
>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>
>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>> ---
>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>
>>>>> <snip>
>>>>>
>>>>>>> +/**
>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>> + * split folios for pages that are partially mapped
>>>>>>> + *
>>>>>>> + * @folio: the folio to split
>>>>>>> + *
>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>> + */
>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>> +{
>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>> +	struct folio *new_folio;
>>>>>>> +	int ret = 0;
>>>>>>> +
>>>>>>> +	/*
>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>> +	 * private pages, this path is executed when
>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>> +	 *
>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>> +	 * and fault handling flows.
>>>>>>> +	 */
>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>> device side mapping.
>>>> Maybe we should make it aware of device private mapping? So that the
>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>> 5) remap device private mapping.
>>> Ah ok this was about device private page obviously here, nevermind..
>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
> The folio only has migration entries pointing to it. From CPU perspective,
> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
> folio by replacing existing page table entries with migration entries
> and after that the folio is regarded as “unmapped”.
>
> The migration entry is an invalid CPU page table entry, so it is not a CPU

split_device_private_folio() is called for device private entry, not migrate entry afaics. 
And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.

> mapping, IIUC.
>
>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>> folio meta data for split.
>>>>>
>>>>>
>>>>> Best Regards,
>>>>> Yan, Zi
>>>> Best Regards,
>>>> Yan, Zi
>>>>
>> --Mika
>
> Best Regards,
> Yan, Zi
>
--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Zi Yan 2 months, 1 week ago

On 30 Jul 2025, at 8:49, Mika Penttilä wrote:

> On 7/30/25 15:25, Zi Yan wrote:
>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>
>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>
>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>
>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>> return true for zone device private large folios only when
>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>
>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>> entries.
>>>>>>>>
>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>> folio.
>>>>>>>>
>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>
>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>> ---
>>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>
>>>>>> <snip>
>>>>>>
>>>>>>>> +/**
>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>> + *
>>>>>>>> + * @folio: the folio to split
>>>>>>>> + *
>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>> + */
>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>> +{
>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>> +	struct folio *new_folio;
>>>>>>>> +	int ret = 0;
>>>>>>>> +
>>>>>>>> +	/*
>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>> +	 *
>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>> +	 * and fault handling flows.
>>>>>>>> +	 */
>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>> device side mapping.
>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>> 5) remap device private mapping.
>>>> Ah ok this was about device private page obviously here, nevermind..
>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>> The folio only has migration entries pointing to it. From CPU perspective,
>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>> folio by replacing existing page table entries with migration entries
>> and after that the folio is regarded as “unmapped”.
>>
>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>
> split_device_private_folio() is called for device private entry, not migrate entry afaics.

Yes, but from CPU perspective, both device private entry and migration entry
are invalid CPU page table entries, so the device private folio is “unmapped”
at CPU side.


> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.

I am not sure that is the right timing of splitting a folio. The device private
folio can be kept without splitting at split_huge_pmd() time.

But from CPU perspective, a device private folio has no CPU mapping, no other
CPU can access or manipulate the folio. It should be OK to split it.

>
>> mapping, IIUC.
>>
>>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>> folio meta data for split.
>>>>>>
>>>>>>
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>> Best Regards,
>>>>> Yan, Zi
>>>>>
>>> --Mika
>>
>> Best Regards,
>> Yan, Zi
>>
> --Mika


Best Regards,
Yan, Zi

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months, 1 week ago

On 7/30/25 18:10, Zi Yan wrote:
> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>
>> On 7/30/25 15:25, Zi Yan wrote:
>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>
>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>
>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>
>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>> entries.
>>>>>>>>>
>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>> folio.
>>>>>>>>>
>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>
>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>> ---
>>>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>
>>>>>>> <snip>
>>>>>>>
>>>>>>>>> +/**
>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>> + *
>>>>>>>>> + * @folio: the folio to split
>>>>>>>>> + *
>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>> + */
>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>> +{
>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>> +	int ret = 0;
>>>>>>>>> +
>>>>>>>>> +	/*
>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>> +	 *
>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>> +	 */
>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>> device side mapping.
>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>> 5) remap device private mapping.
>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>> The folio only has migration entries pointing to it. From CPU perspective,
>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>> folio by replacing existing page table entries with migration entries
>>> and after that the folio is regarded as “unmapped”.
>>>
>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
> Yes, but from CPU perspective, both device private entry and migration entry
> are invalid CPU page table entries, so the device private folio is “unmapped”
> at CPU side.

Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.

Also which might confuse is that v1 of the series had only 
  migrate_vma_split_pages()
which operated only on truly unmapped (mapcount wise) folios. Which was a motivation for split_unmapped_folio()..
Now, 
  split_device_private_folio()
operates on mapcount != 0 folios.

>
>
>> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
> I am not sure that is the right timing of splitting a folio. The device private
> folio can be kept without splitting at split_huge_pmd() time.

Yes this doesn't look quite right, and also
+	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));

looks suspicious

Maybe split_device_private_folio() tries to solve some corner case but maybe good to elaborate
more the exact conditions, there might be a better fix.

>
> But from CPU perspective, a device private folio has no CPU mapping, no other
> CPU can access or manipulate the folio. It should be OK to split it.
>
>>> mapping, IIUC.
>>>
>>>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>>> folio meta data for split.
>>>>>>>
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Yan, Zi
>>>>>> Best Regards,
>>>>>> Yan, Zi
>>>>>>
>>>> --Mika
>>> Best Regards,
>>> Yan, Zi
>>>
>> --Mika
>
> Best Regards,
> Yan, Zi
>
--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Zi Yan 2 months, 1 week ago

On 30 Jul 2025, at 11:40, Mika Penttilä wrote:

> On 7/30/25 18:10, Zi Yan wrote:
>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>
>>> On 7/30/25 15:25, Zi Yan wrote:
>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>
>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>
>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>
>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>> entries.
>>>>>>>>>>
>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>> folio.
>>>>>>>>>>
>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>> ---
>>>>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>
>>>>>>>> <snip>
>>>>>>>>
>>>>>>>>>> +/**
>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>> + *
>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>> + *
>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>> + */
>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>> +{
>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>> +	int ret = 0;
>>>>>>>>>> +
>>>>>>>>>> +	/*
>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>> +	 *
>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>> +	 */
>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>> device side mapping.
>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>> 5) remap device private mapping.
>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>> folio by replacing existing page table entries with migration entries
>>>> and after that the folio is regarded as “unmapped”.
>>>>
>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>> Yes, but from CPU perspective, both device private entry and migration entry
>> are invalid CPU page table entries, so the device private folio is “unmapped”
>> at CPU side.
>
> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.

Right. That confused me when I was talking to Balbir and looking at v1.
When a device private folio is processed in __folio_split(), Balbir needed to
add code to skip CPU mapping handling code. Basically device private folios are
CPU unmapped and device mapped.

Here are my questions on device private folios:
1. How is mapcount used for device private folios? Why is it needed from CPU
   perspective? Can it be stored in a device private specific data structure?
2. When a device private folio is mapped on device, can someone other than
   the device driver manipulate it assuming core-mm just skips device private
   folios (barring the CPU access fault handling)?

Where I am going is that can device private folios be treated as unmapped folios
by CPU and only device driver manipulates their mappings?


>
> Also which might confuse is that v1 of the series had only
>   migrate_vma_split_pages()
> which operated only on truly unmapped (mapcount wise) folios. Which was a motivation for split_unmapped_folio()..
> Now,
>   split_device_private_folio()
> operates on mapcount != 0 folios.
>
>>
>>
>>> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
>> I am not sure that is the right timing of splitting a folio. The device private
>> folio can be kept without splitting at split_huge_pmd() time.
>
> Yes this doesn't look quite right, and also
> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));

I wonder if we need to freeze a device private folio. Can anyone other than
device driver change its refcount? Since CPU just sees it as an unmapped folio.

>
> looks suspicious
>
> Maybe split_device_private_folio() tries to solve some corner case but maybe good to elaborate
> more the exact conditions, there might be a better fix.
>
>>
>> But from CPU perspective, a device private folio has no CPU mapping, no other
>> CPU can access or manipulate the folio. It should be OK to split it.
>>
>>>> mapping, IIUC.
>>>>
>>>>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>>>> folio meta data for split.



Best Regards,
Yan, Zi

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months, 1 week ago

On 7/30/25 18:58, Zi Yan wrote:
> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>
>> On 7/30/25 18:10, Zi Yan wrote:
>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>
>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>
>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>
>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>> entries.
>>>>>>>>>>>
>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>> folio.
>>>>>>>>>>>
>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>> ---
>>>>>>>>>>>  include/linux/huge_mm.h |   1 +
>>>>>>>>>>>  include/linux/rmap.h    |   2 +
>>>>>>>>>>>  include/linux/swapops.h |  17 +++
>>>>>>>>>>>  mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>  mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>  mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>  mm/rmap.c               |  22 +++-
>>>>>>>>>>>  7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>
>>>>>>>>> <snip>
>>>>>>>>>
>>>>>>>>>>> +/**
>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>> + *
>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>> + *
>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>> + */
>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>> +{
>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>> +
>>>>>>>>>>> +	/*
>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>> +	 *
>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>> +	 */
>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>> device side mapping.
>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>> 5) remap device private mapping.
>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>> folio by replacing existing page table entries with migration entries
>>>>> and after that the folio is regarded as “unmapped”.
>>>>>
>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>> Yes, but from CPU perspective, both device private entry and migration entry
>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>> at CPU side.
>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
> Right. That confused me when I was talking to Balbir and looking at v1.
> When a device private folio is processed in __folio_split(), Balbir needed to
> add code to skip CPU mapping handling code. Basically device private folios are
> CPU unmapped and device mapped.
>
> Here are my questions on device private folios:
> 1. How is mapcount used for device private folios? Why is it needed from CPU
>    perspective? Can it be stored in a device private specific data structure?

Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
common code more messy if not done that way but sure possible. 
And not consuming pfns (address space) at all would have benefits.

> 2. When a device private folio is mapped on device, can someone other than
>    the device driver manipulate it assuming core-mm just skips device private
>    folios (barring the CPU access fault handling)?
>
> Where I am going is that can device private folios be treated as unmapped folios
> by CPU and only device driver manipulates their mappings?
>
Yes not present by CPU but mm has bookkeeping on them. The private page has no content
someone could change while in device, it's just pfn.

>> Also which might confuse is that v1 of the series had only
>>   migrate_vma_split_pages()
>> which operated only on truly unmapped (mapcount wise) folios. Which was a motivation for split_unmapped_folio()..
>> Now,
>>   split_device_private_folio()
>> operates on mapcount != 0 folios.
>>
>>>
>>>> And it is called from split_huge_pmd() with freeze == false, not from folio split but pmd split.
>>> I am not sure that is the right timing of splitting a folio. The device private
>>> folio can be kept without splitting at split_huge_pmd() time.
>> Yes this doesn't look quite right, and also
>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
> I wonder if we need to freeze a device private folio. Can anyone other than
> device driver change its refcount? Since CPU just sees it as an unmapped folio.
>
>> looks suspicious
>>
>> Maybe split_device_private_folio() tries to solve some corner case but maybe good to elaborate
>> more the exact conditions, there might be a better fix.
>>
>>> But from CPU perspective, a device private folio has no CPU mapping, no other
>>> CPU can access or manipulate the folio. It should be OK to split it.
>>>
>>>>> mapping, IIUC.
>>>>>
>>>>>>>>>>> +	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>>>> Confusing to  __split_unmapped_folio() if folio is mapped...
>>>>>>>>> From driver point of view, __split_unmapped_folio() probably should be renamed
>>>>>>>>> to __split_cpu_unmapped_folio(), since it is only dealing with CPU side
>>>>>>>>> folio meta data for split.
>
>
> Best Regards,
> Yan, Zi
>

--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by David Hildenbrand 2 months ago

On 30.07.25 18:29, Mika Penttilä wrote:
> 
> On 7/30/25 18:58, Zi Yan wrote:
>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>
>>> On 7/30/25 18:10, Zi Yan wrote:
>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>
>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>
>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>
>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>> entries.
>>>>>>>>>>>>
>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>> folio.
>>>>>>>>>>>>
>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>> ---
>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>
>>>>>>>>>> <snip>
>>>>>>>>>>
>>>>>>>>>>>> +/**
>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>> + *
>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>> + */
>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>> +
>>>>>>>>>>>> +	/*
>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>> +	 *
>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>> +	 */
>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>> device side mapping.
>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>> 5) remap device private mapping.
>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>> folio by replacing existing page table entries with migration entries
>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>
>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>> at CPU side.
>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>> Right. That confused me when I was talking to Balbir and looking at v1.
>> When a device private folio is processed in __folio_split(), Balbir needed to
>> add code to skip CPU mapping handling code. Basically device private folios are
>> CPU unmapped and device mapped.
>>
>> Here are my questions on device private folios:
>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>     perspective? Can it be stored in a device private specific data structure?
> 
> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
> common code more messy if not done that way but sure possible.
> And not consuming pfns (address space) at all would have benefits.
> 
>> 2. When a device private folio is mapped on device, can someone other than
>>     the device driver manipulate it assuming core-mm just skips device private
>>     folios (barring the CPU access fault handling)?
>>
>> Where I am going is that can device private folios be treated as unmapped folios
>> by CPU and only device driver manipulates their mappings?
>>
> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
> someone could change while in device, it's just pfn.

Just to clarify: a device-private entry, like a device-exclusive entry, 
is a *page table mapping* tracked through the rmap -- even though they 
are not present page table entries.

It would be better if they would be present page table entries that are 
PROT_NONE, but it's tricky to mark them as being "special" 
device-private, device-exclusive etc. Maybe there are ways to do that in 
the future.

Maybe device-private could just be PROT_NONE, because we can identify 
the entry type based on the folio. device-exclusive is harder ...


So consider device-private entries just like PROT_NONE present page 
table entries. Refcount and mapcount is adjusted accordingly by rmap 
functions.

-- 
Cheers,

David / dhildenb

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Zi Yan 2 months ago

On 31 Jul 2025, at 3:15, David Hildenbrand wrote:

> On 30.07.25 18:29, Mika Penttilä wrote:
>>
>> On 7/30/25 18:58, Zi Yan wrote:
>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>
>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>> <snip>
>>>>>>>>>>>
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>> device side mapping.
>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>
>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>> at CPU side.
>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>> add code to skip CPU mapping handling code. Basically device private folios are
>>> CPU unmapped and device mapped.
>>>
>>> Here are my questions on device private folios:
>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>     perspective? Can it be stored in a device private specific data structure?
>>
>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>> common code more messy if not done that way but sure possible.
>> And not consuming pfns (address space) at all would have benefits.
>>
>>> 2. When a device private folio is mapped on device, can someone other than
>>>     the device driver manipulate it assuming core-mm just skips device private
>>>     folios (barring the CPU access fault handling)?
>>>
>>> Where I am going is that can device private folios be treated as unmapped folios
>>> by CPU and only device driver manipulates their mappings?
>>>
>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>> someone could change while in device, it's just pfn.
>
> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>
> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>
> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>
>
> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.

Thanks for the clarification.

So folio_mapcount() for device private folios should be treated the same
as normal folios, even if the corresponding PTEs are not accessible from CPUs.
Then I wonder if the device private large folio split should go through
__folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
remap. Otherwise, how can we prevent rmap changes during the split?


Best Regards,
Yan, Zi

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

On 7/31/25 21:26, Zi Yan wrote:
> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
> 
>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>
>>> On 7/30/25 18:58, Zi Yan wrote:
>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>> <snip>
>>>>>>>>>>>>
>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>> device side mapping.
>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>
>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>> at CPU side.
>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>> CPU unmapped and device mapped.
>>>>
>>>> Here are my questions on device private folios:
>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>     perspective? Can it be stored in a device private specific data structure?
>>>
>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>> common code more messy if not done that way but sure possible.
>>> And not consuming pfns (address space) at all would have benefits.
>>>
>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>     folios (barring the CPU access fault handling)?
>>>>
>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>> by CPU and only device driver manipulates their mappings?
>>>>
>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>> someone could change while in device, it's just pfn.
>>
>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>
>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>
>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>
>>
>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
> 
> Thanks for the clarification.
> 
> So folio_mapcount() for device private folios should be treated the same
> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
> Then I wonder if the device private large folio split should go through
> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
> remap. Otherwise, how can we prevent rmap changes during the split?
> 

That is true in general, the special cases I mentioned are:

1. split during migration (where we the sizes on source/destination do not
   match) and so we need to split in the middle of migration. The entries
   there are already unmapped and hence the special handling
2. Partial unmap case, where we need to split in the context of the unmap
   due to the isses mentioned in the patch. I expanded the folio split code
   for device private can be expanded into its own helper, which does not
   need to do the xas/mapped/lru folio handling. During partial unmap the
   original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)

For (2), I spent some time examining the implications of not unmapping the
folios prior to split and in the partial unmap path, once we split the PMD
the folios diverge. I did not run into any particular race either with the
tests.

Balbir Singh

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months ago

Hi,

On 8/1/25 03:49, Balbir Singh wrote:

> On 7/31/25 21:26, Zi Yan wrote:
>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>
>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>
>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>> at CPU side.
>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>> CPU unmapped and device mapped.
>>>>>
>>>>> Here are my questions on device private folios:
>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>     perspective? Can it be stored in a device private specific data structure?
>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>> common code more messy if not done that way but sure possible.
>>>> And not consuming pfns (address space) at all would have benefits.
>>>>
>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>>     folios (barring the CPU access fault handling)?
>>>>>
>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>> by CPU and only device driver manipulates their mappings?
>>>>>
>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>> someone could change while in device, it's just pfn.
>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>
>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>
>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>
>>>
>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>> Thanks for the clarification.
>>
>> So folio_mapcount() for device private folios should be treated the same
>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>> Then I wonder if the device private large folio split should go through
>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>> remap. Otherwise, how can we prevent rmap changes during the split?
>>
> That is true in general, the special cases I mentioned are:
>
> 1. split during migration (where we the sizes on source/destination do not
>    match) and so we need to split in the middle of migration. The entries
>    there are already unmapped and hence the special handling
> 2. Partial unmap case, where we need to split in the context of the unmap
>    due to the isses mentioned in the patch. I expanded the folio split code
>    for device private can be expanded into its own helper, which does not
>    need to do the xas/mapped/lru folio handling. During partial unmap the
>    original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>
> For (2), I spent some time examining the implications of not unmapping the
> folios prior to split and in the partial unmap path, once we split the PMD
> the folios diverge. I did not run into any particular race either with the
> tests.

1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()

2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
It is vulnerable to races by rmap. And for instance this does not look right without checking:

   folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));

You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
possible to split the folio at fault time then?
Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
instead?


> Balbir Singh
>
--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

On 8/1/25 11:16, Mika Penttilä wrote:
> Hi,
> 
> On 8/1/25 03:49, Balbir Singh wrote:
> 
>> On 7/31/25 21:26, Zi Yan wrote:
>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>
>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>
>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>> at CPU side.
>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>> CPU unmapped and device mapped.
>>>>>>
>>>>>> Here are my questions on device private folios:
>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>     perspective? Can it be stored in a device private specific data structure?
>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>> common code more messy if not done that way but sure possible.
>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>
>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>>>     folios (barring the CPU access fault handling)?
>>>>>>
>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>
>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>> someone could change while in device, it's just pfn.
>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>
>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>
>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>
>>>>
>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>> Thanks for the clarification.
>>>
>>> So folio_mapcount() for device private folios should be treated the same
>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>> Then I wonder if the device private large folio split should go through
>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>
>> That is true in general, the special cases I mentioned are:
>>
>> 1. split during migration (where we the sizes on source/destination do not
>>    match) and so we need to split in the middle of migration. The entries
>>    there are already unmapped and hence the special handling
>> 2. Partial unmap case, where we need to split in the context of the unmap
>>    due to the isses mentioned in the patch. I expanded the folio split code
>>    for device private can be expanded into its own helper, which does not
>>    need to do the xas/mapped/lru folio handling. During partial unmap the
>>    original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>
>> For (2), I spent some time examining the implications of not unmapping the
>> folios prior to split and in the partial unmap path, once we split the PMD
>> the folios diverge. I did not run into any particular race either with the
>> tests.
> 
> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
> 
> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
> It is vulnerable to races by rmap. And for instance this does not look right without checking:
> 
>    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
> 

I can add checks to make sure that the call does succeed. 

> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
> possible to split the folio at fault time then?

So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.


> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
> instead?
> 
> 

Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
split_huge_pmd_locked() path. Deferred splits do not work for device private pages, due to the
migration requirements for fault handling.

Balbir Singh

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

On 8/1/25 14:44, Balbir Singh wrote:
> On 8/1/25 11:16, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/1/25 03:49, Balbir Singh wrote:
>>
>>> On 7/31/25 21:26, Zi Yan wrote:
>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>
>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>
>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>> at CPU side.
>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>> CPU unmapped and device mapped.
>>>>>>>
>>>>>>> Here are my questions on device private folios:
>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>     perspective? Can it be stored in a device private specific data structure?
>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>> common code more messy if not done that way but sure possible.
>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>
>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>>>>     folios (barring the CPU access fault handling)?
>>>>>>>
>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>
>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>> someone could change while in device, it's just pfn.
>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>
>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>
>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>
>>>>>
>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>> Thanks for the clarification.
>>>>
>>>> So folio_mapcount() for device private folios should be treated the same
>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>> Then I wonder if the device private large folio split should go through
>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>
>>> That is true in general, the special cases I mentioned are:
>>>
>>> 1. split during migration (where we the sizes on source/destination do not
>>>    match) and so we need to split in the middle of migration. The entries
>>>    there are already unmapped and hence the special handling
>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>    due to the isses mentioned in the patch. I expanded the folio split code
>>>    for device private can be expanded into its own helper, which does not
>>>    need to do the xas/mapped/lru folio handling. During partial unmap the
>>>    original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>
>>> For (2), I spent some time examining the implications of not unmapping the
>>> folios prior to split and in the partial unmap path, once we split the PMD
>>> the folios diverge. I did not run into any particular race either with the
>>> tests.
>>
>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>
>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>
>>    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
> 
> I can add checks to make sure that the call does succeed. 
> 
>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>> possible to split the folio at fault time then?
> 
> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
> 

Let me get back to you on this with data, I was playing around with CONFIG_MM_IDS and might
have different data from it.

> 
>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>> instead?
>>
>>
> 
> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
> split_huge_pmd_locked() path. Deferred splits do not work for device private pages, due to the
> migration requirements for fault handling.
> 
> Balbir Singh
> 

Balbir

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months ago

On 8/1/25 07:44, Balbir Singh wrote:
> On 8/1/25 11:16, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/1/25 03:49, Balbir Singh wrote:
>>
>>> On 7/31/25 21:26, Zi Yan wrote:
>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>
>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>
>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>> at CPU side.
>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>> CPU unmapped and device mapped.
>>>>>>>
>>>>>>> Here are my questions on device private folios:
>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>     perspective? Can it be stored in a device private specific data structure?
>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>> common code more messy if not done that way but sure possible.
>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>
>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>>>>     folios (barring the CPU access fault handling)?
>>>>>>>
>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>
>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>> someone could change while in device, it's just pfn.
>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>
>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>
>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>
>>>>>
>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>> Thanks for the clarification.
>>>>
>>>> So folio_mapcount() for device private folios should be treated the same
>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>> Then I wonder if the device private large folio split should go through
>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>
>>> That is true in general, the special cases I mentioned are:
>>>
>>> 1. split during migration (where we the sizes on source/destination do not
>>>    match) and so we need to split in the middle of migration. The entries
>>>    there are already unmapped and hence the special handling
>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>    due to the isses mentioned in the patch. I expanded the folio split code
>>>    for device private can be expanded into its own helper, which does not
>>>    need to do the xas/mapped/lru folio handling. During partial unmap the
>>>    original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>
>>> For (2), I spent some time examining the implications of not unmapping the
>>> folios prior to split and in the partial unmap path, once we split the PMD
>>> the folios diverge. I did not run into any particular race either with the
>>> tests.
>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>
>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>
>>    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
> I can add checks to make sure that the call does succeed. 
>
>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>> possible to split the folio at fault time then?
> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.

Is this after the deferred split -> map_unused_to_zeropage flow which would leave the page unmapped? Maybe disable that for device pages?


>
>
>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>> instead?
>>
>>
> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
> split_huge_pmd_locked() path. Deferred splits do not work for device private pages, due to the
> migration requirements for fault handling.
>
> Balbir Singh

--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by David Hildenbrand 2 months ago

On 01.08.25 06:44, Balbir Singh wrote:
> On 8/1/25 11:16, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/1/25 03:49, Balbir Singh wrote:
>>
>>> On 7/31/25 21:26, Zi Yan wrote:
>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>
>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>
>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>> at CPU side.
>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>> CPU unmapped and device mapped.
>>>>>>>
>>>>>>> Here are my questions on device private folios:
>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>      perspective? Can it be stored in a device private specific data structure?
>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>> common code more messy if not done that way but sure possible.
>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>
>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>      the device driver manipulate it assuming core-mm just skips device private
>>>>>>>      folios (barring the CPU access fault handling)?
>>>>>>>
>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>
>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>> someone could change while in device, it's just pfn.
>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>
>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>
>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>
>>>>>
>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>> Thanks for the clarification.
>>>>
>>>> So folio_mapcount() for device private folios should be treated the same
>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>> Then I wonder if the device private large folio split should go through
>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>
>>> That is true in general, the special cases I mentioned are:
>>>
>>> 1. split during migration (where we the sizes on source/destination do not
>>>     match) and so we need to split in the middle of migration. The entries
>>>     there are already unmapped and hence the special handling
>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>     due to the isses mentioned in the patch. I expanded the folio split code
>>>     for device private can be expanded into its own helper, which does not
>>>     need to do the xas/mapped/lru folio handling. During partial unmap the
>>>     original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>
>>> For (2), I spent some time examining the implications of not unmapping the
>>> folios prior to split and in the partial unmap path, once we split the PMD
>>> the folios diverge. I did not run into any particular race either with the
>>> tests.
>>
>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>
>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>
>>     folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>
> 
> I can add checks to make sure that the call does succeed.
> 
>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>> possible to split the folio at fault time then?
> 
> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.

I think you mean "Calling folio_split() on a *fully* unmapped folio 
fails ..."

A partially mapped folio still has folio_mapcount() > 0 -> 
folio_mapped() == true.

> 
> 
>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>> instead?
>>
>>
> 
> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
> split_huge_pmd_locked() path.

Yes, that's very complicated.

> Deferred splits do not work for device private pages, due to the
> migration requirements for fault handling.

Can you elaborate on that?

-- 
Cheers,

David / dhildenb

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

On 8/1/25 17:04, David Hildenbrand wrote:
> On 01.08.25 06:44, Balbir Singh wrote:
>> On 8/1/25 11:16, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>
>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>
>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>
>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>> at CPU side.
>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>
>>>>>>>> Here are my questions on device private folios:
>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>      perspective? Can it be stored in a device private specific data structure?
>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>
>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>      the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>      folios (barring the CPU access fault handling)?
>>>>>>>>
>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>
>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>> someone could change while in device, it's just pfn.
>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>
>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>
>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>
>>>>>>
>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>> Thanks for the clarification.
>>>>>
>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>> Then I wonder if the device private large folio split should go through
>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>
>>>> That is true in general, the special cases I mentioned are:
>>>>
>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>     match) and so we need to split in the middle of migration. The entries
>>>>     there are already unmapped and hence the special handling
>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>     due to the isses mentioned in the patch. I expanded the folio split code
>>>>     for device private can be expanded into its own helper, which does not
>>>>     need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>     original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>
>>>> For (2), I spent some time examining the implications of not unmapping the
>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>> the folios diverge. I did not run into any particular race either with the
>>>> tests.
>>>
>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>
>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>
>>>     folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>
>>
>> I can add checks to make sure that the call does succeed.
>>
>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>> possible to split the folio at fault time then?
>>
>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
> 
> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
> 
> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
> 

Looking into this again at my end

>>
>>
>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>> instead?
>>>
>>>
>>
>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>> split_huge_pmd_locked() path.
> 
> Yes, that's very complicated.
> 

Yes and I want to avoid going down that path.

>> Deferred splits do not work for device private pages, due to the
>> migration requirements for fault handling.
> 
> Can you elaborate on that?
> 

If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
assumes that the folio sizes are the same (via check for reference and mapcount)

Balbir

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by David Hildenbrand 2 months ago

On 01.08.25 10:01, Balbir Singh wrote:
> On 8/1/25 17:04, David Hildenbrand wrote:
>> On 01.08.25 06:44, Balbir Singh wrote:
>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>
>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>
>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>> at CPU side.
>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>
>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>
>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>>>>
>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>
>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>
>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>
>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>
>>>>>>>
>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>> Thanks for the clarification.
>>>>>>
>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>> Then I wonder if the device private large folio split should go through
>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>
>>>>> That is true in general, the special cases I mentioned are:
>>>>>
>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>      match) and so we need to split in the middle of migration. The entries
>>>>>      there are already unmapped and hence the special handling
>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>      due to the isses mentioned in the patch. I expanded the folio split code
>>>>>      for device private can be expanded into its own helper, which does not
>>>>>      need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>      original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>
>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>> the folios diverge. I did not run into any particular race either with the
>>>>> tests.
>>>>
>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>
>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>
>>>>      folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>
>>>
>>> I can add checks to make sure that the call does succeed.
>>>
>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>> possible to split the folio at fault time then?
>>>
>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>
>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>
>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>
> 
> Looking into this again at my end
> 
>>>
>>>
>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>> instead?
>>>>
>>>>
>>>
>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>> split_huge_pmd_locked() path.
>>
>> Yes, that's very complicated.
>>
> 
> Yes and I want to avoid going down that path.
> 
>>> Deferred splits do not work for device private pages, due to the
>>> migration requirements for fault handling.
>>
>> Can you elaborate on that?
>>
> 
> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
> assumes that the folio sizes are the same (via check for reference and mapcount)

If you hit a partially-mapped folio, instead of migrating, you would 
actually want to split and then migrate I assume.

-- 
Cheers,

David / dhildenb

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Zi Yan 2 months ago

On 1 Aug 2025, at 4:46, David Hildenbrand wrote:

> On 01.08.25 10:01, Balbir Singh wrote:
>> On 8/1/25 17:04, David Hildenbrand wrote:
>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>
>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>
>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>> at CPU side.
>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>
>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>
>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>>>>>
>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>
>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>
>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>
>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>
>>>>>>>>
>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>> Thanks for the clarification.
>>>>>>>
>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>
>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>
>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>      match) and so we need to split in the middle of migration. The entries
>>>>>>      there are already unmapped and hence the special handling
>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>      due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>      for device private can be expanded into its own helper, which does not
>>>>>>      need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>      original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>
>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>> tests.
>>>>>
>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>
>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>
>>>>>      folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>
>>>>
>>>> I can add checks to make sure that the call does succeed.
>>>>
>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>> possible to split the folio at fault time then?
>>>>
>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>
>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>
>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>
>>
>> Looking into this again at my end
>>
>>>>
>>>>
>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>> instead?
>>>>>
>>>>>
>>>>
>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>> split_huge_pmd_locked() path.
>>>
>>> Yes, that's very complicated.
>>>
>>
>> Yes and I want to avoid going down that path.
>>
>>>> Deferred splits do not work for device private pages, due to the
>>>> migration requirements for fault handling.
>>>
>>> Can you elaborate on that?
>>>
>>
>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>> assumes that the folio sizes are the same (via check for reference and mapcount)
>
> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.

Yes, that is exactly what migrate_pages() does. And if split fails, the migration
fails too. Device private folio probably should do the same thing, assuming
splitting device private folio would always succeed.

Best Regards,
Yan, Zi

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months ago

On 8/1/25 14:10, Zi Yan wrote:
> On 1 Aug 2025, at 4:46, David Hildenbrand wrote:
>
>> On 01.08.25 10:01, Balbir Singh wrote:
>>> On 8/1/25 17:04, David Hildenbrand wrote:
>>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>>
>>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>>> at CPU side.
>>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>>
>>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>>
>>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>>>>>>
>>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>>
>>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>>
>>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>>
>>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>>> Thanks for the clarification.
>>>>>>>>
>>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>>
>>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>>
>>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>>      match) and so we need to split in the middle of migration. The entries
>>>>>>>      there are already unmapped and hence the special handling
>>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>>      due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>>      for device private can be expanded into its own helper, which does not
>>>>>>>      need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>>      original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>>
>>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>>> tests.
>>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>>
>>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>>
>>>>>>      folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>
>>>>> I can add checks to make sure that the call does succeed.
>>>>>
>>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>>> possible to split the folio at fault time then?
>>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>>
>>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>>
>>> Looking into this again at my end
>>>
>>>>>
>>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>>> instead?
>>>>>>
>>>>>>
>>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>>> split_huge_pmd_locked() path.
>>>> Yes, that's very complicated.
>>>>
>>> Yes and I want to avoid going down that path.
>>>
>>>>> Deferred splits do not work for device private pages, due to the
>>>>> migration requirements for fault handling.
>>>> Can you elaborate on that?
>>>>
>>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>>> assumes that the folio sizes are the same (via check for reference and mapcount)
>> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.
> Yes, that is exactly what migrate_pages() does. And if split fails, the migration
> fails too. Device private folio probably should do the same thing, assuming
> splitting device private folio would always succeed.

hmm afaics the normal folio_split wants to use RMP_USE_SHARED_ZEROPAGE when splitting and remapping
device private pages, that can't work..

>
> Best Regards,
> Yan, Zi
>
--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Zi Yan 2 months ago

On 1 Aug 2025, at 8:20, Mika Penttilä wrote:

> On 8/1/25 14:10, Zi Yan wrote:
>> On 1 Aug 2025, at 4:46, David Hildenbrand wrote:
>>
>>> On 01.08.25 10:01, Balbir Singh wrote:
>>>> On 8/1/25 17:04, David Hildenbrand wrote:
>>>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>>>
>>>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>>>> at CPU side.
>>>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>>>
>>>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>>>
>>>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>>>>>>>
>>>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>>>
>>>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>>>
>>>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>>>
>>>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>>>> Thanks for the clarification.
>>>>>>>>>
>>>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>>>
>>>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>>>
>>>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>>>      match) and so we need to split in the middle of migration. The entries
>>>>>>>>      there are already unmapped and hence the special handling
>>>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>>>      due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>>>      for device private can be expanded into its own helper, which does not
>>>>>>>>      need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>>>      original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>>>
>>>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>>>> tests.
>>>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>>>
>>>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>>>
>>>>>>>      folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>
>>>>>> I can add checks to make sure that the call does succeed.
>>>>>>
>>>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>>>> possible to split the folio at fault time then?
>>>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>>>
>>>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>>>
>>>> Looking into this again at my end
>>>>
>>>>>>
>>>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>>>> instead?
>>>>>>>
>>>>>>>
>>>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>>>> split_huge_pmd_locked() path.
>>>>> Yes, that's very complicated.
>>>>>
>>>> Yes and I want to avoid going down that path.
>>>>
>>>>>> Deferred splits do not work for device private pages, due to the
>>>>>> migration requirements for fault handling.
>>>>> Can you elaborate on that?
>>>>>
>>>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>>>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>>>> assumes that the folio sizes are the same (via check for reference and mapcount)
>>> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.
>> Yes, that is exactly what migrate_pages() does. And if split fails, the migration
>> fails too. Device private folio probably should do the same thing, assuming
>> splitting device private folio would always succeed.
>
> hmm afaics the normal folio_split wants to use RMP_USE_SHARED_ZEROPAGE when splitting and remapping
> device private pages, that can't work..

It is fine to exclude device private folio to use RMP_USE_SHARED_ZEROPAGE like:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 2b4ea5a2ce7d..b97dfd3521a9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3858,7 +3858,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
        if (nr_shmem_dropped)
                shmem_uncharge(mapping->host, nr_shmem_dropped);

-       if (!ret && is_anon)
+       if (!ret && is_anon && !folio_is_device_private(folio))
                remap_flags = RMP_USE_SHARED_ZEROPAGE;
        remap_page(folio, 1 << order, remap_flags);

Or it can be done in remove_migration_pte().

Best Regards,
Yan, Zi

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

On 8/1/25 22:28, Zi Yan wrote:
> On 1 Aug 2025, at 8:20, Mika Penttilä wrote:
> 
>> On 8/1/25 14:10, Zi Yan wrote:
>>> On 1 Aug 2025, at 4:46, David Hildenbrand wrote:
>>>
>>>> On 01.08.25 10:01, Balbir Singh wrote:
>>>>> On 8/1/25 17:04, David Hildenbrand wrote:
>>>>>> On 01.08.25 06:44, Balbir Singh wrote:
>>>>>>> On 8/1/25 11:16, Mika Penttilä wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 8/1/25 03:49, Balbir Singh wrote:
>>>>>>>>
>>>>>>>>> On 7/31/25 21:26, Zi Yan wrote:
>>>>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>>>>>>>>> +     *
>>>>>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>>>>>>>>>>>> +     */
>>>>>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>>>>>>>>> at CPU side.
>>>>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>>>>>>>>> CPU unmapped and device mapped.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Here are my questions on device private folios:
>>>>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>>>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>>>>>>>>> common code more messy if not done that way but sure possible.
>>>>>>>>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>>>>>>>>
>>>>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>>>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>>>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>>>>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>>>>>>>>
>>>>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>>>>>>>>> someone could change while in device, it's just pfn.
>>>>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>>>>>>>>
>>>>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>>>>>>>>
>>>>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>>>>>>>> Thanks for the clarification.
>>>>>>>>>>
>>>>>>>>>> So folio_mapcount() for device private folios should be treated the same
>>>>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>>>>>>>>> Then I wonder if the device private large folio split should go through
>>>>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>>>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>>>>>>>>
>>>>>>>>> That is true in general, the special cases I mentioned are:
>>>>>>>>>
>>>>>>>>> 1. split during migration (where we the sizes on source/destination do not
>>>>>>>>>      match) and so we need to split in the middle of migration. The entries
>>>>>>>>>      there are already unmapped and hence the special handling
>>>>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap
>>>>>>>>>      due to the isses mentioned in the patch. I expanded the folio split code
>>>>>>>>>      for device private can be expanded into its own helper, which does not
>>>>>>>>>      need to do the xas/mapped/lru folio handling. During partial unmap the
>>>>>>>>>      original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>>>>>>>>
>>>>>>>>> For (2), I spent some time examining the implications of not unmapping the
>>>>>>>>> folios prior to split and in the partial unmap path, once we split the PMD
>>>>>>>>> the folios diverge. I did not run into any particular race either with the
>>>>>>>>> tests.
>>>>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio()
>>>>>>>>
>>>>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path.
>>>>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking:
>>>>>>>>
>>>>>>>>      folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>
>>>>>>> I can add checks to make sure that the call does succeed.
>>>>>>>
>>>>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be
>>>>>>>> possible to split the folio at fault time then?
>>>>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large,
>>>>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split()
>>>>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures
>>>>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping.
>>>>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..."
>>>>>>
>>>>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true.
>>>>>>
>>>>> Looking into this again at my end
>>>>>
>>>>>>>
>>>>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio()
>>>>>>>> instead?
>>>>>>>>
>>>>>>>>
>>>>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from
>>>>>>> split_huge_pmd_locked() path.
>>>>>> Yes, that's very complicated.
>>>>>>
>>>>> Yes and I want to avoid going down that path.
>>>>>
>>>>>>> Deferred splits do not work for device private pages, due to the
>>>>>>> migration requirements for fault handling.
>>>>>> Can you elaborate on that?
>>>>>>
>>>>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially
>>>>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping()
>>>>> assumes that the folio sizes are the same (via check for reference and mapcount)
>>>> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume.
>>> Yes, that is exactly what migrate_pages() does. And if split fails, the migration
>>> fails too. Device private folio probably should do the same thing, assuming
>>> splitting device private folio would always succeed.
>>
>> hmm afaics the normal folio_split wants to use RMP_USE_SHARED_ZEROPAGE when splitting and remapping
>> device private pages, that can't work..
> 
> It is fine to exclude device private folio to use RMP_USE_SHARED_ZEROPAGE like:
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 2b4ea5a2ce7d..b97dfd3521a9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3858,7 +3858,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>         if (nr_shmem_dropped)
>                 shmem_uncharge(mapping->host, nr_shmem_dropped);
> 
> -       if (!ret && is_anon)
> +       if (!ret && is_anon && !folio_is_device_private(folio))
>                 remap_flags = RMP_USE_SHARED_ZEROPAGE;
>         remap_page(folio, 1 << order, remap_flags);
> 
> Or it can be done in remove_migration_pte().


I have the same set of changes plus more to see if the logic can be simplified and well known
paths be taken. 

Balbir Singh

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

FYI:

I have the following patch on top of my series that seems to make it work
without requiring the helper to split device private folios


Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h |  1 -
 lib/test_hmm.c          | 11 +++++-
 mm/huge_memory.c        | 76 ++++-------------------------------------
 mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
 4 files changed, 67 insertions(+), 72 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 19e7e3b7c2b7..52d8b435950b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 		vm_flags_t vm_flags);
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_device_private_folio(struct folio *folio);
 int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order, bool unmapped);
 int min_order_for_split(struct folio *folio);
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 341ae2af44ec..444477785882 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	 * the mirror but here we use it to hold the page for the simulated
 	 * device memory and that page holds the pointer to the mirror.
 	 */
-	rpage = vmf->page->zone_device_data;
+	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
 	dmirror = rpage->zone_device_data;
 
 	/* FIXME demonstrate how we can adjust migrate range */
 	order = folio_order(page_folio(vmf->page));
 	nr = 1 << order;
 
+	/*
+	 * When folios are partially mapped, we can't rely on the folio
+	 * order of vmf->page as the folio might not be fully split yet
+	 */
+	if (vmf->pte) {
+		order = 0;
+		nr = 1;
+	}
+
 	/*
 	 * Consider a per-cpu cache of src and dst pfns, but with
 	 * large number of cpus that might not scale well.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1fc1efa219c8..863393dec1f1 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
 					  struct shrink_control *sc);
 static unsigned long deferred_split_scan(struct shrinker *shrink,
 					 struct shrink_control *sc);
-static int __split_unmapped_folio(struct folio *folio, int new_order,
-		struct page *split_at, struct xa_state *xas,
-		struct address_space *mapping, bool uniform_split);
-
 static bool split_underused_thp = true;
 
 static atomic_t huge_zero_refcount;
@@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
 	pmd_populate(mm, pmd, pgtable);
 }
 
-/**
- * split_huge_device_private_folio - split a huge device private folio into
- * smaller pages (of order 0), currently used by migrate_device logic to
- * split folios for pages that are partially mapped
- *
- * @folio: the folio to split
- *
- * The caller has to hold the folio_lock and a reference via folio_get
- */
-int split_device_private_folio(struct folio *folio)
-{
-	struct folio *end_folio = folio_next(folio);
-	struct folio *new_folio;
-	int ret = 0;
-
-	/*
-	 * Split the folio now. In the case of device
-	 * private pages, this path is executed when
-	 * the pmd is split and since freeze is not true
-	 * it is likely the folio will be deferred_split.
-	 *
-	 * With device private pages, deferred splits of
-	 * folios should be handled here to prevent partial
-	 * unmaps from causing issues later on in migration
-	 * and fault handling flows.
-	 */
-	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
-	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
-	VM_WARN_ON(ret);
-	for (new_folio = folio_next(folio); new_folio != end_folio;
-					new_folio = folio_next(new_folio)) {
-		zone_device_private_split_cb(folio, new_folio);
-		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
-								new_folio));
-	}
-
-	/*
-	 * Mark the end of the folio split for device private THP
-	 * split
-	 */
-	zone_device_private_split_cb(folio, NULL);
-	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
-	return ret;
-}
-
 static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long haddr, bool freeze)
 {
@@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
 				freeze = false;
 			if (!freeze) {
 				rmap_t rmap_flags = RMAP_NONE;
-				unsigned long addr = haddr;
-				struct folio *new_folio;
-				struct folio *end_folio = folio_next(folio);
 
 				if (anon_exclusive)
 					rmap_flags |= RMAP_EXCLUSIVE;
 
-				folio_lock(folio);
-				folio_get(folio);
-
-				split_device_private_folio(folio);
-
-				for (new_folio = folio_next(folio);
-					new_folio != end_folio;
-					new_folio = folio_next(new_folio)) {
-					addr += PAGE_SIZE;
-					folio_unlock(new_folio);
-					folio_add_anon_rmap_ptes(new_folio,
-						&new_folio->page, 1,
-						vma, addr, rmap_flags);
-				}
-				folio_unlock(folio);
-				folio_add_anon_rmap_ptes(folio, &folio->page,
-						1, vma, haddr, rmap_flags);
+				folio_ref_add(folio, HPAGE_PMD_NR - 1);
+				if (anon_exclusive)
+					rmap_flags |= RMAP_EXCLUSIVE;
+				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
+						 vma, haddr, rmap_flags);
 			}
 		}
 
@@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	if (nr_shmem_dropped)
 		shmem_uncharge(mapping->host, nr_shmem_dropped);
 
-	if (!ret && is_anon)
+	if (!ret && is_anon && !folio_is_device_private(folio))
 		remap_flags = RMP_USE_SHARED_ZEROPAGE;
 
 	remap_page(folio, 1 << order, remap_flags);
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 49962ea19109..4264c0290d08 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			 * page table entry. Other special swap entries are not
 			 * migratable, and we ignore regular swapped page.
 			 */
+			struct folio *folio;
+
 			entry = pte_to_swp_entry(pte);
 			if (!is_device_private_entry(entry))
 				goto next;
@@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			    pgmap->owner != migrate->pgmap_owner)
 				goto next;
 
+			folio = page_folio(page);
+			if (folio_test_large(folio)) {
+				struct folio *new_folio;
+				struct folio *new_fault_folio;
+
+				/*
+				 * The reason for finding pmd present with a
+				 * device private pte and a large folio for the
+				 * pte is partial unmaps. Split the folio now
+				 * for the migration to be handled correctly
+				 */
+				pte_unmap_unlock(ptep, ptl);
+
+				folio_get(folio);
+				if (folio != fault_folio)
+					folio_lock(folio);
+				if (split_folio(folio)) {
+					if (folio != fault_folio)
+						folio_unlock(folio);
+					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+					goto next;
+				}
+
+				/*
+				 * After the split, get back the extra reference
+				 * on the fault_page, this reference is checked during
+				 * folio_migrate_mapping()
+				 */
+				if (migrate->fault_page) {
+					new_fault_folio = page_folio(migrate->fault_page);
+					folio_get(new_fault_folio);
+				}
+
+				new_folio = page_folio(page);
+				pfn = page_to_pfn(page);
+
+				/*
+				 * Ensure the lock is held on the correct
+				 * folio after the split
+				 */
+				if (folio != new_folio) {
+					folio_unlock(folio);
+					folio_lock(new_folio);
+				}
+				folio_put(folio);
+				addr = start;
+				goto again;
+			}
+
 			mpfn = migrate_pfn(page_to_pfn(page)) |
 					MIGRATE_PFN_MIGRATE;
 			if (is_writable_device_private_entry(entry))
-- 
2.50.1

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months ago

Hi,

On 8/2/25 13:37, Balbir Singh wrote:
> FYI:
>
> I have the following patch on top of my series that seems to make it work
> without requiring the helper to split device private folios
>
I think this looks much better!

> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h |  1 -
>  lib/test_hmm.c          | 11 +++++-
>  mm/huge_memory.c        | 76 ++++-------------------------------------
>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>  4 files changed, 67 insertions(+), 72 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 19e7e3b7c2b7..52d8b435950b 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  		vm_flags_t vm_flags);
>  
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> -int split_device_private_folio(struct folio *folio);
>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order, bool unmapped);
>  int min_order_for_split(struct folio *folio);
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 341ae2af44ec..444477785882 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>  	 * the mirror but here we use it to hold the page for the simulated
>  	 * device memory and that page holds the pointer to the mirror.
>  	 */
> -	rpage = vmf->page->zone_device_data;
> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>  	dmirror = rpage->zone_device_data;
>  
>  	/* FIXME demonstrate how we can adjust migrate range */
>  	order = folio_order(page_folio(vmf->page));
>  	nr = 1 << order;
>  
> +	/*
> +	 * When folios are partially mapped, we can't rely on the folio
> +	 * order of vmf->page as the folio might not be fully split yet
> +	 */
> +	if (vmf->pte) {
> +		order = 0;
> +		nr = 1;
> +	}
> +
>  	/*
>  	 * Consider a per-cpu cache of src and dst pfns, but with
>  	 * large number of cpus that might not scale well.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1fc1efa219c8..863393dec1f1 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>  					  struct shrink_control *sc);
>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>  					 struct shrink_control *sc);
> -static int __split_unmapped_folio(struct folio *folio, int new_order,
> -		struct page *split_at, struct xa_state *xas,
> -		struct address_space *mapping, bool uniform_split);
> -
>  static bool split_underused_thp = true;
>  
>  static atomic_t huge_zero_refcount;
> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>  	pmd_populate(mm, pmd, pgtable);
>  }
>  
> -/**
> - * split_huge_device_private_folio - split a huge device private folio into
> - * smaller pages (of order 0), currently used by migrate_device logic to
> - * split folios for pages that are partially mapped
> - *
> - * @folio: the folio to split
> - *
> - * The caller has to hold the folio_lock and a reference via folio_get
> - */
> -int split_device_private_folio(struct folio *folio)
> -{
> -	struct folio *end_folio = folio_next(folio);
> -	struct folio *new_folio;
> -	int ret = 0;
> -
> -	/*
> -	 * Split the folio now. In the case of device
> -	 * private pages, this path is executed when
> -	 * the pmd is split and since freeze is not true
> -	 * it is likely the folio will be deferred_split.
> -	 *
> -	 * With device private pages, deferred splits of
> -	 * folios should be handled here to prevent partial
> -	 * unmaps from causing issues later on in migration
> -	 * and fault handling flows.
> -	 */
> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
> -	VM_WARN_ON(ret);
> -	for (new_folio = folio_next(folio); new_folio != end_folio;
> -					new_folio = folio_next(new_folio)) {
> -		zone_device_private_split_cb(folio, new_folio);
> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
> -								new_folio));
> -	}
> -
> -	/*
> -	 * Mark the end of the folio split for device private THP
> -	 * split
> -	 */
> -	zone_device_private_split_cb(folio, NULL);
> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
> -	return ret;
> -}
> -
>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long haddr, bool freeze)
>  {
> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>  				freeze = false;
>  			if (!freeze) {
>  				rmap_t rmap_flags = RMAP_NONE;
> -				unsigned long addr = haddr;
> -				struct folio *new_folio;
> -				struct folio *end_folio = folio_next(folio);
>  
>  				if (anon_exclusive)
>  					rmap_flags |= RMAP_EXCLUSIVE;
>  
> -				folio_lock(folio);
> -				folio_get(folio);
> -
> -				split_device_private_folio(folio);
> -
> -				for (new_folio = folio_next(folio);
> -					new_folio != end_folio;
> -					new_folio = folio_next(new_folio)) {
> -					addr += PAGE_SIZE;
> -					folio_unlock(new_folio);
> -					folio_add_anon_rmap_ptes(new_folio,
> -						&new_folio->page, 1,
> -						vma, addr, rmap_flags);
> -				}
> -				folio_unlock(folio);
> -				folio_add_anon_rmap_ptes(folio, &folio->page,
> -						1, vma, haddr, rmap_flags);
> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
> +				if (anon_exclusive)
> +					rmap_flags |= RMAP_EXCLUSIVE;
> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
> +						 vma, haddr, rmap_flags);
>  			}
>  		}
>  
> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  	if (nr_shmem_dropped)
>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>  
> -	if (!ret && is_anon)
> +	if (!ret && is_anon && !folio_is_device_private(folio))
>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>  
>  	remap_page(folio, 1 << order, remap_flags);
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 49962ea19109..4264c0290d08 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			 * page table entry. Other special swap entries are not
>  			 * migratable, and we ignore regular swapped page.
>  			 */
> +			struct folio *folio;
> +
>  			entry = pte_to_swp_entry(pte);
>  			if (!is_device_private_entry(entry))
>  				goto next;
> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			    pgmap->owner != migrate->pgmap_owner)
>  				goto next;
>  
> +			folio = page_folio(page);
> +			if (folio_test_large(folio)) {
> +				struct folio *new_folio;
> +				struct folio *new_fault_folio;
> +
> +				/*
> +				 * The reason for finding pmd present with a
> +				 * device private pte and a large folio for the
> +				 * pte is partial unmaps. Split the folio now
> +				 * for the migration to be handled correctly
> +				 */
> +				pte_unmap_unlock(ptep, ptl);
> +
> +				folio_get(folio);
> +				if (folio != fault_folio)
> +					folio_lock(folio);
> +				if (split_folio(folio)) {
> +					if (folio != fault_folio)
> +						folio_unlock(folio);
> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +					goto next;
> +				}
> +

The nouveau migrate_to_ram handler needs adjustment also if split happens.

> +				/*
> +				 * After the split, get back the extra reference
> +				 * on the fault_page, this reference is checked during
> +				 * folio_migrate_mapping()
> +				 */
> +				if (migrate->fault_page) {
> +					new_fault_folio = page_folio(migrate->fault_page);
> +					folio_get(new_fault_folio);
> +				}
> +
> +				new_folio = page_folio(page);
> +				pfn = page_to_pfn(page);
> +
> +				/*
> +				 * Ensure the lock is held on the correct
> +				 * folio after the split
> +				 */
> +				if (folio != new_folio) {
> +					folio_unlock(folio);
> +					folio_lock(new_folio);
> +				}

Maybe careful not to unlock fault_page ?

> +				folio_put(folio);
> +				addr = start;
> +				goto again;
> +			}
> +
>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>  					MIGRATE_PFN_MIGRATE;
>  			if (is_writable_device_private_entry(entry))

--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

On 8/2/25 22:13, Mika Penttilä wrote:
> Hi,
> 
> On 8/2/25 13:37, Balbir Singh wrote:
>> FYI:
>>
>> I have the following patch on top of my series that seems to make it work
>> without requiring the helper to split device private folios
>>
> I think this looks much better!
> 

Thanks!

>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/huge_mm.h |  1 -
>>  lib/test_hmm.c          | 11 +++++-
>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 19e7e3b7c2b7..52d8b435950b 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>  		vm_flags_t vm_flags);
>>  
>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>> -int split_device_private_folio(struct folio *folio);
>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>  		unsigned int new_order, bool unmapped);
>>  int min_order_for_split(struct folio *folio);
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 341ae2af44ec..444477785882 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>  	 * the mirror but here we use it to hold the page for the simulated
>>  	 * device memory and that page holds the pointer to the mirror.
>>  	 */
>> -	rpage = vmf->page->zone_device_data;
>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>  	dmirror = rpage->zone_device_data;
>>  
>>  	/* FIXME demonstrate how we can adjust migrate range */
>>  	order = folio_order(page_folio(vmf->page));
>>  	nr = 1 << order;
>>  
>> +	/*
>> +	 * When folios are partially mapped, we can't rely on the folio
>> +	 * order of vmf->page as the folio might not be fully split yet
>> +	 */
>> +	if (vmf->pte) {
>> +		order = 0;
>> +		nr = 1;
>> +	}
>> +
>>  	/*
>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>  	 * large number of cpus that might not scale well.
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 1fc1efa219c8..863393dec1f1 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>  					  struct shrink_control *sc);
>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>  					 struct shrink_control *sc);
>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>> -		struct page *split_at, struct xa_state *xas,
>> -		struct address_space *mapping, bool uniform_split);
>> -
>>  static bool split_underused_thp = true;
>>  
>>  static atomic_t huge_zero_refcount;
>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>  	pmd_populate(mm, pmd, pgtable);
>>  }
>>  
>> -/**
>> - * split_huge_device_private_folio - split a huge device private folio into
>> - * smaller pages (of order 0), currently used by migrate_device logic to
>> - * split folios for pages that are partially mapped
>> - *
>> - * @folio: the folio to split
>> - *
>> - * The caller has to hold the folio_lock and a reference via folio_get
>> - */
>> -int split_device_private_folio(struct folio *folio)
>> -{
>> -	struct folio *end_folio = folio_next(folio);
>> -	struct folio *new_folio;
>> -	int ret = 0;
>> -
>> -	/*
>> -	 * Split the folio now. In the case of device
>> -	 * private pages, this path is executed when
>> -	 * the pmd is split and since freeze is not true
>> -	 * it is likely the folio will be deferred_split.
>> -	 *
>> -	 * With device private pages, deferred splits of
>> -	 * folios should be handled here to prevent partial
>> -	 * unmaps from causing issues later on in migration
>> -	 * and fault handling flows.
>> -	 */
>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>> -	VM_WARN_ON(ret);
>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>> -					new_folio = folio_next(new_folio)) {
>> -		zone_device_private_split_cb(folio, new_folio);
>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>> -								new_folio));
>> -	}
>> -
>> -	/*
>> -	 * Mark the end of the folio split for device private THP
>> -	 * split
>> -	 */
>> -	zone_device_private_split_cb(folio, NULL);
>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>> -	return ret;
>> -}
>> -
>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  		unsigned long haddr, bool freeze)
>>  {
>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>  				freeze = false;
>>  			if (!freeze) {
>>  				rmap_t rmap_flags = RMAP_NONE;
>> -				unsigned long addr = haddr;
>> -				struct folio *new_folio;
>> -				struct folio *end_folio = folio_next(folio);
>>  
>>  				if (anon_exclusive)
>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>  
>> -				folio_lock(folio);
>> -				folio_get(folio);
>> -
>> -				split_device_private_folio(folio);
>> -
>> -				for (new_folio = folio_next(folio);
>> -					new_folio != end_folio;
>> -					new_folio = folio_next(new_folio)) {
>> -					addr += PAGE_SIZE;
>> -					folio_unlock(new_folio);
>> -					folio_add_anon_rmap_ptes(new_folio,
>> -						&new_folio->page, 1,
>> -						vma, addr, rmap_flags);
>> -				}
>> -				folio_unlock(folio);
>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>> -						1, vma, haddr, rmap_flags);
>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>> +				if (anon_exclusive)
>> +					rmap_flags |= RMAP_EXCLUSIVE;
>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>> +						 vma, haddr, rmap_flags);
>>  			}
>>  		}
>>  
>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  	if (nr_shmem_dropped)
>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>  
>> -	if (!ret && is_anon)
>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>  
>>  	remap_page(folio, 1 << order, remap_flags);
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 49962ea19109..4264c0290d08 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			 * page table entry. Other special swap entries are not
>>  			 * migratable, and we ignore regular swapped page.
>>  			 */
>> +			struct folio *folio;
>> +
>>  			entry = pte_to_swp_entry(pte);
>>  			if (!is_device_private_entry(entry))
>>  				goto next;
>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			    pgmap->owner != migrate->pgmap_owner)
>>  				goto next;
>>  
>> +			folio = page_folio(page);
>> +			if (folio_test_large(folio)) {
>> +				struct folio *new_folio;
>> +				struct folio *new_fault_folio;
>> +
>> +				/*
>> +				 * The reason for finding pmd present with a
>> +				 * device private pte and a large folio for the
>> +				 * pte is partial unmaps. Split the folio now
>> +				 * for the migration to be handled correctly
>> +				 */
>> +				pte_unmap_unlock(ptep, ptl);
>> +
>> +				folio_get(folio);
>> +				if (folio != fault_folio)
>> +					folio_lock(folio);
>> +				if (split_folio(folio)) {
>> +					if (folio != fault_folio)
>> +						folio_unlock(folio);
>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> +					goto next;
>> +				}
>> +
> 
> The nouveau migrate_to_ram handler needs adjustment also if split happens.
> 

test_hmm needs adjustment because of the way the backup folios are setup.

>> +				/*
>> +				 * After the split, get back the extra reference
>> +				 * on the fault_page, this reference is checked during
>> +				 * folio_migrate_mapping()
>> +				 */
>> +				if (migrate->fault_page) {
>> +					new_fault_folio = page_folio(migrate->fault_page);
>> +					folio_get(new_fault_folio);
>> +				}
>> +
>> +				new_folio = page_folio(page);
>> +				pfn = page_to_pfn(page);
>> +
>> +				/*
>> +				 * Ensure the lock is held on the correct
>> +				 * folio after the split
>> +				 */
>> +				if (folio != new_folio) {
>> +					folio_unlock(folio);
>> +					folio_lock(new_folio);
>> +				}
> 
> Maybe careful not to unlock fault_page ?
> 

split_page will unlock everything but the original folio, the code takes the lock
on the folio corresponding to the new folio

>> +				folio_put(folio);
>> +				addr = start;
>> +				goto again;
>> +			}
>> +
>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>  					MIGRATE_PFN_MIGRATE;
>>  			if (is_writable_device_private_entry(entry))
> 

Balbir

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months ago

Hi,

On 8/5/25 01:46, Balbir Singh wrote:
> On 8/2/25 22:13, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/2/25 13:37, Balbir Singh wrote:
>>> FYI:
>>>
>>> I have the following patch on top of my series that seems to make it work
>>> without requiring the helper to split device private folios
>>>
>> I think this looks much better!
>>
> Thanks!
>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  include/linux/huge_mm.h |  1 -
>>>  lib/test_hmm.c          | 11 +++++-
>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>  		vm_flags_t vm_flags);
>>>  
>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>> -int split_device_private_folio(struct folio *folio);
>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>  		unsigned int new_order, bool unmapped);
>>>  int min_order_for_split(struct folio *folio);
>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>> index 341ae2af44ec..444477785882 100644
>>> --- a/lib/test_hmm.c
>>> +++ b/lib/test_hmm.c
>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>  	 * device memory and that page holds the pointer to the mirror.
>>>  	 */
>>> -	rpage = vmf->page->zone_device_data;
>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>  	dmirror = rpage->zone_device_data;
>>>  
>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>  	order = folio_order(page_folio(vmf->page));
>>>  	nr = 1 << order;
>>>  
>>> +	/*
>>> +	 * When folios are partially mapped, we can't rely on the folio
>>> +	 * order of vmf->page as the folio might not be fully split yet
>>> +	 */
>>> +	if (vmf->pte) {
>>> +		order = 0;
>>> +		nr = 1;
>>> +	}
>>> +
>>>  	/*
>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>  	 * large number of cpus that might not scale well.
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 1fc1efa219c8..863393dec1f1 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>  					  struct shrink_control *sc);
>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>  					 struct shrink_control *sc);
>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>> -		struct page *split_at, struct xa_state *xas,
>>> -		struct address_space *mapping, bool uniform_split);
>>> -
>>>  static bool split_underused_thp = true;
>>>  
>>>  static atomic_t huge_zero_refcount;
>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>  	pmd_populate(mm, pmd, pgtable);
>>>  }
>>>  
>>> -/**
>>> - * split_huge_device_private_folio - split a huge device private folio into
>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>> - * split folios for pages that are partially mapped
>>> - *
>>> - * @folio: the folio to split
>>> - *
>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>> - */
>>> -int split_device_private_folio(struct folio *folio)
>>> -{
>>> -	struct folio *end_folio = folio_next(folio);
>>> -	struct folio *new_folio;
>>> -	int ret = 0;
>>> -
>>> -	/*
>>> -	 * Split the folio now. In the case of device
>>> -	 * private pages, this path is executed when
>>> -	 * the pmd is split and since freeze is not true
>>> -	 * it is likely the folio will be deferred_split.
>>> -	 *
>>> -	 * With device private pages, deferred splits of
>>> -	 * folios should be handled here to prevent partial
>>> -	 * unmaps from causing issues later on in migration
>>> -	 * and fault handling flows.
>>> -	 */
>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>> -	VM_WARN_ON(ret);
>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>> -					new_folio = folio_next(new_folio)) {
>>> -		zone_device_private_split_cb(folio, new_folio);
>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>> -								new_folio));
>>> -	}
>>> -
>>> -	/*
>>> -	 * Mark the end of the folio split for device private THP
>>> -	 * split
>>> -	 */
>>> -	zone_device_private_split_cb(folio, NULL);
>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>> -	return ret;
>>> -}
>>> -
>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  		unsigned long haddr, bool freeze)
>>>  {
>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>  				freeze = false;
>>>  			if (!freeze) {
>>>  				rmap_t rmap_flags = RMAP_NONE;
>>> -				unsigned long addr = haddr;
>>> -				struct folio *new_folio;
>>> -				struct folio *end_folio = folio_next(folio);
>>>  
>>>  				if (anon_exclusive)
>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>  
>>> -				folio_lock(folio);
>>> -				folio_get(folio);
>>> -
>>> -				split_device_private_folio(folio);
>>> -
>>> -				for (new_folio = folio_next(folio);
>>> -					new_folio != end_folio;
>>> -					new_folio = folio_next(new_folio)) {
>>> -					addr += PAGE_SIZE;
>>> -					folio_unlock(new_folio);
>>> -					folio_add_anon_rmap_ptes(new_folio,
>>> -						&new_folio->page, 1,
>>> -						vma, addr, rmap_flags);
>>> -				}
>>> -				folio_unlock(folio);
>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>> -						1, vma, haddr, rmap_flags);
>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>> +				if (anon_exclusive)
>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>> +						 vma, haddr, rmap_flags);
>>>  			}
>>>  		}
>>>  
>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  	if (nr_shmem_dropped)
>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>  
>>> -	if (!ret && is_anon)
>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>  
>>>  	remap_page(folio, 1 << order, remap_flags);
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 49962ea19109..4264c0290d08 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  			 * page table entry. Other special swap entries are not
>>>  			 * migratable, and we ignore regular swapped page.
>>>  			 */
>>> +			struct folio *folio;
>>> +
>>>  			entry = pte_to_swp_entry(pte);
>>>  			if (!is_device_private_entry(entry))
>>>  				goto next;
>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>  				goto next;
>>>  
>>> +			folio = page_folio(page);
>>> +			if (folio_test_large(folio)) {
>>> +				struct folio *new_folio;
>>> +				struct folio *new_fault_folio;
>>> +
>>> +				/*
>>> +				 * The reason for finding pmd present with a
>>> +				 * device private pte and a large folio for the
>>> +				 * pte is partial unmaps. Split the folio now
>>> +				 * for the migration to be handled correctly
>>> +				 */
>>> +				pte_unmap_unlock(ptep, ptl);
>>> +
>>> +				folio_get(folio);
>>> +				if (folio != fault_folio)
>>> +					folio_lock(folio);
>>> +				if (split_folio(folio)) {
>>> +					if (folio != fault_folio)
>>> +						folio_unlock(folio);
>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>> +					goto next;
>>> +				}
>>> +
>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>
> test_hmm needs adjustment because of the way the backup folios are setup.

nouveau should check the folio order after the possible split happens.

>
>>> +				/*
>>> +				 * After the split, get back the extra reference
>>> +				 * on the fault_page, this reference is checked during
>>> +				 * folio_migrate_mapping()
>>> +				 */
>>> +				if (migrate->fault_page) {
>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>> +					folio_get(new_fault_folio);
>>> +				}
>>> +
>>> +				new_folio = page_folio(page);
>>> +				pfn = page_to_pfn(page);
>>> +
>>> +				/*
>>> +				 * Ensure the lock is held on the correct
>>> +				 * folio after the split
>>> +				 */
>>> +				if (folio != new_folio) {
>>> +					folio_unlock(folio);
>>> +					folio_lock(new_folio);
>>> +				}
>> Maybe careful not to unlock fault_page ?
>>
> split_page will unlock everything but the original folio, the code takes the lock
> on the folio corresponding to the new folio

I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.

>
>>> +				folio_put(folio);
>>> +				addr = start;
>>> +				goto again;
>>> +			}
>>> +
>>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>>  					MIGRATE_PFN_MIGRATE;
>>>  			if (is_writable_device_private_entry(entry))
> Balbir
>
--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

On 8/5/25 09:26, Mika Penttilä wrote:
> Hi,
> 
> On 8/5/25 01:46, Balbir Singh wrote:
>> On 8/2/25 22:13, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>> FYI:
>>>>
>>>> I have the following patch on top of my series that seems to make it work
>>>> without requiring the helper to split device private folios
>>>>
>>> I think this looks much better!
>>>
>> Thanks!
>>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> ---
>>>>  include/linux/huge_mm.h |  1 -
>>>>  lib/test_hmm.c          | 11 +++++-
>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>  		vm_flags_t vm_flags);
>>>>  
>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>> -int split_device_private_folio(struct folio *folio);
>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>  		unsigned int new_order, bool unmapped);
>>>>  int min_order_for_split(struct folio *folio);
>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>> index 341ae2af44ec..444477785882 100644
>>>> --- a/lib/test_hmm.c
>>>> +++ b/lib/test_hmm.c
>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>  	 */
>>>> -	rpage = vmf->page->zone_device_data;
>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>  	dmirror = rpage->zone_device_data;
>>>>  
>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>  	order = folio_order(page_folio(vmf->page));
>>>>  	nr = 1 << order;
>>>>  
>>>> +	/*
>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>> +	 */
>>>> +	if (vmf->pte) {
>>>> +		order = 0;
>>>> +		nr = 1;
>>>> +	}
>>>> +
>>>>  	/*
>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>  	 * large number of cpus that might not scale well.
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>  					  struct shrink_control *sc);
>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>  					 struct shrink_control *sc);
>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>> -		struct page *split_at, struct xa_state *xas,
>>>> -		struct address_space *mapping, bool uniform_split);
>>>> -
>>>>  static bool split_underused_thp = true;
>>>>  
>>>>  static atomic_t huge_zero_refcount;
>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>  }
>>>>  
>>>> -/**
>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>> - * split folios for pages that are partially mapped
>>>> - *
>>>> - * @folio: the folio to split
>>>> - *
>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>> - */
>>>> -int split_device_private_folio(struct folio *folio)
>>>> -{
>>>> -	struct folio *end_folio = folio_next(folio);
>>>> -	struct folio *new_folio;
>>>> -	int ret = 0;
>>>> -
>>>> -	/*
>>>> -	 * Split the folio now. In the case of device
>>>> -	 * private pages, this path is executed when
>>>> -	 * the pmd is split and since freeze is not true
>>>> -	 * it is likely the folio will be deferred_split.
>>>> -	 *
>>>> -	 * With device private pages, deferred splits of
>>>> -	 * folios should be handled here to prevent partial
>>>> -	 * unmaps from causing issues later on in migration
>>>> -	 * and fault handling flows.
>>>> -	 */
>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>> -	VM_WARN_ON(ret);
>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>> -					new_folio = folio_next(new_folio)) {
>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>> -								new_folio));
>>>> -	}
>>>> -
>>>> -	/*
>>>> -	 * Mark the end of the folio split for device private THP
>>>> -	 * split
>>>> -	 */
>>>> -	zone_device_private_split_cb(folio, NULL);
>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>> -	return ret;
>>>> -}
>>>> -
>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  		unsigned long haddr, bool freeze)
>>>>  {
>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>  				freeze = false;
>>>>  			if (!freeze) {
>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>> -				unsigned long addr = haddr;
>>>> -				struct folio *new_folio;
>>>> -				struct folio *end_folio = folio_next(folio);
>>>>  
>>>>  				if (anon_exclusive)
>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>  
>>>> -				folio_lock(folio);
>>>> -				folio_get(folio);
>>>> -
>>>> -				split_device_private_folio(folio);
>>>> -
>>>> -				for (new_folio = folio_next(folio);
>>>> -					new_folio != end_folio;
>>>> -					new_folio = folio_next(new_folio)) {
>>>> -					addr += PAGE_SIZE;
>>>> -					folio_unlock(new_folio);
>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>> -						&new_folio->page, 1,
>>>> -						vma, addr, rmap_flags);
>>>> -				}
>>>> -				folio_unlock(folio);
>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>> -						1, vma, haddr, rmap_flags);
>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>> +				if (anon_exclusive)
>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>> +						 vma, haddr, rmap_flags);
>>>>  			}
>>>>  		}
>>>>  
>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  	if (nr_shmem_dropped)
>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>  
>>>> -	if (!ret && is_anon)
>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>  
>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 49962ea19109..4264c0290d08 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>  			 * page table entry. Other special swap entries are not
>>>>  			 * migratable, and we ignore regular swapped page.
>>>>  			 */
>>>> +			struct folio *folio;
>>>> +
>>>>  			entry = pte_to_swp_entry(pte);
>>>>  			if (!is_device_private_entry(entry))
>>>>  				goto next;
>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>  				goto next;
>>>>  
>>>> +			folio = page_folio(page);
>>>> +			if (folio_test_large(folio)) {
>>>> +				struct folio *new_folio;
>>>> +				struct folio *new_fault_folio;
>>>> +
>>>> +				/*
>>>> +				 * The reason for finding pmd present with a
>>>> +				 * device private pte and a large folio for the
>>>> +				 * pte is partial unmaps. Split the folio now
>>>> +				 * for the migration to be handled correctly
>>>> +				 */
>>>> +				pte_unmap_unlock(ptep, ptl);
>>>> +
>>>> +				folio_get(folio);
>>>> +				if (folio != fault_folio)
>>>> +					folio_lock(folio);
>>>> +				if (split_folio(folio)) {
>>>> +					if (folio != fault_folio)
>>>> +						folio_unlock(folio);
>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>> +					goto next;
>>>> +				}
>>>> +
>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>
>> test_hmm needs adjustment because of the way the backup folios are setup.
> 
> nouveau should check the folio order after the possible split happens.
> 

You mean the folio_split callback?

>>
>>>> +				/*
>>>> +				 * After the split, get back the extra reference
>>>> +				 * on the fault_page, this reference is checked during
>>>> +				 * folio_migrate_mapping()
>>>> +				 */
>>>> +				if (migrate->fault_page) {
>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>> +					folio_get(new_fault_folio);
>>>> +				}
>>>> +
>>>> +				new_folio = page_folio(page);
>>>> +				pfn = page_to_pfn(page);
>>>> +
>>>> +				/*
>>>> +				 * Ensure the lock is held on the correct
>>>> +				 * folio after the split
>>>> +				 */
>>>> +				if (folio != new_folio) {
>>>> +					folio_unlock(folio);
>>>> +					folio_lock(new_folio);
>>>> +				}
>>> Maybe careful not to unlock fault_page ?
>>>
>> split_page will unlock everything but the original folio, the code takes the lock
>> on the folio corresponding to the new folio
> 
> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
> 

Not sure I follow what you're trying to elaborate on here

Balbir

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months ago

Hi,

On 8/5/25 07:10, Balbir Singh wrote:
> On 8/5/25 09:26, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/5/25 01:46, Balbir Singh wrote:
>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>> FYI:
>>>>>
>>>>> I have the following patch on top of my series that seems to make it work
>>>>> without requiring the helper to split device private folios
>>>>>
>>>> I think this looks much better!
>>>>
>>> Thanks!
>>>
>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>> ---
>>>>>  include/linux/huge_mm.h |  1 -
>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>> --- a/include/linux/huge_mm.h
>>>>> +++ b/include/linux/huge_mm.h
>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>  		vm_flags_t vm_flags);
>>>>>  
>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>  		unsigned int new_order, bool unmapped);
>>>>>  int min_order_for_split(struct folio *folio);
>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>> index 341ae2af44ec..444477785882 100644
>>>>> --- a/lib/test_hmm.c
>>>>> +++ b/lib/test_hmm.c
>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>  	 */
>>>>> -	rpage = vmf->page->zone_device_data;
>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>  	dmirror = rpage->zone_device_data;
>>>>>  
>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>  	nr = 1 << order;
>>>>>  
>>>>> +	/*
>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>> +	 */
>>>>> +	if (vmf->pte) {
>>>>> +		order = 0;
>>>>> +		nr = 1;
>>>>> +	}
>>>>> +
>>>>>  	/*
>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>  	 * large number of cpus that might not scale well.
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>  					  struct shrink_control *sc);
>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>  					 struct shrink_control *sc);
>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>> -
>>>>>  static bool split_underused_thp = true;
>>>>>  
>>>>>  static atomic_t huge_zero_refcount;
>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>  }
>>>>>  
>>>>> -/**
>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>> - * split folios for pages that are partially mapped
>>>>> - *
>>>>> - * @folio: the folio to split
>>>>> - *
>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>> - */
>>>>> -int split_device_private_folio(struct folio *folio)
>>>>> -{
>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>> -	struct folio *new_folio;
>>>>> -	int ret = 0;
>>>>> -
>>>>> -	/*
>>>>> -	 * Split the folio now. In the case of device
>>>>> -	 * private pages, this path is executed when
>>>>> -	 * the pmd is split and since freeze is not true
>>>>> -	 * it is likely the folio will be deferred_split.
>>>>> -	 *
>>>>> -	 * With device private pages, deferred splits of
>>>>> -	 * folios should be handled here to prevent partial
>>>>> -	 * unmaps from causing issues later on in migration
>>>>> -	 * and fault handling flows.
>>>>> -	 */
>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>> -	VM_WARN_ON(ret);
>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>> -					new_folio = folio_next(new_folio)) {
>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>> -								new_folio));
>>>>> -	}
>>>>> -
>>>>> -	/*
>>>>> -	 * Mark the end of the folio split for device private THP
>>>>> -	 * split
>>>>> -	 */
>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>> -	return ret;
>>>>> -}
>>>>> -
>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>  		unsigned long haddr, bool freeze)
>>>>>  {
>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>  				freeze = false;
>>>>>  			if (!freeze) {
>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>> -				unsigned long addr = haddr;
>>>>> -				struct folio *new_folio;
>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>  
>>>>>  				if (anon_exclusive)
>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>  
>>>>> -				folio_lock(folio);
>>>>> -				folio_get(folio);
>>>>> -
>>>>> -				split_device_private_folio(folio);
>>>>> -
>>>>> -				for (new_folio = folio_next(folio);
>>>>> -					new_folio != end_folio;
>>>>> -					new_folio = folio_next(new_folio)) {
>>>>> -					addr += PAGE_SIZE;
>>>>> -					folio_unlock(new_folio);
>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>> -						&new_folio->page, 1,
>>>>> -						vma, addr, rmap_flags);
>>>>> -				}
>>>>> -				folio_unlock(folio);
>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>> -						1, vma, haddr, rmap_flags);
>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>> +				if (anon_exclusive)
>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>> +						 vma, haddr, rmap_flags);
>>>>>  			}
>>>>>  		}
>>>>>  
>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>  	if (nr_shmem_dropped)
>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>  
>>>>> -	if (!ret && is_anon)
>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>  
>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>> index 49962ea19109..4264c0290d08 100644
>>>>> --- a/mm/migrate_device.c
>>>>> +++ b/mm/migrate_device.c
>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>  			 * page table entry. Other special swap entries are not
>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>  			 */
>>>>> +			struct folio *folio;
>>>>> +
>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>  			if (!is_device_private_entry(entry))
>>>>>  				goto next;
>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>  				goto next;
>>>>>  
>>>>> +			folio = page_folio(page);
>>>>> +			if (folio_test_large(folio)) {
>>>>> +				struct folio *new_folio;
>>>>> +				struct folio *new_fault_folio;
>>>>> +
>>>>> +				/*
>>>>> +				 * The reason for finding pmd present with a
>>>>> +				 * device private pte and a large folio for the
>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>> +				 * for the migration to be handled correctly
>>>>> +				 */
>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>> +
>>>>> +				folio_get(folio);
>>>>> +				if (folio != fault_folio)
>>>>> +					folio_lock(folio);
>>>>> +				if (split_folio(folio)) {
>>>>> +					if (folio != fault_folio)
>>>>> +						folio_unlock(folio);
>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>> +					goto next;
>>>>> +				}
>>>>> +
>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>
>>> test_hmm needs adjustment because of the way the backup folios are setup.
>> nouveau should check the folio order after the possible split happens.
>>
> You mean the folio_split callback?

no, nouveau_dmem_migrate_to_ram():
..
        sfolio = page_folio(vmf->page);
        order = folio_order(sfolio);
...
        migrate_vma_setup()
..
if sfolio is split order still reflects the pre-split order

>
>>>>> +				/*
>>>>> +				 * After the split, get back the extra reference
>>>>> +				 * on the fault_page, this reference is checked during
>>>>> +				 * folio_migrate_mapping()
>>>>> +				 */
>>>>> +				if (migrate->fault_page) {
>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>> +					folio_get(new_fault_folio);
>>>>> +				}
>>>>> +
>>>>> +				new_folio = page_folio(page);
>>>>> +				pfn = page_to_pfn(page);
>>>>> +
>>>>> +				/*
>>>>> +				 * Ensure the lock is held on the correct
>>>>> +				 * folio after the split
>>>>> +				 */
>>>>> +				if (folio != new_folio) {
>>>>> +					folio_unlock(folio);
>>>>> +					folio_lock(new_folio);
>>>>> +				}
>>>> Maybe careful not to unlock fault_page ?
>>>>
>>> split_page will unlock everything but the original folio, the code takes the lock
>>> on the folio corresponding to the new folio
>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>
> Not sure I follow what you're trying to elaborate on here

do_swap_page:
..
        if (trylock_page(vmf->page)) {
                ret = pgmap->ops->migrate_to_ram(vmf);
                               <- vmf->page should be locked here even after split
                unlock_page(vmf->page);

> Balbir
>
--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months ago

On 8/5/25 07:24, Mika Penttilä wrote:

> Hi,
>
> On 8/5/25 07:10, Balbir Singh wrote:
>> On 8/5/25 09:26, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>> FYI:
>>>>>>
>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>> without requiring the helper to split device private folios
>>>>>>
>>>>> I think this looks much better!
>>>>>
>>>> Thanks!
>>>>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>  		vm_flags_t vm_flags);
>>>>>>  
>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>> --- a/lib/test_hmm.c
>>>>>> +++ b/lib/test_hmm.c
>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>  	 */
>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>  
>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>  	nr = 1 << order;
>>>>>>  
>>>>>> +	/*
>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>> +	 */
>>>>>> +	if (vmf->pte) {
>>>>>> +		order = 0;
>>>>>> +		nr = 1;
>>>>>> +	}
>>>>>> +
>>>>>>  	/*
>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>  	 * large number of cpus that might not scale well.
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>  					  struct shrink_control *sc);
>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>  					 struct shrink_control *sc);
>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>> -
>>>>>>  static bool split_underused_thp = true;
>>>>>>  
>>>>>>  static atomic_t huge_zero_refcount;
>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>  }
>>>>>>  
>>>>>> -/**
>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>> - * split folios for pages that are partially mapped
>>>>>> - *
>>>>>> - * @folio: the folio to split
>>>>>> - *
>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>> - */
>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>> -{
>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>> -	struct folio *new_folio;
>>>>>> -	int ret = 0;
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Split the folio now. In the case of device
>>>>>> -	 * private pages, this path is executed when
>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>> -	 *
>>>>>> -	 * With device private pages, deferred splits of
>>>>>> -	 * folios should be handled here to prevent partial
>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>> -	 * and fault handling flows.
>>>>>> -	 */
>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>> -	VM_WARN_ON(ret);
>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>> -								new_folio));
>>>>>> -	}
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>> -	 * split
>>>>>> -	 */
>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> -	return ret;
>>>>>> -}
>>>>>> -
>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>  {
>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>  				freeze = false;
>>>>>>  			if (!freeze) {
>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>> -				unsigned long addr = haddr;
>>>>>> -				struct folio *new_folio;
>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>  
>>>>>>  				if (anon_exclusive)
>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>  
>>>>>> -				folio_lock(folio);
>>>>>> -				folio_get(folio);
>>>>>> -
>>>>>> -				split_device_private_folio(folio);
>>>>>> -
>>>>>> -				for (new_folio = folio_next(folio);
>>>>>> -					new_folio != end_folio;
>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>> -					addr += PAGE_SIZE;
>>>>>> -					folio_unlock(new_folio);
>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>> -						&new_folio->page, 1,
>>>>>> -						vma, addr, rmap_flags);
>>>>>> -				}
>>>>>> -				folio_unlock(folio);
>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>> +				if (anon_exclusive)
>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>  			}
>>>>>>  		}
>>>>>>  
>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  	if (nr_shmem_dropped)
>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>  
>>>>>> -	if (!ret && is_anon)
>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>  
>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>  			 */
>>>>>> +			struct folio *folio;
>>>>>> +
>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>  				goto next;
>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>  				goto next;
>>>>>>  
>>>>>> +			folio = page_folio(page);
>>>>>> +			if (folio_test_large(folio)) {
>>>>>> +				struct folio *new_folio;
>>>>>> +				struct folio *new_fault_folio;
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * The reason for finding pmd present with a
>>>>>> +				 * device private pte and a large folio for the
>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>> +				 * for the migration to be handled correctly
>>>>>> +				 */
>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>> +
>>>>>> +				folio_get(folio);
>>>>>> +				if (folio != fault_folio)
>>>>>> +					folio_lock(folio);
>>>>>> +				if (split_folio(folio)) {
>>>>>> +					if (folio != fault_folio)
>>>>>> +						folio_unlock(folio);
>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>> +					goto next;
>>>>>> +				}
>>>>>> +
>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>
>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>> nouveau should check the folio order after the possible split happens.
>>>
>> You mean the folio_split callback?
> no, nouveau_dmem_migrate_to_ram():
> ..
>         sfolio = page_folio(vmf->page);
>         order = folio_order(sfolio);
> ...
>         migrate_vma_setup()
> ..
> if sfolio is split order still reflects the pre-split order
>
>>>>>> +				/*
>>>>>> +				 * After the split, get back the extra reference
>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>> +				 * folio_migrate_mapping()
>>>>>> +				 */
>>>>>> +				if (migrate->fault_page) {
>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>> +					folio_get(new_fault_folio);
>>>>>> +				}
>>>>>> +
>>>>>> +				new_folio = page_folio(page);
>>>>>> +				pfn = page_to_pfn(page);
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * Ensure the lock is held on the correct
>>>>>> +				 * folio after the split
>>>>>> +				 */
>>>>>> +				if (folio != new_folio) {
>>>>>> +					folio_unlock(folio);
>>>>>> +					folio_lock(new_folio);
>>>>>> +				}
>>>>> Maybe careful not to unlock fault_page ?
>>>>>
>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>> on the folio corresponding to the new folio
>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>
>> Not sure I follow what you're trying to elaborate on here
>
Actually fault_folio should be fine but should we have: 


if (fault_folio)
        if(folio != new_folio)) {
                folio_unlock(folio);
                folio_lock(new_folio);
        }
else 
        folio_unlock(folio);




>> Balbir
>>
> --Mika
>

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

On 8/5/25 14:24, Mika Penttilä wrote:
> Hi,
> 
> On 8/5/25 07:10, Balbir Singh wrote:
>> On 8/5/25 09:26, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>> FYI:
>>>>>>
>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>> without requiring the helper to split device private folios
>>>>>>
>>>>> I think this looks much better!
>>>>>
>>>> Thanks!
>>>>
>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>> ---
>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>
>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>> --- a/include/linux/huge_mm.h
>>>>>> +++ b/include/linux/huge_mm.h
>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>  		vm_flags_t vm_flags);
>>>>>>  
>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>> --- a/lib/test_hmm.c
>>>>>> +++ b/lib/test_hmm.c
>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>  	 */
>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>  
>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>  	nr = 1 << order;
>>>>>>  
>>>>>> +	/*
>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>> +	 */
>>>>>> +	if (vmf->pte) {
>>>>>> +		order = 0;
>>>>>> +		nr = 1;
>>>>>> +	}
>>>>>> +
>>>>>>  	/*
>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>  	 * large number of cpus that might not scale well.
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>  					  struct shrink_control *sc);
>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>  					 struct shrink_control *sc);
>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>> -
>>>>>>  static bool split_underused_thp = true;
>>>>>>  
>>>>>>  static atomic_t huge_zero_refcount;
>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>  }
>>>>>>  
>>>>>> -/**
>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>> - * split folios for pages that are partially mapped
>>>>>> - *
>>>>>> - * @folio: the folio to split
>>>>>> - *
>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>> - */
>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>> -{
>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>> -	struct folio *new_folio;
>>>>>> -	int ret = 0;
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Split the folio now. In the case of device
>>>>>> -	 * private pages, this path is executed when
>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>> -	 *
>>>>>> -	 * With device private pages, deferred splits of
>>>>>> -	 * folios should be handled here to prevent partial
>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>> -	 * and fault handling flows.
>>>>>> -	 */
>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>> -	VM_WARN_ON(ret);
>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>> -								new_folio));
>>>>>> -	}
>>>>>> -
>>>>>> -	/*
>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>> -	 * split
>>>>>> -	 */
>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>> -	return ret;
>>>>>> -}
>>>>>> -
>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>  {
>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>  				freeze = false;
>>>>>>  			if (!freeze) {
>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>> -				unsigned long addr = haddr;
>>>>>> -				struct folio *new_folio;
>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>  
>>>>>>  				if (anon_exclusive)
>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>  
>>>>>> -				folio_lock(folio);
>>>>>> -				folio_get(folio);
>>>>>> -
>>>>>> -				split_device_private_folio(folio);
>>>>>> -
>>>>>> -				for (new_folio = folio_next(folio);
>>>>>> -					new_folio != end_folio;
>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>> -					addr += PAGE_SIZE;
>>>>>> -					folio_unlock(new_folio);
>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>> -						&new_folio->page, 1,
>>>>>> -						vma, addr, rmap_flags);
>>>>>> -				}
>>>>>> -				folio_unlock(folio);
>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>> +				if (anon_exclusive)
>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>  			}
>>>>>>  		}
>>>>>>  
>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  	if (nr_shmem_dropped)
>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>  
>>>>>> -	if (!ret && is_anon)
>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>  
>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>> --- a/mm/migrate_device.c
>>>>>> +++ b/mm/migrate_device.c
>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>  			 */
>>>>>> +			struct folio *folio;
>>>>>> +
>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>  				goto next;
>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>  				goto next;
>>>>>>  
>>>>>> +			folio = page_folio(page);
>>>>>> +			if (folio_test_large(folio)) {
>>>>>> +				struct folio *new_folio;
>>>>>> +				struct folio *new_fault_folio;
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * The reason for finding pmd present with a
>>>>>> +				 * device private pte and a large folio for the
>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>> +				 * for the migration to be handled correctly
>>>>>> +				 */
>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>> +
>>>>>> +				folio_get(folio);
>>>>>> +				if (folio != fault_folio)
>>>>>> +					folio_lock(folio);
>>>>>> +				if (split_folio(folio)) {
>>>>>> +					if (folio != fault_folio)
>>>>>> +						folio_unlock(folio);
>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>> +					goto next;
>>>>>> +				}
>>>>>> +
>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>
>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>> nouveau should check the folio order after the possible split happens.
>>>
>> You mean the folio_split callback?
> 
> no, nouveau_dmem_migrate_to_ram():
> ..
>         sfolio = page_folio(vmf->page);
>         order = folio_order(sfolio);
> ...
>         migrate_vma_setup()
> ..
> if sfolio is split order still reflects the pre-split order
> 

Will fix, good catch!

>>
>>>>>> +				/*
>>>>>> +				 * After the split, get back the extra reference
>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>> +				 * folio_migrate_mapping()
>>>>>> +				 */
>>>>>> +				if (migrate->fault_page) {
>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>> +					folio_get(new_fault_folio);
>>>>>> +				}
>>>>>> +
>>>>>> +				new_folio = page_folio(page);
>>>>>> +				pfn = page_to_pfn(page);
>>>>>> +
>>>>>> +				/*
>>>>>> +				 * Ensure the lock is held on the correct
>>>>>> +				 * folio after the split
>>>>>> +				 */
>>>>>> +				if (folio != new_folio) {
>>>>>> +					folio_unlock(folio);
>>>>>> +					folio_lock(new_folio);
>>>>>> +				}
>>>>> Maybe careful not to unlock fault_page ?
>>>>>
>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>> on the folio corresponding to the new folio
>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>
>> Not sure I follow what you're trying to elaborate on here
> 
> do_swap_page:
> ..
>         if (trylock_page(vmf->page)) {
>                 ret = pgmap->ops->migrate_to_ram(vmf);
>                                <- vmf->page should be locked here even after split
>                 unlock_page(vmf->page);
> 

Yep, the split will unlock all tail folios, leaving the just head folio locked
and this the change, the lock we need to hold is the folio lock associated with
fault_page, pte entry and not unlock when the cause is a fault. The code seems
to do the right thing there, let me double check

Balbir
and the code does the right thing there.

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months ago

On 8/5/25 13:27, Balbir Singh wrote:

> On 8/5/25 14:24, Mika Penttilä wrote:
>> Hi,
>>
>> On 8/5/25 07:10, Balbir Singh wrote:
>>> On 8/5/25 09:26, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>>> FYI:
>>>>>>>
>>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>>> without requiring the helper to split device private folios
>>>>>>>
>>>>>> I think this looks much better!
>>>>>>
>>>>> Thanks!
>>>>>
>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>> ---
>>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>>
>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>>  		vm_flags_t vm_flags);
>>>>>>>  
>>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>>> --- a/lib/test_hmm.c
>>>>>>> +++ b/lib/test_hmm.c
>>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>>  	 */
>>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>>  
>>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>>  	nr = 1 << order;
>>>>>>>  
>>>>>>> +	/*
>>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>>> +	 */
>>>>>>> +	if (vmf->pte) {
>>>>>>> +		order = 0;
>>>>>>> +		nr = 1;
>>>>>>> +	}
>>>>>>> +
>>>>>>>  	/*
>>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>>  	 * large number of cpus that might not scale well.
>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>>> --- a/mm/huge_memory.c
>>>>>>> +++ b/mm/huge_memory.c
>>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>>  					  struct shrink_control *sc);
>>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>>  					 struct shrink_control *sc);
>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>>> -
>>>>>>>  static bool split_underused_thp = true;
>>>>>>>  
>>>>>>>  static atomic_t huge_zero_refcount;
>>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>>  }
>>>>>>>  
>>>>>>> -/**
>>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>> - * split folios for pages that are partially mapped
>>>>>>> - *
>>>>>>> - * @folio: the folio to split
>>>>>>> - *
>>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>> - */
>>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>>> -{
>>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>>> -	struct folio *new_folio;
>>>>>>> -	int ret = 0;
>>>>>>> -
>>>>>>> -	/*
>>>>>>> -	 * Split the folio now. In the case of device
>>>>>>> -	 * private pages, this path is executed when
>>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>>> -	 *
>>>>>>> -	 * With device private pages, deferred splits of
>>>>>>> -	 * folios should be handled here to prevent partial
>>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>>> -	 * and fault handling flows.
>>>>>>> -	 */
>>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>> -	VM_WARN_ON(ret);
>>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>>> -								new_folio));
>>>>>>> -	}
>>>>>>> -
>>>>>>> -	/*
>>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>>> -	 * split
>>>>>>> -	 */
>>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>> -	return ret;
>>>>>>> -}
>>>>>>> -
>>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>>  {
>>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>  				freeze = false;
>>>>>>>  			if (!freeze) {
>>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>>> -				unsigned long addr = haddr;
>>>>>>> -				struct folio *new_folio;
>>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>>  
>>>>>>>  				if (anon_exclusive)
>>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>  
>>>>>>> -				folio_lock(folio);
>>>>>>> -				folio_get(folio);
>>>>>>> -
>>>>>>> -				split_device_private_folio(folio);
>>>>>>> -
>>>>>>> -				for (new_folio = folio_next(folio);
>>>>>>> -					new_folio != end_folio;
>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>> -					addr += PAGE_SIZE;
>>>>>>> -					folio_unlock(new_folio);
>>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>>> -						&new_folio->page, 1,
>>>>>>> -						vma, addr, rmap_flags);
>>>>>>> -				}
>>>>>>> -				folio_unlock(folio);
>>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>>> +				if (anon_exclusive)
>>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>>  			}
>>>>>>>  		}
>>>>>>>  
>>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>  	if (nr_shmem_dropped)
>>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>>  
>>>>>>> -	if (!ret && is_anon)
>>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>>  
>>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>>> --- a/mm/migrate_device.c
>>>>>>> +++ b/mm/migrate_device.c
>>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>>  			 */
>>>>>>> +			struct folio *folio;
>>>>>>> +
>>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>>  				goto next;
>>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>>  				goto next;
>>>>>>>  
>>>>>>> +			folio = page_folio(page);
>>>>>>> +			if (folio_test_large(folio)) {
>>>>>>> +				struct folio *new_folio;
>>>>>>> +				struct folio *new_fault_folio;
>>>>>>> +
>>>>>>> +				/*
>>>>>>> +				 * The reason for finding pmd present with a
>>>>>>> +				 * device private pte and a large folio for the
>>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>>> +				 * for the migration to be handled correctly
>>>>>>> +				 */
>>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>>> +
>>>>>>> +				folio_get(folio);
>>>>>>> +				if (folio != fault_folio)
>>>>>>> +					folio_lock(folio);
>>>>>>> +				if (split_folio(folio)) {
>>>>>>> +					if (folio != fault_folio)
>>>>>>> +						folio_unlock(folio);
>>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>>> +					goto next;
>>>>>>> +				}
>>>>>>> +
>>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>>
>>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>>> nouveau should check the folio order after the possible split happens.
>>>>
>>> You mean the folio_split callback?
>> no, nouveau_dmem_migrate_to_ram():
>> ..
>>         sfolio = page_folio(vmf->page);
>>         order = folio_order(sfolio);
>> ...
>>         migrate_vma_setup()
>> ..
>> if sfolio is split order still reflects the pre-split order
>>
> Will fix, good catch!
>
>>>>>>> +				/*
>>>>>>> +				 * After the split, get back the extra reference
>>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>>> +				 * folio_migrate_mapping()
>>>>>>> +				 */
>>>>>>> +				if (migrate->fault_page) {
>>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>>> +					folio_get(new_fault_folio);
>>>>>>> +				}
>>>>>>> +
>>>>>>> +				new_folio = page_folio(page);
>>>>>>> +				pfn = page_to_pfn(page);
>>>>>>> +
>>>>>>> +				/*
>>>>>>> +				 * Ensure the lock is held on the correct
>>>>>>> +				 * folio after the split
>>>>>>> +				 */
>>>>>>> +				if (folio != new_folio) {
>>>>>>> +					folio_unlock(folio);
>>>>>>> +					folio_lock(new_folio);
>>>>>>> +				}
>>>>>> Maybe careful not to unlock fault_page ?
>>>>>>
>>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>>> on the folio corresponding to the new folio
>>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>>
>>> Not sure I follow what you're trying to elaborate on here
>> do_swap_page:
>> ..
>>         if (trylock_page(vmf->page)) {
>>                 ret = pgmap->ops->migrate_to_ram(vmf);
>>                                <- vmf->page should be locked here even after split
>>                 unlock_page(vmf->page);
>>
> Yep, the split will unlock all tail folios, leaving the just head folio locked
> and this the change, the lock we need to hold is the folio lock associated with
> fault_page, pte entry and not unlock when the cause is a fault. The code seems
> to do the right thing there, let me double check

Yes the fault case is ok. But if migrate not for a fault, we should not leave any page locked

>
> Balbir
> and the code does the right thing there.
>
--Mika

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

On 8/5/25 20:35, Mika Penttilä wrote:
> 
> On 8/5/25 13:27, Balbir Singh wrote:
> 
>> On 8/5/25 14:24, Mika Penttilä wrote:
>>> Hi,
>>>
>>> On 8/5/25 07:10, Balbir Singh wrote:
>>>> On 8/5/25 09:26, Mika Penttilä wrote:
>>>>> Hi,
>>>>>
>>>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>>>> FYI:
>>>>>>>>
>>>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>>>> without requiring the helper to split device private folios
>>>>>>>>
>>>>>>> I think this looks much better!
>>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>> ---
>>>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>>>  		vm_flags_t vm_flags);
>>>>>>>>  
>>>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>>>> --- a/lib/test_hmm.c
>>>>>>>> +++ b/lib/test_hmm.c
>>>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>>>  	 */
>>>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>>>  
>>>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>>>  	nr = 1 << order;
>>>>>>>>  
>>>>>>>> +	/*
>>>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>>>> +	 */
>>>>>>>> +	if (vmf->pte) {
>>>>>>>> +		order = 0;
>>>>>>>> +		nr = 1;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>>  	/*
>>>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>>>  	 * large number of cpus that might not scale well.
>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>>>  					  struct shrink_control *sc);
>>>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>>>  					 struct shrink_control *sc);
>>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>>>> -
>>>>>>>>  static bool split_underused_thp = true;
>>>>>>>>  
>>>>>>>>  static atomic_t huge_zero_refcount;
>>>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> -/**
>>>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>> - * split folios for pages that are partially mapped
>>>>>>>> - *
>>>>>>>> - * @folio: the folio to split
>>>>>>>> - *
>>>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>> - */
>>>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>>>> -{
>>>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>>>> -	struct folio *new_folio;
>>>>>>>> -	int ret = 0;
>>>>>>>> -
>>>>>>>> -	/*
>>>>>>>> -	 * Split the folio now. In the case of device
>>>>>>>> -	 * private pages, this path is executed when
>>>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>>>> -	 *
>>>>>>>> -	 * With device private pages, deferred splits of
>>>>>>>> -	 * folios should be handled here to prevent partial
>>>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>>>> -	 * and fault handling flows.
>>>>>>>> -	 */
>>>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>> -	VM_WARN_ON(ret);
>>>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>>>> -								new_folio));
>>>>>>>> -	}
>>>>>>>> -
>>>>>>>> -	/*
>>>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>>>> -	 * split
>>>>>>>> -	 */
>>>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>> -	return ret;
>>>>>>>> -}
>>>>>>>> -
>>>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>>>  {
>>>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>>  				freeze = false;
>>>>>>>>  			if (!freeze) {
>>>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>>>> -				unsigned long addr = haddr;
>>>>>>>> -				struct folio *new_folio;
>>>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>>>  
>>>>>>>>  				if (anon_exclusive)
>>>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>>  
>>>>>>>> -				folio_lock(folio);
>>>>>>>> -				folio_get(folio);
>>>>>>>> -
>>>>>>>> -				split_device_private_folio(folio);
>>>>>>>> -
>>>>>>>> -				for (new_folio = folio_next(folio);
>>>>>>>> -					new_folio != end_folio;
>>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>>> -					addr += PAGE_SIZE;
>>>>>>>> -					folio_unlock(new_folio);
>>>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>>>> -						&new_folio->page, 1,
>>>>>>>> -						vma, addr, rmap_flags);
>>>>>>>> -				}
>>>>>>>> -				folio_unlock(folio);
>>>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>>>> +				if (anon_exclusive)
>>>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>>>  			}
>>>>>>>>  		}
>>>>>>>>  
>>>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  	if (nr_shmem_dropped)
>>>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>>>  
>>>>>>>> -	if (!ret && is_anon)
>>>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>>>  
>>>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>>>> --- a/mm/migrate_device.c
>>>>>>>> +++ b/mm/migrate_device.c
>>>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>>>  			 */
>>>>>>>> +			struct folio *folio;
>>>>>>>> +
>>>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>>>  				goto next;
>>>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>>>  				goto next;
>>>>>>>>  
>>>>>>>> +			folio = page_folio(page);
>>>>>>>> +			if (folio_test_large(folio)) {
>>>>>>>> +				struct folio *new_folio;
>>>>>>>> +				struct folio *new_fault_folio;
>>>>>>>> +
>>>>>>>> +				/*
>>>>>>>> +				 * The reason for finding pmd present with a
>>>>>>>> +				 * device private pte and a large folio for the
>>>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>>>> +				 * for the migration to be handled correctly
>>>>>>>> +				 */
>>>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>>>> +
>>>>>>>> +				folio_get(folio);
>>>>>>>> +				if (folio != fault_folio)
>>>>>>>> +					folio_lock(folio);
>>>>>>>> +				if (split_folio(folio)) {
>>>>>>>> +					if (folio != fault_folio)
>>>>>>>> +						folio_unlock(folio);
>>>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>>>> +					goto next;
>>>>>>>> +				}
>>>>>>>> +
>>>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>>>
>>>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>>>> nouveau should check the folio order after the possible split happens.
>>>>>
>>>> You mean the folio_split callback?
>>> no, nouveau_dmem_migrate_to_ram():
>>> ..
>>>         sfolio = page_folio(vmf->page);
>>>         order = folio_order(sfolio);
>>> ...
>>>         migrate_vma_setup()
>>> ..
>>> if sfolio is split order still reflects the pre-split order
>>>
>> Will fix, good catch!
>>
>>>>>>>> +				/*
>>>>>>>> +				 * After the split, get back the extra reference
>>>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>>>> +				 * folio_migrate_mapping()
>>>>>>>> +				 */
>>>>>>>> +				if (migrate->fault_page) {
>>>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>>>> +					folio_get(new_fault_folio);
>>>>>>>> +				}
>>>>>>>> +
>>>>>>>> +				new_folio = page_folio(page);
>>>>>>>> +				pfn = page_to_pfn(page);
>>>>>>>> +
>>>>>>>> +				/*
>>>>>>>> +				 * Ensure the lock is held on the correct
>>>>>>>> +				 * folio after the split
>>>>>>>> +				 */
>>>>>>>> +				if (folio != new_folio) {
>>>>>>>> +					folio_unlock(folio);
>>>>>>>> +					folio_lock(new_folio);
>>>>>>>> +				}
>>>>>>> Maybe careful not to unlock fault_page ?
>>>>>>>
>>>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>>>> on the folio corresponding to the new folio
>>>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>>>
>>>> Not sure I follow what you're trying to elaborate on here
>>> do_swap_page:
>>> ..
>>>         if (trylock_page(vmf->page)) {
>>>                 ret = pgmap->ops->migrate_to_ram(vmf);
>>>                                <- vmf->page should be locked here even after split
>>>                 unlock_page(vmf->page);
>>>
>> Yep, the split will unlock all tail folios, leaving the just head folio locked
>> and this the change, the lock we need to hold is the folio lock associated with
>> fault_page, pte entry and not unlock when the cause is a fault. The code seems
>> to do the right thing there, let me double check
> 
> Yes the fault case is ok. But if migrate not for a fault, we should not leave any page locked
> 

migrate_vma_finalize() handles this

Balbir

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Mika Penttilä 2 months ago

On 8/5/25 13:36, Balbir Singh wrote:
> On 8/5/25 20:35, Mika Penttilä wrote:
>> On 8/5/25 13:27, Balbir Singh wrote:
>>
>>> On 8/5/25 14:24, Mika Penttilä wrote:
>>>> Hi,
>>>>
>>>> On 8/5/25 07:10, Balbir Singh wrote:
>>>>> On 8/5/25 09:26, Mika Penttilä wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On 8/5/25 01:46, Balbir Singh wrote:
>>>>>>> On 8/2/25 22:13, Mika Penttilä wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> On 8/2/25 13:37, Balbir Singh wrote:
>>>>>>>>> FYI:
>>>>>>>>>
>>>>>>>>> I have the following patch on top of my series that seems to make it work
>>>>>>>>> without requiring the helper to split device private folios
>>>>>>>>>
>>>>>>>> I think this looks much better!
>>>>>>>>
>>>>>>> Thanks!
>>>>>>>
>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>> ---
>>>>>>>>>  include/linux/huge_mm.h |  1 -
>>>>>>>>>  lib/test_hmm.c          | 11 +++++-
>>>>>>>>>  mm/huge_memory.c        | 76 ++++-------------------------------------
>>>>>>>>>  mm/migrate_device.c     | 51 +++++++++++++++++++++++++++
>>>>>>>>>  4 files changed, 67 insertions(+), 72 deletions(-)
>>>>>>>>>
>>>>>>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>>>>>>> index 19e7e3b7c2b7..52d8b435950b 100644
>>>>>>>>> --- a/include/linux/huge_mm.h
>>>>>>>>> +++ b/include/linux/huge_mm.h
>>>>>>>>> @@ -343,7 +343,6 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>>>>>>  		vm_flags_t vm_flags);
>>>>>>>>>  
>>>>>>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>>>>>> -int split_device_private_folio(struct folio *folio);
>>>>>>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>>>>>>  		unsigned int new_order, bool unmapped);
>>>>>>>>>  int min_order_for_split(struct folio *folio);
>>>>>>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>>>>>>> index 341ae2af44ec..444477785882 100644
>>>>>>>>> --- a/lib/test_hmm.c
>>>>>>>>> +++ b/lib/test_hmm.c
>>>>>>>>> @@ -1625,13 +1625,22 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>>>>>>  	 * the mirror but here we use it to hold the page for the simulated
>>>>>>>>>  	 * device memory and that page holds the pointer to the mirror.
>>>>>>>>>  	 */
>>>>>>>>> -	rpage = vmf->page->zone_device_data;
>>>>>>>>> +	rpage = folio_page(page_folio(vmf->page), 0)->zone_device_data;
>>>>>>>>>  	dmirror = rpage->zone_device_data;
>>>>>>>>>  
>>>>>>>>>  	/* FIXME demonstrate how we can adjust migrate range */
>>>>>>>>>  	order = folio_order(page_folio(vmf->page));
>>>>>>>>>  	nr = 1 << order;
>>>>>>>>>  
>>>>>>>>> +	/*
>>>>>>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>>>>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>>>>>>> +	 */
>>>>>>>>> +	if (vmf->pte) {
>>>>>>>>> +		order = 0;
>>>>>>>>> +		nr = 1;
>>>>>>>>> +	}
>>>>>>>>> +
>>>>>>>>>  	/*
>>>>>>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>>>>>>  	 * large number of cpus that might not scale well.
>>>>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>>>>> index 1fc1efa219c8..863393dec1f1 100644
>>>>>>>>> --- a/mm/huge_memory.c
>>>>>>>>> +++ b/mm/huge_memory.c
>>>>>>>>> @@ -72,10 +72,6 @@ static unsigned long deferred_split_count(struct shrinker *shrink,
>>>>>>>>>  					  struct shrink_control *sc);
>>>>>>>>>  static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>>>>  					 struct shrink_control *sc);
>>>>>>>>> -static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>>>>>>> -		struct page *split_at, struct xa_state *xas,
>>>>>>>>> -		struct address_space *mapping, bool uniform_split);
>>>>>>>>> -
>>>>>>>>>  static bool split_underused_thp = true;
>>>>>>>>>  
>>>>>>>>>  static atomic_t huge_zero_refcount;
>>>>>>>>> @@ -2924,51 +2920,6 @@ static void __split_huge_zero_page_pmd(struct vm_area_struct *vma,
>>>>>>>>>  	pmd_populate(mm, pmd, pgtable);
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>> -/**
>>>>>>>>> - * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>> - * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>> - * split folios for pages that are partially mapped
>>>>>>>>> - *
>>>>>>>>> - * @folio: the folio to split
>>>>>>>>> - *
>>>>>>>>> - * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>> - */
>>>>>>>>> -int split_device_private_folio(struct folio *folio)
>>>>>>>>> -{
>>>>>>>>> -	struct folio *end_folio = folio_next(folio);
>>>>>>>>> -	struct folio *new_folio;
>>>>>>>>> -	int ret = 0;
>>>>>>>>> -
>>>>>>>>> -	/*
>>>>>>>>> -	 * Split the folio now. In the case of device
>>>>>>>>> -	 * private pages, this path is executed when
>>>>>>>>> -	 * the pmd is split and since freeze is not true
>>>>>>>>> -	 * it is likely the folio will be deferred_split.
>>>>>>>>> -	 *
>>>>>>>>> -	 * With device private pages, deferred splits of
>>>>>>>>> -	 * folios should be handled here to prevent partial
>>>>>>>>> -	 * unmaps from causing issues later on in migration
>>>>>>>>> -	 * and fault handling flows.
>>>>>>>>> -	 */
>>>>>>>>> -	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>> -	ret = __split_unmapped_folio(folio, 0, &folio->page, NULL, NULL, true);
>>>>>>>>> -	VM_WARN_ON(ret);
>>>>>>>>> -	for (new_folio = folio_next(folio); new_folio != end_folio;
>>>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>>>> -		zone_device_private_split_cb(folio, new_folio);
>>>>>>>>> -		folio_ref_unfreeze(new_folio, 1 + folio_expected_ref_count(
>>>>>>>>> -								new_folio));
>>>>>>>>> -	}
>>>>>>>>> -
>>>>>>>>> -	/*
>>>>>>>>> -	 * Mark the end of the folio split for device private THP
>>>>>>>>> -	 * split
>>>>>>>>> -	 */
>>>>>>>>> -	zone_device_private_split_cb(folio, NULL);
>>>>>>>>> -	folio_ref_unfreeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>> -	return ret;
>>>>>>>>> -}
>>>>>>>>> -
>>>>>>>>>  static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>>>  		unsigned long haddr, bool freeze)
>>>>>>>>>  {
>>>>>>>>> @@ -3064,30 +3015,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
>>>>>>>>>  				freeze = false;
>>>>>>>>>  			if (!freeze) {
>>>>>>>>>  				rmap_t rmap_flags = RMAP_NONE;
>>>>>>>>> -				unsigned long addr = haddr;
>>>>>>>>> -				struct folio *new_folio;
>>>>>>>>> -				struct folio *end_folio = folio_next(folio);
>>>>>>>>>  
>>>>>>>>>  				if (anon_exclusive)
>>>>>>>>>  					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>>>  
>>>>>>>>> -				folio_lock(folio);
>>>>>>>>> -				folio_get(folio);
>>>>>>>>> -
>>>>>>>>> -				split_device_private_folio(folio);
>>>>>>>>> -
>>>>>>>>> -				for (new_folio = folio_next(folio);
>>>>>>>>> -					new_folio != end_folio;
>>>>>>>>> -					new_folio = folio_next(new_folio)) {
>>>>>>>>> -					addr += PAGE_SIZE;
>>>>>>>>> -					folio_unlock(new_folio);
>>>>>>>>> -					folio_add_anon_rmap_ptes(new_folio,
>>>>>>>>> -						&new_folio->page, 1,
>>>>>>>>> -						vma, addr, rmap_flags);
>>>>>>>>> -				}
>>>>>>>>> -				folio_unlock(folio);
>>>>>>>>> -				folio_add_anon_rmap_ptes(folio, &folio->page,
>>>>>>>>> -						1, vma, haddr, rmap_flags);
>>>>>>>>> +				folio_ref_add(folio, HPAGE_PMD_NR - 1);
>>>>>>>>> +				if (anon_exclusive)
>>>>>>>>> +					rmap_flags |= RMAP_EXCLUSIVE;
>>>>>>>>> +				folio_add_anon_rmap_ptes(folio, page, HPAGE_PMD_NR,
>>>>>>>>> +						 vma, haddr, rmap_flags);
>>>>>>>>>  			}
>>>>>>>>>  		}
>>>>>>>>>  
>>>>>>>>> @@ -4065,7 +4001,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>  	if (nr_shmem_dropped)
>>>>>>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>>>>>>  
>>>>>>>>> -	if (!ret && is_anon)
>>>>>>>>> +	if (!ret && is_anon && !folio_is_device_private(folio))
>>>>>>>>>  		remap_flags = RMP_USE_SHARED_ZEROPAGE;
>>>>>>>>>  
>>>>>>>>>  	remap_page(folio, 1 << order, remap_flags);
>>>>>>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>>>>>>> index 49962ea19109..4264c0290d08 100644
>>>>>>>>> --- a/mm/migrate_device.c
>>>>>>>>> +++ b/mm/migrate_device.c
>>>>>>>>> @@ -248,6 +248,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>>  			 * page table entry. Other special swap entries are not
>>>>>>>>>  			 * migratable, and we ignore regular swapped page.
>>>>>>>>>  			 */
>>>>>>>>> +			struct folio *folio;
>>>>>>>>> +
>>>>>>>>>  			entry = pte_to_swp_entry(pte);
>>>>>>>>>  			if (!is_device_private_entry(entry))
>>>>>>>>>  				goto next;
>>>>>>>>> @@ -259,6 +261,55 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>>>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>>>>>>  				goto next;
>>>>>>>>>  
>>>>>>>>> +			folio = page_folio(page);
>>>>>>>>> +			if (folio_test_large(folio)) {
>>>>>>>>> +				struct folio *new_folio;
>>>>>>>>> +				struct folio *new_fault_folio;
>>>>>>>>> +
>>>>>>>>> +				/*
>>>>>>>>> +				 * The reason for finding pmd present with a
>>>>>>>>> +				 * device private pte and a large folio for the
>>>>>>>>> +				 * pte is partial unmaps. Split the folio now
>>>>>>>>> +				 * for the migration to be handled correctly
>>>>>>>>> +				 */
>>>>>>>>> +				pte_unmap_unlock(ptep, ptl);
>>>>>>>>> +
>>>>>>>>> +				folio_get(folio);
>>>>>>>>> +				if (folio != fault_folio)
>>>>>>>>> +					folio_lock(folio);
>>>>>>>>> +				if (split_folio(folio)) {
>>>>>>>>> +					if (folio != fault_folio)
>>>>>>>>> +						folio_unlock(folio);
>>>>>>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>>>>>>> +					goto next;
>>>>>>>>> +				}
>>>>>>>>> +
>>>>>>>> The nouveau migrate_to_ram handler needs adjustment also if split happens.
>>>>>>>>
>>>>>>> test_hmm needs adjustment because of the way the backup folios are setup.
>>>>>> nouveau should check the folio order after the possible split happens.
>>>>>>
>>>>> You mean the folio_split callback?
>>>> no, nouveau_dmem_migrate_to_ram():
>>>> ..
>>>>         sfolio = page_folio(vmf->page);
>>>>         order = folio_order(sfolio);
>>>> ...
>>>>         migrate_vma_setup()
>>>> ..
>>>> if sfolio is split order still reflects the pre-split order
>>>>
>>> Will fix, good catch!
>>>
>>>>>>>>> +				/*
>>>>>>>>> +				 * After the split, get back the extra reference
>>>>>>>>> +				 * on the fault_page, this reference is checked during
>>>>>>>>> +				 * folio_migrate_mapping()
>>>>>>>>> +				 */
>>>>>>>>> +				if (migrate->fault_page) {
>>>>>>>>> +					new_fault_folio = page_folio(migrate->fault_page);
>>>>>>>>> +					folio_get(new_fault_folio);
>>>>>>>>> +				}
>>>>>>>>> +
>>>>>>>>> +				new_folio = page_folio(page);
>>>>>>>>> +				pfn = page_to_pfn(page);
>>>>>>>>> +
>>>>>>>>> +				/*
>>>>>>>>> +				 * Ensure the lock is held on the correct
>>>>>>>>> +				 * folio after the split
>>>>>>>>> +				 */
>>>>>>>>> +				if (folio != new_folio) {
>>>>>>>>> +					folio_unlock(folio);
>>>>>>>>> +					folio_lock(new_folio);
>>>>>>>>> +				}
>>>>>>>> Maybe careful not to unlock fault_page ?
>>>>>>>>
>>>>>>> split_page will unlock everything but the original folio, the code takes the lock
>>>>>>> on the folio corresponding to the new folio
>>>>>> I mean do_swap_page() unlocks folio of fault_page and expects it to remain locked.
>>>>>>
>>>>> Not sure I follow what you're trying to elaborate on here
>>>> do_swap_page:
>>>> ..
>>>>         if (trylock_page(vmf->page)) {
>>>>                 ret = pgmap->ops->migrate_to_ram(vmf);
>>>>                                <- vmf->page should be locked here even after split
>>>>                 unlock_page(vmf->page);
>>>>
>>> Yep, the split will unlock all tail folios, leaving the just head folio locked
>>> and this the change, the lock we need to hold is the folio lock associated with
>>> fault_page, pte entry and not unlock when the cause is a fault. The code seems
>>> to do the right thing there, let me double check
>> Yes the fault case is ok. But if migrate not for a fault, we should not leave any page locked
>>
> migrate_vma_finalize() handles this

But we are in migrate_vma_collect_pmd() after the split, and try to collect the pte, locking the page again.
So needs to be unlocked after the split.

>
> Balbir
>

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Zi Yan 2 months ago

On 31 Jul 2025, at 20:49, Balbir Singh wrote:

> On 7/31/25 21:26, Zi Yan wrote:
>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>
>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>
>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>
>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>> at CPU side.
>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>> CPU unmapped and device mapped.
>>>>>
>>>>> Here are my questions on device private folios:
>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>     perspective? Can it be stored in a device private specific data structure?
>>>>
>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>> common code more messy if not done that way but sure possible.
>>>> And not consuming pfns (address space) at all would have benefits.
>>>>
>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>     the device driver manipulate it assuming core-mm just skips device private
>>>>>     folios (barring the CPU access fault handling)?
>>>>>
>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>> by CPU and only device driver manipulates their mappings?
>>>>>
>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>> someone could change while in device, it's just pfn.
>>>
>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>
>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>
>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>
>>>
>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>
>> Thanks for the clarification.
>>
>> So folio_mapcount() for device private folios should be treated the same
>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>> Then I wonder if the device private large folio split should go through
>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>> remap. Otherwise, how can we prevent rmap changes during the split?
>>
>
> That is true in general, the special cases I mentioned are:
>
> 1. split during migration (where we the sizes on source/destination do not
>    match) and so we need to split in the middle of migration. The entries
>    there are already unmapped and hence the special handling

In this case, all device private entries pointing to this device private
folio should be turned into migration entries and folio_mapcount() should
be 0. The split_device_private_folio() is handling this situation, although
the function name is not very descriptive. You might want to add a comment
to this function about its use and add a check to make sure folio_mapcount()
is 0.

> 2. Partial unmap case, where we need to split in the context of the unmap
>    due to the isses mentioned in the patch. I expanded the folio split code
>    for device private can be expanded into its own helper, which does not
>    need to do the xas/mapped/lru folio handling. During partial unmap the
>    original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>
> For (2), I spent some time examining the implications of not unmapping the
> folios prior to split and in the partial unmap path, once we split the PMD
> the folios diverge. I did not run into any particular race either with the
> tests.

For partial unmap case, you should be able to handle it in the same way
as normal PTE-mapped large folio. Since like David said, each device private
entry can be seen as a PROTNONE entry. At PMD split, PMD page table page
should be filled with device private PTEs. Each of them points to the
corresponding subpage. When device unmaps some of the PTEs, rmap code
should take care of the folio_mapcount().


Best Regards,
Yan, Zi

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by David Hildenbrand 2 months ago

On 01.08.25 03:09, Zi Yan wrote:
> On 31 Jul 2025, at 20:49, Balbir Singh wrote:
> 
>> On 7/31/25 21:26, Zi Yan wrote:
>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>
>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>
>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>
>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>> at CPU side.
>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>> CPU unmapped and device mapped.
>>>>>>
>>>>>> Here are my questions on device private folios:
>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>      perspective? Can it be stored in a device private specific data structure?
>>>>>
>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>> common code more messy if not done that way but sure possible.
>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>
>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>      the device driver manipulate it assuming core-mm just skips device private
>>>>>>      folios (barring the CPU access fault handling)?
>>>>>>
>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>
>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>> someone could change while in device, it's just pfn.
>>>>
>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>
>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>
>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>
>>>>
>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>
>>> Thanks for the clarification.
>>>
>>> So folio_mapcount() for device private folios should be treated the same
>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>> Then I wonder if the device private large folio split should go through
>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>>
>>
>> That is true in general, the special cases I mentioned are:
>>
>> 1. split during migration (where we the sizes on source/destination do not
>>     match) and so we need to split in the middle of migration. The entries
>>     there are already unmapped and hence the special handling
> 
> In this case, all device private entries pointing to this device private
> folio should be turned into migration entries and folio_mapcount() should
> be 0. The split_device_private_folio() is handling this situation, although
> the function name is not very descriptive. You might want to add a comment
> to this function about its use and add a check to make sure folio_mapcount()
> is 0.
> 
>> 2. Partial unmap case, where we need to split in the context of the unmap
>>     due to the isses mentioned in the patch. I expanded the folio split code
>>     for device private can be expanded into its own helper, which does not
>>     need to do the xas/mapped/lru folio handling. During partial unmap the
>>     original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked)
>>
>> For (2), I spent some time examining the implications of not unmapping the
>> folios prior to split and in the partial unmap path, once we split the PMD
>> the folios diverge. I did not run into any particular race either with the
>> tests.
> 
> For partial unmap case, you should be able to handle it in the same way
> as normal PTE-mapped large folio. Since like David said, each device private
> entry can be seen as a PROTNONE entry. At PMD split, PMD page table page
> should be filled with device private PTEs. Each of them points to the
> corresponding subpage. When device unmaps some of the PTEs, rmap code
> should take care of the folio_mapcount().

Right. In general, no splitting of any THP with a mapcount > 0 
(folio_mapped()). It's a clear indication that you are doing something 
wrong.

-- 
Cheers,

David / dhildenb

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by David Hildenbrand 2 months ago

On 31.07.25 13:26, Zi Yan wrote:
> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
> 
>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>
>>> On 7/30/25 18:58, Zi Yan wrote:
>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>
>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>
>>>>>>>>>>>> <snip>
>>>>>>>>>>>>
>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>> device side mapping.
>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>
>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>> at CPU side.
>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>> CPU unmapped and device mapped.
>>>>
>>>> Here are my questions on device private folios:
>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>      perspective? Can it be stored in a device private specific data structure?
>>>
>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>> common code more messy if not done that way but sure possible.
>>> And not consuming pfns (address space) at all would have benefits.
>>>
>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>      the device driver manipulate it assuming core-mm just skips device private
>>>>      folios (barring the CPU access fault handling)?
>>>>
>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>> by CPU and only device driver manipulates their mappings?
>>>>
>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>> someone could change while in device, it's just pfn.
>>
>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>
>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>
>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>
>>
>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
> 
> Thanks for the clarification.
> 
> So folio_mapcount() for device private folios should be treated the same
> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
> Then I wonder if the device private large folio split should go through
> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
> remap. Otherwise, how can we prevent rmap changes during the split?

That is what I would expect: Replace device-private by migration 
entries, perform the migration/split/whatever, restore migration entries 
to device-private entries.

That will drive the mapcount to 0.

-- 
Cheers,

David / dhildenb

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Zi Yan 2 months ago

On 31 Jul 2025, at 8:32, David Hildenbrand wrote:

> On 31.07.25 13:26, Zi Yan wrote:
>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>
>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>
>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>
>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>
>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>> at CPU side.
>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>> CPU unmapped and device mapped.
>>>>>
>>>>> Here are my questions on device private folios:
>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>      perspective? Can it be stored in a device private specific data structure?
>>>>
>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>> common code more messy if not done that way but sure possible.
>>>> And not consuming pfns (address space) at all would have benefits.
>>>>
>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>      the device driver manipulate it assuming core-mm just skips device private
>>>>>      folios (barring the CPU access fault handling)?
>>>>>
>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>> by CPU and only device driver manipulates their mappings?
>>>>>
>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>> someone could change while in device, it's just pfn.
>>>
>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>
>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>
>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>
>>>
>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>
>> Thanks for the clarification.
>>
>> So folio_mapcount() for device private folios should be treated the same
>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>> Then I wonder if the device private large folio split should go through
>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>> remap. Otherwise, how can we prevent rmap changes during the split?
>
> That is what I would expect: Replace device-private by migration entries, perform the migration/split/whatever, restore migration entries to device-private entries.
>
> That will drive the mapcount to 0.

Great. That matches my expectations as well. One potential optimization could
be since device private entry is already CPU inaccessible TLB flush can be
avoided.


Best Regards,
Yan, Zi

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by David Hildenbrand 2 months ago

On 31.07.25 15:34, Zi Yan wrote:
> On 31 Jul 2025, at 8:32, David Hildenbrand wrote:
> 
>> On 31.07.25 13:26, Zi Yan wrote:
>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote:
>>>
>>>> On 30.07.25 18:29, Mika Penttilä wrote:
>>>>>
>>>>> On 7/30/25 18:58, Zi Yan wrote:
>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>>>>
>>>>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>>>>
>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>>>>
>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>     include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>>>>     include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>>>>     include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>>>>     mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>>>>     mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>>>>     mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>>>>     mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>>>>     7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> <snip>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +/**
>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>>>>> + *
>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>>>>> + */
>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>>>>> +{
>>>>>>>>>>>>>>>> +	struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>>>>> +	struct folio *new_folio;
>>>>>>>>>>>>>>>> +	int ret = 0;
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> +	/*
>>>>>>>>>>>>>>>> +	 * Split the folio now. In the case of device
>>>>>>>>>>>>>>>> +	 * private pages, this path is executed when
>>>>>>>>>>>>>>>> +	 * the pmd is split and since freeze is not true
>>>>>>>>>>>>>>>> +	 * it is likely the folio will be deferred_split.
>>>>>>>>>>>>>>>> +	 *
>>>>>>>>>>>>>>>> +	 * With device private pages, deferred splits of
>>>>>>>>>>>>>>>> +	 * folios should be handled here to prevent partial
>>>>>>>>>>>>>>>> +	 * unmaps from causing issues later on in migration
>>>>>>>>>>>>>>>> +	 * and fault handling flows.
>>>>>>>>>>>>>>>> +	 */
>>>>>>>>>>>>>>>> +	folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>>>>> device side mapping.
>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>>>>
>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>>>>> at CPU side.
>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>>>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>>>>> add code to skip CPU mapping handling code. Basically device private folios are
>>>>>> CPU unmapped and device mapped.
>>>>>>
>>>>>> Here are my questions on device private folios:
>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>>>>       perspective? Can it be stored in a device private specific data structure?
>>>>>
>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>>>>> common code more messy if not done that way but sure possible.
>>>>> And not consuming pfns (address space) at all would have benefits.
>>>>>
>>>>>> 2. When a device private folio is mapped on device, can someone other than
>>>>>>       the device driver manipulate it assuming core-mm just skips device private
>>>>>>       folios (barring the CPU access fault handling)?
>>>>>>
>>>>>> Where I am going is that can device private folios be treated as unmapped folios
>>>>>> by CPU and only device driver manipulates their mappings?
>>>>>>
>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>>>>> someone could change while in device, it's just pfn.
>>>>
>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
>>>>
>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
>>>>
>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
>>>>
>>>>
>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
>>>
>>> Thanks for the clarification.
>>>
>>> So folio_mapcount() for device private folios should be treated the same
>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs.
>>> Then I wonder if the device private large folio split should go through
>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze,
>>> remap. Otherwise, how can we prevent rmap changes during the split?
>>
>> That is what I would expect: Replace device-private by migration entries, perform the migration/split/whatever, restore migration entries to device-private entries.
>>
>> That will drive the mapcount to 0.
> 
> Great. That matches my expectations as well. One potential optimization could
> be since device private entry is already CPU inaccessible TLB flush can be
> avoided.

Right, I would assume that is already done, or could easily be added.

Not using proper migration entries sounds like a hack that we shouldn't 
start with. We should start with as little special cases as possible in 
core-mm.

For example, as you probably implied, there is nothing stopping 
concurrent fork() or zap to mess with the refcount+mapcount.

-- 
Cheers,

David / dhildenb

Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code

Posted by Balbir Singh 2 months ago

On 7/31/25 17:15, David Hildenbrand wrote:
> On 30.07.25 18:29, Mika Penttilä wrote:
>>
>> On 7/30/25 18:58, Zi Yan wrote:
>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote:
>>>
>>>> On 7/30/25 18:10, Zi Yan wrote:
>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote:
>>>>>
>>>>>> On 7/30/25 15:25, Zi Yan wrote:
>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote:
>>>>>>>
>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote:
>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote:
>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote:
>>>>>>>>>>
>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote:
>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone
>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes
>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP
>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge
>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration
>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will
>>>>>>>>>>>>> return true for zone device private large folios only when
>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are
>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key
>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other
>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is
>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting.
>>>>>>>>>>>>>
>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use
>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device
>>>>>>>>>>>>> entries.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split,
>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a
>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries
>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the
>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling
>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling
>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio()
>>>>>>>>>>>>> code is used with a new helper to wrap the code
>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around
>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap
>>>>>>>>>>>>> folio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cc: Karol Herbst <kherbst@redhat.com>
>>>>>>>>>>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>>>>>>>>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>>>>>>>>>>> Cc: David Airlie <airlied@gmail.com>
>>>>>>>>>>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>>>>>>>>>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>>>>>>>>>>>> Cc: Shuah Khan <shuah@kernel.org>
>>>>>>>>>>>>> Cc: David Hildenbrand <david@redhat.com>
>>>>>>>>>>>>> Cc: Barry Song <baohua@kernel.org>
>>>>>>>>>>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>>>>>>>>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>>>>>>> Cc: Matthew Wilcox <willy@infradead.org>
>>>>>>>>>>>>> Cc: Peter Xu <peterx@redhat.com>
>>>>>>>>>>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>>>>>>>>>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>>>>>>>>>>>> Cc: Jane Chu <jane.chu@oracle.com>
>>>>>>>>>>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>>>>>>>>>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>>>>>>>>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>>>>>>>>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>>>>>>>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Matthew Brost <matthew.brost@intel.com>
>>>>>>>>>>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>   include/linux/huge_mm.h |   1 +
>>>>>>>>>>>>>   include/linux/rmap.h    |   2 +
>>>>>>>>>>>>>   include/linux/swapops.h |  17 +++
>>>>>>>>>>>>>   mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++-------
>>>>>>>>>>>>>   mm/page_vma_mapped.c    |  13 +-
>>>>>>>>>>>>>   mm/pgtable-generic.c    |   6 +
>>>>>>>>>>>>>   mm/rmap.c               |  22 +++-
>>>>>>>>>>>>>   7 files changed, 278 insertions(+), 51 deletions(-)
>>>>>>>>>>>>>
>>>>>>>>>>> <snip>
>>>>>>>>>>>
>>>>>>>>>>>>> +/**
>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into
>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to
>>>>>>>>>>>>> + * split folios for pages that are partially mapped
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * @folio: the folio to split
>>>>>>>>>>>>> + *
>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get
>>>>>>>>>>>>> + */
>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio);
>>>>>>>>>>>>> +    struct folio *new_folio;
>>>>>>>>>>>>> +    int ret = 0;
>>>>>>>>>>>>> +
>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>> +     * Split the folio now. In the case of device
>>>>>>>>>>>>> +     * private pages, this path is executed when
>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true
>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split.
>>>>>>>>>>>>> +     *
>>>>>>>>>>>>> +     * With device private pages, deferred splits of
>>>>>>>>>>>>> +     * folios should be handled here to prevent partial
>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration
>>>>>>>>>>>>> +     * and fault handling flows.
>>>>>>>>>>>>> +     */
>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio));
>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller?
>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in
>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of
>>>>>>>>>>> device side mapping.
>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the
>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping,
>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze,
>>>>>>>>>> 5) remap device private mapping.
>>>>>>>>> Ah ok this was about device private page obviously here, nevermind..
>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task?
>>>>>>> The folio only has migration entries pointing to it. From CPU perspective,
>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split
>>>>>>> folio by replacing existing page table entries with migration entries
>>>>>>> and after that the folio is regarded as “unmapped”.
>>>>>>>
>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU
>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics.
>>>>> Yes, but from CPU perspective, both device private entry and migration entry
>>>>> are invalid CPU page table entries, so the device private folio is “unmapped”
>>>>> at CPU side.
>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount.
>>> Right. That confused me when I was talking to Balbir and looking at v1.
>>> When a device private folio is processed in __folio_split(), Balbir needed to
>>> add code to skip CPU mapping handling code. Basically device private folios are
>>> CPU unmapped and device mapped.
>>>
>>> Here are my questions on device private folios:
>>> 1. How is mapcount used for device private folios? Why is it needed from CPU
>>>     perspective? Can it be stored in a device private specific data structure?
>>
>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make
>> common code more messy if not done that way but sure possible.
>> And not consuming pfns (address space) at all would have benefits.
>>
>>> 2. When a device private folio is mapped on device, can someone other than
>>>     the device driver manipulate it assuming core-mm just skips device private
>>>     folios (barring the CPU access fault handling)?
>>>
>>> Where I am going is that can device private folios be treated as unmapped folios
>>> by CPU and only device driver manipulates their mappings?
>>>
>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content
>> someone could change while in device, it's just pfn.
> 
> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries.
> 
> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future.
> 
> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ...
> 
> 
> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions.
> 

Thanks for clarifying on my behalf, just catching up with the discussion

When I was referring to mapped with Zi, I was talking about how touching the entry will cause a fault and migration back and they can be considered as unmapped in that sense, because they are mapped to the device. Device private entries are mapped into the page tables and have a refcount associated with the page/folio that represents the device entries.

Balbir Singh