[v7] mm: support device-private THP

[v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Balbir Singh 4 months, 1 week ago

Implement migrate_vma_split_pages() to handle THP splitting during the
migration process when destination cannot allocate compound pages.

This addresses the common scenario where migrate_vma_setup() succeeds with
MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
large pages during the migration phase.

Key changes:
- migrate_vma_split_pages(): Split already-isolated pages during migration
- Enhanced folio_split() and __split_unmapped_folio() with isolated
  parameter to avoid redundant unmap/remap operations

This provides a fallback mechansim to ensure migration succeeds even when
large page allocation fails at the destination.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Ying Huang <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Cc: Mika Penttilä <mpenttil@redhat.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Francois Dugast <francois.dugast@intel.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h | 11 +++++-
 lib/test_hmm.c          |  9 +++++
 mm/huge_memory.c        | 46 ++++++++++++----------
 mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
 4 files changed, 117 insertions(+), 34 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2d669be7f1c8..a166be872628 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 		vm_flags_t vm_flags);
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-		unsigned int new_order);
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order, bool unmapped);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns);
 int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
 		struct list_head *list);
+
+static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order)
+{
+	return __split_huge_page_to_list_to_order(page, list, new_order, false);
+}
+
 /*
  * try_folio_split - try to split a @folio at @page using non uniform split.
  * @folio: folio to be split
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 46fa9e200db8..df429670633e 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
 	order = folio_order(page_folio(vmf->page));
 	nr = 1 << order;
 
+	/*
+	 * When folios are partially mapped, we can't rely on the folio
+	 * order of vmf->page as the folio might not be fully split yet
+	 */
+	if (vmf->pte) {
+		order = 0;
+		nr = 1;
+	}
+
 	/*
 	 * Consider a per-cpu cache of src and dst pfns, but with
 	 * large number of cpus that might not scale well.
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8c95a658b3ec..022b0729f826 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		new_folio->mapping = folio->mapping;
 		new_folio->index = folio->index + i;
 
-		/*
-		 * page->private should not be set in tail pages. Fix up and warn once
-		 * if private is unexpectedly set.
-		 */
-		if (unlikely(new_folio->private)) {
-			VM_WARN_ON_ONCE_PAGE(true, new_head);
-			new_folio->private = NULL;
-		}
-
 		if (folio_test_swapcache(folio))
 			new_folio->swap.val = folio->swap.val + i;
 
@@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  * @lock_at: a page within @folio to be left locked to caller
  * @list: after-split folios will be put on it if non NULL
  * @uniform_split: perform uniform split or not (non-uniform split)
+ * @unmapped: The pages are already unmapped, they are migration entries.
  *
  * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
  * It is in charge of checking whether the split is supported or not and
@@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  */
 static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
-		struct list_head *list, bool uniform_split)
+		struct list_head *list, bool uniform_split, bool unmapped)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
@@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		 * is taken to serialise against parallel split or collapse
 		 * operations.
 		 */
-		anon_vma = folio_get_anon_vma(folio);
-		if (!anon_vma) {
-			ret = -EBUSY;
-			goto out;
+		if (!unmapped) {
+			anon_vma = folio_get_anon_vma(folio);
+			if (!anon_vma) {
+				ret = -EBUSY;
+				goto out;
+			}
+			anon_vma_lock_write(anon_vma);
 		}
 		mapping = NULL;
-		anon_vma_lock_write(anon_vma);
 	} else {
 		unsigned int min_order;
 		gfp_t gfp;
@@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		goto out_unlock;
 	}
 
-	unmap_folio(folio);
+	if (!unmapped)
+		unmap_folio(folio);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 			next = folio_next(new_folio);
 
+			zone_device_private_split_cb(folio, new_folio);
+
 			expected_refs = folio_expected_ref_count(new_folio) + 1;
 			folio_ref_unfreeze(new_folio, expected_refs);
 
-			lru_add_split_folio(folio, new_folio, lruvec, list);
+			if (!unmapped)
+				lru_add_split_folio(folio, new_folio, lruvec, list);
 
 			/*
 			 * Anonymous folio with swap cache.
@@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			__filemap_remove_folio(new_folio, NULL);
 			folio_put_refs(new_folio, nr_pages);
 		}
+
+		zone_device_private_split_cb(folio, NULL);
 		/*
 		 * Unfreeze @folio only after all page cache entries, which
 		 * used to point to it, have been updated with new folios.
@@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 	local_irq_enable();
 
+	if (unmapped)
+		return ret;
+
 	if (nr_shmem_dropped)
 		shmem_uncharge(mapping->host, nr_shmem_dropped);
 
@@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order)
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order, bool unmapped)
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, &folio->page, page, list, true);
+	return __folio_split(folio, new_order, &folio->page, page, list, true,
+				unmapped);
 }
 
 /*
@@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct list_head *list)
 {
 	return __folio_split(folio, new_order, split_at, &folio->page, list,
-			false);
+			false, false);
 }
 
 int min_order_for_split(struct folio *folio)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 4156fd6190d2..fa42d2ebd024 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
 			    pgmap->owner != migrate->pgmap_owner)
 				goto next;
 
+			folio = page_folio(page);
+			if (folio_test_large(folio)) {
+				int ret;
+
+				pte_unmap_unlock(ptep, ptl);
+				ret = migrate_vma_split_folio(folio,
+							  migrate->fault_page);
+
+				if (ret) {
+					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
+					goto next;
+				}
+
+				addr = start;
+				goto again;
+			}
+
 			mpfn = migrate_pfn(page_to_pfn(page)) |
 					MIGRATE_PFN_MIGRATE;
 			if (is_writable_device_private_entry(entry))
@@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 		src[i] &= ~MIGRATE_PFN_MIGRATE;
 	return 0;
 }
+
+static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
+					    unsigned long idx, unsigned long addr,
+					    struct folio *folio)
+{
+	unsigned long i;
+	unsigned long pfn;
+	unsigned long flags;
+	int ret = 0;
+
+	folio_get(folio);
+	split_huge_pmd_address(migrate->vma, addr, true);
+	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
+							0, true);
+	if (ret)
+		return ret;
+	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
+	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
+	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
+	for (i = 1; i < HPAGE_PMD_NR; i++)
+		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
+	return ret;
+}
 #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 					 unsigned long addr,
@@ -889,6 +929,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 {
 	return 0;
 }
+
+static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
+					    unsigned long idx, unsigned long addr,
+					    struct folio *folio)
+{
+	return 0;
+}
 #endif
 
 static unsigned long migrate_vma_nr_pages(unsigned long *src)
@@ -1050,8 +1097,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				struct migrate_vma *migrate)
 {
 	struct mmu_notifier_range range;
-	unsigned long i;
+	unsigned long i, j;
 	bool notified = false;
+	unsigned long addr;
 
 	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
@@ -1093,12 +1141,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
 				nr = migrate_vma_nr_pages(&src_pfns[i]);
 				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-				goto next;
+			} else {
+				nr = 1;
 			}
 
-			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
-						&src_pfns[i]);
+			for (j = 0; j < nr && i + j < npages; j++) {
+				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
+				migrate_vma_insert_page(migrate,
+					addr + j * PAGE_SIZE,
+					&dst_pfns[i+j], &src_pfns[i+j]);
+			}
 			goto next;
 		}
 
@@ -1120,7 +1172,13 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 							 MIGRATE_PFN_COMPOUND);
 					goto next;
 				}
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				nr = 1 << folio_order(folio);
+				addr = migrate->start + i * PAGE_SIZE;
+				if (migrate_vma_split_unmapped_folio(migrate, i, addr, folio)) {
+					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
+							 MIGRATE_PFN_COMPOUND);
+					goto next;
+				}
 			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
 				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
 				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
@@ -1156,11 +1214,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 
 		if (migrate && migrate->fault_page == page)
 			extra_cnt = 1;
-		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
-		if (r)
-			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-		else
-			folio_migrate_flags(newfolio, folio);
+		for (j = 0; j < nr && i + j < npages; j++) {
+			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
+			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
+
+			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
+			if (r)
+				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
+			else
+				folio_migrate_flags(newfolio, folio);
+		}
 next:
 		i += nr;
 	}
-- 
2.51.0

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Wei Yang 3 months, 3 weeks ago

On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
[...]
> static int __folio_split(struct folio *folio, unsigned int new_order,
> 		struct page *split_at, struct page *lock_at,
>-		struct list_head *list, bool uniform_split)
>+		struct list_head *list, bool uniform_split, bool unmapped)
> {
> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>@@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> 		 * is taken to serialise against parallel split or collapse
> 		 * operations.
> 		 */
>-		anon_vma = folio_get_anon_vma(folio);
>-		if (!anon_vma) {
>-			ret = -EBUSY;
>-			goto out;
>+		if (!unmapped) {
>+			anon_vma = folio_get_anon_vma(folio);
>+			if (!anon_vma) {
>+				ret = -EBUSY;
>+				goto out;
>+			}
>+			anon_vma_lock_write(anon_vma);
> 		}
> 		mapping = NULL;
>-		anon_vma_lock_write(anon_vma);
> 	} else {
> 		unsigned int min_order;
> 		gfp_t gfp;
>@@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> 		goto out_unlock;
> 	}
> 
>-	unmap_folio(folio);
>+	if (!unmapped)
>+		unmap_folio(folio);
> 
> 	/* block interrupt reentry in xa_lock and spinlock */
> 	local_irq_disable();
>@@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> 
> 			next = folio_next(new_folio);
> 
>+			zone_device_private_split_cb(folio, new_folio);
>+
> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
> 			folio_ref_unfreeze(new_folio, expected_refs);
> 
>-			lru_add_split_folio(folio, new_folio, lruvec, list);
>+			if (!unmapped)
>+				lru_add_split_folio(folio, new_folio, lruvec, list);
> 
> 			/*
> 			 * Anonymous folio with swap cache.
>@@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> 			__filemap_remove_folio(new_folio, NULL);
> 			folio_put_refs(new_folio, nr_pages);
> 		}
>+
>+		zone_device_private_split_cb(folio, NULL);
> 		/*
> 		 * Unfreeze @folio only after all page cache entries, which
> 		 * used to point to it, have been updated with new folios.
>@@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> 
> 	local_irq_enable();
> 
>+	if (unmapped)
>+		return ret;

As the comment of __folio_split() and __split_huge_page_to_list_to_order()
mentioned:

  * The large folio must be locked
  * After splitting, the after-split folio containing @lock_at remains locked

But here we seems to change the prerequisites.

Hmm.. I am not sure this is correct.

>+
> 	if (nr_shmem_dropped)
> 		shmem_uncharge(mapping->host, nr_shmem_dropped);
> 

-- 
Wei Yang
Help you, Help me

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Balbir Singh 3 months, 3 weeks ago

On 10/19/25 19:19, Wei Yang wrote:
> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
> [...]
>> static int __folio_split(struct folio *folio, unsigned int new_order,
>> 		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split)
>> +		struct list_head *list, bool uniform_split, bool unmapped)
>> {
>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> 		 * is taken to serialise against parallel split or collapse
>> 		 * operations.
>> 		 */
>> -		anon_vma = folio_get_anon_vma(folio);
>> -		if (!anon_vma) {
>> -			ret = -EBUSY;
>> -			goto out;
>> +		if (!unmapped) {
>> +			anon_vma = folio_get_anon_vma(folio);
>> +			if (!anon_vma) {
>> +				ret = -EBUSY;
>> +				goto out;
>> +			}
>> +			anon_vma_lock_write(anon_vma);
>> 		}
>> 		mapping = NULL;
>> -		anon_vma_lock_write(anon_vma);
>> 	} else {
>> 		unsigned int min_order;
>> 		gfp_t gfp;
>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> 		goto out_unlock;
>> 	}
>>
>> -	unmap_folio(folio);
>> +	if (!unmapped)
>> +		unmap_folio(folio);
>>
>> 	/* block interrupt reentry in xa_lock and spinlock */
>> 	local_irq_disable();
>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>> 			next = folio_next(new_folio);
>>
>> +			zone_device_private_split_cb(folio, new_folio);
>> +
>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>
>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>> +			if (!unmapped)
>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>
>> 			/*
>> 			 * Anonymous folio with swap cache.
>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>> 			__filemap_remove_folio(new_folio, NULL);
>> 			folio_put_refs(new_folio, nr_pages);
>> 		}
>> +
>> +		zone_device_private_split_cb(folio, NULL);
>> 		/*
>> 		 * Unfreeze @folio only after all page cache entries, which
>> 		 * used to point to it, have been updated with new folios.
>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>> 	local_irq_enable();
>>
>> +	if (unmapped)
>> +		return ret;
> 
> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
> mentioned:
> 
>   * The large folio must be locked
>   * After splitting, the after-split folio containing @lock_at remains locked
> 
> But here we seems to change the prerequisites.
> 
> Hmm.. I am not sure this is correct.
> 

The code is correct, but you are right in that the documentation needs to be updated.
When "unmapped", we do want to leave the folios locked after the split.

Balbir

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Zi Yan 3 months, 3 weeks ago

On 19 Oct 2025, at 18:49, Balbir Singh wrote:

> On 10/19/25 19:19, Wei Yang wrote:
>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>> [...]
>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>> 		struct page *split_at, struct page *lock_at,
>>> -		struct list_head *list, bool uniform_split)
>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>> {
>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>> 		 * is taken to serialise against parallel split or collapse
>>> 		 * operations.
>>> 		 */
>>> -		anon_vma = folio_get_anon_vma(folio);
>>> -		if (!anon_vma) {
>>> -			ret = -EBUSY;
>>> -			goto out;
>>> +		if (!unmapped) {
>>> +			anon_vma = folio_get_anon_vma(folio);
>>> +			if (!anon_vma) {
>>> +				ret = -EBUSY;
>>> +				goto out;
>>> +			}
>>> +			anon_vma_lock_write(anon_vma);
>>> 		}
>>> 		mapping = NULL;
>>> -		anon_vma_lock_write(anon_vma);
>>> 	} else {
>>> 		unsigned int min_order;
>>> 		gfp_t gfp;
>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>> 		goto out_unlock;
>>> 	}
>>>
>>> -	unmap_folio(folio);
>>> +	if (!unmapped)
>>> +		unmap_folio(folio);
>>>
>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>> 	local_irq_disable();
>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>> 			next = folio_next(new_folio);
>>>
>>> +			zone_device_private_split_cb(folio, new_folio);
>>> +
>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>
>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>> +			if (!unmapped)
>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>
>>> 			/*
>>> 			 * Anonymous folio with swap cache.
>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>> 			__filemap_remove_folio(new_folio, NULL);
>>> 			folio_put_refs(new_folio, nr_pages);
>>> 		}
>>> +
>>> +		zone_device_private_split_cb(folio, NULL);
>>> 		/*
>>> 		 * Unfreeze @folio only after all page cache entries, which
>>> 		 * used to point to it, have been updated with new folios.
>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>> 	local_irq_enable();
>>>
>>> +	if (unmapped)
>>> +		return ret;
>>
>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>> mentioned:
>>
>>   * The large folio must be locked
>>   * After splitting, the after-split folio containing @lock_at remains locked
>>
>> But here we seems to change the prerequisites.
>>
>> Hmm.. I am not sure this is correct.
>>
>
> The code is correct, but you are right in that the documentation needs to be updated.
> When "unmapped", we do want to leave the folios locked after the split.

Sigh, this "unmapped" code needs so many special branches and a different locking
requirement. It should be a separate function to avoid confusions.

--
Best Regards,
Yan, Zi

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Balbir Singh 3 months, 2 weeks ago

On 10/20/25 09:59, Zi Yan wrote:
> On 19 Oct 2025, at 18:49, Balbir Singh wrote:
> 
>> On 10/19/25 19:19, Wei Yang wrote:
>>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>>> [...]
>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>> 		struct page *split_at, struct page *lock_at,
>>>> -		struct list_head *list, bool uniform_split)
>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>> {
>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>> 		 * is taken to serialise against parallel split or collapse
>>>> 		 * operations.
>>>> 		 */
>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>> -		if (!anon_vma) {
>>>> -			ret = -EBUSY;
>>>> -			goto out;
>>>> +		if (!unmapped) {
>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>> +			if (!anon_vma) {
>>>> +				ret = -EBUSY;
>>>> +				goto out;
>>>> +			}
>>>> +			anon_vma_lock_write(anon_vma);
>>>> 		}
>>>> 		mapping = NULL;
>>>> -		anon_vma_lock_write(anon_vma);
>>>> 	} else {
>>>> 		unsigned int min_order;
>>>> 		gfp_t gfp;
>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>> 		goto out_unlock;
>>>> 	}
>>>>
>>>> -	unmap_folio(folio);
>>>> +	if (!unmapped)
>>>> +		unmap_folio(folio);
>>>>
>>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>>> 	local_irq_disable();
>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>> 			next = folio_next(new_folio);
>>>>
>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>> +
>>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>>
>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>> +			if (!unmapped)
>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>
>>>> 			/*
>>>> 			 * Anonymous folio with swap cache.
>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>> 			__filemap_remove_folio(new_folio, NULL);
>>>> 			folio_put_refs(new_folio, nr_pages);
>>>> 		}
>>>> +
>>>> +		zone_device_private_split_cb(folio, NULL);
>>>> 		/*
>>>> 		 * Unfreeze @folio only after all page cache entries, which
>>>> 		 * used to point to it, have been updated with new folios.
>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>> 	local_irq_enable();
>>>>
>>>> +	if (unmapped)
>>>> +		return ret;
>>>
>>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>>> mentioned:
>>>
>>>   * The large folio must be locked
>>>   * After splitting, the after-split folio containing @lock_at remains locked
>>>
>>> But here we seems to change the prerequisites.
>>>
>>> Hmm.. I am not sure this is correct.
>>>
>>
>> The code is correct, but you are right in that the documentation needs to be updated.
>> When "unmapped", we do want to leave the folios locked after the split.
> 
> Sigh, this "unmapped" code needs so many special branches and a different locking
> requirement. It should be a separate function to avoid confusions.
> 

Yep, I have a patch for it, I am also waiting on Matthew's feedback, FYI, here is
a WIP patch that can be applied on top of the series

---
 include/linux/huge_mm.h |   5 +-
 mm/huge_memory.c        | 137 ++++++++++++++++++++++++++++++++++------
 mm/migrate_device.c     |   3 +-
 3 files changed, 120 insertions(+), 25 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index c4a811958cda..86e1cefaf391 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -366,7 +366,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
 int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-		unsigned int new_order, bool unmapped);
+		unsigned int new_order);
+int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -379,7 +380,7 @@ int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
 static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
 		unsigned int new_order)
 {
-	return __split_huge_page_to_list_to_order(page, list, new_order, false);
+	return __split_huge_page_to_list_to_order(page, list, new_order);
 }
 
 /*
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 8c82a0ac6e69..e20cbf68d037 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3711,7 +3711,6 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  * @lock_at: a page within @folio to be left locked to caller
  * @list: after-split folios will be put on it if non NULL
  * @uniform_split: perform uniform split or not (non-uniform split)
- * @unmapped: The pages are already unmapped, they are migration entries.
  *
  * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
  * It is in charge of checking whether the split is supported or not and
@@ -3727,7 +3726,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  */
 static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
-		struct list_head *list, bool uniform_split, bool unmapped)
+		struct list_head *list, bool uniform_split)
 {
 	struct deferred_split *ds_queue;
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
@@ -3777,14 +3776,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		 * is taken to serialise against parallel split or collapse
 		 * operations.
 		 */
-		if (!unmapped) {
-			anon_vma = folio_get_anon_vma(folio);
-			if (!anon_vma) {
-				ret = -EBUSY;
-				goto out;
-			}
-			anon_vma_lock_write(anon_vma);
+		anon_vma = folio_get_anon_vma(folio);
+		if (!anon_vma) {
+			ret = -EBUSY;
+			goto out;
 		}
+		anon_vma_lock_write(anon_vma);
 		mapping = NULL;
 	} else {
 		unsigned int min_order;
@@ -3852,8 +3849,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		goto out_unlock;
 	}
 
-	if (!unmapped)
-		unmap_folio(folio);
+	unmap_folio(folio);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -3954,8 +3950,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			expected_refs = folio_expected_ref_count(new_folio) + 1;
 			folio_ref_unfreeze(new_folio, expected_refs);
 
-			if (!unmapped)
-				lru_add_split_folio(folio, new_folio, lruvec, list);
+			lru_add_split_folio(folio, new_folio, lruvec, list);
 
 			/*
 			 * Anonymous folio with swap cache.
@@ -4011,9 +4006,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 	local_irq_enable();
 
-	if (unmapped)
-		return ret;
-
 	if (nr_shmem_dropped)
 		shmem_uncharge(mapping->host, nr_shmem_dropped);
 
@@ -4057,6 +4049,111 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	return ret;
 }
 
+/*
+ * This function is a helper for splitting folios that have already been unmapped.
+ * The use case is that the device or the CPU can refuse to migrate THP pages in
+ * the middle of migration, due to allocation issues on either side
+ *
+ * The high level code is copied from __folio_split, since the pages are anonymous
+ * and are already isolated from the LRU, the code has been simplified to not
+ * burden __folio_split with unmapped sprinkled into the code.
+ *
+ * None of the split folios are unlocked
+ */
+int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order)
+{
+	int extra_pins;
+	int ret = 0;
+	struct folio *new_folio, *next;
+	struct folio *end_folio = folio_next(folio);
+	struct deferred_split *ds_queue;
+	int old_order = folio_order(folio);
+
+	VM_WARN_ON_FOLIO(folio_mapped(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
+
+	if (!can_split_folio(folio, 1, &extra_pins)) {
+		ret = -EAGAIN;
+		goto err;
+	}
+
+	local_irq_disable();
+	/* Prevent deferred_split_scan() touching ->_refcount */
+	ds_queue = folio_split_queue_lock(folio);
+	if (folio_ref_freeze(folio, 1 + extra_pins)) {
+		int expected_refs;
+		struct swap_cluster_info *ci = NULL;
+
+		if (old_order > 1) {
+			if (!list_empty(&folio->_deferred_list)) {
+				ds_queue->split_queue_len--;
+				/*
+				 * Reinitialize page_deferred_list after
+				 * removing the page from the split_queue,
+				 * otherwise a subsequent split will see list
+				 * corruption when checking the
+				 * page_deferred_list.
+				 */
+				list_del_init(&folio->_deferred_list);
+			}
+			if (folio_test_partially_mapped(folio)) {
+				folio_clear_partially_mapped(folio);
+				mod_mthp_stat(old_order,
+					MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
+			}
+			/*
+			 * Reinitialize page_deferred_list after removing the
+			 * page from the split_queue, otherwise a subsequent
+			 * split will see list corruption when checking the
+			 * page_deferred_list.
+			 */
+			list_del_init(&folio->_deferred_list);
+		}
+		split_queue_unlock(ds_queue);
+
+		if (folio_test_swapcache(folio))
+			ci = swap_cluster_get_and_lock(folio);
+
+		ret = __split_unmapped_folio(folio, new_order, &folio->page,
+					     NULL, NULL, true);
+
+		/*
+		 * Unfreeze after-split folios
+		 */
+		for (new_folio = folio_next(folio); new_folio != end_folio;
+		     new_folio = next) {
+			next = folio_next(new_folio);
+
+			zone_device_private_split_cb(folio, new_folio);
+
+			expected_refs = folio_expected_ref_count(new_folio) + 1;
+			folio_ref_unfreeze(new_folio, expected_refs);
+			if (ci)
+				__swap_cache_replace_folio(ci, folio, new_folio);
+		}
+
+		zone_device_private_split_cb(folio, NULL);
+		/*
+		 * Unfreeze @folio only after all page cache entries, which
+		 * used to point to it, have been updated with new folios.
+		 * Otherwise, a parallel folio_try_get() can grab @folio
+		 * and its caller can see stale page cache entries.
+		 */
+		expected_refs = folio_expected_ref_count(folio) + 1;
+		folio_ref_unfreeze(folio, expected_refs);
+
+		if (ci)
+			swap_cluster_unlock(ci);
+	} else {
+		split_queue_unlock(ds_queue);
+		ret = -EAGAIN;
+	}
+	local_irq_enable();
+err:
+	return ret;
+}
+
 /*
  * This function splits a large folio into smaller folios of order @new_order.
  * @page can point to any page of the large folio to split. The split operation
@@ -4105,12 +4202,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
 int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order, bool unmapped)
+				     unsigned int new_order)
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, &folio->page, page, list, true,
-				unmapped);
+	return __folio_split(folio, new_order, &folio->page, page, list, true);
 }
 
 /*
@@ -4138,8 +4234,7 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
 int folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct list_head *list)
 {
-	return __folio_split(folio, new_order, split_at, &folio->page, list,
-			false, false);
+	return __folio_split(folio, new_order, split_at, &folio->page, list, false);
 }
 
 int min_order_for_split(struct folio *folio)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index c869b272e85a..23515f3ffc35 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -918,8 +918,7 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
 
 	folio_get(folio);
 	split_huge_pmd_address(migrate->vma, addr, true);
-	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
-							0, true);
+	ret = split_unmapped_folio_to_order(folio, 0);
 	if (ret)
 		return ret;
 	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
-- 
2.51.0

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Zi Yan 3 months, 2 weeks ago

On 21 Oct 2025, at 17:34, Balbir Singh wrote:

> On 10/20/25 09:59, Zi Yan wrote:
>> On 19 Oct 2025, at 18:49, Balbir Singh wrote:
>>
>>> On 10/19/25 19:19, Wei Yang wrote:
>>>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>>>> [...]
>>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>> 		struct page *split_at, struct page *lock_at,
>>>>> -		struct list_head *list, bool uniform_split)
>>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>>> {
>>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>> 		 * is taken to serialise against parallel split or collapse
>>>>> 		 * operations.
>>>>> 		 */
>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>> -		if (!anon_vma) {
>>>>> -			ret = -EBUSY;
>>>>> -			goto out;
>>>>> +		if (!unmapped) {
>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>> +			if (!anon_vma) {
>>>>> +				ret = -EBUSY;
>>>>> +				goto out;
>>>>> +			}
>>>>> +			anon_vma_lock_write(anon_vma);
>>>>> 		}
>>>>> 		mapping = NULL;
>>>>> -		anon_vma_lock_write(anon_vma);
>>>>> 	} else {
>>>>> 		unsigned int min_order;
>>>>> 		gfp_t gfp;
>>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>> 		goto out_unlock;
>>>>> 	}
>>>>>
>>>>> -	unmap_folio(folio);
>>>>> +	if (!unmapped)
>>>>> +		unmap_folio(folio);
>>>>>
>>>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>>>> 	local_irq_disable();
>>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>
>>>>> 			next = folio_next(new_folio);
>>>>>
>>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>>> +
>>>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>>>
>>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>> +			if (!unmapped)
>>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>
>>>>> 			/*
>>>>> 			 * Anonymous folio with swap cache.
>>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>> 			__filemap_remove_folio(new_folio, NULL);
>>>>> 			folio_put_refs(new_folio, nr_pages);
>>>>> 		}
>>>>> +
>>>>> +		zone_device_private_split_cb(folio, NULL);
>>>>> 		/*
>>>>> 		 * Unfreeze @folio only after all page cache entries, which
>>>>> 		 * used to point to it, have been updated with new folios.
>>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>
>>>>> 	local_irq_enable();
>>>>>
>>>>> +	if (unmapped)
>>>>> +		return ret;
>>>>
>>>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>>>> mentioned:
>>>>
>>>>   * The large folio must be locked
>>>>   * After splitting, the after-split folio containing @lock_at remains locked
>>>>
>>>> But here we seems to change the prerequisites.
>>>>
>>>> Hmm.. I am not sure this is correct.
>>>>
>>>
>>> The code is correct, but you are right in that the documentation needs to be updated.
>>> When "unmapped", we do want to leave the folios locked after the split.
>>
>> Sigh, this "unmapped" code needs so many special branches and a different locking
>> requirement. It should be a separate function to avoid confusions.
>>
>
> Yep, I have a patch for it, I am also waiting on Matthew's feedback, FYI, here is
> a WIP patch that can be applied on top of the series

Nice cleanup! Thanks.

>
> ---
>  include/linux/huge_mm.h |   5 +-
>  mm/huge_memory.c        | 137 ++++++++++++++++++++++++++++++++++------
>  mm/migrate_device.c     |   3 +-
>  3 files changed, 120 insertions(+), 25 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index c4a811958cda..86e1cefaf391 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -366,7 +366,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -		unsigned int new_order, bool unmapped);
> +		unsigned int new_order);
> +int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order);
>  int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> @@ -379,7 +380,7 @@ int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>  static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>  		unsigned int new_order)
>  {
> -	return __split_huge_page_to_list_to_order(page, list, new_order, false);
> +	return __split_huge_page_to_list_to_order(page, list, new_order);
>  }
>
>  /*
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8c82a0ac6e69..e20cbf68d037 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3711,7 +3711,6 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   * @lock_at: a page within @folio to be left locked to caller
>   * @list: after-split folios will be put on it if non NULL
>   * @uniform_split: perform uniform split or not (non-uniform split)
> - * @unmapped: The pages are already unmapped, they are migration entries.
>   *
>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>   * It is in charge of checking whether the split is supported or not and
> @@ -3727,7 +3726,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   */
>  static int __folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct page *lock_at,
> -		struct list_head *list, bool uniform_split, bool unmapped)
> +		struct list_head *list, bool uniform_split)
>  {
>  	struct deferred_split *ds_queue;
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> @@ -3777,14 +3776,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		 * is taken to serialise against parallel split or collapse
>  		 * operations.
>  		 */
> -		if (!unmapped) {
> -			anon_vma = folio_get_anon_vma(folio);
> -			if (!anon_vma) {
> -				ret = -EBUSY;
> -				goto out;
> -			}
> -			anon_vma_lock_write(anon_vma);
> +		anon_vma = folio_get_anon_vma(folio);
> +		if (!anon_vma) {
> +			ret = -EBUSY;
> +			goto out;
>  		}
> +		anon_vma_lock_write(anon_vma);
>  		mapping = NULL;
>  	} else {
>  		unsigned int min_order;
> @@ -3852,8 +3849,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		goto out_unlock;
>  	}
>
> -	if (!unmapped)
> -		unmap_folio(folio);
> +	unmap_folio(folio);
>
>  	/* block interrupt reentry in xa_lock and spinlock */
>  	local_irq_disable();
> @@ -3954,8 +3950,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>  			folio_ref_unfreeze(new_folio, expected_refs);
>
> -			if (!unmapped)
> -				lru_add_split_folio(folio, new_folio, lruvec, list);
> +			lru_add_split_folio(folio, new_folio, lruvec, list);
>
>  			/*
>  			 * Anonymous folio with swap cache.
> @@ -4011,9 +4006,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>
>  	local_irq_enable();
>
> -	if (unmapped)
> -		return ret;
> -
>  	if (nr_shmem_dropped)
>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>
> @@ -4057,6 +4049,111 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  	return ret;
>  }
>
> +/*
> + * This function is a helper for splitting folios that have already been unmapped.
> + * The use case is that the device or the CPU can refuse to migrate THP pages in
> + * the middle of migration, due to allocation issues on either side
> + *
> + * The high level code is copied from __folio_split, since the pages are anonymous
> + * and are already isolated from the LRU, the code has been simplified to not
> + * burden __folio_split with unmapped sprinkled into the code.

I wonder if it makes sense to remove CPU side folio from both deferred_split queue
and swap cache before migration to further simplify split_unmapped_folio_to_order().
Basically require that device private folios cannot be on deferred_split queue nor
swap cache.

> + *
> + * None of the split folios are unlocked
> + */
> +int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order)
> +{
> +	int extra_pins;
> +	int ret = 0;
> +	struct folio *new_folio, *next;
> +	struct folio *end_folio = folio_next(folio);
> +	struct deferred_split *ds_queue;
> +	int old_order = folio_order(folio);
> +
> +	VM_WARN_ON_FOLIO(folio_mapped(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
> +	VM_WARN_ON_ONCE_FOLIO(!folio_test_large(folio), folio);
> +
> +	if (!can_split_folio(folio, 1, &extra_pins)) {
> +		ret = -EAGAIN;
> +		goto err;
> +	}
> +
> +	local_irq_disable();
> +	/* Prevent deferred_split_scan() touching ->_refcount */
> +	ds_queue = folio_split_queue_lock(folio);
> +	if (folio_ref_freeze(folio, 1 + extra_pins)) {
> +		int expected_refs;
> +		struct swap_cluster_info *ci = NULL;
> +
> +		if (old_order > 1) {
> +			if (!list_empty(&folio->_deferred_list)) {
> +				ds_queue->split_queue_len--;
> +				/*
> +				 * Reinitialize page_deferred_list after
> +				 * removing the page from the split_queue,
> +				 * otherwise a subsequent split will see list
> +				 * corruption when checking the
> +				 * page_deferred_list.
> +				 */
> +				list_del_init(&folio->_deferred_list);
> +			}
> +			if (folio_test_partially_mapped(folio)) {
> +				folio_clear_partially_mapped(folio);
> +				mod_mthp_stat(old_order,
> +					MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> +			}
> +			/*
> +			 * Reinitialize page_deferred_list after removing the
> +			 * page from the split_queue, otherwise a subsequent
> +			 * split will see list corruption when checking the
> +			 * page_deferred_list.
> +			 */
> +			list_del_init(&folio->_deferred_list);
> +		}
> +		split_queue_unlock(ds_queue);
> +
> +		if (folio_test_swapcache(folio))
> +			ci = swap_cluster_get_and_lock(folio);
> +
> +		ret = __split_unmapped_folio(folio, new_order, &folio->page,
> +					     NULL, NULL, true);
> +
> +		/*
> +		 * Unfreeze after-split folios
> +		 */
> +		for (new_folio = folio_next(folio); new_folio != end_folio;
> +		     new_folio = next) {
> +			next = folio_next(new_folio);
> +
> +			zone_device_private_split_cb(folio, new_folio);
> +
> +			expected_refs = folio_expected_ref_count(new_folio) + 1;
> +			folio_ref_unfreeze(new_folio, expected_refs);
> +			if (ci)
> +				__swap_cache_replace_folio(ci, folio, new_folio);
> +		}
> +
> +		zone_device_private_split_cb(folio, NULL);
> +		/*
> +		 * Unfreeze @folio only after all page cache entries, which
> +		 * used to point to it, have been updated with new folios.
> +		 * Otherwise, a parallel folio_try_get() can grab @folio
> +		 * and its caller can see stale page cache entries.
> +		 */
> +		expected_refs = folio_expected_ref_count(folio) + 1;
> +		folio_ref_unfreeze(folio, expected_refs);
> +
> +		if (ci)
> +			swap_cluster_unlock(ci);
> +	} else {
> +		split_queue_unlock(ds_queue);
> +		ret = -EAGAIN;
> +	}
> +	local_irq_enable();
> +err:
> +	return ret;
> +}
> +
>  /*
>   * This function splits a large folio into smaller folios of order @new_order.
>   * @page can point to any page of the large folio to split. The split operation
> @@ -4105,12 +4202,11 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -				     unsigned int new_order, bool unmapped)
> +				     unsigned int new_order)
>  {
>  	struct folio *folio = page_folio(page);
>
> -	return __folio_split(folio, new_order, &folio->page, page, list, true,
> -				unmapped);
> +	return __folio_split(folio, new_order, &folio->page, page, list, true);
>  }
>
>  /*
> @@ -4138,8 +4234,7 @@ int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list
>  int folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct list_head *list)
>  {
> -	return __folio_split(folio, new_order, split_at, &folio->page, list,
> -			false, false);
> +	return __folio_split(folio, new_order, split_at, &folio->page, list, false);
>  }
>
>  int min_order_for_split(struct folio *folio)
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index c869b272e85a..23515f3ffc35 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -918,8 +918,7 @@ static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>
>  	folio_get(folio);
>  	split_huge_pmd_address(migrate->vma, addr, true);
> -	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
> -							0, true);
> +	ret = split_unmapped_folio_to_order(folio, 0);
>  	if (ret)
>  		return ret;
>  	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
> -- 
> 2.51.0


--
Best Regards,
Yan, Zi

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Balbir Singh 3 months, 2 weeks ago

On 10/22/25 13:59, Zi Yan wrote:
> On 21 Oct 2025, at 17:34, Balbir Singh wrote:
> 
>> On 10/20/25 09:59, Zi Yan wrote:
>>> On 19 Oct 2025, at 18:49, Balbir Singh wrote:
>>>
>>>> On 10/19/25 19:19, Wei Yang wrote:
>>>>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>>>>> [...]
>>>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>> 		struct page *split_at, struct page *lock_at,
>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>>>> {
>>>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>> 		 * is taken to serialise against parallel split or collapse
>>>>>> 		 * operations.
>>>>>> 		 */
>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>> -		if (!anon_vma) {
>>>>>> -			ret = -EBUSY;
>>>>>> -			goto out;
>>>>>> +		if (!unmapped) {
>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>> +			if (!anon_vma) {
>>>>>> +				ret = -EBUSY;
>>>>>> +				goto out;
>>>>>> +			}
>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>> 		}
>>>>>> 		mapping = NULL;
>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>> 	} else {
>>>>>> 		unsigned int min_order;
>>>>>> 		gfp_t gfp;
>>>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>> 		goto out_unlock;
>>>>>> 	}
>>>>>>
>>>>>> -	unmap_folio(folio);
>>>>>> +	if (!unmapped)
>>>>>> +		unmap_folio(folio);
>>>>>>
>>>>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>>>>> 	local_irq_disable();
>>>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>
>>>>>> 			next = folio_next(new_folio);
>>>>>>
>>>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>>>> +
>>>>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>>>>
>>>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>> +			if (!unmapped)
>>>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>>
>>>>>> 			/*
>>>>>> 			 * Anonymous folio with swap cache.
>>>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>> 			__filemap_remove_folio(new_folio, NULL);
>>>>>> 			folio_put_refs(new_folio, nr_pages);
>>>>>> 		}
>>>>>> +
>>>>>> +		zone_device_private_split_cb(folio, NULL);
>>>>>> 		/*
>>>>>> 		 * Unfreeze @folio only after all page cache entries, which
>>>>>> 		 * used to point to it, have been updated with new folios.
>>>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>
>>>>>> 	local_irq_enable();
>>>>>>
>>>>>> +	if (unmapped)
>>>>>> +		return ret;
>>>>>
>>>>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>>>>> mentioned:
>>>>>
>>>>>   * The large folio must be locked
>>>>>   * After splitting, the after-split folio containing @lock_at remains locked
>>>>>
>>>>> But here we seems to change the prerequisites.
>>>>>
>>>>> Hmm.. I am not sure this is correct.
>>>>>
>>>>
>>>> The code is correct, but you are right in that the documentation needs to be updated.
>>>> When "unmapped", we do want to leave the folios locked after the split.
>>>
>>> Sigh, this "unmapped" code needs so many special branches and a different locking
>>> requirement. It should be a separate function to avoid confusions.
>>>
>>
>> Yep, I have a patch for it, I am also waiting on Matthew's feedback, FYI, here is
>> a WIP patch that can be applied on top of the series
> 
> Nice cleanup! Thanks.
> 
>>
>> ---
>>  include/linux/huge_mm.h |   5 +-
>>  mm/huge_memory.c        | 137 ++++++++++++++++++++++++++++++++++------
>>  mm/migrate_device.c     |   3 +-
>>  3 files changed, 120 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index c4a811958cda..86e1cefaf391 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -366,7 +366,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>
>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -		unsigned int new_order, bool unmapped);
>> +		unsigned int new_order);
>> +int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order);
>>  int min_order_for_split(struct folio *folio);
>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>> @@ -379,7 +380,7 @@ int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>  static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>  		unsigned int new_order)
>>  {
>> -	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>> +	return __split_huge_page_to_list_to_order(page, list, new_order);
>>  }
>>
>>  /*
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8c82a0ac6e69..e20cbf68d037 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3711,7 +3711,6 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   * @lock_at: a page within @folio to be left locked to caller
>>   * @list: after-split folios will be put on it if non NULL
>>   * @uniform_split: perform uniform split or not (non-uniform split)
>> - * @unmapped: The pages are already unmapped, they are migration entries.
>>   *
>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>   * It is in charge of checking whether the split is supported or not and
>> @@ -3727,7 +3726,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   */
>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split, bool unmapped)
>> +		struct list_head *list, bool uniform_split)
>>  {
>>  	struct deferred_split *ds_queue;
>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3777,14 +3776,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		 * is taken to serialise against parallel split or collapse
>>  		 * operations.
>>  		 */
>> -		if (!unmapped) {
>> -			anon_vma = folio_get_anon_vma(folio);
>> -			if (!anon_vma) {
>> -				ret = -EBUSY;
>> -				goto out;
>> -			}
>> -			anon_vma_lock_write(anon_vma);
>> +		anon_vma = folio_get_anon_vma(folio);
>> +		if (!anon_vma) {
>> +			ret = -EBUSY;
>> +			goto out;
>>  		}
>> +		anon_vma_lock_write(anon_vma);
>>  		mapping = NULL;
>>  	} else {
>>  		unsigned int min_order;
>> @@ -3852,8 +3849,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		goto out_unlock;
>>  	}
>>
>> -	if (!unmapped)
>> -		unmap_folio(folio);
>> +	unmap_folio(folio);
>>
>>  	/* block interrupt reentry in xa_lock and spinlock */
>>  	local_irq_disable();
>> @@ -3954,8 +3950,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>
>> -			if (!unmapped)
>> -				lru_add_split_folio(folio, new_folio, lruvec, list);
>> +			lru_add_split_folio(folio, new_folio, lruvec, list);
>>
>>  			/*
>>  			 * Anonymous folio with swap cache.
>> @@ -4011,9 +4006,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>>  	local_irq_enable();
>>
>> -	if (unmapped)
>> -		return ret;
>> -
>>  	if (nr_shmem_dropped)
>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>
>> @@ -4057,6 +4049,111 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  	return ret;
>>  }
>>
>> +/*
>> + * This function is a helper for splitting folios that have already been unmapped.
>> + * The use case is that the device or the CPU can refuse to migrate THP pages in
>> + * the middle of migration, due to allocation issues on either side
>> + *
>> + * The high level code is copied from __folio_split, since the pages are anonymous
>> + * and are already isolated from the LRU, the code has been simplified to not
>> + * burden __folio_split with unmapped sprinkled into the code.
> 
> I wonder if it makes sense to remove CPU side folio from both deferred_split queue
> and swap cache before migration to further simplify split_unmapped_folio_to_order().
> Basically require that device private folios cannot be on deferred_split queue nor
> swap cache.
> 

This API can be called for non-device private folios as well. Device private folios are
already not on the deferred queue. The use case is

1. Migrate a large folio page from CPU to Device
2. SRC - CPU has a THP (large folio page)
3. DST - Device cannot allocate a large page, hence split the SRC page


[...]


Thanks for the review!
Balbir

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Zi Yan 3 months, 2 weeks ago

On 22 Oct 2025, at 3:16, Balbir Singh wrote:

> On 10/22/25 13:59, Zi Yan wrote:
>> On 21 Oct 2025, at 17:34, Balbir Singh wrote:
>>
>>> On 10/20/25 09:59, Zi Yan wrote:
>>>> On 19 Oct 2025, at 18:49, Balbir Singh wrote:
>>>>
>>>>> On 10/19/25 19:19, Wei Yang wrote:
>>>>>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>>>>>> [...]
>>>>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>> 		struct page *split_at, struct page *lock_at,
>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>>>>> {
>>>>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>> 		 * is taken to serialise against parallel split or collapse
>>>>>>> 		 * operations.
>>>>>>> 		 */
>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>> -		if (!anon_vma) {
>>>>>>> -			ret = -EBUSY;
>>>>>>> -			goto out;
>>>>>>> +		if (!unmapped) {
>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>> +			if (!anon_vma) {
>>>>>>> +				ret = -EBUSY;
>>>>>>> +				goto out;
>>>>>>> +			}
>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>> 		}
>>>>>>> 		mapping = NULL;
>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>> 	} else {
>>>>>>> 		unsigned int min_order;
>>>>>>> 		gfp_t gfp;
>>>>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>> 		goto out_unlock;
>>>>>>> 	}
>>>>>>>
>>>>>>> -	unmap_folio(folio);
>>>>>>> +	if (!unmapped)
>>>>>>> +		unmap_folio(folio);
>>>>>>>
>>>>>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>> 	local_irq_disable();
>>>>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>
>>>>>>> 			next = folio_next(new_folio);
>>>>>>>
>>>>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>>>>> +
>>>>>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>>>>>
>>>>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>>> +			if (!unmapped)
>>>>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>>>
>>>>>>> 			/*
>>>>>>> 			 * Anonymous folio with swap cache.
>>>>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>> 			__filemap_remove_folio(new_folio, NULL);
>>>>>>> 			folio_put_refs(new_folio, nr_pages);
>>>>>>> 		}
>>>>>>> +
>>>>>>> +		zone_device_private_split_cb(folio, NULL);
>>>>>>> 		/*
>>>>>>> 		 * Unfreeze @folio only after all page cache entries, which
>>>>>>> 		 * used to point to it, have been updated with new folios.
>>>>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>
>>>>>>> 	local_irq_enable();
>>>>>>>
>>>>>>> +	if (unmapped)
>>>>>>> +		return ret;
>>>>>>
>>>>>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>>>>>> mentioned:
>>>>>>
>>>>>>   * The large folio must be locked
>>>>>>   * After splitting, the after-split folio containing @lock_at remains locked
>>>>>>
>>>>>> But here we seems to change the prerequisites.
>>>>>>
>>>>>> Hmm.. I am not sure this is correct.
>>>>>>
>>>>>
>>>>> The code is correct, but you are right in that the documentation needs to be updated.
>>>>> When "unmapped", we do want to leave the folios locked after the split.
>>>>
>>>> Sigh, this "unmapped" code needs so many special branches and a different locking
>>>> requirement. It should be a separate function to avoid confusions.
>>>>
>>>
>>> Yep, I have a patch for it, I am also waiting on Matthew's feedback, FYI, here is
>>> a WIP patch that can be applied on top of the series
>>
>> Nice cleanup! Thanks.
>>
>>>
>>> ---
>>>  include/linux/huge_mm.h |   5 +-
>>>  mm/huge_memory.c        | 137 ++++++++++++++++++++++++++++++++++------
>>>  mm/migrate_device.c     |   3 +-
>>>  3 files changed, 120 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index c4a811958cda..86e1cefaf391 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -366,7 +366,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>
>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> -		unsigned int new_order, bool unmapped);
>>> +		unsigned int new_order);
>>> +int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order);
>>>  int min_order_for_split(struct folio *folio);
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>> @@ -379,7 +380,7 @@ int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>>  static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>  		unsigned int new_order)
>>>  {
>>> -	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>>> +	return __split_huge_page_to_list_to_order(page, list, new_order);
>>>  }
>>>
>>>  /*
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 8c82a0ac6e69..e20cbf68d037 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3711,7 +3711,6 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   * @lock_at: a page within @folio to be left locked to caller
>>>   * @list: after-split folios will be put on it if non NULL
>>>   * @uniform_split: perform uniform split or not (non-uniform split)
>>> - * @unmapped: The pages are already unmapped, they are migration entries.
>>>   *
>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>   * It is in charge of checking whether the split is supported or not and
>>> @@ -3727,7 +3726,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   */
>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct page *lock_at,
>>> -		struct list_head *list, bool uniform_split, bool unmapped)
>>> +		struct list_head *list, bool uniform_split)
>>>  {
>>>  	struct deferred_split *ds_queue;
>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> @@ -3777,14 +3776,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		 * is taken to serialise against parallel split or collapse
>>>  		 * operations.
>>>  		 */
>>> -		if (!unmapped) {
>>> -			anon_vma = folio_get_anon_vma(folio);
>>> -			if (!anon_vma) {
>>> -				ret = -EBUSY;
>>> -				goto out;
>>> -			}
>>> -			anon_vma_lock_write(anon_vma);
>>> +		anon_vma = folio_get_anon_vma(folio);
>>> +		if (!anon_vma) {
>>> +			ret = -EBUSY;
>>> +			goto out;
>>>  		}
>>> +		anon_vma_lock_write(anon_vma);
>>>  		mapping = NULL;
>>>  	} else {
>>>  		unsigned int min_order;
>>> @@ -3852,8 +3849,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		goto out_unlock;
>>>  	}
>>>
>>> -	if (!unmapped)
>>> -		unmap_folio(folio);
>>> +	unmap_folio(folio);
>>>
>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>  	local_irq_disable();
>>> @@ -3954,8 +3950,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>>
>>> -			if (!unmapped)
>>> -				lru_add_split_folio(folio, new_folio, lruvec, list);
>>> +			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>
>>>  			/*
>>>  			 * Anonymous folio with swap cache.
>>> @@ -4011,9 +4006,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>>  	local_irq_enable();
>>>
>>> -	if (unmapped)
>>> -		return ret;
>>> -
>>>  	if (nr_shmem_dropped)
>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>
>>> @@ -4057,6 +4049,111 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  	return ret;
>>>  }
>>>
>>> +/*
>>> + * This function is a helper for splitting folios that have already been unmapped.
>>> + * The use case is that the device or the CPU can refuse to migrate THP pages in
>>> + * the middle of migration, due to allocation issues on either side
>>> + *
>>> + * The high level code is copied from __folio_split, since the pages are anonymous
>>> + * and are already isolated from the LRU, the code has been simplified to not
>>> + * burden __folio_split with unmapped sprinkled into the code.
>>
>> I wonder if it makes sense to remove CPU side folio from both deferred_split queue
>> and swap cache before migration to further simplify split_unmapped_folio_to_order().
>> Basically require that device private folios cannot be on deferred_split queue nor
>> swap cache.
>>
>
> This API can be called for non-device private folios as well. Device private folios are
> already not on the deferred queue. The use case is
>
> 1. Migrate a large folio page from CPU to Device
> 2. SRC - CPU has a THP (large folio page)
> 3. DST - Device cannot allocate a large page, hence split the SRC page

Right. That is what I am talking about, sorry I was not clear.
I mean when migrating a large folio from CPU to device, the CPU large folio
can be first removed from deferred_split queue and swap cache, if it is there,
then the migration process begins, so that the CPU large folio will always
be out of deferred_split queue and not in swap cache. As a result, this split
function does not need to handle these two situations.

--
Best Regards,
Yan, Zi

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Balbir Singh 3 months, 1 week ago

On 10/23/25 02:26, Zi Yan wrote:
> On 22 Oct 2025, at 3:16, Balbir Singh wrote:
> 
>> On 10/22/25 13:59, Zi Yan wrote:
>>> On 21 Oct 2025, at 17:34, Balbir Singh wrote:
>>>
>>>> On 10/20/25 09:59, Zi Yan wrote:
>>>>> On 19 Oct 2025, at 18:49, Balbir Singh wrote:
>>>>>
>>>>>> On 10/19/25 19:19, Wei Yang wrote:
>>>>>>> On Wed, Oct 01, 2025 at 04:57:02PM +1000, Balbir Singh wrote:
>>>>>>> [...]
>>>>>>>> static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>> 		struct page *split_at, struct page *lock_at,
>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>>>>>> {
>>>>>>>> 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>> 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>> 		 * is taken to serialise against parallel split or collapse
>>>>>>>> 		 * operations.
>>>>>>>> 		 */
>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>> -		if (!anon_vma) {
>>>>>>>> -			ret = -EBUSY;
>>>>>>>> -			goto out;
>>>>>>>> +		if (!unmapped) {
>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>> +			if (!anon_vma) {
>>>>>>>> +				ret = -EBUSY;
>>>>>>>> +				goto out;
>>>>>>>> +			}
>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>> 		}
>>>>>>>> 		mapping = NULL;
>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>> 	} else {
>>>>>>>> 		unsigned int min_order;
>>>>>>>> 		gfp_t gfp;
>>>>>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>> 		goto out_unlock;
>>>>>>>> 	}
>>>>>>>>
>>>>>>>> -	unmap_folio(folio);
>>>>>>>> +	if (!unmapped)
>>>>>>>> +		unmap_folio(folio);
>>>>>>>>
>>>>>>>> 	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>> 	local_irq_disable();
>>>>>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>
>>>>>>>> 			next = folio_next(new_folio);
>>>>>>>>
>>>>>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>>>>>> +
>>>>>>>> 			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>>>>> 			folio_ref_unfreeze(new_folio, expected_refs);
>>>>>>>>
>>>>>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>>>> +			if (!unmapped)
>>>>>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>>>>>
>>>>>>>> 			/*
>>>>>>>> 			 * Anonymous folio with swap cache.
>>>>>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>> 			__filemap_remove_folio(new_folio, NULL);
>>>>>>>> 			folio_put_refs(new_folio, nr_pages);
>>>>>>>> 		}
>>>>>>>> +
>>>>>>>> +		zone_device_private_split_cb(folio, NULL);
>>>>>>>> 		/*
>>>>>>>> 		 * Unfreeze @folio only after all page cache entries, which
>>>>>>>> 		 * used to point to it, have been updated with new folios.
>>>>>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>
>>>>>>>> 	local_irq_enable();
>>>>>>>>
>>>>>>>> +	if (unmapped)
>>>>>>>> +		return ret;
>>>>>>>
>>>>>>> As the comment of __folio_split() and __split_huge_page_to_list_to_order()
>>>>>>> mentioned:
>>>>>>>
>>>>>>>   * The large folio must be locked
>>>>>>>   * After splitting, the after-split folio containing @lock_at remains locked
>>>>>>>
>>>>>>> But here we seems to change the prerequisites.
>>>>>>>
>>>>>>> Hmm.. I am not sure this is correct.
>>>>>>>
>>>>>>
>>>>>> The code is correct, but you are right in that the documentation needs to be updated.
>>>>>> When "unmapped", we do want to leave the folios locked after the split.
>>>>>
>>>>> Sigh, this "unmapped" code needs so many special branches and a different locking
>>>>> requirement. It should be a separate function to avoid confusions.
>>>>>
>>>>
>>>> Yep, I have a patch for it, I am also waiting on Matthew's feedback, FYI, here is
>>>> a WIP patch that can be applied on top of the series
>>>
>>> Nice cleanup! Thanks.
>>>
>>>>
>>>> ---
>>>>  include/linux/huge_mm.h |   5 +-
>>>>  mm/huge_memory.c        | 137 ++++++++++++++++++++++++++++++++++------
>>>>  mm/migrate_device.c     |   3 +-
>>>>  3 files changed, 120 insertions(+), 25 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index c4a811958cda..86e1cefaf391 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -366,7 +366,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>
>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>>  int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> -		unsigned int new_order, bool unmapped);
>>>> +		unsigned int new_order);
>>>> +int split_unmapped_folio_to_order(struct folio *folio, unsigned int new_order);
>>>>  int min_order_for_split(struct folio *folio);
>>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>> @@ -379,7 +380,7 @@ int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>>>  static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>>  		unsigned int new_order)
>>>>  {
>>>> -	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>>>> +	return __split_huge_page_to_list_to_order(page, list, new_order);
>>>>  }
>>>>
>>>>  /*
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 8c82a0ac6e69..e20cbf68d037 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3711,7 +3711,6 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>   * @lock_at: a page within @folio to be left locked to caller
>>>>   * @list: after-split folios will be put on it if non NULL
>>>>   * @uniform_split: perform uniform split or not (non-uniform split)
>>>> - * @unmapped: The pages are already unmapped, they are migration entries.
>>>>   *
>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>   * It is in charge of checking whether the split is supported or not and
>>>> @@ -3727,7 +3726,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>   */
>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		struct page *split_at, struct page *lock_at,
>>>> -		struct list_head *list, bool uniform_split, bool unmapped)
>>>> +		struct list_head *list, bool uniform_split)
>>>>  {
>>>>  	struct deferred_split *ds_queue;
>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>> @@ -3777,14 +3776,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		 * is taken to serialise against parallel split or collapse
>>>>  		 * operations.
>>>>  		 */
>>>> -		if (!unmapped) {
>>>> -			anon_vma = folio_get_anon_vma(folio);
>>>> -			if (!anon_vma) {
>>>> -				ret = -EBUSY;
>>>> -				goto out;
>>>> -			}
>>>> -			anon_vma_lock_write(anon_vma);
>>>> +		anon_vma = folio_get_anon_vma(folio);
>>>> +		if (!anon_vma) {
>>>> +			ret = -EBUSY;
>>>> +			goto out;
>>>>  		}
>>>> +		anon_vma_lock_write(anon_vma);
>>>>  		mapping = NULL;
>>>>  	} else {
>>>>  		unsigned int min_order;
>>>> @@ -3852,8 +3849,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		goto out_unlock;
>>>>  	}
>>>>
>>>> -	if (!unmapped)
>>>> -		unmap_folio(folio);
>>>> +	unmap_folio(folio);
>>>>
>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>  	local_irq_disable();
>>>> @@ -3954,8 +3950,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>>>
>>>> -			if (!unmapped)
>>>> -				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>> +			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>
>>>>  			/*
>>>>  			 * Anonymous folio with swap cache.
>>>> @@ -4011,9 +4006,6 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>>  	local_irq_enable();
>>>>
>>>> -	if (unmapped)
>>>> -		return ret;
>>>> -
>>>>  	if (nr_shmem_dropped)
>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>
>>>> @@ -4057,6 +4049,111 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  	return ret;
>>>>  }
>>>>
>>>> +/*
>>>> + * This function is a helper for splitting folios that have already been unmapped.
>>>> + * The use case is that the device or the CPU can refuse to migrate THP pages in
>>>> + * the middle of migration, due to allocation issues on either side
>>>> + *
>>>> + * The high level code is copied from __folio_split, since the pages are anonymous
>>>> + * and are already isolated from the LRU, the code has been simplified to not
>>>> + * burden __folio_split with unmapped sprinkled into the code.
>>>
>>> I wonder if it makes sense to remove CPU side folio from both deferred_split queue
>>> and swap cache before migration to further simplify split_unmapped_folio_to_order().
>>> Basically require that device private folios cannot be on deferred_split queue nor
>>> swap cache.
>>>
>>
>> This API can be called for non-device private folios as well. Device private folios are
>> already not on the deferred queue. The use case is
>>
>> 1. Migrate a large folio page from CPU to Device
>> 2. SRC - CPU has a THP (large folio page)
>> 3. DST - Device cannot allocate a large page, hence split the SRC page
> 
> Right. That is what I am talking about, sorry I was not clear.
> I mean when migrating a large folio from CPU to device, the CPU large folio
> can be first removed from deferred_split queue and swap cache, if it is there,
> then the migration process begins, so that the CPU large folio will always
> be out of deferred_split queue and not in swap cache. As a result, this split
> function does not need to handle these two situations.

It leads to more specialization for LRU and zone device-private folios.
I looked at giving it a go and ended up with some amount of duplication, with
taking the lock, freezing the refs and then prior to that finding the
extra_refs with can_split_folio()

Balbir Singh

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Zi Yan 3 months, 3 weeks ago

On 1 Oct 2025, at 2:57, Balbir Singh wrote:

> Implement migrate_vma_split_pages() to handle THP splitting during the
> migration process when destination cannot allocate compound pages.
>
> This addresses the common scenario where migrate_vma_setup() succeeds with
> MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
> large pages during the migration phase.
>
> Key changes:
> - migrate_vma_split_pages(): Split already-isolated pages during migration
> - Enhanced folio_split() and __split_unmapped_folio() with isolated
>   parameter to avoid redundant unmap/remap operations
>
> This provides a fallback mechansim to ensure migration succeeds even when
> large page allocation fails at the destination.
>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
> Cc: Rakie Kim <rakie.kim@sk.com>
> Cc: Byungchul Park <byungchul@sk.com>
> Cc: Gregory Price <gourry@gourry.net>
> Cc: Ying Huang <ying.huang@linux.alibaba.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: Ralph Campbell <rcampbell@nvidia.com>
> Cc: Mika Penttilä <mpenttil@redhat.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Francois Dugast <francois.dugast@intel.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h | 11 +++++-
>  lib/test_hmm.c          |  9 +++++
>  mm/huge_memory.c        | 46 ++++++++++++----------
>  mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
>  4 files changed, 117 insertions(+), 34 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 2d669be7f1c8..a166be872628 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  		vm_flags_t vm_flags);
>
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -		unsigned int new_order);
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order, bool unmapped);
>  int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> @@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>  		bool warns);
>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>  		struct list_head *list);
> +
> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order)
> +{
> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
> +}
> +
>  /*
>   * try_folio_split - try to split a @folio at @page using non uniform split.
>   * @folio: folio to be split
> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
> index 46fa9e200db8..df429670633e 100644
> --- a/lib/test_hmm.c
> +++ b/lib/test_hmm.c
> @@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>  	order = folio_order(page_folio(vmf->page));
>  	nr = 1 << order;
>
> +	/*
> +	 * When folios are partially mapped, we can't rely on the folio
> +	 * order of vmf->page as the folio might not be fully split yet
> +	 */
> +	if (vmf->pte) {
> +		order = 0;
> +		nr = 1;
> +	}
> +
>  	/*
>  	 * Consider a per-cpu cache of src and dst pfns, but with
>  	 * large number of cpus that might not scale well.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 8c95a658b3ec..022b0729f826 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  		new_folio->mapping = folio->mapping;
>  		new_folio->index = folio->index + i;
>
> -		/*
> -		 * page->private should not be set in tail pages. Fix up and warn once
> -		 * if private is unexpectedly set.
> -		 */
> -		if (unlikely(new_folio->private)) {
> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
> -			new_folio->private = NULL;
> -		}
> -
>  		if (folio_test_swapcache(folio))
>  			new_folio->swap.val = folio->swap.val + i;
>
> @@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   * @lock_at: a page within @folio to be left locked to caller
>   * @list: after-split folios will be put on it if non NULL
>   * @uniform_split: perform uniform split or not (non-uniform split)
> + * @unmapped: The pages are already unmapped, they are migration entries.
>   *
>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>   * It is in charge of checking whether the split is supported or not and
> @@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   */
>  static int __folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct page *lock_at,
> -		struct list_head *list, bool uniform_split)
> +		struct list_head *list, bool uniform_split, bool unmapped)
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		 * is taken to serialise against parallel split or collapse
>  		 * operations.
>  		 */
> -		anon_vma = folio_get_anon_vma(folio);
> -		if (!anon_vma) {
> -			ret = -EBUSY;
> -			goto out;
> +		if (!unmapped) {
> +			anon_vma = folio_get_anon_vma(folio);
> +			if (!anon_vma) {
> +				ret = -EBUSY;
> +				goto out;
> +			}
> +			anon_vma_lock_write(anon_vma);
>  		}
>  		mapping = NULL;
> -		anon_vma_lock_write(anon_vma);
>  	} else {
>  		unsigned int min_order;
>  		gfp_t gfp;
> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		goto out_unlock;
>  	}
>
> -	unmap_folio(folio);
> +	if (!unmapped)
> +		unmap_folio(folio);
>
>  	/* block interrupt reentry in xa_lock and spinlock */
>  	local_irq_disable();
> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>
>  			next = folio_next(new_folio);
>
> +			zone_device_private_split_cb(folio, new_folio);
> +
>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>  			folio_ref_unfreeze(new_folio, expected_refs);
>
> -			lru_add_split_folio(folio, new_folio, lruvec, list);
> +			if (!unmapped)
> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>
>  			/*
>  			 * Anonymous folio with swap cache.
> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  			__filemap_remove_folio(new_folio, NULL);
>  			folio_put_refs(new_folio, nr_pages);
>  		}
> +
> +		zone_device_private_split_cb(folio, NULL);
>  		/*
>  		 * Unfreeze @folio only after all page cache entries, which
>  		 * used to point to it, have been updated with new folios.
> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>
>  	local_irq_enable();
>
> +	if (unmapped)
> +		return ret;
> +
>  	if (nr_shmem_dropped)
>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>
> @@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   * Returns -EINVAL when trying to split to an order that is incompatible
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -				     unsigned int new_order)
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +				     unsigned int new_order, bool unmapped)
>  {
>  	struct folio *folio = page_folio(page);
>
> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
> +				unmapped);
>  }
>
>  /*
> @@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct list_head *list)
>  {
>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
> -			false);
> +			false, false);
>  }
>
>  int min_order_for_split(struct folio *folio)
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 4156fd6190d2..fa42d2ebd024 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>  			    pgmap->owner != migrate->pgmap_owner)
>  				goto next;
>
> +			folio = page_folio(page);
> +			if (folio_test_large(folio)) {
> +				int ret;
> +
> +				pte_unmap_unlock(ptep, ptl);
> +				ret = migrate_vma_split_folio(folio,
> +							  migrate->fault_page);
> +
> +				if (ret) {
> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
> +					goto next;
> +				}
> +
> +				addr = start;
> +				goto again;
> +			}
> +
>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>  					MIGRATE_PFN_MIGRATE;
>  			if (is_writable_device_private_entry(entry))
> @@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>  	return 0;
>  }
> +
> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
> +					    unsigned long idx, unsigned long addr,
> +					    struct folio *folio)
> +{
> +	unsigned long i;
> +	unsigned long pfn;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	folio_get(folio);
> +	split_huge_pmd_address(migrate->vma, addr, true);
> +	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
> +							0, true);

Why not just call __split_unmapped_folio() here? Then, you do not need to add
a new unmapped parameter in __folio_split().


> +	if (ret)
> +		return ret;
> +	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
> +	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
> +	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
> +	for (i = 1; i < HPAGE_PMD_NR; i++)
> +		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
> +	return ret;
> +}
>  #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  					 unsigned long addr,
> @@ -889,6 +929,13 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  {
>  	return 0;
>  }
> +
> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
> +					    unsigned long idx, unsigned long addr,
> +					    struct folio *folio)
> +{
> +	return 0;
> +}
>  #endif
>
>  static unsigned long migrate_vma_nr_pages(unsigned long *src)
> @@ -1050,8 +1097,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				struct migrate_vma *migrate)
>  {
>  	struct mmu_notifier_range range;
> -	unsigned long i;
> +	unsigned long i, j;
>  	bool notified = false;
> +	unsigned long addr;
>
>  	for (i = 0; i < npages; ) {
>  		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
> @@ -1093,12 +1141,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
>  				nr = migrate_vma_nr_pages(&src_pfns[i]);
>  				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -				goto next;
> +			} else {
> +				nr = 1;
>  			}
>
> -			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
> -						&src_pfns[i]);
> +			for (j = 0; j < nr && i + j < npages; j++) {
> +				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
> +				migrate_vma_insert_page(migrate,
> +					addr + j * PAGE_SIZE,
> +					&dst_pfns[i+j], &src_pfns[i+j]);
> +			}
>  			goto next;
>  		}
>
> @@ -1120,7 +1172,13 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  							 MIGRATE_PFN_COMPOUND);
>  					goto next;
>  				}
> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +				nr = 1 << folio_order(folio);
> +				addr = migrate->start + i * PAGE_SIZE;
> +				if (migrate_vma_split_unmapped_folio(migrate, i, addr, folio)) {
> +					src_pfns[i] &= ~(MIGRATE_PFN_MIGRATE |
> +							 MIGRATE_PFN_COMPOUND);
> +					goto next;
> +				}
>  			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
>  				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
>  				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
> @@ -1156,11 +1214,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>
>  		if (migrate && migrate->fault_page == page)
>  			extra_cnt = 1;
> -		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
> -		if (r)
> -			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -		else
> -			folio_migrate_flags(newfolio, folio);
> +		for (j = 0; j < nr && i + j < npages; j++) {
> +			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
> +			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
> +
> +			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
> +			if (r)
> +				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
> +			else
> +				folio_migrate_flags(newfolio, folio);
> +		}
>  next:
>  		i += nr;
>  	}
> -- 
> 2.51.0


--
Best Regards,
Yan, Zi

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Balbir Singh 3 months, 3 weeks ago

On 10/14/25 08:17, Zi Yan wrote:
> On 1 Oct 2025, at 2:57, Balbir Singh wrote:
> 
>> Implement migrate_vma_split_pages() to handle THP splitting during the
>> migration process when destination cannot allocate compound pages.
>>
>> This addresses the common scenario where migrate_vma_setup() succeeds with
>> MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
>> large pages during the migration phase.
>>
>> Key changes:
>> - migrate_vma_split_pages(): Split already-isolated pages during migration
>> - Enhanced folio_split() and __split_unmapped_folio() with isolated
>>   parameter to avoid redundant unmap/remap operations
>>
>> This provides a fallback mechansim to ensure migration succeeds even when
>> large page allocation fails at the destination.
>>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Cc: Rakie Kim <rakie.kim@sk.com>
>> Cc: Byungchul Park <byungchul@sk.com>
>> Cc: Gregory Price <gourry@gourry.net>
>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>> Cc: Mika Penttilä <mpenttil@redhat.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: Francois Dugast <francois.dugast@intel.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/huge_mm.h | 11 +++++-
>>  lib/test_hmm.c          |  9 +++++
>>  mm/huge_memory.c        | 46 ++++++++++++----------
>>  mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
>>  4 files changed, 117 insertions(+), 34 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 2d669be7f1c8..a166be872628 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>  		vm_flags_t vm_flags);
>>
>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -		unsigned int new_order);
>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order, bool unmapped);
>>  int min_order_for_split(struct folio *folio);
>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>> @@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>  		bool warns);
>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>  		struct list_head *list);
>> +
>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order)
>> +{
>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>> +}
>> +
>>  /*
>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>   * @folio: folio to be split
>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>> index 46fa9e200db8..df429670633e 100644
>> --- a/lib/test_hmm.c
>> +++ b/lib/test_hmm.c
>> @@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>  	order = folio_order(page_folio(vmf->page));
>>  	nr = 1 << order;
>>
>> +	/*
>> +	 * When folios are partially mapped, we can't rely on the folio
>> +	 * order of vmf->page as the folio might not be fully split yet
>> +	 */
>> +	if (vmf->pte) {
>> +		order = 0;
>> +		nr = 1;
>> +	}
>> +
>>  	/*
>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>  	 * large number of cpus that might not scale well.
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 8c95a658b3ec..022b0729f826 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>  		new_folio->mapping = folio->mapping;
>>  		new_folio->index = folio->index + i;
>>
>> -		/*
>> -		 * page->private should not be set in tail pages. Fix up and warn once
>> -		 * if private is unexpectedly set.
>> -		 */
>> -		if (unlikely(new_folio->private)) {
>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>> -			new_folio->private = NULL;
>> -		}
>> -
>>  		if (folio_test_swapcache(folio))
>>  			new_folio->swap.val = folio->swap.val + i;
>>
>> @@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   * @lock_at: a page within @folio to be left locked to caller
>>   * @list: after-split folios will be put on it if non NULL
>>   * @uniform_split: perform uniform split or not (non-uniform split)
>> + * @unmapped: The pages are already unmapped, they are migration entries.
>>   *
>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>   * It is in charge of checking whether the split is supported or not and
>> @@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   */
>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split)
>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>  {
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		 * is taken to serialise against parallel split or collapse
>>  		 * operations.
>>  		 */
>> -		anon_vma = folio_get_anon_vma(folio);
>> -		if (!anon_vma) {
>> -			ret = -EBUSY;
>> -			goto out;
>> +		if (!unmapped) {
>> +			anon_vma = folio_get_anon_vma(folio);
>> +			if (!anon_vma) {
>> +				ret = -EBUSY;
>> +				goto out;
>> +			}
>> +			anon_vma_lock_write(anon_vma);
>>  		}
>>  		mapping = NULL;
>> -		anon_vma_lock_write(anon_vma);
>>  	} else {
>>  		unsigned int min_order;
>>  		gfp_t gfp;
>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		goto out_unlock;
>>  	}
>>
>> -	unmap_folio(folio);
>> +	if (!unmapped)
>> +		unmap_folio(folio);
>>
>>  	/* block interrupt reentry in xa_lock and spinlock */
>>  	local_irq_disable();
>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>>  			next = folio_next(new_folio);
>>
>> +			zone_device_private_split_cb(folio, new_folio);
>> +
>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>
>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>> +			if (!unmapped)
>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>
>>  			/*
>>  			 * Anonymous folio with swap cache.
>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  			__filemap_remove_folio(new_folio, NULL);
>>  			folio_put_refs(new_folio, nr_pages);
>>  		}
>> +
>> +		zone_device_private_split_cb(folio, NULL);
>>  		/*
>>  		 * Unfreeze @folio only after all page cache entries, which
>>  		 * used to point to it, have been updated with new folios.
>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>>  	local_irq_enable();
>>
>> +	if (unmapped)
>> +		return ret;
>> +
>>  	if (nr_shmem_dropped)
>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>
>> @@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>   */
>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -				     unsigned int new_order)
>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +				     unsigned int new_order, bool unmapped)
>>  {
>>  	struct folio *folio = page_folio(page);
>>
>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>> +				unmapped);
>>  }
>>
>>  /*
>> @@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct list_head *list)
>>  {
>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>> -			false);
>> +			false, false);
>>  }
>>
>>  int min_order_for_split(struct folio *folio)
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 4156fd6190d2..fa42d2ebd024 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>  			    pgmap->owner != migrate->pgmap_owner)
>>  				goto next;
>>
>> +			folio = page_folio(page);
>> +			if (folio_test_large(folio)) {
>> +				int ret;
>> +
>> +				pte_unmap_unlock(ptep, ptl);
>> +				ret = migrate_vma_split_folio(folio,
>> +							  migrate->fault_page);
>> +
>> +				if (ret) {
>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>> +					goto next;
>> +				}
>> +
>> +				addr = start;
>> +				goto again;
>> +			}
>> +
>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>  					MIGRATE_PFN_MIGRATE;
>>  			if (is_writable_device_private_entry(entry))
>> @@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>  	return 0;
>>  }
>> +
>> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>> +					    unsigned long idx, unsigned long addr,
>> +					    struct folio *folio)
>> +{
>> +	unsigned long i;
>> +	unsigned long pfn;
>> +	unsigned long flags;
>> +	int ret = 0;
>> +
>> +	folio_get(folio);
>> +	split_huge_pmd_address(migrate->vma, addr, true);
>> +	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
>> +							0, true);
> 
> Why not just call __split_unmapped_folio() here? Then, you do not need to add
> a new unmapped parameter in __folio_split().
> 
> 

The benefit comes from the ref count checks and freeze/unfreeze (common code) in
__folio_split() and also from the callbacks that are to be made to the drivers on
folio split. These paths are required for both mapped and unmapped folios.

Otherwise we'd have to replicate that logic and checks again for unmapped folios
and handle post split processing again.

[...]

Thanks,
Balbir

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Zi Yan 3 months, 3 weeks ago

On 13 Oct 2025, at 17:33, Balbir Singh wrote:

> On 10/14/25 08:17, Zi Yan wrote:
>> On 1 Oct 2025, at 2:57, Balbir Singh wrote:
>>
>>> Implement migrate_vma_split_pages() to handle THP splitting during the
>>> migration process when destination cannot allocate compound pages.
>>>
>>> This addresses the common scenario where migrate_vma_setup() succeeds with
>>> MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
>>> large pages during the migration phase.
>>>
>>> Key changes:
>>> - migrate_vma_split_pages(): Split already-isolated pages during migration
>>> - Enhanced folio_split() and __split_unmapped_folio() with isolated
>>>   parameter to avoid redundant unmap/remap operations
>>>
>>> This provides a fallback mechansim to ensure migration succeeds even when
>>> large page allocation fails at the destination.
>>>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>> Cc: Byungchul Park <byungchul@sk.com>
>>> Cc: Gregory Price <gourry@gourry.net>
>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  include/linux/huge_mm.h | 11 +++++-
>>>  lib/test_hmm.c          |  9 +++++
>>>  mm/huge_memory.c        | 46 ++++++++++++----------
>>>  mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
>>>  4 files changed, 117 insertions(+), 34 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 2d669be7f1c8..a166be872628 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>  		vm_flags_t vm_flags);
>>>
>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> -		unsigned int new_order);
>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +		unsigned int new_order, bool unmapped);
>>>  int min_order_for_split(struct folio *folio);
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>> @@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>  		bool warns);
>>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>>  		struct list_head *list);
>>> +
>>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +		unsigned int new_order)
>>> +{
>>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>>> +}
>>> +
>>>  /*
>>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>>   * @folio: folio to be split
>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>> index 46fa9e200db8..df429670633e 100644
>>> --- a/lib/test_hmm.c
>>> +++ b/lib/test_hmm.c
>>> @@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>  	order = folio_order(page_folio(vmf->page));
>>>  	nr = 1 << order;
>>>
>>> +	/*
>>> +	 * When folios are partially mapped, we can't rely on the folio
>>> +	 * order of vmf->page as the folio might not be fully split yet
>>> +	 */
>>> +	if (vmf->pte) {
>>> +		order = 0;
>>> +		nr = 1;
>>> +	}
>>> +
>>>  	/*
>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>  	 * large number of cpus that might not scale well.
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 8c95a658b3ec..022b0729f826 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>>  		new_folio->mapping = folio->mapping;
>>>  		new_folio->index = folio->index + i;
>>>
>>> -		/*
>>> -		 * page->private should not be set in tail pages. Fix up and warn once
>>> -		 * if private is unexpectedly set.
>>> -		 */
>>> -		if (unlikely(new_folio->private)) {
>>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>>> -			new_folio->private = NULL;
>>> -		}
>>> -
>>>  		if (folio_test_swapcache(folio))
>>>  			new_folio->swap.val = folio->swap.val + i;
>>>
>>> @@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   * @lock_at: a page within @folio to be left locked to caller
>>>   * @list: after-split folios will be put on it if non NULL
>>>   * @uniform_split: perform uniform split or not (non-uniform split)
>>> + * @unmapped: The pages are already unmapped, they are migration entries.
>>>   *
>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>   * It is in charge of checking whether the split is supported or not and
>>> @@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   */
>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct page *lock_at,
>>> -		struct list_head *list, bool uniform_split)
>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>  {
>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		 * is taken to serialise against parallel split or collapse
>>>  		 * operations.
>>>  		 */
>>> -		anon_vma = folio_get_anon_vma(folio);
>>> -		if (!anon_vma) {
>>> -			ret = -EBUSY;
>>> -			goto out;
>>> +		if (!unmapped) {
>>> +			anon_vma = folio_get_anon_vma(folio);
>>> +			if (!anon_vma) {
>>> +				ret = -EBUSY;
>>> +				goto out;
>>> +			}
>>> +			anon_vma_lock_write(anon_vma);
>>>  		}
>>>  		mapping = NULL;
>>> -		anon_vma_lock_write(anon_vma);
>>>  	} else {
>>>  		unsigned int min_order;
>>>  		gfp_t gfp;
>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		goto out_unlock;
>>>  	}
>>>
>>> -	unmap_folio(folio);
>>> +	if (!unmapped)
>>> +		unmap_folio(folio);
>>>
>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>  	local_irq_disable();
>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>>  			next = folio_next(new_folio);
>>>
>>> +			zone_device_private_split_cb(folio, new_folio);
>>> +
>>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>>
>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>> +			if (!unmapped)
>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>
>>>  			/*
>>>  			 * Anonymous folio with swap cache.
>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  			__filemap_remove_folio(new_folio, NULL);
>>>  			folio_put_refs(new_folio, nr_pages);
>>>  		}
>>> +
>>> +		zone_device_private_split_cb(folio, NULL);
>>>  		/*
>>>  		 * Unfreeze @folio only after all page cache entries, which
>>>  		 * used to point to it, have been updated with new folios.
>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>>  	local_irq_enable();
>>>
>>> +	if (unmapped)
>>> +		return ret;
>>> +
>>>  	if (nr_shmem_dropped)
>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>
>>> @@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>>   */
>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> -				     unsigned int new_order)
>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +				     unsigned int new_order, bool unmapped)
>>>  {
>>>  	struct folio *folio = page_folio(page);
>>>
>>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>>> +				unmapped);
>>>  }
>>>
>>>  /*
>>> @@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct list_head *list)
>>>  {
>>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>>> -			false);
>>> +			false, false);
>>>  }
>>>
>>>  int min_order_for_split(struct folio *folio)
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 4156fd6190d2..fa42d2ebd024 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>  				goto next;
>>>
>>> +			folio = page_folio(page);
>>> +			if (folio_test_large(folio)) {
>>> +				int ret;
>>> +
>>> +				pte_unmap_unlock(ptep, ptl);
>>> +				ret = migrate_vma_split_folio(folio,
>>> +							  migrate->fault_page);
>>> +
>>> +				if (ret) {
>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>> +					goto next;
>>> +				}
>>> +
>>> +				addr = start;
>>> +				goto again;
>>> +			}
>>> +
>>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>>  					MIGRATE_PFN_MIGRATE;
>>>  			if (is_writable_device_private_entry(entry))
>>> @@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>>  	return 0;
>>>  }
>>> +
>>> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>>> +					    unsigned long idx, unsigned long addr,
>>> +					    struct folio *folio)
>>> +{
>>> +	unsigned long i;
>>> +	unsigned long pfn;
>>> +	unsigned long flags;
>>> +	int ret = 0;
>>> +
>>> +	folio_get(folio);
>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>> +	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
>>> +							0, true);
>>
>> Why not just call __split_unmapped_folio() here? Then, you do not need to add
>> a new unmapped parameter in __folio_split().
>>
>>
>
> The benefit comes from the ref count checks and freeze/unfreeze (common code) in
> __folio_split() and also from the callbacks that are to be made to the drivers on
> folio split. These paths are required for both mapped and unmapped folios.
>
> Otherwise we'd have to replicate that logic and checks again for unmapped folios
> and handle post split processing again.

Replicating freeze/unfreeze code would be much better than adding unmapped parameter
and new path in __folio_split(). When it comes to adding support for file-backed
folios, are you going to use unmapped parameter to guard code for file-backed code
in __folio_split()? Just keep piling up special paths?


--
Best Regards,
Yan, Zi

Re: [v7 11/16] mm/migrate_device: add THP splitting during migration

Posted by Balbir Singh 3 months, 3 weeks ago

On 10/14/25 08:55, Zi Yan wrote:
> On 13 Oct 2025, at 17:33, Balbir Singh wrote:
> 
>> On 10/14/25 08:17, Zi Yan wrote:
>>> On 1 Oct 2025, at 2:57, Balbir Singh wrote:
>>>
>>>> Implement migrate_vma_split_pages() to handle THP splitting during the
>>>> migration process when destination cannot allocate compound pages.
>>>>
>>>> This addresses the common scenario where migrate_vma_setup() succeeds with
>>>> MIGRATE_PFN_COMPOUND pages, but the destination device cannot allocate
>>>> large pages during the migration phase.
>>>>
>>>> Key changes:
>>>> - migrate_vma_split_pages(): Split already-isolated pages during migration
>>>> - Enhanced folio_split() and __split_unmapped_folio() with isolated
>>>>   parameter to avoid redundant unmap/remap operations
>>>>
>>>> This provides a fallback mechansim to ensure migration succeeds even when
>>>> large page allocation fails at the destination.
>>>>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
>>>> Cc: Rakie Kim <rakie.kim@sk.com>
>>>> Cc: Byungchul Park <byungchul@sk.com>
>>>> Cc: Gregory Price <gourry@gourry.net>
>>>> Cc: Ying Huang <ying.huang@linux.alibaba.com>
>>>> Cc: Alistair Popple <apopple@nvidia.com>
>>>> Cc: Oscar Salvador <osalvador@suse.de>
>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>>> Cc: Nico Pache <npache@redhat.com>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Dev Jain <dev.jain@arm.com>
>>>> Cc: Barry Song <baohua@kernel.org>
>>>> Cc: Lyude Paul <lyude@redhat.com>
>>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>>> Cc: David Airlie <airlied@gmail.com>
>>>> Cc: Simona Vetter <simona@ffwll.ch>
>>>> Cc: Ralph Campbell <rcampbell@nvidia.com>
>>>> Cc: Mika Penttilä <mpenttil@redhat.com>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: Francois Dugast <francois.dugast@intel.com>
>>>>
>>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>>> ---
>>>>  include/linux/huge_mm.h | 11 +++++-
>>>>  lib/test_hmm.c          |  9 +++++
>>>>  mm/huge_memory.c        | 46 ++++++++++++----------
>>>>  mm/migrate_device.c     | 85 +++++++++++++++++++++++++++++++++++------
>>>>  4 files changed, 117 insertions(+), 34 deletions(-)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index 2d669be7f1c8..a166be872628 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -365,8 +365,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>>  		vm_flags_t vm_flags);
>>>>
>>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> -		unsigned int new_order);
>>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> +		unsigned int new_order, bool unmapped);
>>>>  int min_order_for_split(struct folio *folio);
>>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>> @@ -375,6 +375,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>  		bool warns);
>>>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>>>  		struct list_head *list);
>>>> +
>>>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> +		unsigned int new_order)
>>>> +{
>>>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>>>> +}
>>>> +
>>>>  /*
>>>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>>>   * @folio: folio to be split
>>>> diff --git a/lib/test_hmm.c b/lib/test_hmm.c
>>>> index 46fa9e200db8..df429670633e 100644
>>>> --- a/lib/test_hmm.c
>>>> +++ b/lib/test_hmm.c
>>>> @@ -1612,6 +1612,15 @@ static vm_fault_t dmirror_devmem_fault(struct vm_fault *vmf)
>>>>  	order = folio_order(page_folio(vmf->page));
>>>>  	nr = 1 << order;
>>>>
>>>> +	/*
>>>> +	 * When folios are partially mapped, we can't rely on the folio
>>>> +	 * order of vmf->page as the folio might not be fully split yet
>>>> +	 */
>>>> +	if (vmf->pte) {
>>>> +		order = 0;
>>>> +		nr = 1;
>>>> +	}
>>>> +
>>>>  	/*
>>>>  	 * Consider a per-cpu cache of src and dst pfns, but with
>>>>  	 * large number of cpus that might not scale well.
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index 8c95a658b3ec..022b0729f826 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -3463,15 +3463,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>>>  		new_folio->mapping = folio->mapping;
>>>>  		new_folio->index = folio->index + i;
>>>>
>>>> -		/*
>>>> -		 * page->private should not be set in tail pages. Fix up and warn once
>>>> -		 * if private is unexpectedly set.
>>>> -		 */
>>>> -		if (unlikely(new_folio->private)) {
>>>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>>>> -			new_folio->private = NULL;
>>>> -		}
>>>> -
>>>>  		if (folio_test_swapcache(folio))
>>>>  			new_folio->swap.val = folio->swap.val + i;
>>>>
>>>> @@ -3700,6 +3691,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>   * @lock_at: a page within @folio to be left locked to caller
>>>>   * @list: after-split folios will be put on it if non NULL
>>>>   * @uniform_split: perform uniform split or not (non-uniform split)
>>>> + * @unmapped: The pages are already unmapped, they are migration entries.
>>>>   *
>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>   * It is in charge of checking whether the split is supported or not and
>>>> @@ -3715,7 +3707,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>   */
>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		struct page *split_at, struct page *lock_at,
>>>> -		struct list_head *list, bool uniform_split)
>>>> +		struct list_head *list, bool uniform_split, bool unmapped)
>>>>  {
>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>> @@ -3765,13 +3757,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		 * is taken to serialise against parallel split or collapse
>>>>  		 * operations.
>>>>  		 */
>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>> -		if (!anon_vma) {
>>>> -			ret = -EBUSY;
>>>> -			goto out;
>>>> +		if (!unmapped) {
>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>> +			if (!anon_vma) {
>>>> +				ret = -EBUSY;
>>>> +				goto out;
>>>> +			}
>>>> +			anon_vma_lock_write(anon_vma);
>>>>  		}
>>>>  		mapping = NULL;
>>>> -		anon_vma_lock_write(anon_vma);
>>>>  	} else {
>>>>  		unsigned int min_order;
>>>>  		gfp_t gfp;
>>>> @@ -3838,7 +3832,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		goto out_unlock;
>>>>  	}
>>>>
>>>> -	unmap_folio(folio);
>>>> +	if (!unmapped)
>>>> +		unmap_folio(folio);
>>>>
>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>  	local_irq_disable();
>>>> @@ -3925,10 +3920,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>>  			next = folio_next(new_folio);
>>>>
>>>> +			zone_device_private_split_cb(folio, new_folio);
>>>> +
>>>>  			expected_refs = folio_expected_ref_count(new_folio) + 1;
>>>>  			folio_ref_unfreeze(new_folio, expected_refs);
>>>>
>>>> -			lru_add_split_folio(folio, new_folio, lruvec, list);
>>>> +			if (!unmapped)
>>>> +				lru_add_split_folio(folio, new_folio, lruvec, list);
>>>>
>>>>  			/*
>>>>  			 * Anonymous folio with swap cache.
>>>> @@ -3959,6 +3957,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  			__filemap_remove_folio(new_folio, NULL);
>>>>  			folio_put_refs(new_folio, nr_pages);
>>>>  		}
>>>> +
>>>> +		zone_device_private_split_cb(folio, NULL);
>>>>  		/*
>>>>  		 * Unfreeze @folio only after all page cache entries, which
>>>>  		 * used to point to it, have been updated with new folios.
>>>> @@ -3982,6 +3982,9 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>>  	local_irq_enable();
>>>>
>>>> +	if (unmapped)
>>>> +		return ret;
>>>> +
>>>>  	if (nr_shmem_dropped)
>>>>  		shmem_uncharge(mapping->host, nr_shmem_dropped);
>>>>
>>>> @@ -4072,12 +4075,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>>>   */
>>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> -				     unsigned int new_order)
>>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>>> +				     unsigned int new_order, bool unmapped)
>>>>  {
>>>>  	struct folio *folio = page_folio(page);
>>>>
>>>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>>>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>>>> +				unmapped);
>>>>  }
>>>>
>>>>  /*
>>>> @@ -4106,7 +4110,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>>>  		struct page *split_at, struct list_head *list)
>>>>  {
>>>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>>>> -			false);
>>>> +			false, false);
>>>>  }
>>>>
>>>>  int min_order_for_split(struct folio *folio)
>>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>>> index 4156fd6190d2..fa42d2ebd024 100644
>>>> --- a/mm/migrate_device.c
>>>> +++ b/mm/migrate_device.c
>>>> @@ -306,6 +306,23 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>>>>  			    pgmap->owner != migrate->pgmap_owner)
>>>>  				goto next;
>>>>
>>>> +			folio = page_folio(page);
>>>> +			if (folio_test_large(folio)) {
>>>> +				int ret;
>>>> +
>>>> +				pte_unmap_unlock(ptep, ptl);
>>>> +				ret = migrate_vma_split_folio(folio,
>>>> +							  migrate->fault_page);
>>>> +
>>>> +				if (ret) {
>>>> +					ptep = pte_offset_map_lock(mm, pmdp, addr, &ptl);
>>>> +					goto next;
>>>> +				}
>>>> +
>>>> +				addr = start;
>>>> +				goto again;
>>>> +			}
>>>> +
>>>>  			mpfn = migrate_pfn(page_to_pfn(page)) |
>>>>  					MIGRATE_PFN_MIGRATE;
>>>>  			if (is_writable_device_private_entry(entry))
>>>> @@ -880,6 +897,29 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>>>  	return 0;
>>>>  }
>>>> +
>>>> +static int migrate_vma_split_unmapped_folio(struct migrate_vma *migrate,
>>>> +					    unsigned long idx, unsigned long addr,
>>>> +					    struct folio *folio)
>>>> +{
>>>> +	unsigned long i;
>>>> +	unsigned long pfn;
>>>> +	unsigned long flags;
>>>> +	int ret = 0;
>>>> +
>>>> +	folio_get(folio);
>>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>>> +	ret = __split_huge_page_to_list_to_order(folio_page(folio, 0), NULL,
>>>> +							0, true);
>>>
>>> Why not just call __split_unmapped_folio() here? Then, you do not need to add
>>> a new unmapped parameter in __folio_split().
>>>
>>>
>>
>> The benefit comes from the ref count checks and freeze/unfreeze (common code) in
>> __folio_split() and also from the callbacks that are to be made to the drivers on
>> folio split. These paths are required for both mapped and unmapped folios.
>>
>> Otherwise we'd have to replicate that logic and checks again for unmapped folios
>> and handle post split processing again.
> 
> Replicating freeze/unfreeze code would be much better than adding unmapped parameter
> and new path in __folio_split(). When it comes to adding support for file-backed
> folios, are you going to use unmapped parameter to guard code for file-backed code
> in __folio_split()? Just keep piling up special paths?
> 

Adding file-backed would require more code duplication and hence the aim to reuse 
as much as possible. I am happy to aim towards refactoring the code to separate out
the unmapped part of the code as a follow on patch to the series.

Balbir