[v1] THP support for zone device page migration

[v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 3 months ago

Support splitting pages during THP zone device migration as needed.
The common case that arises is that after setup, during migrate
the destination might not be able to allocate MIGRATE_PFN_COMPOUND
pages.

Add a new routine migrate_vma_split_pages() to support the splitting
of already isolated pages. The pages being migrated are already unmapped
and marked for migration during setup (via unmap). folio_split() and
__split_unmapped_folio() take additional isolated arguments, to avoid
unmapping and remaping these pages and unlocking/putting the folio.

Cc: Karol Herbst <kherbst@redhat.com>
Cc: Lyude Paul <lyude@redhat.com>
Cc: Danilo Krummrich <dakr@kernel.org>
Cc: David Airlie <airlied@gmail.com>
Cc: Simona Vetter <simona@ffwll.ch>
Cc: "Jérôme Glisse" <jglisse@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Peter Xu <peterx@redhat.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Jane Chu <jane.chu@oracle.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Donet Tom <donettom@linux.ibm.com>

Signed-off-by: Balbir Singh <balbirs@nvidia.com>
---
 include/linux/huge_mm.h | 11 ++++++--
 mm/huge_memory.c        | 54 ++++++++++++++++++++-----------------
 mm/migrate_device.c     | 59 ++++++++++++++++++++++++++++++++---------
 3 files changed, 85 insertions(+), 39 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 65a1bdf29bb9..5f55a754e57c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
 		vm_flags_t vm_flags);
 
 bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-		unsigned int new_order);
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order, bool isolated);
 int min_order_for_split(struct folio *folio);
 int split_folio_to_list(struct folio *folio, struct list_head *list);
 bool uniform_split_supported(struct folio *folio, unsigned int new_order,
@@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
 		bool warns);
 int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
 		struct list_head *list);
+
+static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+		unsigned int new_order)
+{
+	return __split_huge_page_to_list_to_order(page, list, new_order, false);
+}
+
 /*
  * try_folio_split - try to split a @folio at @page using non uniform split.
  * @folio: folio to be split
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d55e36ae0c39..e00ddfed22fa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3424,15 +3424,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
 		new_folio->mapping = folio->mapping;
 		new_folio->index = folio->index + i;
 
-		/*
-		 * page->private should not be set in tail pages. Fix up and warn once
-		 * if private is unexpectedly set.
-		 */
-		if (unlikely(new_folio->private)) {
-			VM_WARN_ON_ONCE_PAGE(true, new_head);
-			new_folio->private = NULL;
-		}
-
 		if (folio_test_swapcache(folio))
 			new_folio->swap.val = folio->swap.val + i;
 
@@ -3519,7 +3510,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 		struct page *split_at, struct page *lock_at,
 		struct list_head *list, pgoff_t end,
 		struct xa_state *xas, struct address_space *mapping,
-		bool uniform_split)
+		bool uniform_split, bool isolated)
 {
 	struct lruvec *lruvec;
 	struct address_space *swap_cache = NULL;
@@ -3643,8 +3634,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 				percpu_ref_get_many(&release->pgmap->ref,
 							(1 << new_order) - 1);
 
-			lru_add_split_folio(origin_folio, release, lruvec,
-					list);
+			if (!isolated)
+				lru_add_split_folio(origin_folio, release,
+							lruvec, list);
 
 			/* Some pages can be beyond EOF: drop them from cache */
 			if (release->index >= end) {
@@ -3697,6 +3689,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 	if (nr_dropped)
 		shmem_uncharge(mapping->host, nr_dropped);
 
+	/*
+	 * Don't remap and unlock isolated folios
+	 */
+	if (isolated)
+		return ret;
+
 	remap_page(origin_folio, 1 << order,
 			folio_test_anon(origin_folio) ?
 				RMP_USE_SHARED_ZEROPAGE : 0);
@@ -3790,6 +3788,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  * @lock_at: a page within @folio to be left locked to caller
  * @list: after-split folios will be put on it if non NULL
  * @uniform_split: perform uniform split or not (non-uniform split)
+ * @isolated: The pages are already unmapped
  *
  * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
  * It is in charge of checking whether the split is supported or not and
@@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
  */
 static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
-		struct list_head *list, bool uniform_split)
+		struct list_head *list, bool uniform_split, bool isolated)
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
@@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		 * is taken to serialise against parallel split or collapse
 		 * operations.
 		 */
-		anon_vma = folio_get_anon_vma(folio);
-		if (!anon_vma) {
-			ret = -EBUSY;
-			goto out;
+		if (!isolated) {
+			anon_vma = folio_get_anon_vma(folio);
+			if (!anon_vma) {
+				ret = -EBUSY;
+				goto out;
+			}
+			anon_vma_lock_write(anon_vma);
 		}
 		end = -1;
 		mapping = NULL;
-		anon_vma_lock_write(anon_vma);
 	} else {
 		unsigned int min_order;
 		gfp_t gfp;
@@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		goto out_unlock;
 	}
 
-	unmap_folio(folio);
+	if (!isolated)
+		unmap_folio(folio);
 
 	/* block interrupt reentry in xa_lock and spinlock */
 	local_irq_disable();
@@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 
 		ret = __split_unmapped_folio(folio, new_order,
 				split_at, lock_at, list, end, &xas, mapping,
-				uniform_split);
+				uniform_split, isolated);
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
 		if (mapping)
 			xas_unlock(&xas);
 		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio), 0);
+		if (!isolated)
+			remap_page(folio, folio_nr_pages(folio), 0);
 		ret = -EAGAIN;
 	}
 
@@ -4046,12 +4049,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
  * Returns -EINVAL when trying to split to an order that is incompatible
  * with the folio. Splitting to order 0 is compatible with all folios.
  */
-int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
-				     unsigned int new_order)
+int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
+				     unsigned int new_order, bool isolated)
 {
 	struct folio *folio = page_folio(page);
 
-	return __folio_split(folio, new_order, &folio->page, page, list, true);
+	return __folio_split(folio, new_order, &folio->page, page, list, true,
+				isolated);
 }
 
 /*
@@ -4080,7 +4084,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct list_head *list)
 {
 	return __folio_split(folio, new_order, split_at, &folio->page, list,
-			false);
+			false, false);
 }
 
 int min_order_for_split(struct folio *folio)
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 41d0bd787969..acd2f03b178d 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -813,6 +813,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 		src[i] &= ~MIGRATE_PFN_MIGRATE;
 	return 0;
 }
+
+static void migrate_vma_split_pages(struct migrate_vma *migrate,
+					unsigned long idx, unsigned long addr,
+					struct folio *folio)
+{
+	unsigned long i;
+	unsigned long pfn;
+	unsigned long flags;
+
+	folio_get(folio);
+	split_huge_pmd_address(migrate->vma, addr, true);
+	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
+	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
+	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
+	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
+	for (i = 1; i < HPAGE_PMD_NR; i++)
+		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
+}
 #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
 static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 					 unsigned long addr,
@@ -822,6 +840,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
 {
 	return 0;
 }
+
+static void migrate_vma_split_pages(struct migrate_vma *migrate,
+					unsigned long idx, unsigned long addr,
+					struct folio *folio)
+{}
 #endif
 
 /*
@@ -971,8 +994,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				struct migrate_vma *migrate)
 {
 	struct mmu_notifier_range range;
-	unsigned long i;
+	unsigned long i, j;
 	bool notified = false;
+	unsigned long addr;
 
 	for (i = 0; i < npages; ) {
 		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
@@ -1014,12 +1038,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
 				nr = HPAGE_PMD_NR;
 				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-				goto next;
+			} else {
+				nr = 1;
 			}
 
-			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
-						&src_pfns[i]);
+			for (j = 0; j < nr && i + j < npages; j++) {
+				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
+				migrate_vma_insert_page(migrate,
+					addr + j * PAGE_SIZE,
+					&dst_pfns[i+j], &src_pfns[i+j]);
+			}
 			goto next;
 		}
 
@@ -1041,7 +1069,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 							 MIGRATE_PFN_COMPOUND);
 					goto next;
 				}
-				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
+				nr = 1 << folio_order(folio);
+				addr = migrate->start + i * PAGE_SIZE;
+				migrate_vma_split_pages(migrate, i, addr, folio);
 			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
 				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
 				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
@@ -1076,12 +1106,17 @@ static void __migrate_device_pages(unsigned long *src_pfns,
 		BUG_ON(folio_test_writeback(folio));
 
 		if (migrate && migrate->fault_page == page)
-			extra_cnt = 1;
-		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
-		if (r != MIGRATEPAGE_SUCCESS)
-			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
-		else
-			folio_migrate_flags(newfolio, folio);
+			extra_cnt++;
+		for (j = 0; j < nr && i + j < npages; j++) {
+			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
+			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
+
+			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
+			if (r != MIGRATEPAGE_SUCCESS)
+				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
+			else
+				folio_migrate_flags(newfolio, folio);
+		}
 next:
 		i += nr;
 	}
-- 
2.49.0

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Zi Yan 3 months ago

On 3 Jul 2025, at 19:35, Balbir Singh wrote:

> Support splitting pages during THP zone device migration as needed.
> The common case that arises is that after setup, during migrate
> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> pages.
>
> Add a new routine migrate_vma_split_pages() to support the splitting
> of already isolated pages. The pages being migrated are already unmapped
> and marked for migration during setup (via unmap). folio_split() and
> __split_unmapped_folio() take additional isolated arguments, to avoid
> unmapping and remaping these pages and unlocking/putting the folio.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h | 11 ++++++--
>  mm/huge_memory.c        | 54 ++++++++++++++++++++-----------------
>  mm/migrate_device.c     | 59 ++++++++++++++++++++++++++++++++---------
>  3 files changed, 85 insertions(+), 39 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 65a1bdf29bb9..5f55a754e57c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  		vm_flags_t vm_flags);
>
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -		unsigned int new_order);
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order, bool isolated);
>  int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> @@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>  		bool warns);
>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>  		struct list_head *list);
> +
> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order)
> +{
> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
> +}
> +
>  /*
>   * try_folio_split - try to split a @folio at @page using non uniform split.
>   * @folio: folio to be split
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d55e36ae0c39..e00ddfed22fa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3424,15 +3424,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  		new_folio->mapping = folio->mapping;
>  		new_folio->index = folio->index + i;
>
> -		/*
> -		 * page->private should not be set in tail pages. Fix up and warn once
> -		 * if private is unexpectedly set.
> -		 */
> -		if (unlikely(new_folio->private)) {
> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
> -			new_folio->private = NULL;
> -		}
> -
>  		if (folio_test_swapcache(folio))
>  			new_folio->swap.val = folio->swap.val + i;
>
> @@ -3519,7 +3510,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  		struct page *split_at, struct page *lock_at,
>  		struct list_head *list, pgoff_t end,
>  		struct xa_state *xas, struct address_space *mapping,
> -		bool uniform_split)
> +		bool uniform_split, bool isolated)
>  {
>  	struct lruvec *lruvec;
>  	struct address_space *swap_cache = NULL;
> @@ -3643,8 +3634,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  				percpu_ref_get_many(&release->pgmap->ref,
>  							(1 << new_order) - 1);
>
> -			lru_add_split_folio(origin_folio, release, lruvec,
> -					list);
> +			if (!isolated)
> +				lru_add_split_folio(origin_folio, release,
> +							lruvec, list);
>
>  			/* Some pages can be beyond EOF: drop them from cache */
>  			if (release->index >= end) {
> @@ -3697,6 +3689,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  	if (nr_dropped)
>  		shmem_uncharge(mapping->host, nr_dropped);
>
> +	/*
> +	 * Don't remap and unlock isolated folios
> +	 */
> +	if (isolated)
> +		return ret;
> +
>  	remap_page(origin_folio, 1 << order,
>  			folio_test_anon(origin_folio) ?
>  				RMP_USE_SHARED_ZEROPAGE : 0);
> @@ -3790,6 +3788,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   * @lock_at: a page within @folio to be left locked to caller
>   * @list: after-split folios will be put on it if non NULL
>   * @uniform_split: perform uniform split or not (non-uniform split)
> + * @isolated: The pages are already unmapped

s/pages/folio

Why name it isolated if the folio is unmapped? Isolated folios often mean
they are removed from LRU lists. isolated here causes confusion.

>   *
>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>   * It is in charge of checking whether the split is supported or not and
> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   */
>  static int __folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct page *lock_at,
> -		struct list_head *list, bool uniform_split)
> +		struct list_head *list, bool uniform_split, bool isolated)
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		 * is taken to serialise against parallel split or collapse
>  		 * operations.
>  		 */
> -		anon_vma = folio_get_anon_vma(folio);
> -		if (!anon_vma) {
> -			ret = -EBUSY;
> -			goto out;
> +		if (!isolated) {
> +			anon_vma = folio_get_anon_vma(folio);
> +			if (!anon_vma) {
> +				ret = -EBUSY;
> +				goto out;
> +			}
> +			anon_vma_lock_write(anon_vma);
>  		}
>  		end = -1;
>  		mapping = NULL;
> -		anon_vma_lock_write(anon_vma);
>  	} else {
>  		unsigned int min_order;
>  		gfp_t gfp;
> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		goto out_unlock;
>  	}
>
> -	unmap_folio(folio);
> +	if (!isolated)
> +		unmap_folio(folio);
>
>  	/* block interrupt reentry in xa_lock and spinlock */
>  	local_irq_disable();
> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>
>  		ret = __split_unmapped_folio(folio, new_order,
>  				split_at, lock_at, list, end, &xas, mapping,
> -				uniform_split);
> +				uniform_split, isolated);
>  	} else {
>  		spin_unlock(&ds_queue->split_queue_lock);
>  fail:
>  		if (mapping)
>  			xas_unlock(&xas);
>  		local_irq_enable();
> -		remap_page(folio, folio_nr_pages(folio), 0);
> +		if (!isolated)
> +			remap_page(folio, folio_nr_pages(folio), 0);
>  		ret = -EAGAIN;
>  	}

These "isolated" special handlings does not look good, I wonder if there
is a way of letting split code handle device private folios more gracefully.
It also causes confusions, since why does "isolated/unmapped" folios
not need to unmap_page(), remap_page(), or unlock?


>
> @@ -4046,12 +4049,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   * Returns -EINVAL when trying to split to an order that is incompatible
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -				     unsigned int new_order)
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +				     unsigned int new_order, bool isolated)
>  {
>  	struct folio *folio = page_folio(page);
>
> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
> +				isolated);
>  }
>
>  /*
> @@ -4080,7 +4084,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct list_head *list)
>  {
>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
> -			false);
> +			false, false);
>  }
>
>  int min_order_for_split(struct folio *folio)
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 41d0bd787969..acd2f03b178d 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -813,6 +813,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>  	return 0;
>  }
> +
> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
> +					unsigned long idx, unsigned long addr,
> +					struct folio *folio)
> +{
> +	unsigned long i;
> +	unsigned long pfn;
> +	unsigned long flags;
> +
> +	folio_get(folio);
> +	split_huge_pmd_address(migrate->vma, addr, true);
> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);

If you need to split PMD entries here, why not let unmap_page() and remap_page()
in split code does that?

--
Best Regards,
Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 3 months ago

On 7/4/25 21:24, Zi Yan wrote:
> 
> s/pages/folio
> 

Thanks, will make the changes

> Why name it isolated if the folio is unmapped? Isolated folios often mean
> they are removed from LRU lists. isolated here causes confusion.
> 

Ack, will change the name


>>   *
>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>   * It is in charge of checking whether the split is supported or not and
>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   */
>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split)
>> +		struct list_head *list, bool uniform_split, bool isolated)
>>  {
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		 * is taken to serialise against parallel split or collapse
>>  		 * operations.
>>  		 */
>> -		anon_vma = folio_get_anon_vma(folio);
>> -		if (!anon_vma) {
>> -			ret = -EBUSY;
>> -			goto out;
>> +		if (!isolated) {
>> +			anon_vma = folio_get_anon_vma(folio);
>> +			if (!anon_vma) {
>> +				ret = -EBUSY;
>> +				goto out;
>> +			}
>> +			anon_vma_lock_write(anon_vma);
>>  		}
>>  		end = -1;
>>  		mapping = NULL;
>> -		anon_vma_lock_write(anon_vma);
>>  	} else {
>>  		unsigned int min_order;
>>  		gfp_t gfp;
>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		goto out_unlock;
>>  	}
>>
>> -	unmap_folio(folio);
>> +	if (!isolated)
>> +		unmap_folio(folio);
>>
>>  	/* block interrupt reentry in xa_lock and spinlock */
>>  	local_irq_disable();
>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>
>>  		ret = __split_unmapped_folio(folio, new_order,
>>  				split_at, lock_at, list, end, &xas, mapping,
>> -				uniform_split);
>> +				uniform_split, isolated);
>>  	} else {
>>  		spin_unlock(&ds_queue->split_queue_lock);
>>  fail:
>>  		if (mapping)
>>  			xas_unlock(&xas);
>>  		local_irq_enable();
>> -		remap_page(folio, folio_nr_pages(folio), 0);
>> +		if (!isolated)
>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>  		ret = -EAGAIN;
>>  	}
> 
> These "isolated" special handlings does not look good, I wonder if there
> is a way of letting split code handle device private folios more gracefully.
> It also causes confusions, since why does "isolated/unmapped" folios
> not need to unmap_page(), remap_page(), or unlock?
> 
> 

There are two reasons for going down the current code path

1. if the isolated check is not present, folio_get_anon_vma will fail and cause
   the split routine to return with -EBUSY
2. Going through unmap_page(), remap_page() causes a full page table walk, which
   the migrate_device API has already just done as a part of the migration. The
   entries under consideration are already migration entries in this case.
   This is wasteful and in some case unexpected.


Thanks for the review,
Balbir Singh

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Zi Yan 3 months ago

On 4 Jul 2025, at 20:58, Balbir Singh wrote:

> On 7/4/25 21:24, Zi Yan wrote:
>>
>> s/pages/folio
>>
>
> Thanks, will make the changes
>
>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>> they are removed from LRU lists. isolated here causes confusion.
>>
>
> Ack, will change the name
>
>
>>>   *
>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>   * It is in charge of checking whether the split is supported or not and
>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   */
>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct page *lock_at,
>>> -		struct list_head *list, bool uniform_split)
>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>  {
>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		 * is taken to serialise against parallel split or collapse
>>>  		 * operations.
>>>  		 */
>>> -		anon_vma = folio_get_anon_vma(folio);
>>> -		if (!anon_vma) {
>>> -			ret = -EBUSY;
>>> -			goto out;
>>> +		if (!isolated) {
>>> +			anon_vma = folio_get_anon_vma(folio);
>>> +			if (!anon_vma) {
>>> +				ret = -EBUSY;
>>> +				goto out;
>>> +			}
>>> +			anon_vma_lock_write(anon_vma);
>>>  		}
>>>  		end = -1;
>>>  		mapping = NULL;
>>> -		anon_vma_lock_write(anon_vma);
>>>  	} else {
>>>  		unsigned int min_order;
>>>  		gfp_t gfp;
>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		goto out_unlock;
>>>  	}
>>>
>>> -	unmap_folio(folio);
>>> +	if (!isolated)
>>> +		unmap_folio(folio);
>>>
>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>  	local_irq_disable();
>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>
>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>  				split_at, lock_at, list, end, &xas, mapping,
>>> -				uniform_split);
>>> +				uniform_split, isolated);
>>>  	} else {
>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>  fail:
>>>  		if (mapping)
>>>  			xas_unlock(&xas);
>>>  		local_irq_enable();
>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>> +		if (!isolated)
>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>  		ret = -EAGAIN;
>>>  	}
>>
>> These "isolated" special handlings does not look good, I wonder if there
>> is a way of letting split code handle device private folios more gracefully.
>> It also causes confusions, since why does "isolated/unmapped" folios
>> not need to unmap_page(), remap_page(), or unlock?
>>
>>
>
> There are two reasons for going down the current code path

After thinking more, I think adding isolated/unmapped is not the right
way, since unmapped folio is a very generic concept. If you add it,
one can easily misuse the folio split code by first unmapping a folio
and trying to split it with unmapped = true. I do not think that is
supported and your patch does not prevent that from happening in the future.

You should teach different parts of folio split code path to handle
device private folios properly. Details are below.

>
> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>    the split routine to return with -EBUSY

You do something below instead.

if (!anon_vma && !folio_is_device_private(folio)) {
	ret = -EBUSY;
	goto out;
} else if (anon_vma) {
	anon_vma_lock_write(anon_vma);
}

People can know device private folio split needs a special handling.

BTW, why a device private folio can also be anonymous? Does it mean
if a page cache folio is migrated to device private, kernel also
sees it as both device private and file-backed?

> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>    the migrate_device API has already just done as a part of the migration. The
>    entries under consideration are already migration entries in this case.
>    This is wasteful and in some case unexpected.

unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
PMD mapping, which you did in migrate_vma_split_pages(). You probably
can teach either try_to_migrate() or try_to_unmap() to just split
device private PMD mapping. Or if that is not preferred,
you can simply call split_huge_pmd_address() when unmap_folio()
sees a device private folio.

For remap_page(), you can simply return for device private folios
like it is currently doing for non anonymous folios.

For lru_add_split_folio(), you can skip it if a device private
folio is seen.

Last, for unlock part, why do you need to keep all after-split folios
locked? It should be possible to just keep the to-be-migrated folio
locked and unlock the rest for a later retry. But I could miss something
since I am not familiar with device private migration code.

--
Best Regards,
Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 3 months ago

On 7/5/25 11:55, Zi Yan wrote:
> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> 
>> On 7/4/25 21:24, Zi Yan wrote:
>>>
>>> s/pages/folio
>>>
>>
>> Thanks, will make the changes
>>
>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>> they are removed from LRU lists. isolated here causes confusion.
>>>
>>
>> Ack, will change the name
>>
>>
>>>>   *
>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>   * It is in charge of checking whether the split is supported or not and
>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>   */
>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		struct page *split_at, struct page *lock_at,
>>>> -		struct list_head *list, bool uniform_split)
>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>  {
>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		 * is taken to serialise against parallel split or collapse
>>>>  		 * operations.
>>>>  		 */
>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>> -		if (!anon_vma) {
>>>> -			ret = -EBUSY;
>>>> -			goto out;
>>>> +		if (!isolated) {
>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>> +			if (!anon_vma) {
>>>> +				ret = -EBUSY;
>>>> +				goto out;
>>>> +			}
>>>> +			anon_vma_lock_write(anon_vma);
>>>>  		}
>>>>  		end = -1;
>>>>  		mapping = NULL;
>>>> -		anon_vma_lock_write(anon_vma);
>>>>  	} else {
>>>>  		unsigned int min_order;
>>>>  		gfp_t gfp;
>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>  		goto out_unlock;
>>>>  	}
>>>>
>>>> -	unmap_folio(folio);
>>>> +	if (!isolated)
>>>> +		unmap_folio(folio);
>>>>
>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>  	local_irq_disable();
>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>
>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>> -				uniform_split);
>>>> +				uniform_split, isolated);
>>>>  	} else {
>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>  fail:
>>>>  		if (mapping)
>>>>  			xas_unlock(&xas);
>>>>  		local_irq_enable();
>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>> +		if (!isolated)
>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>  		ret = -EAGAIN;
>>>>  	}
>>>
>>> These "isolated" special handlings does not look good, I wonder if there
>>> is a way of letting split code handle device private folios more gracefully.
>>> It also causes confusions, since why does "isolated/unmapped" folios
>>> not need to unmap_page(), remap_page(), or unlock?
>>>
>>>
>>
>> There are two reasons for going down the current code path
> 
> After thinking more, I think adding isolated/unmapped is not the right
> way, since unmapped folio is a very generic concept. If you add it,
> one can easily misuse the folio split code by first unmapping a folio
> and trying to split it with unmapped = true. I do not think that is
> supported and your patch does not prevent that from happening in the future.
> 

I don't understand the misuse case you mention, I assume you mean someone can
get the usage wrong? The responsibility is on the caller to do the right thing
if calling the API with unmapped

> You should teach different parts of folio split code path to handle
> device private folios properly. Details are below.
> 
>>
>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>    the split routine to return with -EBUSY
> 
> You do something below instead.
> 
> if (!anon_vma && !folio_is_device_private(folio)) {
> 	ret = -EBUSY;
> 	goto out;
> } else if (anon_vma) {
> 	anon_vma_lock_write(anon_vma);
> }
> 

folio_get_anon() cannot be called for unmapped folios. In our case the page has
already been unmapped. Is there a reason why you mix anon_vma_lock_write with
the check for device private folios?

> People can know device private folio split needs a special handling.
> 
> BTW, why a device private folio can also be anonymous? Does it mean
> if a page cache folio is migrated to device private, kernel also
> sees it as both device private and file-backed?
> 

FYI: device private folios only work with anonymous private pages, hence
the name device private.

> 
>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>    the migrate_device API has already just done as a part of the migration. The
>>    entries under consideration are already migration entries in this case.
>>    This is wasteful and in some case unexpected.
> 
> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> can teach either try_to_migrate() or try_to_unmap() to just split
> device private PMD mapping. Or if that is not preferred,
> you can simply call split_huge_pmd_address() when unmap_folio()
> sees a device private folio.
> 
> For remap_page(), you can simply return for device private folios
> like it is currently doing for non anonymous folios.
> 

Doing a full rmap walk does not make sense with unmap_folio() and
remap_folio(), because

1. We need to do a page table walk/rmap walk again
2. We'll need special handling of migration <-> migration entries
   in the rmap handling (set/remove migration ptes)
3. In this context, the code is already in the middle of migration,
   so trying to do that again does not make sense.


> 
> For lru_add_split_folio(), you can skip it if a device private
> folio is seen.
> 
> Last, for unlock part, why do you need to keep all after-split folios
> locked? It should be possible to just keep the to-be-migrated folio
> locked and unlock the rest for a later retry. But I could miss something
> since I am not familiar with device private migration code.
> 

Not sure I follow this comment

Balbir

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Zi Yan 3 months ago

On 5 Jul 2025, at 21:15, Balbir Singh wrote:

> On 7/5/25 11:55, Zi Yan wrote:
>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>
>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>
>>>> s/pages/folio
>>>>
>>>
>>> Thanks, will make the changes
>>>
>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>
>>>
>>> Ack, will change the name
>>>
>>>
>>>>>   *
>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>   */
>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>  		struct page *split_at, struct page *lock_at,
>>>>> -		struct list_head *list, bool uniform_split)
>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>  {
>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>  		 * operations.
>>>>>  		 */
>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>> -		if (!anon_vma) {
>>>>> -			ret = -EBUSY;
>>>>> -			goto out;
>>>>> +		if (!isolated) {
>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>> +			if (!anon_vma) {
>>>>> +				ret = -EBUSY;
>>>>> +				goto out;
>>>>> +			}
>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>  		}
>>>>>  		end = -1;
>>>>>  		mapping = NULL;
>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>  	} else {
>>>>>  		unsigned int min_order;
>>>>>  		gfp_t gfp;
>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>  		goto out_unlock;
>>>>>  	}
>>>>>
>>>>> -	unmap_folio(folio);
>>>>> +	if (!isolated)
>>>>> +		unmap_folio(folio);
>>>>>
>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>  	local_irq_disable();
>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>
>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>> -				uniform_split);
>>>>> +				uniform_split, isolated);
>>>>>  	} else {
>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>  fail:
>>>>>  		if (mapping)
>>>>>  			xas_unlock(&xas);
>>>>>  		local_irq_enable();
>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>> +		if (!isolated)
>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>  		ret = -EAGAIN;
>>>>>  	}
>>>>
>>>> These "isolated" special handlings does not look good, I wonder if there
>>>> is a way of letting split code handle device private folios more gracefully.
>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>
>>>>
>>>
>>> There are two reasons for going down the current code path
>>
>> After thinking more, I think adding isolated/unmapped is not the right
>> way, since unmapped folio is a very generic concept. If you add it,
>> one can easily misuse the folio split code by first unmapping a folio
>> and trying to split it with unmapped = true. I do not think that is
>> supported and your patch does not prevent that from happening in the future.
>>
>
> I don't understand the misuse case you mention, I assume you mean someone can
> get the usage wrong? The responsibility is on the caller to do the right thing
> if calling the API with unmapped

Before your patch, there is no use case of splitting unmapped folios.
Your patch only adds support for device private page split, not any unmapped
folio split. So using a generic isolated/unmapped parameter is not OK.

>
>> You should teach different parts of folio split code path to handle
>> device private folios properly. Details are below.
>>
>>>
>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>    the split routine to return with -EBUSY
>>
>> You do something below instead.
>>
>> if (!anon_vma && !folio_is_device_private(folio)) {
>> 	ret = -EBUSY;
>> 	goto out;
>> } else if (anon_vma) {
>> 	anon_vma_lock_write(anon_vma);
>> }
>>
>
> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> the check for device private folios?

Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
in if (!isolated) branch. In that case, just do

if (folio_is_device_private(folio) {
...
} else if (is_anon) {
...
} else {
...
}

>
>> People can know device private folio split needs a special handling.
>>
>> BTW, why a device private folio can also be anonymous? Does it mean
>> if a page cache folio is migrated to device private, kernel also
>> sees it as both device private and file-backed?
>>
>
> FYI: device private folios only work with anonymous private pages, hence
> the name device private.

OK.

>
>>
>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>    the migrate_device API has already just done as a part of the migration. The
>>>    entries under consideration are already migration entries in this case.
>>>    This is wasteful and in some case unexpected.
>>
>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>> can teach either try_to_migrate() or try_to_unmap() to just split
>> device private PMD mapping. Or if that is not preferred,
>> you can simply call split_huge_pmd_address() when unmap_folio()
>> sees a device private folio.
>>
>> For remap_page(), you can simply return for device private folios
>> like it is currently doing for non anonymous folios.
>>
>
> Doing a full rmap walk does not make sense with unmap_folio() and
> remap_folio(), because
>
> 1. We need to do a page table walk/rmap walk again
> 2. We'll need special handling of migration <-> migration entries
>    in the rmap handling (set/remove migration ptes)
> 3. In this context, the code is already in the middle of migration,
>    so trying to do that again does not make sense.

Why doing split in the middle of migration? Existing split code
assumes to-be-split folios are mapped.

What prevents doing split before migration?

>
>
>>
>> For lru_add_split_folio(), you can skip it if a device private
>> folio is seen.
>>
>> Last, for unlock part, why do you need to keep all after-split folios
>> locked? It should be possible to just keep the to-be-migrated folio
>> locked and unlock the rest for a later retry. But I could miss something
>> since I am not familiar with device private migration code.
>>
>
> Not sure I follow this comment

Because the patch is doing split in the middle of migration and existing
split code never supports. My comment is based on the assumption that
the split is done when a folio is mapped.

--
Best Regards,
Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 3 months ago

On 7/6/25 11:34, Zi Yan wrote:
> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> 
>> On 7/5/25 11:55, Zi Yan wrote:
>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>
>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>
>>>>> s/pages/folio
>>>>>
>>>>
>>>> Thanks, will make the changes
>>>>
>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>
>>>>
>>>> Ack, will change the name
>>>>
>>>>
>>>>>>   *
>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>   */
>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>  {
>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>  		 * operations.
>>>>>>  		 */
>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>> -		if (!anon_vma) {
>>>>>> -			ret = -EBUSY;
>>>>>> -			goto out;
>>>>>> +		if (!isolated) {
>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>> +			if (!anon_vma) {
>>>>>> +				ret = -EBUSY;
>>>>>> +				goto out;
>>>>>> +			}
>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>  		}
>>>>>>  		end = -1;
>>>>>>  		mapping = NULL;
>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>  	} else {
>>>>>>  		unsigned int min_order;
>>>>>>  		gfp_t gfp;
>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>  		goto out_unlock;
>>>>>>  	}
>>>>>>
>>>>>> -	unmap_folio(folio);
>>>>>> +	if (!isolated)
>>>>>> +		unmap_folio(folio);
>>>>>>
>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>  	local_irq_disable();
>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>
>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>> -				uniform_split);
>>>>>> +				uniform_split, isolated);
>>>>>>  	} else {
>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>  fail:
>>>>>>  		if (mapping)
>>>>>>  			xas_unlock(&xas);
>>>>>>  		local_irq_enable();
>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>> +		if (!isolated)
>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>  		ret = -EAGAIN;
>>>>>>  	}
>>>>>
>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>
>>>>>
>>>>
>>>> There are two reasons for going down the current code path
>>>
>>> After thinking more, I think adding isolated/unmapped is not the right
>>> way, since unmapped folio is a very generic concept. If you add it,
>>> one can easily misuse the folio split code by first unmapping a folio
>>> and trying to split it with unmapped = true. I do not think that is
>>> supported and your patch does not prevent that from happening in the future.
>>>
>>
>> I don't understand the misuse case you mention, I assume you mean someone can
>> get the usage wrong? The responsibility is on the caller to do the right thing
>> if calling the API with unmapped
> 
> Before your patch, there is no use case of splitting unmapped folios.
> Your patch only adds support for device private page split, not any unmapped
> folio split. So using a generic isolated/unmapped parameter is not OK.
> 

There is a use for splitting unmapped folios (see below)

>>
>>> You should teach different parts of folio split code path to handle
>>> device private folios properly. Details are below.
>>>
>>>>
>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>    the split routine to return with -EBUSY
>>>
>>> You do something below instead.
>>>
>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>> 	ret = -EBUSY;
>>> 	goto out;
>>> } else if (anon_vma) {
>>> 	anon_vma_lock_write(anon_vma);
>>> }
>>>
>>
>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>> the check for device private folios?
> 
> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> in if (!isolated) branch. In that case, just do
> 
> if (folio_is_device_private(folio) {
> ...
> } else if (is_anon) {
> ...
> } else {
> ...
> }
> 
>>
>>> People can know device private folio split needs a special handling.
>>>
>>> BTW, why a device private folio can also be anonymous? Does it mean
>>> if a page cache folio is migrated to device private, kernel also
>>> sees it as both device private and file-backed?
>>>
>>
>> FYI: device private folios only work with anonymous private pages, hence
>> the name device private.
> 
> OK.
> 
>>
>>>
>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>    entries under consideration are already migration entries in this case.
>>>>    This is wasteful and in some case unexpected.
>>>
>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>> device private PMD mapping. Or if that is not preferred,
>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>> sees a device private folio.
>>>
>>> For remap_page(), you can simply return for device private folios
>>> like it is currently doing for non anonymous folios.
>>>
>>
>> Doing a full rmap walk does not make sense with unmap_folio() and
>> remap_folio(), because
>>
>> 1. We need to do a page table walk/rmap walk again
>> 2. We'll need special handling of migration <-> migration entries
>>    in the rmap handling (set/remove migration ptes)
>> 3. In this context, the code is already in the middle of migration,
>>    so trying to do that again does not make sense.
> 
> Why doing split in the middle of migration? Existing split code
> assumes to-be-split folios are mapped.
> 
> What prevents doing split before migration?
> 

The code does do a split prior to migration if THP selection fails

Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
and the fallback part which calls split_folio()

But the case under consideration is special since the device needs to allocate
corresponding pfn's as well. The changelog mentions it:

"The common case that arises is that after setup, during migrate
the destination might not be able to allocate MIGRATE_PFN_COMPOUND
pages."

I can expand on it, because migrate_vma() is a multi-phase operation

1. migrate_vma_setup()
2. migrate_vma_pages()
3. migrate_vma_finalize()

It can so happen that when we get the destination pfn's allocated the destination
might not be able to allocate a large page, so we do the split in migrate_vma_pages().

The pages have been unmapped and collected in migrate_vma_setup()

The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
tests the split and emulates a failure on the device side to allocate large pages
and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)


>>
>>
>>>
>>> For lru_add_split_folio(), you can skip it if a device private
>>> folio is seen.
>>>
>>> Last, for unlock part, why do you need to keep all after-split folios
>>> locked? It should be possible to just keep the to-be-migrated folio
>>> locked and unlock the rest for a later retry. But I could miss something
>>> since I am not familiar with device private migration code.
>>>
>>
>> Not sure I follow this comment
> 
> Because the patch is doing split in the middle of migration and existing
> split code never supports. My comment is based on the assumption that
> the split is done when a folio is mapped.
> 

Understood, hopefully I've explained the reason for the split in the middle
of migration

Thanks for the detailed review
Balbir

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Matthew Brost 2 months, 3 weeks ago

On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> On 7/6/25 11:34, Zi Yan wrote:
> > On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> > 
> >> On 7/5/25 11:55, Zi Yan wrote:
> >>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>
> >>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>
> >>>>> s/pages/folio
> >>>>>
> >>>>
> >>>> Thanks, will make the changes
> >>>>
> >>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>
> >>>>
> >>>> Ack, will change the name
> >>>>
> >>>>
> >>>>>>   *
> >>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>   */
> >>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>  {
> >>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>  		 * operations.
> >>>>>>  		 */
> >>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>> -		if (!anon_vma) {
> >>>>>> -			ret = -EBUSY;
> >>>>>> -			goto out;
> >>>>>> +		if (!isolated) {
> >>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>> +			if (!anon_vma) {
> >>>>>> +				ret = -EBUSY;
> >>>>>> +				goto out;
> >>>>>> +			}
> >>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>  		}
> >>>>>>  		end = -1;
> >>>>>>  		mapping = NULL;
> >>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>  	} else {
> >>>>>>  		unsigned int min_order;
> >>>>>>  		gfp_t gfp;
> >>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>  		goto out_unlock;
> >>>>>>  	}
> >>>>>>
> >>>>>> -	unmap_folio(folio);
> >>>>>> +	if (!isolated)
> >>>>>> +		unmap_folio(folio);
> >>>>>>
> >>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>  	local_irq_disable();
> >>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>
> >>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>> -				uniform_split);
> >>>>>> +				uniform_split, isolated);
> >>>>>>  	} else {
> >>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>  fail:
> >>>>>>  		if (mapping)
> >>>>>>  			xas_unlock(&xas);
> >>>>>>  		local_irq_enable();
> >>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>> +		if (!isolated)
> >>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>  		ret = -EAGAIN;
> >>>>>>  	}
> >>>>>
> >>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>
> >>>>>
> >>>>
> >>>> There are two reasons for going down the current code path
> >>>
> >>> After thinking more, I think adding isolated/unmapped is not the right
> >>> way, since unmapped folio is a very generic concept. If you add it,
> >>> one can easily misuse the folio split code by first unmapping a folio
> >>> and trying to split it with unmapped = true. I do not think that is
> >>> supported and your patch does not prevent that from happening in the future.
> >>>
> >>
> >> I don't understand the misuse case you mention, I assume you mean someone can
> >> get the usage wrong? The responsibility is on the caller to do the right thing
> >> if calling the API with unmapped
> > 
> > Before your patch, there is no use case of splitting unmapped folios.
> > Your patch only adds support for device private page split, not any unmapped
> > folio split. So using a generic isolated/unmapped parameter is not OK.
> > 
> 
> There is a use for splitting unmapped folios (see below)
> 
> >>
> >>> You should teach different parts of folio split code path to handle
> >>> device private folios properly. Details are below.
> >>>
> >>>>
> >>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>    the split routine to return with -EBUSY
> >>>
> >>> You do something below instead.
> >>>
> >>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>> 	ret = -EBUSY;
> >>> 	goto out;
> >>> } else if (anon_vma) {
> >>> 	anon_vma_lock_write(anon_vma);
> >>> }
> >>>
> >>
> >> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >> the check for device private folios?
> > 
> > Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> > in if (!isolated) branch. In that case, just do
> > 
> > if (folio_is_device_private(folio) {
> > ...
> > } else if (is_anon) {
> > ...
> > } else {
> > ...
> > }
> > 
> >>
> >>> People can know device private folio split needs a special handling.
> >>>
> >>> BTW, why a device private folio can also be anonymous? Does it mean
> >>> if a page cache folio is migrated to device private, kernel also
> >>> sees it as both device private and file-backed?
> >>>
> >>
> >> FYI: device private folios only work with anonymous private pages, hence
> >> the name device private.
> > 
> > OK.
> > 
> >>
> >>>
> >>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>    entries under consideration are already migration entries in this case.
> >>>>    This is wasteful and in some case unexpected.
> >>>
> >>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>> device private PMD mapping. Or if that is not preferred,
> >>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>> sees a device private folio.
> >>>
> >>> For remap_page(), you can simply return for device private folios
> >>> like it is currently doing for non anonymous folios.
> >>>
> >>
> >> Doing a full rmap walk does not make sense with unmap_folio() and
> >> remap_folio(), because
> >>
> >> 1. We need to do a page table walk/rmap walk again
> >> 2. We'll need special handling of migration <-> migration entries
> >>    in the rmap handling (set/remove migration ptes)
> >> 3. In this context, the code is already in the middle of migration,
> >>    so trying to do that again does not make sense.
> > 
> > Why doing split in the middle of migration? Existing split code
> > assumes to-be-split folios are mapped.
> > 
> > What prevents doing split before migration?
> > 
> 
> The code does do a split prior to migration if THP selection fails
> 
> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> and the fallback part which calls split_folio()
> 
> But the case under consideration is special since the device needs to allocate
> corresponding pfn's as well. The changelog mentions it:
> 
> "The common case that arises is that after setup, during migrate
> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> pages."
> 
> I can expand on it, because migrate_vma() is a multi-phase operation
> 
> 1. migrate_vma_setup()
> 2. migrate_vma_pages()
> 3. migrate_vma_finalize()
> 
> It can so happen that when we get the destination pfn's allocated the destination
> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> 
> The pages have been unmapped and collected in migrate_vma_setup()
> 
> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> tests the split and emulates a failure on the device side to allocate large pages
> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> 

Another use case I’ve seen is when a previously allocated high-order
folio, now in the free memory pool, is reallocated as a lower-order
page. For example, a 2MB fault allocates a folio, the memory is later
freed, and then a 4KB fault reuses a page from that previously allocated
folio. This will be actually quite common in Xe / GPU SVM. In such
cases, the folio in an unmapped state needs to be split. I’d suggest a
migrate_device_* helper built on top of the core MM __split_folio
function add here.

Matt

> 
> >>
> >>
> >>>
> >>> For lru_add_split_folio(), you can skip it if a device private
> >>> folio is seen.
> >>>
> >>> Last, for unlock part, why do you need to keep all after-split folios
> >>> locked? It should be possible to just keep the to-be-migrated folio
> >>> locked and unlock the rest for a later retry. But I could miss something
> >>> since I am not familiar with device private migration code.
> >>>
> >>
> >> Not sure I follow this comment
> > 
> > Because the patch is doing split in the middle of migration and existing
> > split code never supports. My comment is based on the assumption that
> > the split is done when a folio is mapped.
> > 
> 
> Understood, hopefully I've explained the reason for the split in the middle
> of migration
> 
> Thanks for the detailed review
> Balbir

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Zi Yan 2 months, 3 weeks ago

On 16 Jul 2025, at 1:34, Matthew Brost wrote:

> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
>> On 7/6/25 11:34, Zi Yan wrote:
>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>
>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>
>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>
>>>>>>> s/pages/folio
>>>>>>>
>>>>>>
>>>>>> Thanks, will make the changes
>>>>>>
>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>
>>>>>>
>>>>>> Ack, will change the name
>>>>>>
>>>>>>
>>>>>>>>   *
>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>   */
>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>  {
>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>  		 * operations.
>>>>>>>>  		 */
>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>> -		if (!anon_vma) {
>>>>>>>> -			ret = -EBUSY;
>>>>>>>> -			goto out;
>>>>>>>> +		if (!isolated) {
>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>> +			if (!anon_vma) {
>>>>>>>> +				ret = -EBUSY;
>>>>>>>> +				goto out;
>>>>>>>> +			}
>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>  		}
>>>>>>>>  		end = -1;
>>>>>>>>  		mapping = NULL;
>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>  	} else {
>>>>>>>>  		unsigned int min_order;
>>>>>>>>  		gfp_t gfp;
>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		goto out_unlock;
>>>>>>>>  	}
>>>>>>>>
>>>>>>>> -	unmap_folio(folio);
>>>>>>>> +	if (!isolated)
>>>>>>>> +		unmap_folio(folio);
>>>>>>>>
>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>  	local_irq_disable();
>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>
>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>> -				uniform_split);
>>>>>>>> +				uniform_split, isolated);
>>>>>>>>  	} else {
>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>  fail:
>>>>>>>>  		if (mapping)
>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>  		local_irq_enable();
>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>> +		if (!isolated)
>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>  	}
>>>>>>>
>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> There are two reasons for going down the current code path
>>>>>
>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>
>>>>
>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>> if calling the API with unmapped
>>>
>>> Before your patch, there is no use case of splitting unmapped folios.
>>> Your patch only adds support for device private page split, not any unmapped
>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>
>>
>> There is a use for splitting unmapped folios (see below)
>>
>>>>
>>>>> You should teach different parts of folio split code path to handle
>>>>> device private folios properly. Details are below.
>>>>>
>>>>>>
>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>    the split routine to return with -EBUSY
>>>>>
>>>>> You do something below instead.
>>>>>
>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>> 	ret = -EBUSY;
>>>>> 	goto out;
>>>>> } else if (anon_vma) {
>>>>> 	anon_vma_lock_write(anon_vma);
>>>>> }
>>>>>
>>>>
>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>> the check for device private folios?
>>>
>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>> in if (!isolated) branch. In that case, just do
>>>
>>> if (folio_is_device_private(folio) {
>>> ...
>>> } else if (is_anon) {
>>> ...
>>> } else {
>>> ...
>>> }
>>>
>>>>
>>>>> People can know device private folio split needs a special handling.
>>>>>
>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>> if a page cache folio is migrated to device private, kernel also
>>>>> sees it as both device private and file-backed?
>>>>>
>>>>
>>>> FYI: device private folios only work with anonymous private pages, hence
>>>> the name device private.
>>>
>>> OK.
>>>
>>>>
>>>>>
>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>    This is wasteful and in some case unexpected.
>>>>>
>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>> device private PMD mapping. Or if that is not preferred,
>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>> sees a device private folio.
>>>>>
>>>>> For remap_page(), you can simply return for device private folios
>>>>> like it is currently doing for non anonymous folios.
>>>>>
>>>>
>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>> remap_folio(), because
>>>>
>>>> 1. We need to do a page table walk/rmap walk again
>>>> 2. We'll need special handling of migration <-> migration entries
>>>>    in the rmap handling (set/remove migration ptes)
>>>> 3. In this context, the code is already in the middle of migration,
>>>>    so trying to do that again does not make sense.
>>>
>>> Why doing split in the middle of migration? Existing split code
>>> assumes to-be-split folios are mapped.
>>>
>>> What prevents doing split before migration?
>>>
>>
>> The code does do a split prior to migration if THP selection fails
>>
>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>> and the fallback part which calls split_folio()
>>
>> But the case under consideration is special since the device needs to allocate
>> corresponding pfn's as well. The changelog mentions it:
>>
>> "The common case that arises is that after setup, during migrate
>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>> pages."
>>
>> I can expand on it, because migrate_vma() is a multi-phase operation
>>
>> 1. migrate_vma_setup()
>> 2. migrate_vma_pages()
>> 3. migrate_vma_finalize()
>>
>> It can so happen that when we get the destination pfn's allocated the destination
>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>
>> The pages have been unmapped and collected in migrate_vma_setup()
>>
>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
>> tests the split and emulates a failure on the device side to allocate large pages
>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>>
>
> Another use case I’ve seen is when a previously allocated high-order
> folio, now in the free memory pool, is reallocated as a lower-order
> page. For example, a 2MB fault allocates a folio, the memory is later

That is different. If the high-order folio is free, it should be split
using split_page() from mm/page_alloc.c.

> freed, and then a 4KB fault reuses a page from that previously allocated
> folio. This will be actually quite common in Xe / GPU SVM. In such
> cases, the folio in an unmapped state needs to be split. I’d suggest a

This folio is unused, so ->flags, ->mapping, and etc. are not set,
__split_unmapped_folio() is not for it, unless you mean free folio
differently.

> migrate_device_* helper built on top of the core MM __split_folio
> function add here.
>

--
Best Regards,
Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Matthew Brost 2 months, 3 weeks ago

On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> 
> > On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >> On 7/6/25 11:34, Zi Yan wrote:
> >>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>
> >>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>
> >>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>
> >>>>>>> s/pages/folio
> >>>>>>>
> >>>>>>
> >>>>>> Thanks, will make the changes
> >>>>>>
> >>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>
> >>>>>>
> >>>>>> Ack, will change the name
> >>>>>>
> >>>>>>
> >>>>>>>>   *
> >>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>   */
> >>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>  {
> >>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>  		 * operations.
> >>>>>>>>  		 */
> >>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>> -		if (!anon_vma) {
> >>>>>>>> -			ret = -EBUSY;
> >>>>>>>> -			goto out;
> >>>>>>>> +		if (!isolated) {
> >>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>> +			if (!anon_vma) {
> >>>>>>>> +				ret = -EBUSY;
> >>>>>>>> +				goto out;
> >>>>>>>> +			}
> >>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>  		}
> >>>>>>>>  		end = -1;
> >>>>>>>>  		mapping = NULL;
> >>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>  	} else {
> >>>>>>>>  		unsigned int min_order;
> >>>>>>>>  		gfp_t gfp;
> >>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>  		goto out_unlock;
> >>>>>>>>  	}
> >>>>>>>>
> >>>>>>>> -	unmap_folio(folio);
> >>>>>>>> +	if (!isolated)
> >>>>>>>> +		unmap_folio(folio);
> >>>>>>>>
> >>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>  	local_irq_disable();
> >>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>
> >>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>> -				uniform_split);
> >>>>>>>> +				uniform_split, isolated);
> >>>>>>>>  	} else {
> >>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>  fail:
> >>>>>>>>  		if (mapping)
> >>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>  		local_irq_enable();
> >>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>> +		if (!isolated)
> >>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>  	}
> >>>>>>>
> >>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> There are two reasons for going down the current code path
> >>>>>
> >>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>
> >>>>
> >>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>> if calling the API with unmapped
> >>>
> >>> Before your patch, there is no use case of splitting unmapped folios.
> >>> Your patch only adds support for device private page split, not any unmapped
> >>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>
> >>
> >> There is a use for splitting unmapped folios (see below)
> >>
> >>>>
> >>>>> You should teach different parts of folio split code path to handle
> >>>>> device private folios properly. Details are below.
> >>>>>
> >>>>>>
> >>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>    the split routine to return with -EBUSY
> >>>>>
> >>>>> You do something below instead.
> >>>>>
> >>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>> 	ret = -EBUSY;
> >>>>> 	goto out;
> >>>>> } else if (anon_vma) {
> >>>>> 	anon_vma_lock_write(anon_vma);
> >>>>> }
> >>>>>
> >>>>
> >>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>> the check for device private folios?
> >>>
> >>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>> in if (!isolated) branch. In that case, just do
> >>>
> >>> if (folio_is_device_private(folio) {
> >>> ...
> >>> } else if (is_anon) {
> >>> ...
> >>> } else {
> >>> ...
> >>> }
> >>>
> >>>>
> >>>>> People can know device private folio split needs a special handling.
> >>>>>
> >>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>> if a page cache folio is migrated to device private, kernel also
> >>>>> sees it as both device private and file-backed?
> >>>>>
> >>>>
> >>>> FYI: device private folios only work with anonymous private pages, hence
> >>>> the name device private.
> >>>
> >>> OK.
> >>>
> >>>>
> >>>>>
> >>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>    This is wasteful and in some case unexpected.
> >>>>>
> >>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>> device private PMD mapping. Or if that is not preferred,
> >>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>> sees a device private folio.
> >>>>>
> >>>>> For remap_page(), you can simply return for device private folios
> >>>>> like it is currently doing for non anonymous folios.
> >>>>>
> >>>>
> >>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>> remap_folio(), because
> >>>>
> >>>> 1. We need to do a page table walk/rmap walk again
> >>>> 2. We'll need special handling of migration <-> migration entries
> >>>>    in the rmap handling (set/remove migration ptes)
> >>>> 3. In this context, the code is already in the middle of migration,
> >>>>    so trying to do that again does not make sense.
> >>>
> >>> Why doing split in the middle of migration? Existing split code
> >>> assumes to-be-split folios are mapped.
> >>>
> >>> What prevents doing split before migration?
> >>>
> >>
> >> The code does do a split prior to migration if THP selection fails
> >>
> >> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >> and the fallback part which calls split_folio()
> >>
> >> But the case under consideration is special since the device needs to allocate
> >> corresponding pfn's as well. The changelog mentions it:
> >>
> >> "The common case that arises is that after setup, during migrate
> >> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >> pages."
> >>
> >> I can expand on it, because migrate_vma() is a multi-phase operation
> >>
> >> 1. migrate_vma_setup()
> >> 2. migrate_vma_pages()
> >> 3. migrate_vma_finalize()
> >>
> >> It can so happen that when we get the destination pfn's allocated the destination
> >> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>
> >> The pages have been unmapped and collected in migrate_vma_setup()
> >>
> >> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >> tests the split and emulates a failure on the device side to allocate large pages
> >> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>
> >
> > Another use case I’ve seen is when a previously allocated high-order
> > folio, now in the free memory pool, is reallocated as a lower-order
> > page. For example, a 2MB fault allocates a folio, the memory is later
> 
> That is different. If the high-order folio is free, it should be split
> using split_page() from mm/page_alloc.c.
> 

Ah, ok. Let me see if that works - it would easier.

> > freed, and then a 4KB fault reuses a page from that previously allocated
> > folio. This will be actually quite common in Xe / GPU SVM. In such
> > cases, the folio in an unmapped state needs to be split. I’d suggest a
> 
> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> __split_unmapped_folio() is not for it, unless you mean free folio
> differently.
> 

This is right, those fields should be clear.

Thanks for the tip.

Matt

> > migrate_device_* helper built on top of the core MM __split_folio
> > function add here.
> >
> 
> --
> Best Regards,
> Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 2 months, 3 weeks ago

On 7/17/25 02:24, Matthew Brost wrote:
> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
>>
>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>
>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>
>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>
>>>>>>>>> s/pages/folio
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, will make the changes
>>>>>>>>
>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ack, will change the name
>>>>>>>>
>>>>>>>>
>>>>>>>>>>   *
>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>   */
>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>  {
>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>  		 * operations.
>>>>>>>>>>  		 */
>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>> -			goto out;
>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>> +				goto out;
>>>>>>>>>> +			}
>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>  		}
>>>>>>>>>>  		end = -1;
>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>  	} else {
>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>  	}
>>>>>>>>>>
>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>> +	if (!isolated)
>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>
>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>
>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>> -				uniform_split);
>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>  	} else {
>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>  fail:
>>>>>>>>>>  		if (mapping)
>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>> +		if (!isolated)
>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>  	}
>>>>>>>>>
>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> There are two reasons for going down the current code path
>>>>>>>
>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>
>>>>>>
>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>> if calling the API with unmapped
>>>>>
>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>
>>>>
>>>> There is a use for splitting unmapped folios (see below)
>>>>
>>>>>>
>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>> device private folios properly. Details are below.
>>>>>>>
>>>>>>>>
>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>
>>>>>>> You do something below instead.
>>>>>>>
>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>> 	ret = -EBUSY;
>>>>>>> 	goto out;
>>>>>>> } else if (anon_vma) {
>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>> the check for device private folios?
>>>>>
>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>> in if (!isolated) branch. In that case, just do
>>>>>
>>>>> if (folio_is_device_private(folio) {
>>>>> ...
>>>>> } else if (is_anon) {
>>>>> ...
>>>>> } else {
>>>>> ...
>>>>> }
>>>>>
>>>>>>
>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>
>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>> sees it as both device private and file-backed?
>>>>>>>
>>>>>>
>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>> the name device private.
>>>>>
>>>>> OK.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>
>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>> sees a device private folio.
>>>>>>>
>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>
>>>>>>
>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>> remap_folio(), because
>>>>>>
>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>    so trying to do that again does not make sense.
>>>>>
>>>>> Why doing split in the middle of migration? Existing split code
>>>>> assumes to-be-split folios are mapped.
>>>>>
>>>>> What prevents doing split before migration?
>>>>>
>>>>
>>>> The code does do a split prior to migration if THP selection fails
>>>>
>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>> and the fallback part which calls split_folio()
>>>>
>>>> But the case under consideration is special since the device needs to allocate
>>>> corresponding pfn's as well. The changelog mentions it:
>>>>
>>>> "The common case that arises is that after setup, during migrate
>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>> pages."
>>>>
>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>
>>>> 1. migrate_vma_setup()
>>>> 2. migrate_vma_pages()
>>>> 3. migrate_vma_finalize()
>>>>
>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>
>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>>
>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
>>>> tests the split and emulates a failure on the device side to allocate large pages
>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>>>>
>>>
>>> Another use case I’ve seen is when a previously allocated high-order
>>> folio, now in the free memory pool, is reallocated as a lower-order
>>> page. For example, a 2MB fault allocates a folio, the memory is later
>>
>> That is different. If the high-order folio is free, it should be split
>> using split_page() from mm/page_alloc.c.
>>
> 
> Ah, ok. Let me see if that works - it would easier.
> 
>>> freed, and then a 4KB fault reuses a page from that previously allocated
>>> folio. This will be actually quite common in Xe / GPU SVM. In such
>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
>>
>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
>> __split_unmapped_folio() is not for it, unless you mean free folio
>> differently.
>>
> 
> This is right, those fields should be clear.
> 
> Thanks for the tip.
> 
I was hoping to reuse __split_folio_to_order() at some point in the future
to split the backing pages in the driver, but it is not an immediate priority

Balbir

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Matthew Brost 2 months, 3 weeks ago

On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
> On 7/17/25 02:24, Matthew Brost wrote:
> > On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> >> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> >>
> >>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >>>> On 7/6/25 11:34, Zi Yan wrote:
> >>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>>>
> >>>>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>>>
> >>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>>>
> >>>>>>>>> s/pages/folio
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Thanks, will make the changes
> >>>>>>>>
> >>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Ack, will change the name
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>>   *
> >>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>   */
> >>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>>>  {
> >>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>>>  		 * operations.
> >>>>>>>>>>  		 */
> >>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>> -		if (!anon_vma) {
> >>>>>>>>>> -			ret = -EBUSY;
> >>>>>>>>>> -			goto out;
> >>>>>>>>>> +		if (!isolated) {
> >>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>> +			if (!anon_vma) {
> >>>>>>>>>> +				ret = -EBUSY;
> >>>>>>>>>> +				goto out;
> >>>>>>>>>> +			}
> >>>>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>>>  		}
> >>>>>>>>>>  		end = -1;
> >>>>>>>>>>  		mapping = NULL;
> >>>>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>>>  	} else {
> >>>>>>>>>>  		unsigned int min_order;
> >>>>>>>>>>  		gfp_t gfp;
> >>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>  		goto out_unlock;
> >>>>>>>>>>  	}
> >>>>>>>>>>
> >>>>>>>>>> -	unmap_folio(folio);
> >>>>>>>>>> +	if (!isolated)
> >>>>>>>>>> +		unmap_folio(folio);
> >>>>>>>>>>
> >>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>>>  	local_irq_disable();
> >>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>
> >>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>>>> -				uniform_split);
> >>>>>>>>>> +				uniform_split, isolated);
> >>>>>>>>>>  	} else {
> >>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>>>  fail:
> >>>>>>>>>>  		if (mapping)
> >>>>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>>>  		local_irq_enable();
> >>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>> +		if (!isolated)
> >>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>>>  	}
> >>>>>>>>>
> >>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> There are two reasons for going down the current code path
> >>>>>>>
> >>>>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>>>
> >>>>>>
> >>>>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>>>> if calling the API with unmapped
> >>>>>
> >>>>> Before your patch, there is no use case of splitting unmapped folios.
> >>>>> Your patch only adds support for device private page split, not any unmapped
> >>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>>>
> >>>>
> >>>> There is a use for splitting unmapped folios (see below)
> >>>>
> >>>>>>
> >>>>>>> You should teach different parts of folio split code path to handle
> >>>>>>> device private folios properly. Details are below.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>>>    the split routine to return with -EBUSY
> >>>>>>>
> >>>>>>> You do something below instead.
> >>>>>>>
> >>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>>>> 	ret = -EBUSY;
> >>>>>>> 	goto out;
> >>>>>>> } else if (anon_vma) {
> >>>>>>> 	anon_vma_lock_write(anon_vma);
> >>>>>>> }
> >>>>>>>
> >>>>>>
> >>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>>>> the check for device private folios?
> >>>>>
> >>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>>>> in if (!isolated) branch. In that case, just do
> >>>>>
> >>>>> if (folio_is_device_private(folio) {
> >>>>> ...
> >>>>> } else if (is_anon) {
> >>>>> ...
> >>>>> } else {
> >>>>> ...
> >>>>> }
> >>>>>
> >>>>>>
> >>>>>>> People can know device private folio split needs a special handling.
> >>>>>>>
> >>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>>>> if a page cache folio is migrated to device private, kernel also
> >>>>>>> sees it as both device private and file-backed?
> >>>>>>>
> >>>>>>
> >>>>>> FYI: device private folios only work with anonymous private pages, hence
> >>>>>> the name device private.
> >>>>>
> >>>>> OK.
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>>>    This is wasteful and in some case unexpected.
> >>>>>>>
> >>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>>>> device private PMD mapping. Or if that is not preferred,
> >>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>>>> sees a device private folio.
> >>>>>>>
> >>>>>>> For remap_page(), you can simply return for device private folios
> >>>>>>> like it is currently doing for non anonymous folios.
> >>>>>>>
> >>>>>>
> >>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>>>> remap_folio(), because
> >>>>>>
> >>>>>> 1. We need to do a page table walk/rmap walk again
> >>>>>> 2. We'll need special handling of migration <-> migration entries
> >>>>>>    in the rmap handling (set/remove migration ptes)
> >>>>>> 3. In this context, the code is already in the middle of migration,
> >>>>>>    so trying to do that again does not make sense.
> >>>>>
> >>>>> Why doing split in the middle of migration? Existing split code
> >>>>> assumes to-be-split folios are mapped.
> >>>>>
> >>>>> What prevents doing split before migration?
> >>>>>
> >>>>
> >>>> The code does do a split prior to migration if THP selection fails
> >>>>
> >>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >>>> and the fallback part which calls split_folio()
> >>>>
> >>>> But the case under consideration is special since the device needs to allocate
> >>>> corresponding pfn's as well. The changelog mentions it:
> >>>>
> >>>> "The common case that arises is that after setup, during migrate
> >>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >>>> pages."
> >>>>
> >>>> I can expand on it, because migrate_vma() is a multi-phase operation
> >>>>
> >>>> 1. migrate_vma_setup()
> >>>> 2. migrate_vma_pages()
> >>>> 3. migrate_vma_finalize()
> >>>>
> >>>> It can so happen that when we get the destination pfn's allocated the destination
> >>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>>>
> >>>> The pages have been unmapped and collected in migrate_vma_setup()
> >>>>
> >>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >>>> tests the split and emulates a failure on the device side to allocate large pages
> >>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>>>
> >>>
> >>> Another use case I’ve seen is when a previously allocated high-order
> >>> folio, now in the free memory pool, is reallocated as a lower-order
> >>> page. For example, a 2MB fault allocates a folio, the memory is later
> >>
> >> That is different. If the high-order folio is free, it should be split
> >> using split_page() from mm/page_alloc.c.
> >>
> > 
> > Ah, ok. Let me see if that works - it would easier.
> > 

This suggestion quickly blows up as PageCompound is true and page_count
here is zero.

> >>> freed, and then a 4KB fault reuses a page from that previously allocated
> >>> folio. This will be actually quite common in Xe / GPU SVM. In such
> >>> cases, the folio in an unmapped state needs to be split. I’d suggest a
> >>
> >> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> >> __split_unmapped_folio() is not for it, unless you mean free folio
> >> differently.
> >>
> > 
> > This is right, those fields should be clear.
> > 
> > Thanks for the tip.
> > 
> I was hoping to reuse __split_folio_to_order() at some point in the future
> to split the backing pages in the driver, but it is not an immediate priority
> 

I think we need something for the scenario I describe here. I was to
make __split_huge_page_to_list_to_order with a couple of hacks but it
almostly certainig not right as Zi pointed out.

New to the MM stuff, but play around with this a bit and see if I can
come up with something that will work here.

Matt

> Balbir

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Zi Yan 2 months, 3 weeks ago

On 17 Jul 2025, at 18:24, Matthew Brost wrote:

> On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
>> On 7/17/25 02:24, Matthew Brost wrote:
>>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
>>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
>>>>
>>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
>>>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>>>
>>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>>>
>>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>>>
>>>>>>>>>>> s/pages/folio
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks, will make the changes
>>>>>>>>>>
>>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Ack, will change the name
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>>   *
>>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>   */
>>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>>>  {
>>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>>>  		 * operations.
>>>>>>>>>>>>  		 */
>>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>>>> -			goto out;
>>>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>>>> +				goto out;
>>>>>>>>>>>> +			}
>>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>  		}
>>>>>>>>>>>>  		end = -1;
>>>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>>>  	}
>>>>>>>>>>>>
>>>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>>>> +	if (!isolated)
>>>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>>>
>>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>
>>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>>>> -				uniform_split);
>>>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>>>  fail:
>>>>>>>>>>>>  		if (mapping)
>>>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>> +		if (!isolated)
>>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>>>  	}
>>>>>>>>>>>
>>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There are two reasons for going down the current code path
>>>>>>>>>
>>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>>>> if calling the API with unmapped
>>>>>>>
>>>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>>>
>>>>>>
>>>>>> There is a use for splitting unmapped folios (see below)
>>>>>>
>>>>>>>>
>>>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>>>> device private folios properly. Details are below.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>>>
>>>>>>>>> You do something below instead.
>>>>>>>>>
>>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>>>> 	ret = -EBUSY;
>>>>>>>>> 	goto out;
>>>>>>>>> } else if (anon_vma) {
>>>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>
>>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>>>> the check for device private folios?
>>>>>>>
>>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>>>> in if (!isolated) branch. In that case, just do
>>>>>>>
>>>>>>> if (folio_is_device_private(folio) {
>>>>>>> ...
>>>>>>> } else if (is_anon) {
>>>>>>> ...
>>>>>>> } else {
>>>>>>> ...
>>>>>>> }
>>>>>>>
>>>>>>>>
>>>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>>>
>>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>>>> sees it as both device private and file-backed?
>>>>>>>>>
>>>>>>>>
>>>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>>>> the name device private.
>>>>>>>
>>>>>>> OK.
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>>>
>>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>>>> sees a device private folio.
>>>>>>>>>
>>>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>>>> remap_folio(), because
>>>>>>>>
>>>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>>>    so trying to do that again does not make sense.
>>>>>>>
>>>>>>> Why doing split in the middle of migration? Existing split code
>>>>>>> assumes to-be-split folios are mapped.
>>>>>>>
>>>>>>> What prevents doing split before migration?
>>>>>>>
>>>>>>
>>>>>> The code does do a split prior to migration if THP selection fails
>>>>>>
>>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>>>> and the fallback part which calls split_folio()
>>>>>>
>>>>>> But the case under consideration is special since the device needs to allocate
>>>>>> corresponding pfn's as well. The changelog mentions it:
>>>>>>
>>>>>> "The common case that arises is that after setup, during migrate
>>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>>>> pages."
>>>>>>
>>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>>>
>>>>>> 1. migrate_vma_setup()
>>>>>> 2. migrate_vma_pages()
>>>>>> 3. migrate_vma_finalize()
>>>>>>
>>>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>>>
>>>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>>>>
>>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
>>>>>> tests the split and emulates a failure on the device side to allocate large pages
>>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>>>>>>
>>>>>
>>>>> Another use case I’ve seen is when a previously allocated high-order
>>>>> folio, now in the free memory pool, is reallocated as a lower-order
>>>>> page. For example, a 2MB fault allocates a folio, the memory is later
>>>>
>>>> That is different. If the high-order folio is free, it should be split
>>>> using split_page() from mm/page_alloc.c.
>>>>
>>>
>>> Ah, ok. Let me see if that works - it would easier.
>>>
>
> This suggestion quickly blows up as PageCompound is true and page_count
> here is zero.

OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().

>
>>>>> freed, and then a 4KB fault reuses a page from that previously allocated
>>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
>>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
>>>>
>>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
>>>> __split_unmapped_folio() is not for it, unless you mean free folio
>>>> differently.
>>>>
>>>
>>> This is right, those fields should be clear.
>>>
>>> Thanks for the tip.
>>>
>> I was hoping to reuse __split_folio_to_order() at some point in the future
>> to split the backing pages in the driver, but it is not an immediate priority
>>
>
> I think we need something for the scenario I describe here. I was to
> make __split_huge_page_to_list_to_order with a couple of hacks but it
> almostly certainig not right as Zi pointed out.
>
> New to the MM stuff, but play around with this a bit and see if I can
> come up with something that will work here.

Can you try to write a new split_page function with __split_unmapped_folio()?
Since based on your description, your folio is not mapped.


Best Regards,
Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Matthew Brost 2 months, 3 weeks ago

On Thu, Jul 17, 2025 at 07:04:48PM -0400, Zi Yan wrote:
> On 17 Jul 2025, at 18:24, Matthew Brost wrote:
> 
> > On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
> >> On 7/17/25 02:24, Matthew Brost wrote:
> >>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> >>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> >>>>
> >>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >>>>>> On 7/6/25 11:34, Zi Yan wrote:
> >>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>>>>>
> >>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>>>>>
> >>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> s/pages/folio
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Thanks, will make the changes
> >>>>>>>>>>
> >>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Ack, will change the name
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>>   *
> >>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>   */
> >>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>>>>>  {
> >>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>>>>>  		 * operations.
> >>>>>>>>>>>>  		 */
> >>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>> -		if (!anon_vma) {
> >>>>>>>>>>>> -			ret = -EBUSY;
> >>>>>>>>>>>> -			goto out;
> >>>>>>>>>>>> +		if (!isolated) {
> >>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>> +			if (!anon_vma) {
> >>>>>>>>>>>> +				ret = -EBUSY;
> >>>>>>>>>>>> +				goto out;
> >>>>>>>>>>>> +			}
> >>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>  		}
> >>>>>>>>>>>>  		end = -1;
> >>>>>>>>>>>>  		mapping = NULL;
> >>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>  		unsigned int min_order;
> >>>>>>>>>>>>  		gfp_t gfp;
> >>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>  		goto out_unlock;
> >>>>>>>>>>>>  	}
> >>>>>>>>>>>>
> >>>>>>>>>>>> -	unmap_folio(folio);
> >>>>>>>>>>>> +	if (!isolated)
> >>>>>>>>>>>> +		unmap_folio(folio);
> >>>>>>>>>>>>
> >>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>>>>>  	local_irq_disable();
> >>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>
> >>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>>>>>> -				uniform_split);
> >>>>>>>>>>>> +				uniform_split, isolated);
> >>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>>>>>  fail:
> >>>>>>>>>>>>  		if (mapping)
> >>>>>>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>>>>>  		local_irq_enable();
> >>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>> +		if (!isolated)
> >>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>>>>>  	}
> >>>>>>>>>>>
> >>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> There are two reasons for going down the current code path
> >>>>>>>>>
> >>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>>>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>>>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>>>>>> if calling the API with unmapped
> >>>>>>>
> >>>>>>> Before your patch, there is no use case of splitting unmapped folios.
> >>>>>>> Your patch only adds support for device private page split, not any unmapped
> >>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>>>>>
> >>>>>>
> >>>>>> There is a use for splitting unmapped folios (see below)
> >>>>>>
> >>>>>>>>
> >>>>>>>>> You should teach different parts of folio split code path to handle
> >>>>>>>>> device private folios properly. Details are below.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>>>>>    the split routine to return with -EBUSY
> >>>>>>>>>
> >>>>>>>>> You do something below instead.
> >>>>>>>>>
> >>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>>>>>> 	ret = -EBUSY;
> >>>>>>>>> 	goto out;
> >>>>>>>>> } else if (anon_vma) {
> >>>>>>>>> 	anon_vma_lock_write(anon_vma);
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>>>>>> the check for device private folios?
> >>>>>>>
> >>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>>>>>> in if (!isolated) branch. In that case, just do
> >>>>>>>
> >>>>>>> if (folio_is_device_private(folio) {
> >>>>>>> ...
> >>>>>>> } else if (is_anon) {
> >>>>>>> ...
> >>>>>>> } else {
> >>>>>>> ...
> >>>>>>> }
> >>>>>>>
> >>>>>>>>
> >>>>>>>>> People can know device private folio split needs a special handling.
> >>>>>>>>>
> >>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>>>>>> if a page cache folio is migrated to device private, kernel also
> >>>>>>>>> sees it as both device private and file-backed?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> FYI: device private folios only work with anonymous private pages, hence
> >>>>>>>> the name device private.
> >>>>>>>
> >>>>>>> OK.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>>>>>    This is wasteful and in some case unexpected.
> >>>>>>>>>
> >>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>>>>>> device private PMD mapping. Or if that is not preferred,
> >>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>>>>>> sees a device private folio.
> >>>>>>>>>
> >>>>>>>>> For remap_page(), you can simply return for device private folios
> >>>>>>>>> like it is currently doing for non anonymous folios.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>>>>>> remap_folio(), because
> >>>>>>>>
> >>>>>>>> 1. We need to do a page table walk/rmap walk again
> >>>>>>>> 2. We'll need special handling of migration <-> migration entries
> >>>>>>>>    in the rmap handling (set/remove migration ptes)
> >>>>>>>> 3. In this context, the code is already in the middle of migration,
> >>>>>>>>    so trying to do that again does not make sense.
> >>>>>>>
> >>>>>>> Why doing split in the middle of migration? Existing split code
> >>>>>>> assumes to-be-split folios are mapped.
> >>>>>>>
> >>>>>>> What prevents doing split before migration?
> >>>>>>>
> >>>>>>
> >>>>>> The code does do a split prior to migration if THP selection fails
> >>>>>>
> >>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >>>>>> and the fallback part which calls split_folio()
> >>>>>>
> >>>>>> But the case under consideration is special since the device needs to allocate
> >>>>>> corresponding pfn's as well. The changelog mentions it:
> >>>>>>
> >>>>>> "The common case that arises is that after setup, during migrate
> >>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >>>>>> pages."
> >>>>>>
> >>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
> >>>>>>
> >>>>>> 1. migrate_vma_setup()
> >>>>>> 2. migrate_vma_pages()
> >>>>>> 3. migrate_vma_finalize()
> >>>>>>
> >>>>>> It can so happen that when we get the destination pfn's allocated the destination
> >>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>>>>>
> >>>>>> The pages have been unmapped and collected in migrate_vma_setup()
> >>>>>>
> >>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >>>>>> tests the split and emulates a failure on the device side to allocate large pages
> >>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>>>>>
> >>>>>
> >>>>> Another use case I’ve seen is when a previously allocated high-order
> >>>>> folio, now in the free memory pool, is reallocated as a lower-order
> >>>>> page. For example, a 2MB fault allocates a folio, the memory is later
> >>>>
> >>>> That is different. If the high-order folio is free, it should be split
> >>>> using split_page() from mm/page_alloc.c.
> >>>>
> >>>
> >>> Ah, ok. Let me see if that works - it would easier.
> >>>
> >
> > This suggestion quickly blows up as PageCompound is true and page_count
> > here is zero.
> 
> OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().
> 
> >
> >>>>> freed, and then a 4KB fault reuses a page from that previously allocated
> >>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
> >>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
> >>>>
> >>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> >>>> __split_unmapped_folio() is not for it, unless you mean free folio
> >>>> differently.
> >>>>
> >>>
> >>> This is right, those fields should be clear.
> >>>
> >>> Thanks for the tip.
> >>>
> >> I was hoping to reuse __split_folio_to_order() at some point in the future
> >> to split the backing pages in the driver, but it is not an immediate priority
> >>
> >
> > I think we need something for the scenario I describe here. I was to
> > make __split_huge_page_to_list_to_order with a couple of hacks but it
> > almostly certainig not right as Zi pointed out.
> >
> > New to the MM stuff, but play around with this a bit and see if I can
> > come up with something that will work here.
> 
> Can you try to write a new split_page function with __split_unmapped_folio()?
> Since based on your description, your folio is not mapped.
> 

Yes, page->mapping is NULL in this case - that was part of the hacks to
__split_huge_page_to_list_to_order (more specially __folio_split) I had
to make in order to get something working for this case.

I can try out something based on __split_unmapped_folio and report back.

Matt 

> 
> Best Regards,
> Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Zi Yan 2 months, 3 weeks ago

On 17 Jul 2025, at 20:41, Matthew Brost wrote:

> On Thu, Jul 17, 2025 at 07:04:48PM -0400, Zi Yan wrote:
>> On 17 Jul 2025, at 18:24, Matthew Brost wrote:
>>
>>> On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
>>>> On 7/17/25 02:24, Matthew Brost wrote:
>>>>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
>>>>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
>>>>>>
>>>>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
>>>>>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>>>>>
>>>>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> s/pages/folio
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks, will make the changes
>>>>>>>>>>>>
>>>>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Ack, will change the name
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>>   *
>>>>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>>>>>  		 * operations.
>>>>>>>>>>>>>>  		 */
>>>>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>>>>>> -			goto out;
>>>>>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>>>>>> +				goto out;
>>>>>>>>>>>>>> +			}
>>>>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>>>  		}
>>>>>>>>>>>>>>  		end = -1;
>>>>>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>>>>>> +	if (!isolated)
>>>>>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>>>>>> -				uniform_split);
>>>>>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>>>>>  fail:
>>>>>>>>>>>>>>  		if (mapping)
>>>>>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>>>> +		if (!isolated)
>>>>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>
>>>>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> There are two reasons for going down the current code path
>>>>>>>>>>>
>>>>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>>>>>> if calling the API with unmapped
>>>>>>>>>
>>>>>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>>>>>
>>>>>>>>
>>>>>>>> There is a use for splitting unmapped folios (see below)
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>>>>>> device private folios properly. Details are below.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>>>>>
>>>>>>>>>>> You do something below instead.
>>>>>>>>>>>
>>>>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>>>>>> 	ret = -EBUSY;
>>>>>>>>>>> 	goto out;
>>>>>>>>>>> } else if (anon_vma) {
>>>>>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>>>>>> the check for device private folios?
>>>>>>>>>
>>>>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>>>>>> in if (!isolated) branch. In that case, just do
>>>>>>>>>
>>>>>>>>> if (folio_is_device_private(folio) {
>>>>>>>>> ...
>>>>>>>>> } else if (is_anon) {
>>>>>>>>> ...
>>>>>>>>> } else {
>>>>>>>>> ...
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>>>>>
>>>>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>>>>>> sees it as both device private and file-backed?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>>>>>> the name device private.
>>>>>>>>>
>>>>>>>>> OK.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>>>>>
>>>>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>>>>>> sees a device private folio.
>>>>>>>>>>>
>>>>>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>>>>>> remap_folio(), because
>>>>>>>>>>
>>>>>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>>>>>    so trying to do that again does not make sense.
>>>>>>>>>
>>>>>>>>> Why doing split in the middle of migration? Existing split code
>>>>>>>>> assumes to-be-split folios are mapped.
>>>>>>>>>
>>>>>>>>> What prevents doing split before migration?
>>>>>>>>>
>>>>>>>>
>>>>>>>> The code does do a split prior to migration if THP selection fails
>>>>>>>>
>>>>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>>>>>> and the fallback part which calls split_folio()
>>>>>>>>
>>>>>>>> But the case under consideration is special since the device needs to allocate
>>>>>>>> corresponding pfn's as well. The changelog mentions it:
>>>>>>>>
>>>>>>>> "The common case that arises is that after setup, during migrate
>>>>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>>>>>> pages."
>>>>>>>>
>>>>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>>>>>
>>>>>>>> 1. migrate_vma_setup()
>>>>>>>> 2. migrate_vma_pages()
>>>>>>>> 3. migrate_vma_finalize()
>>>>>>>>
>>>>>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>>>>>
>>>>>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>>>>>>
>>>>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
>>>>>>>> tests the split and emulates a failure on the device side to allocate large pages
>>>>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>>>>>>>>
>>>>>>>
>>>>>>> Another use case I’ve seen is when a previously allocated high-order
>>>>>>> folio, now in the free memory pool, is reallocated as a lower-order
>>>>>>> page. For example, a 2MB fault allocates a folio, the memory is later
>>>>>>
>>>>>> That is different. If the high-order folio is free, it should be split
>>>>>> using split_page() from mm/page_alloc.c.
>>>>>>
>>>>>
>>>>> Ah, ok. Let me see if that works - it would easier.
>>>>>
>>>
>>> This suggestion quickly blows up as PageCompound is true and page_count
>>> here is zero.
>>
>> OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().
>>
>>>
>>>>>>> freed, and then a 4KB fault reuses a page from that previously allocated
>>>>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
>>>>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
>>>>>>
>>>>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
>>>>>> __split_unmapped_folio() is not for it, unless you mean free folio
>>>>>> differently.
>>>>>>
>>>>>
>>>>> This is right, those fields should be clear.
>>>>>
>>>>> Thanks for the tip.
>>>>>
>>>> I was hoping to reuse __split_folio_to_order() at some point in the future
>>>> to split the backing pages in the driver, but it is not an immediate priority
>>>>
>>>
>>> I think we need something for the scenario I describe here. I was to
>>> make __split_huge_page_to_list_to_order with a couple of hacks but it
>>> almostly certainig not right as Zi pointed out.
>>>
>>> New to the MM stuff, but play around with this a bit and see if I can
>>> come up with something that will work here.
>>
>> Can you try to write a new split_page function with __split_unmapped_folio()?
>> Since based on your description, your folio is not mapped.
>>
>
> Yes, page->mapping is NULL in this case - that was part of the hacks to
> __split_huge_page_to_list_to_order (more specially __folio_split) I had
> to make in order to get something working for this case.
>
> I can try out something based on __split_unmapped_folio and report back.

mm-new tree has an updated __split_unmapped_folio() version, it moves
all unmap irrelevant code out of __split_unmaped_folio(). You might find
it easier to reuse.

See: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/mm/huge_memory.c?h=mm-new#n3430

I am about to update the code with v4 patches. I will cc you, so that
you can get the updated __split_unmaped_folio().

Feel free to ask questions on folio split code.

Best Regards,
Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Matthew Brost 2 months, 3 weeks ago

On Thu, Jul 17, 2025 at 09:25:02PM -0400, Zi Yan wrote:
> On 17 Jul 2025, at 20:41, Matthew Brost wrote:
> 
> > On Thu, Jul 17, 2025 at 07:04:48PM -0400, Zi Yan wrote:
> >> On 17 Jul 2025, at 18:24, Matthew Brost wrote:
> >>
> >>> On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
> >>>> On 7/17/25 02:24, Matthew Brost wrote:
> >>>>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> >>>>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> >>>>>>
> >>>>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >>>>>>>> On 7/6/25 11:34, Zi Yan wrote:
> >>>>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>>>>>>>
> >>>>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> s/pages/folio
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks, will make the changes
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ack, will change the name
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>>   *
> >>>>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>   */
> >>>>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>>>>>>>  		 * operations.
> >>>>>>>>>>>>>>  		 */
> >>>>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>>>> -		if (!anon_vma) {
> >>>>>>>>>>>>>> -			ret = -EBUSY;
> >>>>>>>>>>>>>> -			goto out;
> >>>>>>>>>>>>>> +		if (!isolated) {
> >>>>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>>>> +			if (!anon_vma) {
> >>>>>>>>>>>>>> +				ret = -EBUSY;
> >>>>>>>>>>>>>> +				goto out;
> >>>>>>>>>>>>>> +			}
> >>>>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>>>  		}
> >>>>>>>>>>>>>>  		end = -1;
> >>>>>>>>>>>>>>  		mapping = NULL;
> >>>>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>>>  		unsigned int min_order;
> >>>>>>>>>>>>>>  		gfp_t gfp;
> >>>>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>  		goto out_unlock;
> >>>>>>>>>>>>>>  	}
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> -	unmap_folio(folio);
> >>>>>>>>>>>>>> +	if (!isolated)
> >>>>>>>>>>>>>> +		unmap_folio(folio);
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>>>>>>>  	local_irq_disable();
> >>>>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>>>>>>>> -				uniform_split);
> >>>>>>>>>>>>>> +				uniform_split, isolated);
> >>>>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>>>>>>>  fail:
> >>>>>>>>>>>>>>  		if (mapping)
> >>>>>>>>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>>>>>>>  		local_irq_enable();
> >>>>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>>>> +		if (!isolated)
> >>>>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>>>>>>>  	}
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> There are two reasons for going down the current code path
> >>>>>>>>>>>
> >>>>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>>>>>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>>>>>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>>>>>>>> if calling the API with unmapped
> >>>>>>>>>
> >>>>>>>>> Before your patch, there is no use case of splitting unmapped folios.
> >>>>>>>>> Your patch only adds support for device private page split, not any unmapped
> >>>>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> There is a use for splitting unmapped folios (see below)
> >>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> You should teach different parts of folio split code path to handle
> >>>>>>>>>>> device private folios properly. Details are below.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>>>>>>>    the split routine to return with -EBUSY
> >>>>>>>>>>>
> >>>>>>>>>>> You do something below instead.
> >>>>>>>>>>>
> >>>>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>>>>>>>> 	ret = -EBUSY;
> >>>>>>>>>>> 	goto out;
> >>>>>>>>>>> } else if (anon_vma) {
> >>>>>>>>>>> 	anon_vma_lock_write(anon_vma);
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>>>>>>>> the check for device private folios?
> >>>>>>>>>
> >>>>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>>>>>>>> in if (!isolated) branch. In that case, just do
> >>>>>>>>>
> >>>>>>>>> if (folio_is_device_private(folio) {
> >>>>>>>>> ...
> >>>>>>>>> } else if (is_anon) {
> >>>>>>>>> ...
> >>>>>>>>> } else {
> >>>>>>>>> ...
> >>>>>>>>> }
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>> People can know device private folio split needs a special handling.
> >>>>>>>>>>>
> >>>>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>>>>>>>> if a page cache folio is migrated to device private, kernel also
> >>>>>>>>>>> sees it as both device private and file-backed?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> FYI: device private folios only work with anonymous private pages, hence
> >>>>>>>>>> the name device private.
> >>>>>>>>>
> >>>>>>>>> OK.
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>>>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>>>>>>>    This is wasteful and in some case unexpected.
> >>>>>>>>>>>
> >>>>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>>>>>>>> device private PMD mapping. Or if that is not preferred,
> >>>>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>>>>>>>> sees a device private folio.
> >>>>>>>>>>>
> >>>>>>>>>>> For remap_page(), you can simply return for device private folios
> >>>>>>>>>>> like it is currently doing for non anonymous folios.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>>>>>>>> remap_folio(), because
> >>>>>>>>>>
> >>>>>>>>>> 1. We need to do a page table walk/rmap walk again
> >>>>>>>>>> 2. We'll need special handling of migration <-> migration entries
> >>>>>>>>>>    in the rmap handling (set/remove migration ptes)
> >>>>>>>>>> 3. In this context, the code is already in the middle of migration,
> >>>>>>>>>>    so trying to do that again does not make sense.
> >>>>>>>>>
> >>>>>>>>> Why doing split in the middle of migration? Existing split code
> >>>>>>>>> assumes to-be-split folios are mapped.
> >>>>>>>>>
> >>>>>>>>> What prevents doing split before migration?
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> The code does do a split prior to migration if THP selection fails
> >>>>>>>>
> >>>>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >>>>>>>> and the fallback part which calls split_folio()
> >>>>>>>>
> >>>>>>>> But the case under consideration is special since the device needs to allocate
> >>>>>>>> corresponding pfn's as well. The changelog mentions it:
> >>>>>>>>
> >>>>>>>> "The common case that arises is that after setup, during migrate
> >>>>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >>>>>>>> pages."
> >>>>>>>>
> >>>>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
> >>>>>>>>
> >>>>>>>> 1. migrate_vma_setup()
> >>>>>>>> 2. migrate_vma_pages()
> >>>>>>>> 3. migrate_vma_finalize()
> >>>>>>>>
> >>>>>>>> It can so happen that when we get the destination pfn's allocated the destination
> >>>>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>>>>>>>
> >>>>>>>> The pages have been unmapped and collected in migrate_vma_setup()
> >>>>>>>>
> >>>>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >>>>>>>> tests the split and emulates a failure on the device side to allocate large pages
> >>>>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>>>>>>>
> >>>>>>>
> >>>>>>> Another use case I’ve seen is when a previously allocated high-order
> >>>>>>> folio, now in the free memory pool, is reallocated as a lower-order
> >>>>>>> page. For example, a 2MB fault allocates a folio, the memory is later
> >>>>>>
> >>>>>> That is different. If the high-order folio is free, it should be split
> >>>>>> using split_page() from mm/page_alloc.c.
> >>>>>>
> >>>>>
> >>>>> Ah, ok. Let me see if that works - it would easier.
> >>>>>
> >>>
> >>> This suggestion quickly blows up as PageCompound is true and page_count
> >>> here is zero.
> >>
> >> OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().
> >>
> >>>
> >>>>>>> freed, and then a 4KB fault reuses a page from that previously allocated
> >>>>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
> >>>>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
> >>>>>>
> >>>>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> >>>>>> __split_unmapped_folio() is not for it, unless you mean free folio
> >>>>>> differently.
> >>>>>>
> >>>>>
> >>>>> This is right, those fields should be clear.
> >>>>>
> >>>>> Thanks for the tip.
> >>>>>
> >>>> I was hoping to reuse __split_folio_to_order() at some point in the future
> >>>> to split the backing pages in the driver, but it is not an immediate priority
> >>>>
> >>>
> >>> I think we need something for the scenario I describe here. I was to
> >>> make __split_huge_page_to_list_to_order with a couple of hacks but it
> >>> almostly certainig not right as Zi pointed out.
> >>>
> >>> New to the MM stuff, but play around with this a bit and see if I can
> >>> come up with something that will work here.
> >>
> >> Can you try to write a new split_page function with __split_unmapped_folio()?
> >> Since based on your description, your folio is not mapped.
> >>
> >
> > Yes, page->mapping is NULL in this case - that was part of the hacks to
> > __split_huge_page_to_list_to_order (more specially __folio_split) I had
> > to make in order to get something working for this case.
> >
> > I can try out something based on __split_unmapped_folio and report back.
> 
> mm-new tree has an updated __split_unmapped_folio() version, it moves
> all unmap irrelevant code out of __split_unmaped_folio(). You might find
> it easier to reuse.
> 
> See: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/mm/huge_memory.c?h=mm-new#n3430
> 

Will take a look. It is possible some of the issues we are hitting are
due to working on drm-tip + pulling in core MM patches in this series on
top of that branch then missing some other patches in mm-new. I'll see
if ww can figure out a work flow to have the latest and greatest from
both drm-tip and the MM branches.

Will these changes be in 6.17?

> I am about to update the code with v4 patches. I will cc you, so that
> you can get the updated __split_unmaped_folio().
> 
> Feel free to ask questions on folio split code.
>

Thanks.

Matt
 
> Best Regards,
> Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Zi Yan 2 months, 3 weeks ago

On 17 Jul 2025, at 23:33, Matthew Brost wrote:

> On Thu, Jul 17, 2025 at 09:25:02PM -0400, Zi Yan wrote:
>> On 17 Jul 2025, at 20:41, Matthew Brost wrote:
>>
>>> On Thu, Jul 17, 2025 at 07:04:48PM -0400, Zi Yan wrote:
>>>> On 17 Jul 2025, at 18:24, Matthew Brost wrote:
>>>>
>>>>> On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
>>>>>> On 7/17/25 02:24, Matthew Brost wrote:
>>>>>>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
>>>>>>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
>>>>>>>>
>>>>>>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
>>>>>>>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>>>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>>>>>>>
>>>>>>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> s/pages/folio
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, will make the changes
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ack, will change the name
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>   *
>>>>>>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>>>   */
>>>>>>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>>>>>>>  {
>>>>>>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>>>>>>>  		 * operations.
>>>>>>>>>>>>>>>>  		 */
>>>>>>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>>>>>>>> -			goto out;
>>>>>>>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>>>>>>>> +				goto out;
>>>>>>>>>>>>>>>> +			}
>>>>>>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>>>>>  		}
>>>>>>>>>>>>>>>>  		end = -1;
>>>>>>>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>>>>>>>> +	if (!isolated)
>>>>>>>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>>>>>>>> -				uniform_split);
>>>>>>>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>>>>>>>  	} else {
>>>>>>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>>>>>>>  fail:
>>>>>>>>>>>>>>>>  		if (mapping)
>>>>>>>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>>>>>> +		if (!isolated)
>>>>>>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>>>>>>>  	}
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are two reasons for going down the current code path
>>>>>>>>>>>>>
>>>>>>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>>>>>>>> if calling the API with unmapped
>>>>>>>>>>>
>>>>>>>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>>>>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>>>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> There is a use for splitting unmapped folios (see below)
>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>>>>>>>> device private folios properly. Details are below.
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>>>>>>>
>>>>>>>>>>>>> You do something below instead.
>>>>>>>>>>>>>
>>>>>>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>>>>>>>> 	ret = -EBUSY;
>>>>>>>>>>>>> 	goto out;
>>>>>>>>>>>>> } else if (anon_vma) {
>>>>>>>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>>>>>>>> }
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>>>>>>>> the check for device private folios?
>>>>>>>>>>>
>>>>>>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>>>>>>>> in if (!isolated) branch. In that case, just do
>>>>>>>>>>>
>>>>>>>>>>> if (folio_is_device_private(folio) {
>>>>>>>>>>> ...
>>>>>>>>>>> } else if (is_anon) {
>>>>>>>>>>> ...
>>>>>>>>>>> } else {
>>>>>>>>>>> ...
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>>>>>>>
>>>>>>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>>>>>>>> sees it as both device private and file-backed?
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>>>>>>>> the name device private.
>>>>>>>>>>>
>>>>>>>>>>> OK.
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>>>>>>>
>>>>>>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>>>>>>>> sees a device private folio.
>>>>>>>>>>>>>
>>>>>>>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>>>>>>>> remap_folio(), because
>>>>>>>>>>>>
>>>>>>>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>>>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>>>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>>>>>>>    so trying to do that again does not make sense.
>>>>>>>>>>>
>>>>>>>>>>> Why doing split in the middle of migration? Existing split code
>>>>>>>>>>> assumes to-be-split folios are mapped.
>>>>>>>>>>>
>>>>>>>>>>> What prevents doing split before migration?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The code does do a split prior to migration if THP selection fails
>>>>>>>>>>
>>>>>>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>>>>>>>> and the fallback part which calls split_folio()
>>>>>>>>>>
>>>>>>>>>> But the case under consideration is special since the device needs to allocate
>>>>>>>>>> corresponding pfn's as well. The changelog mentions it:
>>>>>>>>>>
>>>>>>>>>> "The common case that arises is that after setup, during migrate
>>>>>>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>>>>>>>> pages."
>>>>>>>>>>
>>>>>>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>>>>>>>
>>>>>>>>>> 1. migrate_vma_setup()
>>>>>>>>>> 2. migrate_vma_pages()
>>>>>>>>>> 3. migrate_vma_finalize()
>>>>>>>>>>
>>>>>>>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>>>>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>>>>>>>
>>>>>>>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>>>>>>>>
>>>>>>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
>>>>>>>>>> tests the split and emulates a failure on the device side to allocate large pages
>>>>>>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Another use case I’ve seen is when a previously allocated high-order
>>>>>>>>> folio, now in the free memory pool, is reallocated as a lower-order
>>>>>>>>> page. For example, a 2MB fault allocates a folio, the memory is later
>>>>>>>>
>>>>>>>> That is different. If the high-order folio is free, it should be split
>>>>>>>> using split_page() from mm/page_alloc.c.
>>>>>>>>
>>>>>>>
>>>>>>> Ah, ok. Let me see if that works - it would easier.
>>>>>>>
>>>>>
>>>>> This suggestion quickly blows up as PageCompound is true and page_count
>>>>> here is zero.
>>>>
>>>> OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().
>>>>
>>>>>
>>>>>>>>> freed, and then a 4KB fault reuses a page from that previously allocated
>>>>>>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
>>>>>>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
>>>>>>>>
>>>>>>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
>>>>>>>> __split_unmapped_folio() is not for it, unless you mean free folio
>>>>>>>> differently.
>>>>>>>>
>>>>>>>
>>>>>>> This is right, those fields should be clear.
>>>>>>>
>>>>>>> Thanks for the tip.
>>>>>>>
>>>>>> I was hoping to reuse __split_folio_to_order() at some point in the future
>>>>>> to split the backing pages in the driver, but it is not an immediate priority
>>>>>>
>>>>>
>>>>> I think we need something for the scenario I describe here. I was to
>>>>> make __split_huge_page_to_list_to_order with a couple of hacks but it
>>>>> almostly certainig not right as Zi pointed out.
>>>>>
>>>>> New to the MM stuff, but play around with this a bit and see if I can
>>>>> come up with something that will work here.
>>>>
>>>> Can you try to write a new split_page function with __split_unmapped_folio()?
>>>> Since based on your description, your folio is not mapped.
>>>>
>>>
>>> Yes, page->mapping is NULL in this case - that was part of the hacks to
>>> __split_huge_page_to_list_to_order (more specially __folio_split) I had
>>> to make in order to get something working for this case.
>>>
>>> I can try out something based on __split_unmapped_folio and report back.
>>
>> mm-new tree has an updated __split_unmapped_folio() version, it moves
>> all unmap irrelevant code out of __split_unmaped_folio(). You might find
>> it easier to reuse.
>>
>> See: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/mm/huge_memory.c?h=mm-new#n3430
>>
>
> Will take a look. It is possible some of the issues we are hitting are
> due to working on drm-tip + pulling in core MM patches in this series on
> top of that branch then missing some other patches in mm-new. I'll see
> if ww can figure out a work flow to have the latest and greatest from
> both drm-tip and the MM branches.
>
> Will these changes be in 6.17?

Hopefully yes. mm patches usually go from mm-new to mm-unstable
to mm-stable to mainline. If not, we will figure it out. :)

>
>> I am about to update the code with v4 patches. I will cc you, so that
>> you can get the updated __split_unmaped_folio().
>>
>> Feel free to ask questions on folio split code.
>>
>
> Thanks.
>
> Matt
>
>> Best Regards,
>> Yan, Zi


Best Regards,
Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Matthew Brost 2 months, 2 weeks ago

On Fri, Jul 18, 2025 at 11:06:09AM -0400, Zi Yan wrote:
> On 17 Jul 2025, at 23:33, Matthew Brost wrote:
> 
> > On Thu, Jul 17, 2025 at 09:25:02PM -0400, Zi Yan wrote:
> >> On 17 Jul 2025, at 20:41, Matthew Brost wrote:
> >>
> >>> On Thu, Jul 17, 2025 at 07:04:48PM -0400, Zi Yan wrote:
> >>>> On 17 Jul 2025, at 18:24, Matthew Brost wrote:
> >>>>
> >>>>> On Thu, Jul 17, 2025 at 07:53:40AM +1000, Balbir Singh wrote:
> >>>>>> On 7/17/25 02:24, Matthew Brost wrote:
> >>>>>>> On Wed, Jul 16, 2025 at 07:19:10AM -0400, Zi Yan wrote:
> >>>>>>>> On 16 Jul 2025, at 1:34, Matthew Brost wrote:
> >>>>>>>>
> >>>>>>>>> On Sun, Jul 06, 2025 at 11:47:10AM +1000, Balbir Singh wrote:
> >>>>>>>>>> On 7/6/25 11:34, Zi Yan wrote:
> >>>>>>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
> >>>>>>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> s/pages/folio
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks, will make the changes
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
> >>>>>>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Ack, will change the name
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>   *
> >>>>>>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
> >>>>>>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
> >>>>>>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>>>   */
> >>>>>>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
> >>>>>>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
> >>>>>>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
> >>>>>>>>>>>>>>>>  {
> >>>>>>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
> >>>>>>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> >>>>>>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
> >>>>>>>>>>>>>>>>  		 * operations.
> >>>>>>>>>>>>>>>>  		 */
> >>>>>>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>>>>>> -		if (!anon_vma) {
> >>>>>>>>>>>>>>>> -			ret = -EBUSY;
> >>>>>>>>>>>>>>>> -			goto out;
> >>>>>>>>>>>>>>>> +		if (!isolated) {
> >>>>>>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
> >>>>>>>>>>>>>>>> +			if (!anon_vma) {
> >>>>>>>>>>>>>>>> +				ret = -EBUSY;
> >>>>>>>>>>>>>>>> +				goto out;
> >>>>>>>>>>>>>>>> +			}
> >>>>>>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>>>>>  		}
> >>>>>>>>>>>>>>>>  		end = -1;
> >>>>>>>>>>>>>>>>  		mapping = NULL;
> >>>>>>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>>>>>  		unsigned int min_order;
> >>>>>>>>>>>>>>>>  		gfp_t gfp;
> >>>>>>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>>>  		goto out_unlock;
> >>>>>>>>>>>>>>>>  	}
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> -	unmap_folio(folio);
> >>>>>>>>>>>>>>>> +	if (!isolated)
> >>>>>>>>>>>>>>>> +		unmap_folio(folio);
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
> >>>>>>>>>>>>>>>>  	local_irq_disable();
> >>>>>>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
> >>>>>>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
> >>>>>>>>>>>>>>>> -				uniform_split);
> >>>>>>>>>>>>>>>> +				uniform_split, isolated);
> >>>>>>>>>>>>>>>>  	} else {
> >>>>>>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
> >>>>>>>>>>>>>>>>  fail:
> >>>>>>>>>>>>>>>>  		if (mapping)
> >>>>>>>>>>>>>>>>  			xas_unlock(&xas);
> >>>>>>>>>>>>>>>>  		local_irq_enable();
> >>>>>>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>>>>>> +		if (!isolated)
> >>>>>>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
> >>>>>>>>>>>>>>>>  		ret = -EAGAIN;
> >>>>>>>>>>>>>>>>  	}
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
> >>>>>>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
> >>>>>>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
> >>>>>>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> There are two reasons for going down the current code path
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
> >>>>>>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
> >>>>>>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
> >>>>>>>>>>>>> and trying to split it with unmapped = true. I do not think that is
> >>>>>>>>>>>>> supported and your patch does not prevent that from happening in the future.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
> >>>>>>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
> >>>>>>>>>>>> if calling the API with unmapped
> >>>>>>>>>>>
> >>>>>>>>>>> Before your patch, there is no use case of splitting unmapped folios.
> >>>>>>>>>>> Your patch only adds support for device private page split, not any unmapped
> >>>>>>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> There is a use for splitting unmapped folios (see below)
> >>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> You should teach different parts of folio split code path to handle
> >>>>>>>>>>>>> device private folios properly. Details are below.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
> >>>>>>>>>>>>>>    the split routine to return with -EBUSY
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> You do something below instead.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
> >>>>>>>>>>>>> 	ret = -EBUSY;
> >>>>>>>>>>>>> 	goto out;
> >>>>>>>>>>>>> } else if (anon_vma) {
> >>>>>>>>>>>>> 	anon_vma_lock_write(anon_vma);
> >>>>>>>>>>>>> }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
> >>>>>>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
> >>>>>>>>>>>> the check for device private folios?
> >>>>>>>>>>>
> >>>>>>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
> >>>>>>>>>>> in if (!isolated) branch. In that case, just do
> >>>>>>>>>>>
> >>>>>>>>>>> if (folio_is_device_private(folio) {
> >>>>>>>>>>> ...
> >>>>>>>>>>> } else if (is_anon) {
> >>>>>>>>>>> ...
> >>>>>>>>>>> } else {
> >>>>>>>>>>> ...
> >>>>>>>>>>> }
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> People can know device private folio split needs a special handling.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
> >>>>>>>>>>>>> if a page cache folio is migrated to device private, kernel also
> >>>>>>>>>>>>> sees it as both device private and file-backed?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> FYI: device private folios only work with anonymous private pages, hence
> >>>>>>>>>>>> the name device private.
> >>>>>>>>>>>
> >>>>>>>>>>> OK.
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
> >>>>>>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
> >>>>>>>>>>>>>>    entries under consideration are already migration entries in this case.
> >>>>>>>>>>>>>>    This is wasteful and in some case unexpected.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
> >>>>>>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
> >>>>>>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
> >>>>>>>>>>>>> device private PMD mapping. Or if that is not preferred,
> >>>>>>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
> >>>>>>>>>>>>> sees a device private folio.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> For remap_page(), you can simply return for device private folios
> >>>>>>>>>>>>> like it is currently doing for non anonymous folios.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
> >>>>>>>>>>>> remap_folio(), because
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1. We need to do a page table walk/rmap walk again
> >>>>>>>>>>>> 2. We'll need special handling of migration <-> migration entries
> >>>>>>>>>>>>    in the rmap handling (set/remove migration ptes)
> >>>>>>>>>>>> 3. In this context, the code is already in the middle of migration,
> >>>>>>>>>>>>    so trying to do that again does not make sense.
> >>>>>>>>>>>
> >>>>>>>>>>> Why doing split in the middle of migration? Existing split code
> >>>>>>>>>>> assumes to-be-split folios are mapped.
> >>>>>>>>>>>
> >>>>>>>>>>> What prevents doing split before migration?
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> The code does do a split prior to migration if THP selection fails
> >>>>>>>>>>
> >>>>>>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> >>>>>>>>>> and the fallback part which calls split_folio()
> >>>>>>>>>>
> >>>>>>>>>> But the case under consideration is special since the device needs to allocate
> >>>>>>>>>> corresponding pfn's as well. The changelog mentions it:
> >>>>>>>>>>
> >>>>>>>>>> "The common case that arises is that after setup, during migrate
> >>>>>>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> >>>>>>>>>> pages."
> >>>>>>>>>>
> >>>>>>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
> >>>>>>>>>>
> >>>>>>>>>> 1. migrate_vma_setup()
> >>>>>>>>>> 2. migrate_vma_pages()
> >>>>>>>>>> 3. migrate_vma_finalize()
> >>>>>>>>>>
> >>>>>>>>>> It can so happen that when we get the destination pfn's allocated the destination
> >>>>>>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
> >>>>>>>>>>
> >>>>>>>>>> The pages have been unmapped and collected in migrate_vma_setup()
> >>>>>>>>>>
> >>>>>>>>>> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> >>>>>>>>>> tests the split and emulates a failure on the device side to allocate large pages
> >>>>>>>>>> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Another use case I’ve seen is when a previously allocated high-order
> >>>>>>>>> folio, now in the free memory pool, is reallocated as a lower-order
> >>>>>>>>> page. For example, a 2MB fault allocates a folio, the memory is later
> >>>>>>>>
> >>>>>>>> That is different. If the high-order folio is free, it should be split
> >>>>>>>> using split_page() from mm/page_alloc.c.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Ah, ok. Let me see if that works - it would easier.
> >>>>>>>
> >>>>>
> >>>>> This suggestion quickly blows up as PageCompound is true and page_count
> >>>>> here is zero.
> >>>>
> >>>> OK, your folio has PageCompound set. Then you will need __split_unmapped_foio().
> >>>>
> >>>>>
> >>>>>>>>> freed, and then a 4KB fault reuses a page from that previously allocated
> >>>>>>>>> folio. This will be actually quite common in Xe / GPU SVM. In such
> >>>>>>>>> cases, the folio in an unmapped state needs to be split. I’d suggest a
> >>>>>>>>
> >>>>>>>> This folio is unused, so ->flags, ->mapping, and etc. are not set,
> >>>>>>>> __split_unmapped_folio() is not for it, unless you mean free folio
> >>>>>>>> differently.
> >>>>>>>>
> >>>>>>>
> >>>>>>> This is right, those fields should be clear.
> >>>>>>>
> >>>>>>> Thanks for the tip.
> >>>>>>>
> >>>>>> I was hoping to reuse __split_folio_to_order() at some point in the future
> >>>>>> to split the backing pages in the driver, but it is not an immediate priority
> >>>>>>
> >>>>>
> >>>>> I think we need something for the scenario I describe here. I was to
> >>>>> make __split_huge_page_to_list_to_order with a couple of hacks but it
> >>>>> almostly certainig not right as Zi pointed out.
> >>>>>
> >>>>> New to the MM stuff, but play around with this a bit and see if I can
> >>>>> come up with something that will work here.
> >>>>
> >>>> Can you try to write a new split_page function with __split_unmapped_folio()?
> >>>> Since based on your description, your folio is not mapped.
> >>>>
> >>>
> >>> Yes, page->mapping is NULL in this case - that was part of the hacks to
> >>> __split_huge_page_to_list_to_order (more specially __folio_split) I had
> >>> to make in order to get something working for this case.
> >>>
> >>> I can try out something based on __split_unmapped_folio and report back.
> >>
> >> mm-new tree has an updated __split_unmapped_folio() version, it moves
> >> all unmap irrelevant code out of __split_unmaped_folio(). You might find
> >> it easier to reuse.
> >>
> >> See: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git/tree/mm/huge_memory.c?h=mm-new#n3430
> >>

I pulled in the new version and it to works for this case.

Matt

> >
> > Will take a look. It is possible some of the issues we are hitting are
> > due to working on drm-tip + pulling in core MM patches in this series on
> > top of that branch then missing some other patches in mm-new. I'll see
> > if ww can figure out a work flow to have the latest and greatest from
> > both drm-tip and the MM branches.
> >
> > Will these changes be in 6.17?
> 
> Hopefully yes. mm patches usually go from mm-new to mm-unstable
> to mm-stable to mainline. If not, we will figure it out. :)
> 
> >
> >> I am about to update the code with v4 patches. I will cc you, so that
> >> you can get the updated __split_unmaped_folio().
> >>
> >> Feel free to ask questions on folio split code.
> >>
> >
> > Thanks.
> >
> > Matt
> >
> >> Best Regards,
> >> Yan, Zi
> 
> 
> Best Regards,
> Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Zi Yan 3 months ago

On 5 Jul 2025, at 21:47, Balbir Singh wrote:

> On 7/6/25 11:34, Zi Yan wrote:
>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>
>>> On 7/5/25 11:55, Zi Yan wrote:
>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>
>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>
>>>>>> s/pages/folio
>>>>>>
>>>>>
>>>>> Thanks, will make the changes
>>>>>
>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>
>>>>>
>>>>> Ack, will change the name
>>>>>
>>>>>
>>>>>>>   *
>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>   */
>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>  {
>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>  		 * operations.
>>>>>>>  		 */
>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>> -		if (!anon_vma) {
>>>>>>> -			ret = -EBUSY;
>>>>>>> -			goto out;
>>>>>>> +		if (!isolated) {
>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>> +			if (!anon_vma) {
>>>>>>> +				ret = -EBUSY;
>>>>>>> +				goto out;
>>>>>>> +			}
>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>  		}
>>>>>>>  		end = -1;
>>>>>>>  		mapping = NULL;
>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>  	} else {
>>>>>>>  		unsigned int min_order;
>>>>>>>  		gfp_t gfp;
>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>  		goto out_unlock;
>>>>>>>  	}
>>>>>>>
>>>>>>> -	unmap_folio(folio);
>>>>>>> +	if (!isolated)
>>>>>>> +		unmap_folio(folio);
>>>>>>>
>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>  	local_irq_disable();
>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>
>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>> -				uniform_split);
>>>>>>> +				uniform_split, isolated);
>>>>>>>  	} else {
>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>  fail:
>>>>>>>  		if (mapping)
>>>>>>>  			xas_unlock(&xas);
>>>>>>>  		local_irq_enable();
>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>> +		if (!isolated)
>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>  		ret = -EAGAIN;
>>>>>>>  	}
>>>>>>
>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>
>>>>>>
>>>>>
>>>>> There are two reasons for going down the current code path
>>>>
>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>> one can easily misuse the folio split code by first unmapping a folio
>>>> and trying to split it with unmapped = true. I do not think that is
>>>> supported and your patch does not prevent that from happening in the future.
>>>>
>>>
>>> I don't understand the misuse case you mention, I assume you mean someone can
>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>> if calling the API with unmapped
>>
>> Before your patch, there is no use case of splitting unmapped folios.
>> Your patch only adds support for device private page split, not any unmapped
>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>
>
> There is a use for splitting unmapped folios (see below)
>
>>>
>>>> You should teach different parts of folio split code path to handle
>>>> device private folios properly. Details are below.
>>>>
>>>>>
>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>    the split routine to return with -EBUSY
>>>>
>>>> You do something below instead.
>>>>
>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>> 	ret = -EBUSY;
>>>> 	goto out;
>>>> } else if (anon_vma) {
>>>> 	anon_vma_lock_write(anon_vma);
>>>> }
>>>>
>>>
>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>> the check for device private folios?
>>
>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>> in if (!isolated) branch. In that case, just do
>>
>> if (folio_is_device_private(folio) {
>> ...
>> } else if (is_anon) {
>> ...
>> } else {
>> ...
>> }
>>
>>>
>>>> People can know device private folio split needs a special handling.
>>>>
>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>> if a page cache folio is migrated to device private, kernel also
>>>> sees it as both device private and file-backed?
>>>>
>>>
>>> FYI: device private folios only work with anonymous private pages, hence
>>> the name device private.
>>
>> OK.
>>
>>>
>>>>
>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>    entries under consideration are already migration entries in this case.
>>>>>    This is wasteful and in some case unexpected.
>>>>
>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>> device private PMD mapping. Or if that is not preferred,
>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>> sees a device private folio.
>>>>
>>>> For remap_page(), you can simply return for device private folios
>>>> like it is currently doing for non anonymous folios.
>>>>
>>>
>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>> remap_folio(), because
>>>
>>> 1. We need to do a page table walk/rmap walk again
>>> 2. We'll need special handling of migration <-> migration entries
>>>    in the rmap handling (set/remove migration ptes)
>>> 3. In this context, the code is already in the middle of migration,
>>>    so trying to do that again does not make sense.
>>
>> Why doing split in the middle of migration? Existing split code
>> assumes to-be-split folios are mapped.
>>
>> What prevents doing split before migration?
>>
>
> The code does do a split prior to migration if THP selection fails
>
> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
> and the fallback part which calls split_folio()

So this split is done when the folio in system memory is mapped.

>
> But the case under consideration is special since the device needs to allocate
> corresponding pfn's as well. The changelog mentions it:
>
> "The common case that arises is that after setup, during migrate
> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> pages."
>
> I can expand on it, because migrate_vma() is a multi-phase operation
>
> 1. migrate_vma_setup()
> 2. migrate_vma_pages()
> 3. migrate_vma_finalize()
>
> It can so happen that when we get the destination pfn's allocated the destination
> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>
> The pages have been unmapped and collected in migrate_vma_setup()

So these unmapped folios are system memory folios? I thought they are
large device private folios.

OK. It sounds like splitting unmapped folios is really needed. I think
it is better to make a new split_unmapped_folio() function
by reusing __split_unmapped_folio(), since __folio_split() assumes
the input folio is mapped.

>
> The next patch in the series 9/12 (https://lore.kernel.org/lkml/20250703233511.2028395-10-balbirs@nvidia.com/)
> tests the split and emulates a failure on the device side to allocate large pages
> and tests it in 10/12 (https://lore.kernel.org/lkml/20250703233511.2028395-11-balbirs@nvidia.com/)
>
>
>>>
>>>
>>>>
>>>> For lru_add_split_folio(), you can skip it if a device private
>>>> folio is seen.
>>>>
>>>> Last, for unlock part, why do you need to keep all after-split folios
>>>> locked? It should be possible to just keep the to-be-migrated folio
>>>> locked and unlock the rest for a later retry. But I could miss something
>>>> since I am not familiar with device private migration code.
>>>>
>>>
>>> Not sure I follow this comment
>>
>> Because the patch is doing split in the middle of migration and existing
>> split code never supports. My comment is based on the assumption that
>> the split is done when a folio is mapped.
>>
>
> Understood, hopefully I've explained the reason for the split in the middle
> of migration


--
Best Regards,
Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Zi Yan 3 months ago

On 5 Jul 2025, at 22:34, Zi Yan wrote:

> On 5 Jul 2025, at 21:47, Balbir Singh wrote:
>
>> On 7/6/25 11:34, Zi Yan wrote:
>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>
>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>
>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>
>>>>>>> s/pages/folio
>>>>>>>
>>>>>>
>>>>>> Thanks, will make the changes
>>>>>>
>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>
>>>>>>
>>>>>> Ack, will change the name
>>>>>>
>>>>>>
>>>>>>>>   *
>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>   */
>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>  {
>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>  		 * operations.
>>>>>>>>  		 */
>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>> -		if (!anon_vma) {
>>>>>>>> -			ret = -EBUSY;
>>>>>>>> -			goto out;
>>>>>>>> +		if (!isolated) {
>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>> +			if (!anon_vma) {
>>>>>>>> +				ret = -EBUSY;
>>>>>>>> +				goto out;
>>>>>>>> +			}
>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>  		}
>>>>>>>>  		end = -1;
>>>>>>>>  		mapping = NULL;
>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>  	} else {
>>>>>>>>  		unsigned int min_order;
>>>>>>>>  		gfp_t gfp;
>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>  		goto out_unlock;
>>>>>>>>  	}
>>>>>>>>
>>>>>>>> -	unmap_folio(folio);
>>>>>>>> +	if (!isolated)
>>>>>>>> +		unmap_folio(folio);
>>>>>>>>
>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>  	local_irq_disable();
>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>
>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>> -				uniform_split);
>>>>>>>> +				uniform_split, isolated);
>>>>>>>>  	} else {
>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>  fail:
>>>>>>>>  		if (mapping)
>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>  		local_irq_enable();
>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>> +		if (!isolated)
>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>  	}
>>>>>>>
>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> There are two reasons for going down the current code path
>>>>>
>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>
>>>>
>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>> if calling the API with unmapped
>>>
>>> Before your patch, there is no use case of splitting unmapped folios.
>>> Your patch only adds support for device private page split, not any unmapped
>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>
>>
>> There is a use for splitting unmapped folios (see below)
>>
>>>>
>>>>> You should teach different parts of folio split code path to handle
>>>>> device private folios properly. Details are below.
>>>>>
>>>>>>
>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>    the split routine to return with -EBUSY
>>>>>
>>>>> You do something below instead.
>>>>>
>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>> 	ret = -EBUSY;
>>>>> 	goto out;
>>>>> } else if (anon_vma) {
>>>>> 	anon_vma_lock_write(anon_vma);
>>>>> }
>>>>>
>>>>
>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>> the check for device private folios?
>>>
>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>> in if (!isolated) branch. In that case, just do
>>>
>>> if (folio_is_device_private(folio) {
>>> ...
>>> } else if (is_anon) {
>>> ...
>>> } else {
>>> ...
>>> }
>>>
>>>>
>>>>> People can know device private folio split needs a special handling.
>>>>>
>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>> if a page cache folio is migrated to device private, kernel also
>>>>> sees it as both device private and file-backed?
>>>>>
>>>>
>>>> FYI: device private folios only work with anonymous private pages, hence
>>>> the name device private.
>>>
>>> OK.
>>>
>>>>
>>>>>
>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>    This is wasteful and in some case unexpected.
>>>>>
>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>> device private PMD mapping. Or if that is not preferred,
>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>> sees a device private folio.
>>>>>
>>>>> For remap_page(), you can simply return for device private folios
>>>>> like it is currently doing for non anonymous folios.
>>>>>
>>>>
>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>> remap_folio(), because
>>>>
>>>> 1. We need to do a page table walk/rmap walk again
>>>> 2. We'll need special handling of migration <-> migration entries
>>>>    in the rmap handling (set/remove migration ptes)
>>>> 3. In this context, the code is already in the middle of migration,
>>>>    so trying to do that again does not make sense.
>>>
>>> Why doing split in the middle of migration? Existing split code
>>> assumes to-be-split folios are mapped.
>>>
>>> What prevents doing split before migration?
>>>
>>
>> The code does do a split prior to migration if THP selection fails
>>
>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>> and the fallback part which calls split_folio()
>
> So this split is done when the folio in system memory is mapped.
>
>>
>> But the case under consideration is special since the device needs to allocate
>> corresponding pfn's as well. The changelog mentions it:
>>
>> "The common case that arises is that after setup, during migrate
>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>> pages."
>>
>> I can expand on it, because migrate_vma() is a multi-phase operation
>>
>> 1. migrate_vma_setup()
>> 2. migrate_vma_pages()
>> 3. migrate_vma_finalize()
>>
>> It can so happen that when we get the destination pfn's allocated the destination
>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>
>> The pages have been unmapped and collected in migrate_vma_setup()
>
> So these unmapped folios are system memory folios? I thought they are
> large device private folios.
>
> OK. It sounds like splitting unmapped folios is really needed. I think
> it is better to make a new split_unmapped_folio() function
> by reusing __split_unmapped_folio(), since __folio_split() assumes
> the input folio is mapped.

And to make __split_unmapped_folio()'s functionality match its name,
I will later refactor it. At least move local_irq_enable(), remap_page(),
and folio_unlocks out of it. I will think about how to deal with
lru_add_split_folio(). The goal is to remove the to-be-added "unmapped"
parameter from __split_unmapped_folio().

--
Best Regards,
Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 3 months ago

On 7/6/25 13:03, Zi Yan wrote:
> On 5 Jul 2025, at 22:34, Zi Yan wrote:
> 
>> On 5 Jul 2025, at 21:47, Balbir Singh wrote:
>>
>>> On 7/6/25 11:34, Zi Yan wrote:
>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>
>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>
>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>
>>>>>>>> s/pages/folio
>>>>>>>>
>>>>>>>
>>>>>>> Thanks, will make the changes
>>>>>>>
>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>
>>>>>>>
>>>>>>> Ack, will change the name
>>>>>>>
>>>>>>>
>>>>>>>>>   *
>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>   */
>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>  {
>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>  		 * operations.
>>>>>>>>>  		 */
>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>> -			goto out;
>>>>>>>>> +		if (!isolated) {
>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>> +				goto out;
>>>>>>>>> +			}
>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>  		}
>>>>>>>>>  		end = -1;
>>>>>>>>>  		mapping = NULL;
>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>  	} else {
>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>  		gfp_t gfp;
>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>  		goto out_unlock;
>>>>>>>>>  	}
>>>>>>>>>
>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>> +	if (!isolated)
>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>
>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>  	local_irq_disable();
>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>
>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>> -				uniform_split);
>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>  	} else {
>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>  fail:
>>>>>>>>>  		if (mapping)
>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>  		local_irq_enable();
>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>> +		if (!isolated)
>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>  	}
>>>>>>>>
>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> There are two reasons for going down the current code path
>>>>>>
>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>
>>>>>
>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>> if calling the API with unmapped
>>>>
>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>> Your patch only adds support for device private page split, not any unmapped
>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>
>>>
>>> There is a use for splitting unmapped folios (see below)
>>>
>>>>>
>>>>>> You should teach different parts of folio split code path to handle
>>>>>> device private folios properly. Details are below.
>>>>>>
>>>>>>>
>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>    the split routine to return with -EBUSY
>>>>>>
>>>>>> You do something below instead.
>>>>>>
>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>> 	ret = -EBUSY;
>>>>>> 	goto out;
>>>>>> } else if (anon_vma) {
>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>> }
>>>>>>
>>>>>
>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>> the check for device private folios?
>>>>
>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>> in if (!isolated) branch. In that case, just do
>>>>
>>>> if (folio_is_device_private(folio) {
>>>> ...
>>>> } else if (is_anon) {
>>>> ...
>>>> } else {
>>>> ...
>>>> }
>>>>
>>>>>
>>>>>> People can know device private folio split needs a special handling.
>>>>>>
>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>> sees it as both device private and file-backed?
>>>>>>
>>>>>
>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>> the name device private.
>>>>
>>>> OK.
>>>>
>>>>>
>>>>>>
>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>
>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>> sees a device private folio.
>>>>>>
>>>>>> For remap_page(), you can simply return for device private folios
>>>>>> like it is currently doing for non anonymous folios.
>>>>>>
>>>>>
>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>> remap_folio(), because
>>>>>
>>>>> 1. We need to do a page table walk/rmap walk again
>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>    in the rmap handling (set/remove migration ptes)
>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>    so trying to do that again does not make sense.
>>>>
>>>> Why doing split in the middle of migration? Existing split code
>>>> assumes to-be-split folios are mapped.
>>>>
>>>> What prevents doing split before migration?
>>>>
>>>
>>> The code does do a split prior to migration if THP selection fails
>>>
>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>> and the fallback part which calls split_folio()
>>
>> So this split is done when the folio in system memory is mapped.
>>
>>>
>>> But the case under consideration is special since the device needs to allocate
>>> corresponding pfn's as well. The changelog mentions it:
>>>
>>> "The common case that arises is that after setup, during migrate
>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>> pages."
>>>
>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>
>>> 1. migrate_vma_setup()
>>> 2. migrate_vma_pages()
>>> 3. migrate_vma_finalize()
>>>
>>> It can so happen that when we get the destination pfn's allocated the destination
>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>
>>> The pages have been unmapped and collected in migrate_vma_setup()
>>
>> So these unmapped folios are system memory folios? I thought they are
>> large device private folios.
>>
>> OK. It sounds like splitting unmapped folios is really needed. I think
>> it is better to make a new split_unmapped_folio() function
>> by reusing __split_unmapped_folio(), since __folio_split() assumes
>> the input folio is mapped.
> 
> And to make __split_unmapped_folio()'s functionality match its name,
> I will later refactor it. At least move local_irq_enable(), remap_page(),
> and folio_unlocks out of it. I will think about how to deal with
> lru_add_split_folio(). The goal is to remove the to-be-added "unmapped"
> parameter from __split_unmapped_folio().
> 

That sounds like a plan, it seems like there needs to be a finish phase of
the split and it does not belong to __split_unmapped_folio(). I would propose
that we rename "isolated" to "folio_is_migrating" and then your cleanups can
follow? Once your cleanups come in, we won't need to pass the parameter to
__split_unmapped_folio().

Balbir Singh

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Zi Yan 3 months ago

On 6 Jul 2025, at 22:29, Balbir Singh wrote:

> On 7/6/25 13:03, Zi Yan wrote:
>> On 5 Jul 2025, at 22:34, Zi Yan wrote:
>>
>>> On 5 Jul 2025, at 21:47, Balbir Singh wrote:
>>>
>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>
>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>
>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>
>>>>>>>>> s/pages/folio
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks, will make the changes
>>>>>>>>
>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Ack, will change the name
>>>>>>>>
>>>>>>>>
>>>>>>>>>>   *
>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>   */
>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>  {
>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>  		 * operations.
>>>>>>>>>>  		 */
>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>> -			goto out;
>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>> +				goto out;
>>>>>>>>>> +			}
>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>  		}
>>>>>>>>>>  		end = -1;
>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>  	} else {
>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>  	}
>>>>>>>>>>
>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>> +	if (!isolated)
>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>
>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>
>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>> -				uniform_split);
>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>  	} else {
>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>  fail:
>>>>>>>>>>  		if (mapping)
>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>> +		if (!isolated)
>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>  	}
>>>>>>>>>
>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> There are two reasons for going down the current code path
>>>>>>>
>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>
>>>>>>
>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>> if calling the API with unmapped
>>>>>
>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>
>>>>
>>>> There is a use for splitting unmapped folios (see below)
>>>>
>>>>>>
>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>> device private folios properly. Details are below.
>>>>>>>
>>>>>>>>
>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>
>>>>>>> You do something below instead.
>>>>>>>
>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>> 	ret = -EBUSY;
>>>>>>> 	goto out;
>>>>>>> } else if (anon_vma) {
>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>> }
>>>>>>>
>>>>>>
>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>> the check for device private folios?
>>>>>
>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>> in if (!isolated) branch. In that case, just do
>>>>>
>>>>> if (folio_is_device_private(folio) {
>>>>> ...
>>>>> } else if (is_anon) {
>>>>> ...
>>>>> } else {
>>>>> ...
>>>>> }
>>>>>
>>>>>>
>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>
>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>> sees it as both device private and file-backed?
>>>>>>>
>>>>>>
>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>> the name device private.
>>>>>
>>>>> OK.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>
>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>> sees a device private folio.
>>>>>>>
>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>
>>>>>>
>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>> remap_folio(), because
>>>>>>
>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>    so trying to do that again does not make sense.
>>>>>
>>>>> Why doing split in the middle of migration? Existing split code
>>>>> assumes to-be-split folios are mapped.
>>>>>
>>>>> What prevents doing split before migration?
>>>>>
>>>>
>>>> The code does do a split prior to migration if THP selection fails
>>>>
>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>> and the fallback part which calls split_folio()
>>>
>>> So this split is done when the folio in system memory is mapped.
>>>
>>>>
>>>> But the case under consideration is special since the device needs to allocate
>>>> corresponding pfn's as well. The changelog mentions it:
>>>>
>>>> "The common case that arises is that after setup, during migrate
>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>> pages."
>>>>
>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>
>>>> 1. migrate_vma_setup()
>>>> 2. migrate_vma_pages()
>>>> 3. migrate_vma_finalize()
>>>>
>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>
>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>
>>> So these unmapped folios are system memory folios? I thought they are
>>> large device private folios.
>>>
>>> OK. It sounds like splitting unmapped folios is really needed. I think
>>> it is better to make a new split_unmapped_folio() function
>>> by reusing __split_unmapped_folio(), since __folio_split() assumes
>>> the input folio is mapped.
>>
>> And to make __split_unmapped_folio()'s functionality match its name,
>> I will later refactor it. At least move local_irq_enable(), remap_page(),
>> and folio_unlocks out of it. I will think about how to deal with
>> lru_add_split_folio(). The goal is to remove the to-be-added "unmapped"
>> parameter from __split_unmapped_folio().
>>
>
> That sounds like a plan, it seems like there needs to be a finish phase of
> the split and it does not belong to __split_unmapped_folio(). I would propose
> that we rename "isolated" to "folio_is_migrating" and then your cleanups can
> follow? Once your cleanups come in, we won't need to pass the parameter to
> __split_unmapped_folio().

Sure.

The patch below should work. It only passed mm selftests and I am planning
to do more. If you are brave enough, you can give it a try and use
__split_unmapped_folio() from it.

From e594924d689bef740c38d93c7c1653f31bd5ae83 Mon Sep 17 00:00:00 2001
From: Zi Yan <ziy@nvidia.com>
Date: Sun, 6 Jul 2025 22:40:53 -0400
Subject: [PATCH] mm/huge_memory: move epilogue code out of
 __split_unmapped_folio()

The code is not related to splitting unmapped folio operations. Move
it out, so that __split_unmapped_folio() only do split works on unmapped
folios.

Signed-off-by: Zi Yan <ziy@nvidia.com>
---
 mm/huge_memory.c | 226 ++++++++++++++++++++++++-----------------------
 1 file changed, 116 insertions(+), 110 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3eb1c34be601..6eead616583f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3396,9 +3396,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
  *             order - 1 to new_order).
  * @split_at: in buddy allocator like split, the folio containing @split_at
  *            will be split until its order becomes @new_order.
- * @lock_at: the folio containing @lock_at is left locked for caller.
- * @list: the after split folios will be added to @list if it is not NULL,
- *        otherwise to LRU lists.
  * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
  * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
  * @mapping: @folio->mapping
@@ -3436,40 +3433,20 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
  * split. The caller needs to check the input folio.
  */
 static int __split_unmapped_folio(struct folio *folio, int new_order,
-		struct page *split_at, struct page *lock_at,
-		struct list_head *list, pgoff_t end,
-		struct xa_state *xas, struct address_space *mapping,
-		bool uniform_split)
+				  struct page *split_at, struct xa_state *xas,
+				  struct address_space *mapping,
+				  bool uniform_split)
 {
-	struct lruvec *lruvec;
-	struct address_space *swap_cache = NULL;
-	struct folio *origin_folio = folio;
-	struct folio *next_folio = folio_next(folio);
-	struct folio *new_folio;
 	struct folio *next;
 	int order = folio_order(folio);
 	int split_order;
 	int start_order = uniform_split ? new_order : order - 1;
-	int nr_dropped = 0;
 	int ret = 0;
 	bool stop_split = false;

-	if (folio_test_swapcache(folio)) {
-		VM_BUG_ON(mapping);
-
-		/* a swapcache folio can only be uniformly split to order-0 */
-		if (!uniform_split || new_order != 0)
-			return -EINVAL;
-
-		swap_cache = swap_address_space(folio->swap);
-		xa_lock(&swap_cache->i_pages);
-	}
-
 	if (folio_test_anon(folio))
 		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);

-	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
-	lruvec = folio_lruvec_lock(folio);

 	folio_clear_has_hwpoisoned(folio);

@@ -3541,89 +3518,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
 						MTHP_STAT_NR_ANON, 1);
 			}

-			/*
-			 * origin_folio should be kept frozon until page cache
-			 * entries are updated with all the other after-split
-			 * folios to prevent others seeing stale page cache
-			 * entries.
-			 */
-			if (release == origin_folio)
-				continue;
-
-			folio_ref_unfreeze(release, 1 +
-					((mapping || swap_cache) ?
-						folio_nr_pages(release) : 0));
-
-			lru_add_split_folio(origin_folio, release, lruvec,
-					list);
-
-			/* Some pages can be beyond EOF: drop them from cache */
-			if (release->index >= end) {
-				if (shmem_mapping(mapping))
-					nr_dropped += folio_nr_pages(release);
-				else if (folio_test_clear_dirty(release))
-					folio_account_cleaned(release,
-						inode_to_wb(mapping->host));
-				__filemap_remove_folio(release, NULL);
-				folio_put_refs(release, folio_nr_pages(release));
-			} else if (mapping) {
-				__xa_store(&mapping->i_pages,
-						release->index, release, 0);
-			} else if (swap_cache) {
-				__xa_store(&swap_cache->i_pages,
-						swap_cache_index(release->swap),
-						release, 0);
-			}
 		}
 	}

-	/*
-	 * Unfreeze origin_folio only after all page cache entries, which used
-	 * to point to it, have been updated with new folios. Otherwise,
-	 * a parallel folio_try_get() can grab origin_folio and its caller can
-	 * see stale page cache entries.
-	 */
-	folio_ref_unfreeze(origin_folio, 1 +
-		((mapping || swap_cache) ? folio_nr_pages(origin_folio) : 0));
-
-	unlock_page_lruvec(lruvec);
-
-	if (swap_cache)
-		xa_unlock(&swap_cache->i_pages);
-	if (mapping)
-		xa_unlock(&mapping->i_pages);

-	/* Caller disabled irqs, so they are still disabled here */
-	local_irq_enable();
-
-	if (nr_dropped)
-		shmem_uncharge(mapping->host, nr_dropped);
-
-	remap_page(origin_folio, 1 << order,
-			folio_test_anon(origin_folio) ?
-				RMP_USE_SHARED_ZEROPAGE : 0);
-
-	/*
-	 * At this point, folio should contain the specified page.
-	 * For uniform split, it is left for caller to unlock.
-	 * For buddy allocator like split, the first after-split folio is left
-	 * for caller to unlock.
-	 */
-	for (new_folio = origin_folio; new_folio != next_folio; new_folio = next) {
-		next = folio_next(new_folio);
-		if (new_folio == page_folio(lock_at))
-			continue;
-
-		folio_unlock(new_folio);
-		/*
-		 * Subpages may be freed if there wasn't any mapping
-		 * like if add_to_swap() is running on a lru page that
-		 * had its mapping zapped. And freeing these pages
-		 * requires taking the lru_lock so we do the put_page
-		 * of the tail pages after the split is complete.
-		 */
-		free_folio_and_swap_cache(new_folio);
-	}
 	return ret;
 }

@@ -3706,10 +3604,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 {
 	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
+	struct folio *next_folio = folio_next(folio);
 	bool is_anon = folio_test_anon(folio);
 	struct address_space *mapping = NULL;
 	struct anon_vma *anon_vma = NULL;
 	int order = folio_order(folio);
+	struct folio *new_folio, *next;
 	int extra_pins, ret;
 	pgoff_t end;
 	bool is_hzp;
@@ -3840,6 +3740,10 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	/* Prevent deferred_split_scan() touching ->_refcount */
 	spin_lock(&ds_queue->split_queue_lock);
 	if (folio_ref_freeze(folio, 1 + extra_pins)) {
+		struct address_space *swap_cache = NULL;
+		struct lruvec *lruvec;
+		int nr_dropped = 0;
+
 		if (folio_order(folio) > 1 &&
 		    !list_empty(&folio->_deferred_list)) {
 			ds_queue->split_queue_len--;
@@ -3873,19 +3777,121 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			}
 		}

-		ret = __split_unmapped_folio(folio, new_order,
-				split_at, lock_at, list, end, &xas, mapping,
-				uniform_split);
+		if (folio_test_swapcache(folio)) {
+			VM_BUG_ON(mapping);
+
+			/* a swapcache folio can only be uniformly split to order-0 */
+			if (!uniform_split || new_order != 0) {
+				ret = -EINVAL;
+				goto out_unlock;
+			}
+
+			swap_cache = swap_address_space(folio->swap);
+			xa_lock(&swap_cache->i_pages);
+		}
+
+		/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
+		lruvec = folio_lruvec_lock(folio);
+
+		ret = __split_unmapped_folio(folio, new_order, split_at, &xas,
+					     mapping, uniform_split);
+
+		/* Unfreeze after-split folios */
+		for (new_folio = folio; new_folio != next_folio;
+		     new_folio = next) {
+			next = folio_next(new_folio);
+			/*
+			 * @folio should be kept frozon until page cache
+			 * entries are updated with all the other after-split
+			 * folios to prevent others seeing stale page cache
+			 * entries.
+			 */
+			if (new_folio == folio)
+				continue;
+
+			folio_ref_unfreeze(
+				new_folio,
+				1 + ((mapping || swap_cache) ?
+					     folio_nr_pages(new_folio) :
+					     0));
+
+			lru_add_split_folio(folio, new_folio, lruvec, list);
+
+			/* Some pages can be beyond EOF: drop them from cache */
+			if (new_folio->index >= end) {
+				if (shmem_mapping(mapping))
+					nr_dropped += folio_nr_pages(new_folio);
+				else if (folio_test_clear_dirty(new_folio))
+					folio_account_cleaned(
+						new_folio,
+						inode_to_wb(mapping->host));
+				__filemap_remove_folio(new_folio, NULL);
+				folio_put_refs(new_folio,
+					       folio_nr_pages(new_folio));
+			} else if (mapping) {
+				__xa_store(&mapping->i_pages, new_folio->index,
+					   new_folio, 0);
+			} else if (swap_cache) {
+				__xa_store(&swap_cache->i_pages,
+					   swap_cache_index(new_folio->swap),
+					   new_folio, 0);
+			}
+		}
+		/*
+		 * Unfreeze @folio only after all page cache entries, which
+		 * used to point to it, have been updated with new folios.
+		 * Otherwise, a parallel folio_try_get() can grab origin_folio
+		 * and its caller can see stale page cache entries.
+		 */
+		folio_ref_unfreeze(folio, 1 +
+			((mapping || swap_cache) ? folio_nr_pages(folio) : 0));
+
+		unlock_page_lruvec(lruvec);
+
+		if (swap_cache)
+			xa_unlock(&swap_cache->i_pages);
+		if (mapping)
+			xa_unlock(&mapping->i_pages);
+
+		if (nr_dropped)
+			shmem_uncharge(mapping->host, nr_dropped);
+
 	} else {
 		spin_unlock(&ds_queue->split_queue_lock);
 fail:
 		if (mapping)
 			xas_unlock(&xas);
-		local_irq_enable();
-		remap_page(folio, folio_nr_pages(folio), 0);
 		ret = -EAGAIN;
 	}

+	local_irq_enable();
+
+	remap_page(folio, 1 << order,
+		   !ret && folio_test_anon(folio) ? RMP_USE_SHARED_ZEROPAGE :
+						    0);
+
+	/*
+	 * At this point, folio should contain the specified page.
+	 * For uniform split, it is left for caller to unlock.
+	 * For buddy allocator like split, the first after-split folio is left
+	 * for caller to unlock.
+	 */
+	for (new_folio = folio; new_folio != next_folio; new_folio = next) {
+		next = folio_next(new_folio);
+		if (new_folio == page_folio(lock_at))
+			continue;
+
+		folio_unlock(new_folio);
+		/*
+		 * Subpages may be freed if there wasn't any mapping
+		 * like if add_to_swap() is running on a lru page that
+		 * had its mapping zapped. And freeing these pages
+		 * requires taking the lru_lock so we do the put_page
+		 * of the tail pages after the split is complete.
+		 */
+		free_folio_and_swap_cache(new_folio);
+	}
+
 out_unlock:
 	if (anon_vma) {
 		anon_vma_unlock_write(anon_vma);
-- 
2.47.2



--
Best Regards,
Yan, Zi

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 3 months ago

>>
>> That sounds like a plan, it seems like there needs to be a finish phase of
>> the split and it does not belong to __split_unmapped_folio(). I would propose
>> that we rename "isolated" to "folio_is_migrating" and then your cleanups can
>> follow? Once your cleanups come in, we won't need to pass the parameter to
>> __split_unmapped_folio().
> 
> Sure.
> 
> The patch below should work. It only passed mm selftests and I am planning
> to do more. If you are brave enough, you can give it a try and use
> __split_unmapped_folio() from it.
> 
> From e594924d689bef740c38d93c7c1653f31bd5ae83 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Sun, 6 Jul 2025 22:40:53 -0400
> Subject: [PATCH] mm/huge_memory: move epilogue code out of
>  __split_unmapped_folio()
> 
> The code is not related to splitting unmapped folio operations. Move
> it out, so that __split_unmapped_folio() only do split works on unmapped
> folios.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
>  mm/huge_memory.c | 226 ++++++++++++++++++++++++-----------------------
>  1 file changed, 116 insertions(+), 110 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 3eb1c34be601..6eead616583f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3396,9 +3396,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>   *             order - 1 to new_order).
>   * @split_at: in buddy allocator like split, the folio containing @split_at
>   *            will be split until its order becomes @new_order.
> - * @lock_at: the folio containing @lock_at is left locked for caller.
> - * @list: the after split folios will be added to @list if it is not NULL,
> - *        otherwise to LRU lists.
>   * @end: the end of the file @folio maps to. -1 if @folio is anonymous memory.
>   * @xas: xa_state pointing to folio->mapping->i_pages and locked by caller
>   * @mapping: @folio->mapping
> @@ -3436,40 +3433,20 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>   * split. The caller needs to check the input folio.
>   */
>  static int __split_unmapped_folio(struct folio *folio, int new_order,
> -		struct page *split_at, struct page *lock_at,
> -		struct list_head *list, pgoff_t end,
> -		struct xa_state *xas, struct address_space *mapping,
> -		bool uniform_split)
> +				  struct page *split_at, struct xa_state *xas,
> +				  struct address_space *mapping,
> +				  bool uniform_split)
>  {
> -	struct lruvec *lruvec;
> -	struct address_space *swap_cache = NULL;
> -	struct folio *origin_folio = folio;
> -	struct folio *next_folio = folio_next(folio);
> -	struct folio *new_folio;
>  	struct folio *next;
>  	int order = folio_order(folio);
>  	int split_order;
>  	int start_order = uniform_split ? new_order : order - 1;
> -	int nr_dropped = 0;
>  	int ret = 0;
>  	bool stop_split = false;
> 
> -	if (folio_test_swapcache(folio)) {
> -		VM_BUG_ON(mapping);
> -
> -		/* a swapcache folio can only be uniformly split to order-0 */
> -		if (!uniform_split || new_order != 0)
> -			return -EINVAL;
> -
> -		swap_cache = swap_address_space(folio->swap);
> -		xa_lock(&swap_cache->i_pages);
> -	}
> -
>  	if (folio_test_anon(folio))
>  		mod_mthp_stat(order, MTHP_STAT_NR_ANON, -1);
> 
> -	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> -	lruvec = folio_lruvec_lock(folio);
> 
>  	folio_clear_has_hwpoisoned(folio);
> 
> @@ -3541,89 +3518,10 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  						MTHP_STAT_NR_ANON, 1);
>  			}
> 
> -			/*
> -			 * origin_folio should be kept frozon until page cache
> -			 * entries are updated with all the other after-split
> -			 * folios to prevent others seeing stale page cache
> -			 * entries.
> -			 */
> -			if (release == origin_folio)
> -				continue;
> -
> -			folio_ref_unfreeze(release, 1 +
> -					((mapping || swap_cache) ?
> -						folio_nr_pages(release) : 0));
> -
> -			lru_add_split_folio(origin_folio, release, lruvec,
> -					list);
> -
> -			/* Some pages can be beyond EOF: drop them from cache */
> -			if (release->index >= end) {
> -				if (shmem_mapping(mapping))
> -					nr_dropped += folio_nr_pages(release);
> -				else if (folio_test_clear_dirty(release))
> -					folio_account_cleaned(release,
> -						inode_to_wb(mapping->host));
> -				__filemap_remove_folio(release, NULL);
> -				folio_put_refs(release, folio_nr_pages(release));
> -			} else if (mapping) {
> -				__xa_store(&mapping->i_pages,
> -						release->index, release, 0);
> -			} else if (swap_cache) {
> -				__xa_store(&swap_cache->i_pages,
> -						swap_cache_index(release->swap),
> -						release, 0);
> -			}
>  		}
>  	}
> 
> -	/*
> -	 * Unfreeze origin_folio only after all page cache entries, which used
> -	 * to point to it, have been updated with new folios. Otherwise,
> -	 * a parallel folio_try_get() can grab origin_folio and its caller can
> -	 * see stale page cache entries.
> -	 */
> -	folio_ref_unfreeze(origin_folio, 1 +
> -		((mapping || swap_cache) ? folio_nr_pages(origin_folio) : 0));
> -
> -	unlock_page_lruvec(lruvec);
> -
> -	if (swap_cache)
> -		xa_unlock(&swap_cache->i_pages);
> -	if (mapping)
> -		xa_unlock(&mapping->i_pages);
> 
> -	/* Caller disabled irqs, so they are still disabled here */
> -	local_irq_enable();
> -
> -	if (nr_dropped)
> -		shmem_uncharge(mapping->host, nr_dropped);
> -
> -	remap_page(origin_folio, 1 << order,
> -			folio_test_anon(origin_folio) ?
> -				RMP_USE_SHARED_ZEROPAGE : 0);
> -
> -	/*
> -	 * At this point, folio should contain the specified page.
> -	 * For uniform split, it is left for caller to unlock.
> -	 * For buddy allocator like split, the first after-split folio is left
> -	 * for caller to unlock.
> -	 */
> -	for (new_folio = origin_folio; new_folio != next_folio; new_folio = next) {
> -		next = folio_next(new_folio);
> -		if (new_folio == page_folio(lock_at))
> -			continue;
> -
> -		folio_unlock(new_folio);
> -		/*
> -		 * Subpages may be freed if there wasn't any mapping
> -		 * like if add_to_swap() is running on a lru page that
> -		 * had its mapping zapped. And freeing these pages
> -		 * requires taking the lru_lock so we do the put_page
> -		 * of the tail pages after the split is complete.
> -		 */
> -		free_folio_and_swap_cache(new_folio);
> -	}
>  	return ret;
>  }
> 
> @@ -3706,10 +3604,12 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> +	struct folio *next_folio = folio_next(folio);
>  	bool is_anon = folio_test_anon(folio);
>  	struct address_space *mapping = NULL;
>  	struct anon_vma *anon_vma = NULL;
>  	int order = folio_order(folio);
> +	struct folio *new_folio, *next;
>  	int extra_pins, ret;
>  	pgoff_t end;
>  	bool is_hzp;
> @@ -3840,6 +3740,10 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  	/* Prevent deferred_split_scan() touching ->_refcount */
>  	spin_lock(&ds_queue->split_queue_lock);
>  	if (folio_ref_freeze(folio, 1 + extra_pins)) {
> +		struct address_space *swap_cache = NULL;
> +		struct lruvec *lruvec;
> +		int nr_dropped = 0;
> +
>  		if (folio_order(folio) > 1 &&
>  		    !list_empty(&folio->_deferred_list)) {
>  			ds_queue->split_queue_len--;
> @@ -3873,19 +3777,121 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  			}
>  		}
> 
> -		ret = __split_unmapped_folio(folio, new_order,
> -				split_at, lock_at, list, end, &xas, mapping,
> -				uniform_split);
> +		if (folio_test_swapcache(folio)) {
> +			VM_BUG_ON(mapping);
> +
> +			/* a swapcache folio can only be uniformly split to order-0 */
> +			if (!uniform_split || new_order != 0) {
> +				ret = -EINVAL;
> +				goto out_unlock;
> +			}
> +
> +			swap_cache = swap_address_space(folio->swap);
> +			xa_lock(&swap_cache->i_pages);
> +		}
> +
> +		/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> +		lruvec = folio_lruvec_lock(folio);
> +
> +		ret = __split_unmapped_folio(folio, new_order, split_at, &xas,
> +					     mapping, uniform_split);
> +
> +		/* Unfreeze after-split folios */
> +		for (new_folio = folio; new_folio != next_folio;
> +		     new_folio = next) {
> +			next = folio_next(new_folio);
> +			/*
> +			 * @folio should be kept frozon until page cache
> +			 * entries are updated with all the other after-split
> +			 * folios to prevent others seeing stale page cache
> +			 * entries.
> +			 */
> +			if (new_folio == folio)
> +				continue;
> +
> +			folio_ref_unfreeze(
> +				new_folio,
> +				1 + ((mapping || swap_cache) ?
> +					     folio_nr_pages(new_folio) :
> +					     0));
> +
> +			lru_add_split_folio(folio, new_folio, lruvec, list);
> +
> +			/* Some pages can be beyond EOF: drop them from cache */
> +			if (new_folio->index >= end) {
> +				if (shmem_mapping(mapping))
> +					nr_dropped += folio_nr_pages(new_folio);
> +				else if (folio_test_clear_dirty(new_folio))
> +					folio_account_cleaned(
> +						new_folio,
> +						inode_to_wb(mapping->host));
> +				__filemap_remove_folio(new_folio, NULL);
> +				folio_put_refs(new_folio,
> +					       folio_nr_pages(new_folio));
> +			} else if (mapping) {
> +				__xa_store(&mapping->i_pages, new_folio->index,
> +					   new_folio, 0);
> +			} else if (swap_cache) {
> +				__xa_store(&swap_cache->i_pages,
> +					   swap_cache_index(new_folio->swap),
> +					   new_folio, 0);
> +			}
> +		}
> +		/*
> +		 * Unfreeze @folio only after all page cache entries, which
> +		 * used to point to it, have been updated with new folios.
> +		 * Otherwise, a parallel folio_try_get() can grab origin_folio
> +		 * and its caller can see stale page cache entries.
> +		 */
> +		folio_ref_unfreeze(folio, 1 +
> +			((mapping || swap_cache) ? folio_nr_pages(folio) : 0));
> +
> +		unlock_page_lruvec(lruvec);
> +
> +		if (swap_cache)
> +			xa_unlock(&swap_cache->i_pages);
> +		if (mapping)
> +			xa_unlock(&mapping->i_pages);
> +
> +		if (nr_dropped)
> +			shmem_uncharge(mapping->host, nr_dropped);
> +
>  	} else {
>  		spin_unlock(&ds_queue->split_queue_lock);
>  fail:
>  		if (mapping)
>  			xas_unlock(&xas);
> -		local_irq_enable();
> -		remap_page(folio, folio_nr_pages(folio), 0);
>  		ret = -EAGAIN;
>  	}
> 
> +	local_irq_enable();
> +
> +	remap_page(folio, 1 << order,
> +		   !ret && folio_test_anon(folio) ? RMP_USE_SHARED_ZEROPAGE :
> +						    0);
> +
> +	/*
> +	 * At this point, folio should contain the specified page.
> +	 * For uniform split, it is left for caller to unlock.
> +	 * For buddy allocator like split, the first after-split folio is left
> +	 * for caller to unlock.
> +	 */
> +	for (new_folio = folio; new_folio != next_folio; new_folio = next) {
> +		next = folio_next(new_folio);
> +		if (new_folio == page_folio(lock_at))
> +			continue;
> +
> +		folio_unlock(new_folio);
> +		/*
> +		 * Subpages may be freed if there wasn't any mapping
> +		 * like if add_to_swap() is running on a lru page that
> +		 * had its mapping zapped. And freeing these pages
> +		 * requires taking the lru_lock so we do the put_page
> +		 * of the tail pages after the split is complete.
> +		 */
> +		free_folio_and_swap_cache(new_folio);
> +	}
> +
>  out_unlock:
>  	if (anon_vma) {
>  		anon_vma_unlock_write(anon_vma);


I applied my changes and tested on top of this patch. Thanks!

Tested-by: Balbir Singh <balbirs@nvidia.com>

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 3 months ago

On 7/7/25 12:45, Zi Yan wrote:
> On 6 Jul 2025, at 22:29, Balbir Singh wrote:
> 
>> On 7/6/25 13:03, Zi Yan wrote:
>>> On 5 Jul 2025, at 22:34, Zi Yan wrote:
>>>
>>>> On 5 Jul 2025, at 21:47, Balbir Singh wrote:
>>>>
>>>>> On 7/6/25 11:34, Zi Yan wrote:
>>>>>> On 5 Jul 2025, at 21:15, Balbir Singh wrote:
>>>>>>
>>>>>>> On 7/5/25 11:55, Zi Yan wrote:
>>>>>>>> On 4 Jul 2025, at 20:58, Balbir Singh wrote:
>>>>>>>>
>>>>>>>>> On 7/4/25 21:24, Zi Yan wrote:
>>>>>>>>>>
>>>>>>>>>> s/pages/folio
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks, will make the changes
>>>>>>>>>
>>>>>>>>>> Why name it isolated if the folio is unmapped? Isolated folios often mean
>>>>>>>>>> they are removed from LRU lists. isolated here causes confusion.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Ack, will change the name
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>>   *
>>>>>>>>>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>>>>>>>>>   * It is in charge of checking whether the split is supported or not and
>>>>>>>>>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>   */
>>>>>>>>>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>  		struct page *split_at, struct page *lock_at,
>>>>>>>>>>> -		struct list_head *list, bool uniform_split)
>>>>>>>>>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>>>>>>>>>  {
>>>>>>>>>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>>>>>>>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>>>>>>>>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>  		 * is taken to serialise against parallel split or collapse
>>>>>>>>>>>  		 * operations.
>>>>>>>>>>>  		 */
>>>>>>>>>>> -		anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>> -		if (!anon_vma) {
>>>>>>>>>>> -			ret = -EBUSY;
>>>>>>>>>>> -			goto out;
>>>>>>>>>>> +		if (!isolated) {
>>>>>>>>>>> +			anon_vma = folio_get_anon_vma(folio);
>>>>>>>>>>> +			if (!anon_vma) {
>>>>>>>>>>> +				ret = -EBUSY;
>>>>>>>>>>> +				goto out;
>>>>>>>>>>> +			}
>>>>>>>>>>> +			anon_vma_lock_write(anon_vma);
>>>>>>>>>>>  		}
>>>>>>>>>>>  		end = -1;
>>>>>>>>>>>  		mapping = NULL;
>>>>>>>>>>> -		anon_vma_lock_write(anon_vma);
>>>>>>>>>>>  	} else {
>>>>>>>>>>>  		unsigned int min_order;
>>>>>>>>>>>  		gfp_t gfp;
>>>>>>>>>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>  		goto out_unlock;
>>>>>>>>>>>  	}
>>>>>>>>>>>
>>>>>>>>>>> -	unmap_folio(folio);
>>>>>>>>>>> +	if (!isolated)
>>>>>>>>>>> +		unmap_folio(folio);
>>>>>>>>>>>
>>>>>>>>>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>>>>>>>>>  	local_irq_disable();
>>>>>>>>>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>>>>>>>>>
>>>>>>>>>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>>>>>>>>>  				split_at, lock_at, list, end, &xas, mapping,
>>>>>>>>>>> -				uniform_split);
>>>>>>>>>>> +				uniform_split, isolated);
>>>>>>>>>>>  	} else {
>>>>>>>>>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>>>>>>>>>  fail:
>>>>>>>>>>>  		if (mapping)
>>>>>>>>>>>  			xas_unlock(&xas);
>>>>>>>>>>>  		local_irq_enable();
>>>>>>>>>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>> +		if (!isolated)
>>>>>>>>>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>>>>>>>>>  		ret = -EAGAIN;
>>>>>>>>>>>  	}
>>>>>>>>>>
>>>>>>>>>> These "isolated" special handlings does not look good, I wonder if there
>>>>>>>>>> is a way of letting split code handle device private folios more gracefully.
>>>>>>>>>> It also causes confusions, since why does "isolated/unmapped" folios
>>>>>>>>>> not need to unmap_page(), remap_page(), or unlock?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> There are two reasons for going down the current code path
>>>>>>>>
>>>>>>>> After thinking more, I think adding isolated/unmapped is not the right
>>>>>>>> way, since unmapped folio is a very generic concept. If you add it,
>>>>>>>> one can easily misuse the folio split code by first unmapping a folio
>>>>>>>> and trying to split it with unmapped = true. I do not think that is
>>>>>>>> supported and your patch does not prevent that from happening in the future.
>>>>>>>>
>>>>>>>
>>>>>>> I don't understand the misuse case you mention, I assume you mean someone can
>>>>>>> get the usage wrong? The responsibility is on the caller to do the right thing
>>>>>>> if calling the API with unmapped
>>>>>>
>>>>>> Before your patch, there is no use case of splitting unmapped folios.
>>>>>> Your patch only adds support for device private page split, not any unmapped
>>>>>> folio split. So using a generic isolated/unmapped parameter is not OK.
>>>>>>
>>>>>
>>>>> There is a use for splitting unmapped folios (see below)
>>>>>
>>>>>>>
>>>>>>>> You should teach different parts of folio split code path to handle
>>>>>>>> device private folios properly. Details are below.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> 1. if the isolated check is not present, folio_get_anon_vma will fail and cause
>>>>>>>>>    the split routine to return with -EBUSY
>>>>>>>>
>>>>>>>> You do something below instead.
>>>>>>>>
>>>>>>>> if (!anon_vma && !folio_is_device_private(folio)) {
>>>>>>>> 	ret = -EBUSY;
>>>>>>>> 	goto out;
>>>>>>>> } else if (anon_vma) {
>>>>>>>> 	anon_vma_lock_write(anon_vma);
>>>>>>>> }
>>>>>>>>
>>>>>>>
>>>>>>> folio_get_anon() cannot be called for unmapped folios. In our case the page has
>>>>>>> already been unmapped. Is there a reason why you mix anon_vma_lock_write with
>>>>>>> the check for device private folios?
>>>>>>
>>>>>> Oh, I did not notice that anon_vma = folio_get_anon_vma(folio) is also
>>>>>> in if (!isolated) branch. In that case, just do
>>>>>>
>>>>>> if (folio_is_device_private(folio) {
>>>>>> ...
>>>>>> } else if (is_anon) {
>>>>>> ...
>>>>>> } else {
>>>>>> ...
>>>>>> }
>>>>>>
>>>>>>>
>>>>>>>> People can know device private folio split needs a special handling.
>>>>>>>>
>>>>>>>> BTW, why a device private folio can also be anonymous? Does it mean
>>>>>>>> if a page cache folio is migrated to device private, kernel also
>>>>>>>> sees it as both device private and file-backed?
>>>>>>>>
>>>>>>>
>>>>>>> FYI: device private folios only work with anonymous private pages, hence
>>>>>>> the name device private.
>>>>>>
>>>>>> OK.
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>> 2. Going through unmap_page(), remap_page() causes a full page table walk, which
>>>>>>>>>    the migrate_device API has already just done as a part of the migration. The
>>>>>>>>>    entries under consideration are already migration entries in this case.
>>>>>>>>>    This is wasteful and in some case unexpected.
>>>>>>>>
>>>>>>>> unmap_folio() already adds TTU_SPLIT_HUGE_PMD to try to split
>>>>>>>> PMD mapping, which you did in migrate_vma_split_pages(). You probably
>>>>>>>> can teach either try_to_migrate() or try_to_unmap() to just split
>>>>>>>> device private PMD mapping. Or if that is not preferred,
>>>>>>>> you can simply call split_huge_pmd_address() when unmap_folio()
>>>>>>>> sees a device private folio.
>>>>>>>>
>>>>>>>> For remap_page(), you can simply return for device private folios
>>>>>>>> like it is currently doing for non anonymous folios.
>>>>>>>>
>>>>>>>
>>>>>>> Doing a full rmap walk does not make sense with unmap_folio() and
>>>>>>> remap_folio(), because
>>>>>>>
>>>>>>> 1. We need to do a page table walk/rmap walk again
>>>>>>> 2. We'll need special handling of migration <-> migration entries
>>>>>>>    in the rmap handling (set/remove migration ptes)
>>>>>>> 3. In this context, the code is already in the middle of migration,
>>>>>>>    so trying to do that again does not make sense.
>>>>>>
>>>>>> Why doing split in the middle of migration? Existing split code
>>>>>> assumes to-be-split folios are mapped.
>>>>>>
>>>>>> What prevents doing split before migration?
>>>>>>
>>>>>
>>>>> The code does do a split prior to migration if THP selection fails
>>>>>
>>>>> Please see https://lore.kernel.org/lkml/20250703233511.2028395-5-balbirs@nvidia.com/
>>>>> and the fallback part which calls split_folio()
>>>>
>>>> So this split is done when the folio in system memory is mapped.
>>>>
>>>>>
>>>>> But the case under consideration is special since the device needs to allocate
>>>>> corresponding pfn's as well. The changelog mentions it:
>>>>>
>>>>> "The common case that arises is that after setup, during migrate
>>>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>>>> pages."
>>>>>
>>>>> I can expand on it, because migrate_vma() is a multi-phase operation
>>>>>
>>>>> 1. migrate_vma_setup()
>>>>> 2. migrate_vma_pages()
>>>>> 3. migrate_vma_finalize()
>>>>>
>>>>> It can so happen that when we get the destination pfn's allocated the destination
>>>>> might not be able to allocate a large page, so we do the split in migrate_vma_pages().
>>>>>
>>>>> The pages have been unmapped and collected in migrate_vma_setup()
>>>>
>>>> So these unmapped folios are system memory folios? I thought they are
>>>> large device private folios.
>>>>
>>>> OK. It sounds like splitting unmapped folios is really needed. I think
>>>> it is better to make a new split_unmapped_folio() function
>>>> by reusing __split_unmapped_folio(), since __folio_split() assumes
>>>> the input folio is mapped.
>>>
>>> And to make __split_unmapped_folio()'s functionality match its name,
>>> I will later refactor it. At least move local_irq_enable(), remap_page(),
>>> and folio_unlocks out of it. I will think about how to deal with
>>> lru_add_split_folio(). The goal is to remove the to-be-added "unmapped"
>>> parameter from __split_unmapped_folio().
>>>
>>
>> That sounds like a plan, it seems like there needs to be a finish phase of
>> the split and it does not belong to __split_unmapped_folio(). I would propose
>> that we rename "isolated" to "folio_is_migrating" and then your cleanups can
>> follow? Once your cleanups come in, we won't need to pass the parameter to
>> __split_unmapped_folio().
> 
> Sure.
> 
> The patch below should work. It only passed mm selftests and I am planning
> to do more. If you are brave enough, you can give it a try and use
> __split_unmapped_folio() from it.
> 
> From e594924d689bef740c38d93c7c1653f31bd5ae83 Mon Sep 17 00:00:00 2001
> From: Zi Yan <ziy@nvidia.com>
> Date: Sun, 6 Jul 2025 22:40:53 -0400
> Subject: [PATCH] mm/huge_memory: move epilogue code out of
>  __split_unmapped_folio()
> 
> The code is not related to splitting unmapped folio operations. Move
> it out, so that __split_unmapped_folio() only do split works on unmapped
> folios.
> 
> Signed-off-by: Zi Yan <ziy@nvidia.com>
> ---
> 

The patch fails to apply for me, let me try and rebase it on top of this series

Balbir

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Mika Penttilä 3 months ago

On 7/4/25 02:35, Balbir Singh wrote:
> Support splitting pages during THP zone device migration as needed.
> The common case that arises is that after setup, during migrate
> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
> pages.
>
> Add a new routine migrate_vma_split_pages() to support the splitting
> of already isolated pages. The pages being migrated are already unmapped
> and marked for migration during setup (via unmap). folio_split() and
> __split_unmapped_folio() take additional isolated arguments, to avoid
> unmapping and remaping these pages and unlocking/putting the folio.
>
> Cc: Karol Herbst <kherbst@redhat.com>
> Cc: Lyude Paul <lyude@redhat.com>
> Cc: Danilo Krummrich <dakr@kernel.org>
> Cc: David Airlie <airlied@gmail.com>
> Cc: Simona Vetter <simona@ffwll.ch>
> Cc: "Jérôme Glisse" <jglisse@redhat.com>
> Cc: Shuah Khan <shuah@kernel.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
> Cc: Jane Chu <jane.chu@oracle.com>
> Cc: Alistair Popple <apopple@nvidia.com>
> Cc: Donet Tom <donettom@linux.ibm.com>
>
> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
> ---
>  include/linux/huge_mm.h | 11 ++++++--
>  mm/huge_memory.c        | 54 ++++++++++++++++++++-----------------
>  mm/migrate_device.c     | 59 ++++++++++++++++++++++++++++++++---------
>  3 files changed, 85 insertions(+), 39 deletions(-)
>
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index 65a1bdf29bb9..5f55a754e57c 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>  		vm_flags_t vm_flags);
>  
>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -		unsigned int new_order);
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order, bool isolated);
>  int min_order_for_split(struct folio *folio);
>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
> @@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>  		bool warns);
>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>  		struct list_head *list);
> +
> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +		unsigned int new_order)
> +{
> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
> +}
> +
>  /*
>   * try_folio_split - try to split a @folio at @page using non uniform split.
>   * @folio: folio to be split
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index d55e36ae0c39..e00ddfed22fa 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -3424,15 +3424,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>  		new_folio->mapping = folio->mapping;
>  		new_folio->index = folio->index + i;
>  
> -		/*
> -		 * page->private should not be set in tail pages. Fix up and warn once
> -		 * if private is unexpectedly set.
> -		 */
> -		if (unlikely(new_folio->private)) {
> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
> -			new_folio->private = NULL;
> -		}
> -
>  		if (folio_test_swapcache(folio))
>  			new_folio->swap.val = folio->swap.val + i;
>  
> @@ -3519,7 +3510,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  		struct page *split_at, struct page *lock_at,
>  		struct list_head *list, pgoff_t end,
>  		struct xa_state *xas, struct address_space *mapping,
> -		bool uniform_split)
> +		bool uniform_split, bool isolated)
>  {
>  	struct lruvec *lruvec;
>  	struct address_space *swap_cache = NULL;
> @@ -3643,8 +3634,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  				percpu_ref_get_many(&release->pgmap->ref,
>  							(1 << new_order) - 1);
>  
> -			lru_add_split_folio(origin_folio, release, lruvec,
> -					list);
> +			if (!isolated)
> +				lru_add_split_folio(origin_folio, release,
> +							lruvec, list);
>  
>  			/* Some pages can be beyond EOF: drop them from cache */
>  			if (release->index >= end) {
> @@ -3697,6 +3689,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>  	if (nr_dropped)
>  		shmem_uncharge(mapping->host, nr_dropped);
>  
> +	/*
> +	 * Don't remap and unlock isolated folios
> +	 */
> +	if (isolated)
> +		return ret;
> +
>  	remap_page(origin_folio, 1 << order,
>  			folio_test_anon(origin_folio) ?
>  				RMP_USE_SHARED_ZEROPAGE : 0);
> @@ -3790,6 +3788,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   * @lock_at: a page within @folio to be left locked to caller
>   * @list: after-split folios will be put on it if non NULL
>   * @uniform_split: perform uniform split or not (non-uniform split)
> + * @isolated: The pages are already unmapped
>   *
>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>   * It is in charge of checking whether the split is supported or not and
> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>   */
>  static int __folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct page *lock_at,
> -		struct list_head *list, bool uniform_split)
> +		struct list_head *list, bool uniform_split, bool isolated)
>  {
>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		 * is taken to serialise against parallel split or collapse
>  		 * operations.
>  		 */
> -		anon_vma = folio_get_anon_vma(folio);
> -		if (!anon_vma) {
> -			ret = -EBUSY;
> -			goto out;
> +		if (!isolated) {
> +			anon_vma = folio_get_anon_vma(folio);
> +			if (!anon_vma) {
> +				ret = -EBUSY;
> +				goto out;
> +			}
> +			anon_vma_lock_write(anon_vma);
>  		}
>  		end = -1;
>  		mapping = NULL;
> -		anon_vma_lock_write(anon_vma);
>  	} else {
>  		unsigned int min_order;
>  		gfp_t gfp;
> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  		goto out_unlock;
>  	}
>  
> -	unmap_folio(folio);
> +	if (!isolated)
> +		unmap_folio(folio);
>  
>  	/* block interrupt reentry in xa_lock and spinlock */
>  	local_irq_disable();
> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>  
>  		ret = __split_unmapped_folio(folio, new_order,
>  				split_at, lock_at, list, end, &xas, mapping,
> -				uniform_split);
> +				uniform_split, isolated);
>  	} else {
>  		spin_unlock(&ds_queue->split_queue_lock);
>  fail:
>  		if (mapping)
>  			xas_unlock(&xas);
>  		local_irq_enable();
> -		remap_page(folio, folio_nr_pages(folio), 0);
> +		if (!isolated)
> +			remap_page(folio, folio_nr_pages(folio), 0);
>  		ret = -EAGAIN;
>  	}
>  
> @@ -4046,12 +4049,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>   * Returns -EINVAL when trying to split to an order that is incompatible
>   * with the folio. Splitting to order 0 is compatible with all folios.
>   */
> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> -				     unsigned int new_order)
> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
> +				     unsigned int new_order, bool isolated)
>  {
>  	struct folio *folio = page_folio(page);
>  
> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
> +				isolated);
>  }
>  
>  /*
> @@ -4080,7 +4084,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>  		struct page *split_at, struct list_head *list)
>  {
>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
> -			false);
> +			false, false);
>  }
>  
>  int min_order_for_split(struct folio *folio)
> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
> index 41d0bd787969..acd2f03b178d 100644
> --- a/mm/migrate_device.c
> +++ b/mm/migrate_device.c
> @@ -813,6 +813,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>  	return 0;
>  }
> +
> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
> +					unsigned long idx, unsigned long addr,
> +					struct folio *folio)
> +{
> +	unsigned long i;
> +	unsigned long pfn;
> +	unsigned long flags;
> +
> +	folio_get(folio);
> +	split_huge_pmd_address(migrate->vma, addr, true);
> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);

We already have reference to folio, why is folio_get() needed ?

Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?

> +	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
> +	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
> +	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
> +	for (i = 1; i < HPAGE_PMD_NR; i++)
> +		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
> +}
>  #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
>  static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  					 unsigned long addr,
> @@ -822,6 +840,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>  {
>  	return 0;
>  }
> +
> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
> +					unsigned long idx, unsigned long addr,
> +					struct folio *folio)
> +{}
>  #endif
>  
>  /*
> @@ -971,8 +994,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				struct migrate_vma *migrate)
>  {
>  	struct mmu_notifier_range range;
> -	unsigned long i;
> +	unsigned long i, j;
>  	bool notified = false;
> +	unsigned long addr;
>  
>  	for (i = 0; i < npages; ) {
>  		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
> @@ -1014,12 +1038,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
>  				nr = HPAGE_PMD_NR;
>  				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -				goto next;
> +			} else {
> +				nr = 1;
>  			}
>  
> -			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
> -						&src_pfns[i]);
> +			for (j = 0; j < nr && i + j < npages; j++) {
> +				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
> +				migrate_vma_insert_page(migrate,
> +					addr + j * PAGE_SIZE,
> +					&dst_pfns[i+j], &src_pfns[i+j]);
> +			}
>  			goto next;
>  		}
>  
> @@ -1041,7 +1069,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  							 MIGRATE_PFN_COMPOUND);
>  					goto next;
>  				}
> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> +				nr = 1 << folio_order(folio);
> +				addr = migrate->start + i * PAGE_SIZE;
> +				migrate_vma_split_pages(migrate, i, addr, folio);
>  			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
>  				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
>  				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
> @@ -1076,12 +1106,17 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>  		BUG_ON(folio_test_writeback(folio));
>  
>  		if (migrate && migrate->fault_page == page)
> -			extra_cnt = 1;
> -		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
> -		if (r != MIGRATEPAGE_SUCCESS)
> -			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
> -		else
> -			folio_migrate_flags(newfolio, folio);
> +			extra_cnt++;
> +		for (j = 0; j < nr && i + j < npages; j++) {
> +			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
> +			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
> +
> +			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
> +			if (r != MIGRATEPAGE_SUCCESS)
> +				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
> +			else
> +				folio_migrate_flags(newfolio, folio);
> +		}
>  next:
>  		i += nr;
>  	}

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Mika Penttilä 3 months ago

On 7/4/25 08:17, Mika Penttilä wrote:
> On 7/4/25 02:35, Balbir Singh wrote:
>> Support splitting pages during THP zone device migration as needed.
>> The common case that arises is that after setup, during migrate
>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>> pages.
>>
>> Add a new routine migrate_vma_split_pages() to support the splitting
>> of already isolated pages. The pages being migrated are already unmapped
>> and marked for migration during setup (via unmap). folio_split() and
>> __split_unmapped_folio() take additional isolated arguments, to avoid
>> unmapping and remaping these pages and unlocking/putting the folio.
>>
>> Cc: Karol Herbst <kherbst@redhat.com>
>> Cc: Lyude Paul <lyude@redhat.com>
>> Cc: Danilo Krummrich <dakr@kernel.org>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona@ffwll.ch>
>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>> Cc: Shuah Khan <shuah@kernel.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Matthew Wilcox <willy@infradead.org>
>> Cc: Peter Xu <peterx@redhat.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>> Cc: Jane Chu <jane.chu@oracle.com>
>> Cc: Alistair Popple <apopple@nvidia.com>
>> Cc: Donet Tom <donettom@linux.ibm.com>
>>
>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>> ---
>>  include/linux/huge_mm.h | 11 ++++++--
>>  mm/huge_memory.c        | 54 ++++++++++++++++++++-----------------
>>  mm/migrate_device.c     | 59 ++++++++++++++++++++++++++++++++---------
>>  3 files changed, 85 insertions(+), 39 deletions(-)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index 65a1bdf29bb9..5f55a754e57c 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>  		vm_flags_t vm_flags);
>>  
>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -		unsigned int new_order);
>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order, bool isolated);
>>  int min_order_for_split(struct folio *folio);
>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>> @@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>  		bool warns);
>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>  		struct list_head *list);
>> +
>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +		unsigned int new_order)
>> +{
>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>> +}
>> +
>>  /*
>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>   * @folio: folio to be split
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index d55e36ae0c39..e00ddfed22fa 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -3424,15 +3424,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>  		new_folio->mapping = folio->mapping;
>>  		new_folio->index = folio->index + i;
>>  
>> -		/*
>> -		 * page->private should not be set in tail pages. Fix up and warn once
>> -		 * if private is unexpectedly set.
>> -		 */
>> -		if (unlikely(new_folio->private)) {
>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>> -			new_folio->private = NULL;
>> -		}
>> -
>>  		if (folio_test_swapcache(folio))
>>  			new_folio->swap.val = folio->swap.val + i;
>>  
>> @@ -3519,7 +3510,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>  		struct page *split_at, struct page *lock_at,
>>  		struct list_head *list, pgoff_t end,
>>  		struct xa_state *xas, struct address_space *mapping,
>> -		bool uniform_split)
>> +		bool uniform_split, bool isolated)
>>  {
>>  	struct lruvec *lruvec;
>>  	struct address_space *swap_cache = NULL;
>> @@ -3643,8 +3634,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>  				percpu_ref_get_many(&release->pgmap->ref,
>>  							(1 << new_order) - 1);
>>  
>> -			lru_add_split_folio(origin_folio, release, lruvec,
>> -					list);
>> +			if (!isolated)
>> +				lru_add_split_folio(origin_folio, release,
>> +							lruvec, list);
>>  
>>  			/* Some pages can be beyond EOF: drop them from cache */
>>  			if (release->index >= end) {
>> @@ -3697,6 +3689,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>  	if (nr_dropped)
>>  		shmem_uncharge(mapping->host, nr_dropped);
>>  
>> +	/*
>> +	 * Don't remap and unlock isolated folios
>> +	 */
>> +	if (isolated)
>> +		return ret;
>> +
>>  	remap_page(origin_folio, 1 << order,
>>  			folio_test_anon(origin_folio) ?
>>  				RMP_USE_SHARED_ZEROPAGE : 0);
>> @@ -3790,6 +3788,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   * @lock_at: a page within @folio to be left locked to caller
>>   * @list: after-split folios will be put on it if non NULL
>>   * @uniform_split: perform uniform split or not (non-uniform split)
>> + * @isolated: The pages are already unmapped
>>   *
>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>   * It is in charge of checking whether the split is supported or not and
>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>   */
>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct page *lock_at,
>> -		struct list_head *list, bool uniform_split)
>> +		struct list_head *list, bool uniform_split, bool isolated)
>>  {
>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		 * is taken to serialise against parallel split or collapse
>>  		 * operations.
>>  		 */
>> -		anon_vma = folio_get_anon_vma(folio);
>> -		if (!anon_vma) {
>> -			ret = -EBUSY;
>> -			goto out;
>> +		if (!isolated) {
>> +			anon_vma = folio_get_anon_vma(folio);
>> +			if (!anon_vma) {
>> +				ret = -EBUSY;
>> +				goto out;
>> +			}
>> +			anon_vma_lock_write(anon_vma);
>>  		}
>>  		end = -1;
>>  		mapping = NULL;
>> -		anon_vma_lock_write(anon_vma);
>>  	} else {
>>  		unsigned int min_order;
>>  		gfp_t gfp;
>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  		goto out_unlock;
>>  	}
>>  
>> -	unmap_folio(folio);
>> +	if (!isolated)
>> +		unmap_folio(folio);
>>  
>>  	/* block interrupt reentry in xa_lock and spinlock */
>>  	local_irq_disable();
>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>  
>>  		ret = __split_unmapped_folio(folio, new_order,
>>  				split_at, lock_at, list, end, &xas, mapping,
>> -				uniform_split);
>> +				uniform_split, isolated);
>>  	} else {
>>  		spin_unlock(&ds_queue->split_queue_lock);
>>  fail:
>>  		if (mapping)
>>  			xas_unlock(&xas);
>>  		local_irq_enable();
>> -		remap_page(folio, folio_nr_pages(folio), 0);
>> +		if (!isolated)
>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>  		ret = -EAGAIN;
>>  	}
>>  
>> @@ -4046,12 +4049,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>   */
>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> -				     unsigned int new_order)
>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>> +				     unsigned int new_order, bool isolated)
>>  {
>>  	struct folio *folio = page_folio(page);
>>  
>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>> +				isolated);
>>  }
>>  
>>  /*
>> @@ -4080,7 +4084,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>  		struct page *split_at, struct list_head *list)
>>  {
>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>> -			false);
>> +			false, false);
>>  }
>>  
>>  int min_order_for_split(struct folio *folio)
>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>> index 41d0bd787969..acd2f03b178d 100644
>> --- a/mm/migrate_device.c
>> +++ b/mm/migrate_device.c
>> @@ -813,6 +813,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>  	return 0;
>>  }
>> +
>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>> +					unsigned long idx, unsigned long addr,
>> +					struct folio *folio)
>> +{
>> +	unsigned long i;
>> +	unsigned long pfn;
>> +	unsigned long flags;
>> +
>> +	folio_get(folio);
>> +	split_huge_pmd_address(migrate->vma, addr, true);
>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
> We already have reference to folio, why is folio_get() needed ?
>
> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?

Oh I see 
+	if (!isolated)
+		unmap_folio(folio);

which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);

Still, why the folio_get(folio);?
 

>
>> +	migrate->src[idx] &= ~MIGRATE_PFN_COMPOUND;
>> +	flags = migrate->src[idx] & ((1UL << MIGRATE_PFN_SHIFT) - 1);
>> +	pfn = migrate->src[idx] >> MIGRATE_PFN_SHIFT;
>> +	for (i = 1; i < HPAGE_PMD_NR; i++)
>> +		migrate->src[i+idx] = migrate_pfn(pfn + i) | flags;
>> +}
>>  #else /* !CONFIG_ARCH_ENABLE_THP_MIGRATION */
>>  static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>  					 unsigned long addr,
>> @@ -822,6 +840,11 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>  {
>>  	return 0;
>>  }
>> +
>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>> +					unsigned long idx, unsigned long addr,
>> +					struct folio *folio)
>> +{}
>>  #endif
>>  
>>  /*
>> @@ -971,8 +994,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>>  				struct migrate_vma *migrate)
>>  {
>>  	struct mmu_notifier_range range;
>> -	unsigned long i;
>> +	unsigned long i, j;
>>  	bool notified = false;
>> +	unsigned long addr;
>>  
>>  	for (i = 0; i < npages; ) {
>>  		struct page *newpage = migrate_pfn_to_page(dst_pfns[i]);
>> @@ -1014,12 +1038,16 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>>  				(!(dst_pfns[i] & MIGRATE_PFN_COMPOUND))) {
>>  				nr = HPAGE_PMD_NR;
>>  				src_pfns[i] &= ~MIGRATE_PFN_COMPOUND;
>> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>> -				goto next;
>> +			} else {
>> +				nr = 1;
>>  			}
>>  
>> -			migrate_vma_insert_page(migrate, addr, &dst_pfns[i],
>> -						&src_pfns[i]);
>> +			for (j = 0; j < nr && i + j < npages; j++) {
>> +				src_pfns[i+j] |= MIGRATE_PFN_MIGRATE;
>> +				migrate_vma_insert_page(migrate,
>> +					addr + j * PAGE_SIZE,
>> +					&dst_pfns[i+j], &src_pfns[i+j]);
>> +			}
>>  			goto next;
>>  		}
>>  
>> @@ -1041,7 +1069,9 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>>  							 MIGRATE_PFN_COMPOUND);
>>  					goto next;
>>  				}
>> -				src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>> +				nr = 1 << folio_order(folio);
>> +				addr = migrate->start + i * PAGE_SIZE;
>> +				migrate_vma_split_pages(migrate, i, addr, folio);
>>  			} else if ((src_pfns[i] & MIGRATE_PFN_MIGRATE) &&
>>  				(dst_pfns[i] & MIGRATE_PFN_COMPOUND) &&
>>  				!(src_pfns[i] & MIGRATE_PFN_COMPOUND)) {
>> @@ -1076,12 +1106,17 @@ static void __migrate_device_pages(unsigned long *src_pfns,
>>  		BUG_ON(folio_test_writeback(folio));
>>  
>>  		if (migrate && migrate->fault_page == page)
>> -			extra_cnt = 1;
>> -		r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
>> -		if (r != MIGRATEPAGE_SUCCESS)
>> -			src_pfns[i] &= ~MIGRATE_PFN_MIGRATE;
>> -		else
>> -			folio_migrate_flags(newfolio, folio);
>> +			extra_cnt++;
>> +		for (j = 0; j < nr && i + j < npages; j++) {
>> +			folio = page_folio(migrate_pfn_to_page(src_pfns[i+j]));
>> +			newfolio = page_folio(migrate_pfn_to_page(dst_pfns[i+j]));
>> +
>> +			r = folio_migrate_mapping(mapping, newfolio, folio, extra_cnt);
>> +			if (r != MIGRATEPAGE_SUCCESS)
>> +				src_pfns[i+j] &= ~MIGRATE_PFN_MIGRATE;
>> +			else
>> +				folio_migrate_flags(newfolio, folio);
>> +		}
>>  next:
>>  		i += nr;
>>  	}

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 3 months ago

On 7/4/25 16:43, Mika Penttilä wrote:
> 
> On 7/4/25 08:17, Mika Penttilä wrote:
>> On 7/4/25 02:35, Balbir Singh wrote:
>>> Support splitting pages during THP zone device migration as needed.
>>> The common case that arises is that after setup, during migrate
>>> the destination might not be able to allocate MIGRATE_PFN_COMPOUND
>>> pages.
>>>
>>> Add a new routine migrate_vma_split_pages() to support the splitting
>>> of already isolated pages. The pages being migrated are already unmapped
>>> and marked for migration during setup (via unmap). folio_split() and
>>> __split_unmapped_folio() take additional isolated arguments, to avoid
>>> unmapping and remaping these pages and unlocking/putting the folio.
>>>
>>> Cc: Karol Herbst <kherbst@redhat.com>
>>> Cc: Lyude Paul <lyude@redhat.com>
>>> Cc: Danilo Krummrich <dakr@kernel.org>
>>> Cc: David Airlie <airlied@gmail.com>
>>> Cc: Simona Vetter <simona@ffwll.ch>
>>> Cc: "Jérôme Glisse" <jglisse@redhat.com>
>>> Cc: Shuah Khan <shuah@kernel.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Matthew Wilcox <willy@infradead.org>
>>> Cc: Peter Xu <peterx@redhat.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
>>> Cc: Jane Chu <jane.chu@oracle.com>
>>> Cc: Alistair Popple <apopple@nvidia.com>
>>> Cc: Donet Tom <donettom@linux.ibm.com>
>>>
>>> Signed-off-by: Balbir Singh <balbirs@nvidia.com>
>>> ---
>>>  include/linux/huge_mm.h | 11 ++++++--
>>>  mm/huge_memory.c        | 54 ++++++++++++++++++++-----------------
>>>  mm/migrate_device.c     | 59 ++++++++++++++++++++++++++++++++---------
>>>  3 files changed, 85 insertions(+), 39 deletions(-)
>>>
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index 65a1bdf29bb9..5f55a754e57c 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -343,8 +343,8 @@ unsigned long thp_get_unmapped_area_vmflags(struct file *filp, unsigned long add
>>>  		vm_flags_t vm_flags);
>>>  
>>>  bool can_split_folio(struct folio *folio, int caller_pins, int *pextra_pins);
>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> -		unsigned int new_order);
>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +		unsigned int new_order, bool isolated);
>>>  int min_order_for_split(struct folio *folio);
>>>  int split_folio_to_list(struct folio *folio, struct list_head *list);
>>>  bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>> @@ -353,6 +353,13 @@ bool non_uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>  		bool warns);
>>>  int folio_split(struct folio *folio, unsigned int new_order, struct page *page,
>>>  		struct list_head *list);
>>> +
>>> +static inline int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +		unsigned int new_order)
>>> +{
>>> +	return __split_huge_page_to_list_to_order(page, list, new_order, false);
>>> +}
>>> +
>>>  /*
>>>   * try_folio_split - try to split a @folio at @page using non uniform split.
>>>   * @folio: folio to be split
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index d55e36ae0c39..e00ddfed22fa 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -3424,15 +3424,6 @@ static void __split_folio_to_order(struct folio *folio, int old_order,
>>>  		new_folio->mapping = folio->mapping;
>>>  		new_folio->index = folio->index + i;
>>>  
>>> -		/*
>>> -		 * page->private should not be set in tail pages. Fix up and warn once
>>> -		 * if private is unexpectedly set.
>>> -		 */
>>> -		if (unlikely(new_folio->private)) {
>>> -			VM_WARN_ON_ONCE_PAGE(true, new_head);
>>> -			new_folio->private = NULL;
>>> -		}
>>> -
>>>  		if (folio_test_swapcache(folio))
>>>  			new_folio->swap.val = folio->swap.val + i;
>>>  
>>> @@ -3519,7 +3510,7 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>  		struct page *split_at, struct page *lock_at,
>>>  		struct list_head *list, pgoff_t end,
>>>  		struct xa_state *xas, struct address_space *mapping,
>>> -		bool uniform_split)
>>> +		bool uniform_split, bool isolated)
>>>  {
>>>  	struct lruvec *lruvec;
>>>  	struct address_space *swap_cache = NULL;
>>> @@ -3643,8 +3634,9 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>  				percpu_ref_get_many(&release->pgmap->ref,
>>>  							(1 << new_order) - 1);
>>>  
>>> -			lru_add_split_folio(origin_folio, release, lruvec,
>>> -					list);
>>> +			if (!isolated)
>>> +				lru_add_split_folio(origin_folio, release,
>>> +							lruvec, list);
>>>  
>>>  			/* Some pages can be beyond EOF: drop them from cache */
>>>  			if (release->index >= end) {
>>> @@ -3697,6 +3689,12 @@ static int __split_unmapped_folio(struct folio *folio, int new_order,
>>>  	if (nr_dropped)
>>>  		shmem_uncharge(mapping->host, nr_dropped);
>>>  
>>> +	/*
>>> +	 * Don't remap and unlock isolated folios
>>> +	 */
>>> +	if (isolated)
>>> +		return ret;
>>> +
>>>  	remap_page(origin_folio, 1 << order,
>>>  			folio_test_anon(origin_folio) ?
>>>  				RMP_USE_SHARED_ZEROPAGE : 0);
>>> @@ -3790,6 +3788,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   * @lock_at: a page within @folio to be left locked to caller
>>>   * @list: after-split folios will be put on it if non NULL
>>>   * @uniform_split: perform uniform split or not (non-uniform split)
>>> + * @isolated: The pages are already unmapped
>>>   *
>>>   * It calls __split_unmapped_folio() to perform uniform and non-uniform split.
>>>   * It is in charge of checking whether the split is supported or not and
>>> @@ -3800,7 +3799,7 @@ bool uniform_split_supported(struct folio *folio, unsigned int new_order,
>>>   */
>>>  static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct page *lock_at,
>>> -		struct list_head *list, bool uniform_split)
>>> +		struct list_head *list, bool uniform_split, bool isolated)
>>>  {
>>>  	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
>>>  	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
>>> @@ -3846,14 +3845,16 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		 * is taken to serialise against parallel split or collapse
>>>  		 * operations.
>>>  		 */
>>> -		anon_vma = folio_get_anon_vma(folio);
>>> -		if (!anon_vma) {
>>> -			ret = -EBUSY;
>>> -			goto out;
>>> +		if (!isolated) {
>>> +			anon_vma = folio_get_anon_vma(folio);
>>> +			if (!anon_vma) {
>>> +				ret = -EBUSY;
>>> +				goto out;
>>> +			}
>>> +			anon_vma_lock_write(anon_vma);
>>>  		}
>>>  		end = -1;
>>>  		mapping = NULL;
>>> -		anon_vma_lock_write(anon_vma);
>>>  	} else {
>>>  		unsigned int min_order;
>>>  		gfp_t gfp;
>>> @@ -3920,7 +3921,8 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  		goto out_unlock;
>>>  	}
>>>  
>>> -	unmap_folio(folio);
>>> +	if (!isolated)
>>> +		unmap_folio(folio);
>>>  
>>>  	/* block interrupt reentry in xa_lock and spinlock */
>>>  	local_irq_disable();
>>> @@ -3973,14 +3975,15 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>  
>>>  		ret = __split_unmapped_folio(folio, new_order,
>>>  				split_at, lock_at, list, end, &xas, mapping,
>>> -				uniform_split);
>>> +				uniform_split, isolated);
>>>  	} else {
>>>  		spin_unlock(&ds_queue->split_queue_lock);
>>>  fail:
>>>  		if (mapping)
>>>  			xas_unlock(&xas);
>>>  		local_irq_enable();
>>> -		remap_page(folio, folio_nr_pages(folio), 0);
>>> +		if (!isolated)
>>> +			remap_page(folio, folio_nr_pages(folio), 0);
>>>  		ret = -EAGAIN;
>>>  	}
>>>  
>>> @@ -4046,12 +4049,13 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
>>>   * Returns -EINVAL when trying to split to an order that is incompatible
>>>   * with the folio. Splitting to order 0 is compatible with all folios.
>>>   */
>>> -int split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> -				     unsigned int new_order)
>>> +int __split_huge_page_to_list_to_order(struct page *page, struct list_head *list,
>>> +				     unsigned int new_order, bool isolated)
>>>  {
>>>  	struct folio *folio = page_folio(page);
>>>  
>>> -	return __folio_split(folio, new_order, &folio->page, page, list, true);
>>> +	return __folio_split(folio, new_order, &folio->page, page, list, true,
>>> +				isolated);
>>>  }
>>>  
>>>  /*
>>> @@ -4080,7 +4084,7 @@ int folio_split(struct folio *folio, unsigned int new_order,
>>>  		struct page *split_at, struct list_head *list)
>>>  {
>>>  	return __folio_split(folio, new_order, split_at, &folio->page, list,
>>> -			false);
>>> +			false, false);
>>>  }
>>>  
>>>  int min_order_for_split(struct folio *folio)
>>> diff --git a/mm/migrate_device.c b/mm/migrate_device.c
>>> index 41d0bd787969..acd2f03b178d 100644
>>> --- a/mm/migrate_device.c
>>> +++ b/mm/migrate_device.c
>>> @@ -813,6 +813,24 @@ static int migrate_vma_insert_huge_pmd_page(struct migrate_vma *migrate,
>>>  		src[i] &= ~MIGRATE_PFN_MIGRATE;
>>>  	return 0;
>>>  }
>>> +
>>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>>> +					unsigned long idx, unsigned long addr,
>>> +					struct folio *folio)
>>> +{
>>> +	unsigned long i;
>>> +	unsigned long pfn;
>>> +	unsigned long flags;
>>> +
>>> +	folio_get(folio);
>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
>> We already have reference to folio, why is folio_get() needed ?
>>
>> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?
> 
> Oh I see 
> +	if (!isolated)
> +		unmap_folio(folio);
> 
> which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);
> 
> Still, why the folio_get(folio);?
>  
> 

That is for split_huge_pmd_address, when called with freeze=true, it drops the
ref count on the page

	if (freeze)
		put_page(page);

Balbir

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Mika Penttilä 3 months ago

>>>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>>>> +					unsigned long idx, unsigned long addr,
>>>> +					struct folio *folio)
>>>> +{
>>>> +	unsigned long i;
>>>> +	unsigned long pfn;
>>>> +	unsigned long flags;
>>>> +
>>>> +	folio_get(folio);
>>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
>>> We already have reference to folio, why is folio_get() needed ?
>>>
>>> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?
>> Oh I see 
>> +	if (!isolated)
>> +		unmap_folio(folio);
>>
>> which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);
>>
>> Still, why the folio_get(folio);?
>>  
>>
> That is for split_huge_pmd_address, when called with freeze=true, it drops the
> ref count on the page
>
> 	if (freeze)
> 		put_page(page);
>
> Balbir
>
yeah I guess you could have used the pmd_migration path in __split_huge_pmd_locked, and not use freeze because you have installed the migration pmd entry already.
Which brings to a bigger concern, that you do need the freeze semantics, like clear PageAnonExclusive (which may fail). I think you did not get this part
right in the 3/12 patch. And in this patch, you can't assume the split succeeds, which would mean you can't migrate the range at all.
Doing the split this late is quite problematic all in all.



--Mika

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 3 months ago

On 7/5/25 13:17, Mika Penttilä wrote:
> 
>>>>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>>>>> +					unsigned long idx, unsigned long addr,
>>>>> +					struct folio *folio)
>>>>> +{
>>>>> +	unsigned long i;
>>>>> +	unsigned long pfn;
>>>>> +	unsigned long flags;
>>>>> +
>>>>> +	folio_get(folio);
>>>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>>>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
>>>> We already have reference to folio, why is folio_get() needed ?
>>>>
>>>> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?
>>> Oh I see 
>>> +	if (!isolated)
>>> +		unmap_folio(folio);
>>>
>>> which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);
>>>
>>> Still, why the folio_get(folio);?
>>>  
>>>
>> That is for split_huge_pmd_address, when called with freeze=true, it drops the
>> ref count on the page
>>
>> 	if (freeze)
>> 		put_page(page);
>>
>> Balbir
>>
> yeah I guess you could have used the pmd_migration path in __split_huge_pmd_locked, and not use freeze because you have installed the migration pmd entry already.
> Which brings to a bigger concern, that you do need the freeze semantics, like clear PageAnonExclusive (which may fail). I think you did not get this part
> right in the 3/12 patch. And in this patch, you can't assume the split succeeds, which would mean you can't migrate the range at all.
> Doing the split this late is quite problematic all in all.
> 

Clearing PageAnonExclusive will *not* fail for device private pages from what I can see in __folio_try_share_anon_rmap().
Doing the split late is a requirement due to the nature of the three stage migration operation, the other side
might fail to allocate THP sized pages and so the code needs to deal with it

Balbir Singh

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Mika Penttilä 3 months ago

On 7/7/25 05:35, Balbir Singh wrote:
> On 7/5/25 13:17, Mika Penttilä wrote:
>>>>>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>>>>>> +					unsigned long idx, unsigned long addr,
>>>>>> +					struct folio *folio)
>>>>>> +{
>>>>>> +	unsigned long i;
>>>>>> +	unsigned long pfn;
>>>>>> +	unsigned long flags;
>>>>>> +
>>>>>> +	folio_get(folio);
>>>>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>>>>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
>>>>> We already have reference to folio, why is folio_get() needed ?
>>>>>
>>>>> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?
>>>> Oh I see 
>>>> +	if (!isolated)
>>>> +		unmap_folio(folio);
>>>>
>>>> which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);
>>>>
>>>> Still, why the folio_get(folio);?
>>>>  
>>>>
>>> That is for split_huge_pmd_address, when called with freeze=true, it drops the
>>> ref count on the page
>>>
>>> 	if (freeze)
>>> 		put_page(page);
>>>
>>> Balbir
>>>
>> yeah I guess you could have used the pmd_migration path in __split_huge_pmd_locked, and not use freeze because you have installed the migration pmd entry already.
>> Which brings to a bigger concern, that you do need the freeze semantics, like clear PageAnonExclusive (which may fail). I think you did not get this part
>> right in the 3/12 patch. And in this patch, you can't assume the split succeeds, which would mean you can't migrate the range at all.
>> Doing the split this late is quite problematic all in all.
>>
> Clearing PageAnonExclusive will *not* fail for device private pages from what I can see in __folio_try_share_anon_rmap().
> Doing the split late is a requirement due to the nature of the three stage migration operation, the other side
> might fail to allocate THP sized pages and so the code needs to deal with it
>
> Balbir Singh

Yes seems clearing PageAnonExclusive doesn't fail for device private pages in the end, 
but the 3/12 patch doesn't even try to clear PageAnonExclusive with your changes afaics,
which is a separate issue.

And __split_huge_page_to_list_to_order() (return value is not checked) can fail for out of memory.
So think you can not just assume split just works. If late split is a requirement (which I can understand is),
you should be prepared to rollback somehow the operation.

>
--Mika

Re: [v1 resend 08/12] mm/thp: add split during migration support

Posted by Balbir Singh 3 months ago

On 7/7/25 13:29, Mika Penttilä wrote:
> On 7/7/25 05:35, Balbir Singh wrote:
>> On 7/5/25 13:17, Mika Penttilä wrote:
>>>>>>> +static void migrate_vma_split_pages(struct migrate_vma *migrate,
>>>>>>> +					unsigned long idx, unsigned long addr,
>>>>>>> +					struct folio *folio)
>>>>>>> +{
>>>>>>> +	unsigned long i;
>>>>>>> +	unsigned long pfn;
>>>>>>> +	unsigned long flags;
>>>>>>> +
>>>>>>> +	folio_get(folio);
>>>>>>> +	split_huge_pmd_address(migrate->vma, addr, true);
>>>>>>> +	__split_huge_page_to_list_to_order(folio_page(folio, 0), NULL, 0, true);
>>>>>> We already have reference to folio, why is folio_get() needed ?
>>>>>>
>>>>>> Splitting the page splits pmd for anon folios, why is there split_huge_pmd_address() ?
>>>>> Oh I see 
>>>>> +	if (!isolated)
>>>>> +		unmap_folio(folio);
>>>>>
>>>>> which explains the explicit split_huge_pmd_address(migrate->vma, addr, true);
>>>>>
>>>>> Still, why the folio_get(folio);?
>>>>>  
>>>>>
>>>> That is for split_huge_pmd_address, when called with freeze=true, it drops the
>>>> ref count on the page
>>>>
>>>> 	if (freeze)
>>>> 		put_page(page);
>>>>
>>>> Balbir
>>>>
>>> yeah I guess you could have used the pmd_migration path in __split_huge_pmd_locked, and not use freeze because you have installed the migration pmd entry already.
>>> Which brings to a bigger concern, that you do need the freeze semantics, like clear PageAnonExclusive (which may fail). I think you did not get this part
>>> right in the 3/12 patch. And in this patch, you can't assume the split succeeds, which would mean you can't migrate the range at all.
>>> Doing the split this late is quite problematic all in all.
>>>
>> Clearing PageAnonExclusive will *not* fail for device private pages from what I can see in __folio_try_share_anon_rmap().
>> Doing the split late is a requirement due to the nature of the three stage migration operation, the other side
>> might fail to allocate THP sized pages and so the code needs to deal with it
>>
>> Balbir Singh
> 
> Yes seems clearing PageAnonExclusive doesn't fail for device private pages in the end, 
> but the 3/12 patch doesn't even try to clear PageAnonExclusive with your changes afaics,
> which is a separate issue.
> 
> And __split_huge_page_to_list_to_order() (return value is not checked) can fail for out of memory.
> So think you can not just assume split just works. If late split is a requirement (which I can understand is),
> you should be prepared to rollback somehow the operation.
> 

I'll add a check, rolling back is just setting up the entries to not be migrated

Thanks,
Balbir Singh