[v12] khugepaged: mTHP support

[PATCH v12 mm-new 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse

Posted by Nico Pache 3 months, 2 weeks ago

Pass an order and offset to collapse_huge_page to support collapsing anon
memory to arbitrary orders within a PMD. order indicates what mTHP size we
are attempting to collapse to, and offset indicates were in the PMD to
start the collapse attempt.

For non-PMD collapse we must leave the anon VMA write locked until after
we collapse the mTHP-- in the PMD case all the pages are isolated, but in
the mTHP case this is not true, and we must keep the lock to prevent
changes to the VMA from occurring.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++------------------
 1 file changed, 67 insertions(+), 41 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 286c3a7afdee..75e7ebdccc36 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1142,43 +1142,50 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
 	return SCAN_SUCCEED;
 }
 
-static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
-			      int referenced, int unmapped,
-			      struct collapse_control *cc)
+static int collapse_huge_page(struct mm_struct *mm, unsigned long pmd_address,
+		int referenced, int unmapped, struct collapse_control *cc,
+		bool *mmap_locked, unsigned int order, unsigned long offset)
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte;
+	pte_t *pte = NULL, mthp_pte;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	bool anon_vma_locked = false;
+	const unsigned long nr_pages = 1UL << order;
+	unsigned long mthp_address = pmd_address + offset * PAGE_SIZE;
 
-	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
+	VM_BUG_ON(pmd_address & ~HPAGE_PMD_MASK);
 
 	/*
 	 * Before allocating the hugepage, release the mmap_lock read lock.
 	 * The allocation can take potentially a long time if it involves
 	 * sync compaction, and we do not need to hold the mmap_lock during
 	 * that. We will recheck the vma after taking it again in write mode.
+	 * If collapsing mTHPs we may have already released the read_lock.
 	 */
-	mmap_read_unlock(mm);
+	if (*mmap_locked) {
+		mmap_read_unlock(mm);
+		*mmap_locked = false;
+	}
 
-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+	result = alloc_charge_folio(&folio, mm, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
-					 HPAGE_PMD_ORDER);
+	*mmap_locked = true;
+	result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
 	}
 
-	result = find_pmd_or_thp_or_none(mm, address, &pmd);
+	result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1190,13 +1197,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * released when it fails. So we jump out_nolock directly in
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-						     referenced, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_swapin(mm, vma, mthp_address, pmd,
+						     referenced, order);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
 
 	mmap_read_unlock(mm);
+	*mmap_locked = false;
 	/*
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later handled by the ptep_clear_flush and the VM
@@ -1206,20 +1214,20 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
-					 HPAGE_PMD_ORDER);
+	result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
 	vma_start_write(vma);
-	result = check_pmd_still_valid(mm, address, pmd);
+	result = check_pmd_still_valid(mm, pmd_address, pmd);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
 	anon_vma_lock_write(vma->anon_vma);
+	anon_vma_locked = true;
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-				address + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, mthp_address,
+				mthp_address + (PAGE_SIZE << order));
 	mmu_notifier_invalidate_range_start(&range);
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
@@ -1231,24 +1239,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * Parallel GUP-fast is fine since GUP-fast will back off when
 	 * it detects PMD is changed.
 	 */
-	_pmd = pmdp_collapse_flush(vma, address, pmd);
+	_pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
 	spin_unlock(pmd_ptl);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_remove_table_sync_one();
 
-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	pte = pte_offset_map_lock(mm, &_pmd, mthp_address, &pte_ptl);
 	if (pte) {
-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-						      HPAGE_PMD_ORDER,
-						      &compound_pagelist);
+		result = __collapse_huge_page_isolate(vma, mthp_address, pte, cc,
+						      order, &compound_pagelist);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
 	}
 
 	if (unlikely(result != SCAN_SUCCEED)) {
-		if (pte)
-			pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
 		/*
@@ -1258,21 +1263,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 */
 		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
 		spin_unlock(pmd_ptl);
-		anon_vma_unlock_write(vma->anon_vma);
 		goto out_up_write;
 	}
 
 	/*
-	 * All pages are isolated and locked so anon_vma rmap
-	 * can't run anymore.
+	 * For PMD collapse all pages are isolated and locked so anon_vma
+	 * rmap can't run anymore. For mTHP collapse we must hold the lock
 	 */
-	anon_vma_unlock_write(vma->anon_vma);
+	if (order == HPAGE_PMD_ORDER) {
+		anon_vma_unlock_write(vma->anon_vma);
+		anon_vma_locked = false;
+	}
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
-					   vma, address, pte_ptl,
-					   HPAGE_PMD_ORDER,
-					   &compound_pagelist);
-	pte_unmap(pte);
+					   vma, mthp_address, pte_ptl,
+					   order, &compound_pagelist);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
 
@@ -1282,20 +1287,42 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-	pgtable = pmd_pgtable(_pmd);
+	if (order == HPAGE_PMD_ORDER) {
+		pgtable = pmd_pgtable(_pmd);
 
-	spin_lock(pmd_ptl);
-	BUG_ON(!pmd_none(*pmd));
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
-	spin_unlock(pmd_ptl);
+		spin_lock(pmd_ptl);
+		WARN_ON_ONCE(!pmd_none(*pmd));
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
+		spin_unlock(pmd_ptl);
+	} else { /* mTHP collapse */
+		mthp_pte = mk_pte(folio_page(folio, 0), vma->vm_page_prot);
+		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+
+		spin_lock(pmd_ptl);
+		WARN_ON_ONCE(!pmd_none(*pmd));
+		folio_ref_add(folio, nr_pages - 1);
+		folio_add_new_anon_rmap(folio, vma, mthp_address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		set_ptes(vma->vm_mm, mthp_address, pte, mthp_pte, nr_pages);
+		update_mmu_cache_range(NULL, vma, mthp_address, pte, nr_pages);
+
+		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+		spin_unlock(pmd_ptl);
+	}
 
 	folio = NULL;
 
 	result = SCAN_SUCCEED;
 out_up_write:
+	if (anon_vma_locked)
+		anon_vma_unlock_write(vma->anon_vma);
+	if (pte)
+		pte_unmap(pte);
 	mmap_write_unlock(mm);
 out_nolock:
+	*mmap_locked = false;
 	if (folio)
 		folio_put(folio);
 	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
@@ -1463,9 +1490,8 @@ static int collapse_scan_pmd(struct mm_struct *mm,
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
 		result = collapse_huge_page(mm, start_addr, referenced,
-					    unmapped, cc);
-		/* collapse_huge_page will return with the mmap_lock released */
-		*mmap_locked = false;
+					    unmapped, cc, mmap_locked,
+					    HPAGE_PMD_ORDER, 0);
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
-- 
2.51.0

Re: [PATCH v12 mm-new 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse

Posted by Lorenzo Stoakes 3 months ago

On Wed, Oct 22, 2025 at 12:37:09PM -0600, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
>
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in

NIT but is this -- a typo?

> the mTHP case this is not true, and we must keep the lock to prevent
> changes to the VMA from occurring.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++------------------
>  1 file changed, 67 insertions(+), 41 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 286c3a7afdee..75e7ebdccc36 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1142,43 +1142,50 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
>  	return SCAN_SUCCEED;
>  }
>
> -static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> -			      int referenced, int unmapped,
> -			      struct collapse_control *cc)
> +static int collapse_huge_page(struct mm_struct *mm, unsigned long pmd_address,

Presumably pmd_address is the PMD-aligned address?

> +		int referenced, int unmapped, struct collapse_control *cc,
> +		bool *mmap_locked, unsigned int order, unsigned long offset)

It'd be nice to pass through a helper struct at this point having so many params
but perhaps we can deal with that in a follow up series.

If PMD address is the PMD-aligned address, and mthp_address = pmd_address +
offset * PAGE_SIZE, couldn't we just pass in the mthp address and get the
PMD address by aligning down to PMD size and reduce the number of args by
1?

>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte = NULL, mthp_pte;

mthp_pte is only used in a single if () branch and can be declared there
AFAICT?

>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	int result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	bool anon_vma_locked = false;
> +	const unsigned long nr_pages = 1UL << order;
> +	unsigned long mthp_address = pmd_address + offset * PAGE_SIZE;

Do we ever update this? If not we can const-ify.

>
> -	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> +	VM_BUG_ON(pmd_address & ~HPAGE_PMD_MASK);

NIT: Be nice to convert this to a VM_WARN_ON_ONCE(), as VM_BUG_ON() is not
right here.

>
>  	/*
>  	 * Before allocating the hugepage, release the mmap_lock read lock.
>  	 * The allocation can take potentially a long time if it involves
>  	 * sync compaction, and we do not need to hold the mmap_lock during
>  	 * that. We will recheck the vma after taking it again in write mode.
> +	 * If collapsing mTHPs we may have already released the read_lock.
>  	 */
> -	mmap_read_unlock(mm);
> +	if (*mmap_locked) {
> +		mmap_read_unlock(mm);
> +		*mmap_locked = false;
> +	}
>
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	*mmap_locked = true;
> +	result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);

I don't really love the semantics of 'sometimes we set *mmap_locked false
when we unlock, sometimes we rely on out_nolock doing it'.

Let's just set it false when we unlock and VM_WARN_ON_ONCE(*mmap_locked) in
out_nolock.

>  		goto out_nolock;
>  	}
>
> -	result = find_pmd_or_thp_or_none(mm, address, &pmd);
> +	result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1190,13 +1197,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -						     referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, mthp_address, pmd,
> +						     referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
>
>  	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>  	/*
>  	 * Prevent all access to pagetables with the exception of
>  	 * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1206,20 +1214,20 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> -					 HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
>  	vma_start_write(vma);
> -	result = check_pmd_still_valid(mm, address, pmd);
> +	result = check_pmd_still_valid(mm, pmd_address, pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>
>  	anon_vma_lock_write(vma->anon_vma);
> +	anon_vma_locked = true;
>
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, mthp_address,
> +				mthp_address + (PAGE_SIZE << order));
>  	mmu_notifier_invalidate_range_start(&range);
>
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> @@ -1231,24 +1239,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * Parallel GUP-fast is fine since GUP-fast will back off when
>  	 * it detects PMD is changed.
>  	 */
> -	_pmd = pmdp_collapse_flush(vma, address, pmd);
> +	_pmd = pmdp_collapse_flush(vma, pmd_address, pmd);

Not your fault but so hate this _p** convention. One for a follow up I
suppose.

>  	spin_unlock(pmd_ptl);
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, mthp_address, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -						      HPAGE_PMD_ORDER,
> -						      &compound_pagelist);
> +		result = __collapse_huge_page_isolate(vma, mthp_address, pte, cc,
> +						      order, &compound_pagelist);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_PMD_NULL;
>  	}
>
>  	if (unlikely(result != SCAN_SUCCEED)) {
> -		if (pte)
> -			pte_unmap(pte);

OK I guess we drop this because it's handled in out_up_write. I assume no
issue keeping PTE mapped here?

>  		spin_lock(pmd_ptl);
>  		BUG_ON(!pmd_none(*pmd));
>  		/*
> @@ -1258,21 +1263,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		 */
>  		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>  		spin_unlock(pmd_ptl);
> -		anon_vma_unlock_write(vma->anon_vma);
>  		goto out_up_write;
>  	}
>
>  	/*
> -	 * All pages are isolated and locked so anon_vma rmap
> -	 * can't run anymore.
> +	 * For PMD collapse all pages are isolated and locked so anon_vma
> +	 * rmap can't run anymore. For mTHP collapse we must hold the lock
>  	 */
> -	anon_vma_unlock_write(vma->anon_vma);
> +	if (order == HPAGE_PMD_ORDER) {
> +		anon_vma_unlock_write(vma->anon_vma);
> +		anon_vma_locked = false;
> +	}
>
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   HPAGE_PMD_ORDER,
> -					   &compound_pagelist);
> -	pte_unmap(pte);
> +					   vma, mthp_address, pte_ptl,
> +					   order, &compound_pagelist);

Looking through __collapse_huge_page_copy() there doesn't seem to be any
issue with holding anon lock here.

>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
>
> @@ -1282,20 +1287,42 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> +	if (order == HPAGE_PMD_ORDER) {
> +		pgtable = pmd_pgtable(_pmd);
>
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> -	spin_unlock(pmd_ptl);
> +		spin_lock(pmd_ptl);
> +		WARN_ON_ONCE(!pmd_none(*pmd));
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
> +		spin_unlock(pmd_ptl);
> +	} else { /* mTHP collapse */

As per above, let's just declare mthp_pte here.

> +		mthp_pte = mk_pte(folio_page(folio, 0), vma->vm_page_prot);

Hm, so we make a PTE that references the first page of the folio? I guess
the folio will be an mTHP folio so we're just creating essentially a

> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);

In set_pte_range() we have a whole host of other checks like dirty,
uffd_wp, etc. I wonder if we need to consider those?

> +
> +		spin_lock(pmd_ptl);

We're duplicating this in both branches, why not do outside if/else?

> +		WARN_ON_ONCE(!pmd_none(*pmd));

Hmm so the PMD entry will still always be empty on mTHP collapse? Surely we
could be collapsing more than one mTHP into an existing PTE table no? I may
be missing something here/confused :)

> +		folio_ref_add(folio, nr_pages - 1);

If we're setting the refcount here, where is the ref count being set in the
PMD path?

> +		folio_add_new_anon_rmap(folio, vma, mthp_address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		set_ptes(vma->vm_mm, mthp_address, pte, mthp_pte, nr_pages);
> +		update_mmu_cache_range(NULL, vma, mthp_address, pte, nr_pages);

Prior to this change the only user of this are functions in memory.c, I
do wonder if this is the wrong abstraction here.

But maybe that's _yet another_ thing for a follow up (the THP code is a
mess).

> +
> +		smp_wmb(); /* make PTEs visible before PMD. See pmd_install() */

Feels like we could avoid open-coding this by just using pmd_install()?

Also are we therefore missing a mm_inc_nr_ptes() invocation here, or do we
update mm->pgtables_bytes elsewhere?


> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));

Why are we referencing pmd in PMD branch and _pmd here?

> +		spin_unlock(pmd_ptl);

The PMD case does this stuff in map_anon_pmd_nopf(), could we add one for
mTHP?

This function is already horribly overwrought (not your fault) so I'd like
to avoid adding open-coded blocks as much as possible.

> +	}
>
>  	folio = NULL;
>
>  	result = SCAN_SUCCEED;
>  out_up_write:
> +	if (anon_vma_locked)
> +		anon_vma_unlock_write(vma->anon_vma);
> +	if (pte)
> +		pte_unmap(pte);
>  	mmap_write_unlock(mm);
>  out_nolock:
> +	*mmap_locked = false;

See above comment about setting this prior to jumping to out_nolock.

>  	if (folio)
>  		folio_put(folio);
>  	trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> @@ -1463,9 +1490,8 @@ static int collapse_scan_pmd(struct mm_struct *mm,
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
>  		result = collapse_huge_page(mm, start_addr, referenced,
> -					    unmapped, cc);
> -		/* collapse_huge_page will return with the mmap_lock released */
> -		*mmap_locked = false;
> +					    unmapped, cc, mmap_locked,
> +					    HPAGE_PMD_ORDER, 0);
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> --
> 2.51.0
>

Re: [PATCH v12 mm-new 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse

Posted by Nico Pache 3 months ago

On Thu, Nov 6, 2025 at 11:15 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Oct 22, 2025 at 12:37:09PM -0600, Nico Pache wrote:
> > Pass an order and offset to collapse_huge_page to support collapsing anon
> > memory to arbitrary orders within a PMD. order indicates what mTHP size we
> > are attempting to collapse to, and offset indicates were in the PMD to
> > start the collapse attempt.
> >
> > For non-PMD collapse we must leave the anon VMA write locked until after
> > we collapse the mTHP-- in the PMD case all the pages are isolated, but in
>
> NIT but is this -- a typo?

no its an em dash. I can replace it with a period if you'd like, but
both work in this context.

>
> > the mTHP case this is not true, and we must keep the lock to prevent
> > changes to the VMA from occurring.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 108 ++++++++++++++++++++++++++++++------------------
> >  1 file changed, 67 insertions(+), 41 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 286c3a7afdee..75e7ebdccc36 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1142,43 +1142,50 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm,
> >       return SCAN_SUCCEED;
> >  }
> >
> > -static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > -                           int referenced, int unmapped,
> > -                           struct collapse_control *cc)
> > +static int collapse_huge_page(struct mm_struct *mm, unsigned long pmd_address,
>
> Presumably pmd_address is the PMD-aligned address?
>
> > +             int referenced, int unmapped, struct collapse_control *cc,
> > +             bool *mmap_locked, unsigned int order, unsigned long offset)
>
> It'd be nice to pass through a helper struct at this point having so many params
> but perhaps we can deal with that in a follow up series.
>
> If PMD address is the PMD-aligned address, and mthp_address = pmd_address +
> offset * PAGE_SIZE, couldn't we just pass in the mthp address and get the
> PMD address by aligning down to PMD size and reduce the number of args by
> 1?

Yeah that seems like a good idea. Thanks

>
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte = NULL, mthp_pte;
>
> mthp_pte is only used in a single if () branch and can be declared there
> AFAICT?

ack!

>
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       int result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     bool anon_vma_locked = false;
> > +     const unsigned long nr_pages = 1UL << order;
> > +     unsigned long mthp_address = pmd_address + offset * PAGE_SIZE;
>
> Do we ever update this? If not we can const-ify.

ack!

>
> >
> > -     VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> > +     VM_BUG_ON(pmd_address & ~HPAGE_PMD_MASK);
>
> NIT: Be nice to convert this to a VM_WARN_ON_ONCE(), as VM_BUG_ON() is not
> right here.
>
> >
> >       /*
> >        * Before allocating the hugepage, release the mmap_lock read lock.
> >        * The allocation can take potentially a long time if it involves
> >        * sync compaction, and we do not need to hold the mmap_lock during
> >        * that. We will recheck the vma after taking it again in write mode.
> > +      * If collapsing mTHPs we may have already released the read_lock.
> >        */
> > -     mmap_read_unlock(mm);
> > +     if (*mmap_locked) {
> > +             mmap_read_unlock(mm);
> > +             *mmap_locked = false;
> > +     }
> >
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -                                      HPAGE_PMD_ORDER);
> > +     *mmap_locked = true;
> > +     result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
>
> I don't really love the semantics of 'sometimes we set *mmap_locked false
> when we unlock, sometimes we rely on out_nolock doing it'.
>
> Let's just set it false when we unlock and VM_WARN_ON_ONCE(*mmap_locked) in
> out_nolock.

Ok that sounds like a good idea! thanks

>
> >               goto out_nolock;
> >       }
> >
> > -     result = find_pmd_or_thp_or_none(mm, address, &pmd);
> > +     result = find_pmd_or_thp_or_none(mm, pmd_address, &pmd);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1190,13 +1197,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                                                  referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, mthp_address, pmd,
> > +                                                  referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> >
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >       /*
> >        * Prevent all access to pagetables with the exception of
> >        * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1206,20 +1214,20 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc,
> > -                                      HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, pmd_address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> >       vma_start_write(vma);
> > -     result = check_pmd_still_valid(mm, address, pmd);
> > +     result = check_pmd_still_valid(mm, pmd_address, pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >
> >       anon_vma_lock_write(vma->anon_vma);
> > +     anon_vma_locked = true;
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, mthp_address,
> > +                             mthp_address + (PAGE_SIZE << order));
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > @@ -1231,24 +1239,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * Parallel GUP-fast is fine since GUP-fast will back off when
> >        * it detects PMD is changed.
> >        */
> > -     _pmd = pmdp_collapse_flush(vma, address, pmd);
> > +     _pmd = pmdp_collapse_flush(vma, pmd_address, pmd);
>
> Not your fault but so hate this _p** convention. One for a follow up I
> suppose.
>
> >       spin_unlock(pmd_ptl);
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, mthp_address, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                                   HPAGE_PMD_ORDER,
> > -                                                   &compound_pagelist);
> > +             result = __collapse_huge_page_isolate(vma, mthp_address, pte, cc,
> > +                                                   order, &compound_pagelist);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_PMD_NULL;
> >       }
> >
> >       if (unlikely(result != SCAN_SUCCEED)) {
> > -             if (pte)
> > -                     pte_unmap(pte);
>
> OK I guess we drop this because it's handled in out_up_write. I assume no
> issue keeping PTE mapped here?

Correct, I dont think there are any issues here. The checks for pte
and anon_vma_locked in out_up_write should keep everything in order.

>
> >               spin_lock(pmd_ptl);
> >               BUG_ON(!pmd_none(*pmd));
> >               /*
> > @@ -1258,21 +1263,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                */
> >               pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> >               spin_unlock(pmd_ptl);
> > -             anon_vma_unlock_write(vma->anon_vma);
> >               goto out_up_write;
> >       }
> >
> >       /*
> > -      * All pages are isolated and locked so anon_vma rmap
> > -      * can't run anymore.
> > +      * For PMD collapse all pages are isolated and locked so anon_vma
> > +      * rmap can't run anymore. For mTHP collapse we must hold the lock
> >        */
> > -     anon_vma_unlock_write(vma->anon_vma);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             anon_vma_unlock_write(vma->anon_vma);
> > +             anon_vma_locked = false;
> > +     }
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        HPAGE_PMD_ORDER,
> > -                                        &compound_pagelist);
> > -     pte_unmap(pte);
> > +                                        vma, mthp_address, pte_ptl,
> > +                                        order, &compound_pagelist);
>
> Looking through __collapse_huge_page_copy() there doesn't seem to be any
> issue with holding anon lock here.
>
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> >
> > @@ -1282,20 +1287,42 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> >
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     map_anon_folio_pmd_nopf(folio, pmd, vma, address);
> > -     spin_unlock(pmd_ptl);
> > +             spin_lock(pmd_ptl);
> > +             WARN_ON_ONCE(!pmd_none(*pmd));
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             map_anon_folio_pmd_nopf(folio, pmd, vma, pmd_address);
> > +             spin_unlock(pmd_ptl);
> > +     } else { /* mTHP collapse */
>
> As per above, let's just declare mthp_pte here.

ack

>
> > +             mthp_pte = mk_pte(folio_page(folio, 0), vma->vm_page_prot);
>
> Hm, so we make a PTE that references the first page of the folio? I guess
> the folio will be an mTHP folio so we're just creating essentially a
>
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
>
> In set_pte_range() we have a whole host of other checks like dirty,
> uffd_wp, etc. I wonder if we need to consider those?

I dont believe so because those checks are coming from fault handling.
Here we are doing almost the same thing that the PMD case was doing
with some influence from do_anonymous_page()

>
> > +
> > +             spin_lock(pmd_ptl);
>
> We're duplicating this in both branches, why not do outside if/else?

ack

>
> > +             WARN_ON_ONCE(!pmd_none(*pmd));
>
> Hmm so the PMD entry will still always be empty on mTHP collapse? Surely we
> could be collapsing more than one mTHP into an existing PTE table no? I may
> be missing something here/confused :)

We remove the PMD entry to ensure no GUP-fast call can operate on this PMD.

pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
/*
* This removes any huge TLB entry from the CPU so we won't allow
* huge and small TLB entries for the same virtual address to
* avoid the risk of CPU bugs in that area.
*
* Parallel GUP-fast is fine since GUP-fast will back off when
* it detects PMD is changed.
*/
_pmd = pmdp_collapse_flush(vma, pmd_address, pmd);

pmdp_collapse_flush removes the PMD
pmd = pmdp_huge_get_and_clear(vma->vm_mm, address, pmdp);

In the PMD case we install a new PMD, in the mTHP case (and in the
failure cases), we reinstall the same PMD once we are done/exit.

>
> > +             folio_ref_add(folio, nr_pages - 1);
>
> If we're setting the refcount here, where is the ref count being set in the
> PMD path?

Both folios are initiated with a single ref. PMDs only need 1 ref,
while mTHPs need a ref for each PTE; hence why we increment by
nr_pages  - 1.

>
> > +             folio_add_new_anon_rmap(folio, vma, mthp_address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             set_ptes(vma->vm_mm, mthp_address, pte, mthp_pte, nr_pages);
> > +             update_mmu_cache_range(NULL, vma, mthp_address, pte, nr_pages);
>
> Prior to this change the only user of this are functions in memory.c, I
> do wonder if this is the wrong abstraction here.
>
> But maybe that's _yet another_ thing for a follow up (the THP code is a
> mess).

Yes, I tried to do something similar to the new
map_anon_folio_pmd_nopf, but it proved to be harder than expected. The
other cases that do similar operations all differ slightly so unifying
is going to be tricky/require more testing.

>
> > +
> > +             smp_wmb(); /* make PTEs visible before PMD. See c() */

>
> Feels like we could avoid open-coding this by just using pmd_install()?

The locking seems to differ which may make that tricky.

>
> Also are we therefore missing a mm_inc_nr_ptes() invocation here, or do we
> update mm->pgtables_bytes elsewhere?

If I understand correctly, we already have accounted for the ptes when
we alloc'd them and their parent PMD. Since we are operating on an
already allocated PMD, I dont think we need to handle accounting for
PMD or mTHP collapse. Ill send some time confirming this before
posting.

>
>
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
>
> Why are we referencing pmd in PMD branch and _pmd here?

I explained it a little more above, but we are reinstalling the
original PMD entry, which was removed for gup race reasons.

>
> > +             spin_unlock(pmd_ptl);
>
> The PMD case does this stuff in map_anon_pmd_nopf(), could we add one for
> mTHP?

Yes but I believe we should clean it up after. Unifying most of the
callers proved tricky.

>
> This function is already horribly overwrought (not your fault) so I'd like
> to avoid adding open-coded blocks as much as possible.
>
> > +     }
> >
> >       folio = NULL;
> >
> >       result = SCAN_SUCCEED;
> >  out_up_write:
> > +     if (anon_vma_locked)
> > +             anon_vma_unlock_write(vma->anon_vma);
> > +     if (pte)
> > +             pte_unmap(pte);
> >       mmap_write_unlock(mm);
> >  out_nolock:
> > +     *mmap_locked = false;
>
> See above comment about setting this prior to jumping to out_nolock.

ack

Thanks for the reviews!
-- Nico

>
> >       if (folio)
> >               folio_put(folio);
> >       trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result);
> > @@ -1463,9 +1490,8 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> >               result = collapse_huge_page(mm, start_addr, referenced,
> > -                                         unmapped, cc);
> > -             /* collapse_huge_page will return with the mmap_lock released */
> > -             *mmap_locked = false;
> > +                                         unmapped, cc, mmap_locked,
> > +                                         HPAGE_PMD_ORDER, 0);
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > --
> > 2.51.0
> >
>

Re: [PATCH v12 mm-new 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse

Posted by Dev Jain 3 months ago

> ----------[snip]------------
>
>> +
>> +		spin_lock(pmd_ptl);
> We're duplicating this in both branches, why not do outside if/else?
>
>> +		WARN_ON_ONCE(!pmd_none(*pmd));
> Hmm so the PMD entry will still always be empty on mTHP collapse? Surely we
> could be collapsing more than one mTHP into an existing PTE table no? I may
> be missing something here/confused :)

After this code path isolates the PTE table, we don't want any other code path
doing "Hey, I see an empty PMD, let's install a PTE table here". One of the
reasons why all the heavy locking is required here.

Also, I want to ask a question about WARN vs BUG_ON: suppose that the
race I described above occurs. After khugepaged isolates the PTE table, someone
faults in a PTE table there, and eventually writes data in the underlying folios.
Then the buggy khugepaged nukes out that table and installs a new one, installing
an mTHP folio which had old data. How do we decide whether such a condition is
worthy of a BUG_ON (leading to system crash) vs letting this pass with WARN?


>
> ------------[snip]----------
>

Re: [PATCH v12 mm-new 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse

Posted by Lorenzo Stoakes 3 months ago

On Fri, Nov 07, 2025 at 08:39:03AM +0530, Dev Jain wrote:
> > ----------[snip]------------

PLease when you snip can you not snip way the code being referenced?
That's really unhelpful and now this sub-thread loses a ton of context...

> >
> > > +
> > > +		spin_lock(pmd_ptl);
> > We're duplicating this in both branches, why not do outside if/else?
> >
> > > +		WARN_ON_ONCE(!pmd_none(*pmd));
> > Hmm so the PMD entry will still always be empty on mTHP collapse? Surely we
> > could be collapsing more than one mTHP into an existing PTE table no? I may
> > be missing something here/confused :)
>
> After this code path isolates the PTE table, we don't want any other code path
> doing "Hey, I see an empty PMD, let's install a PTE table here". One of the
> reasons why all the heavy locking is required here.

That wasn't the question, the question was why are not able to install mTHP
entries in an existing PTE table.

I'm obviously aware that we need to lock here.

>
> Also, I want to ask a question about WARN vs BUG_ON: suppose that the
> race I described above occurs. After khugepaged isolates the PTE table, someone
> faults in a PTE table there, and eventually writes data in the underlying folios.
> Then the buggy khugepaged nukes out that table and installs a new one, installing
> an mTHP folio which had old data. How do we decide whether such a condition is
> worthy of a BUG_ON (leading to system crash) vs letting this pass with WARN?

To all intents and purposes just use a WARN_ON(). A BUG_ON() is almost
never right. This has been done to death.

Probably the WARN_ON() should be a VM_WARN_ON_ONCE() because this is
something that should simply not be happening in practice.

Or can make if (WARN_ON_ONCE(...)) abort, but then we complicate already
very complciated code.

>
>
> >
> > ------------[snip]----------
> >

Thanks, Lorenzo

Re: [PATCH v12 mm-new 07/15] khugepaged: generalize collapse_huge_page for mTHP collapse

Posted by Baolin Wang 3 months, 2 weeks ago


On 2025/10/23 02:37, Nico Pache wrote:
> Pass an order and offset to collapse_huge_page to support collapsing anon
> memory to arbitrary orders within a PMD. order indicates what mTHP size we
> are attempting to collapse to, and offset indicates were in the PMD to
> start the collapse attempt.
> 
> For non-PMD collapse we must leave the anon VMA write locked until after
> we collapse the mTHP-- in the PMD case all the pages are isolated, but in
> the mTHP case this is not true, and we must keep the lock to prevent
> changes to the VMA from occurring.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---

LGTM. And passed my mTHP collapse testing cases. So:
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>