[v5] khugepaged: mTHP support

[PATCH v5 07/12] khugepaged: add mTHP support

Posted by Nico Pache 9 months, 2 weeks ago

Introduce the ability for khugepaged to collapse to different mTHP sizes.
While scanning PMD ranges for potential collapse candidates, keep track
of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
mTHPs are enabled we remove the restriction of max_ptes_none during the
scan phase so we dont bailout early and miss potential mTHP candidates.

After the scan is complete we will perform binary recursion on the
bitmap to determine which mTHP size would be most efficient to collapse
to. max_ptes_none will be scaled by the attempted collapse order to
determine how full a THP must be to be eligible.

If a mTHP collapse is attempted, but contains swapped out, or shared
pages, we dont perform the collapse.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 125 ++++++++++++++++++++++++++++++++++--------------
 1 file changed, 88 insertions(+), 37 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6e67db86409a..3a846cd70c66 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1136,13 +1136,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte;
+	pte_t *pte, mthp_pte;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
 	int result = SCAN_FAIL;
 	struct vm_area_struct *vma;
 	struct mmu_notifier_range range;
+	unsigned long _address = address + offset * PAGE_SIZE;
 
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
@@ -1158,12 +1159,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		*mmap_locked = false;
 	}
 
-	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
+	result = alloc_charge_folio(&folio, mm, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_nolock;
 
 	mmap_read_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
+	*mmap_locked = true;
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED) {
 		mmap_read_unlock(mm);
 		goto out_nolock;
@@ -1181,13 +1183,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 		 * released when it fails. So we jump out_nolock directly in
 		 * that case.  Continuing to collapse causes inconsistency.
 		 */
-		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
-				referenced, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
+				referenced, order);
 		if (result != SCAN_SUCCEED)
 			goto out_nolock;
 	}
 
 	mmap_read_unlock(mm);
+	*mmap_locked = false;
 	/*
 	 * Prevent all access to pagetables with the exception of
 	 * gup_fast later handled by the ptep_clear_flush and the VM
@@ -1197,7 +1200,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * mmap_lock.
 	 */
 	mmap_write_lock(mm);
-	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
+	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 	/* check if the pmd is still valid */
@@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
-				address + HPAGE_PMD_SIZE);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
+				_address + (PAGE_SIZE << order));
 	mmu_notifier_invalidate_range_start(&range);
 
 	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
+
 	/*
 	 * This removes any huge TLB entry from the CPU so we won't allow
 	 * huge and small TLB entries for the same virtual address to
@@ -1226,18 +1230,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_remove_table_sync_one();
 
-	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
+	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
 	if (pte) {
-		result = __collapse_huge_page_isolate(vma, address, pte, cc,
-					&compound_pagelist, HPAGE_PMD_ORDER);
+		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
+					&compound_pagelist, order);
 		spin_unlock(pte_ptl);
 	} else {
 		result = SCAN_PMD_NULL;
 	}
 
 	if (unlikely(result != SCAN_SUCCEED)) {
-		if (pte)
-			pte_unmap(pte);
 		spin_lock(pmd_ptl);
 		BUG_ON(!pmd_none(*pmd));
 		/*
@@ -1258,9 +1260,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	anon_vma_unlock_write(vma->anon_vma);
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
-					   vma, address, pte_ptl,
-					   &compound_pagelist, HPAGE_PMD_ORDER);
-	pte_unmap(pte);
+					   vma, _address, pte_ptl,
+					   &compound_pagelist, order);
 	if (unlikely(result != SCAN_SUCCEED))
 		goto out_up_write;
 
@@ -1270,25 +1271,42 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	 * write.
 	 */
 	__folio_mark_uptodate(folio);
-	pgtable = pmd_pgtable(_pmd);
-
-	_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
-	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
-
-	spin_lock(pmd_ptl);
-	BUG_ON(!pmd_none(*pmd));
-	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
-	folio_add_lru_vma(folio, vma);
-	pgtable_trans_huge_deposit(mm, pmd, pgtable);
-	set_pmd_at(mm, address, pmd, _pmd);
-	update_mmu_cache_pmd(vma, address, pmd);
-	deferred_split_folio(folio, false);
-	spin_unlock(pmd_ptl);
+	if (order == HPAGE_PMD_ORDER) {
+		pgtable = pmd_pgtable(_pmd);
+		_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
+		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
+
+		spin_lock(pmd_ptl);
+		BUG_ON(!pmd_none(*pmd));
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		pgtable_trans_huge_deposit(mm, pmd, pgtable);
+		set_pmd_at(mm, address, pmd, _pmd);
+		update_mmu_cache_pmd(vma, address, pmd);
+		deferred_split_folio(folio, false);
+		spin_unlock(pmd_ptl);
+	} else { /* mTHP collapse */
+		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
+		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
+
+		spin_lock(pmd_ptl);
+		folio_ref_add(folio, (1 << order) - 1);
+		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
+		folio_add_lru_vma(folio, vma);
+		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
+		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
+
+		smp_wmb(); /* make pte visible before pmd */
+		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
+		spin_unlock(pmd_ptl);
+	}
 
 	folio = NULL;
 
 	result = SCAN_SUCCEED;
 out_up_write:
+	if (pte)
+		pte_unmap(pte);
 	mmap_write_unlock(mm);
 out_nolock:
 	*mmap_locked = false;
@@ -1364,31 +1382,58 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
+	int i;
 	int result = SCAN_FAIL, referenced = 0;
 	int none_or_zero = 0, shared = 0;
 	struct page *page = NULL;
 	struct folio *folio = NULL;
 	unsigned long _address;
+	unsigned long enabled_orders;
 	spinlock_t *ptl;
 	int node = NUMA_NO_NODE, unmapped = 0;
+	bool is_pmd_only;
 	bool writable = false;
-
+	int chunk_none_count = 0;
+	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
+	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
 	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
 
 	result = find_pmd_or_thp_or_none(mm, address, &pmd);
 	if (result != SCAN_SUCCEED)
 		goto out;
 
+	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
+	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
 	memset(cc->node_load, 0, sizeof(cc->node_load));
 	nodes_clear(cc->alloc_nmask);
+
+	enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
+		tva_flags, THP_ORDERS_ALL_ANON);
+
+	is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
+
 	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
 	if (!pte) {
 		result = SCAN_PMD_NULL;
 		goto out;
 	}
 
-	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
-	     _pte++, _address += PAGE_SIZE) {
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		/*
+		 * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
+		 * there are pages in this chunk keep track of it in the bitmap
+		 * for mTHP collapsing.
+		 */
+		if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
+			if (chunk_none_count <= scaled_none)
+				bitmap_set(cc->mthp_bitmap,
+					   i / KHUGEPAGED_MIN_MTHP_NR, 1);
+
+			chunk_none_count = 0;
+		}
+
+		_pte = pte + i;
+		_address = address + i * PAGE_SIZE;
 		pte_t pteval = ptep_get(_pte);
 		if (is_swap_pte(pteval)) {
 			++unmapped;
@@ -1411,10 +1456,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 			}
 		}
 		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+			++chunk_none_count;
 			++none_or_zero;
 			if (!userfaultfd_armed(vma) &&
-			    (!cc->is_khugepaged ||
-			     none_or_zero <= khugepaged_max_ptes_none)) {
+			    (!cc->is_khugepaged || !is_pmd_only ||
+				none_or_zero <= khugepaged_max_ptes_none)) {
 				continue;
 			} else {
 				result = SCAN_EXCEED_NONE_PTE;
@@ -1510,6 +1556,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 								     address)))
 			referenced++;
 	}
+
 	if (!writable) {
 		result = SCAN_PAGE_RO;
 	} else if (cc->is_khugepaged &&
@@ -1522,8 +1569,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
 out_unmap:
 	pte_unmap_unlock(pte, ptl);
 	if (result == SCAN_SUCCEED) {
-		result = collapse_huge_page(mm, address, referenced,
-					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
+		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
+			       mmap_locked, enabled_orders);
+		if (result > 0)
+			result = SCAN_SUCCEED;
+		else
+			result = SCAN_FAIL;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
-- 
2.48.1

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by Hugh Dickins 9 months, 1 week ago

On Mon, 28 Apr 2025, Nico Pache wrote:

> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning PMD ranges for potential collapse candidates, keep track
> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> mTHPs are enabled we remove the restriction of max_ptes_none during the
> scan phase so we dont bailout early and miss potential mTHP candidates.
> 
> After the scan is complete we will perform binary recursion on the
> bitmap to determine which mTHP size would be most efficient to collapse
> to. max_ptes_none will be scaled by the attempted collapse order to
> determine how full a THP must be to be eligible.
> 
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we dont perform the collapse.
> 
> Signed-off-by: Nico Pache <npache@redhat.com>

There are locking errors in this patch.  Let me comment inline below,
then at the end append the fix patch I'm using, to keep mm-new usable
for me. But that's more of an emergency rescue than a recommended fixup:
I don't much like your approach here, and hope it will change in v6.

> ---
>  mm/khugepaged.c | 125 ++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 88 insertions(+), 37 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 6e67db86409a..3a846cd70c66 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1136,13 +1136,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  {
>  	LIST_HEAD(compound_pagelist);
>  	pmd_t *pmd, _pmd;
> -	pte_t *pte;
> +	pte_t *pte, mthp_pte;

I didn't wait to see the problem, just noticed that in the v4->v5
update, pte gets used at out_up_write, but there are gotos before
pte has been initialized. Declare pte = NULL here.

>  	pgtable_t pgtable;
>  	struct folio *folio;
>  	spinlock_t *pmd_ptl, *pte_ptl;
>  	int result = SCAN_FAIL;
>  	struct vm_area_struct *vma;
>  	struct mmu_notifier_range range;
> +	unsigned long _address = address + offset * PAGE_SIZE;
>  
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
> @@ -1158,12 +1159,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		*mmap_locked = false;
>  	}
>  
> -	result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> +	result = alloc_charge_folio(&folio, mm, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_nolock;
>  
>  	mmap_read_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	*mmap_locked = true;
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED) {
>  		mmap_read_unlock(mm);
>  		goto out_nolock;
> @@ -1181,13 +1183,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  		 * released when it fails. So we jump out_nolock directly in
>  		 * that case.  Continuing to collapse causes inconsistency.
>  		 */
> -		result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> -				referenced, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> +				referenced, order);
>  		if (result != SCAN_SUCCEED)
>  			goto out_nolock;
>  	}
>  
>  	mmap_read_unlock(mm);
> +	*mmap_locked = false;
>  	/*
>  	 * Prevent all access to pagetables with the exception of
>  	 * gup_fast later handled by the ptep_clear_flush and the VM
> @@ -1197,7 +1200,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * mmap_lock.
>  	 */
>  	mmap_write_lock(mm);
> -	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> +	result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
>  	if (result != SCAN_SUCCEED)
>  		goto out_up_write;
>  	/* check if the pmd is still valid */
> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,

I spent a long time trying to work out why the include/linux/swapops.h:511
BUG is soon hit - the BUG which tells there's a migration entry left behind
after its folio has been unlocked.

In the patch at the end you'll see that I've inserted a check here, to
abort if the VMA following the revalidated "vma" is sharing the page table
(and so also affected by clearing *pmd).  That turned out not to be the
problem (WARN_ONs inserted never fired in my limited testing), but it still
looks to me as if some such check is needed.  Or I may be wrong, and
"revalidate" (a better place for the check) does actually check that, but
it wasn't obvious, and I haven't spent more time looking at it (but it did
appear to rule out the case of a VMA before "vma" sharing the page table).

>  	vma_start_write(vma);
>  	anon_vma_lock_write(vma->anon_vma);
>  
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -				address + HPAGE_PMD_SIZE);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> +				_address + (PAGE_SIZE << order));

mmu_notifiers tend to be rather a mystery to me, so I've made no change
below, but it's not obvious whether it's correct clear the *pmd but only
notify of clearing a subset of that range: what's outside the range soon
gets replaced as it was, but is that good enough?  I don't know.

>  	mmu_notifier_invalidate_range_start(&range);
>  
>  	pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
>  	/*
>  	 * This removes any huge TLB entry from the CPU so we won't allow
>  	 * huge and small TLB entries for the same virtual address to

The line I want to comment on does not appear in this diff context:
	_pmd = pmdp_collapse_flush(vma, address, pmd);

That is appropriate for a THP occupying the whole range of the page table,
but is a surprising way to handle an "mTHP" of just some of its ptes:
I would expect you to be invalidating and replacing just those.

And that is the cause of the swapops:511 BUGs: "uninvolved" ptes are
being temporarily hidden, so not found when remove_migration_ptes()
goes looking for them.

This reliance on pmdp_collapse_flush() can be rescued, with stricter
locking (comment below); but I don't like it, and notice Jann has
picked up on it too.  I hope v6 switches to handling ptes by ptes.

> @@ -1226,18 +1230,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	mmu_notifier_invalidate_range_end(&range);
>  	tlb_remove_table_sync_one();
>  
> -	pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> +	pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
>  	if (pte) {
> -		result = __collapse_huge_page_isolate(vma, address, pte, cc,
> -					&compound_pagelist, HPAGE_PMD_ORDER);
> +		result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> +					&compound_pagelist, order);
>  		spin_unlock(pte_ptl);
>  	} else {
>  		result = SCAN_PMD_NULL;
>  	}
>  
>  	if (unlikely(result != SCAN_SUCCEED)) {
> -		if (pte)
> -			pte_unmap(pte);
>  		spin_lock(pmd_ptl);
>  		BUG_ON(!pmd_none(*pmd));
>  		/*
> @@ -1258,9 +1260,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	anon_vma_unlock_write(vma->anon_vma);

Phew, it's just visible there in the context.  The anon_vma lock is what
keeps out racing lookups; so, that anon_vma_unlock_write() (and its
"All pages are isolated and locked" comment) is appropriate in the
HPAGE_PMD_SIZEd THP case, but has to be left until later for mTHP ptes.

But the anon_vma lock may well span a much larger range than the pte
lock, and the pmd lock certainly spans a much larger range than the
pte lock; so we really prefer to release anon_vma lock and pmd lock
as soon as is safe, and use pte lock in preference where possible.

>  
>  	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> -					   vma, address, pte_ptl,
> -					   &compound_pagelist, HPAGE_PMD_ORDER);
> -	pte_unmap(pte);
> +					   vma, _address, pte_ptl,
> +					   &compound_pagelist, order);
>  	if (unlikely(result != SCAN_SUCCEED))
>  		goto out_up_write;
>  
> @@ -1270,25 +1271,42 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  	 * write.
>  	 */
>  	__folio_mark_uptodate(folio);
> -	pgtable = pmd_pgtable(_pmd);
> -
> -	_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> -	_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> -
> -	spin_lock(pmd_ptl);
> -	BUG_ON(!pmd_none(*pmd));
> -	folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> -	folio_add_lru_vma(folio, vma);
> -	pgtable_trans_huge_deposit(mm, pmd, pgtable);
> -	set_pmd_at(mm, address, pmd, _pmd);
> -	update_mmu_cache_pmd(vma, address, pmd);
> -	deferred_split_folio(folio, false);
> -	spin_unlock(pmd_ptl);
> +	if (order == HPAGE_PMD_ORDER) {
> +		pgtable = pmd_pgtable(_pmd);
> +		_pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> +		_pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> +
> +		spin_lock(pmd_ptl);
> +		BUG_ON(!pmd_none(*pmd));
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		pgtable_trans_huge_deposit(mm, pmd, pgtable);
> +		set_pmd_at(mm, address, pmd, _pmd);
> +		update_mmu_cache_pmd(vma, address, pmd);
> +		deferred_split_folio(folio, false);
> +		spin_unlock(pmd_ptl);
> +	} else { /* mTHP collapse */
> +		mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> +		mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> +
> +		spin_lock(pmd_ptl);

I haven't changed that, but it is odd: yes, pmd_ptl will be required
when doing the pmd_populate(), but it serves no purpose here when
fiddling around with ptes in a disconnected page table.

> +		folio_ref_add(folio, (1 << order) - 1);
> +		folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> +		folio_add_lru_vma(folio, vma);
> +		set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> +		update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> +
> +		smp_wmb(); /* make pte visible before pmd */
> +		pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> +		spin_unlock(pmd_ptl);
> +	}
>  
>  	folio = NULL;
>  
>  	result = SCAN_SUCCEED;

Somewhere around here it becomes safe for mTHP to anon_vma_unlock_write().

>  out_up_write:
> +	if (pte)
> +		pte_unmap(pte);
>  	mmap_write_unlock(mm);
>  out_nolock:
>  	*mmap_locked = false;
> @@ -1364,31 +1382,58 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  {
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
> +	int i;
>  	int result = SCAN_FAIL, referenced = 0;
>  	int none_or_zero = 0, shared = 0;
>  	struct page *page = NULL;
>  	struct folio *folio = NULL;
>  	unsigned long _address;
> +	unsigned long enabled_orders;
>  	spinlock_t *ptl;
>  	int node = NUMA_NO_NODE, unmapped = 0;
> +	bool is_pmd_only;
>  	bool writable = false;
> -
> +	int chunk_none_count = 0;
> +	int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
> +	unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
>  	VM_BUG_ON(address & ~HPAGE_PMD_MASK);
>  
>  	result = find_pmd_or_thp_or_none(mm, address, &pmd);
>  	if (result != SCAN_SUCCEED)
>  		goto out;
>  
> +	bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> +	bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
>  	memset(cc->node_load, 0, sizeof(cc->node_load));
>  	nodes_clear(cc->alloc_nmask);
> +
> +	enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> +		tva_flags, THP_ORDERS_ALL_ANON);
> +
> +	is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> +
>  	pte = pte_offset_map_lock(mm, pmd, address, &ptl);
>  	if (!pte) {
>  		result = SCAN_PMD_NULL;
>  		goto out;
>  	}
>  
> -	for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> -	     _pte++, _address += PAGE_SIZE) {
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		/*
> +		 * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
> +		 * there are pages in this chunk keep track of it in the bitmap
> +		 * for mTHP collapsing.
> +		 */
> +		if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
> +			if (chunk_none_count <= scaled_none)
> +				bitmap_set(cc->mthp_bitmap,
> +					   i / KHUGEPAGED_MIN_MTHP_NR, 1);
> +
> +			chunk_none_count = 0;
> +		}
> +
> +		_pte = pte + i;
> +		_address = address + i * PAGE_SIZE;
>  		pte_t pteval = ptep_get(_pte);
>  		if (is_swap_pte(pteval)) {
>  			++unmapped;
> @@ -1411,10 +1456,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  			}
>  		}
>  		if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +			++chunk_none_count;
>  			++none_or_zero;
>  			if (!userfaultfd_armed(vma) &&
> -			    (!cc->is_khugepaged ||
> -			     none_or_zero <= khugepaged_max_ptes_none)) {
> +			    (!cc->is_khugepaged || !is_pmd_only ||
> +				none_or_zero <= khugepaged_max_ptes_none)) {
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_NONE_PTE;
> @@ -1510,6 +1556,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  								     address)))
>  			referenced++;
>  	}
> +
>  	if (!writable) {
>  		result = SCAN_PAGE_RO;
>  	} else if (cc->is_khugepaged &&
> @@ -1522,8 +1569,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
>  out_unmap:
>  	pte_unmap_unlock(pte, ptl);
>  	if (result == SCAN_SUCCEED) {
> -		result = collapse_huge_page(mm, address, referenced,
> -					    unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> +		result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> +			       mmap_locked, enabled_orders);
> +		if (result > 0)
> +			result = SCAN_SUCCEED;
> +		else
> +			result = SCAN_FAIL;
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> -- 
> 2.48.1

Fixes to 07/12 "khugepaged: add mTHP support".
But I see now that the first hunk is actually not to this 07/12, but to
05/12 "khugepaged: generalize __collapse_huge_page_* for mTHP support":
the mTHP check added in __collapse_huge_page_swapin() forgets to unmap
and unlock before returning, causing RCU imbalance warnings and lockups.
I won't separate it out here, let me leave that to you.

And I had other fixes to v4, which you've fixed differently in v5,
I haven't looked up which patch: where khugepaged_collapse_single_pmd()
does mmap_read_(un)lock() around collapse_pte_mapped_thp().  I dislike
your special use of result SCAN_ANY_PROCESS there, because mmap_locked
is precisely the tool for that job, so just lock and unlock without
setting *mmap_locked true (but I'd agree that mmap_locked is confusing,
and offhand wouldn't want to assert exactly what it means - does it
mean that mmap lock was *never* dropped, so "vma" is safe without
revalidation? depends on where it's used perhaps).

Hugh

---
 mm/khugepaged.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index c1c637dbcb81..2c814c239d65 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1054,6 +1054,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
 
 		/* Dont swapin for mTHP collapse */
 		if (order != HPAGE_PMD_ORDER) {
+			pte_unmap(pte);
+			mmap_read_unlock(mm);
 			result = SCAN_EXCEED_SWAP_PTE;
 			goto out;
 		}
@@ -1136,7 +1138,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 {
 	LIST_HEAD(compound_pagelist);
 	pmd_t *pmd, _pmd;
-	pte_t *pte, mthp_pte;
+	pte_t *pte = NULL, mthp_pte;
 	pgtable_t pgtable;
 	struct folio *folio;
 	spinlock_t *pmd_ptl, *pte_ptl;
@@ -1208,6 +1210,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	if (result != SCAN_SUCCEED)
 		goto out_up_write;
 
+	if (vma->vm_end < address + HPAGE_PMD_SIZE) {
+		struct vm_area_struct *next_vma = find_vma(mm, vma->vm_end);
+		/*
+		 * We must not clear *pmd if it is used by the following VMA.
+		 * Well, perhaps we could if it, and all following VMAs using
+		 * this same page table, share the same anon_vma, and so are
+		 * locked out together: but keep it simple for now (and this
+		 * code might better belong in hugepage_vma_revalidate()).
+		 */
+		if (next_vma && next_vma->vm_start < address + HPAGE_PMD_SIZE) {
+			result = SCAN_ADDRESS_RANGE;
+			goto out_up_write;
+		}
+	}
+
 	vma_start_write(vma);
 	anon_vma_lock_write(vma->anon_vma);
 
@@ -1255,15 +1272,17 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 
 	/*
 	 * All pages are isolated and locked so anon_vma rmap
-	 * can't run anymore.
-	 */
-	anon_vma_unlock_write(vma->anon_vma);
+	 * can't run anymore - IF the entire extent has been isolated.
+	 * anon_vma lock may cover a large area: better unlock a.s.a.p.
+	 */	
+	if (order == HPAGE_PMD_ORDER)
+		anon_vma_unlock_write(vma->anon_vma);
 
 	result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
 					   vma, _address, pte_ptl,
 					   &compound_pagelist, order);
 	if (unlikely(result != SCAN_SUCCEED))
-		goto out_up_write;
+		goto out_unlock_anon_vma;
 
 	/*
 	 * The smp_wmb() inside __folio_mark_uptodate() ensures the
@@ -1304,6 +1323,9 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
 	folio = NULL;
 
 	result = SCAN_SUCCEED;
+out_unlock_anon_vma:
+	if (order != HPAGE_PMD_ORDER)
+		anon_vma_unlock_write(vma->anon_vma);
 out_up_write:
 	if (pte)
 		pte_unmap(pte);
-- 
2.43.0

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by Nico Pache 9 months, 1 week ago

On Thu, May 1, 2025 at 6:59 AM Hugh Dickins <hughd@google.com> wrote:
>
> On Mon, 28 Apr 2025, Nico Pache wrote:
>
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning PMD ranges for potential collapse candidates, keep track
> > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > scan phase so we dont bailout early and miss potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
>
> There are locking errors in this patch.  Let me comment inline below,
> then at the end append the fix patch I'm using, to keep mm-new usable
> for me. But that's more of an emergency rescue than a recommended fixup:
> I don't much like your approach here, and hope it will change in v6.
Hi Hugh,

Thank you for testing, and providing such a detailed report + fix!

>
> > ---
> >  mm/khugepaged.c | 125 ++++++++++++++++++++++++++++++++++--------------
> >  1 file changed, 88 insertions(+), 37 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 6e67db86409a..3a846cd70c66 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1136,13 +1136,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >  {
> >       LIST_HEAD(compound_pagelist);
> >       pmd_t *pmd, _pmd;
> > -     pte_t *pte;
> > +     pte_t *pte, mthp_pte;
>
> I didn't wait to see the problem, just noticed that in the v4->v5
> update, pte gets used at out_up_write, but there are gotos before
> pte has been initialized. Declare pte = NULL here.

Ah thanks I missed that when I moved the PTE unmapping to out_up_write.

>
> >       pgtable_t pgtable;
> >       struct folio *folio;
> >       spinlock_t *pmd_ptl, *pte_ptl;
> >       int result = SCAN_FAIL;
> >       struct vm_area_struct *vma;
> >       struct mmu_notifier_range range;
> > +     unsigned long _address = address + offset * PAGE_SIZE;
> >
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> > @@ -1158,12 +1159,13 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >               *mmap_locked = false;
> >       }
> >
> > -     result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER);
> > +     result = alloc_charge_folio(&folio, mm, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_nolock;
> >
> >       mmap_read_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     *mmap_locked = true;
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED) {
> >               mmap_read_unlock(mm);
> >               goto out_nolock;
> > @@ -1181,13 +1183,14 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >                * released when it fails. So we jump out_nolock directly in
> >                * that case.  Continuing to collapse causes inconsistency.
> >                */
> > -             result = __collapse_huge_page_swapin(mm, vma, address, pmd,
> > -                             referenced, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_swapin(mm, vma, _address, pmd,
> > +                             referenced, order);
> >               if (result != SCAN_SUCCEED)
> >                       goto out_nolock;
> >       }
> >
> >       mmap_read_unlock(mm);
> > +     *mmap_locked = false;
> >       /*
> >        * Prevent all access to pagetables with the exception of
> >        * gup_fast later handled by the ptep_clear_flush and the VM
> > @@ -1197,7 +1200,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * mmap_lock.
> >        */
> >       mmap_write_lock(mm);
> > -     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, HPAGE_PMD_ORDER);
> > +     result = hugepage_vma_revalidate(mm, address, true, &vma, cc, order);
> >       if (result != SCAN_SUCCEED)
> >               goto out_up_write;
> >       /* check if the pmd is still valid */
> > @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>
> I spent a long time trying to work out why the include/linux/swapops.h:511
> BUG is soon hit - the BUG which tells there's a migration entry left behind
> after its folio has been unlocked.

Can you please share how you reproduce the bug? I haven't been able to
reproduce any BUGs with most of the memory debuggers enabled. This has
also been through a number of mm test suites and some internal
testing. So i'm curious how to hit this issue!
>
> In the patch at the end you'll see that I've inserted a check here, to
> abort if the VMA following the revalidated "vma" is sharing the page table
> (and so also affected by clearing *pmd).  That turned out not to be the
> problem (WARN_ONs inserted never fired in my limited testing), but it still
> looks to me as if some such check is needed.  Or I may be wrong, and
> "revalidate" (a better place for the check) does actually check that, but
> it wasn't obvious, and I haven't spent more time looking at it (but it did
> appear to rule out the case of a VMA before "vma" sharing the page table

Hmm if I understand this correctly, the check you are adding is
already being handled in the PMD case through
thp_vma_suitable_order(s) (by checking that a PMD can fit in the VMA
@address); however, I think you are correct, we much check that the
VMA still spans the whole PMD region we are trying to a mTHP collapse
within it... if not its possible that a mremap+mmap might have occured
allowing a different mapping to belong to the same PMD (this also
makes support mTHP collapse in mappings <PMD sized difficult and is
the reason we haven't pursued it yet, the locking requires explode,
the simple solutions is as you noted, only support <PMD sized
collapses with a single VMA in the PMD range, this is a future
change).

I actually realized my revalidation is rather weak... mostly because
I'm passing the PMD address, not the mTHP starting address (see
address vs _address in collapse_huge_page). I think your check would
be handled if I pass it _address. If that was the case the following
check would be performed in thp_vma_suitable_order:

if (haddr < vma->vm_start || haddr + hpage_size > vma->vm_end)
return false;


>
> >       vma_start_write(vma);
> >       anon_vma_lock_write(vma->anon_vma);
> >
> > -     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                             address + HPAGE_PMD_SIZE);
> > +     mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > +                             _address + (PAGE_SIZE << order));
>
> mmu_notifiers tend to be rather a mystery to me, so I've made no change
> below, but it's not obvious whether it's correct clear the *pmd but only
> notify of clearing a subset of that range: what's outside the range soon
> gets replaced as it was, but is that good enough?  I don't know.
Sadly I dont have a good answer for you as mmu_notifier is mostly a
mystery to me too. I assume the anon_vma_lock_write would prevent any
others from touching the pages we ARE NOT invalidating here, and that
may be enough. Honestly I think the anon_vma_lock_write allows up to
remove a lot of the locking (but as i stated im trying to keep the
locking the  *same* in this series).
>
> >       mmu_notifier_invalidate_range_start(&range);
> >
> >       pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
> >       /*
> >        * This removes any huge TLB entry from the CPU so we won't allow
> >        * huge and small TLB entries for the same virtual address to
>
> The line I want to comment on does not appear in this diff context:
>         _pmd = pmdp_collapse_flush(vma, address, pmd);
>
> That is appropriate for a THP occupying the whole range of the page table,
> but is a surprising way to handle an "mTHP" of just some of its ptes:
> I would expect you to be invalidating and replacing just those.

The reason for leaving it in the mTHP case was to prevent the
aforementioned GUP-fast race.
* Parallel GUP-fast is fine since GUP-fast will back off when
* it detects PMD is changed.

I tried to keep the same locking principles as the PMD case: remove
the PMD, hold all the same locks, collapse to (m)THP, reinstall the
PMD. This seems the easiest way to get it right and allowed me to
focus on adding the feature. The codebase being scattered with
mentions of potential races, and taking locks that might not be
needed, also made me weary.

I know Dev was looking into optimizations for the locking, and I have
some ideas of ways to improve it, or at least clean up some of the
locking.
>
> And that is the cause of the swapops:511 BUGs: "uninvolved" ptes are
> being temporarily hidden, so not found when remove_migration_ptes()
> goes looking for them.
>
> This reliance on pmdp_collapse_flush() can be rescued, with stricter
> locking (comment below); but I don't like it, and notice Jann has
> picked up on it too.  I hope v6 switches to handling ptes by ptes.
I believe the current approach with your anon_write_unlock fix would
suffice (if the gup race is a valid reason to keep the pmd flush), and
then we can follow up with locking improvements.
>
> > @@ -1226,18 +1230,16 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       mmu_notifier_invalidate_range_end(&range);
> >       tlb_remove_table_sync_one();
> >
> > -     pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl);
> > +     pte = pte_offset_map_lock(mm, &_pmd, _address, &pte_ptl);
> >       if (pte) {
> > -             result = __collapse_huge_page_isolate(vma, address, pte, cc,
> > -                                     &compound_pagelist, HPAGE_PMD_ORDER);
> > +             result = __collapse_huge_page_isolate(vma, _address, pte, cc,
> > +                                     &compound_pagelist, order);
> >               spin_unlock(pte_ptl);
> >       } else {
> >               result = SCAN_PMD_NULL;
> >       }
> >
> >       if (unlikely(result != SCAN_SUCCEED)) {
> > -             if (pte)
> > -                     pte_unmap(pte);
> >               spin_lock(pmd_ptl);
> >               BUG_ON(!pmd_none(*pmd));
> >               /*
> > @@ -1258,9 +1260,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >       anon_vma_unlock_write(vma->anon_vma);
>
> Phew, it's just visible there in the context.  The anon_vma lock is what
> keeps out racing lookups; so, that anon_vma_unlock_write() (and its
> "All pages are isolated and locked" comment) is appropriate in the
> HPAGE_PMD_SIZEd THP case, but has to be left until later for mTHP ptes.
Makes sense!
>
> But the anon_vma lock may well span a much larger range than the pte
> lock, and the pmd lock certainly spans a much larger range than the
> pte lock; so we really prefer to release anon_vma lock and pmd lock
> as soon as is safe, and use pte lock in preference where possible.
>
> >
> >       result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
> > -                                        vma, address, pte_ptl,
> > -                                        &compound_pagelist, HPAGE_PMD_ORDER);
> > -     pte_unmap(pte);
> > +                                        vma, _address, pte_ptl,
> > +                                        &compound_pagelist, order);
> >       if (unlikely(result != SCAN_SUCCEED))
> >               goto out_up_write;
> >
> > @@ -1270,25 +1271,42 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >        * write.
> >        */
> >       __folio_mark_uptodate(folio);
> > -     pgtable = pmd_pgtable(_pmd);
> > -
> > -     _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> > -     _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > -
> > -     spin_lock(pmd_ptl);
> > -     BUG_ON(!pmd_none(*pmd));
> > -     folio_add_new_anon_rmap(folio, vma, address, RMAP_EXCLUSIVE);
> > -     folio_add_lru_vma(folio, vma);
> > -     pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > -     set_pmd_at(mm, address, pmd, _pmd);
> > -     update_mmu_cache_pmd(vma, address, pmd);
> > -     deferred_split_folio(folio, false);
> > -     spin_unlock(pmd_ptl);
> > +     if (order == HPAGE_PMD_ORDER) {
> > +             pgtable = pmd_pgtable(_pmd);
> > +             _pmd = folio_mk_pmd(folio, vma->vm_page_prot);
> > +             _pmd = maybe_pmd_mkwrite(pmd_mkdirty(_pmd), vma);
> > +
> > +             spin_lock(pmd_ptl);
> > +             BUG_ON(!pmd_none(*pmd));
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             pgtable_trans_huge_deposit(mm, pmd, pgtable);
> > +             set_pmd_at(mm, address, pmd, _pmd);
> > +             update_mmu_cache_pmd(vma, address, pmd);
> > +             deferred_split_folio(folio, false);
> > +             spin_unlock(pmd_ptl);
> > +     } else { /* mTHP collapse */
> > +             mthp_pte = mk_pte(&folio->page, vma->vm_page_prot);
> > +             mthp_pte = maybe_mkwrite(pte_mkdirty(mthp_pte), vma);
> > +
> > +             spin_lock(pmd_ptl);
>
> I haven't changed that, but it is odd: yes, pmd_ptl will be required
> when doing the pmd_populate(), but it serves no purpose here when
> fiddling around with ptes in a disconnected page table.
Yes I believe a lot of locking can be improved and some is unnecessary.
>
> > +             folio_ref_add(folio, (1 << order) - 1);
> > +             folio_add_new_anon_rmap(folio, vma, _address, RMAP_EXCLUSIVE);
> > +             folio_add_lru_vma(folio, vma);
> > +             set_ptes(vma->vm_mm, _address, pte, mthp_pte, (1 << order));
> > +             update_mmu_cache_range(NULL, vma, _address, pte, (1 << order));
> > +
> > +             smp_wmb(); /* make pte visible before pmd */
> > +             pmd_populate(mm, pmd, pmd_pgtable(_pmd));
> > +             spin_unlock(pmd_ptl);
> > +     }
> >
> >       folio = NULL;
> >
> >       result = SCAN_SUCCEED;
>
> Somewhere around here it becomes safe for mTHP to anon_vma_unlock_write().
>
> >  out_up_write:
> > +     if (pte)
> > +             pte_unmap(pte);
> >       mmap_write_unlock(mm);
> >  out_nolock:
> >       *mmap_locked = false;
> > @@ -1364,31 +1382,58 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  {
> >       pmd_t *pmd;
> >       pte_t *pte, *_pte;
> > +     int i;
> >       int result = SCAN_FAIL, referenced = 0;
> >       int none_or_zero = 0, shared = 0;
> >       struct page *page = NULL;
> >       struct folio *folio = NULL;
> >       unsigned long _address;
> > +     unsigned long enabled_orders;
> >       spinlock_t *ptl;
> >       int node = NUMA_NO_NODE, unmapped = 0;
> > +     bool is_pmd_only;
> >       bool writable = false;
> > -
> > +     int chunk_none_count = 0;
> > +     int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER);
> > +     unsigned long tva_flags = cc->is_khugepaged ? TVA_ENFORCE_SYSFS : 0;
> >       VM_BUG_ON(address & ~HPAGE_PMD_MASK);
> >
> >       result = find_pmd_or_thp_or_none(mm, address, &pmd);
> >       if (result != SCAN_SUCCEED)
> >               goto out;
> >
> > +     bitmap_zero(cc->mthp_bitmap, MAX_MTHP_BITMAP_SIZE);
> > +     bitmap_zero(cc->mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE);
> >       memset(cc->node_load, 0, sizeof(cc->node_load));
> >       nodes_clear(cc->alloc_nmask);
> > +
> > +     enabled_orders = thp_vma_allowable_orders(vma, vma->vm_flags,
> > +             tva_flags, THP_ORDERS_ALL_ANON);
> > +
> > +     is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER));
> > +
> >       pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> >       if (!pte) {
> >               result = SCAN_PMD_NULL;
> >               goto out;
> >       }
> >
> > -     for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -          _pte++, _address += PAGE_SIZE) {
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             /*
> > +              * we are reading in KHUGEPAGED_MIN_MTHP_NR page chunks. if
> > +              * there are pages in this chunk keep track of it in the bitmap
> > +              * for mTHP collapsing.
> > +              */
> > +             if (i % KHUGEPAGED_MIN_MTHP_NR == 0) {
> > +                     if (chunk_none_count <= scaled_none)
> > +                             bitmap_set(cc->mthp_bitmap,
> > +                                        i / KHUGEPAGED_MIN_MTHP_NR, 1);
> > +
> > +                     chunk_none_count = 0;
> > +             }
> > +
> > +             _pte = pte + i;
> > +             _address = address + i * PAGE_SIZE;
> >               pte_t pteval = ptep_get(_pte);
> >               if (is_swap_pte(pteval)) {
> >                       ++unmapped;
> > @@ -1411,10 +1456,11 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                       }
> >               }
> >               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> > +                     ++chunk_none_count;
> >                       ++none_or_zero;
> >                       if (!userfaultfd_armed(vma) &&
> > -                         (!cc->is_khugepaged ||
> > -                          none_or_zero <= khugepaged_max_ptes_none)) {
> > +                         (!cc->is_khugepaged || !is_pmd_only ||
> > +                             none_or_zero <= khugepaged_max_ptes_none)) {
> >                               continue;
> >                       } else {
> >                               result = SCAN_EXCEED_NONE_PTE;
> > @@ -1510,6 +1556,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >                                                                    address)))
> >                       referenced++;
> >       }
> > +
> >       if (!writable) {
> >               result = SCAN_PAGE_RO;
> >       } else if (cc->is_khugepaged &&
> > @@ -1522,8 +1569,12 @@ static int khugepaged_scan_pmd(struct mm_struct *mm,
> >  out_unmap:
> >       pte_unmap_unlock(pte, ptl);
> >       if (result == SCAN_SUCCEED) {
> > -             result = collapse_huge_page(mm, address, referenced,
> > -                                         unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0);
> > +             result = khugepaged_scan_bitmap(mm, address, referenced, unmapped, cc,
> > +                            mmap_locked, enabled_orders);
> > +             if (result > 0)
> > +                     result = SCAN_SUCCEED;
> > +             else
> > +                     result = SCAN_FAIL;
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,
> > --
> > 2.48.1
>
> Fixes to 07/12 "khugepaged: add mTHP support".
> But I see now that the first hunk is actually not to this 07/12, but to
> 05/12 "khugepaged: generalize __collapse_huge_page_* for mTHP support":
> the mTHP check added in __collapse_huge_page_swapin() forgets to unmap
> and unlock before returning, causing RCU imbalance warnings and lockups.
> I won't separate it out here, let me leave that to you.
Thanks I'll get both of these issues fixed!
>
> And I had other fixes to v4, which you've fixed differently in v5,
> I haven't looked up which patch: where khugepaged_collapse_single_pmd()
> does mmap_read_(un)lock() around collapse_pte_mapped_thp().  I dislike
> your special use of result SCAN_ANY_PROCESS there, because mmap_locked

This is what the revalidate func is doing with
khugepaged_test_exit_or_disable so I tried to keep it consistent.
Sadly this was the cleanest solution I could come up with. If you dont
mind sharing what your solution to this was, I'd be happy to take a
look!

> is precisely the tool for that job, so just lock and unlock without
> setting *mmap_locked true (but I'd agree that mmap_locked is confusing,
> and offhand wouldn't want to assert exactly what it means - does it
> mean that mmap lock was *never* dropped, so "vma" is safe without
> revalidation? depends on where it's used perhaps).
I agree its use is confusing and I like to think of it as ONLY the
indicator of it being locked/unlocked. Im not sure if there were other
intended uses.
>
> Hugh
>
> ---
>  mm/khugepaged.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index c1c637dbcb81..2c814c239d65 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1054,6 +1054,8 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm,
>
>                 /* Dont swapin for mTHP collapse */
>                 if (order != HPAGE_PMD_ORDER) {
> +                       pte_unmap(pte);
> +                       mmap_read_unlock(mm);
>                         result = SCAN_EXCEED_SWAP_PTE;
>                         goto out;
>                 }
> @@ -1136,7 +1138,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>  {
>         LIST_HEAD(compound_pagelist);
>         pmd_t *pmd, _pmd;
> -       pte_t *pte, mthp_pte;
> +       pte_t *pte = NULL, mthp_pte;
>         pgtable_t pgtable;
>         struct folio *folio;
>         spinlock_t *pmd_ptl, *pte_ptl;
> @@ -1208,6 +1210,21 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         if (result != SCAN_SUCCEED)
>                 goto out_up_write;
>
> +       if (vma->vm_end < address + HPAGE_PMD_SIZE) {
> +               struct vm_area_struct *next_vma = find_vma(mm, vma->vm_end);
> +               /*
> +                * We must not clear *pmd if it is used by the following VMA.
> +                * Well, perhaps we could if it, and all following VMAs using
> +                * this same page table, share the same anon_vma, and so are
> +                * locked out together: but keep it simple for now (and this
> +                * code might better belong in hugepage_vma_revalidate()).
> +                */
> +               if (next_vma && next_vma->vm_start < address + HPAGE_PMD_SIZE) {
> +                       result = SCAN_ADDRESS_RANGE;
> +                       goto out_up_write;
> +               }
> +       }
> +
>         vma_start_write(vma);
>         anon_vma_lock_write(vma->anon_vma);
>
> @@ -1255,15 +1272,17 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>
>         /*
>          * All pages are isolated and locked so anon_vma rmap
> -        * can't run anymore.
> -        */
> -       anon_vma_unlock_write(vma->anon_vma);
> +        * can't run anymore - IF the entire extent has been isolated.
> +        * anon_vma lock may cover a large area: better unlock a.s.a.p.
> +        */
> +       if (order == HPAGE_PMD_ORDER)
> +               anon_vma_unlock_write(vma->anon_vma);
>
>         result = __collapse_huge_page_copy(pte, folio, pmd, _pmd,
>                                            vma, _address, pte_ptl,
>                                            &compound_pagelist, order);
>         if (unlikely(result != SCAN_SUCCEED))
> -               goto out_up_write;
> +               goto out_unlock_anon_vma;
>
>         /*
>          * The smp_wmb() inside __folio_mark_uptodate() ensures the
> @@ -1304,6 +1323,9 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         folio = NULL;
>
>         result = SCAN_SUCCEED;
> +out_unlock_anon_vma:
> +       if (order != HPAGE_PMD_ORDER)
> +               anon_vma_unlock_write(vma->anon_vma);
>  out_up_write:
>         if (pte)
>                 pte_unmap(pte);
> --
> 2.43.0
>

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by Andrew Morton 9 months, 1 week ago

On Thu, 1 May 2025 05:58:40 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> There are locking errors in this patch. 

Thanks, Hugh.  I'll remove the v5 series from mm-new.

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by Jann Horn 9 months, 2 weeks ago

On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> While scanning PMD ranges for potential collapse candidates, keep track
> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> mTHPs are enabled we remove the restriction of max_ptes_none during the
> scan phase so we dont bailout early and miss potential mTHP candidates.
>
> After the scan is complete we will perform binary recursion on the
> bitmap to determine which mTHP size would be most efficient to collapse
> to. max_ptes_none will be scaled by the attempted collapse order to
> determine how full a THP must be to be eligible.
>
> If a mTHP collapse is attempted, but contains swapped out, or shared
> pages, we dont perform the collapse.
[...]
> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>         vma_start_write(vma);
>         anon_vma_lock_write(vma->anon_vma);
>
> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> -                               address + HPAGE_PMD_SIZE);
> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> +                               _address + (PAGE_SIZE << order));
>         mmu_notifier_invalidate_range_start(&range);
>
>         pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> +
>         /*
>          * This removes any huge TLB entry from the CPU so we won't allow
>          * huge and small TLB entries for the same virtual address to

It's not visible in this diff, but we're about to do a
pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
entire page table, meaning it tears down 2MiB of address space; and it
assumes that the entire page table exclusively corresponds to the
current VMA.

I think you'll need to ensure that the pmdp_collapse_flush() only
happens for full-size THP, and that mTHP only tears down individual
PTEs in the relevant range. (That code might get a bit messy, since
the existing THP code tears down PTEs in a detached page table, while
mTHP would have to do it in a still-attached page table.)

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by Nico Pache 9 months, 1 week ago

On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>
> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > While scanning PMD ranges for potential collapse candidates, keep track
> > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > scan phase so we dont bailout early and miss potential mTHP candidates.
> >
> > After the scan is complete we will perform binary recursion on the
> > bitmap to determine which mTHP size would be most efficient to collapse
> > to. max_ptes_none will be scaled by the attempted collapse order to
> > determine how full a THP must be to be eligible.
> >
> > If a mTHP collapse is attempted, but contains swapped out, or shared
> > pages, we dont perform the collapse.
> [...]
> > @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >         vma_start_write(vma);
> >         anon_vma_lock_write(vma->anon_vma);
> >
> > -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > -                               address + HPAGE_PMD_SIZE);
> > +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > +                               _address + (PAGE_SIZE << order));
> >         mmu_notifier_invalidate_range_start(&range);
> >
> >         pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > +
> >         /*
> >          * This removes any huge TLB entry from the CPU so we won't allow
> >          * huge and small TLB entries for the same virtual address to
>
> It's not visible in this diff, but we're about to do a
> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> entire page table, meaning it tears down 2MiB of address space; and it
> assumes that the entire page table exclusively corresponds to the
> current VMA.
>
> I think you'll need to ensure that the pmdp_collapse_flush() only
> happens for full-size THP, and that mTHP only tears down individual
> PTEs in the relevant range. (That code might get a bit messy, since
> the existing THP code tears down PTEs in a detached page table, while
> mTHP would have to do it in a still-attached page table.)
Hi Jann!

I was under the impression that this is needed to prevent GUP-fast
races (and potentially others).
As you state here, conceptually the PMD case is, detach the PMD, do
the collapse, then reinstall the PMD (similarly to how the system
recovers from a failed PMD collapse). I tried to keep the current
locking behavior as it seemed the easiest way to get it right (and not
break anything). So I keep the PMD detaching and reinstalling for the
mTHP case too. As Hugh points out I am releasing the anon lock too
early. I will comment further on his response.

As I familiarize myself with the code more, I do see potential code
improvements/cleanups and locking improvements, but I was going to
leave those to a later series.

Thanks
-- Nico
>

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by David Hildenbrand 9 months, 1 week ago

On 02.05.25 00:29, Nico Pache wrote:
> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>>
>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>> While scanning PMD ranges for potential collapse candidates, keep track
>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>
>>> After the scan is complete we will perform binary recursion on the
>>> bitmap to determine which mTHP size would be most efficient to collapse
>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>> determine how full a THP must be to be eligible.
>>>
>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>> pages, we dont perform the collapse.
>> [...]
>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>          vma_start_write(vma);
>>>          anon_vma_lock_write(vma->anon_vma);
>>>
>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>> -                               address + HPAGE_PMD_SIZE);
>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>> +                               _address + (PAGE_SIZE << order));
>>>          mmu_notifier_invalidate_range_start(&range);
>>>
>>>          pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>> +
>>>          /*
>>>           * This removes any huge TLB entry from the CPU so we won't allow
>>>           * huge and small TLB entries for the same virtual address to
>>
>> It's not visible in this diff, but we're about to do a
>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>> entire page table, meaning it tears down 2MiB of address space; and it
>> assumes that the entire page table exclusively corresponds to the
>> current VMA.
>>
>> I think you'll need to ensure that the pmdp_collapse_flush() only
>> happens for full-size THP, and that mTHP only tears down individual
>> PTEs in the relevant range. (That code might get a bit messy, since
>> the existing THP code tears down PTEs in a detached page table, while
>> mTHP would have to do it in a still-attached page table.)
> Hi Jann!
> 
> I was under the impression that this is needed to prevent GUP-fast
> races (and potentially others).
> As you state here, conceptually the PMD case is, detach the PMD, do
> the collapse, then reinstall the PMD (similarly to how the system
> recovers from a failed PMD collapse). I tried to keep the current
> locking behavior as it seemed the easiest way to get it right (and not
> break anything). So I keep the PMD detaching and reinstalling for the
> mTHP case too. As Hugh points out I am releasing the anon lock too
> early. I will comment further on his response.
> 
> As I familiarize myself with the code more, I do see potential code
> improvements/cleanups and locking improvements, but I was going to
> leave those to a later series.

Right, the simplest approach on top of the current PMD collapse is to do 
exactly what we do in the PMD case, including the locking: which 
apparently is no completely the same yet :).

Instead of installing a PMD THP, we modify the page table and remap that.

Moving from the PMD lock to the PTE lock will not make a big change in 
practice for most cases: we already must disable essentially all page 
table walkers (vma lock, mmap lock in write, rmap lock in write).

The PMDP clear+flush is primarily to disable the last possible set of 
page table walkers: (1) HW modifications and (2) GUP-fast.

So after the PMDP clear+flush we know that (A) HW can not modify the 
pages concurrently and (B) GUP-fast cannot succeed anymore.

The issue with PTEP clear+flush is that we will have to remember all PTE 
values, to reset them if anything goes wrong. Using a single PMD value 
is arguably simpler. And then, the benefit vs. complexity is unclear.

Certainly something to look into later, but not a requirement for the 
first support,

The real challenge/benefit will be looking into avoiding taking all the 
heavy weight locks. Dev has already been thinking about that. For mTHP 
it might be easier than for THPs. Probably it will involve setting PTE 
migration entries whenever we drop the PTL, and dealing with the 
possibility of concurrent zapping of these migration entries.

-- 
Cheers,

David / dhildenb

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by Jann Horn 9 months, 1 week ago

On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
> On 02.05.25 00:29, Nico Pache wrote:
> > On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
> >>
> >> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> >>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> >>> While scanning PMD ranges for potential collapse candidates, keep track
> >>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> >>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> >>> mTHPs are enabled we remove the restriction of max_ptes_none during the
> >>> scan phase so we dont bailout early and miss potential mTHP candidates.
> >>>
> >>> After the scan is complete we will perform binary recursion on the
> >>> bitmap to determine which mTHP size would be most efficient to collapse
> >>> to. max_ptes_none will be scaled by the attempted collapse order to
> >>> determine how full a THP must be to be eligible.
> >>>
> >>> If a mTHP collapse is attempted, but contains swapped out, or shared
> >>> pages, we dont perform the collapse.
> >> [...]
> >>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>          vma_start_write(vma);
> >>>          anon_vma_lock_write(vma->anon_vma);
> >>>
> >>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> >>> -                               address + HPAGE_PMD_SIZE);
> >>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> >>> +                               _address + (PAGE_SIZE << order));
> >>>          mmu_notifier_invalidate_range_start(&range);
> >>>
> >>>          pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >>> +
> >>>          /*
> >>>           * This removes any huge TLB entry from the CPU so we won't allow
> >>>           * huge and small TLB entries for the same virtual address to
> >>
> >> It's not visible in this diff, but we're about to do a
> >> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> >> entire page table, meaning it tears down 2MiB of address space; and it
> >> assumes that the entire page table exclusively corresponds to the
> >> current VMA.
> >>
> >> I think you'll need to ensure that the pmdp_collapse_flush() only
> >> happens for full-size THP, and that mTHP only tears down individual
> >> PTEs in the relevant range. (That code might get a bit messy, since
> >> the existing THP code tears down PTEs in a detached page table, while
> >> mTHP would have to do it in a still-attached page table.)
> > Hi Jann!
> >
> > I was under the impression that this is needed to prevent GUP-fast
> > races (and potentially others).

Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?

> > As you state here, conceptually the PMD case is, detach the PMD, do
> > the collapse, then reinstall the PMD (similarly to how the system
> > recovers from a failed PMD collapse). I tried to keep the current
> > locking behavior as it seemed the easiest way to get it right (and not
> > break anything). So I keep the PMD detaching and reinstalling for the
> > mTHP case too. As Hugh points out I am releasing the anon lock too
> > early. I will comment further on his response.

As I see it, you're not "keeping" the current locking behavior; you're
making a big implicit locking change by reusing a codepath designed
for PMD THP for mTHP, where the page table may not be exclusively
owned by one VMA.

> > As I familiarize myself with the code more, I do see potential code
> > improvements/cleanups and locking improvements, but I was going to
> > leave those to a later series.
>
> Right, the simplest approach on top of the current PMD collapse is to do
> exactly what we do in the PMD case, including the locking: which
> apparently is no completely the same yet :).
>
> Instead of installing a PMD THP, we modify the page table and remap that.
>
> Moving from the PMD lock to the PTE lock will not make a big change in
> practice for most cases: we already must disable essentially all page
> table walkers (vma lock, mmap lock in write, rmap lock in write).
>
> The PMDP clear+flush is primarily to disable the last possible set of
> page table walkers: (1) HW modifications and (2) GUP-fast.
>
> So after the PMDP clear+flush we know that (A) HW can not modify the
> pages concurrently and (B) GUP-fast cannot succeed anymore.
>
> The issue with PTEP clear+flush is that we will have to remember all PTE
> values, to reset them if anything goes wrong. Using a single PMD value
> is arguably simpler. And then, the benefit vs. complexity is unclear.
>
> Certainly something to look into later, but not a requirement for the
> first support,

As I understand, one rule we currently have in MM is that an operation
that logically operates on one VMA (VMA 1) does not touch the page
tables of other VMAs (VMA 2) in any way, except that it may walk page
tables that cover address space that intersects with both VMA 1 and
VMA 2, and create such page tables if they are missing.

This proposed patch changes that, without explicitly discussing this
locking change.

Just as one example: I think this patch retracts a page table without
VMA-locking the relevant address space (we hold a VMA lock on VMA 1,
but not on VMA 2), and we then drop the PMD lock after (temporarily)
retracting the page table. At that point, I think a racing fault that
uses the VMA-locked fastpath can observe the empty PMD, and can
install a new page table? Then when collapse_huge_page() tries to
re-add the retracted page table, I think we'll get a BUG_ON(). Similar
thing with concurrent ftruncate() or such trying to zap PTEs, we can
probably end up not zapping PTEs that should have been zapped?

> The real challenge/benefit will be looking into avoiding taking all the
> heavy weight locks. Dev has already been thinking about that. For mTHP
> it might be easier than for THPs. Probably it will involve setting PTE
> migration entries whenever we drop the PTL, and dealing with the
> possibility of concurrent zapping of these migration entries.

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by David Hildenbrand 9 months, 1 week ago

On 02.05.25 14:50, Jann Horn wrote:
> On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
>> On 02.05.25 00:29, Nico Pache wrote:
>>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>>>>
>>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
>>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>>>> While scanning PMD ranges for potential collapse candidates, keep track
>>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>>>
>>>>> After the scan is complete we will perform binary recursion on the
>>>>> bitmap to determine which mTHP size would be most efficient to collapse
>>>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>>>> determine how full a THP must be to be eligible.
>>>>>
>>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>>>> pages, we dont perform the collapse.
>>>> [...]
>>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>>           vma_start_write(vma);
>>>>>           anon_vma_lock_write(vma->anon_vma);
>>>>>
>>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>>>> -                               address + HPAGE_PMD_SIZE);
>>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>>>> +                               _address + (PAGE_SIZE << order));
>>>>>           mmu_notifier_invalidate_range_start(&range);
>>>>>
>>>>>           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>>>> +
>>>>>           /*
>>>>>            * This removes any huge TLB entry from the CPU so we won't allow
>>>>>            * huge and small TLB entries for the same virtual address to
>>>>
>>>> It's not visible in this diff, but we're about to do a
>>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>>>> entire page table, meaning it tears down 2MiB of address space; and it
>>>> assumes that the entire page table exclusively corresponds to the
>>>> current VMA.
>>>>
>>>> I think you'll need to ensure that the pmdp_collapse_flush() only
>>>> happens for full-size THP, and that mTHP only tears down individual
>>>> PTEs in the relevant range. (That code might get a bit messy, since
>>>> the existing THP code tears down PTEs in a detached page table, while
>>>> mTHP would have to do it in a still-attached page table.)
>>> Hi Jann!
>>>
>>> I was under the impression that this is needed to prevent GUP-fast
>>> races (and potentially others).
> 
> Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> 
>>> As you state here, conceptually the PMD case is, detach the PMD, do
>>> the collapse, then reinstall the PMD (similarly to how the system
>>> recovers from a failed PMD collapse). I tried to keep the current
>>> locking behavior as it seemed the easiest way to get it right (and not
>>> break anything). So I keep the PMD detaching and reinstalling for the
>>> mTHP case too. As Hugh points out I am releasing the anon lock too
>>> early. I will comment further on his response.
> 
> As I see it, you're not "keeping" the current locking behavior; you're
> making a big implicit locking change by reusing a codepath designed
> for PMD THP for mTHP, where the page table may not be exclusively
> owned by one VMA.

That is not the intention. The intention in this series (at least as we 
discussed) was to not do it across VMAs; that is considered the next 
logical step (which will be especially relevant on arm64 IMHO).

> 
>>> As I familiarize myself with the code more, I do see potential code
>>> improvements/cleanups and locking improvements, but I was going to
>>> leave those to a later series.
>>
>> Right, the simplest approach on top of the current PMD collapse is to do
>> exactly what we do in the PMD case, including the locking: which
>> apparently is no completely the same yet :).
>>
>> Instead of installing a PMD THP, we modify the page table and remap that.
>>
>> Moving from the PMD lock to the PTE lock will not make a big change in
>> practice for most cases: we already must disable essentially all page
>> table walkers (vma lock, mmap lock in write, rmap lock in write).
>>
>> The PMDP clear+flush is primarily to disable the last possible set of
>> page table walkers: (1) HW modifications and (2) GUP-fast.
>>
>> So after the PMDP clear+flush we know that (A) HW can not modify the
>> pages concurrently and (B) GUP-fast cannot succeed anymore.
>>
>> The issue with PTEP clear+flush is that we will have to remember all PTE
>> values, to reset them if anything goes wrong. Using a single PMD value
>> is arguably simpler. And then, the benefit vs. complexity is unclear.
>>
>> Certainly something to look into later, but not a requirement for the
>> first support,
> 
> As I understand, one rule we currently have in MM is that an operation
> that logically operates on one VMA (VMA 1) does not touch the page
> tables of other VMAs (VMA 2) in any way, except that it may walk page
> tables that cover address space that intersects with both VMA 1 and
> VMA 2, and create such page tables if they are missing.

Yes, absolutely. That must not happen. And I think I raised it as a 
problem in reply to one of Dev's series.

If this series does not rely on that it must be fixed.

> 
> This proposed patch changes that, without explicitly discussing this
> locking change.

Yes, that must not happen. We must not zap a PMD to temporarily replace 
it with a pmd_none() entry if any other sane page table walker could 
stumble over it.

This includes another VMA that is not write-locked that could span the PMD.

-- 
Cheers,

David / dhildenb

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by Jann Horn 9 months, 1 week ago

On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 02.05.25 14:50, Jann Horn wrote:
> > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
> >> On 02.05.25 00:29, Nico Pache wrote:
> >>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
> >>>>
> >>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> >>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> >>>>> While scanning PMD ranges for potential collapse candidates, keep track
> >>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> >>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> >>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
> >>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
> >>>>>
> >>>>> After the scan is complete we will perform binary recursion on the
> >>>>> bitmap to determine which mTHP size would be most efficient to collapse
> >>>>> to. max_ptes_none will be scaled by the attempted collapse order to
> >>>>> determine how full a THP must be to be eligible.
> >>>>>
> >>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
> >>>>> pages, we dont perform the collapse.
> >>>> [...]
> >>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> >>>>>           vma_start_write(vma);
> >>>>>           anon_vma_lock_write(vma->anon_vma);
> >>>>>
> >>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> >>>>> -                               address + HPAGE_PMD_SIZE);
> >>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> >>>>> +                               _address + (PAGE_SIZE << order));
> >>>>>           mmu_notifier_invalidate_range_start(&range);
> >>>>>
> >>>>>           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> >>>>> +
> >>>>>           /*
> >>>>>            * This removes any huge TLB entry from the CPU so we won't allow
> >>>>>            * huge and small TLB entries for the same virtual address to
> >>>>
> >>>> It's not visible in this diff, but we're about to do a
> >>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> >>>> entire page table, meaning it tears down 2MiB of address space; and it
> >>>> assumes that the entire page table exclusively corresponds to the
> >>>> current VMA.
> >>>>
> >>>> I think you'll need to ensure that the pmdp_collapse_flush() only
> >>>> happens for full-size THP, and that mTHP only tears down individual
> >>>> PTEs in the relevant range. (That code might get a bit messy, since
> >>>> the existing THP code tears down PTEs in a detached page table, while
> >>>> mTHP would have to do it in a still-attached page table.)
> >>> Hi Jann!
> >>>
> >>> I was under the impression that this is needed to prevent GUP-fast
> >>> races (and potentially others).
> >
> > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> >
> >>> As you state here, conceptually the PMD case is, detach the PMD, do
> >>> the collapse, then reinstall the PMD (similarly to how the system
> >>> recovers from a failed PMD collapse). I tried to keep the current
> >>> locking behavior as it seemed the easiest way to get it right (and not
> >>> break anything). So I keep the PMD detaching and reinstalling for the
> >>> mTHP case too. As Hugh points out I am releasing the anon lock too
> >>> early. I will comment further on his response.
> >
> > As I see it, you're not "keeping" the current locking behavior; you're
> > making a big implicit locking change by reusing a codepath designed
> > for PMD THP for mTHP, where the page table may not be exclusively
> > owned by one VMA.
>
> That is not the intention. The intention in this series (at least as we
> discussed) was to not do it across VMAs; that is considered the next
> logical step (which will be especially relevant on arm64 IMHO).

Ah, so for now this is supposed to only work for PTEs which are in a
PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
then try to collapse its contents to an order-2 mTHP page, that should
just not work?

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by Nico Pache 9 months, 1 week ago

On Fri, May 2, 2025 at 9:27 AM Jann Horn <jannh@google.com> wrote:
>
> On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
> >
> > On 02.05.25 14:50, Jann Horn wrote:
> > > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
> > >> On 02.05.25 00:29, Nico Pache wrote:
> > >>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
> > >>>>
> > >>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> > >>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > >>>>> While scanning PMD ranges for potential collapse candidates, keep track
> > >>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > >>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > >>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
> > >>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
> > >>>>>
> > >>>>> After the scan is complete we will perform binary recursion on the
> > >>>>> bitmap to determine which mTHP size would be most efficient to collapse
> > >>>>> to. max_ptes_none will be scaled by the attempted collapse order to
> > >>>>> determine how full a THP must be to be eligible.
> > >>>>>
> > >>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
> > >>>>> pages, we dont perform the collapse.
> > >>>> [...]
> > >>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > >>>>>           vma_start_write(vma);
> > >>>>>           anon_vma_lock_write(vma->anon_vma);
> > >>>>>
> > >>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > >>>>> -                               address + HPAGE_PMD_SIZE);
> > >>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > >>>>> +                               _address + (PAGE_SIZE << order));
> > >>>>>           mmu_notifier_invalidate_range_start(&range);
> > >>>>>
> > >>>>>           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > >>>>> +
> > >>>>>           /*
> > >>>>>            * This removes any huge TLB entry from the CPU so we won't allow
> > >>>>>            * huge and small TLB entries for the same virtual address to
> > >>>>
> > >>>> It's not visible in this diff, but we're about to do a
> > >>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> > >>>> entire page table, meaning it tears down 2MiB of address space; and it
> > >>>> assumes that the entire page table exclusively corresponds to the
> > >>>> current VMA.
> > >>>>
> > >>>> I think you'll need to ensure that the pmdp_collapse_flush() only
> > >>>> happens for full-size THP, and that mTHP only tears down individual
> > >>>> PTEs in the relevant range. (That code might get a bit messy, since
> > >>>> the existing THP code tears down PTEs in a detached page table, while
> > >>>> mTHP would have to do it in a still-attached page table.)
> > >>> Hi Jann!
> > >>>
> > >>> I was under the impression that this is needed to prevent GUP-fast
> > >>> races (and potentially others).
> > >
> > > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> > >
> > >>> As you state here, conceptually the PMD case is, detach the PMD, do
> > >>> the collapse, then reinstall the PMD (similarly to how the system
> > >>> recovers from a failed PMD collapse). I tried to keep the current
> > >>> locking behavior as it seemed the easiest way to get it right (and not
> > >>> break anything). So I keep the PMD detaching and reinstalling for the
> > >>> mTHP case too. As Hugh points out I am releasing the anon lock too
> > >>> early. I will comment further on his response.
> > >
> > > As I see it, you're not "keeping" the current locking behavior; you're
> > > making a big implicit locking change by reusing a codepath designed
> > > for PMD THP for mTHP, where the page table may not be exclusively
> > > owned by one VMA.
> >
> > That is not the intention. The intention in this series (at least as we
> > discussed) was to not do it across VMAs; that is considered the next
> > logical step (which will be especially relevant on arm64 IMHO).
>
> Ah, so for now this is supposed to only work for PTEs which are in a
> PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
> then try to collapse its contents to an order-2 mTHP page, that should
> just not work?
Correct! As I started in reply to Hugh, the locking conditions explode
if we drop that requirement. A simple workaround we've considered is
only collapsing if a single VMA intersects a PMD. I can make sure this
is more clear in the coverletter + this patch.

 Cheers,
-- Nico
>

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by David Hildenbrand 9 months, 1 week ago

On 02.05.25 17:30, Nico Pache wrote:
> On Fri, May 2, 2025 at 9:27 AM Jann Horn <jannh@google.com> wrote:
>>
>> On Fri, May 2, 2025 at 5:19 PM David Hildenbrand <david@redhat.com> wrote:
>>>
>>> On 02.05.25 14:50, Jann Horn wrote:
>>>> On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
>>>>> On 02.05.25 00:29, Nico Pache wrote:
>>>>>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>>>>>>>
>>>>>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
>>>>>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>>>>>>> While scanning PMD ranges for potential collapse candidates, keep track
>>>>>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>>>>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>>>>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>>>>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>>>>>>
>>>>>>>> After the scan is complete we will perform binary recursion on the
>>>>>>>> bitmap to determine which mTHP size would be most efficient to collapse
>>>>>>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>>>>>>> determine how full a THP must be to be eligible.
>>>>>>>>
>>>>>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>>>>>>> pages, we dont perform the collapse.
>>>>>>> [...]
>>>>>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>>>>>            vma_start_write(vma);
>>>>>>>>            anon_vma_lock_write(vma->anon_vma);
>>>>>>>>
>>>>>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>>>>>>> -                               address + HPAGE_PMD_SIZE);
>>>>>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>>>>>>> +                               _address + (PAGE_SIZE << order));
>>>>>>>>            mmu_notifier_invalidate_range_start(&range);
>>>>>>>>
>>>>>>>>            pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>>>>>>> +
>>>>>>>>            /*
>>>>>>>>             * This removes any huge TLB entry from the CPU so we won't allow
>>>>>>>>             * huge and small TLB entries for the same virtual address to
>>>>>>>
>>>>>>> It's not visible in this diff, but we're about to do a
>>>>>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>>>>>>> entire page table, meaning it tears down 2MiB of address space; and it
>>>>>>> assumes that the entire page table exclusively corresponds to the
>>>>>>> current VMA.
>>>>>>>
>>>>>>> I think you'll need to ensure that the pmdp_collapse_flush() only
>>>>>>> happens for full-size THP, and that mTHP only tears down individual
>>>>>>> PTEs in the relevant range. (That code might get a bit messy, since
>>>>>>> the existing THP code tears down PTEs in a detached page table, while
>>>>>>> mTHP would have to do it in a still-attached page table.)
>>>>>> Hi Jann!
>>>>>>
>>>>>> I was under the impression that this is needed to prevent GUP-fast
>>>>>> races (and potentially others).
>>>>
>>>> Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
>>>>
>>>>>> As you state here, conceptually the PMD case is, detach the PMD, do
>>>>>> the collapse, then reinstall the PMD (similarly to how the system
>>>>>> recovers from a failed PMD collapse). I tried to keep the current
>>>>>> locking behavior as it seemed the easiest way to get it right (and not
>>>>>> break anything). So I keep the PMD detaching and reinstalling for the
>>>>>> mTHP case too. As Hugh points out I am releasing the anon lock too
>>>>>> early. I will comment further on his response.
>>>>
>>>> As I see it, you're not "keeping" the current locking behavior; you're
>>>> making a big implicit locking change by reusing a codepath designed
>>>> for PMD THP for mTHP, where the page table may not be exclusively
>>>> owned by one VMA.
>>>
>>> That is not the intention. The intention in this series (at least as we
>>> discussed) was to not do it across VMAs; that is considered the next
>>> logical step (which will be especially relevant on arm64 IMHO).
>>
>> Ah, so for now this is supposed to only work for PTEs which are in a
>> PMD which is fully covered by the VMA? So if I make a 16KiB VMA and
>> then try to collapse its contents to an order-2 mTHP page, that should
>> just not work?
> Correct! As I started in reply to Hugh, the locking conditions explode
> if we drop that requirement.

Right. Adding to that, one could evaluate how much we would gain by 
trying to lock, say, up to $MAGIC_NUMBER related VMAs.

Of course, if no other VMA spans the PMD, and the VMA only covers it 
partially, it is likely still fine as long as we hold the mmap lock in 
write mode.

But probably, looking into a different locking scheme would be 
beneficial at this point.

-- 
Cheers,

David / dhildenb

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by Lorenzo Stoakes 9 months, 1 week ago

On Fri, May 02, 2025 at 05:18:54PM +0200, David Hildenbrand wrote:
> On 02.05.25 14:50, Jann Horn wrote:
> > On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
> > > On 02.05.25 00:29, Nico Pache wrote:
> > > > On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
> > > > >
> > > > > On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
> > > > > > Introduce the ability for khugepaged to collapse to different mTHP sizes.
> > > > > > While scanning PMD ranges for potential collapse candidates, keep track
> > > > > > of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
> > > > > > represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
> > > > > > mTHPs are enabled we remove the restriction of max_ptes_none during the
> > > > > > scan phase so we dont bailout early and miss potential mTHP candidates.
> > > > > >
> > > > > > After the scan is complete we will perform binary recursion on the
> > > > > > bitmap to determine which mTHP size would be most efficient to collapse
> > > > > > to. max_ptes_none will be scaled by the attempted collapse order to
> > > > > > determine how full a THP must be to be eligible.
> > > > > >
> > > > > > If a mTHP collapse is attempted, but contains swapped out, or shared
> > > > > > pages, we dont perform the collapse.
> > > > > [...]
> > > > > > @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
> > > > > >           vma_start_write(vma);
> > > > > >           anon_vma_lock_write(vma->anon_vma);
> > > > > >
> > > > > > -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
> > > > > > -                               address + HPAGE_PMD_SIZE);
> > > > > > +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
> > > > > > +                               _address + (PAGE_SIZE << order));
> > > > > >           mmu_notifier_invalidate_range_start(&range);
> > > > > >
> > > > > >           pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
> > > > > > +
> > > > > >           /*
> > > > > >            * This removes any huge TLB entry from the CPU so we won't allow
> > > > > >            * huge and small TLB entries for the same virtual address to
> > > > >
> > > > > It's not visible in this diff, but we're about to do a
> > > > > pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
> > > > > entire page table, meaning it tears down 2MiB of address space; and it
> > > > > assumes that the entire page table exclusively corresponds to the
> > > > > current VMA.
> > > > >
> > > > > I think you'll need to ensure that the pmdp_collapse_flush() only
> > > > > happens for full-size THP, and that mTHP only tears down individual
> > > > > PTEs in the relevant range. (That code might get a bit messy, since
> > > > > the existing THP code tears down PTEs in a detached page table, while
> > > > > mTHP would have to do it in a still-attached page table.)
> > > > Hi Jann!
> > > >
> > > > I was under the impression that this is needed to prevent GUP-fast
> > > > races (and potentially others).
> >
> > Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
> >
> > > > As you state here, conceptually the PMD case is, detach the PMD, do
> > > > the collapse, then reinstall the PMD (similarly to how the system
> > > > recovers from a failed PMD collapse). I tried to keep the current
> > > > locking behavior as it seemed the easiest way to get it right (and not
> > > > break anything). So I keep the PMD detaching and reinstalling for the
> > > > mTHP case too. As Hugh points out I am releasing the anon lock too
> > > > early. I will comment further on his response.
> >
> > As I see it, you're not "keeping" the current locking behavior; you're
> > making a big implicit locking change by reusing a codepath designed
> > for PMD THP for mTHP, where the page table may not be exclusively
> > owned by one VMA.
>
> That is not the intention. The intention in this series (at least as we
> discussed) was to not do it across VMAs; that is considered the next logical
> step (which will be especially relevant on arm64 IMHO).
>
> >
> > > > As I familiarize myself with the code more, I do see potential code
> > > > improvements/cleanups and locking improvements, but I was going to
> > > > leave those to a later series.
> > >
> > > Right, the simplest approach on top of the current PMD collapse is to do
> > > exactly what we do in the PMD case, including the locking: which
> > > apparently is no completely the same yet :).
> > >
> > > Instead of installing a PMD THP, we modify the page table and remap that.
> > >
> > > Moving from the PMD lock to the PTE lock will not make a big change in
> > > practice for most cases: we already must disable essentially all page
> > > table walkers (vma lock, mmap lock in write, rmap lock in write).
> > >
> > > The PMDP clear+flush is primarily to disable the last possible set of
> > > page table walkers: (1) HW modifications and (2) GUP-fast.
> > >
> > > So after the PMDP clear+flush we know that (A) HW can not modify the
> > > pages concurrently and (B) GUP-fast cannot succeed anymore.
> > >
> > > The issue with PTEP clear+flush is that we will have to remember all PTE
> > > values, to reset them if anything goes wrong. Using a single PMD value
> > > is arguably simpler. And then, the benefit vs. complexity is unclear.
> > >
> > > Certainly something to look into later, but not a requirement for the
> > > first support,
> >
> > As I understand, one rule we currently have in MM is that an operation
> > that logically operates on one VMA (VMA 1) does not touch the page
> > tables of other VMAs (VMA 2) in any way, except that it may walk page
> > tables that cover address space that intersects with both VMA 1 and
> > VMA 2, and create such page tables if they are missing.
>
> Yes, absolutely. That must not happen. And I think I raised it as a problem
> in reply to one of Dev's series.
>
> If this series does not rely on that it must be fixed.
>
> >
> > This proposed patch changes that, without explicitly discussing this
> > locking change.
>
> Yes, that must not happen. We must not zap a PMD to temporarily replace it
> with a pmd_none() entry if any other sane page table walker could stumble
> over it.
>
> This includes another VMA that is not write-locked that could span the PMD.

I feel like we should document these restrictions somewhere :)

Perhaps in a new page table walker doc, or on the
https://origin.kernel.org/doc/html/latest/mm/process_addrs.html page.

Which sounds like I'm volunteering myself to do so doesn't it...

[adds to todo...]

>
> --
> Cheers,
>
> David / dhildenb
>

Re: [PATCH v5 07/12] khugepaged: add mTHP support

Posted by David Hildenbrand 9 months, 1 week ago

On 02.05.25 17:24, Lorenzo Stoakes wrote:
> On Fri, May 02, 2025 at 05:18:54PM +0200, David Hildenbrand wrote:
>> On 02.05.25 14:50, Jann Horn wrote:
>>> On Fri, May 2, 2025 at 8:29 AM David Hildenbrand <david@redhat.com> wrote:
>>>> On 02.05.25 00:29, Nico Pache wrote:
>>>>> On Wed, Apr 30, 2025 at 2:53 PM Jann Horn <jannh@google.com> wrote:
>>>>>>
>>>>>> On Mon, Apr 28, 2025 at 8:12 PM Nico Pache <npache@redhat.com> wrote:
>>>>>>> Introduce the ability for khugepaged to collapse to different mTHP sizes.
>>>>>>> While scanning PMD ranges for potential collapse candidates, keep track
>>>>>>> of pages in KHUGEPAGED_MIN_MTHP_ORDER chunks via a bitmap. Each bit
>>>>>>> represents a utilized region of order KHUGEPAGED_MIN_MTHP_ORDER ptes. If
>>>>>>> mTHPs are enabled we remove the restriction of max_ptes_none during the
>>>>>>> scan phase so we dont bailout early and miss potential mTHP candidates.
>>>>>>>
>>>>>>> After the scan is complete we will perform binary recursion on the
>>>>>>> bitmap to determine which mTHP size would be most efficient to collapse
>>>>>>> to. max_ptes_none will be scaled by the attempted collapse order to
>>>>>>> determine how full a THP must be to be eligible.
>>>>>>>
>>>>>>> If a mTHP collapse is attempted, but contains swapped out, or shared
>>>>>>> pages, we dont perform the collapse.
>>>>>> [...]
>>>>>>> @@ -1208,11 +1211,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address,
>>>>>>>            vma_start_write(vma);
>>>>>>>            anon_vma_lock_write(vma->anon_vma);
>>>>>>>
>>>>>>> -       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address,
>>>>>>> -                               address + HPAGE_PMD_SIZE);
>>>>>>> +       mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, _address,
>>>>>>> +                               _address + (PAGE_SIZE << order));
>>>>>>>            mmu_notifier_invalidate_range_start(&range);
>>>>>>>
>>>>>>>            pmd_ptl = pmd_lock(mm, pmd); /* probably unnecessary */
>>>>>>> +
>>>>>>>            /*
>>>>>>>             * This removes any huge TLB entry from the CPU so we won't allow
>>>>>>>             * huge and small TLB entries for the same virtual address to
>>>>>>
>>>>>> It's not visible in this diff, but we're about to do a
>>>>>> pmdp_collapse_flush() here. pmdp_collapse_flush() tears down the
>>>>>> entire page table, meaning it tears down 2MiB of address space; and it
>>>>>> assumes that the entire page table exclusively corresponds to the
>>>>>> current VMA.
>>>>>>
>>>>>> I think you'll need to ensure that the pmdp_collapse_flush() only
>>>>>> happens for full-size THP, and that mTHP only tears down individual
>>>>>> PTEs in the relevant range. (That code might get a bit messy, since
>>>>>> the existing THP code tears down PTEs in a detached page table, while
>>>>>> mTHP would have to do it in a still-attached page table.)
>>>>> Hi Jann!
>>>>>
>>>>> I was under the impression that this is needed to prevent GUP-fast
>>>>> races (and potentially others).
>>>
>>> Why would you need to touch the PMD entry to prevent GUP-fast races for mTHP?
>>>
>>>>> As you state here, conceptually the PMD case is, detach the PMD, do
>>>>> the collapse, then reinstall the PMD (similarly to how the system
>>>>> recovers from a failed PMD collapse). I tried to keep the current
>>>>> locking behavior as it seemed the easiest way to get it right (and not
>>>>> break anything). So I keep the PMD detaching and reinstalling for the
>>>>> mTHP case too. As Hugh points out I am releasing the anon lock too
>>>>> early. I will comment further on his response.
>>>
>>> As I see it, you're not "keeping" the current locking behavior; you're
>>> making a big implicit locking change by reusing a codepath designed
>>> for PMD THP for mTHP, where the page table may not be exclusively
>>> owned by one VMA.
>>
>> That is not the intention. The intention in this series (at least as we
>> discussed) was to not do it across VMAs; that is considered the next logical
>> step (which will be especially relevant on arm64 IMHO).
>>
>>>
>>>>> As I familiarize myself with the code more, I do see potential code
>>>>> improvements/cleanups and locking improvements, but I was going to
>>>>> leave those to a later series.
>>>>
>>>> Right, the simplest approach on top of the current PMD collapse is to do
>>>> exactly what we do in the PMD case, including the locking: which
>>>> apparently is no completely the same yet :).
>>>>
>>>> Instead of installing a PMD THP, we modify the page table and remap that.
>>>>
>>>> Moving from the PMD lock to the PTE lock will not make a big change in
>>>> practice for most cases: we already must disable essentially all page
>>>> table walkers (vma lock, mmap lock in write, rmap lock in write).
>>>>
>>>> The PMDP clear+flush is primarily to disable the last possible set of
>>>> page table walkers: (1) HW modifications and (2) GUP-fast.
>>>>
>>>> So after the PMDP clear+flush we know that (A) HW can not modify the
>>>> pages concurrently and (B) GUP-fast cannot succeed anymore.
>>>>
>>>> The issue with PTEP clear+flush is that we will have to remember all PTE
>>>> values, to reset them if anything goes wrong. Using a single PMD value
>>>> is arguably simpler. And then, the benefit vs. complexity is unclear.
>>>>
>>>> Certainly something to look into later, but not a requirement for the
>>>> first support,
>>>
>>> As I understand, one rule we currently have in MM is that an operation
>>> that logically operates on one VMA (VMA 1) does not touch the page
>>> tables of other VMAs (VMA 2) in any way, except that it may walk page
>>> tables that cover address space that intersects with both VMA 1 and
>>> VMA 2, and create such page tables if they are missing.
>>
>> Yes, absolutely. That must not happen. And I think I raised it as a problem
>> in reply to one of Dev's series.
>>
>> If this series does not rely on that it must be fixed.
>>
>>>
>>> This proposed patch changes that, without explicitly discussing this
>>> locking change.
>>
>> Yes, that must not happen. We must not zap a PMD to temporarily replace it
>> with a pmd_none() entry if any other sane page table walker could stumble
>> over it.
>>
>> This includes another VMA that is not write-locked that could span the PMD.
> 
> I feel like we should document these restrictions somewhere :)
> 
> Perhaps in a new page table walker doc, or on the
> https://origin.kernel.org/doc/html/latest/mm/process_addrs.html page.
> 
> Which sounds like I'm volunteering myself to do so doesn't it...
> 
> [adds to todo...]

:) that would be nice.

Yeah, I mean this is very subtle, but essentially: unless you exclude 
all page table walkers (well, okay, HW and GUP-fast are a bit special), 
temporarily replacing a present PTE by pte_none() will cause trouble. 
Same for PMDs.

That's also one of the problems of looking into only using the PTE table 
lock and not any other heavy-weight locking in khugepaged. As soon as we 
temporarily unlock the PTE table, but have to temporarily unmap the PTEs 
(->pte_none()) as well, we need something else (e.g., migration entry) 
to tell other page table walkers to back off and wait for us to re-take 
the PTL lock and finish. Zapping must still be able to continue, at 
which point it gets hairy ...

... which is why I'm hoping that we can play with that once we have the 
basics running. With the intend of reusing the existing locking scheme 
-> single VMA spans the PMD.

-- 
Cheers,

David / dhildenb