[PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()

Hugh Dickins posted 13 patches 2 years, 7 months ago
[PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
Posted by Hugh Dickins 2 years, 7 months ago
Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
It does need mmap_read_lock(), but it does not need mmap_write_lock(),
nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

Follow the pattern in retract_page_tables(); and using pte_free_defer()
removes most of the need for tlb_remove_table_sync_one() here; but call
pmdp_get_lockless_sync() to use it in the PAE case.

First check the VMA, in case page tables are being torn down: from JannH.
Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
acquired and the page looks suitable: from then on its state is stable.

However, collapse_pte_mapped_thp() was doing something others don't:
freeing a page table still containing "valid" entries.  i_mmap lock did
stop a racing truncate from double-freeing those pages, but we prefer
collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
flush can wait until the pmdp_collapse_flush() which follows, but the
mmu_notifier_invalidate_range_start() has to be done earlier.

Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
for khugepaged to keep on repeatedly invalidating a range which is then
found unsuitable e.g. contains COWs.  "step 2", which does the clearing,
must then be more careful (after dropping ptl to do mmu_notifier), with
abort prepared to correct the accounting like "step 3".  But with those
entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
safe by the huge page lock, which stops new PTEs from being faulted in.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 172 ++++++++++++++++++++++----------------------------
 1 file changed, 77 insertions(+), 95 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3bb05147961b..46986eb4eebb 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
 	return ret;
 }
 
-/* hpage must be locked, and mmap_lock must be held in write */
+/* hpage must be locked, and mmap_lock must be held */
 static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 			pmd_t *pmdp, struct page *hpage)
 {
@@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	};
 
 	VM_BUG_ON(!PageTransHuge(hpage));
-	mmap_assert_write_locked(vma->vm_mm);
+	mmap_assert_locked(vma->vm_mm);
 
 	if (do_set_pmd(&vmf, hpage))
 		return SCAN_FAIL;
@@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	return SCAN_SUCCEED;
 }
 
-/*
- * A note about locking:
- * Trying to take the page table spinlocks would be useless here because those
- * are only used to synchronize:
- *
- *  - modifying terminal entries (ones that point to a data page, not to another
- *    page table)
- *  - installing *new* non-terminal entries
- *
- * Instead, we need roughly the same kind of protection as free_pgtables() or
- * mm_take_all_locks() (but only for a single VMA):
- * The mmap lock together with this VMA's rmap locks covers all paths towards
- * the page table entries we're messing with here, except for hardware page
- * table walks and lockless_pages_from_mm().
- */
-static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
-				  unsigned long addr, pmd_t *pmdp)
-{
-	pmd_t pmd;
-	struct mmu_notifier_range range;
-
-	mmap_assert_write_locked(mm);
-	if (vma->vm_file)
-		lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
-	/*
-	 * All anon_vmas attached to the VMA have the same root and are
-	 * therefore locked by the same lock.
-	 */
-	if (vma->anon_vma)
-		lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
-
-	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
-				addr + HPAGE_PMD_SIZE);
-	mmu_notifier_invalidate_range_start(&range);
-	pmd = pmdp_collapse_flush(vma, addr, pmdp);
-	tlb_remove_table_sync_one();
-	mmu_notifier_invalidate_range_end(&range);
-	mm_dec_nr_ptes(mm);
-	page_table_check_pte_clear_range(mm, addr, pmd);
-	pte_free(mm, pmd_pgtable(pmd));
-}
-
 /**
  * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
  * address haddr.
@@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
 int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 			    bool install_pmd)
 {
+	struct mmu_notifier_range range;
+	bool notified = false;
 	unsigned long haddr = addr & HPAGE_PMD_MASK;
 	struct vm_area_struct *vma = vma_lookup(mm, haddr);
 	struct page *hpage;
 	pte_t *start_pte, *pte;
-	pmd_t *pmd;
-	spinlock_t *ptl;
-	int count = 0, result = SCAN_FAIL;
+	pmd_t *pmd, pgt_pmd;
+	spinlock_t *pml, *ptl;
+	int nr_ptes = 0, result = SCAN_FAIL;
 	int i;
 
-	mmap_assert_write_locked(mm);
+	mmap_assert_locked(mm);
+
+	/* First check VMA found, in case page tables are being torn down */
+	if (!vma || !vma->vm_file ||
+	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
+		return SCAN_VMA_CHECK;
 
 	/* Fast check before locking page if already PMD-mapped */
 	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
 	if (result == SCAN_PMD_MAPPED)
 		return result;
 
-	if (!vma || !vma->vm_file ||
-	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
-		return SCAN_VMA_CHECK;
-
 	/*
 	 * If we are here, we've succeeded in replacing all the native pages
 	 * in the page cache with a single hugepage. If a mm were to fault-in
@@ -1610,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
+	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
 	switch (result) {
 	case SCAN_SUCCEED:
 		break;
@@ -1623,27 +1585,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		goto drop_hpage;
 	}
 
-	/* Lock the vma before taking i_mmap and page table locks */
-	vma_start_write(vma);
-
-	/*
-	 * We need to lock the mapping so that from here on, only GUP-fast and
-	 * hardware page walks can access the parts of the page tables that
-	 * we're operating on.
-	 * See collapse_and_free_pmd().
-	 */
-	i_mmap_lock_write(vma->vm_file->f_mapping);
-
-	/*
-	 * This spinlock should be unnecessary: Nobody else should be accessing
-	 * the page tables under spinlock protection here, only
-	 * lockless_pages_from_mm() and the hardware page walker can access page
-	 * tables while all the high-level locks are held in write mode.
-	 */
 	result = SCAN_FAIL;
 	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
-	if (!start_pte)
-		goto drop_immap;
+	if (!start_pte)		/* mmap_lock + page lock should prevent this */
+		goto drop_hpage;
 
 	/* step 1: check all mapped PTEs are to the right huge page */
 	for (i = 0, addr = haddr, pte = start_pte;
@@ -1670,10 +1615,18 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		 */
 		if (hpage + i != page)
 			goto abort;
-		count++;
 	}
 
-	/* step 2: adjust rmap */
+	pte_unmap_unlock(start_pte, ptl);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+				haddr, haddr + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+	notified = true;
+	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
+	if (!start_pte)		/* mmap_lock + page lock should prevent this */
+		goto abort;
+
+	/* step 2: clear page table and adjust rmap */
 	for (i = 0, addr = haddr, pte = start_pte;
 	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
 		struct page *page;
@@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 
 		if (pte_none(ptent))
 			continue;
-		page = vm_normal_page(vma, addr, ptent);
-		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
+		/*
+		 * We dropped ptl after the first scan, to do the mmu_notifier:
+		 * page lock stops more PTEs of the hpage being faulted in, but
+		 * does not stop write faults COWing anon copies from existing
+		 * PTEs; and does not stop those being swapped out or migrated.
+		 */
+		if (!pte_present(ptent)) {
+			result = SCAN_PTE_NON_PRESENT;
 			goto abort;
+		}
+		page = vm_normal_page(vma, addr, ptent);
+		if (hpage + i != page)
+			goto abort;
+
+		/*
+		 * Must clear entry, or a racing truncate may re-remove it.
+		 * TLB flush can be left until pmdp_collapse_flush() does it.
+		 * PTE dirty? Shmem page is already dirty; file is read-only.
+		 */
+		pte_clear(mm, addr, pte);
 		page_remove_rmap(page, vma, false);
+		nr_ptes++;
 	}
 
 	pte_unmap_unlock(start_pte, ptl);
 
 	/* step 3: set proper refcount and mm_counters. */
-	if (count) {
-		page_ref_sub(hpage, count);
-		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
+	if (nr_ptes) {
+		page_ref_sub(hpage, nr_ptes);
+		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
 	}
 
-	/* step 4: remove pte entries */
-	/* we make no change to anon, but protect concurrent anon page lookup */
-	if (vma->anon_vma)
-		anon_vma_lock_write(vma->anon_vma);
+	/* step 4: remove page table */
 
-	collapse_and_free_pmd(mm, vma, haddr, pmd);
+	/* Huge page lock is still held, so page table must remain empty */
+	pml = pmd_lock(mm, pmd);
+	if (ptl != pml)
+		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
+	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
+	pmdp_get_lockless_sync();
+	if (ptl != pml)
+		spin_unlock(ptl);
+	spin_unlock(pml);
 
-	if (vma->anon_vma)
-		anon_vma_unlock_write(vma->anon_vma);
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
+	mmu_notifier_invalidate_range_end(&range);
+
+	mm_dec_nr_ptes(mm);
+	page_table_check_pte_clear_range(mm, haddr, pgt_pmd);
+	pte_free_defer(mm, pmd_pgtable(pgt_pmd));
 
 maybe_install_pmd:
 	/* step 5: install pmd entry */
 	result = install_pmd
 			? set_huge_pmd(vma, haddr, pmd, hpage)
 			: SCAN_SUCCEED;
-
+	goto drop_hpage;
+abort:
+	if (nr_ptes) {
+		flush_tlb_mm(mm);
+		page_ref_sub(hpage, nr_ptes);
+		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
+	}
+	if (start_pte)
+		pte_unmap_unlock(start_pte, ptl);
+	if (notified)
+		mmu_notifier_invalidate_range_end(&range);
 drop_hpage:
 	unlock_page(hpage);
 	put_page(hpage);
 	return result;
-
-abort:
-	pte_unmap_unlock(start_pte, ptl);
-drop_immap:
-	i_mmap_unlock_write(vma->vm_file->f_mapping);
-	goto drop_hpage;
 }
 
 static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
@@ -2855,9 +2837,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 		case SCAN_PTE_MAPPED_HUGEPAGE:
 			BUG_ON(mmap_locked);
 			BUG_ON(*prev);
-			mmap_write_lock(mm);
+			mmap_read_lock(mm);
 			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_write_unlock(mm);
+			mmap_locked = true;
 			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
-- 
2.35.3
Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
Posted by Qi Zheng 2 years, 6 months ago
Hi,

On 2023/7/12 12:42, Hugh Dickins wrote:
> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> Follow the pattern in retract_page_tables(); and using pte_free_defer()
> removes most of the need for tlb_remove_table_sync_one() here; but call
> pmdp_get_lockless_sync() to use it in the PAE case.
> 
> First check the VMA, in case page tables are being torn down: from JannH.
> Confirm the preliminary find_pmd_or_thp_or_none() once page lock has been
> acquired and the page looks suitable: from then on its state is stable.
> 
> However, collapse_pte_mapped_thp() was doing something others don't:
> freeing a page table still containing "valid" entries.  i_mmap lock did
> stop a racing truncate from double-freeing those pages, but we prefer
> collapse_pte_mapped_thp() to clear the entries as usual.  Their TLB
> flush can wait until the pmdp_collapse_flush() which follows, but the
> mmu_notifier_invalidate_range_start() has to be done earlier.
> 
> Do the "step 1" checking loop without mmu_notifier: it wouldn't be good
> for khugepaged to keep on repeatedly invalidating a range which is then
> found unsuitable e.g. contains COWs.  "step 2", which does the clearing,
> must then be more careful (after dropping ptl to do mmu_notifier), with
> abort prepared to correct the accounting like "step 3".  But with those
> entries now cleared, "step 4" (after dropping ptl to do pmd_lock) is kept
> safe by the huge page lock, which stops new PTEs from being faulted in.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> ---
>   mm/khugepaged.c | 172 ++++++++++++++++++++++----------------------------
>   1 file changed, 77 insertions(+), 95 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3bb05147961b..46986eb4eebb 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1483,7 +1483,7 @@ static bool khugepaged_add_pte_mapped_thp(struct mm_struct *mm,
>   	return ret;
>   }
>   
> -/* hpage must be locked, and mmap_lock must be held in write */
> +/* hpage must be locked, and mmap_lock must be held */
>   static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>   			pmd_t *pmdp, struct page *hpage)
>   {
> @@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>   	};
>   
>   	VM_BUG_ON(!PageTransHuge(hpage));
> -	mmap_assert_write_locked(vma->vm_mm);
> +	mmap_assert_locked(vma->vm_mm);
>   
>   	if (do_set_pmd(&vmf, hpage))
>   		return SCAN_FAIL;
> @@ -1504,48 +1504,6 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
>   	return SCAN_SUCCEED;
>   }
>   
> -/*
> - * A note about locking:
> - * Trying to take the page table spinlocks would be useless here because those
> - * are only used to synchronize:
> - *
> - *  - modifying terminal entries (ones that point to a data page, not to another
> - *    page table)
> - *  - installing *new* non-terminal entries
> - *
> - * Instead, we need roughly the same kind of protection as free_pgtables() or
> - * mm_take_all_locks() (but only for a single VMA):
> - * The mmap lock together with this VMA's rmap locks covers all paths towards
> - * the page table entries we're messing with here, except for hardware page
> - * table walks and lockless_pages_from_mm().
> - */
> -static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *vma,
> -				  unsigned long addr, pmd_t *pmdp)
> -{
> -	pmd_t pmd;
> -	struct mmu_notifier_range range;
> -
> -	mmap_assert_write_locked(mm);
> -	if (vma->vm_file)
> -		lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
> -	/*
> -	 * All anon_vmas attached to the VMA have the same root and are
> -	 * therefore locked by the same lock.
> -	 */
> -	if (vma->anon_vma)
> -		lockdep_assert_held_write(&vma->anon_vma->root->rwsem);
> -
> -	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, addr,
> -				addr + HPAGE_PMD_SIZE);
> -	mmu_notifier_invalidate_range_start(&range);
> -	pmd = pmdp_collapse_flush(vma, addr, pmdp);
> -	tlb_remove_table_sync_one();
> -	mmu_notifier_invalidate_range_end(&range);
> -	mm_dec_nr_ptes(mm);
> -	page_table_check_pte_clear_range(mm, addr, pmd);
> -	pte_free(mm, pmd_pgtable(pmd));
> -}
> -
>   /**
>    * collapse_pte_mapped_thp - Try to collapse a pte-mapped THP for mm at
>    * address haddr.
> @@ -1561,26 +1519,29 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
>   int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   			    bool install_pmd)
>   {
> +	struct mmu_notifier_range range;
> +	bool notified = false;
>   	unsigned long haddr = addr & HPAGE_PMD_MASK;
>   	struct vm_area_struct *vma = vma_lookup(mm, haddr);
>   	struct page *hpage;
>   	pte_t *start_pte, *pte;
> -	pmd_t *pmd;
> -	spinlock_t *ptl;
> -	int count = 0, result = SCAN_FAIL;
> +	pmd_t *pmd, pgt_pmd;
> +	spinlock_t *pml, *ptl;
> +	int nr_ptes = 0, result = SCAN_FAIL;
>   	int i;
>   
> -	mmap_assert_write_locked(mm);
> +	mmap_assert_locked(mm);
> +
> +	/* First check VMA found, in case page tables are being torn down */
> +	if (!vma || !vma->vm_file ||
> +	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> +		return SCAN_VMA_CHECK;
>   
>   	/* Fast check before locking page if already PMD-mapped */
>   	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
>   	if (result == SCAN_PMD_MAPPED)
>   		return result;
>   
> -	if (!vma || !vma->vm_file ||
> -	    !range_in_vma(vma, haddr, haddr + HPAGE_PMD_SIZE))
> -		return SCAN_VMA_CHECK;
> -
>   	/*
>   	 * If we are here, we've succeeded in replacing all the native pages
>   	 * in the page cache with a single hugepage. If a mm were to fault-in
> @@ -1610,6 +1571,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   		goto drop_hpage;
>   	}
>   
> +	result = find_pmd_or_thp_or_none(mm, haddr, &pmd);
>   	switch (result) {
>   	case SCAN_SUCCEED:
>   		break;
> @@ -1623,27 +1585,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   		goto drop_hpage;
>   	}
>   
> -	/* Lock the vma before taking i_mmap and page table locks */
> -	vma_start_write(vma);
> -
> -	/*
> -	 * We need to lock the mapping so that from here on, only GUP-fast and
> -	 * hardware page walks can access the parts of the page tables that
> -	 * we're operating on.
> -	 * See collapse_and_free_pmd().
> -	 */
> -	i_mmap_lock_write(vma->vm_file->f_mapping);
> -
> -	/*
> -	 * This spinlock should be unnecessary: Nobody else should be accessing
> -	 * the page tables under spinlock protection here, only
> -	 * lockless_pages_from_mm() and the hardware page walker can access page
> -	 * tables while all the high-level locks are held in write mode.
> -	 */
>   	result = SCAN_FAIL;
>   	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> -	if (!start_pte)
> -		goto drop_immap;
> +	if (!start_pte)		/* mmap_lock + page lock should prevent this */
> +		goto drop_hpage;
>   
>   	/* step 1: check all mapped PTEs are to the right huge page */
>   	for (i = 0, addr = haddr, pte = start_pte;
> @@ -1670,10 +1615,18 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   		 */
>   		if (hpage + i != page)
>   			goto abort;
> -		count++;
>   	}
>   
> -	/* step 2: adjust rmap */
> +	pte_unmap_unlock(start_pte, ptl);
> +	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
> +				haddr, haddr + HPAGE_PMD_SIZE);
> +	mmu_notifier_invalidate_range_start(&range);
> +	notified = true;
> +	start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl);
> +	if (!start_pte)		/* mmap_lock + page lock should prevent this */
> +		goto abort;
> +
> +	/* step 2: clear page table and adjust rmap */
>   	for (i = 0, addr = haddr, pte = start_pte;
>   	     i < HPAGE_PMD_NR; i++, addr += PAGE_SIZE, pte++) {
>   		struct page *page;
> @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
>   
>   		if (pte_none(ptent))
>   			continue;
> -		page = vm_normal_page(vma, addr, ptent);
> -		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> +		/*
> +		 * We dropped ptl after the first scan, to do the mmu_notifier:
> +		 * page lock stops more PTEs of the hpage being faulted in, but
> +		 * does not stop write faults COWing anon copies from existing
> +		 * PTEs; and does not stop those being swapped out or migrated.
> +		 */
> +		if (!pte_present(ptent)) {
> +			result = SCAN_PTE_NON_PRESENT;
>   			goto abort;
> +		}
> +		page = vm_normal_page(vma, addr, ptent);
> +		if (hpage + i != page)
> +			goto abort;
> +
> +		/*
> +		 * Must clear entry, or a racing truncate may re-remove it.
> +		 * TLB flush can be left until pmdp_collapse_flush() does it.
> +		 * PTE dirty? Shmem page is already dirty; file is read-only.
> +		 */
> +		pte_clear(mm, addr, pte);

This is not non-present PTE entry, so we should call ptep_clear() to let
page_table_check track the PTE clearing operation, right? Otherwise it
may lead to false positives?

Thanks,
Qi

>   		page_remove_rmap(page, vma, false);
> +		nr_ptes++;
>   	}
>   
>   	pte_unmap_unlock(start_pte, ptl);
>   
>   	/* step 3: set proper refcount and mm_counters. */
> -	if (count) {
> -		page_ref_sub(hpage, count);
> -		add_mm_counter(vma->vm_mm, mm_counter_file(hpage), -count);
> +	if (nr_ptes) {
> +		page_ref_sub(hpage, nr_ptes);
> +		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
>   	}
>   
> -	/* step 4: remove pte entries */
> -	/* we make no change to anon, but protect concurrent anon page lookup */
> -	if (vma->anon_vma)
> -		anon_vma_lock_write(vma->anon_vma);
> +	/* step 4: remove page table */
>   
> -	collapse_and_free_pmd(mm, vma, haddr, pmd);
> +	/* Huge page lock is still held, so page table must remain empty */
> +	pml = pmd_lock(mm, pmd);
> +	if (ptl != pml)
> +		spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> +	pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd);
> +	pmdp_get_lockless_sync();
> +	if (ptl != pml)
> +		spin_unlock(ptl);
> +	spin_unlock(pml);
>   
> -	if (vma->anon_vma)
> -		anon_vma_unlock_write(vma->anon_vma);
> -	i_mmap_unlock_write(vma->vm_file->f_mapping);
> +	mmu_notifier_invalidate_range_end(&range);
> +
> +	mm_dec_nr_ptes(mm);
> +	page_table_check_pte_clear_range(mm, haddr, pgt_pmd);
> +	pte_free_defer(mm, pmd_pgtable(pgt_pmd));
>   
>   maybe_install_pmd:
>   	/* step 5: install pmd entry */
>   	result = install_pmd
>   			? set_huge_pmd(vma, haddr, pmd, hpage)
>   			: SCAN_SUCCEED;
> -
> +	goto drop_hpage;
> +abort:
> +	if (nr_ptes) {
> +		flush_tlb_mm(mm);
> +		page_ref_sub(hpage, nr_ptes);
> +		add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes);
> +	}
> +	if (start_pte)
> +		pte_unmap_unlock(start_pte, ptl);
> +	if (notified)
> +		mmu_notifier_invalidate_range_end(&range);
>   drop_hpage:
>   	unlock_page(hpage);
>   	put_page(hpage);
>   	return result;
> -
> -abort:
> -	pte_unmap_unlock(start_pte, ptl);
> -drop_immap:
> -	i_mmap_unlock_write(vma->vm_file->f_mapping);
> -	goto drop_hpage;
>   }
>   
>   static void khugepaged_collapse_pte_mapped_thps(struct khugepaged_mm_slot *mm_slot)
> @@ -2855,9 +2837,9 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
>   		case SCAN_PTE_MAPPED_HUGEPAGE:
>   			BUG_ON(mmap_locked);
>   			BUG_ON(*prev);
> -			mmap_write_lock(mm);
> +			mmap_read_lock(mm);
>   			result = collapse_pte_mapped_thp(mm, addr, true);
> -			mmap_write_unlock(mm);
> +			mmap_locked = true;
>   			goto handle_result;
>   		/* Whitelisted set of results where continuing OK */
>   		case SCAN_PMD_NULL:
Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
Posted by Hugh Dickins 2 years, 6 months ago
On Thu, 3 Aug 2023, Qi Zheng wrote:
> On 2023/7/12 12:42, Hugh Dickins wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
...
> > @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm,
> > unsigned long addr,
> >   
> >     if (pte_none(ptent))
> >   			continue;
> > -		page = vm_normal_page(vma, addr, ptent);
> > -		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
> > +		/*
> > +		 * We dropped ptl after the first scan, to do the
> > mmu_notifier:
> > +		 * page lock stops more PTEs of the hpage being faulted in,
> > but
> > +		 * does not stop write faults COWing anon copies from existing
> > +		 * PTEs; and does not stop those being swapped out or
> > migrated.
> > +		 */
> > +		if (!pte_present(ptent)) {
> > +			result = SCAN_PTE_NON_PRESENT;
> >   			goto abort;
> > +		}
> > +		page = vm_normal_page(vma, addr, ptent);
> > +		if (hpage + i != page)
> > +			goto abort;
> > +
> > +		/*
> > +		 * Must clear entry, or a racing truncate may re-remove it.
> > +		 * TLB flush can be left until pmdp_collapse_flush() does it.
> > +		 * PTE dirty? Shmem page is already dirty; file is read-only.
> > +		 */
> > +		pte_clear(mm, addr, pte);
> 
> This is not non-present PTE entry, so we should call ptep_clear() to let
> page_table_check track the PTE clearing operation, right? Otherwise it
> may lead to false positives?

You are right: thanks a lot for catching that: fix patch follows.

Hugh
Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
Posted by Qi Zheng 2 years, 6 months ago

On 2023/8/6 11:55, Hugh Dickins wrote:
> On Thu, 3 Aug 2023, Qi Zheng wrote:
>> On 2023/7/12 12:42, Hugh Dickins wrote:
>>> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
>>> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
>>> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
>>> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> ...
>>> @@ -1681,47 +1634,76 @@ int collapse_pte_mapped_thp(struct mm_struct *mm,
>>> unsigned long addr,
>>>    
>>>      if (pte_none(ptent))
>>>    			continue;
>>> -		page = vm_normal_page(vma, addr, ptent);
>>> -		if (WARN_ON_ONCE(page && is_zone_device_page(page)))
>>> +		/*
>>> +		 * We dropped ptl after the first scan, to do the
>>> mmu_notifier:
>>> +		 * page lock stops more PTEs of the hpage being faulted in,
>>> but
>>> +		 * does not stop write faults COWing anon copies from existing
>>> +		 * PTEs; and does not stop those being swapped out or
>>> migrated.
>>> +		 */
>>> +		if (!pte_present(ptent)) {
>>> +			result = SCAN_PTE_NON_PRESENT;
>>>    			goto abort;
>>> +		}
>>> +		page = vm_normal_page(vma, addr, ptent);
>>> +		if (hpage + i != page)
>>> +			goto abort;
>>> +
>>> +		/*
>>> +		 * Must clear entry, or a racing truncate may re-remove it.
>>> +		 * TLB flush can be left until pmdp_collapse_flush() does it.
>>> +		 * PTE dirty? Shmem page is already dirty; file is read-only.
>>> +		 */
>>> +		pte_clear(mm, addr, pte);
>>
>> This is not non-present PTE entry, so we should call ptep_clear() to let
>> page_table_check track the PTE clearing operation, right? Otherwise it
>> may lead to false positives?
> 
> You are right: thanks a lot for catching that: fix patch follows.

With fix patch:

Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com>

Thanks.

> 
> Hugh
[PATCH v3 10/13 fix2] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix2
Posted by Hugh Dickins 2 years, 6 months ago
Use ptep_clear() instead of pte_clear(): when CONFIG_PAGE_TABLE_CHECK=y,
ptep_clear() adds some accounting, missing which would cause a BUG later.

Signed-off-by: Hugh Dickins <hughd@google.com>
Reported-by: Qi Zheng <zhengqi.arch@bytedance.com>
Closes: https://lore.kernel.org/linux-mm/0df84f9f-e9b0-80b1-4c9e-95abc1a73a96@bytedance.com/
---
 mm/khugepaged.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index bb76a5d454de..78fc1a24a1cc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1603,7 +1603,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 		 * TLB flush can be left until pmdp_collapse_flush() does it.
 		 * PTE dirty? Shmem page is already dirty; file is read-only.
 		 */
-		pte_clear(mm, addr, pte);
+		ptep_clear(mm, addr, pte);
 		page_remove_rmap(page, vma, false);
 		nr_ptes++;
 	}
-- 
2.35.3
[PATCH v3 10/13 fix] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock(): fix
Posted by Hugh Dickins 2 years, 6 months ago
madvise_collapse() setting "mmap_locked = true" after calling
collapse_pte_mapped_thp() looked good but was wrong.  If the loop then
moves on to the next extent, mmap_locked assures it that "vma" has been
revalidated under mmap_lock, which was not the case: and led to UAFs,
crashes in __fput() or task_work_run(), even collapse_file()'s
VM_BUG_ON(start & (HPAGE_PMD_NR - 1)) - all detected by syzbot.

(collapse_pte_mapped_thp() does validate the vma that it works on:
but it's not passed in as an argument, collapse_pte_mapped_thp() finds
the vma for mm and addr by itself - which may by this time have changed
from the vma saved in madvise_collapse().)

Reported-by: syzbot+fe7b1487405295d29268@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/lkml/000000000000f9de430600ae05db@google.com/
Reported-by: syzbot+173cc8cfdfbbef6dd755@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/linux-mm/000000000000e4b0f0060123ca40@google.com/
Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/khugepaged.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 6bad69c0e4bd..1c773db26e88 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2747,7 +2747,7 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev,
 			BUG_ON(*prev);
 			mmap_read_lock(mm);
 			result = collapse_pte_mapped_thp(mm, addr, true);
-			mmap_locked = true;
+			mmap_read_unlock(mm);
 			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_PMD_NULL:
-- 
2.35.3
[BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
Posted by Jann Horn 2 years, 5 months ago
On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins <hughd@google.com> wrote:
> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.

We can still have a racing userfaultfd operation at the "/* step 4:
remove page table */" point that installs a new PTE before the page
table is removed.

To reproduce, patch a delay into the kernel like this:


diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9a6e0d507759..27cc8dfbf3a7 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -20,6 +20,7 @@
 #include <linux/swapops.h>
 #include <linux/shmem_fs.h>
 #include <linux/ksm.h>
+#include <linux/delay.h>

 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
*mm, unsigned long addr,
        }

        /* step 4: remove page table */
+       if (strcmp(current->comm, "DELAYME") == 0) {
+               pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
+               mdelay(5000);
+               pr_warn("%s: END DELAY INJECTION\n", __func__);
+       }

        /* Huge page lock is still held, so page table must remain empty */
        pml = pmd_lock(mm, pmd);


And then run the attached reproducer against mm/mm-everything. You
should get this in dmesg:

[  206.578096] BUG: Bad rss-counter state mm:000000000942ebea
type:MM_ANONPAGES val:1
Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
Posted by Hugh Dickins 2 years, 5 months ago
On Mon, 14 Aug 2023, Jann Horn wrote:
> On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins <hughd@google.com> wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> We can still have a racing userfaultfd operation at the "/* step 4:
> remove page table */" point that installs a new PTE before the page
> table is removed.

And you've been very polite not to remind me that this is exactly
what you warned me about, in connection with retract_page_tables(),
nearly three months ago:

https://lore.kernel.org/linux-mm/CAG48ez0aF1Rf1apSjn9YcnfyFQ4YqSd4GqB6f2wfhF7jMdi5Hg@mail.gmail.com/

> 
> To reproduce, patch a delay into the kernel like this:
> 
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9a6e0d507759..27cc8dfbf3a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -20,6 +20,7 @@
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/ksm.h>
> +#include <linux/delay.h>
> 
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
> @@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
> *mm, unsigned long addr,
>         }
> 
>         /* step 4: remove page table */
> +       if (strcmp(current->comm, "DELAYME") == 0) {
> +               pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
> +               mdelay(5000);
> +               pr_warn("%s: END DELAY INJECTION\n", __func__);
> +       }
> 
>         /* Huge page lock is still held, so page table must remain empty */
>         pml = pmd_lock(mm, pmd);
> 
> 
> And then run the attached reproducer against mm/mm-everything. You
> should get this in dmesg:
> 
> [  206.578096] BUG: Bad rss-counter state mm:000000000942ebea
> type:MM_ANONPAGES val:1

Very helpful, thank you Jann.

I got a bit distracted when I then found mm's recent addition of
UFFDIO_POISON: thought I needed to change both collapse_pte_mapped_thp()
and retract_page_tables() now to cope with mfill_atomic_pte_poison()
inserting into even a userfaultfd_armed shared VMA.

But eventually, on second thoughts, realized that's only inserting a pte
marker, invalid, so won't cause any actual trouble.  A little untidy,
to leave that behind in a supposedly empty page table about to be freed,
but not worth refactoring these functions to avoid a non-bug.

And though syzbot and JH may find some fun with it, I don't think any
real application would be insertng a PTE_MARKER_POISONED where a huge
page collapse is almost complete.

So I scaled back to a more proportionate fix, following.  Sorry, I've
slightly messed up applying the "DELAY INJECTION" patch above: not
intentional, honest!  (mdelay while holding the locks is still good.)

Hugh
Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
Posted by Hugh Dickins 2 years, 5 months ago
On Mon, 14 Aug 2023, Jann Horn wrote:
> On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins <hughd@google.com> wrote:
> > Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
> > It does need mmap_read_lock(), but it does not need mmap_write_lock(),
> > nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
> > paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
> 
> We can still have a racing userfaultfd operation at the "/* step 4:
> remove page table */" point that installs a new PTE before the page
> table is removed.
> 
> To reproduce, patch a delay into the kernel like this:
> 
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9a6e0d507759..27cc8dfbf3a7 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -20,6 +20,7 @@
>  #include <linux/swapops.h>
>  #include <linux/shmem_fs.h>
>  #include <linux/ksm.h>
> +#include <linux/delay.h>
> 
>  #include <asm/tlb.h>
>  #include <asm/pgalloc.h>
> @@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
> *mm, unsigned long addr,
>         }
> 
>         /* step 4: remove page table */
> +       if (strcmp(current->comm, "DELAYME") == 0) {
> +               pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
> +               mdelay(5000);
> +               pr_warn("%s: END DELAY INJECTION\n", __func__);
> +       }
> 
>         /* Huge page lock is still held, so page table must remain empty */
>         pml = pmd_lock(mm, pmd);
> 
> 
> And then run the attached reproducer against mm/mm-everything. You
> should get this in dmesg:
> 
> [  206.578096] BUG: Bad rss-counter state mm:000000000942ebea
> type:MM_ANONPAGES val:1

Thanks a lot, Jann. I haven't thought about it at all yet; and just
tried to reproduce, but haven't yet got the "BUG: Bad rss-counter":
just see "Invalid argument" on the UFFDIO_COPY ioctl.
Will investigate tomorrow.

Hugh
Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
Posted by David Hildenbrand 2 years, 5 months ago
On 15.08.23 08:34, Hugh Dickins wrote:
> On Mon, 14 Aug 2023, Jann Horn wrote:
>> On Wed, Jul 12, 2023 at 6:42 AM Hugh Dickins <hughd@google.com> wrote:
>>> Bring collapse_and_free_pmd() back into collapse_pte_mapped_thp().
>>> It does need mmap_read_lock(), but it does not need mmap_write_lock(),
>>> nor vma_start_write() nor i_mmap lock nor anon_vma lock.  All racing
>>> paths are relying on pte_offset_map_lock() and pmd_lock(), so use those.
>>
>> We can still have a racing userfaultfd operation at the "/* step 4:
>> remove page table */" point that installs a new PTE before the page
>> table is removed.
>>
>> To reproduce, patch a delay into the kernel like this:
>>
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 9a6e0d507759..27cc8dfbf3a7 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/swapops.h>
>>   #include <linux/shmem_fs.h>
>>   #include <linux/ksm.h>
>> +#include <linux/delay.h>
>>
>>   #include <asm/tlb.h>
>>   #include <asm/pgalloc.h>
>> @@ -1617,6 +1618,11 @@ int collapse_pte_mapped_thp(struct mm_struct
>> *mm, unsigned long addr,
>>          }
>>
>>          /* step 4: remove page table */
>> +       if (strcmp(current->comm, "DELAYME") == 0) {
>> +               pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
>> +               mdelay(5000);
>> +               pr_warn("%s: END DELAY INJECTION\n", __func__);
>> +       }
>>
>>          /* Huge page lock is still held, so page table must remain empty */
>>          pml = pmd_lock(mm, pmd);
>>
>>
>> And then run the attached reproducer against mm/mm-everything. You
>> should get this in dmesg:
>>
>> [  206.578096] BUG: Bad rss-counter state mm:000000000942ebea
>> type:MM_ANONPAGES val:1
> 
> Thanks a lot, Jann. I haven't thought about it at all yet; and just
> tried to reproduce, but haven't yet got the "BUG: Bad rss-counter":
> just see "Invalid argument" on the UFFDIO_COPY ioctl.
> Will investigate tomorrow.

Maybe you're missing a fixup:

https://lkml.kernel.org/r/20230810192128.1855570-1-axelrasmussen@google.com

When the src address is not page aligned, UFFDIO_COPY in mm-unstable 
would erroneously fail.

-- 
Cheers,

David / dhildenb

Re: [BUG] Re: [PATCH v3 10/13] mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()
Posted by Hugh Dickins 2 years, 5 months ago
On Tue, 15 Aug 2023, David Hildenbrand wrote:
> On 15.08.23 08:34, Hugh Dickins wrote:
> > On Mon, 14 Aug 2023, Jann Horn wrote:
> >>
> >>          /* step 4: remove page table */
> >> +       if (strcmp(current->comm, "DELAYME") == 0) {
> >> +               pr_warn("%s: BEGIN DELAY INJECTION\n", __func__);
> >> +               mdelay(5000);
> >> +               pr_warn("%s: END DELAY INJECTION\n", __func__);
> >> +       }
> >>
> >>          /* Huge page lock is still held, so page table must remain empty
> >>          */
> >>          pml = pmd_lock(mm, pmd);
> >>
> >>
> >> And then run the attached reproducer against mm/mm-everything. You
> >> should get this in dmesg:
> >>
> >> [  206.578096] BUG: Bad rss-counter state mm:000000000942ebea
> >> type:MM_ANONPAGES val:1
> > 
> > Thanks a lot, Jann. I haven't thought about it at all yet; and just
> > tried to reproduce, but haven't yet got the "BUG: Bad rss-counter":
> > just see "Invalid argument" on the UFFDIO_COPY ioctl.
> > Will investigate tomorrow.
> 
> Maybe you're missing a fixup:
> 
> https://lkml.kernel.org/r/20230810192128.1855570-1-axelrasmussen@google.com
> 
> When the src address is not page aligned, UFFDIO_COPY in mm-unstable would
> erroneously fail.

You got it, many thanks David: I had assumed that my next-20230808 tree
would be up-to-date enough, but it wasn't.  Reproduced now.

Hugh