[PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()

Nico Pache posted 5 patches 1 week, 1 day ago
[PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Nico Pache 1 week, 1 day ago
The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing. Create collapse_single_pmd
to increase code reuse and create an entry point to these two users.

Refactor madvise_collapse and collapse_scan_mm_slot to use the new
collapse_single_pmd function. To help reduce confusion around the
mmap_locked variable, we rename mmap_locked to lock_dropped in the
collapse_scan_mm_slot() function, and remove the redundant mmap_locked
in madvise_collapse(); this further unifies the code readiblity. the
SCAN_PTE_MAPPED_HUGEPAGE enum is no longer reachable in the
madvise_collapse() function, so we drop it from the list of "continuing"
enums.

This introduces a minor behavioral change that is most likely an
undiscovered bug. The current implementation of khugepaged tests
collapse_test_exit_or_disable() before calling collapse_pte_mapped_thp,
but we weren't doing it in the madvise_collapse case. By unifying these
two callers madvise_collapse now also performs this check. We also modify
the return value to be SCAN_ANY_PROCESS which properly indicates that this
process is no longer valid to operate on.

By moving the madvise_collapse writeback-retry logic into the helper
function we can also avoid having to revalidate the VMA.

We guard the khugepaged_pages_collapsed variable to ensure its only
incremented for khugepaged.

As requested we also convert a VM_BUG_ON to a VM_WARN_ON.

Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 142 ++++++++++++++++++++++++------------------------
 1 file changed, 72 insertions(+), 70 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 3728a2cf133c..d06d84219e1b 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1257,7 +1257,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 
 static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start_addr,
-		bool *mmap_locked, struct collapse_control *cc)
+		bool *lock_dropped, struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1432,7 +1432,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		result = collapse_huge_page(mm, start_addr, referenced,
 					    unmapped, cc);
 		/* collapse_huge_page will return with the mmap_lock released */
-		*mmap_locked = false;
+		*lock_dropped = true;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
@@ -2424,6 +2424,67 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
 	return result;
 }
 
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static enum scan_result collapse_single_pmd(unsigned long addr,
+		struct vm_area_struct *vma, bool *lock_dropped,
+		struct collapse_control *cc)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	bool triggered_wb = false;
+	enum scan_result result;
+	struct file *file;
+	pgoff_t pgoff;
+
+	mmap_assert_locked(mm);
+
+	if (vma_is_anonymous(vma)) {
+		result = collapse_scan_pmd(mm, vma, addr, lock_dropped, cc);
+		goto end;
+	}
+
+	file = get_file(vma->vm_file);
+	pgoff = linear_page_index(vma, addr);
+
+	mmap_read_unlock(mm);
+	*lock_dropped = true;
+retry:
+	result = collapse_scan_file(mm, addr, file, pgoff, cc);
+
+	/*
+	 * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
+	 * then retry the collapse one time.
+	 */
+	if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
+	    !triggered_wb && mapping_can_writeback(file->f_mapping)) {
+		const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
+		const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+
+		filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+		triggered_wb = true;
+		goto retry;
+	}
+	fput(file);
+
+	if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
+		mmap_read_lock(mm);
+		if (collapse_test_exit_or_disable(mm))
+			result = SCAN_ANY_PROCESS;
+		else
+			result = try_collapse_pte_mapped_thp(mm, addr,
+							     !cc->is_khugepaged);
+		if (result == SCAN_PMD_MAPPED)
+			result = SCAN_SUCCEED;
+		mmap_read_unlock(mm);
+	}
+end:
+	if (cc->is_khugepaged && result == SCAN_SUCCEED)
+		++khugepaged_pages_collapsed;
+	return result;
+}
+
 static void collapse_scan_mm_slot(unsigned int progress_max,
 		enum scan_result *result, struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
@@ -2485,46 +2546,21 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
 		VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
 
 		while (khugepaged_scan.address < hend) {
-			bool mmap_locked = true;
+			bool lock_dropped = false;
 
 			cond_resched();
 			if (unlikely(collapse_test_exit_or_disable(mm)))
 				goto breakouterloop;
 
-			VM_BUG_ON(khugepaged_scan.address < hstart ||
+			VM_WARN_ON_ONCE(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
-			if (!vma_is_anonymous(vma)) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma,
-						khugepaged_scan.address);
-
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*result = collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
-				fput(file);
-				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
-					mmap_read_lock(mm);
-					if (collapse_test_exit_or_disable(mm))
-						goto breakouterloop;
-					*result = try_collapse_pte_mapped_thp(mm,
-						khugepaged_scan.address, false);
-					if (*result == SCAN_PMD_MAPPED)
-						*result = SCAN_SUCCEED;
-					mmap_read_unlock(mm);
-				}
-			} else {
-				*result = collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
-			}
-
-			if (*result == SCAN_SUCCEED)
-				++khugepaged_pages_collapsed;
 
+			*result = collapse_single_pmd(khugepaged_scan.address,
+						      vma, &lock_dropped, cc);
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-			if (!mmap_locked)
+			if (lock_dropped)
 				/*
 				 * We released mmap_lock so break loop.  Note
 				 * that we drop mmap_lock before all hugepage
@@ -2799,7 +2835,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	unsigned long hstart, hend, addr;
 	enum scan_result last_fail = SCAN_FAIL;
 	int thps = 0;
-	bool mmap_locked = true;
 
 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
@@ -2821,13 +2856,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
 		enum scan_result result = SCAN_FAIL;
-		bool triggered_wb = false;
 
-retry:
-		if (!mmap_locked) {
+		if (*lock_dropped) {
 			cond_resched();
 			mmap_read_lock(mm);
-			mmap_locked = true;
+			*lock_dropped = false;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
 							 cc);
 			if (result  != SCAN_SUCCEED) {
@@ -2837,45 +2870,14 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
 		}
-		mmap_assert_locked(mm);
-		if (!vma_is_anonymous(vma)) {
-			struct file *file = get_file(vma->vm_file);
-			pgoff_t pgoff = linear_page_index(vma, addr);
 
-			mmap_read_unlock(mm);
-			mmap_locked = false;
-			*lock_dropped = true;
-			result = collapse_scan_file(mm, addr, file, pgoff, cc);
-
-			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
-			    mapping_can_writeback(file->f_mapping)) {
-				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
-				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
-
-				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
-				triggered_wb = true;
-				fput(file);
-				goto retry;
-			}
-			fput(file);
-		} else {
-			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
-		}
-		if (!mmap_locked)
-			*lock_dropped = true;
+		result = collapse_single_pmd(addr, vma, lock_dropped, cc);
 
-handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
-		case SCAN_PTE_MAPPED_HUGEPAGE:
-			BUG_ON(mmap_locked);
-			mmap_read_lock(mm);
-			result = try_collapse_pte_mapped_thp(mm, addr, true);
-			mmap_read_unlock(mm);
-			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_NO_PTE_TABLE:
 		case SCAN_PTE_NON_PRESENT:
@@ -2898,7 +2900,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 out_maybelock:
 	/* Caller expects us to hold mmap_lock on return */
-	if (!mmap_locked)
+	if (*lock_dropped)
 		mmap_read_lock(mm);
 out_nolock:
 	mmap_assert_locked(mm);
-- 
2.53.0
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Lorenzo Stoakes (Oracle) 2 days, 3 hours ago
OK we need a fairly urgent fix for this as this has triggered a syzbot. See [0]
for an analysis.

I show inline where the issue is, and attach a fix-patch for the bug.

[0]: https://lore.kernel.org/all/e1cb33b8-c1f7-4972-8628-3a2169077d6e@lucifer.local/

See below for details.

Cheers, Lorenzo

On Wed, Mar 25, 2026 at 05:40:22AM -0600, Nico Pache wrote:
> The khugepaged daemon and madvise_collapse have two different
> implementations that do almost the same thing. Create collapse_single_pmd
> to increase code reuse and create an entry point to these two users.
>
> Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> collapse_single_pmd function. To help reduce confusion around the
> mmap_locked variable, we rename mmap_locked to lock_dropped in the
> collapse_scan_mm_slot() function, and remove the redundant mmap_locked
> in madvise_collapse(); this further unifies the code readiblity. the
> SCAN_PTE_MAPPED_HUGEPAGE enum is no longer reachable in the
> madvise_collapse() function, so we drop it from the list of "continuing"
> enums.
>
> This introduces a minor behavioral change that is most likely an
> undiscovered bug. The current implementation of khugepaged tests
> collapse_test_exit_or_disable() before calling collapse_pte_mapped_thp,
> but we weren't doing it in the madvise_collapse case. By unifying these
> two callers madvise_collapse now also performs this check. We also modify
> the return value to be SCAN_ANY_PROCESS which properly indicates that this
> process is no longer valid to operate on.
>
> By moving the madvise_collapse writeback-retry logic into the helper
> function we can also avoid having to revalidate the VMA.
>
> We guard the khugepaged_pages_collapsed variable to ensure its only
> incremented for khugepaged.
>
> As requested we also convert a VM_BUG_ON to a VM_WARN_ON.
>
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 142 ++++++++++++++++++++++++------------------------
>  1 file changed, 72 insertions(+), 70 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 3728a2cf133c..d06d84219e1b 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1257,7 +1257,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
>  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		struct vm_area_struct *vma, unsigned long start_addr,
> -		bool *mmap_locked, struct collapse_control *cc)
> +		bool *lock_dropped, struct collapse_control *cc)
>  {
>  	pmd_t *pmd;
>  	pte_t *pte, *_pte;
> @@ -1432,7 +1432,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>  		result = collapse_huge_page(mm, start_addr, referenced,
>  					    unmapped, cc);
>  		/* collapse_huge_page will return with the mmap_lock released */
> -		*mmap_locked = false;
> +		*lock_dropped = true;
>  	}
>  out:
>  	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> @@ -2424,6 +2424,67 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>  	return result;
>  }
>
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static enum scan_result collapse_single_pmd(unsigned long addr,
> +		struct vm_area_struct *vma, bool *lock_dropped,
> +		struct collapse_control *cc)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	bool triggered_wb = false;
> +	enum scan_result result;
> +	struct file *file;
> +	pgoff_t pgoff;
> +
> +	mmap_assert_locked(mm);
> +
> +	if (vma_is_anonymous(vma)) {
> +		result = collapse_scan_pmd(mm, vma, addr, lock_dropped, cc);
> +		goto end;
> +	}
> +
> +	file = get_file(vma->vm_file);
> +	pgoff = linear_page_index(vma, addr);
> +
> +	mmap_read_unlock(mm);
> +	*lock_dropped = true;
> +retry:
> +	result = collapse_scan_file(mm, addr, file, pgoff, cc);
> +
> +	/*
> +	 * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
> +	 * then retry the collapse one time.
> +	 */
> +	if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
> +	    !triggered_wb && mapping_can_writeback(file->f_mapping)) {
> +		const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> +		const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +
> +		filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> +		triggered_wb = true;
> +		goto retry;
> +	}
> +	fput(file);
> +
> +	if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> +		mmap_read_lock(mm);
> +		if (collapse_test_exit_or_disable(mm))
> +			result = SCAN_ANY_PROCESS;
> +		else
> +			result = try_collapse_pte_mapped_thp(mm, addr,
> +							     !cc->is_khugepaged);
> +		if (result == SCAN_PMD_MAPPED)
> +			result = SCAN_SUCCEED;
> +		mmap_read_unlock(mm);
> +	}
> +end:
> +	if (cc->is_khugepaged && result == SCAN_SUCCEED)
> +		++khugepaged_pages_collapsed;
> +	return result;
> +}
> +
>  static void collapse_scan_mm_slot(unsigned int progress_max,
>  		enum scan_result *result, struct collapse_control *cc)
>  	__releases(&khugepaged_mm_lock)
> @@ -2485,46 +2546,21 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
>  		VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
>
>  		while (khugepaged_scan.address < hend) {
> -			bool mmap_locked = true;
> +			bool lock_dropped = false;
>
>  			cond_resched();
>  			if (unlikely(collapse_test_exit_or_disable(mm)))
>  				goto breakouterloop;
>
> -			VM_BUG_ON(khugepaged_scan.address < hstart ||
> +			VM_WARN_ON_ONCE(khugepaged_scan.address < hstart ||
>  				  khugepaged_scan.address + HPAGE_PMD_SIZE >
>  				  hend);
> -			if (!vma_is_anonymous(vma)) {
> -				struct file *file = get_file(vma->vm_file);
> -				pgoff_t pgoff = linear_page_index(vma,
> -						khugepaged_scan.address);
> -
> -				mmap_read_unlock(mm);
> -				mmap_locked = false;
> -				*result = collapse_scan_file(mm,
> -					khugepaged_scan.address, file, pgoff, cc);
> -				fput(file);
> -				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> -					mmap_read_lock(mm);
> -					if (collapse_test_exit_or_disable(mm))
> -						goto breakouterloop;
> -					*result = try_collapse_pte_mapped_thp(mm,
> -						khugepaged_scan.address, false);
> -					if (*result == SCAN_PMD_MAPPED)
> -						*result = SCAN_SUCCEED;
> -					mmap_read_unlock(mm);
> -				}
> -			} else {
> -				*result = collapse_scan_pmd(mm, vma,
> -					khugepaged_scan.address, &mmap_locked, cc);
> -			}
> -
> -			if (*result == SCAN_SUCCEED)
> -				++khugepaged_pages_collapsed;
>
> +			*result = collapse_single_pmd(khugepaged_scan.address,
> +						      vma, &lock_dropped, cc);
>  			/* move to next address */
>  			khugepaged_scan.address += HPAGE_PMD_SIZE;
> -			if (!mmap_locked)
> +			if (lock_dropped)
>  				/*
>  				 * We released mmap_lock so break loop.  Note
>  				 * that we drop mmap_lock before all hugepage
> @@ -2799,7 +2835,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	unsigned long hstart, hend, addr;
>  	enum scan_result last_fail = SCAN_FAIL;
>  	int thps = 0;
> -	bool mmap_locked = true;
>
>  	BUG_ON(vma->vm_start > start);
>  	BUG_ON(vma->vm_end < end);
> @@ -2821,13 +2856,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
>  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>  		enum scan_result result = SCAN_FAIL;
> -		bool triggered_wb = false;
>
> -retry:
> -		if (!mmap_locked) {
> +		if (*lock_dropped) {
>  			cond_resched();
>  			mmap_read_lock(mm);
> -			mmap_locked = true;
> +			*lock_dropped = false;

So this is the bug. 'lock_dropped' needs to record if the lock was _ever_
dropped, not if it is _currently_ dropped.

This is probably a mea culpa on my part on review, so apologies.

See below for a fix-patch.

>  			result = hugepage_vma_revalidate(mm, addr, false, &vma,
>  							 cc);
>  			if (result  != SCAN_SUCCEED) {
> @@ -2837,45 +2870,14 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
>  			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>  		}
> -		mmap_assert_locked(mm);
> -		if (!vma_is_anonymous(vma)) {
> -			struct file *file = get_file(vma->vm_file);
> -			pgoff_t pgoff = linear_page_index(vma, addr);
>
> -			mmap_read_unlock(mm);
> -			mmap_locked = false;
> -			*lock_dropped = true;
> -			result = collapse_scan_file(mm, addr, file, pgoff, cc);
> -
> -			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> -			    mapping_can_writeback(file->f_mapping)) {
> -				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> -				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> -
> -				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> -				triggered_wb = true;
> -				fput(file);
> -				goto retry;
> -			}
> -			fput(file);
> -		} else {
> -			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
> -		}
> -		if (!mmap_locked)
> -			*lock_dropped = true;
> +		result = collapse_single_pmd(addr, vma, lock_dropped, cc);
>
> -handle_result:
>  		switch (result) {
>  		case SCAN_SUCCEED:
>  		case SCAN_PMD_MAPPED:
>  			++thps;
>  			break;
> -		case SCAN_PTE_MAPPED_HUGEPAGE:
> -			BUG_ON(mmap_locked);
> -			mmap_read_lock(mm);
> -			result = try_collapse_pte_mapped_thp(mm, addr, true);
> -			mmap_read_unlock(mm);
> -			goto handle_result;
>  		/* Whitelisted set of results where continuing OK */
>  		case SCAN_NO_PTE_TABLE:
>  		case SCAN_PTE_NON_PRESENT:
> @@ -2898,7 +2900,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
>  out_maybelock:
>  	/* Caller expects us to hold mmap_lock on return */
> -	if (!mmap_locked)
> +	if (*lock_dropped)
>  		mmap_read_lock(mm);
>  out_nolock:
>  	mmap_assert_locked(mm);
> --
> 2.53.0
>

Fix patch follows:

----8<----
From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Date: Tue, 31 Mar 2026 13:11:18 +0100
Subject: [PATCH] mm/khugepaged: fix issue with tracking lock

We are incorrectly treating lock_dropped to track both whether the lock is
currently held and whether or not the lock was ever dropped.

Update this change to account for this.

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 mm/khugepaged.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d21348b85a59..b8452dbdb043 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	unsigned long hstart, hend, addr;
 	enum scan_result last_fail = SCAN_FAIL;
 	int thps = 0;
+	bool mmap_unlocked = false;

 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
@@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
 		enum scan_result result = SCAN_FAIL;

-		if (*lock_dropped) {
+		if (mmap_unlocked) {
 			cond_resched();
 			mmap_read_lock(mm);
-			*lock_dropped = false;
+			mmap_unlocked = false;
+			*lock_dropped = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
 							cc);
 			if (result  != SCAN_SUCCEED) {
@@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
 		}

-		result = collapse_single_pmd(addr, vma, lock_dropped, cc);
+		result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);

 		switch (result) {
 		case SCAN_SUCCEED:
@@ -2893,8 +2895,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,

 out_maybelock:
 	/* Caller expects us to hold mmap_lock on return */
-	if (*lock_dropped)
+	if (mmap_unlocked) {
+		*lock_dropped = true;
 		mmap_read_lock(mm);
+	}
 out_nolock:
 	mmap_assert_locked(mm);
 	mmdrop(mm);
--
2.53.0
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Nico Pache 1 day, 21 hours ago
On Tue, Mar 31, 2026 at 8:01 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> OK we need a fairly urgent fix for this as this has triggered a syzbot. See [0]
> for an analysis.
>
> I show inline where the issue is, and attach a fix-patch for the bug.
>
> [0]: https://lore.kernel.org/all/e1cb33b8-c1f7-4972-8628-3a2169077d6e@lucifer.local/
>
> See below for details.
>
> Cheers, Lorenzo
>
> On Wed, Mar 25, 2026 at 05:40:22AM -0600, Nico Pache wrote:
> > The khugepaged daemon and madvise_collapse have two different
> > implementations that do almost the same thing. Create collapse_single_pmd
> > to increase code reuse and create an entry point to these two users.
> >
> > Refactor madvise_collapse and collapse_scan_mm_slot to use the new
> > collapse_single_pmd function. To help reduce confusion around the
> > mmap_locked variable, we rename mmap_locked to lock_dropped in the
> > collapse_scan_mm_slot() function, and remove the redundant mmap_locked
> > in madvise_collapse(); this further unifies the code readiblity. the
> > SCAN_PTE_MAPPED_HUGEPAGE enum is no longer reachable in the
> > madvise_collapse() function, so we drop it from the list of "continuing"
> > enums.
> >
> > This introduces a minor behavioral change that is most likely an
> > undiscovered bug. The current implementation of khugepaged tests
> > collapse_test_exit_or_disable() before calling collapse_pte_mapped_thp,
> > but we weren't doing it in the madvise_collapse case. By unifying these
> > two callers madvise_collapse now also performs this check. We also modify
> > the return value to be SCAN_ANY_PROCESS which properly indicates that this
> > process is no longer valid to operate on.
> >
> > By moving the madvise_collapse writeback-retry logic into the helper
> > function we can also avoid having to revalidate the VMA.
> >
> > We guard the khugepaged_pages_collapsed variable to ensure its only
> > incremented for khugepaged.
> >
> > As requested we also convert a VM_BUG_ON to a VM_WARN_ON.
> >
> > Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > Reviewed-by: Lance Yang <lance.yang@linux.dev>
> > Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
> > Acked-by: David Hildenbrand (Arm) <david@kernel.org>
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 142 ++++++++++++++++++++++++------------------------
> >  1 file changed, 72 insertions(+), 70 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 3728a2cf133c..d06d84219e1b 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1257,7 +1257,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
> >
> >  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               struct vm_area_struct *vma, unsigned long start_addr,
> > -             bool *mmap_locked, struct collapse_control *cc)
> > +             bool *lock_dropped, struct collapse_control *cc)
> >  {
> >       pmd_t *pmd;
> >       pte_t *pte, *_pte;
> > @@ -1432,7 +1432,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
> >               result = collapse_huge_page(mm, start_addr, referenced,
> >                                           unmapped, cc);
> >               /* collapse_huge_page will return with the mmap_lock released */
> > -             *mmap_locked = false;
> > +             *lock_dropped = true;
> >       }
> >  out:
> >       trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > @@ -2424,6 +2424,67 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
> >       return result;
> >  }
> >
> > +/*
> > + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> > + * the results.
> > + */
> > +static enum scan_result collapse_single_pmd(unsigned long addr,
> > +             struct vm_area_struct *vma, bool *lock_dropped,
> > +             struct collapse_control *cc)
> > +{
> > +     struct mm_struct *mm = vma->vm_mm;
> > +     bool triggered_wb = false;
> > +     enum scan_result result;
> > +     struct file *file;
> > +     pgoff_t pgoff;
> > +
> > +     mmap_assert_locked(mm);
> > +
> > +     if (vma_is_anonymous(vma)) {
> > +             result = collapse_scan_pmd(mm, vma, addr, lock_dropped, cc);
> > +             goto end;
> > +     }
> > +
> > +     file = get_file(vma->vm_file);
> > +     pgoff = linear_page_index(vma, addr);
> > +
> > +     mmap_read_unlock(mm);
> > +     *lock_dropped = true;
> > +retry:
> > +     result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > +
> > +     /*
> > +      * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
> > +      * then retry the collapse one time.
> > +      */
> > +     if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
> > +         !triggered_wb && mapping_can_writeback(file->f_mapping)) {
> > +             const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> > +             const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> > +
> > +             filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> > +             triggered_wb = true;
> > +             goto retry;
> > +     }
> > +     fput(file);
> > +
> > +     if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > +             mmap_read_lock(mm);
> > +             if (collapse_test_exit_or_disable(mm))
> > +                     result = SCAN_ANY_PROCESS;
> > +             else
> > +                     result = try_collapse_pte_mapped_thp(mm, addr,
> > +                                                          !cc->is_khugepaged);
> > +             if (result == SCAN_PMD_MAPPED)
> > +                     result = SCAN_SUCCEED;
> > +             mmap_read_unlock(mm);
> > +     }
> > +end:
> > +     if (cc->is_khugepaged && result == SCAN_SUCCEED)
> > +             ++khugepaged_pages_collapsed;
> > +     return result;
> > +}
> > +
> >  static void collapse_scan_mm_slot(unsigned int progress_max,
> >               enum scan_result *result, struct collapse_control *cc)
> >       __releases(&khugepaged_mm_lock)
> > @@ -2485,46 +2546,21 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
> >               VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
> >
> >               while (khugepaged_scan.address < hend) {
> > -                     bool mmap_locked = true;
> > +                     bool lock_dropped = false;
> >
> >                       cond_resched();
> >                       if (unlikely(collapse_test_exit_or_disable(mm)))
> >                               goto breakouterloop;
> >
> > -                     VM_BUG_ON(khugepaged_scan.address < hstart ||
> > +                     VM_WARN_ON_ONCE(khugepaged_scan.address < hstart ||
> >                                 khugepaged_scan.address + HPAGE_PMD_SIZE >
> >                                 hend);
> > -                     if (!vma_is_anonymous(vma)) {
> > -                             struct file *file = get_file(vma->vm_file);
> > -                             pgoff_t pgoff = linear_page_index(vma,
> > -                                             khugepaged_scan.address);
> > -
> > -                             mmap_read_unlock(mm);
> > -                             mmap_locked = false;
> > -                             *result = collapse_scan_file(mm,
> > -                                     khugepaged_scan.address, file, pgoff, cc);
> > -                             fput(file);
> > -                             if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> > -                                     mmap_read_lock(mm);
> > -                                     if (collapse_test_exit_or_disable(mm))
> > -                                             goto breakouterloop;
> > -                                     *result = try_collapse_pte_mapped_thp(mm,
> > -                                             khugepaged_scan.address, false);
> > -                                     if (*result == SCAN_PMD_MAPPED)
> > -                                             *result = SCAN_SUCCEED;
> > -                                     mmap_read_unlock(mm);
> > -                             }
> > -                     } else {
> > -                             *result = collapse_scan_pmd(mm, vma,
> > -                                     khugepaged_scan.address, &mmap_locked, cc);
> > -                     }
> > -
> > -                     if (*result == SCAN_SUCCEED)
> > -                             ++khugepaged_pages_collapsed;
> >
> > +                     *result = collapse_single_pmd(khugepaged_scan.address,
> > +                                                   vma, &lock_dropped, cc);
> >                       /* move to next address */
> >                       khugepaged_scan.address += HPAGE_PMD_SIZE;
> > -                     if (!mmap_locked)
> > +                     if (lock_dropped)
> >                               /*
> >                                * We released mmap_lock so break loop.  Note
> >                                * that we drop mmap_lock before all hugepage
> > @@ -2799,7 +2835,6 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >       unsigned long hstart, hend, addr;
> >       enum scan_result last_fail = SCAN_FAIL;
> >       int thps = 0;
> > -     bool mmap_locked = true;
> >
> >       BUG_ON(vma->vm_start > start);
> >       BUG_ON(vma->vm_end < end);
> > @@ -2821,13 +2856,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >
> >       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> >               enum scan_result result = SCAN_FAIL;
> > -             bool triggered_wb = false;
> >
> > -retry:
> > -             if (!mmap_locked) {
> > +             if (*lock_dropped) {
> >                       cond_resched();
> >                       mmap_read_lock(mm);
> > -                     mmap_locked = true;
> > +                     *lock_dropped = false;
>
> So this is the bug. 'lock_dropped' needs to record if the lock was _ever_
> dropped, not if it is _currently_ dropped.
>
> This is probably a mea culpa on my part on review, so apologies.

All good! That code is rather confusing.

>
> See below for a fix-patch.
>
> >                       result = hugepage_vma_revalidate(mm, addr, false, &vma,
> >                                                        cc);
> >                       if (result  != SCAN_SUCCEED) {
> > @@ -2837,45 +2870,14 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >
> >                       hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
> >               }
> > -             mmap_assert_locked(mm);
> > -             if (!vma_is_anonymous(vma)) {
> > -                     struct file *file = get_file(vma->vm_file);
> > -                     pgoff_t pgoff = linear_page_index(vma, addr);
> >
> > -                     mmap_read_unlock(mm);
> > -                     mmap_locked = false;
> > -                     *lock_dropped = true;
> > -                     result = collapse_scan_file(mm, addr, file, pgoff, cc);
> > -
> > -                     if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> > -                         mapping_can_writeback(file->f_mapping)) {
> > -                             loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> > -                             loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> > -
> > -                             filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> > -                             triggered_wb = true;
> > -                             fput(file);
> > -                             goto retry;
> > -                     }
> > -                     fput(file);
> > -             } else {
> > -                     result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
> > -             }
> > -             if (!mmap_locked)
> > -                     *lock_dropped = true;
> > +             result = collapse_single_pmd(addr, vma, lock_dropped, cc);
> >
> > -handle_result:
> >               switch (result) {
> >               case SCAN_SUCCEED:
> >               case SCAN_PMD_MAPPED:
> >                       ++thps;
> >                       break;
> > -             case SCAN_PTE_MAPPED_HUGEPAGE:
> > -                     BUG_ON(mmap_locked);
> > -                     mmap_read_lock(mm);
> > -                     result = try_collapse_pte_mapped_thp(mm, addr, true);
> > -                     mmap_read_unlock(mm);
> > -                     goto handle_result;
> >               /* Whitelisted set of results where continuing OK */
> >               case SCAN_NO_PTE_TABLE:
> >               case SCAN_PTE_NON_PRESENT:
> > @@ -2898,7 +2900,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >
> >  out_maybelock:
> >       /* Caller expects us to hold mmap_lock on return */
> > -     if (!mmap_locked)
> > +     if (*lock_dropped)
> >               mmap_read_lock(mm);
> >  out_nolock:
> >       mmap_assert_locked(mm);
> > --
> > 2.53.0
> >
>
> Fix patch follows:
>
> ----8<----
> From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> Date: Tue, 31 Mar 2026 13:11:18 +0100
> Subject: [PATCH] mm/khugepaged: fix issue with tracking lock
>
> We are incorrectly treating lock_dropped to track both whether the lock is
> currently held and whether or not the lock was ever dropped.
>
> Update this change to account for this.
>
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---

Thanks for fixing this so quickly :) Sysbot didn't send this to me.
Sadly it looked like we indeed needed that doubled "locked" semantics.
Thank you for the very good explanation in-reply-to the sysbot; that
really cleared up some confusion for me.

Reviewed-by: Nico Pache <npache@redhat.com>

>  mm/khugepaged.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d21348b85a59..b8452dbdb043 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>         unsigned long hstart, hend, addr;
>         enum scan_result last_fail = SCAN_FAIL;
>         int thps = 0;
> +       bool mmap_unlocked = false;
>
>         BUG_ON(vma->vm_start > start);
>         BUG_ON(vma->vm_end < end);
> @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>         for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>                 enum scan_result result = SCAN_FAIL;
>
> -               if (*lock_dropped) {
> +               if (mmap_unlocked) {
>                         cond_resched();
>                         mmap_read_lock(mm);
> -                       *lock_dropped = false;
> +                       mmap_unlocked = false;
> +                       *lock_dropped = true;
>                         result = hugepage_vma_revalidate(mm, addr, false, &vma,
>                                                         cc);
>                         if (result  != SCAN_SUCCEED) {
> @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>                         hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>                 }
>
> -               result = collapse_single_pmd(addr, vma, lock_dropped, cc);
> +               result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
>
>                 switch (result) {
>                 case SCAN_SUCCEED:
> @@ -2893,8 +2895,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
>  out_maybelock:
>         /* Caller expects us to hold mmap_lock on return */
> -       if (*lock_dropped)
> +       if (mmap_unlocked) {
> +               *lock_dropped = true;
>                 mmap_read_lock(mm);
> +       }
>  out_nolock:
>         mmap_assert_locked(mm);
>         mmdrop(mm);
> --
> 2.53.0
>
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Lance Yang 2 days, 1 hour ago

On 2026/3/31 22:01, Lorenzo Stoakes (Oracle) wrote:
> OK we need a fairly urgent fix for this as this has triggered a syzbot. See [0]
> for an analysis.
> 
> I show inline where the issue is, and attach a fix-patch for the bug.
> 
> [0]: https://lore.kernel.org/all/e1cb33b8-c1f7-4972-8628-3a2169077d6e@lucifer.local/
> 
> See below for details.
> 
> Cheers, Lorenzo
> 
[...]
> 
> Fix patch follows:
> 
> ----8<----
>  From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> Date: Tue, 31 Mar 2026 13:11:18 +0100
> Subject: [PATCH] mm/khugepaged: fix issue with tracking lock
> 
> We are incorrectly treating lock_dropped to track both whether the lock is
> currently held and whether or not the lock was ever dropped.

Good catch!

Right, lock_dropped is not supposed to mean "is the mmap lock currently
unlocked?", it should mean "was the mmap lock dropped at any point
during MADV_COLLAPSE?"

> 
> Update this change to account for this.
> 
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---

Thanks for the fix!
Reviewed-by: Lance Yang <lance.yang@linux.dev>

>   mm/khugepaged.c | 12 ++++++++----
>   1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d21348b85a59..b8452dbdb043 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>   	unsigned long hstart, hend, addr;
>   	enum scan_result last_fail = SCAN_FAIL;
>   	int thps = 0;
> +	bool mmap_unlocked = false;
> 
>   	BUG_ON(vma->vm_start > start);
>   	BUG_ON(vma->vm_end < end);
> @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>   	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>   		enum scan_result result = SCAN_FAIL;
> 
> -		if (*lock_dropped) {
> +		if (mmap_unlocked) {
>   			cond_resched();
>   			mmap_read_lock(mm);
> -			*lock_dropped = false;
> +			mmap_unlocked = false;
> +			*lock_dropped = true;
>   			result = hugepage_vma_revalidate(mm, addr, false, &vma,
>   							cc);
>   			if (result  != SCAN_SUCCEED) {
> @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>   			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>   		}
> 
> -		result = collapse_single_pmd(addr, vma, lock_dropped, cc);
> +		result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
> 
>   		switch (result) {
>   		case SCAN_SUCCEED:
> @@ -2893,8 +2895,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> 
>   out_maybelock:
>   	/* Caller expects us to hold mmap_lock on return */
> -	if (*lock_dropped)
> +	if (mmap_unlocked) {
> +		*lock_dropped = true;
>   		mmap_read_lock(mm);
> +	}
>   out_nolock:
>   	mmap_assert_locked(mm);
>   	mmdrop(mm);
> --
> 2.53.0
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by David Hildenbrand (Arm) 2 days, 3 hours ago
>> +		if (*lock_dropped) {
>>  			cond_resched();
>>  			mmap_read_lock(mm);
>> -			mmap_locked = true;
>> +			*lock_dropped = false;
> 
> So this is the bug. 'lock_dropped' needs to record if the lock was _ever_
> dropped, not if it is _currently_ dropped.

Ah, well spotted. I thought we discussed that at some point during
review ...

Thanks for tackling this!

[...]

> 
> ----8<----
> From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> Date: Tue, 31 Mar 2026 13:11:18 +0100
> Subject: [PATCH] mm/khugepaged: fix issue with tracking lock
> 
> We are incorrectly treating lock_dropped to track both whether the lock is
> currently held and whether or not the lock was ever dropped.
> 
> Update this change to account for this.
> 
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> ---
>  mm/khugepaged.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index d21348b85a59..b8452dbdb043 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	unsigned long hstart, hend, addr;
>  	enum scan_result last_fail = SCAN_FAIL;
>  	int thps = 0;
> +	bool mmap_unlocked = false;
> 
>  	BUG_ON(vma->vm_start > start);
>  	BUG_ON(vma->vm_end < end);
> @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>  		enum scan_result result = SCAN_FAIL;
> 
> -		if (*lock_dropped) {
> +		if (mmap_unlocked) {
>  			cond_resched();
>  			mmap_read_lock(mm);
> -			*lock_dropped = false;
> +			mmap_unlocked = false;
> +			*lock_dropped = true;
>  			result = hugepage_vma_revalidate(mm, addr, false, &vma,
>  							cc);
>  			if (result  != SCAN_SUCCEED) {
> @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,

(in hurry so I might be wrong)

Do we have to handle when collapse_single_pmd() dropped the lock as well?

-- 
Cheers,

David
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Nico Pache 1 day, 21 hours ago
On Tue, Mar 31, 2026 at 8:13 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> >> +            if (*lock_dropped) {
> >>                      cond_resched();
> >>                      mmap_read_lock(mm);
> >> -                    mmap_locked = true;
> >> +                    *lock_dropped = false;
> >
> > So this is the bug. 'lock_dropped' needs to record if the lock was _ever_
> > dropped, not if it is _currently_ dropped.
>
> Ah, well spotted. I thought we discussed that at some point during
> review ...

Yes I think we've discussed this before, and IIRC the outcome was, "We
weren't sure why this semantic was in place, or if we truly needed to
track if it was dropped at any point, or simply if it is currently
dropped.". The code is rather confusing, and changed mid-flight during
my series.

Lorenzo asked me to unify this semantic across the two functions to
further simplify readability, but it looks like we indeed needed that
extra tracking.


Cheers,
 -- Nico

>
> Thanks for tackling this!
>
> [...]
>
> >
> > ----8<----
> > From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
> > From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> > Date: Tue, 31 Mar 2026 13:11:18 +0100
> > Subject: [PATCH] mm/khugepaged: fix issue with tracking lock
> >
> > We are incorrectly treating lock_dropped to track both whether the lock is
> > currently held and whether or not the lock was ever dropped.
> >
> > Update this change to account for this.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  mm/khugepaged.c | 12 ++++++++----
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index d21348b85a59..b8452dbdb043 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >       unsigned long hstart, hend, addr;
> >       enum scan_result last_fail = SCAN_FAIL;
> >       int thps = 0;
> > +     bool mmap_unlocked = false;
> >
> >       BUG_ON(vma->vm_start > start);
> >       BUG_ON(vma->vm_end < end);
> > @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> >               enum scan_result result = SCAN_FAIL;
> >
> > -             if (*lock_dropped) {
> > +             if (mmap_unlocked) {
> >                       cond_resched();
> >                       mmap_read_lock(mm);
> > -                     *lock_dropped = false;
> > +                     mmap_unlocked = false;
> > +                     *lock_dropped = true;
> >                       result = hugepage_vma_revalidate(mm, addr, false, &vma,
> >                                                       cc);
> >                       if (result  != SCAN_SUCCEED) {
> > @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
> (in hurry so I might be wrong)
>
> Do we have to handle when collapse_single_pmd() dropped the lock as well?

There are some oddities about madvise that sadly I don't fully
understand. This is one of them. However, I dont believe khugepaged
had these lock_dropped semantics and just tracked the state of the
lock.

>
> --
> Cheers,
>
> David
>
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Lorenzo Stoakes (Oracle) 1 day, 21 hours ago
On Tue, Mar 31, 2026 at 01:46:02PM -0600, Nico Pache wrote:
> On Tue, Mar 31, 2026 at 8:13 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
> >
> > >> +            if (*lock_dropped) {
> > >>                      cond_resched();
> > >>                      mmap_read_lock(mm);
> > >> -                    mmap_locked = true;
> > >> +                    *lock_dropped = false;
> > >
> > > So this is the bug. 'lock_dropped' needs to record if the lock was _ever_
> > > dropped, not if it is _currently_ dropped.
> >
> > Ah, well spotted. I thought we discussed that at some point during
> > review ...
>
> Yes I think we've discussed this before, and IIRC the outcome was, "We
> weren't sure why this semantic was in place, or if we truly needed to
> track if it was dropped at any point, or simply if it is currently
> dropped.". The code is rather confusing, and changed mid-flight during
> my series.
>
> Lorenzo asked me to unify this semantic across the two functions to
> further simplify readability, but it looks like we indeed needed that
> extra tracking.
>
>
> Cheers,
>  -- Nico
>
> >
> > Thanks for tackling this!
> >
> > [...]
> >
> > >
> > > ----8<----
> > > From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
> > > From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> > > Date: Tue, 31 Mar 2026 13:11:18 +0100
> > > Subject: [PATCH] mm/khugepaged: fix issue with tracking lock
> > >
> > > We are incorrectly treating lock_dropped to track both whether the lock is
> > > currently held and whether or not the lock was ever dropped.
> > >
> > > Update this change to account for this.
> > >
> > > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > > ---
> > >  mm/khugepaged.c | 12 ++++++++----
> > >  1 file changed, 8 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index d21348b85a59..b8452dbdb043 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > >       unsigned long hstart, hend, addr;
> > >       enum scan_result last_fail = SCAN_FAIL;
> > >       int thps = 0;
> > > +     bool mmap_unlocked = false;
> > >
> > >       BUG_ON(vma->vm_start > start);
> > >       BUG_ON(vma->vm_end < end);
> > > @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> > >       for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> > >               enum scan_result result = SCAN_FAIL;
> > >
> > > -             if (*lock_dropped) {
> > > +             if (mmap_unlocked) {
> > >                       cond_resched();
> > >                       mmap_read_lock(mm);
> > > -                     *lock_dropped = false;
> > > +                     mmap_unlocked = false;
> > > +                     *lock_dropped = true;
> > >                       result = hugepage_vma_revalidate(mm, addr, false, &vma,
> > >                                                       cc);
> > >                       if (result  != SCAN_SUCCEED) {
> > > @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >
> > (in hurry so I might be wrong)
> >
> > Do we have to handle when collapse_single_pmd() dropped the lock as well?
>
> There are some oddities about madvise that sadly I don't fully
> understand. This is one of them. However, I dont believe khugepaged
> had these lock_dropped semantics and just tracked the state of the
> lock.

(As said to David) with my fix-patch we pass a pointer to mmap_unlocked and will
set the upstream lock_dropped appropriately as a result :)

So we cover that case also.

But yeah the handling in madvise is weird, and we maybe need to rethink a bit at
some point.

>
> >
> > --
> > Cheers,
> >
> > David
> >
>

Cheers, Lorenzo
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Lorenzo Stoakes (Oracle) 2 days, 3 hours ago
On Tue, Mar 31, 2026 at 04:13:29PM +0200, David Hildenbrand (Arm) wrote:
> >> +		if (*lock_dropped) {
> >>  			cond_resched();
> >>  			mmap_read_lock(mm);
> >> -			mmap_locked = true;
> >> +			*lock_dropped = false;
> >
> > So this is the bug. 'lock_dropped' needs to record if the lock was _ever_
> > dropped, not if it is _currently_ dropped.
>
> Ah, well spotted. I thought we discussed that at some point during
> review ...
>
> Thanks for tackling this!
>
> [...]
>
> >
> > ----8<----
> > From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
> > From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> > Date: Tue, 31 Mar 2026 13:11:18 +0100
> > Subject: [PATCH] mm/khugepaged: fix issue with tracking lock
> >
> > We are incorrectly treating lock_dropped to track both whether the lock is
> > currently held and whether or not the lock was ever dropped.
> >
> > Update this change to account for this.
> >
> > Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > ---
> >  mm/khugepaged.c | 12 ++++++++----
> >  1 file changed, 8 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index d21348b85a59..b8452dbdb043 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >  	unsigned long hstart, hend, addr;
> >  	enum scan_result last_fail = SCAN_FAIL;
> >  	int thps = 0;
> > +	bool mmap_unlocked = false;
> >
> >  	BUG_ON(vma->vm_start > start);
> >  	BUG_ON(vma->vm_end < end);
> > @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> >  		enum scan_result result = SCAN_FAIL;
> >
> > -		if (*lock_dropped) {
> > +		if (mmap_unlocked) {
> >  			cond_resched();
> >  			mmap_read_lock(mm);
> > -			*lock_dropped = false;
> > +			mmap_unlocked = false;
> > +			*lock_dropped = true;
> >  			result = hugepage_vma_revalidate(mm, addr, false, &vma,
> >  							cc);
> >  			if (result  != SCAN_SUCCEED) {
> > @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
> (in hurry so I might be wrong)
>
> Do we have to handle when collapse_single_pmd() dropped the lock as well?

That updates mmap_unlocked which will be handled in the loop or on exit:

		if (mmap_unlocked) {
			...
			mmap_read_lock(mm);
			mmap_unlocked = false;
			*lock_dropped = true;
			...
		}

		result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);

...

out_maybelock:
	/* Caller expects us to hold mmap_lock on return */
	if (mmap_unlocked) {
		*lock_dropped = true;
		mmap_read_lock(mm);
	}

>
> --
> Cheers,
>
> David

Thanks, Lorenzo
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by David Hildenbrand (Arm) 1 day, 21 hours ago
On 3/31/26 16:15, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 31, 2026 at 04:13:29PM +0200, David Hildenbrand (Arm) wrote:
>>>
>>> So this is the bug. 'lock_dropped' needs to record if the lock was _ever_
>>> dropped, not if it is _currently_ dropped.
>>
>> Ah, well spotted. I thought we discussed that at some point during
>> review ...
>>
>> Thanks for tackling this!
>>
>> [...]
>>
>>>
>>> ----8<----
>>> From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
>>> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
>>> Date: Tue, 31 Mar 2026 13:11:18 +0100
>>> Subject: [PATCH] mm/khugepaged: fix issue with tracking lock
>>>
>>> We are incorrectly treating lock_dropped to track both whether the lock is
>>> currently held and whether or not the lock was ever dropped.
>>>
>>> Update this change to account for this.
>>>
>>> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>>> ---
>>>  mm/khugepaged.c | 12 ++++++++----
>>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index d21348b85a59..b8452dbdb043 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>  	unsigned long hstart, hend, addr;
>>>  	enum scan_result last_fail = SCAN_FAIL;
>>>  	int thps = 0;
>>> +	bool mmap_unlocked = false;
>>>
>>>  	BUG_ON(vma->vm_start > start);
>>>  	BUG_ON(vma->vm_end < end);
>>> @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>>  		enum scan_result result = SCAN_FAIL;
>>>
>>> -		if (*lock_dropped) {
>>> +		if (mmap_unlocked) {
>>>  			cond_resched();
>>>  			mmap_read_lock(mm);
>>> -			*lock_dropped = false;
>>> +			mmap_unlocked = false;
>>> +			*lock_dropped = true;
>>>  			result = hugepage_vma_revalidate(mm, addr, false, &vma,
>>>  							cc);
>>>  			if (result  != SCAN_SUCCEED) {
>>> @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>
>> (in hurry so I might be wrong)
>>
>> Do we have to handle when collapse_single_pmd() dropped the lock as well?
> 
> That updates mmap_unlocked which will be handled in the loop or on exit:
> 
> 		if (mmap_unlocked) {
> 			...
> 			mmap_read_lock(mm);
> 			mmap_unlocked = false;
> 			*lock_dropped = true;
> 			...
> 		}
> 
> 		result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
> 
> ...
> 
> out_maybelock:
> 	/* Caller expects us to hold mmap_lock on return */
> 	if (mmap_unlocked) {
> 		*lock_dropped = true;
> 		mmap_read_lock(mm);
> 	}
> 

Right.

The original code used

	bool mmap_locked = true;

If the fix get squashed, maybe we should revert to that handling to
minimize the churn?

(this is not in stable, right?)

-- 
Cheers,

David
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Andrew Morton 1 day, 20 hours ago
On Tue, 31 Mar 2026 22:00:58 +0200 "David Hildenbrand (Arm)" <david@kernel.org> wrote:

> > out_maybelock:
> > 	/* Caller expects us to hold mmap_lock on return */
> > 	if (mmap_unlocked) {
> > 		*lock_dropped = true;
> > 		mmap_read_lock(mm);
> > 	}
> > 
> 
> Right.
> 
> The original code used
> 
> 	bool mmap_locked = true;
> 
> If the fix get squashed, maybe we should revert to that handling to
> minimize the churn?
> 
> (this is not in stable, right?)

"mm/khugepaged: unify khugepaged and madv_collapse with
collapse_single_pmd(): is presently in mm-stable.

Doing a big rebase&rework isn't the done thing, I believe.  So queue it
afterwards, add the Fixes: and tolerate the minor runtime bisection
hole.  It ends up looking like a later hotfix.


Below is what I added to mm-unstable.  I added a note-to-self to check
in on Nico's question at
https://lore.kernel.org/all/CAA1CXcCYLGQBy9nT-MSTPVK=XgWWtD6Af-QiA_6pXtM_2pybPg@mail.gmail.com/T/#u


From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Subject: mm/khugepaged: fix issue with tracking lock
Date: Tue, 31 Mar 2026 13:11:18 +0100

We are incorrectly treating lock_dropped to track both whether the lock is
currently held and whether or not the lock was ever dropped.

Update this change to account for this.

Link: https://lkml.kernel.org/r/7760c811-e100-4d40-9217-0813c28314be@lucifer.local
Fixes: 330f3758a3bc ("mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()")
Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Nico Pache <npache@redhat.com>
...
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 mm/khugepaged.c |   12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

--- a/mm/khugepaged.c~mm-khugepaged-fix-issue-with-tracking-lock
+++ a/mm/khugepaged.c
@@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_stru
 	unsigned long hstart, hend, addr;
 	enum scan_result last_fail = SCAN_FAIL;
 	int thps = 0;
+	bool mmap_unlocked = false;
 
 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
@@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_stru
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
 		enum scan_result result = SCAN_FAIL;
 
-		if (*lock_dropped) {
+		if (mmap_unlocked) {
 			cond_resched();
 			mmap_read_lock(mm);
-			*lock_dropped = false;
+			mmap_unlocked = false;
+			*lock_dropped = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
 							 cc);
 			if (result  != SCAN_SUCCEED) {
@@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_stru
 			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
 		}
 
-		result = collapse_single_pmd(addr, vma, lock_dropped, cc);
+		result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
 
 		switch (result) {
 		case SCAN_SUCCEED:
@@ -2893,8 +2895,10 @@ int madvise_collapse(struct vm_area_stru
 
 out_maybelock:
 	/* Caller expects us to hold mmap_lock on return */
-	if (*lock_dropped)
+	if (mmap_unlocked) {
+		*lock_dropped = true;
 		mmap_read_lock(mm);
+	}
 out_nolock:
 	mmap_assert_locked(mm);
 	mmdrop(mm);
_
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Nico Pache 1 day, 19 hours ago
On Tue, Mar 31, 2026 at 3:35 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Tue, 31 Mar 2026 22:00:58 +0200 "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>
> > > out_maybelock:
> > >     /* Caller expects us to hold mmap_lock on return */
> > >     if (mmap_unlocked) {
> > >             *lock_dropped = true;
> > >             mmap_read_lock(mm);
> > >     }
> > >
> >
> > Right.
> >
> > The original code used
> >
> >       bool mmap_locked = true;
> >
> > If the fix get squashed, maybe we should revert to that handling to
> > minimize the churn?
> >
> > (this is not in stable, right?)
>
> "mm/khugepaged: unify khugepaged and madv_collapse with
> collapse_single_pmd(): is presently in mm-stable.
>
> Doing a big rebase&rework isn't the done thing, I believe.  So queue it
> afterwards, add the Fixes: and tolerate the minor runtime bisection
> hole.  It ends up looking like a later hotfix.
>
>
> Below is what I added to mm-unstable.  I added a note-to-self to check
> in on Nico's question at
> https://lore.kernel.org/all/CAA1CXcCYLGQBy9nT-MSTPVK=XgWWtD6Af-QiA_6pXtM_2pybPg@mail.gmail.com/T/#u
>
>
> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> Subject: mm/khugepaged: fix issue with tracking lock
> Date: Tue, 31 Mar 2026 13:11:18 +0100
>
> We are incorrectly treating lock_dropped to track both whether the lock is
> currently held and whether or not the lock was ever dropped.
>
> Update this change to account for this.
>
> Link: https://lkml.kernel.org/r/7760c811-e100-4d40-9217-0813c28314be@lucifer.local
> Fixes: 330f3758a3bc ("mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()")
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> Reviewed-by: Nico Pache <npache@redhat.com>
> ...
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>
>  mm/khugepaged.c |   12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> --- a/mm/khugepaged.c~mm-khugepaged-fix-issue-with-tracking-lock
> +++ a/mm/khugepaged.c
> @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_stru
>         unsigned long hstart, hend, addr;
>         enum scan_result last_fail = SCAN_FAIL;
>         int thps = 0;
> +       bool mmap_unlocked = false;
>
>         BUG_ON(vma->vm_start > start);
>         BUG_ON(vma->vm_end < end);
> @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_stru
>         for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>                 enum scan_result result = SCAN_FAIL;
>
> -               if (*lock_dropped) {
> +               if (mmap_unlocked) {
>                         cond_resched();
>                         mmap_read_lock(mm);
> -                       *lock_dropped = false;
> +                       mmap_unlocked = false;
> +                       *lock_dropped = true;
>                         result = hugepage_vma_revalidate(mm, addr, false, &vma,
>                                                          cc);
>                         if (result  != SCAN_SUCCEED) {
> @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_stru
>                         hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>                 }
>
> -               result = collapse_single_pmd(addr, vma, lock_dropped, cc);
> +               result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
>
>                 switch (result) {
>                 case SCAN_SUCCEED:
> @@ -2893,8 +2895,10 @@ int madvise_collapse(struct vm_area_stru
>
>  out_maybelock:
>         /* Caller expects us to hold mmap_lock on return */
> -       if (*lock_dropped)
> +       if (mmap_unlocked) {
> +               *lock_dropped = true;

Hmm this was =false in David's example and = true here? Now i am also
confused hahaha. It seems Lorenzo's code is correct. I think i
responded to my own code changes, when I thought David was replying to
Lorenzo's.

I think we are all good with Lorenzo's changes.

Sorry for the confusion

-- Nico

>                 mmap_read_lock(mm);
> +       }
>  out_nolock:
>         mmap_assert_locked(mm);
>         mmdrop(mm);
> _
>
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by David Hildenbrand (Arm) 1 day, 10 hours ago
On 3/31/26 23:49, Nico Pache wrote:
> On Tue, Mar 31, 2026 at 3:35 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> On Tue, 31 Mar 2026 22:00:58 +0200 "David Hildenbrand (Arm)" <david@kernel.org> wrote:
>>
>>>
>>> Right.
>>>
>>> The original code used
>>>
>>>       bool mmap_locked = true;
>>>
>>> If the fix get squashed, maybe we should revert to that handling to
>>> minimize the churn?
>>>
>>> (this is not in stable, right?)
>>
>> "mm/khugepaged: unify khugepaged and madv_collapse with
>> collapse_single_pmd(): is presently in mm-stable.
>>
>> Doing a big rebase&rework isn't the done thing, I believe.  So queue it
>> afterwards, add the Fixes: and tolerate the minor runtime bisection
>> hole.  It ends up looking like a later hotfix.
>>
>>
>> Below is what I added to mm-unstable.  I added a note-to-self to check
>> in on Nico's question at
>> https://lore.kernel.org/all/CAA1CXcCYLGQBy9nT-MSTPVK=XgWWtD6Af-QiA_6pXtM_2pybPg@mail.gmail.com/T/#u
>>
>>
>> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
>> Subject: mm/khugepaged: fix issue with tracking lock
>> Date: Tue, 31 Mar 2026 13:11:18 +0100
>>
>> We are incorrectly treating lock_dropped to track both whether the lock is
>> currently held and whether or not the lock was ever dropped.
>>
>> Update this change to account for this.
>>
>> Link: https://lkml.kernel.org/r/7760c811-e100-4d40-9217-0813c28314be@lucifer.local
>> Fixes: 330f3758a3bc ("mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()")
>> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>> Reviewed-by: Lance Yang <lance.yang@linux.dev>
>> Reviewed-by: Nico Pache <npache@redhat.com>
>> ...
>> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
>> ---
>>
>>  mm/khugepaged.c |   12 ++++++++----
>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> --- a/mm/khugepaged.c~mm-khugepaged-fix-issue-with-tracking-lock
>> +++ a/mm/khugepaged.c
>> @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_stru
>>         unsigned long hstart, hend, addr;
>>         enum scan_result last_fail = SCAN_FAIL;
>>         int thps = 0;
>> +       bool mmap_unlocked = false;
>>
>>         BUG_ON(vma->vm_start > start);
>>         BUG_ON(vma->vm_end < end);
>> @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_stru
>>         for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>                 enum scan_result result = SCAN_FAIL;
>>
>> -               if (*lock_dropped) {
>> +               if (mmap_unlocked) {
>>                         cond_resched();
>>                         mmap_read_lock(mm);
>> -                       *lock_dropped = false;
>> +                       mmap_unlocked = false;
>> +                       *lock_dropped = true;
>>                         result = hugepage_vma_revalidate(mm, addr, false, &vma,
>>                                                          cc);
>>                         if (result  != SCAN_SUCCEED) {
>> @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_stru
>>                         hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>>                 }
>>
>> -               result = collapse_single_pmd(addr, vma, lock_dropped, cc);
>> +               result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
>>
>>                 switch (result) {
>>                 case SCAN_SUCCEED:
>> @@ -2893,8 +2895,10 @@ int madvise_collapse(struct vm_area_stru
>>
>>  out_maybelock:
>>         /* Caller expects us to hold mmap_lock on return */
>> -       if (*lock_dropped)
>> +       if (mmap_unlocked) {
>> +               *lock_dropped = true;
> 
> Hmm this was =false in David's example and = true here? Now i am also
> confused hahaha. It seems Lorenzo's code is correct. I think i
> responded to my own code changes, when I thought David was replying to
> Lorenzo's.

I think I messed up that in my posting when manually applying :)

So this here is correct.

-- 
Cheers,

David
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Lorenzo Stoakes (Oracle) 1 day, 9 hours ago
On Wed, Apr 01, 2026 at 09:05:39AM +0200, David Hildenbrand (Arm) wrote:
> On 3/31/26 23:49, Nico Pache wrote:
> > On Tue, Mar 31, 2026 at 3:35 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> >>
> >> On Tue, 31 Mar 2026 22:00:58 +0200 "David Hildenbrand (Arm)" <david@kernel.org> wrote:
> >>
> >>>
> >>> Right.
> >>>
> >>> The original code used
> >>>
> >>>       bool mmap_locked = true;
> >>>
> >>> If the fix get squashed, maybe we should revert to that handling to
> >>> minimize the churn?
> >>>
> >>> (this is not in stable, right?)
> >>
> >> "mm/khugepaged: unify khugepaged and madv_collapse with
> >> collapse_single_pmd(): is presently in mm-stable.
> >>
> >> Doing a big rebase&rework isn't the done thing, I believe.  So queue it
> >> afterwards, add the Fixes: and tolerate the minor runtime bisection
> >> hole.  It ends up looking like a later hotfix.
> >>
> >>
> >> Below is what I added to mm-unstable.  I added a note-to-self to check
> >> in on Nico's question at
> >> https://lore.kernel.org/all/CAA1CXcCYLGQBy9nT-MSTPVK=XgWWtD6Af-QiA_6pXtM_2pybPg@mail.gmail.com/T/#u
> >>
> >>
> >> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> >> Subject: mm/khugepaged: fix issue with tracking lock
> >> Date: Tue, 31 Mar 2026 13:11:18 +0100
> >>
> >> We are incorrectly treating lock_dropped to track both whether the lock is
> >> currently held and whether or not the lock was ever dropped.
> >>
> >> Update this change to account for this.
> >>
> >> Link: https://lkml.kernel.org/r/7760c811-e100-4d40-9217-0813c28314be@lucifer.local
> >> Fixes: 330f3758a3bc ("mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()")
> >> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> >> Reviewed-by: Lance Yang <lance.yang@linux.dev>
> >> Reviewed-by: Nico Pache <npache@redhat.com>
> >> ...
> >> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> >> ---
> >>
> >>  mm/khugepaged.c |   12 ++++++++----
> >>  1 file changed, 8 insertions(+), 4 deletions(-)
> >>
> >> --- a/mm/khugepaged.c~mm-khugepaged-fix-issue-with-tracking-lock
> >> +++ a/mm/khugepaged.c
> >> @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_stru
> >>         unsigned long hstart, hend, addr;
> >>         enum scan_result last_fail = SCAN_FAIL;
> >>         int thps = 0;
> >> +       bool mmap_unlocked = false;
> >>
> >>         BUG_ON(vma->vm_start > start);
> >>         BUG_ON(vma->vm_end < end);
> >> @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_stru
> >>         for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> >>                 enum scan_result result = SCAN_FAIL;
> >>
> >> -               if (*lock_dropped) {
> >> +               if (mmap_unlocked) {
> >>                         cond_resched();
> >>                         mmap_read_lock(mm);
> >> -                       *lock_dropped = false;
> >> +                       mmap_unlocked = false;
> >> +                       *lock_dropped = true;
> >>                         result = hugepage_vma_revalidate(mm, addr, false, &vma,
> >>                                                          cc);
> >>                         if (result  != SCAN_SUCCEED) {
> >> @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_stru
> >>                         hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
> >>                 }
> >>
> >> -               result = collapse_single_pmd(addr, vma, lock_dropped, cc);
> >> +               result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
> >>
> >>                 switch (result) {
> >>                 case SCAN_SUCCEED:
> >> @@ -2893,8 +2895,10 @@ int madvise_collapse(struct vm_area_stru
> >>
> >>  out_maybelock:
> >>         /* Caller expects us to hold mmap_lock on return */
> >> -       if (*lock_dropped)
> >> +       if (mmap_unlocked) {
> >> +               *lock_dropped = true;
> >
> > Hmm this was =false in David's example and = true here? Now i am also
> > confused hahaha. It seems Lorenzo's code is correct. I think i
> > responded to my own code changes, when I thought David was replying to
> > Lorenzo's.
>
> I think I messed up that in my posting when manually applying :)
>
> So this here is correct.

Yeah the fix-patch on top made it all confusing :)

FWIW :P I also had claude do its thing to check it on this, and it gave the
thumbs up.

>
> --
> Cheers,
>
> David

Thanks, Lorenzo
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Lorenzo Stoakes (Oracle) 1 day, 21 hours ago
On Tue, Mar 31, 2026 at 10:00:58PM +0200, David Hildenbrand (Arm) wrote:
> On 3/31/26 16:15, Lorenzo Stoakes (Oracle) wrote:
> > On Tue, Mar 31, 2026 at 04:13:29PM +0200, David Hildenbrand (Arm) wrote:
> >>>
> >>> So this is the bug. 'lock_dropped' needs to record if the lock was _ever_
> >>> dropped, not if it is _currently_ dropped.
> >>
> >> Ah, well spotted. I thought we discussed that at some point during
> >> review ...
> >>
> >> Thanks for tackling this!
> >>
> >> [...]
> >>
> >>>
> >>> ----8<----
> >>> From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
> >>> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> >>> Date: Tue, 31 Mar 2026 13:11:18 +0100
> >>> Subject: [PATCH] mm/khugepaged: fix issue with tracking lock
> >>>
> >>> We are incorrectly treating lock_dropped to track both whether the lock is
> >>> currently held and whether or not the lock was ever dropped.
> >>>
> >>> Update this change to account for this.
> >>>
> >>> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> >>> ---
> >>>  mm/khugepaged.c | 12 ++++++++----
> >>>  1 file changed, 8 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >>> index d21348b85a59..b8452dbdb043 100644
> >>> --- a/mm/khugepaged.c
> >>> +++ b/mm/khugepaged.c
> >>> @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >>>  	unsigned long hstart, hend, addr;
> >>>  	enum scan_result last_fail = SCAN_FAIL;
> >>>  	int thps = 0;
> >>> +	bool mmap_unlocked = false;
> >>>
> >>>  	BUG_ON(vma->vm_start > start);
> >>>  	BUG_ON(vma->vm_end < end);
> >>> @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >>>  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> >>>  		enum scan_result result = SCAN_FAIL;
> >>>
> >>> -		if (*lock_dropped) {
> >>> +		if (mmap_unlocked) {
> >>>  			cond_resched();
> >>>  			mmap_read_lock(mm);
> >>> -			*lock_dropped = false;
> >>> +			mmap_unlocked = false;
> >>> +			*lock_dropped = true;
> >>>  			result = hugepage_vma_revalidate(mm, addr, false, &vma,
> >>>  							cc);
> >>>  			if (result  != SCAN_SUCCEED) {
> >>> @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
> >>
> >> (in hurry so I might be wrong)
> >>
> >> Do we have to handle when collapse_single_pmd() dropped the lock as well?
> >
> > That updates mmap_unlocked which will be handled in the loop or on exit:
> >
> > 		if (mmap_unlocked) {
> > 			...
> > 			mmap_read_lock(mm);
> > 			mmap_unlocked = false;
> > 			*lock_dropped = true;
> > 			...
> > 		}
> >
> > 		result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
> >
> > ...
> >
> > out_maybelock:
> > 	/* Caller expects us to hold mmap_lock on return */
> > 	if (mmap_unlocked) {
> > 		*lock_dropped = true;
> > 		mmap_read_lock(mm);
> > 	}
> >
>
> Right.
>
> The original code used
>
> 	bool mmap_locked = true;
>
> If the fix get squashed, maybe we should revert to that handling to
> minimize the churn?

Well collapse_single_pmd() and all descendents would need to be updated to
invert the boolean, not sure it's worth it.

>
> (this is not in stable, right?)
>
> --
> Cheers,
>
> David
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by David Hildenbrand (Arm) 1 day, 20 hours ago
On 3/31/26 22:06, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 31, 2026 at 10:00:58PM +0200, David Hildenbrand (Arm) wrote:
>> On 3/31/26 16:15, Lorenzo Stoakes (Oracle) wrote:
>>>
>>> That updates mmap_unlocked which will be handled in the loop or on exit:
>>>
>>> 		if (mmap_unlocked) {
>>> 			...
>>> 			mmap_read_lock(mm);
>>> 			mmap_unlocked = false;
>>> 			*lock_dropped = true;
>>> 			...
>>> 		}
>>>
>>> 		result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
>>>
>>> ...
>>>
>>> out_maybelock:
>>> 	/* Caller expects us to hold mmap_lock on return */
>>> 	if (mmap_unlocked) {
>>> 		*lock_dropped = true;
>>> 		mmap_read_lock(mm);
>>> 	}
>>>
>>
>> Right.
>>
>> The original code used
>>
>> 	bool mmap_locked = true;
>>
>> If the fix get squashed, maybe we should revert to that handling to
>> minimize the churn?
> 
> Well collapse_single_pmd() and all descendents would need to be updated to
> invert the boolean, not sure it's worth it.

The code is confusing me.

In the old code, collapse_single_pmd() was called with "mmap_locked ==
true" and set "mmap_locked=false".

Now we call it with "mmap_unlocked=false" and it sets "mmap_unlocked=true".

So far so good.

However, collapse_scan_pmd() consumed "mmap_locked", now it effectively
consumes "mmap_unlocked".

I think I have to squash your fix into Nicos patch to make sense of this. :)


-- 
Cheers,

David
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by David Hildenbrand (Arm) 1 day, 20 hours ago
On 3/31/26 22:50, David Hildenbrand (Arm) wrote:
> On 3/31/26 22:06, Lorenzo Stoakes (Oracle) wrote:
>> On Tue, Mar 31, 2026 at 10:00:58PM +0200, David Hildenbrand (Arm) wrote:
>>>
>>> Right.
>>>
>>> The original code used
>>>
>>> 	bool mmap_locked = true;
>>>
>>> If the fix get squashed, maybe we should revert to that handling to
>>> minimize the churn?
>>
>> Well collapse_single_pmd() and all descendents would need to be updated to
>> invert the boolean, not sure it's worth it.
> 
> The code is confusing me.
> 
> In the old code, collapse_single_pmd() was called with "mmap_locked ==
> true" and set "mmap_locked=false".
> 
> Now we call it with "mmap_unlocked=false" and it sets "mmap_unlocked=true".
> 
> So far so good.
> 
> However, collapse_scan_pmd() consumed "mmap_locked", now it effectively
> consumes "mmap_unlocked".

Okay, I'm too confused for today and give up :)

FWIW, this is the effective change, and the effective changes to
mmap_unlocked and lock_dropped ... confuse me:


 mm/khugepaged.c | 146 +++++++++++++++++++++++++-----------------------
 1 file changed, 76 insertions(+), 70 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 26c38934e829..3ffff9f98b2f 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1253,7 +1253,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
 
 static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		struct vm_area_struct *vma, unsigned long start_addr,
-		bool *mmap_locked, struct collapse_control *cc)
+		bool *lock_dropped, struct collapse_control *cc)
 {
 	pmd_t *pmd;
 	pte_t *pte, *_pte;
@@ -1428,7 +1428,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
 		result = collapse_huge_page(mm, start_addr, referenced,
 					    unmapped, cc);
 		/* collapse_huge_page will return with the mmap_lock released */
-		*mmap_locked = false;
+		*lock_dropped = true;
 	}
 out:
 	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
@@ -2420,6 +2420,67 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
 	return result;
 }
 
+/*
+ * Try to collapse a single PMD starting at a PMD aligned addr, and return
+ * the results.
+ */
+static enum scan_result collapse_single_pmd(unsigned long addr,
+		struct vm_area_struct *vma, bool *lock_dropped,
+		struct collapse_control *cc)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	bool triggered_wb = false;
+	enum scan_result result;
+	struct file *file;
+	pgoff_t pgoff;
+
+	mmap_assert_locked(mm);
+
+	if (vma_is_anonymous(vma)) {
+		result = collapse_scan_pmd(mm, vma, addr, lock_dropped, cc);
+		goto end;
+	}
+
+	file = get_file(vma->vm_file);
+	pgoff = linear_page_index(vma, addr);
+
+	mmap_read_unlock(mm);
+	*lock_dropped = true;
+retry:
+	result = collapse_scan_file(mm, addr, file, pgoff, cc);
+
+	/*
+	 * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
+	 * then retry the collapse one time.
+	 */
+	if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
+	    !triggered_wb && mapping_can_writeback(file->f_mapping)) {
+		const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
+		const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+
+		filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+		triggered_wb = true;
+		goto retry;
+	}
+	fput(file);
+
+	if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
+		mmap_read_lock(mm);
+		if (collapse_test_exit_or_disable(mm))
+			result = SCAN_ANY_PROCESS;
+		else
+			result = try_collapse_pte_mapped_thp(mm, addr,
+							     !cc->is_khugepaged);
+		if (result == SCAN_PMD_MAPPED)
+			result = SCAN_SUCCEED;
+		mmap_read_unlock(mm);
+	}
+end:
+	if (cc->is_khugepaged && result == SCAN_SUCCEED)
+		++khugepaged_pages_collapsed;
+	return result;
+}
+
 static void collapse_scan_mm_slot(unsigned int progress_max,
 		enum scan_result *result, struct collapse_control *cc)
 	__releases(&khugepaged_mm_lock)
@@ -2481,46 +2542,21 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
 		VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
 
 		while (khugepaged_scan.address < hend) {
-			bool mmap_locked = true;
+			bool lock_dropped = false;
 
 			cond_resched();
 			if (unlikely(collapse_test_exit_or_disable(mm)))
 				goto breakouterloop;
 
-			VM_BUG_ON(khugepaged_scan.address < hstart ||
+			VM_WARN_ON_ONCE(khugepaged_scan.address < hstart ||
 				  khugepaged_scan.address + HPAGE_PMD_SIZE >
 				  hend);
-			if (!vma_is_anonymous(vma)) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma,
-						khugepaged_scan.address);
-
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*result = collapse_scan_file(mm,
-					khugepaged_scan.address, file, pgoff, cc);
-				fput(file);
-				if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
-					mmap_read_lock(mm);
-					if (collapse_test_exit_or_disable(mm))
-						goto breakouterloop;
-					*result = try_collapse_pte_mapped_thp(mm,
-						khugepaged_scan.address, false);
-					if (*result == SCAN_PMD_MAPPED)
-						*result = SCAN_SUCCEED;
-					mmap_read_unlock(mm);
-				}
-			} else {
-				*result = collapse_scan_pmd(mm, vma,
-					khugepaged_scan.address, &mmap_locked, cc);
-			}
-
-			if (*result == SCAN_SUCCEED)
-				++khugepaged_pages_collapsed;
 
+			*result = collapse_single_pmd(khugepaged_scan.address,
+						      vma, &lock_dropped, cc);
 			/* move to next address */
 			khugepaged_scan.address += HPAGE_PMD_SIZE;
-			if (!mmap_locked)
+			if (lock_dropped)
 				/*
 				 * We released mmap_lock so break loop.  Note
 				 * that we drop mmap_lock before all hugepage
@@ -2795,7 +2831,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	unsigned long hstart, hend, addr;
 	enum scan_result last_fail = SCAN_FAIL;
 	int thps = 0;
-	bool mmap_locked = true;
+	bool mmap_unlocked = false;
 
 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
@@ -2817,13 +2853,12 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
 		enum scan_result result = SCAN_FAIL;
-		bool triggered_wb = false;
 
-retry:
-		if (!mmap_locked) {
+		if (mmap_unlocked) {
 			cond_resched();
 			mmap_read_lock(mm);
-			mmap_locked = true;
+			mmap_unlocked = false;
+			*lock_dropped = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
 							 cc);
 			if (result  != SCAN_SUCCEED) {
@@ -2833,45 +2868,14 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
 		}
-		mmap_assert_locked(mm);
-		if (!vma_is_anonymous(vma)) {
-			struct file *file = get_file(vma->vm_file);
-			pgoff_t pgoff = linear_page_index(vma, addr);
 
-			mmap_read_unlock(mm);
-			mmap_locked = false;
-			*lock_dropped = true;
-			result = collapse_scan_file(mm, addr, file, pgoff, cc);
-
-			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
-			    mapping_can_writeback(file->f_mapping)) {
-				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
-				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+		result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
 
-				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
-				triggered_wb = true;
-				fput(file);
-				goto retry;
-			}
-			fput(file);
-		} else {
-			result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
-		}
-		if (!mmap_locked)
-			*lock_dropped = true;
-
-handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
 		case SCAN_PMD_MAPPED:
 			++thps;
 			break;
-		case SCAN_PTE_MAPPED_HUGEPAGE:
-			BUG_ON(mmap_locked);
-			mmap_read_lock(mm);
-			result = try_collapse_pte_mapped_thp(mm, addr, true);
-			mmap_read_unlock(mm);
-			goto handle_result;
 		/* Whitelisted set of results where continuing OK */
 		case SCAN_NO_PTE_TABLE:
 		case SCAN_PTE_NON_PRESENT:
@@ -2894,8 +2898,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 out_maybelock:
 	/* Caller expects us to hold mmap_lock on return */
-	if (!mmap_locked)
+	if (mmap_unlocked) {
+		*lock_dropped = false;
 		mmap_read_lock(mm);
+	}
 out_nolock:
 	mmap_assert_locked(mm);
 	mmdrop(mm);
-- 
2.43.0


-- 
Cheers,

David
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Nico Pache 1 day, 20 hours ago
On Tue, Mar 31, 2026 at 3:04 PM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 3/31/26 22:50, David Hildenbrand (Arm) wrote:
> > On 3/31/26 22:06, Lorenzo Stoakes (Oracle) wrote:
> >> On Tue, Mar 31, 2026 at 10:00:58PM +0200, David Hildenbrand (Arm) wrote:
> >>>
> >>> Right.
> >>>
> >>> The original code used
> >>>
> >>>     bool mmap_locked = true;
> >>>
> >>> If the fix get squashed, maybe we should revert to that handling to
> >>> minimize the churn?
> >>
> >> Well collapse_single_pmd() and all descendents would need to be updated to
> >> invert the boolean, not sure it's worth it.
> >
> > The code is confusing me.
> >
> > In the old code, collapse_single_pmd() was called with "mmap_locked ==
> > true" and set "mmap_locked=false".
> >
> > Now we call it with "mmap_unlocked=false" and it sets "mmap_unlocked=true".
> >
> > So far so good.
> >
> > However, collapse_scan_pmd() consumed "mmap_locked", now it effectively
> > consumes "mmap_unlocked".
>
> Okay, I'm too confused for today and give up :)
>
> FWIW, this is the effective change, and the effective changes to
> mmap_unlocked and lock_dropped ... confuse me:

basically all the logic was inverted.

>
>
>  mm/khugepaged.c | 146 +++++++++++++++++++++++++-----------------------
>  1 file changed, 76 insertions(+), 70 deletions(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 26c38934e829..3ffff9f98b2f 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1253,7 +1253,7 @@ static enum scan_result collapse_huge_page(struct mm_struct *mm, unsigned long a
>
>  static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>                 struct vm_area_struct *vma, unsigned long start_addr,
> -               bool *mmap_locked, struct collapse_control *cc)
> +               bool *lock_dropped, struct collapse_control *cc)
>  {
>         pmd_t *pmd;
>         pte_t *pte, *_pte;
> @@ -1428,7 +1428,7 @@ static enum scan_result collapse_scan_pmd(struct mm_struct *mm,
>                 result = collapse_huge_page(mm, start_addr, referenced,
>                                             unmapped, cc);
>                 /* collapse_huge_page will return with the mmap_lock released */
> -               *mmap_locked = false;
> +               *lock_dropped = true;
>         }
>  out:
>         trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> @@ -2420,6 +2420,67 @@ static enum scan_result collapse_scan_file(struct mm_struct *mm,
>         return result;
>  }
>
> +/*
> + * Try to collapse a single PMD starting at a PMD aligned addr, and return
> + * the results.
> + */
> +static enum scan_result collapse_single_pmd(unsigned long addr,
> +               struct vm_area_struct *vma, bool *lock_dropped,
> +               struct collapse_control *cc)
> +{
> +       struct mm_struct *mm = vma->vm_mm;
> +       bool triggered_wb = false;
> +       enum scan_result result;
> +       struct file *file;
> +       pgoff_t pgoff;
> +
> +       mmap_assert_locked(mm);
> +
> +       if (vma_is_anonymous(vma)) {
> +               result = collapse_scan_pmd(mm, vma, addr, lock_dropped, cc);
> +               goto end;
> +       }
> +
> +       file = get_file(vma->vm_file);
> +       pgoff = linear_page_index(vma, addr);
> +
> +       mmap_read_unlock(mm);
> +       *lock_dropped = true;
> +retry:
> +       result = collapse_scan_file(mm, addr, file, pgoff, cc);
> +
> +       /*
> +        * For MADV_COLLAPSE, when encountering dirty pages, try to writeback,
> +        * then retry the collapse one time.
> +        */
> +       if (!cc->is_khugepaged && result == SCAN_PAGE_DIRTY_OR_WRITEBACK &&
> +           !triggered_wb && mapping_can_writeback(file->f_mapping)) {
> +               const loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> +               const loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +
> +               filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> +               triggered_wb = true;
> +               goto retry;
> +       }
> +       fput(file);
> +
> +       if (result == SCAN_PTE_MAPPED_HUGEPAGE) {
> +               mmap_read_lock(mm);
> +               if (collapse_test_exit_or_disable(mm))
> +                       result = SCAN_ANY_PROCESS;
> +               else
> +                       result = try_collapse_pte_mapped_thp(mm, addr,
> +                                                            !cc->is_khugepaged);
> +               if (result == SCAN_PMD_MAPPED)
> +                       result = SCAN_SUCCEED;
> +               mmap_read_unlock(mm);
> +       }
> +end:
> +       if (cc->is_khugepaged && result == SCAN_SUCCEED)
> +               ++khugepaged_pages_collapsed;
> +       return result;
> +}
> +
>  static void collapse_scan_mm_slot(unsigned int progress_max,
>                 enum scan_result *result, struct collapse_control *cc)
>         __releases(&khugepaged_mm_lock)
> @@ -2481,46 +2542,21 @@ static void collapse_scan_mm_slot(unsigned int progress_max,
>                 VM_BUG_ON(khugepaged_scan.address & ~HPAGE_PMD_MASK);
>
>                 while (khugepaged_scan.address < hend) {
> -                       bool mmap_locked = true;
> +                       bool lock_dropped = false;
>
>                         cond_resched();
>                         if (unlikely(collapse_test_exit_or_disable(mm)))
>                                 goto breakouterloop;
>
> -                       VM_BUG_ON(khugepaged_scan.address < hstart ||
> +                       VM_WARN_ON_ONCE(khugepaged_scan.address < hstart ||
>                                   khugepaged_scan.address + HPAGE_PMD_SIZE >
>                                   hend);
> -                       if (!vma_is_anonymous(vma)) {
> -                               struct file *file = get_file(vma->vm_file);
> -                               pgoff_t pgoff = linear_page_index(vma,
> -                                               khugepaged_scan.address);
> -
> -                               mmap_read_unlock(mm);
> -                               mmap_locked = false;
> -                               *result = collapse_scan_file(mm,
> -                                       khugepaged_scan.address, file, pgoff, cc);
> -                               fput(file);
> -                               if (*result == SCAN_PTE_MAPPED_HUGEPAGE) {
> -                                       mmap_read_lock(mm);
> -                                       if (collapse_test_exit_or_disable(mm))
> -                                               goto breakouterloop;
> -                                       *result = try_collapse_pte_mapped_thp(mm,
> -                                               khugepaged_scan.address, false);
> -                                       if (*result == SCAN_PMD_MAPPED)
> -                                               *result = SCAN_SUCCEED;
> -                                       mmap_read_unlock(mm);
> -                               }
> -                       } else {
> -                               *result = collapse_scan_pmd(mm, vma,
> -                                       khugepaged_scan.address, &mmap_locked, cc);
> -                       }
> -
> -                       if (*result == SCAN_SUCCEED)
> -                               ++khugepaged_pages_collapsed;
>
> +                       *result = collapse_single_pmd(khugepaged_scan.address,
> +                                                     vma, &lock_dropped, cc);
>                         /* move to next address */
>                         khugepaged_scan.address += HPAGE_PMD_SIZE;
> -                       if (!mmap_locked)
> +                       if (lock_dropped)
>                                 /*
>                                  * We released mmap_lock so break loop.  Note
>                                  * that we drop mmap_lock before all hugepage
> @@ -2795,7 +2831,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>         unsigned long hstart, hend, addr;
>         enum scan_result last_fail = SCAN_FAIL;
>         int thps = 0;
> -       bool mmap_locked = true;
> +       bool mmap_unlocked = false;
>
>         BUG_ON(vma->vm_start > start);
>         BUG_ON(vma->vm_end < end);
> @@ -2817,13 +2853,12 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
>         for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>                 enum scan_result result = SCAN_FAIL;
> -               bool triggered_wb = false;
>
> -retry:
> -               if (!mmap_locked) {
> +               if (mmap_unlocked) {
>                         cond_resched();
>                         mmap_read_lock(mm);
> -                       mmap_locked = true;
> +                       mmap_unlocked = false;
> +                       *lock_dropped = true;
>                         result = hugepage_vma_revalidate(mm, addr, false, &vma,
>                                                          cc);
>                         if (result  != SCAN_SUCCEED) {
> @@ -2833,45 +2868,14 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
>                         hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
>                 }
> -               mmap_assert_locked(mm);
> -               if (!vma_is_anonymous(vma)) {
> -                       struct file *file = get_file(vma->vm_file);
> -                       pgoff_t pgoff = linear_page_index(vma, addr);
>
> -                       mmap_read_unlock(mm);
> -                       mmap_locked = false;
> -                       *lock_dropped = true;
> -                       result = collapse_scan_file(mm, addr, file, pgoff, cc);
> -
> -                       if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> -                           mapping_can_writeback(file->f_mapping)) {
> -                               loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> -                               loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +               result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);
>
> -                               filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> -                               triggered_wb = true;
> -                               fput(file);
> -                               goto retry;
> -                       }
> -                       fput(file);
> -               } else {
> -                       result = collapse_scan_pmd(mm, vma, addr, &mmap_locked, cc);
> -               }
> -               if (!mmap_locked)
> -                       *lock_dropped = true;
> -
> -handle_result:
>                 switch (result) {
>                 case SCAN_SUCCEED:
>                 case SCAN_PMD_MAPPED:
>                         ++thps;
>                         break;
> -               case SCAN_PTE_MAPPED_HUGEPAGE:
> -                       BUG_ON(mmap_locked);
> -                       mmap_read_lock(mm);
> -                       result = try_collapse_pte_mapped_thp(mm, addr, true);
> -                       mmap_read_unlock(mm);
> -                       goto handle_result;
>                 /* Whitelisted set of results where continuing OK */
>                 case SCAN_NO_PTE_TABLE:
>                 case SCAN_PTE_NON_PRESENT:
> @@ -2894,8 +2898,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>
>  out_maybelock:
>         /* Caller expects us to hold mmap_lock on return */
> -       if (!mmap_locked)
> +       if (mmap_unlocked) {
> +               *lock_dropped = false;

Lorenzo, is this correct? At first glance, shouldn't this be
`locked_dropped = true`?

-- Nico

>                 mmap_read_lock(mm);
> +       }
>  out_nolock:
>         mmap_assert_locked(mm);
>         mmdrop(mm);
> --
> 2.43.0
>
>
> --
> Cheers,
>
> David
>
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Lorenzo Stoakes (Oracle) 1 day, 9 hours ago
On Tue, Mar 31, 2026 at 03:09:08PM -0600, Nico Pache wrote:
> On Tue, Mar 31, 2026 at 3:04 PM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
> >
> > On 3/31/26 22:50, David Hildenbrand (Arm) wrote:
> > > On 3/31/26 22:06, Lorenzo Stoakes (Oracle) wrote:
> > >> On Tue, Mar 31, 2026 at 10:00:58PM +0200, David Hildenbrand (Arm) wrote:
> > >>>
> > >>> Right.
> > >>>
> > >>> The original code used
> > >>>
> > >>>     bool mmap_locked = true;
> > >>>
> > >>> If the fix get squashed, maybe we should revert to that handling to
> > >>> minimize the churn?
> > >>
> > >> Well collapse_single_pmd() and all descendents would need to be updated to
> > >> invert the boolean, not sure it's worth it.
> > >
> > > The code is confusing me.
> > >
> > > In the old code, collapse_single_pmd() was called with "mmap_locked ==
> > > true" and set "mmap_locked=false".
> > >
> > > Now we call it with "mmap_unlocked=false" and it sets "mmap_unlocked=true".
> > >
> > > So far so good.
> > >
> > > However, collapse_scan_pmd() consumed "mmap_locked", now it effectively
> > > consumes "mmap_unlocked".
> >
> > Okay, I'm too confused for today and give up :)
> >
> > FWIW, this is the effective change, and the effective changes to
> > mmap_unlocked and lock_dropped ... confuse me:
>
> basically all the logic was inverted.

Yup.

> >  out_maybelock:
> >         /* Caller expects us to hold mmap_lock on return */
> > -       if (!mmap_locked)
> > +       if (mmap_unlocked) {
> > +               *lock_dropped = false;
>
> Lorenzo, is this correct? At first glance, shouldn't this be
> `locked_dropped = true`?

It is :) you're looking at the old code from your patch maybe?

I include my fix-patch again below to make life easier, but that code is:

+		if (mmap_unlocked) {
 			cond_resched();
 			mmap_read_lock(mm);
-			*lock_dropped = false;
+			mmap_unlocked = false;
+			*lock_dropped = true;

Having one on top of the other is causing the confusion :)

I also asked Claude to check this btw for correctness. I'm pretty confident it's
ok.

Cheers, Lorenzo

----8<----
From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
Date: Tue, 31 Mar 2026 13:11:18 +0100
Subject: [PATCH] mm/khugepaged: fix issue with tracking lock

We are incorrectly treating lock_dropped to track both whether the lock is
currently held and whether or not the lock was ever dropped.

Update this change to account for this.

Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
---
 mm/khugepaged.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index d21348b85a59..b8452dbdb043 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	unsigned long hstart, hend, addr;
 	enum scan_result last_fail = SCAN_FAIL;
 	int thps = 0;
+	bool mmap_unlocked = false;

 	BUG_ON(vma->vm_start > start);
 	BUG_ON(vma->vm_end < end);
@@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
 		enum scan_result result = SCAN_FAIL;

-		if (*lock_dropped) {
+		if (mmap_unlocked) {
 			cond_resched();
 			mmap_read_lock(mm);
-			*lock_dropped = false;
+			mmap_unlocked = false;
+			*lock_dropped = true;
 			result = hugepage_vma_revalidate(mm, addr, false, &vma,
 							 cc);
 			if (result  != SCAN_SUCCEED) {
@@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 			hend = min(hend, vma->vm_end & HPAGE_PMD_MASK);
 		}

-		result = collapse_single_pmd(addr, vma, lock_dropped, cc);
+		result = collapse_single_pmd(addr, vma, &mmap_unlocked, cc);

 		switch (result) {
 		case SCAN_SUCCEED:
@@ -2893,8 +2895,10 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,

 out_maybelock:
 	/* Caller expects us to hold mmap_lock on return */
-	if (*lock_dropped)
+	if (mmap_unlocked) {
+		*lock_dropped = true;
 		mmap_read_lock(mm);
+	}
 out_nolock:
 	mmap_assert_locked(mm);
 	mmdrop(mm);
--
2.53.0
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by Andrew Morton 21 hours ago
On Wed, 1 Apr 2026 09:14:35 +0100 "Lorenzo Stoakes (Oracle)" <ljs@kernel.org> wrote:

> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
> Date: Tue, 31 Mar 2026 13:11:18 +0100
> Subject: [PATCH] mm/khugepaged: fix issue with tracking lock
> 
> We are incorrectly treating lock_dropped to track both whether the lock is
> currently held and whether or not the lock was ever dropped.
> 
> Update this change to account for this.
> 
> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>

Not sure if I should add this?

Should it have

Fixes: 330f3758a3bc ("mm/khugepaged: unify khugepaged and madv_collapse
with collapse_single_pmd()")?
Re: [PATCH mm-unstable v4 5/5] mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
Posted by David Hildenbrand (Arm) 2 days, 2 hours ago
On 3/31/26 16:15, Lorenzo Stoakes (Oracle) wrote:
> On Tue, Mar 31, 2026 at 04:13:29PM +0200, David Hildenbrand (Arm) wrote:
>>>
>>> So this is the bug. 'lock_dropped' needs to record if the lock was _ever_
>>> dropped, not if it is _currently_ dropped.
>>
>> Ah, well spotted. I thought we discussed that at some point during
>> review ...
>>
>> Thanks for tackling this!
>>
>> [...]
>>
>>>
>>> ----8<----
>>> From a4dfc7718a15035449f344a0bc7f58e449366405 Mon Sep 17 00:00:00 2001
>>> From: "Lorenzo Stoakes (Oracle)" <ljs@kernel.org>
>>> Date: Tue, 31 Mar 2026 13:11:18 +0100
>>> Subject: [PATCH] mm/khugepaged: fix issue with tracking lock
>>>
>>> We are incorrectly treating lock_dropped to track both whether the lock is
>>> currently held and whether or not the lock was ever dropped.
>>>
>>> Update this change to account for this.
>>>
>>> Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>>> ---
>>>  mm/khugepaged.c | 12 ++++++++----
>>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index d21348b85a59..b8452dbdb043 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -2828,6 +2828,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>  	unsigned long hstart, hend, addr;
>>>  	enum scan_result last_fail = SCAN_FAIL;
>>>  	int thps = 0;
>>> +	bool mmap_unlocked = false;
>>>
>>>  	BUG_ON(vma->vm_start > start);
>>>  	BUG_ON(vma->vm_end < end);
>>> @@ -2850,10 +2851,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>  	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>>  		enum scan_result result = SCAN_FAIL;
>>>
>>> -		if (*lock_dropped) {
>>> +		if (mmap_unlocked) {
>>>  			cond_resched();
>>>  			mmap_read_lock(mm);
>>> -			*lock_dropped = false;
>>> +			mmap_unlocked = false;
>>> +			*lock_dropped = true;
>>>  			result = hugepage_vma_revalidate(mm, addr, false, &vma,
>>>  							cc);
>>>  			if (result  != SCAN_SUCCEED) {
>>> @@ -2864,7 +2866,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>
>> (in hurry so I might be wrong)
>>
>> Do we have to handle when collapse_single_pmd() dropped the lock as well?

Great, thanks!

-- 
Cheers,

David