[v5] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE

[PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE

Posted by Shivank Garg 2 weeks, 5 days ago

MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages
are dirty. This affects scenarios like package/container updates or
executing binaries immediately after writing them, etc.

The issue is that collapse_file() triggers async writeback and returns
SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But
MADV_COLLAPSE is synchronous and userspace expects immediate success or
a clear retry signal.

Reproduction:
 - Compile or copy 2MB-aligned executable to XFS/ext4 FS
 - Call MADV_COLLAPSE on .text section
 - First call fails with -EINVAL (text pages dirty from copy)
 - Second call succeeds (async writeback completed)

Issue Report:
https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com


Hi Andrew,

This V5 series incorporates David's feedback for simplifying the retry
logic. To apply on mm-new, please drop:
- 20251215084615.5283-3-shivankg@amd.com:
  [PATCH V4 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE
- 20251224111351.41042-4-shivankg@amd.com:
  [PATCH V2 0/5] mm/khugepaged: cleanups and scan limit fix
  (merge conflicts with this series; V3 with review fixes posting soon)

Thank you :)


Changelog:
V5:
- In patch 2/2, Simplify dirty writeback retry logic (David)

V4:
- https://lore.kernel.org/all/20251215084615.5283-3-shivankg@amd.com
- Rebase on mm-new
- Fix spurious blank line (Lance)

V3:
- https://lore.kernel.org/all/20251201185604.210634-6-shivankg@amd.com
- Reordered patches: Enum definition comes first as the retry logic depends on it
- Renamed SCAN_PAGE_NOT_CLEAN to SCAN_PAGE_DIRTY_OR_WRITEBACK (Dev, Lance, David)
- Changed writeback logic: Only trigger synchronous writeback and retry if the
  initial collapse attempt failed specifically due to dirty/writeback pages,
  rather than blindly flushing all file-backed VMAs (David)
- Added proper file reference counting (get_file/fput) around the unlock window
  to prevent UAF (Lance)

V2:
- https://lore.kernel.org/all/20251120065043.41738-6-shivankg@amd.com
- Move writeback to madvise_collapse() (better abstraction, proper
  mmap_lock handling and does VMA revalidation after I/O) (Lorenzo)
- Rename to SCAN_PAGE_DIRTY to SCAN_PAGE_NOT_CLEAN and extend its use
  for all dirty/writeback folio cases that previously returned incorrect
  results (Dev) 

V1: https://lore.kernel.org/all/20251110113254.77822-1-shivankg@amd.com

Thanks,

Shivank Garg (2):
  mm/khugepaged: map dirty/writeback pages failures to EAGAIN
  mm/khugepaged: retry with sync writeback for MADV_COLLAPSE

 include/trace/events/huge_memory.h |  3 ++-
 mm/khugepaged.c                    | 23 ++++++++++++++++++++---
 2 files changed, 22 insertions(+), 4 deletions(-)

-- 
2.43.0

Re: [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE

Posted by Andrew Morton 2 weeks, 5 days ago

On Sun, 18 Jan 2026 19:09:38 +0000 Shivank Garg <shivankg@amd.com> wrote:

> MADV_COLLAPSE on file-backed mappings fails with -EINVAL when TEXT pages
> are dirty. This affects scenarios like package/container updates or
> executing binaries immediately after writing them, etc.
> 
> The issue is that collapse_file() triggers async writeback and returns
> SCAN_FAIL (maps to -EINVAL), expecting khugepaged to revisit later. But
> MADV_COLLAPSE is synchronous and userspace expects immediate success or
> a clear retry signal.
> 
> Reproduction:
>  - Compile or copy 2MB-aligned executable to XFS/ext4 FS
>  - Call MADV_COLLAPSE on .text section
>  - First call fails with -EINVAL (text pages dirty from copy)
>  - Second call succeeds (async writeback completed)
> 
> Issue Report:
> https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com

Updated, thanks.

Please tolerate a little whining about the timeliess here.  We're at
-rc6, v4 was added to mm.git over a month ago, had quite a lot of
review, this is very close to being moved into the mm-stable branch and now
we get v5.  Argh.

> V5:
> - In patch 2/2, Simplify dirty writeback retry logic (David)

Are you sure this is the only change?  It looks like a lot for a
simplification and I'm wondering if we should retain the v4 series and
defer a simplification for separate consideration during the next
cycle.

Below is how this updated altered mm.git.  Could reviewers please check
this fairly soon?


--- a/mm/khugepaged.c~b
+++ a/mm/khugepaged.c
@@ -2788,11 +2788,11 @@ int madvise_collapse(struct vm_area_stru
 	hend = end & HPAGE_PMD_MASK;
 
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
-		bool retried = false;
 		int result = SCAN_FAIL;
+		bool triggered_wb = false;
 
-		if (!mmap_locked) {
 retry:
+		if (!mmap_locked) {
 			cond_resched();
 			mmap_read_lock(mm);
 			mmap_locked = true;
@@ -2812,52 +2812,27 @@ retry:
 
 			mmap_read_unlock(mm);
 			mmap_locked = false;
+			*lock_dropped = true;
 			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
 							  cc);
-			fput(file);
-		} else {
-			result = hpage_collapse_scan_pmd(mm, vma, addr,
-							 &mmap_locked, cc);
-		}
-		if (!mmap_locked)
-			*lock_dropped = true;
-
-		/*
-		 * If the file-backed VMA has dirty pages, the scan triggers
-		 * async writeback and returns SCAN_PAGE_DIRTY_OR_WRITEBACK.
-		 * Since MADV_COLLAPSE is sync, we force sync writeback and
-		 * retry once.
-		 */
-		if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !retried) {
-			/*
-			 * File scan drops the lock. We must re-acquire it to
-			 * safely inspect the VMA and hold the file reference.
-			 */
-			if (!mmap_locked) {
-				cond_resched();
-				mmap_read_lock(mm);
-				mmap_locked = true;
-				result = hugepage_vma_revalidate(mm, addr, false, &vma, cc);
-				if (result != SCAN_SUCCEED)
-					goto handle_result;
-			}
 
-			if (!vma_is_anonymous(vma) && vma->vm_file &&
-			    mapping_can_writeback(vma->vm_file->f_mapping)) {
-				struct file *file = get_file(vma->vm_file);
-				pgoff_t pgoff = linear_page_index(vma, addr);
+			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
+			    mapping_can_writeback(file->f_mapping)) {
 				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
 				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
 
-				mmap_read_unlock(mm);
-				mmap_locked = false;
-				*lock_dropped = true;
 				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+				triggered_wb = true;
 				fput(file);
-				retried = true;
 				goto retry;
 			}
+			fput(file);
+		} else {
+			result = hpage_collapse_scan_pmd(mm, vma, addr,
+							 &mmap_locked, cc);
 		}
+		if (!mmap_locked)
+			*lock_dropped = true;
 
 handle_result:
 		switch (result) {
_

Re: [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE

Posted by Garg, Shivank 2 weeks, 5 days ago

On 1/19/2026 1:52 AM, Andrew Morton wrote:
> Please tolerate a little whining about the timeliess here.  We're at
> -rc6, v4 was added to mm.git over a month ago, had quite a lot of
> review, this is very close to being moved into the mm-stable branch and now
> we get v5.  Argh.
> 
I sincerely apologize for this.

I had this doubt on sending an incremental patch or V5:
https://lore.kernel.org/linux-mm/7a42515f-ae57-4f4d-831c-87689930a797@amd.com

>> V5:
>> - In patch 2/2, Simplify dirty writeback retry logic (David)
> Are you sure this is the only change?  It looks like a lot for a
> simplification and I'm wondering if we should retain the v4 series and
> defer a simplification for separate consideration during the next
> cycle.

Yes, patch 1/2 is unchanged and patch 2/2 is the only change. 
The diff looks larger due to code movement but logic is actually simpler now.

I completely understand if you prefer to keep V4 and defer this 
refactoring. I'm sorry for creating this late-cycle churn. Please let me 
know what you'd prefer and I'll follow your guidance.
Thank you for your patience.

Regards,
Shivank

Re: [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE

Posted by David Hildenbrand (Red Hat) 2 weeks, 5 days ago

On 1/19/26 06:07, Garg, Shivank wrote:
> 
> 
> On 1/19/2026 1:52 AM, Andrew Morton wrote:
>> Please tolerate a little whining about the timeliess here.  We're at
>> -rc6, v4 was added to mm.git over a month ago, had quite a lot of
>> review, this is very close to being moved into the mm-stable branch and now
>> we get v5.  Argh.
>>
> I sincerely apologize for this.
> 
> I had this doubt on sending an incremental patch or V5:
> https://lore.kernel.org/linux-mm/7a42515f-ae57-4f4d-831c-87689930a797@amd.com
> 
> 
>>> V5:
>>> - In patch 2/2, Simplify dirty writeback retry logic (David)
>> Are you sure this is the only change?  It looks like a lot for a
>> simplification and I'm wondering if we should retain the v4 series and
>> defer a simplification for separate consideration during the next
>> cycle.
> 
> Yes, patch 1/2 is unchanged and patch 2/2 is the only change.
> The diff looks larger due to code movement but logic is actually simpler now.
> 
> I completely understand if you prefer to keep V4 and defer this
> refactoring. I'm sorry for creating this late-cycle churn. Please let me
> know what you'd prefer and I'll follow your guidance.

There is no reason to rush any of this. :)

Review was not over and due to multiple factors my review on v4 came in 
later than I wanted.

Andrew, if you prefer, I can start sending as a reply to every patch set 
that I want to review so you are aware that it is on my review backlog.

Unfortunately we can't have nice things (sub-maintainer acks).

-- 
Cheers

David

Re: [PATCH V5 0/2] mm/khugepaged: fix dirty page handling for MADV_COLLAPSE

Posted by Lorenzo Stoakes 2 weeks, 5 days ago

On Mon, Jan 19, 2026 at 10:58:39AM +0100, David Hildenbrand (Red Hat) wrote:
> On 1/19/26 06:07, Garg, Shivank wrote:
> >
> >
> > On 1/19/2026 1:52 AM, Andrew Morton wrote:
> > > Please tolerate a little whining about the timeliess here.  We're at
> > > -rc6, v4 was added to mm.git over a month ago, had quite a lot of
> > > review, this is very close to being moved into the mm-stable branch and now
> > > we get v5.  Argh.

As co-maintainer of THP:

Can I explicitly request that you not merge anything in THP without EXPLICIT
sub-M sign-off, i.e. either me or David.

Obviously I have publicly and privately request that we do this for all of mm,
but in THP especially, we CANNOT accept series landing without this.

THP workload is one of the worst in mm and is completely overwhelming.

Again - David is working whilst being on holiday to handle the load because by
default we auto-merge.

And here you are complaining that you can't auto-merge (...!), it's completely
unworkable.

People can just _wait_ for things to land sometimes. It's OK. It's part of why
the whole rc- approach was added - I think Greg KH explicitly wrote about this
elsewhere - people can just wait for the next cycle and not be too disappointed,
it realy isn't that long.

Stability trumps speed of merge.

I don't want to have to write a script to auto-NAK anything that doesn't match
this requirement, but I'm starting to feel like I have to... Let's try to avoid
that.

> > >
> > I sincerely apologize for this.

You don't need to apologise Shivank, you just might have to wait a cycle.

> >
> > I had this doubt on sending an incremental patch or V5:
> > https://lore.kernel.org/linux-mm/7a42515f-ae57-4f4d-831c-87689930a797@amd.com
> >
> >
> > > > V5:
> > > > - In patch 2/2, Simplify dirty writeback retry logic (David)
> > > Are you sure this is the only change?  It looks like a lot for a
> > > simplification and I'm wondering if we should retain the v4 series and
> > > defer a simplification for separate consideration during the next
> > > cycle.
> >
> > Yes, patch 1/2 is unchanged and patch 2/2 is the only change.
> > The diff looks larger due to code movement but logic is actually simpler now.
> >
> > I completely understand if you prefer to keep V4 and defer this
> > refactoring. I'm sorry for creating this late-cycle churn. Please let me
> > know what you'd prefer and I'll follow your guidance.
>
> There is no reason to rush any of this. :)

Yes absolutely.

It's patently INSANE and in fact a ongoing security risk to rush things in mm in
general, and we really need to stop this.

>
> Review was not over and due to multiple factors my review on v4 came in
> later than I wanted.

Also you're on leave from work :)

I have taken a step back from review somewhat due to this insanity, but now feel
like I have to step it back up... due to this insanity.

Or maybe write a script...

>
> Andrew, if you prefer, I can start sending as a reply to every patch set
> that I want to review so you are aware that it is on my review backlog.

This is ridiculous and shouldn't be necessary.

>
> Unfortunately we can't have nice things (sub-maintainer acks).

Well we can start having less nice things like a NAK script if this unworkable
situation continues.

>
> --
> Cheers
>
> David

Again Andrew - as THP co-maintainer - I politely ask that, while you might not
want to implement a sub-M A-b/R-b requirement across mm (though you should),
that you implement one for THP explicitly.

If you don't then I don't really see any other option than to start NAK'ing
everything in THP that hits mm-stable which either David or I have not
explicitly tagged.

Thanks, Lorenzo