[PATCH V4 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE

Shivank Garg posted 2 patches 1 month, 3 weeks ago
There is a newer version of this series
[PATCH V4 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
Posted by Shivank Garg 1 month, 3 weeks ago
When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
text sections), the pages may still be dirty from recent writes.
collapse_file() will trigger async writeback and fail with
SCAN_PAGE_DIRTY_OR_WRITEBACK (-EAGAIN).

MADV_COLLAPSE is a synchronous operation where userspace expects
immediate results. If the collapse fails due to dirty pages, perform
synchronous writeback on the specific range and retry once.

This avoids spurious failures for freshly written executables while
avoiding unnecessary synchronous I/O for mappings that are already clean.

Reported-by: Branden Moore <Branden.Moore@amd.com>
Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
Suggested-by: David Hildenbrand <david@kernel.org>
Tested-by: Lance Yang <lance.yang@linux.dev>
Signed-off-by: Shivank Garg <shivankg@amd.com>
---
 mm/khugepaged.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 219dfa2e523c..6c8c35d3e0c9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -22,6 +22,7 @@
 #include <linux/dax.h>
 #include <linux/ksm.h>
 #include <linux/pgalloc.h>
+#include <linux/backing-dev.h>
 
 #include <asm/tlb.h>
 #include "internal.h"
@@ -2787,9 +2788,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 	hend = end & HPAGE_PMD_MASK;
 
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
+		bool retried = false;
 		int result = SCAN_FAIL;
 
 		if (!mmap_locked) {
+retry:
 			cond_resched();
 			mmap_read_lock(mm);
 			mmap_locked = true;
@@ -2819,6 +2822,43 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 		if (!mmap_locked)
 			*lock_dropped = true;
 
+		/*
+		 * If the file-backed VMA has dirty pages, the scan triggers
+		 * async writeback and returns SCAN_PAGE_DIRTY_OR_WRITEBACK.
+		 * Since MADV_COLLAPSE is sync, we force sync writeback and
+		 * retry once.
+		 */
+		if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !retried) {
+			/*
+			 * File scan drops the lock. We must re-acquire it to
+			 * safely inspect the VMA and hold the file reference.
+			 */
+			if (!mmap_locked) {
+				cond_resched();
+				mmap_read_lock(mm);
+				mmap_locked = true;
+				result = hugepage_vma_revalidate(mm, addr, false, &vma, cc);
+				if (result != SCAN_SUCCEED)
+					goto handle_result;
+			}
+
+			if (!vma_is_anonymous(vma) && vma->vm_file &&
+			    mapping_can_writeback(vma->vm_file->f_mapping)) {
+				struct file *file = get_file(vma->vm_file);
+				pgoff_t pgoff = linear_page_index(vma, addr);
+				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
+				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+
+				mmap_read_unlock(mm);
+				mmap_locked = false;
+				*lock_dropped = true;
+				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+				fput(file);
+				retried = true;
+				goto retry;
+			}
+		}
+
 handle_result:
 		switch (result) {
 		case SCAN_SUCCEED:
-- 
2.43.0
Re: [PATCH V4 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
Posted by David Hildenbrand (Red Hat) 1 month ago
On 12/15/25 09:46, Shivank Garg wrote:
> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
> text sections), the pages may still be dirty from recent writes.
> collapse_file() will trigger async writeback and fail with
> SCAN_PAGE_DIRTY_OR_WRITEBACK (-EAGAIN).
> 
> MADV_COLLAPSE is a synchronous operation where userspace expects
> immediate results. If the collapse fails due to dirty pages, perform
> synchronous writeback on the specific range and retry once.
> 
> This avoids spurious failures for freshly written executables while
> avoiding unnecessary synchronous I/O for mappings that are already clean.
> 
> Reported-by: Branden Moore <Branden.Moore@amd.com>
> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
> Suggested-by: David Hildenbrand <david@kernel.org>
> Tested-by: Lance Yang <lance.yang@linux.dev>
> Signed-off-by: Shivank Garg <shivankg@amd.com>
> ---
>   mm/khugepaged.c | 40 ++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 40 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 219dfa2e523c..6c8c35d3e0c9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -22,6 +22,7 @@
>   #include <linux/dax.h>
>   #include <linux/ksm.h>
>   #include <linux/pgalloc.h>
> +#include <linux/backing-dev.h>
>   
>   #include <asm/tlb.h>
>   #include "internal.h"
> @@ -2787,9 +2788,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>   	hend = end & HPAGE_PMD_MASK;
>   
>   	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
> +		bool retried = false;
>   		int result = SCAN_FAIL;
>   
>   		if (!mmap_locked) {
> +retry:

Jumping into an if block is nasty :)

>   			cond_resched();
>   			mmap_read_lock(mm);
>   			mmap_locked = true;
> @@ -2819,6 +2822,43 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>   		if (!mmap_locked)
>   			*lock_dropped = true;
>   
> +		/*
> +		 * If the file-backed VMA has dirty pages, the scan triggers
> +		 * async writeback and returns SCAN_PAGE_DIRTY_OR_WRITEBACK.
> +		 * Since MADV_COLLAPSE is sync, we force sync writeback and
> +		 * retry once.
> +		 */
> +		if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !retried) {
> +			/*
> +			 * File scan drops the lock. We must re-acquire it to
> +			 * safely inspect the VMA and hold the file reference.
> +			 */
> +			if (!mmap_locked) {
> +				cond_resched();
> +				mmap_read_lock(mm);
> +				mmap_locked = true;
> +				result = hugepage_vma_revalidate(mm, addr, false, &vma, cc);
> +				if (result != SCAN_SUCCEED)
> +					goto handle_result;
> +			}
> +
> +			if (!vma_is_anonymous(vma) && vma->vm_file &&
> +			    mapping_can_writeback(vma->vm_file->f_mapping)) {
> +				struct file *file = get_file(vma->vm_file);
> +				pgoff_t pgoff = linear_page_index(vma, addr);
> +				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> +				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +
> +				mmap_read_unlock(mm);
> +				mmap_locked = false;
> +				*lock_dropped = true;
> +				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> +				fput(file);
> +				retried = true;
> +				goto retry;
> +			}
> +		}
> +

This looks a bit complicated. Can't we move that handing up, where we have most of that
information already? Or am I missing something important?

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 97d1b2824386f..c7271877c5220 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -22,6 +22,7 @@
  #include <linux/dax.h>
  #include <linux/ksm.h>
  #include <linux/pgalloc.h>
+#include <linux/backing-dev.h>
  
  #include <asm/tlb.h>
  #include "internal.h"
@@ -2786,7 +2787,9 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
  
         for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
                 int result = SCAN_FAIL;
+               bool triggered_wb = false;
  
+retry:
                 if (!mmap_locked) {
                         cond_resched();
                         mmap_read_lock(mm);
@@ -2809,6 +2812,16 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
                         mmap_locked = false;
                         result = hpage_collapse_scan_file(mm, addr, file, pgoff,
                                                           cc);
+
+                       if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
+                           mapping_can_writeback(file->f_mapping)) {
+                               loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
+                               loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+
+                               filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+                               triggered_wb = true;
+                               goto retry;
+                       }
                         fput(file);
                 } else {
                         result = hpage_collapse_scan_pmd(mm, vma, addr,


-- 
Cheers

David
Re: [PATCH V4 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
Posted by Garg, Shivank 4 weeks, 1 day ago

On 1/9/2026 8:16 PM, David Hildenbrand (Red Hat) wrote:
> On 12/15/25 09:46, Shivank Garg wrote:
>> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
>> text sections), the pages may still be dirty from recent writes.
>> collapse_file() will trigger async writeback and fail with
>> SCAN_PAGE_DIRTY_OR_WRITEBACK (-EAGAIN).
>>
>> MADV_COLLAPSE is a synchronous operation where userspace expects
>> immediate results. If the collapse fails due to dirty pages, perform
>> synchronous writeback on the specific range and retry once.
>>
>> This avoids spurious failures for freshly written executables while
>> avoiding unnecessary synchronous I/O for mappings that are already clean.
>>
>> Reported-by: Branden Moore <Branden.Moore@amd.com>
>> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
>> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
>> Suggested-by: David Hildenbrand <david@kernel.org>
>> Tested-by: Lance Yang <lance.yang@linux.dev>
>> Signed-off-by: Shivank Garg <shivankg@amd.com>
>> ---
>>   mm/khugepaged.c | 40 ++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 40 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 219dfa2e523c..6c8c35d3e0c9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -22,6 +22,7 @@
>>   #include <linux/dax.h>
>>   #include <linux/ksm.h>
>>   #include <linux/pgalloc.h>
>> +#include <linux/backing-dev.h>
>>     #include <asm/tlb.h>
>>   #include "internal.h"
>> @@ -2787,9 +2788,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>       hend = end & HPAGE_PMD_MASK;
>>         for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>> +        bool retried = false;
>>           int result = SCAN_FAIL;
>>             if (!mmap_locked) {
>> +retry:
> 
> Jumping into an if block is nasty :)
> 
>>               cond_resched();
>>               mmap_read_lock(mm);
>>               mmap_locked = true;
>> @@ -2819,6 +2822,43 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>           if (!mmap_locked)
>>               *lock_dropped = true;
>>   +        /*
>> +         * If the file-backed VMA has dirty pages, the scan triggers
>> +         * async writeback and returns SCAN_PAGE_DIRTY_OR_WRITEBACK.
>> +         * Since MADV_COLLAPSE is sync, we force sync writeback and
>> +         * retry once.
>> +         */
>> +        if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !retried) {
>> +            /*
>> +             * File scan drops the lock. We must re-acquire it to
>> +             * safely inspect the VMA and hold the file reference.
>> +             */
>> +            if (!mmap_locked) {
>> +                cond_resched();
>> +                mmap_read_lock(mm);
>> +                mmap_locked = true;
>> +                result = hugepage_vma_revalidate(mm, addr, false, &vma, cc);
>> +                if (result != SCAN_SUCCEED)
>> +                    goto handle_result;
>> +            }
>> +
>> +            if (!vma_is_anonymous(vma) && vma->vm_file &&
>> +                mapping_can_writeback(vma->vm_file->f_mapping)) {
>> +                struct file *file = get_file(vma->vm_file);
>> +                pgoff_t pgoff = linear_page_index(vma, addr);
>> +                loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>> +                loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>> +
>> +                mmap_read_unlock(mm);
>> +                mmap_locked = false;
>> +                *lock_dropped = true;
>> +                filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>> +                fput(file);
>> +                retried = true;
>> +                goto retry;
>> +            }
>> +        }
>> +
> 
> This looks a bit complicated. Can't we move that handing up, where we have most of that
> information already? Or am I missing something important?
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 97d1b2824386f..c7271877c5220 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -22,6 +22,7 @@
>  #include <linux/dax.h>
>  #include <linux/ksm.h>
>  #include <linux/pgalloc.h>
> +#include <linux/backing-dev.h>
>  
>  #include <asm/tlb.h>
>  #include "internal.h"
> @@ -2786,7 +2787,9 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  
>         for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>                 int result = SCAN_FAIL;
> +               bool triggered_wb = false;
>  
> +retry:
>                 if (!mmap_locked) {
>                         cond_resched();
>                         mmap_read_lock(mm);
> @@ -2809,6 +2812,16 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>                         mmap_locked = false;
			  
			  *lock_dropped = true;
>                         result = hpage_collapse_scan_file(mm, addr, file, pgoff,
>                                                           cc);
> +
> +                       if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
> +                           mapping_can_writeback(file->f_mapping)) {
> +                               loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
> +                               loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
> +
> +                               filemap_write_and_wait_range(file->f_mapping, lstart, lend);
> +                               triggered_wb = true;

				  fput(file);

> +                               goto retry;
> +                       }
>                         fput(file);
>                 } else {
>                         result = hpage_collapse_scan_pmd(mm, vma, addr,
> 
> 

Thank you for the suggestion, this approach looks much simpler.

There are two small nits I observed:

1. In the retry loop, it is possible that we reacquire the mmap_lock and set
   mmap_locked to true. This can cause issues later when we do:

       if (!mmap_locked)
               *lock_dropped = true;

   because the caller would no longer see that the lock was dropped earlier.

2. We need an fput() to balance the file reference taken at line 2795.

fixed patch: (tested it no longer reproduce the issue):

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 87ca577c2668..b88bbb54f109 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -21,6 +21,7 @@
 #include <linux/shmem_fs.h>
 #include <linux/dax.h>
 #include <linux/ksm.h>
+#include <linux/backing-dev.h>
 
 #include <asm/tlb.h>
 #include <asm/pgalloc.h>
@@ -2771,7 +2772,9 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 	for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
 		int result = SCAN_FAIL;
+		bool triggered_wb = false;
 
+retry:
 		if (!mmap_locked) {
 			cond_resched();
 			mmap_read_lock(mm);
@@ -2794,8 +2797,20 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 
 			mmap_read_unlock(mm);
 			mmap_locked = false;
+			*lock_dropped = true;
 			result = hpage_collapse_scan_file(mm, addr, file, pgoff,
 							  cc);
+
+			if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
+			    mapping_can_writeback(file->f_mapping)) {
+				loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
+				loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
+
+				filemap_write_and_wait_range(file->f_mapping, lstart, lend);
+				triggered_wb = true;
+				fput(file);
+				goto retry;
+			}
 			fput(file);
 		} else {
 			result = hpage_collapse_scan_pmd(mm, vma, addr,
Re: [PATCH V4 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
Posted by David Hildenbrand (Red Hat) 4 weeks ago
On 1/10/26 19:20, Garg, Shivank wrote:
> 
> 
> On 1/9/2026 8:16 PM, David Hildenbrand (Red Hat) wrote:
>> On 12/15/25 09:46, Shivank Garg wrote:
>>> When MADV_COLLAPSE is called on file-backed mappings (e.g., executable
>>> text sections), the pages may still be dirty from recent writes.
>>> collapse_file() will trigger async writeback and fail with
>>> SCAN_PAGE_DIRTY_OR_WRITEBACK (-EAGAIN).
>>>
>>> MADV_COLLAPSE is a synchronous operation where userspace expects
>>> immediate results. If the collapse fails due to dirty pages, perform
>>> synchronous writeback on the specific range and retry once.
>>>
>>> This avoids spurious failures for freshly written executables while
>>> avoiding unnecessary synchronous I/O for mappings that are already clean.
>>>
>>> Reported-by: Branden Moore <Branden.Moore@amd.com>
>>> Closes: https://lore.kernel.org/all/4e26fe5e-7374-467c-a333-9dd48f85d7cc@amd.com
>>> Fixes: 34488399fa08 ("mm/madvise: add file and shmem support to MADV_COLLAPSE")
>>> Suggested-by: David Hildenbrand <david@kernel.org>
>>> Tested-by: Lance Yang <lance.yang@linux.dev>
>>> Signed-off-by: Shivank Garg <shivankg@amd.com>
>>> ---
>>>    mm/khugepaged.c | 40 ++++++++++++++++++++++++++++++++++++++++
>>>    1 file changed, 40 insertions(+)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 219dfa2e523c..6c8c35d3e0c9 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -22,6 +22,7 @@
>>>    #include <linux/dax.h>
>>>    #include <linux/ksm.h>
>>>    #include <linux/pgalloc.h>
>>> +#include <linux/backing-dev.h>
>>>      #include <asm/tlb.h>
>>>    #include "internal.h"
>>> @@ -2787,9 +2788,11 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>        hend = end & HPAGE_PMD_MASK;
>>>          for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>> +        bool retried = false;
>>>            int result = SCAN_FAIL;
>>>              if (!mmap_locked) {
>>> +retry:
>>
>> Jumping into an if block is nasty :)
>>
>>>                cond_resched();
>>>                mmap_read_lock(mm);
>>>                mmap_locked = true;
>>> @@ -2819,6 +2822,43 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>            if (!mmap_locked)
>>>                *lock_dropped = true;
>>>    +        /*
>>> +         * If the file-backed VMA has dirty pages, the scan triggers
>>> +         * async writeback and returns SCAN_PAGE_DIRTY_OR_WRITEBACK.
>>> +         * Since MADV_COLLAPSE is sync, we force sync writeback and
>>> +         * retry once.
>>> +         */
>>> +        if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !retried) {
>>> +            /*
>>> +             * File scan drops the lock. We must re-acquire it to
>>> +             * safely inspect the VMA and hold the file reference.
>>> +             */
>>> +            if (!mmap_locked) {
>>> +                cond_resched();
>>> +                mmap_read_lock(mm);
>>> +                mmap_locked = true;
>>> +                result = hugepage_vma_revalidate(mm, addr, false, &vma, cc);
>>> +                if (result != SCAN_SUCCEED)
>>> +                    goto handle_result;
>>> +            }
>>> +
>>> +            if (!vma_is_anonymous(vma) && vma->vm_file &&
>>> +                mapping_can_writeback(vma->vm_file->f_mapping)) {
>>> +                struct file *file = get_file(vma->vm_file);
>>> +                pgoff_t pgoff = linear_page_index(vma, addr);
>>> +                loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>>> +                loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>>> +
>>> +                mmap_read_unlock(mm);
>>> +                mmap_locked = false;
>>> +                *lock_dropped = true;
>>> +                filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>>> +                fput(file);
>>> +                retried = true;
>>> +                goto retry;
>>> +            }
>>> +        }
>>> +
>>
>> This looks a bit complicated. Can't we move that handing up, where we have most of that
>> information already? Or am I missing something important?
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 97d1b2824386f..c7271877c5220 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -22,6 +22,7 @@
>>   #include <linux/dax.h>
>>   #include <linux/ksm.h>
>>   #include <linux/pgalloc.h>
>> +#include <linux/backing-dev.h>
>>   
>>   #include <asm/tlb.h>
>>   #include "internal.h"
>> @@ -2786,7 +2787,9 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>   
>>          for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>                  int result = SCAN_FAIL;
>> +               bool triggered_wb = false;
>>   
>> +retry:
>>                  if (!mmap_locked) {
>>                          cond_resched();
>>                          mmap_read_lock(mm);
>> @@ -2809,6 +2812,16 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>                          mmap_locked = false;
> 			
> 			  *lock_dropped = true;
>>                          result = hpage_collapse_scan_file(mm, addr, file, pgoff,
>>                                                            cc);
>> +
>> +                       if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>> +                           mapping_can_writeback(file->f_mapping)) {
>> +                               loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>> +                               loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>> +
>> +                               filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>> +                               triggered_wb = true;
> 
> 				  fput(file);
> 
>> +                               goto retry;
>> +                       }
>>                          fput(file);
>>                  } else {
>>                          result = hpage_collapse_scan_pmd(mm, vma, addr,
>>
>>
> 
> Thank you for the suggestion, this approach looks much simpler.
> 
> There are two small nits I observed:

Yeah, was a quick untested hack to see if this can be simplified :)

> 
> 1. In the retry loop, it is possible that we reacquire the mmap_lock and set
>     mmap_locked to true. This can cause issues later when we do:
> 
>         if (!mmap_locked)
>                 *lock_dropped = true;

That whole logic of having two variables that express whether locks have 
been taken/dropped is just absolutely confusing. Any way we can clean 
that up?

> 
>     because the caller would no longer see that the lock was dropped earlier.
> 
> 2. We need an fput() to balance the file reference taken at line 2795.

Ah, yes, makes sense. Having a single fput() would be nicer, but that 
would require yet another temporary variable.

-- 
Cheers

David
Re: [PATCH V4 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
Posted by Garg, Shivank 3 weeks, 4 days ago

On 1/11/2026 4:59 PM, David Hildenbrand (Red Hat) wrote:
> On 1/10/26 19:20, Garg, Shivank wrote:
>>
>>
>> On 1/9/2026 8:16 PM, David Hildenbrand (Red Hat) wrote:
>>> On 12/15/25 09:46, Shivank Garg wrote:

>>>
>>> This looks a bit complicated. Can't we move that handing up, where we have most of that
>>> information already? Or am I missing something important?
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 97d1b2824386f..c7271877c5220 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -22,6 +22,7 @@
>>>   #include <linux/dax.h>
>>>   #include <linux/ksm.h>
>>>   #include <linux/pgalloc.h>
>>> +#include <linux/backing-dev.h>
>>>     #include <asm/tlb.h>
>>>   #include "internal.h"
>>> @@ -2786,7 +2787,9 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>            for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>>                  int result = SCAN_FAIL;
>>> +               bool triggered_wb = false;
>>>   +retry:
>>>                  if (!mmap_locked) {
>>>                          cond_resched();
>>>                          mmap_read_lock(mm);
>>> @@ -2809,6 +2812,16 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>                          mmap_locked = false;
>>            
>>               *lock_dropped = true;
>>>                          result = hpage_collapse_scan_file(mm, addr, file, pgoff,
>>>                                                            cc);
>>> +
>>> +                       if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>>> +                           mapping_can_writeback(file->f_mapping)) {
>>> +                               loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>>> +                               loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>>> +
>>> +                               filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>>> +                               triggered_wb = true;
>>
>>                   fput(file);
>>
>>> +                               goto retry;
>>> +                       }
>>>                          fput(file);
>>>                  } else {
>>>                          result = hpage_collapse_scan_pmd(mm, vma, addr,
>>>
>>>
>>
>> Thank you for the suggestion, this approach looks much simpler.
>>
>> There are two small nits I observed:
> 
> Yeah, was a quick untested hack to see if this can be simplified :)
> 
>>
>> 1. In the retry loop, it is possible that we reacquire the mmap_lock and set
>>     mmap_locked to true. This can cause issues later when we do:
>>
>>         if (!mmap_locked)
>>                 *lock_dropped = true;
> 
> That whole logic of having two variables that express whether locks have been taken/dropped is just absolutely confusing. Any way we can clean that up?
>
>>
>>     because the caller would no longer see that the lock was dropped earlier.
>>
>> 2. We need an fput() to balance the file reference taken at line 2795.
> 
> Ah, yes, makes sense. Having a single fput() would be nicer, but that would require yet another temporary variable.
>

I agree, that this interaction for lock taken/droped is confusing.
However, a proper clean-up would require refactoring the locking logic across multiple functions in the collapse call-flow path.
This seems significantly more invasive and risky.

I would like to handle this refactoring but in a separate TODO for later.
Could we please proceed with these minimal changes for now?

Since, V4 has been in the linux-next/mm-unstable for a while, should I send a v5 or an incremental clean-up on top for this?

Thanks,
Shivank
Re: [PATCH V4 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
Posted by David Hildenbrand (Red Hat) 3 weeks, 4 days ago
On 1/14/26 20:47, Garg, Shivank wrote:
> 
> 
> On 1/11/2026 4:59 PM, David Hildenbrand (Red Hat) wrote:
>> On 1/10/26 19:20, Garg, Shivank wrote:
>>>
>>>
>>> On 1/9/2026 8:16 PM, David Hildenbrand (Red Hat) wrote:
>>>> On 12/15/25 09:46, Shivank Garg wrote:
> 
>>>>
>>>> This looks a bit complicated. Can't we move that handing up, where we have most of that
>>>> information already? Or am I missing something important?
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 97d1b2824386f..c7271877c5220 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -22,6 +22,7 @@
>>>>    #include <linux/dax.h>
>>>>    #include <linux/ksm.h>
>>>>    #include <linux/pgalloc.h>
>>>> +#include <linux/backing-dev.h>
>>>>      #include <asm/tlb.h>
>>>>    #include "internal.h"
>>>> @@ -2786,7 +2787,9 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>>             for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>>>                   int result = SCAN_FAIL;
>>>> +               bool triggered_wb = false;
>>>>    +retry:
>>>>                   if (!mmap_locked) {
>>>>                           cond_resched();
>>>>                           mmap_read_lock(mm);
>>>> @@ -2809,6 +2812,16 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>>                           mmap_locked = false;
>>>             
>>>                *lock_dropped = true;
>>>>                           result = hpage_collapse_scan_file(mm, addr, file, pgoff,
>>>>                                                             cc);
>>>> +
>>>> +                       if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>>>> +                           mapping_can_writeback(file->f_mapping)) {
>>>> +                               loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>>>> +                               loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>>>> +
>>>> +                               filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>>>> +                               triggered_wb = true;
>>>
>>>                    fput(file);
>>>
>>>> +                               goto retry;
>>>> +                       }
>>>>                           fput(file);
>>>>                   } else {
>>>>                           result = hpage_collapse_scan_pmd(mm, vma, addr,
>>>>
>>>>
>>>
>>> Thank you for the suggestion, this approach looks much simpler.
>>>
>>> There are two small nits I observed:
>>
>> Yeah, was a quick untested hack to see if this can be simplified :)
>>
>>>
>>> 1. In the retry loop, it is possible that we reacquire the mmap_lock and set
>>>      mmap_locked to true. This can cause issues later when we do:
>>>
>>>          if (!mmap_locked)
>>>                  *lock_dropped = true;
>>
>> That whole logic of having two variables that express whether locks have been taken/dropped is just absolutely confusing. Any way we can clean that up?
>>
>>>
>>>      because the caller would no longer see that the lock was dropped earlier.
>>>
>>> 2. We need an fput() to balance the file reference taken at line 2795.
>>
>> Ah, yes, makes sense. Having a single fput() would be nicer, but that would require yet another temporary variable.
>>
> 
> I agree, that this interaction for lock taken/droped is confusing.
> However, a proper clean-up would require refactoring the locking logic across multiple functions in the collapse call-flow path.
> This seems significantly more invasive and risky.
> 
> I would like to handle this refactoring but in a separate TODO for later.
> Could we please proceed with these minimal changes for now?

Sure, fine with me.

> 
> Since, V4 has been in the linux-next/mm-unstable for a while, should I send a v5 or an incremental clean-up on top for this?

Just send a v4, unless Andrew tells you otherwise :)

-- 
Cheers

David
Re: [PATCH V4 2/2] mm/khugepaged: retry with sync writeback for MADV_COLLAPSE
Posted by Garg, Shivank 3 weeks ago

On 1/15/2026 1:44 AM, David Hildenbrand (Red Hat) wrote:
> On 1/14/26 20:47, Garg, Shivank wrote:
>>
>>
>> On 1/11/2026 4:59 PM, David Hildenbrand (Red Hat) wrote:
>>> On 1/10/26 19:20, Garg, Shivank wrote:
>>>>
>>>>
>>>> On 1/9/2026 8:16 PM, David Hildenbrand (Red Hat) wrote:
>>>>> On 12/15/25 09:46, Shivank Garg wrote:
>>
>>>>>
>>>>> This looks a bit complicated. Can't we move that handing up, where we have most of that
>>>>> information already? Or am I missing something important?
>>>>>
>>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>>> index 97d1b2824386f..c7271877c5220 100644
>>>>> --- a/mm/khugepaged.c
>>>>> +++ b/mm/khugepaged.c
>>>>> @@ -22,6 +22,7 @@
>>>>>    #include <linux/dax.h>
>>>>>    #include <linux/ksm.h>
>>>>>    #include <linux/pgalloc.h>
>>>>> +#include <linux/backing-dev.h>
>>>>>      #include <asm/tlb.h>
>>>>>    #include "internal.h"
>>>>> @@ -2786,7 +2787,9 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>>>             for (addr = hstart; addr < hend; addr += HPAGE_PMD_SIZE) {
>>>>>                   int result = SCAN_FAIL;
>>>>> +               bool triggered_wb = false;
>>>>>    +retry:
>>>>>                   if (!mmap_locked) {
>>>>>                           cond_resched();
>>>>>                           mmap_read_lock(mm);
>>>>> @@ -2809,6 +2812,16 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>>>>                           mmap_locked = false;
>>>>                            *lock_dropped = true;
>>>>>                           result = hpage_collapse_scan_file(mm, addr, file, pgoff,
>>>>>                                                             cc);
>>>>> +
>>>>> +                       if (result == SCAN_PAGE_DIRTY_OR_WRITEBACK && !triggered_wb &&
>>>>> +                           mapping_can_writeback(file->f_mapping)) {
>>>>> +                               loff_t lstart = (loff_t)pgoff << PAGE_SHIFT;
>>>>> +                               loff_t lend = lstart + HPAGE_PMD_SIZE - 1;
>>>>> +
>>>>> +                               filemap_write_and_wait_range(file->f_mapping, lstart, lend);
>>>>> +                               triggered_wb = true;
>>>>
>>>>                    fput(file);
>>>>
>>>>> +                               goto retry;
>>>>> +                       }
>>>>>                           fput(file);
>>>>>                   } else {
>>>>>                           result = hpage_collapse_scan_pmd(mm, vma, addr,
>>>>>
>>>>>
>>>>
>>>> Thank you for the suggestion, this approach looks much simpler.
>>>>
>>>> There are two small nits I observed:
>>>
>>> Yeah, was a quick untested hack to see if this can be simplified :)
>>>
>>>>
>>>> 1. In the retry loop, it is possible that we reacquire the mmap_lock and set
>>>>      mmap_locked to true. This can cause issues later when we do:
>>>>
>>>>          if (!mmap_locked)
>>>>                  *lock_dropped = true;
>>>
>>> That whole logic of having two variables that express whether locks have been taken/dropped is just absolutely confusing. Any way we can clean that up?
>>>
>>>>
>>>>      because the caller would no longer see that the lock was dropped earlier.
>>>>
>>>> 2. We need an fput() to balance the file reference taken at line 2795.
>>>
>>> Ah, yes, makes sense. Having a single fput() would be nicer, but that would require yet another temporary variable.
>>>
>>
>> I agree, that this interaction for lock taken/droped is confusing.
>> However, a proper clean-up would require refactoring the locking logic across multiple functions in the collapse call-flow path.
>> This seems significantly more invasive and risky.
>>
>> I would like to handle this refactoring but in a separate TODO for later.
>> Could we please proceed with these minimal changes for now?
> 
> Sure, fine with me.
> 
>>
>> Since, V4 has been in the linux-next/mm-unstable for a while, should I send a v5 or an incremental clean-up on top for this?
> 
> Just send a v4, unless Andrew tells you otherwise :)

Thanks! I have send V5:
https://lore.kernel.org/linux-mm/20260118192253.9263-14-shivankg@amd.com

Best Regards,
Shivank