[PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs

Lance Yang posted 3 patches 2 weeks, 4 days ago
There is a newer version of this series
[PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lance Yang 2 weeks, 4 days ago
From: Lance Yang <lance.yang@linux.dev>

Guard PTE markers are installed via MADV_GUARD_INSTALL to create
lightweight guard regions.

Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
encountering such a range.

MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
the special marker in __collapse_huge_page_swapin().

hpage_collapse_scan_pmd()
 `- collapse_huge_page()
     `- __collapse_huge_page_swapin() -> fails!

khugepaged's behavior is slightly different due to its max_ptes_swap limit
(default 64). It won't fail as deep, but it will still needlessly scan up
to 64 swap entries before bailing out.

IMHO, we can and should detect this much earlier ;)

This patch adds a check directly inside the PTE scan loop. If a guard
marker is found, the scan is aborted immediately with a new SCAN_PTE_GUARD
status, avoiding wasted work.

Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 mm/khugepaged.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index e54f99bb0b57..910a6f2ec8a9 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -59,6 +59,7 @@ enum scan_result {
 	SCAN_STORE_FAILED,
 	SCAN_COPY_MC,
 	SCAN_PAGE_FILLED,
+	SCAN_PTE_GUARD,
 };
 
 #define CREATE_TRACE_POINTS
@@ -1317,6 +1318,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 					result = SCAN_PTE_UFFD_WP;
 					goto out_unmap;
 				}
+				/*
+				 * Guard PTE markers are installed by
+				 * MADV_GUARD_INSTALL. Any collapse path must
+				 * not touch them, so abort the scan immediately
+				 * if one is found.
+				 */
+				if (is_guard_pte_marker(pteval)) {
+					result = SCAN_PTE_GUARD;
+					goto out_unmap;
+				}
 				continue;
 			} else {
 				result = SCAN_EXCEED_SWAP_PTE;
@@ -2860,6 +2871,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
 		case SCAN_PAGE_COMPOUND:
 		case SCAN_PAGE_LRU:
 		case SCAN_DEL_PAGE_LRU:
+		case SCAN_PTE_GUARD:
 			last_fail = result;
 			break;
 		default:
-- 
2.49.0
Re: [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lorenzo Stoakes 2 weeks, 3 days ago
On Sun, Sep 14, 2025 at 10:35:47PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
> lightweight guard regions.
>
> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
> encountering such a range.
>
> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
> the special marker in __collapse_huge_page_swapin().
>
> hpage_collapse_scan_pmd()
>  `- collapse_huge_page()
>      `- __collapse_huge_page_swapin() -> fails!
>
> khugepaged's behavior is slightly different due to its max_ptes_swap limit
> (default 64). It won't fail as deep, but it will still needlessly scan up
> to 64 swap entries before bailing out.
>
> IMHO, we can and should detect this much earlier ;)

No smileys in commit messages please... :)

>
> This patch adds a check directly inside the PTE scan loop. If a guard
> marker is found, the scan is aborted immediately with a new SCAN_PTE_GUARD
> status, avoiding wasted work.
>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>  mm/khugepaged.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e54f99bb0b57..910a6f2ec8a9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -59,6 +59,7 @@ enum scan_result {
>  	SCAN_STORE_FAILED,
>  	SCAN_COPY_MC,
>  	SCAN_PAGE_FILLED,
> +	SCAN_PTE_GUARD,

I wonder if we really need to have it literally called out though, can we just
use:

SCAN_PTE_NON_PRESENT

Instead?

As it is, indeed, non-present :)

>  };
>
>  #define CREATE_TRACE_POINTS
> @@ -1317,6 +1318,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  					result = SCAN_PTE_UFFD_WP;
>  					goto out_unmap;
>  				}
> +				/*
> +				 * Guard PTE markers are installed by
> +				 * MADV_GUARD_INSTALL. Any collapse path must
> +				 * not touch them, so abort the scan immediately
> +				 * if one is found.
> +				 */
> +				if (is_guard_pte_marker(pteval)) {
> +					result = SCAN_PTE_GUARD;
> +					goto out_unmap;
> +				}
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_SWAP_PTE;
> @@ -2860,6 +2871,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>  		case SCAN_PAGE_COMPOUND:
>  		case SCAN_PAGE_LRU:
>  		case SCAN_DEL_PAGE_LRU:
> +		case SCAN_PTE_GUARD:
>  			last_fail = result;
>  			break;
>  		default:
> --
> 2.49.0
>

Cheers, Lorenzo
Re: [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lance Yang 2 weeks, 3 days ago

On 2025/9/15 22:08, Lorenzo Stoakes wrote:
> On Sun, Sep 14, 2025 at 10:35:47PM +0800, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>> lightweight guard regions.
>>
>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
>> encountering such a range.
>>
>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
>> the special marker in __collapse_huge_page_swapin().
>>
>> hpage_collapse_scan_pmd()
>>   `- collapse_huge_page()
>>       `- __collapse_huge_page_swapin() -> fails!
>>
>> khugepaged's behavior is slightly different due to its max_ptes_swap limit
>> (default 64). It won't fail as deep, but it will still needlessly scan up
>> to 64 swap entries before bailing out.
>>
>> IMHO, we can and should detect this much earlier ;)
> 
> No smileys in commit messages please... :)

Got it. Apparently, I'm a bit too fond of them ... Will remove it in v2.

> 
>>
>> This patch adds a check directly inside the PTE scan loop. If a guard
>> marker is found, the scan is aborted immediately with a new SCAN_PTE_GUARD
>> status, avoiding wasted work.
>>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index e54f99bb0b57..910a6f2ec8a9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -59,6 +59,7 @@ enum scan_result {
>>   	SCAN_STORE_FAILED,
>>   	SCAN_COPY_MC,
>>   	SCAN_PAGE_FILLED,
>> +	SCAN_PTE_GUARD,
> 
> I wonder if we really need to have it literally called out though, can we just
> use:
> 
> SCAN_PTE_NON_PRESENT
> 
> Instead?
> 
> As it is, indeed, non-present :)

Makes sense to me. A guard PTE is indeed a special non-present case.

So, let's reuse SCAN_PTE_NON_PRESENT for that case ;)

Cheers,
Lance

> 
>>   };
>>
>>   #define CREATE_TRACE_POINTS
>> @@ -1317,6 +1318,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>   					result = SCAN_PTE_UFFD_WP;
>>   					goto out_unmap;
>>   				}
>> +				/*
>> +				 * Guard PTE markers are installed by
>> +				 * MADV_GUARD_INSTALL. Any collapse path must
>> +				 * not touch them, so abort the scan immediately
>> +				 * if one is found.
>> +				 */
>> +				if (is_guard_pte_marker(pteval)) {
>> +					result = SCAN_PTE_GUARD;
>> +					goto out_unmap;
>> +				}
>>   				continue;
>>   			} else {
>>   				result = SCAN_EXCEED_SWAP_PTE;
>> @@ -2860,6 +2871,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>>   		case SCAN_PAGE_COMPOUND:
>>   		case SCAN_PAGE_LRU:
>>   		case SCAN_DEL_PAGE_LRU:
>> +		case SCAN_PTE_GUARD:
>>   			last_fail = result;
>>   			break;
>>   		default:
>> --
>> 2.49.0
>>
> 
> Cheers, Lorenzo
Re: [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Dev Jain 2 weeks, 3 days ago
On 14/09/25 8:05 pm, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
> lightweight guard regions.
>
> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
> encountering such a range.
>
> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
> the special marker in __collapse_huge_page_swapin().
>
> hpage_collapse_scan_pmd()
>   `- collapse_huge_page()
>       `- __collapse_huge_page_swapin() -> fails!
>
> khugepaged's behavior is slightly different due to its max_ptes_swap limit
> (default 64). It won't fail as deep, but it will still needlessly scan up
> to 64 swap entries before bailing out.
>
> IMHO, we can and should detect this much earlier ;)
>
> This patch adds a check directly inside the PTE scan loop. If a guard
> marker is found, the scan is aborted immediately with a new SCAN_PTE_GUARD
> status, avoiding wasted work.
>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   mm/khugepaged.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index e54f99bb0b57..910a6f2ec8a9 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -59,6 +59,7 @@ enum scan_result {
>   	SCAN_STORE_FAILED,
>   	SCAN_COPY_MC,
>   	SCAN_PAGE_FILLED,
> +	SCAN_PTE_GUARD,
>   };
>   
>   #define CREATE_TRACE_POINTS
> @@ -1317,6 +1318,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   					result = SCAN_PTE_UFFD_WP;
>   					goto out_unmap;
>   				}
> +				/*
> +				 * Guard PTE markers are installed by
> +				 * MADV_GUARD_INSTALL. Any collapse path must
> +				 * not touch them, so abort the scan immediately
> +				 * if one is found.
> +				 */
> +				if (is_guard_pte_marker(pteval)) {
> +					result = SCAN_PTE_GUARD;
> +					goto out_unmap;
> +				}
>   				continue;

This looks good, but see below.

>   			} else {
>   				result = SCAN_EXCEED_SWAP_PTE;
> @@ -2860,6 +2871,7 @@ int madvise_collapse(struct vm_area_struct *vma, unsigned long start,
>   		case SCAN_PAGE_COMPOUND:
>   		case SCAN_PAGE_LRU:
>   		case SCAN_DEL_PAGE_LRU:
> +		case SCAN_PTE_GUARD:
>   			last_fail = result;

Should we not do this, and just send this case over to the default case. That
would mean immediate exit with -EINVAL, instead of iterating over the complete
range, potentially collapsing a non-guard range, and returning -EINVAL. I do not
think we should spend a significant time in the kernel when the user is literally
invoking madvise(MADV_GUARD_INSTALL) and madvise(MADV_COLLAPSE) on overlapping regions.

>   			break;
>   		default:
Re: [PATCH mm-new 3/3] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lance Yang 2 weeks, 3 days ago

On 2025/9/15 01:03, Dev Jain wrote:
> 
> On 14/09/25 8:05 pm, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>> lightweight guard regions.
>>
>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
>> encountering such a range.
>>
>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
>> the special marker in __collapse_huge_page_swapin().
>>
>> hpage_collapse_scan_pmd()
>>   `- collapse_huge_page()
>>       `- __collapse_huge_page_swapin() -> fails!
>>
>> khugepaged's behavior is slightly different due to its max_ptes_swap 
>> limit
>> (default 64). It won't fail as deep, but it will still needlessly scan up
>> to 64 swap entries before bailing out.
>>
>> IMHO, we can and should detect this much earlier ;)
>>
>> This patch adds a check directly inside the PTE scan loop. If a guard
>> marker is found, the scan is aborted immediately with a new 
>> SCAN_PTE_GUARD
>> status, avoiding wasted work.
>>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index e54f99bb0b57..910a6f2ec8a9 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -59,6 +59,7 @@ enum scan_result {
>>       SCAN_STORE_FAILED,
>>       SCAN_COPY_MC,
>>       SCAN_PAGE_FILLED,
>> +    SCAN_PTE_GUARD,
>>   };
>>   #define CREATE_TRACE_POINTS
>> @@ -1317,6 +1318,16 @@ static int hpage_collapse_scan_pmd(struct 
>> mm_struct *mm,
>>                       result = SCAN_PTE_UFFD_WP;
>>                       goto out_unmap;
>>                   }
>> +                /*
>> +                 * Guard PTE markers are installed by
>> +                 * MADV_GUARD_INSTALL. Any collapse path must
>> +                 * not touch them, so abort the scan immediately
>> +                 * if one is found.
>> +                 */
>> +                if (is_guard_pte_marker(pteval)) {
>> +                    result = SCAN_PTE_GUARD;
>> +                    goto out_unmap;
>> +                }
>>                   continue;
> 
> This looks good, but see below.
> 
>>               } else {
>>                   result = SCAN_EXCEED_SWAP_PTE;
>> @@ -2860,6 +2871,7 @@ int madvise_collapse(struct vm_area_struct *vma, 
>> unsigned long start,
>>           case SCAN_PAGE_COMPOUND:
>>           case SCAN_PAGE_LRU:
>>           case SCAN_DEL_PAGE_LRU:
>> +        case SCAN_PTE_GUARD:
>>               last_fail = result;
> 
> Should we not do this, and just send this case over to the default case. 
> That
> would mean immediate exit with -EINVAL, instead of iterating over the 
> complete
> range, potentially collapsing a non-guard range, and returning -EINVAL. 

That makes sense to me ;)

> I do not
> think we should spend a significant time in the kernel when the user is 
> literally
> invoking madvise(MADV_GUARD_INSTALL) and madvise(MADV_COLLAPSE) on 
> overlapping regions.

I'm just a bit unsure because the MADV_COLLAPSE man page[1] describes it
as a "best-effort" collapse. This patch follows that idea, collapsing what
it can.

        MADV_COLLAPSE (since Linux 6.1)
               Perform a best-effort synchronous collapse of the native
               pages mapped by the memory range into Transparent Huge
               Pages (THPs).  MADV_COLLAPSE operates on the current state
               of memory of the calling process and makes no persistent
               changes or guarantees on how pages will be mapped,
               constructed, or faulted in the future.

A hard-fail on a guard PTE marker might go against that.

Well, I'm open to either approach. What do other folks think?

[1] https://man7.org/linux/man-pages/man2/madvise.2.html

Cheers,
Lance

> 
>>               break;
>>           default: