[PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs

Lance Yang posted 2 patches 2 weeks ago
[PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lance Yang 2 weeks ago
From: Lance Yang <lance.yang@linux.dev>

Guard PTE markers are installed via MADV_GUARD_INSTALL to create
lightweight guard regions.

Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
encountering such a range.

MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
the special marker in __collapse_huge_page_swapin().

hpage_collapse_scan_pmd()
 `- collapse_huge_page()
     `- __collapse_huge_page_swapin() -> fails!

khugepaged's behavior is slightly different due to its max_ptes_swap limit
(default 64). It won't fail as deep, but it will still needlessly scan up
to 64 swap entries before bailing out.

IMHO, we can and should detect this much earlier.

This patch adds a check directly inside the PTE scan loop. If a guard
marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT,
avoiding wasted work.

Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
 mm/khugepaged.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9ed1af2b5c38..70ebfc7c1f3e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
 					result = SCAN_PTE_UFFD_WP;
 					goto out_unmap;
 				}
+				/*
+				 * Guard PTE markers are installed by
+				 * MADV_GUARD_INSTALL. Any collapse path must
+				 * not touch them, so abort the scan immediately
+				 * if one is found.
+				 */
+				if (is_guard_pte_marker(pteval)) {
+					result = SCAN_PTE_NON_PRESENT;
+					goto out_unmap;
+				}
 				continue;
 			} else {
 				result = SCAN_EXCEED_SWAP_PTE;
-- 
2.49.0
Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Zi Yan 1 week, 6 days ago
On 18 Sep 2025, at 1:04, Lance Yang wrote:

> From: Lance Yang <lance.yang@linux.dev>
>
> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
> lightweight guard regions.
>
> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
> encountering such a range.
>
> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
> the special marker in __collapse_huge_page_swapin().
>
> hpage_collapse_scan_pmd()
>  `- collapse_huge_page()
>      `- __collapse_huge_page_swapin() -> fails!
>
> khugepaged's behavior is slightly different due to its max_ptes_swap limit
> (default 64). It won't fail as deep, but it will still needlessly scan up
> to 64 swap entries before bailing out.
>
> IMHO, we can and should detect this much earlier.
>
> This patch adds a check directly inside the PTE scan loop. If a guard
> marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT,
> avoiding wasted work.
>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>  mm/khugepaged.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9ed1af2b5c38..70ebfc7c1f3e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  					result = SCAN_PTE_UFFD_WP;
>  					goto out_unmap;
>  				}
> +				/*
> +				 * Guard PTE markers are installed by
> +				 * MADV_GUARD_INSTALL. Any collapse path must
> +				 * not touch them, so abort the scan immediately
> +				 * if one is found.
> +				 */
> +				if (is_guard_pte_marker(pteval)) {

pteval is already is_swap_pte(), would is_guard_swp_entry() be a better
choice here? Save one is_swap_pte() call.

> +					result = SCAN_PTE_NON_PRESENT;
> +					goto out_unmap;
> +				}
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_SWAP_PTE;
> -- 
> 2.49.0


Best Regards,
Yan, Zi
Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lance Yang 1 week, 6 days ago

On 2025/9/19 03:12, Zi Yan wrote:
> On 18 Sep 2025, at 1:04, Lance Yang wrote:
> 
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>> lightweight guard regions.
>>
>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
>> encountering such a range.
>>
>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
>> the special marker in __collapse_huge_page_swapin().
>>
>> hpage_collapse_scan_pmd()
>>   `- collapse_huge_page()
>>       `- __collapse_huge_page_swapin() -> fails!
>>
>> khugepaged's behavior is slightly different due to its max_ptes_swap limit
>> (default 64). It won't fail as deep, but it will still needlessly scan up
>> to 64 swap entries before bailing out.
>>
>> IMHO, we can and should detect this much earlier.
>>
>> This patch adds a check directly inside the PTE scan loop. If a guard
>> marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT,
>> avoiding wasted work.
>>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 9ed1af2b5c38..70ebfc7c1f3e 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>>   					result = SCAN_PTE_UFFD_WP;
>>   					goto out_unmap;
>>   				}
>> +				/*
>> +				 * Guard PTE markers are installed by
>> +				 * MADV_GUARD_INSTALL. Any collapse path must
>> +				 * not touch them, so abort the scan immediately
>> +				 * if one is found.
>> +				 */
>> +				if (is_guard_pte_marker(pteval)) {
> 
> pteval is already is_swap_pte(), would is_guard_swp_entry() be a better
> choice here? Save one is_swap_pte() call.

Yeah. Good spot! Will do ;)

Thanks,
Lance

> 
>> +					result = SCAN_PTE_NON_PRESENT;
>> +					goto out_unmap;
>> +				}
>>   				continue;
>>   			} else {
>>   				result = SCAN_EXCEED_SWAP_PTE;
>> -- 
>> 2.49.0
> 
> 
> Best Regards,
> Yan, Zi
Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Baolin Wang 1 week, 6 days ago

On 2025/9/19 10:44, Lance Yang wrote:
> 
> 
> On 2025/9/19 03:12, Zi Yan wrote:
>> On 18 Sep 2025, at 1:04, Lance Yang wrote:
>>
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>>> lightweight guard regions.
>>>
>>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail 
>>> when
>>> encountering such a range.
>>>
>>> MADV_COLLAPSE fails deep inside the collapse logic when trying to 
>>> swap-in
>>> the special marker in __collapse_huge_page_swapin().
>>>
>>> hpage_collapse_scan_pmd()
>>>   `- collapse_huge_page()
>>>       `- __collapse_huge_page_swapin() -> fails!
>>>
>>> khugepaged's behavior is slightly different due to its max_ptes_swap 
>>> limit
>>> (default 64). It won't fail as deep, but it will still needlessly 
>>> scan up
>>> to 64 swap entries before bailing out.
>>>
>>> IMHO, we can and should detect this much earlier.
>>>
>>> This patch adds a check directly inside the PTE scan loop. If a guard
>>> marker is found, the scan is aborted immediately with 
>>> SCAN_PTE_NON_PRESENT,
>>> avoiding wasted work.
>>>
>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>> ---
>>>   mm/khugepaged.c | 10 ++++++++++
>>>   1 file changed, 10 insertions(+)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 9ed1af2b5c38..70ebfc7c1f3e 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct 
>>> mm_struct *mm,
>>>                       result = SCAN_PTE_UFFD_WP;
>>>                       goto out_unmap;
>>>                   }
>>> +                /*
>>> +                 * Guard PTE markers are installed by
>>> +                 * MADV_GUARD_INSTALL. Any collapse path must
>>> +                 * not touch them, so abort the scan immediately
>>> +                 * if one is found.
>>> +                 */
>>> +                if (is_guard_pte_marker(pteval)) {
>>
>> pteval is already is_swap_pte(), would is_guard_swp_entry() be a better
>> choice here? Save one is_swap_pte() call.

Agree. Then seems we don't need to move the is_guard_pte_marker() into a 
header file.

Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lance Yang 1 week, 6 days ago

On 2025/9/19 11:18, Baolin Wang wrote:
> 
> 
> On 2025/9/19 10:44, Lance Yang wrote:
>>
>>
>> On 2025/9/19 03:12, Zi Yan wrote:
>>> On 18 Sep 2025, at 1:04, Lance Yang wrote:
>>>
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>>>> lightweight guard regions.
>>>>
>>>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail 
>>>> when
>>>> encountering such a range.
>>>>
>>>> MADV_COLLAPSE fails deep inside the collapse logic when trying to 
>>>> swap-in
>>>> the special marker in __collapse_huge_page_swapin().
>>>>
>>>> hpage_collapse_scan_pmd()
>>>>   `- collapse_huge_page()
>>>>       `- __collapse_huge_page_swapin() -> fails!
>>>>
>>>> khugepaged's behavior is slightly different due to its max_ptes_swap 
>>>> limit
>>>> (default 64). It won't fail as deep, but it will still needlessly 
>>>> scan up
>>>> to 64 swap entries before bailing out.
>>>>
>>>> IMHO, we can and should detect this much earlier.
>>>>
>>>> This patch adds a check directly inside the PTE scan loop. If a guard
>>>> marker is found, the scan is aborted immediately with 
>>>> SCAN_PTE_NON_PRESENT,
>>>> avoiding wasted work.
>>>>
>>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>>   mm/khugepaged.c | 10 ++++++++++
>>>>   1 file changed, 10 insertions(+)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 9ed1af2b5c38..70ebfc7c1f3e 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct 
>>>> mm_struct *mm,
>>>>                       result = SCAN_PTE_UFFD_WP;
>>>>                       goto out_unmap;
>>>>                   }
>>>> +                /*
>>>> +                 * Guard PTE markers are installed by
>>>> +                 * MADV_GUARD_INSTALL. Any collapse path must
>>>> +                 * not touch them, so abort the scan immediately
>>>> +                 * if one is found.
>>>> +                 */
>>>> +                if (is_guard_pte_marker(pteval)) {
>>>
>>> pteval is already is_swap_pte(), would is_guard_swp_entry() be a better
>>> choice here? Save one is_swap_pte() call.
> 
> Agree. Then seems we don't need to move the is_guard_pte_marker() into a 
> header file.

Got it. I'll only remove the redundant check in #01 patch.

Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by David Hildenbrand 1 week, 6 days ago
On 18.09.25 07:04, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
> 
> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
> lightweight guard regions.
> 
> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
> encountering such a range.
> 
> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
> the special marker in __collapse_huge_page_swapin().
> 
> hpage_collapse_scan_pmd()
>   `- collapse_huge_page()
>       `- __collapse_huge_page_swapin() -> fails!
> 
> khugepaged's behavior is slightly different due to its max_ptes_swap limit
> (default 64). It won't fail as deep, but it will still needlessly scan up
> to 64 swap entries before bailing out.
> 
> IMHO, we can and should detect this much earlier.
> 
> This patch adds a check directly inside the PTE scan loop. If a guard
> marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT,
> avoiding wasted work.
> 
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   mm/khugepaged.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9ed1af2b5c38..70ebfc7c1f3e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   					result = SCAN_PTE_UFFD_WP;
>   					goto out_unmap;
>   				}
> +				/*
> +				 * Guard PTE markers are installed by
> +				 * MADV_GUARD_INSTALL. Any collapse path must
> +				 * not touch them, so abort the scan immediately
> +				 * if one is found.
> +				 */
> +				if (is_guard_pte_marker(pteval)) {
> +					result = SCAN_PTE_NON_PRESENT;
> +					goto out_unmap;
> +				}

Thinking about it, this is interesting.

Essentially we track any non-swap swap entries towards khugepaged_max_ptes_swap, which is rather weird.

I think we might also run into migration entries here and hwpoison entries?

So what about just generalizing this:

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index af5f5c80fe4ed..28f1f4bf0e0a8 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1293,7 +1293,24 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
         for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
              _pte++, _address += PAGE_SIZE) {
                 pte_t pteval = ptep_get(_pte);
-               if (is_swap_pte(pteval)) {
+
+               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
+                       ++none_or_zero;
+                       if (!userfaultfd_armed(vma) &&
+                           (!cc->is_khugepaged ||
+                            none_or_zero <= khugepaged_max_ptes_none)) {
+                               continue;
+                       } else {
+                               result = SCAN_EXCEED_NONE_PTE;
+                               count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
+                               goto out_unmap;
+                       }
+               } else if (!pte_present(pteval)) {
+                       if (non_swap_entry(pte_to_swp_entry(pteval))) {
+                               result = SCAN_PTE_NON_PRESENT;
+                               goto out_unmap;
+                       }
+
                         ++unmapped;
                         if (!cc->is_khugepaged ||
                             unmapped <= khugepaged_max_ptes_swap) {
@@ -1313,18 +1330,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
                                 goto out_unmap;
                         }
                 }
-               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
-                       ++none_or_zero;
-                       if (!userfaultfd_armed(vma) &&
-                           (!cc->is_khugepaged ||
-                            none_or_zero <= khugepaged_max_ptes_none)) {
-                               continue;
-                       } else {
-                               result = SCAN_EXCEED_NONE_PTE;
-                               count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
-                               goto out_unmap;
-                       }
-               }
+
                 if (pte_uffd_wp(pteval)) {
                         /*
                          * Don't collapse the page if any of the small


With that, the function flow looks more similar to __collapse_huge_page_isolate(),
except that we handle swap entries in there now.


And with that in place, couldn't we factor out a huge chunk of both scanning
functions into some helper (passing whether swap entries are allowed or not?).

Yes, I know, refactoring khugepaged, crazy idea.

-- 
Cheers

David / dhildenb
Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Dev Jain 1 week, 6 days ago
On 19/09/25 12:17 am, David Hildenbrand wrote:
> On 18.09.25 07:04, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>> lightweight guard regions.
>>
>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail 
>> when
>> encountering such a range.
>>
>> MADV_COLLAPSE fails deep inside the collapse logic when trying to 
>> swap-in
>> the special marker in __collapse_huge_page_swapin().
>>
>> hpage_collapse_scan_pmd()
>>   `- collapse_huge_page()
>>       `- __collapse_huge_page_swapin() -> fails!
>>
>> khugepaged's behavior is slightly different due to its max_ptes_swap 
>> limit
>> (default 64). It won't fail as deep, but it will still needlessly 
>> scan up
>> to 64 swap entries before bailing out.
>>
>> IMHO, we can and should detect this much earlier.
>>
>> This patch adds a check directly inside the PTE scan loop. If a guard
>> marker is found, the scan is aborted immediately with 
>> SCAN_PTE_NON_PRESENT,
>> avoiding wasted work.
>>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 9ed1af2b5c38..70ebfc7c1f3e 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct 
>> mm_struct *mm,
>>                       result = SCAN_PTE_UFFD_WP;
>>                       goto out_unmap;
>>                   }
>> +                /*
>> +                 * Guard PTE markers are installed by
>> +                 * MADV_GUARD_INSTALL. Any collapse path must
>> +                 * not touch them, so abort the scan immediately
>> +                 * if one is found.
>> +                 */
>> +                if (is_guard_pte_marker(pteval)) {
>> +                    result = SCAN_PTE_NON_PRESENT;
>> +                    goto out_unmap;
>> +                }
>
> Thinking about it, this is interesting.
>
> Essentially we track any non-swap swap entries towards 
> khugepaged_max_ptes_swap, which is rather weird.
>
> I think we might also run into migration entries here and hwpoison 
> entries?
>
> So what about just generalizing this:
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index af5f5c80fe4ed..28f1f4bf0e0a8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1293,7 +1293,24 @@ static int hpage_collapse_scan_pmd(struct 
> mm_struct *mm,
>         for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>              _pte++, _address += PAGE_SIZE) {
>                 pte_t pteval = ptep_get(_pte);
> -               if (is_swap_pte(pteval)) {
> +
> +               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +                       ++none_or_zero;
> +                       if (!userfaultfd_armed(vma) &&
> +                           (!cc->is_khugepaged ||
> +                            none_or_zero <= khugepaged_max_ptes_none)) {
> +                               continue;
> +                       } else {
> +                               result = SCAN_EXCEED_NONE_PTE;
> + count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +                               goto out_unmap;
> +                       }
> +               } else if (!pte_present(pteval)) {
> +                       if (non_swap_entry(pte_to_swp_entry(pteval))) {
> +                               result = SCAN_PTE_NON_PRESENT;
> +                               goto out_unmap;
> +                       }
> +
>                         ++unmapped;
>                         if (!cc->is_khugepaged ||
>                             unmapped <= khugepaged_max_ptes_swap) {
> @@ -1313,18 +1330,7 @@ static int hpage_collapse_scan_pmd(struct 
> mm_struct *mm,
>                                 goto out_unmap;
>                         }
>                 }
> -               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> -                       ++none_or_zero;
> -                       if (!userfaultfd_armed(vma) &&
> -                           (!cc->is_khugepaged ||
> -                            none_or_zero <= khugepaged_max_ptes_none)) {
> -                               continue;
> -                       } else {
> -                               result = SCAN_EXCEED_NONE_PTE;
> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> -                               goto out_unmap;
> -                       }
> -               }
> +
>                 if (pte_uffd_wp(pteval)) {
>                         /*
>                          * Don't collapse the page if any of the small
>
>
> With that, the function flow looks more similar to 
> __collapse_huge_page_isolate(),
> except that we handle swap entries in there now.

This looks good!


>
>
> And with that in place, couldn't we factor out a huge chunk of both 
> scanning
> functions into some helper (passing whether swap entries are allowed 
> or not?).
>
> Yes, I know, refactoring khugepaged, crazy idea.
>
Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lance Yang 1 week, 6 days ago

On 2025/9/19 02:47, David Hildenbrand wrote:
> On 18.09.25 07:04, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>> lightweight guard regions.
>>
>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
>> encountering such a range.
>>
>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
>> the special marker in __collapse_huge_page_swapin().
>>
>> hpage_collapse_scan_pmd()
>>   `- collapse_huge_page()
>>       `- __collapse_huge_page_swapin() -> fails!
>>
>> khugepaged's behavior is slightly different due to its max_ptes_swap 
>> limit
>> (default 64). It won't fail as deep, but it will still needlessly scan up
>> to 64 swap entries before bailing out.
>>
>> IMHO, we can and should detect this much earlier.
>>
>> This patch adds a check directly inside the PTE scan loop. If a guard
>> marker is found, the scan is aborted immediately with 
>> SCAN_PTE_NON_PRESENT,
>> avoiding wasted work.
>>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 9ed1af2b5c38..70ebfc7c1f3e 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct 
>> mm_struct *mm,
>>                       result = SCAN_PTE_UFFD_WP;
>>                       goto out_unmap;
>>                   }
>> +                /*
>> +                 * Guard PTE markers are installed by
>> +                 * MADV_GUARD_INSTALL. Any collapse path must
>> +                 * not touch them, so abort the scan immediately
>> +                 * if one is found.
>> +                 */
>> +                if (is_guard_pte_marker(pteval)) {
>> +                    result = SCAN_PTE_NON_PRESENT;
>> +                    goto out_unmap;
>> +                }
> 
> Thinking about it, this is interesting.
> 
> Essentially we track any non-swap swap entries towards 
> khugepaged_max_ptes_swap, which is rather weird.
> 
> I think we might also run into migration entries here and hwpoison entries?
> 
> So what about just generalizing this:
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index af5f5c80fe4ed..28f1f4bf0e0a8 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1293,7 +1293,24 @@ static int hpage_collapse_scan_pmd(struct 
> mm_struct *mm,
>          for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>               _pte++, _address += PAGE_SIZE) {
>                  pte_t pteval = ptep_get(_pte);
> -               if (is_swap_pte(pteval)) {
> +
> +               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> +                       ++none_or_zero;
> +                       if (!userfaultfd_armed(vma) &&
> +                           (!cc->is_khugepaged ||
> +                            none_or_zero <= khugepaged_max_ptes_none)) {
> +                               continue;
> +                       } else {
> +                               result = SCAN_EXCEED_NONE_PTE;
> +                               count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> +                               goto out_unmap;
> +                       }
> +               } else if (!pte_present(pteval)) {
> +                       if (non_swap_entry(pte_to_swp_entry(pteval))) {
> +                               result = SCAN_PTE_NON_PRESENT;
> +                               goto out_unmap;
> +                       }
> +
>                          ++unmapped;
>                          if (!cc->is_khugepaged ||
>                              unmapped <= khugepaged_max_ptes_swap) {
> @@ -1313,18 +1330,7 @@ static int hpage_collapse_scan_pmd(struct 
> mm_struct *mm,
>                                  goto out_unmap;
>                          }
>                  }
> -               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
> -                       ++none_or_zero;
> -                       if (!userfaultfd_armed(vma) &&
> -                           (!cc->is_khugepaged ||
> -                            none_or_zero <= khugepaged_max_ptes_none)) {
> -                               continue;
> -                       } else {
> -                               result = SCAN_EXCEED_NONE_PTE;
> -                               count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
> -                               goto out_unmap;
> -                       }
> -               }
> +
>                  if (pte_uffd_wp(pteval)) {
>                          /*
>                           * Don't collapse the page if any of the small
> 
> 
> With that, the function flow looks more similar to 
> __collapse_huge_page_isolate(),
> except that we handle swap entries in there now.

Ah, indeed. I like this crazy idea ;p

> 
> 
> And with that in place, couldn't we factor out a huge chunk of both 
> scanning
> functions into some helper (passing whether swap entries are allowed or 
> not?).

Yes. Factoring out the common scanning logic into a new helper is a
good suggestion. It would clean things up ;)

> 
> Yes, I know, refactoring khugepaged, crazy idea.

I'll look into that. But let's do this separately :)

Cheers,
Lance


Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by David Hildenbrand 1 week, 6 days ago
On 19.09.25 04:41, Lance Yang wrote:
> 
> 
> On 2025/9/19 02:47, David Hildenbrand wrote:
>> On 18.09.25 07:04, Lance Yang wrote:
>>> From: Lance Yang <lance.yang@linux.dev>
>>>
>>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>>> lightweight guard regions.
>>>
>>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
>>> encountering such a range.
>>>
>>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
>>> the special marker in __collapse_huge_page_swapin().
>>>
>>> hpage_collapse_scan_pmd()
>>>    `- collapse_huge_page()
>>>        `- __collapse_huge_page_swapin() -> fails!
>>>
>>> khugepaged's behavior is slightly different due to its max_ptes_swap
>>> limit
>>> (default 64). It won't fail as deep, but it will still needlessly scan up
>>> to 64 swap entries before bailing out.
>>>
>>> IMHO, we can and should detect this much earlier.
>>>
>>> This patch adds a check directly inside the PTE scan loop. If a guard
>>> marker is found, the scan is aborted immediately with
>>> SCAN_PTE_NON_PRESENT,
>>> avoiding wasted work.
>>>
>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>> ---
>>>    mm/khugepaged.c | 10 ++++++++++
>>>    1 file changed, 10 insertions(+)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 9ed1af2b5c38..70ebfc7c1f3e 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct
>>> mm_struct *mm,
>>>                        result = SCAN_PTE_UFFD_WP;
>>>                        goto out_unmap;
>>>                    }
>>> +                /*
>>> +                 * Guard PTE markers are installed by
>>> +                 * MADV_GUARD_INSTALL. Any collapse path must
>>> +                 * not touch them, so abort the scan immediately
>>> +                 * if one is found.
>>> +                 */
>>> +                if (is_guard_pte_marker(pteval)) {
>>> +                    result = SCAN_PTE_NON_PRESENT;
>>> +                    goto out_unmap;
>>> +                }
>>
>> Thinking about it, this is interesting.
>>
>> Essentially we track any non-swap swap entries towards
>> khugepaged_max_ptes_swap, which is rather weird.
>>
>> I think we might also run into migration entries here and hwpoison entries?
>>
>> So what about just generalizing this:
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index af5f5c80fe4ed..28f1f4bf0e0a8 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1293,7 +1293,24 @@ static int hpage_collapse_scan_pmd(struct
>> mm_struct *mm,
>>           for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR;
>>                _pte++, _address += PAGE_SIZE) {
>>                   pte_t pteval = ptep_get(_pte);
>> -               if (is_swap_pte(pteval)) {
>> +
>> +               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> +                       ++none_or_zero;
>> +                       if (!userfaultfd_armed(vma) &&
>> +                           (!cc->is_khugepaged ||
>> +                            none_or_zero <= khugepaged_max_ptes_none)) {
>> +                               continue;
>> +                       } else {
>> +                               result = SCAN_EXCEED_NONE_PTE;
>> +                               count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> +                               goto out_unmap;
>> +                       }
>> +               } else if (!pte_present(pteval)) {
>> +                       if (non_swap_entry(pte_to_swp_entry(pteval))) {
>> +                               result = SCAN_PTE_NON_PRESENT;
>> +                               goto out_unmap;
>> +                       }
>> +
>>                           ++unmapped;
>>                           if (!cc->is_khugepaged ||
>>                               unmapped <= khugepaged_max_ptes_swap) {
>> @@ -1313,18 +1330,7 @@ static int hpage_collapse_scan_pmd(struct
>> mm_struct *mm,
>>                                   goto out_unmap;
>>                           }
>>                   }
>> -               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>> -                       ++none_or_zero;
>> -                       if (!userfaultfd_armed(vma) &&
>> -                           (!cc->is_khugepaged ||
>> -                            none_or_zero <= khugepaged_max_ptes_none)) {
>> -                               continue;
>> -                       } else {
>> -                               result = SCAN_EXCEED_NONE_PTE;
>> -                               count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>> -                               goto out_unmap;
>> -                       }
>> -               }
>> +
>>                   if (pte_uffd_wp(pteval)) {
>>                           /*
>>                            * Don't collapse the page if any of the small
>>
>>
>> With that, the function flow looks more similar to
>> __collapse_huge_page_isolate(),
>> except that we handle swap entries in there now.
> 
> Ah, indeed. I like this crazy idea ;p
> 
>>
>>
>> And with that in place, couldn't we factor out a huge chunk of both
>> scanning
>> functions into some helper (passing whether swap entries are allowed or
>> not?).
> 
> Yes. Factoring out the common scanning logic into a new helper is a
> good suggestion. It would clean things up ;)
> 
>>
>> Yes, I know, refactoring khugepaged, crazy idea.
> 
> I'll look into that. But let's do this separately :)

Right, but let's just skip any non-swap entries early in this patch 
instead of special-casing only guard ptes.

-- 
Cheers

David / dhildenb

Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lance Yang 1 week, 6 days ago

On 2025/9/19 15:57, David Hildenbrand wrote:
> On 19.09.25 04:41, Lance Yang wrote:
>>
>>
>> On 2025/9/19 02:47, David Hildenbrand wrote:
>>> On 18.09.25 07:04, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>>>> lightweight guard regions.
>>>>
>>>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail 
>>>> when
>>>> encountering such a range.
>>>>
>>>> MADV_COLLAPSE fails deep inside the collapse logic when trying to 
>>>> swap-in
>>>> the special marker in __collapse_huge_page_swapin().
>>>>
>>>> hpage_collapse_scan_pmd()
>>>>    `- collapse_huge_page()
>>>>        `- __collapse_huge_page_swapin() -> fails!
>>>>
>>>> khugepaged's behavior is slightly different due to its max_ptes_swap
>>>> limit
>>>> (default 64). It won't fail as deep, but it will still needlessly 
>>>> scan up
>>>> to 64 swap entries before bailing out.
>>>>
>>>> IMHO, we can and should detect this much earlier.
>>>>
>>>> This patch adds a check directly inside the PTE scan loop. If a guard
>>>> marker is found, the scan is aborted immediately with
>>>> SCAN_PTE_NON_PRESENT,
>>>> avoiding wasted work.
>>>>
>>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>>    mm/khugepaged.c | 10 ++++++++++
>>>>    1 file changed, 10 insertions(+)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 9ed1af2b5c38..70ebfc7c1f3e 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct
>>>> mm_struct *mm,
>>>>                        result = SCAN_PTE_UFFD_WP;
>>>>                        goto out_unmap;
>>>>                    }
>>>> +                /*
>>>> +                 * Guard PTE markers are installed by
>>>> +                 * MADV_GUARD_INSTALL. Any collapse path must
>>>> +                 * not touch them, so abort the scan immediately
>>>> +                 * if one is found.
>>>> +                 */
>>>> +                if (is_guard_pte_marker(pteval)) {
>>>> +                    result = SCAN_PTE_NON_PRESENT;
>>>> +                    goto out_unmap;
>>>> +                }
>>>
>>> Thinking about it, this is interesting.
>>>
>>> Essentially we track any non-swap swap entries towards
>>> khugepaged_max_ptes_swap, which is rather weird.
>>>
>>> I think we might also run into migration entries here and hwpoison 
>>> entries?
>>>
>>> So what about just generalizing this:
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index af5f5c80fe4ed..28f1f4bf0e0a8 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -1293,7 +1293,24 @@ static int hpage_collapse_scan_pmd(struct
>>> mm_struct *mm,
>>>           for (_address = address, _pte = pte; _pte < pte + 
>>> HPAGE_PMD_NR;
>>>                _pte++, _address += PAGE_SIZE) {
>>>                   pte_t pteval = ptep_get(_pte);
>>> -               if (is_swap_pte(pteval)) {
>>> +
>>> +               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>> +                       ++none_or_zero;
>>> +                       if (!userfaultfd_armed(vma) &&
>>> +                           (!cc->is_khugepaged ||
>>> +                            none_or_zero <= 
>>> khugepaged_max_ptes_none)) {
>>> +                               continue;
>>> +                       } else {
>>> +                               result = SCAN_EXCEED_NONE_PTE;
>>> +                               
>>> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>> +                               goto out_unmap;
>>> +                       }
>>> +               } else if (!pte_present(pteval)) {
>>> +                       if (non_swap_entry(pte_to_swp_entry(pteval))) {
>>> +                               result = SCAN_PTE_NON_PRESENT;
>>> +                               goto out_unmap;
>>> +                       }
>>> +
>>>                           ++unmapped;
>>>                           if (!cc->is_khugepaged ||
>>>                               unmapped <= khugepaged_max_ptes_swap) {
>>> @@ -1313,18 +1330,7 @@ static int hpage_collapse_scan_pmd(struct
>>> mm_struct *mm,
>>>                                   goto out_unmap;
>>>                           }
>>>                   }
>>> -               if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) {
>>> -                       ++none_or_zero;
>>> -                       if (!userfaultfd_armed(vma) &&
>>> -                           (!cc->is_khugepaged ||
>>> -                            none_or_zero <= 
>>> khugepaged_max_ptes_none)) {
>>> -                               continue;
>>> -                       } else {
>>> -                               result = SCAN_EXCEED_NONE_PTE;
>>> -                               
>>> count_vm_event(THP_SCAN_EXCEED_NONE_PTE);
>>> -                               goto out_unmap;
>>> -                       }
>>> -               }
>>> +
>>>                   if (pte_uffd_wp(pteval)) {
>>>                           /*
>>>                            * Don't collapse the page if any of the small
>>>
>>>
>>> With that, the function flow looks more similar to
>>> __collapse_huge_page_isolate(),
>>> except that we handle swap entries in there now.
>>
>> Ah, indeed. I like this crazy idea ;p
>>
>>>
>>>
>>> And with that in place, couldn't we factor out a huge chunk of both
>>> scanning
>>> functions into some helper (passing whether swap entries are allowed or
>>> not?).
>>
>> Yes. Factoring out the common scanning logic into a new helper is a
>> good suggestion. It would clean things up ;)
>>
>>>
>>> Yes, I know, refactoring khugepaged, crazy idea.
>>
>> I'll look into that. But let's do this separately :)
> 
> Right, but let's just skip any non-swap entries early in this patch 
> instead of special-casing only guard ptes.

Ah, right! I missed the other non-swap entries. Will rework this patch
as you suggested!

Cheers,
Lance

Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Dev Jain 1 week, 6 days ago
On 18/09/25 10:34 am, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
> lightweight guard regions.
>
> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
> encountering such a range.
>
> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
> the special marker in __collapse_huge_page_swapin().
>
> hpage_collapse_scan_pmd()
>   `- collapse_huge_page()
>       `- __collapse_huge_page_swapin() -> fails!
>
> khugepaged's behavior is slightly different due to its max_ptes_swap limit
> (default 64). It won't fail as deep, but it will still needlessly scan up
> to 64 swap entries before bailing out.
>
> IMHO, we can and should detect this much earlier.
>
> This patch adds a check directly inside the PTE scan loop. If a guard
> marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT,
> avoiding wasted work.
>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---

Reviewed-by: Dev Jain <dev.jain@arm.com>
Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lorenzo Stoakes 2 weeks ago
On Thu, Sep 18, 2025 at 01:04:31PM +0800, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
> lightweight guard regions.
>
> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
> encountering such a range.
>
> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
> the special marker in __collapse_huge_page_swapin().
>
> hpage_collapse_scan_pmd()
>  `- collapse_huge_page()
>      `- __collapse_huge_page_swapin() -> fails!
>
> khugepaged's behavior is slightly different due to its max_ptes_swap limit
> (default 64). It won't fail as deep, but it will still needlessly scan up
> to 64 swap entries before bailing out.
>
> IMHO, we can and should detect this much earlier.
>
> This patch adds a check directly inside the PTE scan loop. If a guard
> marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT,
> avoiding wasted work.
>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/khugepaged.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9ed1af2b5c38..70ebfc7c1f3e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>  					result = SCAN_PTE_UFFD_WP;
>  					goto out_unmap;
>  				}
> +				/*
> +				 * Guard PTE markers are installed by
> +				 * MADV_GUARD_INSTALL. Any collapse path must
> +				 * not touch them, so abort the scan immediately
> +				 * if one is found.
> +				 */
> +				if (is_guard_pte_marker(pteval)) {
> +					result = SCAN_PTE_NON_PRESENT;
> +					goto out_unmap;
> +				}
>  				continue;
>  			} else {
>  				result = SCAN_EXCEED_SWAP_PTE;
> --
> 2.49.0
>
Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Dev Jain 2 weeks ago
On 18/09/25 10:34 am, Lance Yang wrote:
> From: Lance Yang <lance.yang@linux.dev>
>
> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
> lightweight guard regions.
>
> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
> encountering such a range.
>
> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
> the special marker in __collapse_huge_page_swapin().
>
> hpage_collapse_scan_pmd()
>   `- collapse_huge_page()
>       `- __collapse_huge_page_swapin() -> fails!
>
> khugepaged's behavior is slightly different due to its max_ptes_swap limit
> (default 64). It won't fail as deep, but it will still needlessly scan up
> to 64 swap entries before bailing out.
>
> IMHO, we can and should detect this much earlier.
>
> This patch adds a check directly inside the PTE scan loop. If a guard
> marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT,
> avoiding wasted work.
>
> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Signed-off-by: Lance Yang <lance.yang@linux.dev>
> ---
>   mm/khugepaged.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 9ed1af2b5c38..70ebfc7c1f3e 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
>   					result = SCAN_PTE_UFFD_WP;
>   					goto out_unmap;
>   				}
> +				/*
> +				 * Guard PTE markers are installed by
> +				 * MADV_GUARD_INSTALL. Any collapse path must
> +				 * not touch them, so abort the scan immediately
> +				 * if one is found.
> +				 */
> +				if (is_guard_pte_marker(pteval)) {
> +					result = SCAN_PTE_NON_PRESENT;
> +					goto out_unmap;
> +				}
>   				continue;
>   			} else {
>   				result = SCAN_EXCEED_SWAP_PTE;
>
>

I would like to hear everyone else's thoughts on
https://lore.kernel.org/linux-mm/750a06dc-db3d-43c6-b234-95efb393a9df@arm.com/
wherein I suggest that we should not continue to try collapsing other regions
but immediately exit. The SCAN_PTE_NON_PRESENT case does not exit.
Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lance Yang 2 weeks ago

On 2025/9/18 15:37, Dev Jain wrote:
> 
> On 18/09/25 10:34 am, Lance Yang wrote:
>> From: Lance Yang <lance.yang@linux.dev>
>>
>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>> lightweight guard regions.
>>
>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
>> encountering such a range.
>>
>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
>> the special marker in __collapse_huge_page_swapin().
>>
>> hpage_collapse_scan_pmd()
>>   `- collapse_huge_page()
>>       `- __collapse_huge_page_swapin() -> fails!
>>
>> khugepaged's behavior is slightly different due to its max_ptes_swap 
>> limit
>> (default 64). It won't fail as deep, but it will still needlessly scan up
>> to 64 swap entries before bailing out.
>>
>> IMHO, we can and should detect this much earlier.
>>
>> This patch adds a check directly inside the PTE scan loop. If a guard
>> marker is found, the scan is aborted immediately with 
>> SCAN_PTE_NON_PRESENT,
>> avoiding wasted work.
>>
>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>> ---
>>   mm/khugepaged.c | 10 ++++++++++
>>   1 file changed, 10 insertions(+)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 9ed1af2b5c38..70ebfc7c1f3e 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct 
>> mm_struct *mm,
>>                       result = SCAN_PTE_UFFD_WP;
>>                       goto out_unmap;
>>                   }
>> +                /*
>> +                 * Guard PTE markers are installed by
>> +                 * MADV_GUARD_INSTALL. Any collapse path must
>> +                 * not touch them, so abort the scan immediately
>> +                 * if one is found.
>> +                 */
>> +                if (is_guard_pte_marker(pteval)) {
>> +                    result = SCAN_PTE_NON_PRESENT;
>> +                    goto out_unmap;
>> +                }
>>                   continue;
>>               } else {
>>                   result = SCAN_EXCEED_SWAP_PTE;
>>
>>
> 
> I would like to hear everyone else's thoughts on
> https://lore.kernel.org/linux-mm/750a06dc-db3d-43c6- 
> b234-95efb393a9df@arm.com/
> wherein I suggest that we should not continue to try collapsing other 
> regions
> but immediately exit. The SCAN_PTE_NON_PRESENT case does not exit.

Yes! Let's hear what other folks think on that[1].

[1] 
https://lore.kernel.org/linux-mm/c9d4d761-202f-48ce-8e3d-fb9075671ff3@linux.dev

Cheers,
Lance




Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Lorenzo Stoakes 2 weeks ago
On Thu, Sep 18, 2025 at 04:11:21PM +0800, Lance Yang wrote:
>
>
> On 2025/9/18 15:37, Dev Jain wrote:
> >
> > On 18/09/25 10:34 am, Lance Yang wrote:
> > > From: Lance Yang <lance.yang@linux.dev>
> > >
> > > Guard PTE markers are installed via MADV_GUARD_INSTALL to create
> > > lightweight guard regions.
> > >
> > > Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
> > > encountering such a range.
> > >
> > > MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
> > > the special marker in __collapse_huge_page_swapin().
> > >
> > > hpage_collapse_scan_pmd()
> > >   `- collapse_huge_page()
> > >       `- __collapse_huge_page_swapin() -> fails!
> > >
> > > khugepaged's behavior is slightly different due to its max_ptes_swap
> > > limit
> > > (default 64). It won't fail as deep, but it will still needlessly scan up
> > > to 64 swap entries before bailing out.
> > >
> > > IMHO, we can and should detect this much earlier.
> > >
> > > This patch adds a check directly inside the PTE scan loop. If a guard
> > > marker is found, the scan is aborted immediately with
> > > SCAN_PTE_NON_PRESENT,
> > > avoiding wasted work.
> > >
> > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> > > Signed-off-by: Lance Yang <lance.yang@linux.dev>
> > > ---
> > >   mm/khugepaged.c | 10 ++++++++++
> > >   1 file changed, 10 insertions(+)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 9ed1af2b5c38..70ebfc7c1f3e 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct
> > > mm_struct *mm,
> > >                       result = SCAN_PTE_UFFD_WP;
> > >                       goto out_unmap;
> > >                   }
> > > +                /*
> > > +                 * Guard PTE markers are installed by
> > > +                 * MADV_GUARD_INSTALL. Any collapse path must
> > > +                 * not touch them, so abort the scan immediately
> > > +                 * if one is found.
> > > +                 */
> > > +                if (is_guard_pte_marker(pteval)) {
> > > +                    result = SCAN_PTE_NON_PRESENT;
> > > +                    goto out_unmap;
> > > +                }
> > >                   continue;
> > >               } else {
> > >                   result = SCAN_EXCEED_SWAP_PTE;
> > >
> > >
> >
> > I would like to hear everyone else's thoughts on
> > https://lore.kernel.org/linux-mm/750a06dc-db3d-43c6-
> > b234-95efb393a9df@arm.com/
> > wherein I suggest that we should not continue to try collapsing other
> > regions
> > but immediately exit. The SCAN_PTE_NON_PRESENT case does not exit.
>
> Yes! Let's hear what other folks think on that[1].
>
> [1] https://lore.kernel.org/linux-mm/c9d4d761-202f-48ce-8e3d-fb9075671ff3@linux.dev

Since the code has changed let's discuss on this thread.

Dev - You can have guard regions in a range that prevent one PMD from being
collapsed, I'm struggling to understand why you'd want to abort the whole
thing?

Your reasoning there isn't clear at all, so if I had a guard region in one
page in a giant range I was trying to collapse, you're saying we should
just abort the whole thing?

I really don't understand why we would do that? You just skip over what you
can't collapse right?

There's no reason at all to assume that overlapping regions here matter, we
can't predict how users will use this.

As Lance says, it's best effort. And also note we already do this with UFFD
WP. And note this is also a non-present, PTE marker.

And also this would change existing behaviour which treats this as a swap
entry then just fails later down the line right?

So yeah I don't agree, I think it's fine as is, unless I'm missing
something here.

Cheers, Lorenzo
Re: [PATCH mm-new v2 2/2] mm/khugepaged: abort collapse scan on guard PTEs
Posted by Dev Jain 1 week, 6 days ago
On 18/09/25 3:36 pm, Lorenzo Stoakes wrote:
> On Thu, Sep 18, 2025 at 04:11:21PM +0800, Lance Yang wrote:
>>
>> On 2025/9/18 15:37, Dev Jain wrote:
>>> On 18/09/25 10:34 am, Lance Yang wrote:
>>>> From: Lance Yang <lance.yang@linux.dev>
>>>>
>>>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create
>>>> lightweight guard regions.
>>>>
>>>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
>>>> encountering such a range.
>>>>
>>>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
>>>> the special marker in __collapse_huge_page_swapin().
>>>>
>>>> hpage_collapse_scan_pmd()
>>>>    `- collapse_huge_page()
>>>>        `- __collapse_huge_page_swapin() -> fails!
>>>>
>>>> khugepaged's behavior is slightly different due to its max_ptes_swap
>>>> limit
>>>> (default 64). It won't fail as deep, but it will still needlessly scan up
>>>> to 64 swap entries before bailing out.
>>>>
>>>> IMHO, we can and should detect this much earlier.
>>>>
>>>> This patch adds a check directly inside the PTE scan loop. If a guard
>>>> marker is found, the scan is aborted immediately with
>>>> SCAN_PTE_NON_PRESENT,
>>>> avoiding wasted work.
>>>>
>>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Signed-off-by: Lance Yang <lance.yang@linux.dev>
>>>> ---
>>>>    mm/khugepaged.c | 10 ++++++++++
>>>>    1 file changed, 10 insertions(+)
>>>>
>>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>>> index 9ed1af2b5c38..70ebfc7c1f3e 100644
>>>> --- a/mm/khugepaged.c
>>>> +++ b/mm/khugepaged.c
>>>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct
>>>> mm_struct *mm,
>>>>                        result = SCAN_PTE_UFFD_WP;
>>>>                        goto out_unmap;
>>>>                    }
>>>> +                /*
>>>> +                 * Guard PTE markers are installed by
>>>> +                 * MADV_GUARD_INSTALL. Any collapse path must
>>>> +                 * not touch them, so abort the scan immediately
>>>> +                 * if one is found.
>>>> +                 */
>>>> +                if (is_guard_pte_marker(pteval)) {
>>>> +                    result = SCAN_PTE_NON_PRESENT;
>>>> +                    goto out_unmap;
>>>> +                }
>>>>                    continue;
>>>>                } else {
>>>>                    result = SCAN_EXCEED_SWAP_PTE;
>>>>
>>>>
>>> I would like to hear everyone else's thoughts on
>>> https://lore.kernel.org/linux-mm/750a06dc-db3d-43c6-
>>> b234-95efb393a9df@arm.com/
>>> wherein I suggest that we should not continue to try collapsing other
>>> regions
>>> but immediately exit. The SCAN_PTE_NON_PRESENT case does not exit.
>> Yes! Let's hear what other folks think on that[1].
>>
>> [1] https://lore.kernel.org/linux-mm/c9d4d761-202f-48ce-8e3d-fb9075671ff3@linux.dev
> Since the code has changed let's discuss on this thread.
>
> Dev - You can have guard regions in a range that prevent one PMD from being
> collapsed, I'm struggling to understand why you'd want to abort the whole
> thing?
>
> Your reasoning there isn't clear at all, so if I had a guard region in one
> page in a giant range I was trying to collapse, you're saying we should
> just abort the whole thing?

My reasoning was that it doesn't seem correct that the user will operate in
any capacity on a guard region when it knows it is a guard region.

But, as you say, we then won't be able to collapse a large region in one go
and will have to do multiple madvise() calls to prevent overlapping with a
guard region. So I agree with you.

>
> I really don't understand why we would do that? You just skip over what you
> can't collapse right?
>
> There's no reason at all to assume that overlapping regions here matter, we
> can't predict how users will use this.

True.

>
> As Lance says, it's best effort. And also note we already do this with UFFD
> WP. And note this is also a non-present, PTE marker.
>
> And also this would change existing behaviour which treats this as a swap
> entry then just fails later down the line right?
>
> So yeah I don't agree, I think it's fine as is, unless I'm missing
> something here.

Thanks for your explanation!

>
> Cheers, Lorenzo