From: Lance Yang <lance.yang@linux.dev>
Guard PTE markers are installed via MADV_GUARD_INSTALL to create
lightweight guard regions.
Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when
encountering such a range.
MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in
the special marker in __collapse_huge_page_swapin().
hpage_collapse_scan_pmd()
`- collapse_huge_page()
`- __collapse_huge_page_swapin() -> fails!
khugepaged's behavior is slightly different due to its max_ptes_swap limit
(default 64). It won't fail as deep, but it will still needlessly scan up
to 64 swap entries before bailing out.
IMHO, we can and should detect this much earlier.
This patch adds a check directly inside the PTE scan loop. If a guard
marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT,
avoiding wasted work.
Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Signed-off-by: Lance Yang <lance.yang@linux.dev>
---
mm/khugepaged.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 9ed1af2b5c38..70ebfc7c1f3e 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm,
result = SCAN_PTE_UFFD_WP;
goto out_unmap;
}
+ /*
+ * Guard PTE markers are installed by
+ * MADV_GUARD_INSTALL. Any collapse path must
+ * not touch them, so abort the scan immediately
+ * if one is found.
+ */
+ if (is_guard_pte_marker(pteval)) {
+ result = SCAN_PTE_NON_PRESENT;
+ goto out_unmap;
+ }
continue;
} else {
result = SCAN_EXCEED_SWAP_PTE;
--
2.49.0
On 18 Sep 2025, at 1:04, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > Guard PTE markers are installed via MADV_GUARD_INSTALL to create > lightweight guard regions. > > Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when > encountering such a range. > > MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in > the special marker in __collapse_huge_page_swapin(). > > hpage_collapse_scan_pmd() > `- collapse_huge_page() > `- __collapse_huge_page_swapin() -> fails! > > khugepaged's behavior is slightly different due to its max_ptes_swap limit > (default 64). It won't fail as deep, but it will still needlessly scan up > to 64 swap entries before bailing out. > > IMHO, we can and should detect this much earlier. > > This patch adds a check directly inside the PTE scan loop. If a guard > marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT, > avoiding wasted work. > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Signed-off-by: Lance Yang <lance.yang@linux.dev> > --- > mm/khugepaged.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 9ed1af2b5c38..70ebfc7c1f3e 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > result = SCAN_PTE_UFFD_WP; > goto out_unmap; > } > + /* > + * Guard PTE markers are installed by > + * MADV_GUARD_INSTALL. Any collapse path must > + * not touch them, so abort the scan immediately > + * if one is found. > + */ > + if (is_guard_pte_marker(pteval)) { pteval is already is_swap_pte(), would is_guard_swp_entry() be a better choice here? Save one is_swap_pte() call. > + result = SCAN_PTE_NON_PRESENT; > + goto out_unmap; > + } > continue; > } else { > result = SCAN_EXCEED_SWAP_PTE; > -- > 2.49.0 Best Regards, Yan, Zi
On 2025/9/19 03:12, Zi Yan wrote: > On 18 Sep 2025, at 1:04, Lance Yang wrote: > >> From: Lance Yang <lance.yang@linux.dev> >> >> Guard PTE markers are installed via MADV_GUARD_INSTALL to create >> lightweight guard regions. >> >> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when >> encountering such a range. >> >> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in >> the special marker in __collapse_huge_page_swapin(). >> >> hpage_collapse_scan_pmd() >> `- collapse_huge_page() >> `- __collapse_huge_page_swapin() -> fails! >> >> khugepaged's behavior is slightly different due to its max_ptes_swap limit >> (default 64). It won't fail as deep, but it will still needlessly scan up >> to 64 swap entries before bailing out. >> >> IMHO, we can and should detect this much earlier. >> >> This patch adds a check directly inside the PTE scan loop. If a guard >> marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT, >> avoiding wasted work. >> >> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >> Signed-off-by: Lance Yang <lance.yang@linux.dev> >> --- >> mm/khugepaged.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 9ed1af2b5c38..70ebfc7c1f3e 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, >> result = SCAN_PTE_UFFD_WP; >> goto out_unmap; >> } >> + /* >> + * Guard PTE markers are installed by >> + * MADV_GUARD_INSTALL. Any collapse path must >> + * not touch them, so abort the scan immediately >> + * if one is found. >> + */ >> + if (is_guard_pte_marker(pteval)) { > > pteval is already is_swap_pte(), would is_guard_swp_entry() be a better > choice here? Save one is_swap_pte() call. Yeah. Good spot! Will do ;) Thanks, Lance > >> + result = SCAN_PTE_NON_PRESENT; >> + goto out_unmap; >> + } >> continue; >> } else { >> result = SCAN_EXCEED_SWAP_PTE; >> -- >> 2.49.0 > > > Best Regards, > Yan, Zi
On 2025/9/19 10:44, Lance Yang wrote: > > > On 2025/9/19 03:12, Zi Yan wrote: >> On 18 Sep 2025, at 1:04, Lance Yang wrote: >> >>> From: Lance Yang <lance.yang@linux.dev> >>> >>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create >>> lightweight guard regions. >>> >>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail >>> when >>> encountering such a range. >>> >>> MADV_COLLAPSE fails deep inside the collapse logic when trying to >>> swap-in >>> the special marker in __collapse_huge_page_swapin(). >>> >>> hpage_collapse_scan_pmd() >>> `- collapse_huge_page() >>> `- __collapse_huge_page_swapin() -> fails! >>> >>> khugepaged's behavior is slightly different due to its max_ptes_swap >>> limit >>> (default 64). It won't fail as deep, but it will still needlessly >>> scan up >>> to 64 swap entries before bailing out. >>> >>> IMHO, we can and should detect this much earlier. >>> >>> This patch adds a check directly inside the PTE scan loop. If a guard >>> marker is found, the scan is aborted immediately with >>> SCAN_PTE_NON_PRESENT, >>> avoiding wasted work. >>> >>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>> Signed-off-by: Lance Yang <lance.yang@linux.dev> >>> --- >>> mm/khugepaged.c | 10 ++++++++++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>> index 9ed1af2b5c38..70ebfc7c1f3e 100644 >>> --- a/mm/khugepaged.c >>> +++ b/mm/khugepaged.c >>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct >>> mm_struct *mm, >>> result = SCAN_PTE_UFFD_WP; >>> goto out_unmap; >>> } >>> + /* >>> + * Guard PTE markers are installed by >>> + * MADV_GUARD_INSTALL. Any collapse path must >>> + * not touch them, so abort the scan immediately >>> + * if one is found. >>> + */ >>> + if (is_guard_pte_marker(pteval)) { >> >> pteval is already is_swap_pte(), would is_guard_swp_entry() be a better >> choice here? Save one is_swap_pte() call. Agree. Then seems we don't need to move the is_guard_pte_marker() into a header file.
On 2025/9/19 11:18, Baolin Wang wrote: > > > On 2025/9/19 10:44, Lance Yang wrote: >> >> >> On 2025/9/19 03:12, Zi Yan wrote: >>> On 18 Sep 2025, at 1:04, Lance Yang wrote: >>> >>>> From: Lance Yang <lance.yang@linux.dev> >>>> >>>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create >>>> lightweight guard regions. >>>> >>>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail >>>> when >>>> encountering such a range. >>>> >>>> MADV_COLLAPSE fails deep inside the collapse logic when trying to >>>> swap-in >>>> the special marker in __collapse_huge_page_swapin(). >>>> >>>> hpage_collapse_scan_pmd() >>>> `- collapse_huge_page() >>>> `- __collapse_huge_page_swapin() -> fails! >>>> >>>> khugepaged's behavior is slightly different due to its max_ptes_swap >>>> limit >>>> (default 64). It won't fail as deep, but it will still needlessly >>>> scan up >>>> to 64 swap entries before bailing out. >>>> >>>> IMHO, we can and should detect this much earlier. >>>> >>>> This patch adds a check directly inside the PTE scan loop. If a guard >>>> marker is found, the scan is aborted immediately with >>>> SCAN_PTE_NON_PRESENT, >>>> avoiding wasted work. >>>> >>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>>> Signed-off-by: Lance Yang <lance.yang@linux.dev> >>>> --- >>>> mm/khugepaged.c | 10 ++++++++++ >>>> 1 file changed, 10 insertions(+) >>>> >>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>> index 9ed1af2b5c38..70ebfc7c1f3e 100644 >>>> --- a/mm/khugepaged.c >>>> +++ b/mm/khugepaged.c >>>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct >>>> mm_struct *mm, >>>> result = SCAN_PTE_UFFD_WP; >>>> goto out_unmap; >>>> } >>>> + /* >>>> + * Guard PTE markers are installed by >>>> + * MADV_GUARD_INSTALL. Any collapse path must >>>> + * not touch them, so abort the scan immediately >>>> + * if one is found. >>>> + */ >>>> + if (is_guard_pte_marker(pteval)) { >>> >>> pteval is already is_swap_pte(), would is_guard_swp_entry() be a better >>> choice here? Save one is_swap_pte() call. > > Agree. Then seems we don't need to move the is_guard_pte_marker() into a > header file. Got it. I'll only remove the redundant check in #01 patch.
On 18.09.25 07:04, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > Guard PTE markers are installed via MADV_GUARD_INSTALL to create > lightweight guard regions. > > Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when > encountering such a range. > > MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in > the special marker in __collapse_huge_page_swapin(). > > hpage_collapse_scan_pmd() > `- collapse_huge_page() > `- __collapse_huge_page_swapin() -> fails! > > khugepaged's behavior is slightly different due to its max_ptes_swap limit > (default 64). It won't fail as deep, but it will still needlessly scan up > to 64 swap entries before bailing out. > > IMHO, we can and should detect this much earlier. > > This patch adds a check directly inside the PTE scan loop. If a guard > marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT, > avoiding wasted work. > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Signed-off-by: Lance Yang <lance.yang@linux.dev> > --- > mm/khugepaged.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 9ed1af2b5c38..70ebfc7c1f3e 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > result = SCAN_PTE_UFFD_WP; > goto out_unmap; > } > + /* > + * Guard PTE markers are installed by > + * MADV_GUARD_INSTALL. Any collapse path must > + * not touch them, so abort the scan immediately > + * if one is found. > + */ > + if (is_guard_pte_marker(pteval)) { > + result = SCAN_PTE_NON_PRESENT; > + goto out_unmap; > + } Thinking about it, this is interesting. Essentially we track any non-swap swap entries towards khugepaged_max_ptes_swap, which is rather weird. I think we might also run into migration entries here and hwpoison entries? So what about just generalizing this: diff --git a/mm/khugepaged.c b/mm/khugepaged.c index af5f5c80fe4ed..28f1f4bf0e0a8 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1293,7 +1293,24 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; _pte++, _address += PAGE_SIZE) { pte_t pteval = ptep_get(_pte); - if (is_swap_pte(pteval)) { + + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { + ++none_or_zero; + if (!userfaultfd_armed(vma) && + (!cc->is_khugepaged || + none_or_zero <= khugepaged_max_ptes_none)) { + continue; + } else { + result = SCAN_EXCEED_NONE_PTE; + count_vm_event(THP_SCAN_EXCEED_NONE_PTE); + goto out_unmap; + } + } else if (!pte_present(pteval)) { + if (non_swap_entry(pte_to_swp_entry(pteval))) { + result = SCAN_PTE_NON_PRESENT; + goto out_unmap; + } + ++unmapped; if (!cc->is_khugepaged || unmapped <= khugepaged_max_ptes_swap) { @@ -1313,18 +1330,7 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, goto out_unmap; } } - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { - ++none_or_zero; - if (!userfaultfd_armed(vma) && - (!cc->is_khugepaged || - none_or_zero <= khugepaged_max_ptes_none)) { - continue; - } else { - result = SCAN_EXCEED_NONE_PTE; - count_vm_event(THP_SCAN_EXCEED_NONE_PTE); - goto out_unmap; - } - } + if (pte_uffd_wp(pteval)) { /* * Don't collapse the page if any of the small With that, the function flow looks more similar to __collapse_huge_page_isolate(), except that we handle swap entries in there now. And with that in place, couldn't we factor out a huge chunk of both scanning functions into some helper (passing whether swap entries are allowed or not?). Yes, I know, refactoring khugepaged, crazy idea. -- Cheers David / dhildenb
On 19/09/25 12:17 am, David Hildenbrand wrote: > On 18.09.25 07:04, Lance Yang wrote: >> From: Lance Yang <lance.yang@linux.dev> >> >> Guard PTE markers are installed via MADV_GUARD_INSTALL to create >> lightweight guard regions. >> >> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail >> when >> encountering such a range. >> >> MADV_COLLAPSE fails deep inside the collapse logic when trying to >> swap-in >> the special marker in __collapse_huge_page_swapin(). >> >> hpage_collapse_scan_pmd() >> `- collapse_huge_page() >> `- __collapse_huge_page_swapin() -> fails! >> >> khugepaged's behavior is slightly different due to its max_ptes_swap >> limit >> (default 64). It won't fail as deep, but it will still needlessly >> scan up >> to 64 swap entries before bailing out. >> >> IMHO, we can and should detect this much earlier. >> >> This patch adds a check directly inside the PTE scan loop. If a guard >> marker is found, the scan is aborted immediately with >> SCAN_PTE_NON_PRESENT, >> avoiding wasted work. >> >> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >> Signed-off-by: Lance Yang <lance.yang@linux.dev> >> --- >> mm/khugepaged.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 9ed1af2b5c38..70ebfc7c1f3e 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct >> mm_struct *mm, >> result = SCAN_PTE_UFFD_WP; >> goto out_unmap; >> } >> + /* >> + * Guard PTE markers are installed by >> + * MADV_GUARD_INSTALL. Any collapse path must >> + * not touch them, so abort the scan immediately >> + * if one is found. >> + */ >> + if (is_guard_pte_marker(pteval)) { >> + result = SCAN_PTE_NON_PRESENT; >> + goto out_unmap; >> + } > > Thinking about it, this is interesting. > > Essentially we track any non-swap swap entries towards > khugepaged_max_ptes_swap, which is rather weird. > > I think we might also run into migration entries here and hwpoison > entries? > > So what about just generalizing this: > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index af5f5c80fe4ed..28f1f4bf0e0a8 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1293,7 +1293,24 @@ static int hpage_collapse_scan_pmd(struct > mm_struct *mm, > for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; > _pte++, _address += PAGE_SIZE) { > pte_t pteval = ptep_get(_pte); > - if (is_swap_pte(pteval)) { > + > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > + ++none_or_zero; > + if (!userfaultfd_armed(vma) && > + (!cc->is_khugepaged || > + none_or_zero <= khugepaged_max_ptes_none)) { > + continue; > + } else { > + result = SCAN_EXCEED_NONE_PTE; > + count_vm_event(THP_SCAN_EXCEED_NONE_PTE); > + goto out_unmap; > + } > + } else if (!pte_present(pteval)) { > + if (non_swap_entry(pte_to_swp_entry(pteval))) { > + result = SCAN_PTE_NON_PRESENT; > + goto out_unmap; > + } > + > ++unmapped; > if (!cc->is_khugepaged || > unmapped <= khugepaged_max_ptes_swap) { > @@ -1313,18 +1330,7 @@ static int hpage_collapse_scan_pmd(struct > mm_struct *mm, > goto out_unmap; > } > } > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > - ++none_or_zero; > - if (!userfaultfd_armed(vma) && > - (!cc->is_khugepaged || > - none_or_zero <= khugepaged_max_ptes_none)) { > - continue; > - } else { > - result = SCAN_EXCEED_NONE_PTE; > - count_vm_event(THP_SCAN_EXCEED_NONE_PTE); > - goto out_unmap; > - } > - } > + > if (pte_uffd_wp(pteval)) { > /* > * Don't collapse the page if any of the small > > > With that, the function flow looks more similar to > __collapse_huge_page_isolate(), > except that we handle swap entries in there now. This looks good! > > > And with that in place, couldn't we factor out a huge chunk of both > scanning > functions into some helper (passing whether swap entries are allowed > or not?). > > Yes, I know, refactoring khugepaged, crazy idea. >
On 2025/9/19 02:47, David Hildenbrand wrote: > On 18.09.25 07:04, Lance Yang wrote: >> From: Lance Yang <lance.yang@linux.dev> >> >> Guard PTE markers are installed via MADV_GUARD_INSTALL to create >> lightweight guard regions. >> >> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when >> encountering such a range. >> >> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in >> the special marker in __collapse_huge_page_swapin(). >> >> hpage_collapse_scan_pmd() >> `- collapse_huge_page() >> `- __collapse_huge_page_swapin() -> fails! >> >> khugepaged's behavior is slightly different due to its max_ptes_swap >> limit >> (default 64). It won't fail as deep, but it will still needlessly scan up >> to 64 swap entries before bailing out. >> >> IMHO, we can and should detect this much earlier. >> >> This patch adds a check directly inside the PTE scan loop. If a guard >> marker is found, the scan is aborted immediately with >> SCAN_PTE_NON_PRESENT, >> avoiding wasted work. >> >> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >> Signed-off-by: Lance Yang <lance.yang@linux.dev> >> --- >> mm/khugepaged.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 9ed1af2b5c38..70ebfc7c1f3e 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct >> mm_struct *mm, >> result = SCAN_PTE_UFFD_WP; >> goto out_unmap; >> } >> + /* >> + * Guard PTE markers are installed by >> + * MADV_GUARD_INSTALL. Any collapse path must >> + * not touch them, so abort the scan immediately >> + * if one is found. >> + */ >> + if (is_guard_pte_marker(pteval)) { >> + result = SCAN_PTE_NON_PRESENT; >> + goto out_unmap; >> + } > > Thinking about it, this is interesting. > > Essentially we track any non-swap swap entries towards > khugepaged_max_ptes_swap, which is rather weird. > > I think we might also run into migration entries here and hwpoison entries? > > So what about just generalizing this: > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index af5f5c80fe4ed..28f1f4bf0e0a8 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1293,7 +1293,24 @@ static int hpage_collapse_scan_pmd(struct > mm_struct *mm, > for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; > _pte++, _address += PAGE_SIZE) { > pte_t pteval = ptep_get(_pte); > - if (is_swap_pte(pteval)) { > + > + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > + ++none_or_zero; > + if (!userfaultfd_armed(vma) && > + (!cc->is_khugepaged || > + none_or_zero <= khugepaged_max_ptes_none)) { > + continue; > + } else { > + result = SCAN_EXCEED_NONE_PTE; > + count_vm_event(THP_SCAN_EXCEED_NONE_PTE); > + goto out_unmap; > + } > + } else if (!pte_present(pteval)) { > + if (non_swap_entry(pte_to_swp_entry(pteval))) { > + result = SCAN_PTE_NON_PRESENT; > + goto out_unmap; > + } > + > ++unmapped; > if (!cc->is_khugepaged || > unmapped <= khugepaged_max_ptes_swap) { > @@ -1313,18 +1330,7 @@ static int hpage_collapse_scan_pmd(struct > mm_struct *mm, > goto out_unmap; > } > } > - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > - ++none_or_zero; > - if (!userfaultfd_armed(vma) && > - (!cc->is_khugepaged || > - none_or_zero <= khugepaged_max_ptes_none)) { > - continue; > - } else { > - result = SCAN_EXCEED_NONE_PTE; > - count_vm_event(THP_SCAN_EXCEED_NONE_PTE); > - goto out_unmap; > - } > - } > + > if (pte_uffd_wp(pteval)) { > /* > * Don't collapse the page if any of the small > > > With that, the function flow looks more similar to > __collapse_huge_page_isolate(), > except that we handle swap entries in there now. Ah, indeed. I like this crazy idea ;p > > > And with that in place, couldn't we factor out a huge chunk of both > scanning > functions into some helper (passing whether swap entries are allowed or > not?). Yes. Factoring out the common scanning logic into a new helper is a good suggestion. It would clean things up ;) > > Yes, I know, refactoring khugepaged, crazy idea. I'll look into that. But let's do this separately :) Cheers, Lance
On 19.09.25 04:41, Lance Yang wrote: > > > On 2025/9/19 02:47, David Hildenbrand wrote: >> On 18.09.25 07:04, Lance Yang wrote: >>> From: Lance Yang <lance.yang@linux.dev> >>> >>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create >>> lightweight guard regions. >>> >>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when >>> encountering such a range. >>> >>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in >>> the special marker in __collapse_huge_page_swapin(). >>> >>> hpage_collapse_scan_pmd() >>> `- collapse_huge_page() >>> `- __collapse_huge_page_swapin() -> fails! >>> >>> khugepaged's behavior is slightly different due to its max_ptes_swap >>> limit >>> (default 64). It won't fail as deep, but it will still needlessly scan up >>> to 64 swap entries before bailing out. >>> >>> IMHO, we can and should detect this much earlier. >>> >>> This patch adds a check directly inside the PTE scan loop. If a guard >>> marker is found, the scan is aborted immediately with >>> SCAN_PTE_NON_PRESENT, >>> avoiding wasted work. >>> >>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>> Signed-off-by: Lance Yang <lance.yang@linux.dev> >>> --- >>> mm/khugepaged.c | 10 ++++++++++ >>> 1 file changed, 10 insertions(+) >>> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>> index 9ed1af2b5c38..70ebfc7c1f3e 100644 >>> --- a/mm/khugepaged.c >>> +++ b/mm/khugepaged.c >>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct >>> mm_struct *mm, >>> result = SCAN_PTE_UFFD_WP; >>> goto out_unmap; >>> } >>> + /* >>> + * Guard PTE markers are installed by >>> + * MADV_GUARD_INSTALL. Any collapse path must >>> + * not touch them, so abort the scan immediately >>> + * if one is found. >>> + */ >>> + if (is_guard_pte_marker(pteval)) { >>> + result = SCAN_PTE_NON_PRESENT; >>> + goto out_unmap; >>> + } >> >> Thinking about it, this is interesting. >> >> Essentially we track any non-swap swap entries towards >> khugepaged_max_ptes_swap, which is rather weird. >> >> I think we might also run into migration entries here and hwpoison entries? >> >> So what about just generalizing this: >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index af5f5c80fe4ed..28f1f4bf0e0a8 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1293,7 +1293,24 @@ static int hpage_collapse_scan_pmd(struct >> mm_struct *mm, >> for (_address = address, _pte = pte; _pte < pte + HPAGE_PMD_NR; >> _pte++, _address += PAGE_SIZE) { >> pte_t pteval = ptep_get(_pte); >> - if (is_swap_pte(pteval)) { >> + >> + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { >> + ++none_or_zero; >> + if (!userfaultfd_armed(vma) && >> + (!cc->is_khugepaged || >> + none_or_zero <= khugepaged_max_ptes_none)) { >> + continue; >> + } else { >> + result = SCAN_EXCEED_NONE_PTE; >> + count_vm_event(THP_SCAN_EXCEED_NONE_PTE); >> + goto out_unmap; >> + } >> + } else if (!pte_present(pteval)) { >> + if (non_swap_entry(pte_to_swp_entry(pteval))) { >> + result = SCAN_PTE_NON_PRESENT; >> + goto out_unmap; >> + } >> + >> ++unmapped; >> if (!cc->is_khugepaged || >> unmapped <= khugepaged_max_ptes_swap) { >> @@ -1313,18 +1330,7 @@ static int hpage_collapse_scan_pmd(struct >> mm_struct *mm, >> goto out_unmap; >> } >> } >> - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { >> - ++none_or_zero; >> - if (!userfaultfd_armed(vma) && >> - (!cc->is_khugepaged || >> - none_or_zero <= khugepaged_max_ptes_none)) { >> - continue; >> - } else { >> - result = SCAN_EXCEED_NONE_PTE; >> - count_vm_event(THP_SCAN_EXCEED_NONE_PTE); >> - goto out_unmap; >> - } >> - } >> + >> if (pte_uffd_wp(pteval)) { >> /* >> * Don't collapse the page if any of the small >> >> >> With that, the function flow looks more similar to >> __collapse_huge_page_isolate(), >> except that we handle swap entries in there now. > > Ah, indeed. I like this crazy idea ;p > >> >> >> And with that in place, couldn't we factor out a huge chunk of both >> scanning >> functions into some helper (passing whether swap entries are allowed or >> not?). > > Yes. Factoring out the common scanning logic into a new helper is a > good suggestion. It would clean things up ;) > >> >> Yes, I know, refactoring khugepaged, crazy idea. > > I'll look into that. But let's do this separately :) Right, but let's just skip any non-swap entries early in this patch instead of special-casing only guard ptes. -- Cheers David / dhildenb
On 2025/9/19 15:57, David Hildenbrand wrote: > On 19.09.25 04:41, Lance Yang wrote: >> >> >> On 2025/9/19 02:47, David Hildenbrand wrote: >>> On 18.09.25 07:04, Lance Yang wrote: >>>> From: Lance Yang <lance.yang@linux.dev> >>>> >>>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create >>>> lightweight guard regions. >>>> >>>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail >>>> when >>>> encountering such a range. >>>> >>>> MADV_COLLAPSE fails deep inside the collapse logic when trying to >>>> swap-in >>>> the special marker in __collapse_huge_page_swapin(). >>>> >>>> hpage_collapse_scan_pmd() >>>> `- collapse_huge_page() >>>> `- __collapse_huge_page_swapin() -> fails! >>>> >>>> khugepaged's behavior is slightly different due to its max_ptes_swap >>>> limit >>>> (default 64). It won't fail as deep, but it will still needlessly >>>> scan up >>>> to 64 swap entries before bailing out. >>>> >>>> IMHO, we can and should detect this much earlier. >>>> >>>> This patch adds a check directly inside the PTE scan loop. If a guard >>>> marker is found, the scan is aborted immediately with >>>> SCAN_PTE_NON_PRESENT, >>>> avoiding wasted work. >>>> >>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>>> Signed-off-by: Lance Yang <lance.yang@linux.dev> >>>> --- >>>> mm/khugepaged.c | 10 ++++++++++ >>>> 1 file changed, 10 insertions(+) >>>> >>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>> index 9ed1af2b5c38..70ebfc7c1f3e 100644 >>>> --- a/mm/khugepaged.c >>>> +++ b/mm/khugepaged.c >>>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct >>>> mm_struct *mm, >>>> result = SCAN_PTE_UFFD_WP; >>>> goto out_unmap; >>>> } >>>> + /* >>>> + * Guard PTE markers are installed by >>>> + * MADV_GUARD_INSTALL. Any collapse path must >>>> + * not touch them, so abort the scan immediately >>>> + * if one is found. >>>> + */ >>>> + if (is_guard_pte_marker(pteval)) { >>>> + result = SCAN_PTE_NON_PRESENT; >>>> + goto out_unmap; >>>> + } >>> >>> Thinking about it, this is interesting. >>> >>> Essentially we track any non-swap swap entries towards >>> khugepaged_max_ptes_swap, which is rather weird. >>> >>> I think we might also run into migration entries here and hwpoison >>> entries? >>> >>> So what about just generalizing this: >>> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>> index af5f5c80fe4ed..28f1f4bf0e0a8 100644 >>> --- a/mm/khugepaged.c >>> +++ b/mm/khugepaged.c >>> @@ -1293,7 +1293,24 @@ static int hpage_collapse_scan_pmd(struct >>> mm_struct *mm, >>> for (_address = address, _pte = pte; _pte < pte + >>> HPAGE_PMD_NR; >>> _pte++, _address += PAGE_SIZE) { >>> pte_t pteval = ptep_get(_pte); >>> - if (is_swap_pte(pteval)) { >>> + >>> + if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { >>> + ++none_or_zero; >>> + if (!userfaultfd_armed(vma) && >>> + (!cc->is_khugepaged || >>> + none_or_zero <= >>> khugepaged_max_ptes_none)) { >>> + continue; >>> + } else { >>> + result = SCAN_EXCEED_NONE_PTE; >>> + >>> count_vm_event(THP_SCAN_EXCEED_NONE_PTE); >>> + goto out_unmap; >>> + } >>> + } else if (!pte_present(pteval)) { >>> + if (non_swap_entry(pte_to_swp_entry(pteval))) { >>> + result = SCAN_PTE_NON_PRESENT; >>> + goto out_unmap; >>> + } >>> + >>> ++unmapped; >>> if (!cc->is_khugepaged || >>> unmapped <= khugepaged_max_ptes_swap) { >>> @@ -1313,18 +1330,7 @@ static int hpage_collapse_scan_pmd(struct >>> mm_struct *mm, >>> goto out_unmap; >>> } >>> } >>> - if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { >>> - ++none_or_zero; >>> - if (!userfaultfd_armed(vma) && >>> - (!cc->is_khugepaged || >>> - none_or_zero <= >>> khugepaged_max_ptes_none)) { >>> - continue; >>> - } else { >>> - result = SCAN_EXCEED_NONE_PTE; >>> - >>> count_vm_event(THP_SCAN_EXCEED_NONE_PTE); >>> - goto out_unmap; >>> - } >>> - } >>> + >>> if (pte_uffd_wp(pteval)) { >>> /* >>> * Don't collapse the page if any of the small >>> >>> >>> With that, the function flow looks more similar to >>> __collapse_huge_page_isolate(), >>> except that we handle swap entries in there now. >> >> Ah, indeed. I like this crazy idea ;p >> >>> >>> >>> And with that in place, couldn't we factor out a huge chunk of both >>> scanning >>> functions into some helper (passing whether swap entries are allowed or >>> not?). >> >> Yes. Factoring out the common scanning logic into a new helper is a >> good suggestion. It would clean things up ;) >> >>> >>> Yes, I know, refactoring khugepaged, crazy idea. >> >> I'll look into that. But let's do this separately :) > > Right, but let's just skip any non-swap entries early in this patch > instead of special-casing only guard ptes. Ah, right! I missed the other non-swap entries. Will rework this patch as you suggested! Cheers, Lance
On 18/09/25 10:34 am, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > Guard PTE markers are installed via MADV_GUARD_INSTALL to create > lightweight guard regions. > > Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when > encountering such a range. > > MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in > the special marker in __collapse_huge_page_swapin(). > > hpage_collapse_scan_pmd() > `- collapse_huge_page() > `- __collapse_huge_page_swapin() -> fails! > > khugepaged's behavior is slightly different due to its max_ptes_swap limit > (default 64). It won't fail as deep, but it will still needlessly scan up > to 64 swap entries before bailing out. > > IMHO, we can and should detect this much earlier. > > This patch adds a check directly inside the PTE scan loop. If a guard > marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT, > avoiding wasted work. > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Signed-off-by: Lance Yang <lance.yang@linux.dev> > --- Reviewed-by: Dev Jain <dev.jain@arm.com>
On Thu, Sep 18, 2025 at 01:04:31PM +0800, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > Guard PTE markers are installed via MADV_GUARD_INSTALL to create > lightweight guard regions. > > Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when > encountering such a range. > > MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in > the special marker in __collapse_huge_page_swapin(). > > hpage_collapse_scan_pmd() > `- collapse_huge_page() > `- __collapse_huge_page_swapin() -> fails! > > khugepaged's behavior is slightly different due to its max_ptes_swap limit > (default 64). It won't fail as deep, but it will still needlessly scan up > to 64 swap entries before bailing out. > > IMHO, we can and should detect this much earlier. > > This patch adds a check directly inside the PTE scan loop. If a guard > marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT, > avoiding wasted work. > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Signed-off-by: Lance Yang <lance.yang@linux.dev> LGTM, so: Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > --- > mm/khugepaged.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 9ed1af2b5c38..70ebfc7c1f3e 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > result = SCAN_PTE_UFFD_WP; > goto out_unmap; > } > + /* > + * Guard PTE markers are installed by > + * MADV_GUARD_INSTALL. Any collapse path must > + * not touch them, so abort the scan immediately > + * if one is found. > + */ > + if (is_guard_pte_marker(pteval)) { > + result = SCAN_PTE_NON_PRESENT; > + goto out_unmap; > + } > continue; > } else { > result = SCAN_EXCEED_SWAP_PTE; > -- > 2.49.0 >
On 18/09/25 10:34 am, Lance Yang wrote: > From: Lance Yang <lance.yang@linux.dev> > > Guard PTE markers are installed via MADV_GUARD_INSTALL to create > lightweight guard regions. > > Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when > encountering such a range. > > MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in > the special marker in __collapse_huge_page_swapin(). > > hpage_collapse_scan_pmd() > `- collapse_huge_page() > `- __collapse_huge_page_swapin() -> fails! > > khugepaged's behavior is slightly different due to its max_ptes_swap limit > (default 64). It won't fail as deep, but it will still needlessly scan up > to 64 swap entries before bailing out. > > IMHO, we can and should detect this much earlier. > > This patch adds a check directly inside the PTE scan loop. If a guard > marker is found, the scan is aborted immediately with SCAN_PTE_NON_PRESENT, > avoiding wasted work. > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > Signed-off-by: Lance Yang <lance.yang@linux.dev> > --- > mm/khugepaged.c | 10 ++++++++++ > 1 file changed, 10 insertions(+) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 9ed1af2b5c38..70ebfc7c1f3e 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct mm_struct *mm, > result = SCAN_PTE_UFFD_WP; > goto out_unmap; > } > + /* > + * Guard PTE markers are installed by > + * MADV_GUARD_INSTALL. Any collapse path must > + * not touch them, so abort the scan immediately > + * if one is found. > + */ > + if (is_guard_pte_marker(pteval)) { > + result = SCAN_PTE_NON_PRESENT; > + goto out_unmap; > + } > continue; > } else { > result = SCAN_EXCEED_SWAP_PTE; > > I would like to hear everyone else's thoughts on https://lore.kernel.org/linux-mm/750a06dc-db3d-43c6-b234-95efb393a9df@arm.com/ wherein I suggest that we should not continue to try collapsing other regions but immediately exit. The SCAN_PTE_NON_PRESENT case does not exit.
On 2025/9/18 15:37, Dev Jain wrote: > > On 18/09/25 10:34 am, Lance Yang wrote: >> From: Lance Yang <lance.yang@linux.dev> >> >> Guard PTE markers are installed via MADV_GUARD_INSTALL to create >> lightweight guard regions. >> >> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when >> encountering such a range. >> >> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in >> the special marker in __collapse_huge_page_swapin(). >> >> hpage_collapse_scan_pmd() >> `- collapse_huge_page() >> `- __collapse_huge_page_swapin() -> fails! >> >> khugepaged's behavior is slightly different due to its max_ptes_swap >> limit >> (default 64). It won't fail as deep, but it will still needlessly scan up >> to 64 swap entries before bailing out. >> >> IMHO, we can and should detect this much earlier. >> >> This patch adds a check directly inside the PTE scan loop. If a guard >> marker is found, the scan is aborted immediately with >> SCAN_PTE_NON_PRESENT, >> avoiding wasted work. >> >> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >> Signed-off-by: Lance Yang <lance.yang@linux.dev> >> --- >> mm/khugepaged.c | 10 ++++++++++ >> 1 file changed, 10 insertions(+) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 9ed1af2b5c38..70ebfc7c1f3e 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct >> mm_struct *mm, >> result = SCAN_PTE_UFFD_WP; >> goto out_unmap; >> } >> + /* >> + * Guard PTE markers are installed by >> + * MADV_GUARD_INSTALL. Any collapse path must >> + * not touch them, so abort the scan immediately >> + * if one is found. >> + */ >> + if (is_guard_pte_marker(pteval)) { >> + result = SCAN_PTE_NON_PRESENT; >> + goto out_unmap; >> + } >> continue; >> } else { >> result = SCAN_EXCEED_SWAP_PTE; >> >> > > I would like to hear everyone else's thoughts on > https://lore.kernel.org/linux-mm/750a06dc-db3d-43c6- > b234-95efb393a9df@arm.com/ > wherein I suggest that we should not continue to try collapsing other > regions > but immediately exit. The SCAN_PTE_NON_PRESENT case does not exit. Yes! Let's hear what other folks think on that[1]. [1] https://lore.kernel.org/linux-mm/c9d4d761-202f-48ce-8e3d-fb9075671ff3@linux.dev Cheers, Lance
On Thu, Sep 18, 2025 at 04:11:21PM +0800, Lance Yang wrote: > > > On 2025/9/18 15:37, Dev Jain wrote: > > > > On 18/09/25 10:34 am, Lance Yang wrote: > > > From: Lance Yang <lance.yang@linux.dev> > > > > > > Guard PTE markers are installed via MADV_GUARD_INSTALL to create > > > lightweight guard regions. > > > > > > Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when > > > encountering such a range. > > > > > > MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in > > > the special marker in __collapse_huge_page_swapin(). > > > > > > hpage_collapse_scan_pmd() > > > `- collapse_huge_page() > > > `- __collapse_huge_page_swapin() -> fails! > > > > > > khugepaged's behavior is slightly different due to its max_ptes_swap > > > limit > > > (default 64). It won't fail as deep, but it will still needlessly scan up > > > to 64 swap entries before bailing out. > > > > > > IMHO, we can and should detect this much earlier. > > > > > > This patch adds a check directly inside the PTE scan loop. If a guard > > > marker is found, the scan is aborted immediately with > > > SCAN_PTE_NON_PRESENT, > > > avoiding wasted work. > > > > > > Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> > > > Signed-off-by: Lance Yang <lance.yang@linux.dev> > > > --- > > > mm/khugepaged.c | 10 ++++++++++ > > > 1 file changed, 10 insertions(+) > > > > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > > index 9ed1af2b5c38..70ebfc7c1f3e 100644 > > > --- a/mm/khugepaged.c > > > +++ b/mm/khugepaged.c > > > @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct > > > mm_struct *mm, > > > result = SCAN_PTE_UFFD_WP; > > > goto out_unmap; > > > } > > > + /* > > > + * Guard PTE markers are installed by > > > + * MADV_GUARD_INSTALL. Any collapse path must > > > + * not touch them, so abort the scan immediately > > > + * if one is found. > > > + */ > > > + if (is_guard_pte_marker(pteval)) { > > > + result = SCAN_PTE_NON_PRESENT; > > > + goto out_unmap; > > > + } > > > continue; > > > } else { > > > result = SCAN_EXCEED_SWAP_PTE; > > > > > > > > > > I would like to hear everyone else's thoughts on > > https://lore.kernel.org/linux-mm/750a06dc-db3d-43c6- > > b234-95efb393a9df@arm.com/ > > wherein I suggest that we should not continue to try collapsing other > > regions > > but immediately exit. The SCAN_PTE_NON_PRESENT case does not exit. > > Yes! Let's hear what other folks think on that[1]. > > [1] https://lore.kernel.org/linux-mm/c9d4d761-202f-48ce-8e3d-fb9075671ff3@linux.dev Since the code has changed let's discuss on this thread. Dev - You can have guard regions in a range that prevent one PMD from being collapsed, I'm struggling to understand why you'd want to abort the whole thing? Your reasoning there isn't clear at all, so if I had a guard region in one page in a giant range I was trying to collapse, you're saying we should just abort the whole thing? I really don't understand why we would do that? You just skip over what you can't collapse right? There's no reason at all to assume that overlapping regions here matter, we can't predict how users will use this. As Lance says, it's best effort. And also note we already do this with UFFD WP. And note this is also a non-present, PTE marker. And also this would change existing behaviour which treats this as a swap entry then just fails later down the line right? So yeah I don't agree, I think it's fine as is, unless I'm missing something here. Cheers, Lorenzo
On 18/09/25 3:36 pm, Lorenzo Stoakes wrote: > On Thu, Sep 18, 2025 at 04:11:21PM +0800, Lance Yang wrote: >> >> On 2025/9/18 15:37, Dev Jain wrote: >>> On 18/09/25 10:34 am, Lance Yang wrote: >>>> From: Lance Yang <lance.yang@linux.dev> >>>> >>>> Guard PTE markers are installed via MADV_GUARD_INSTALL to create >>>> lightweight guard regions. >>>> >>>> Currently, any collapse path (khugepaged or MADV_COLLAPSE) will fail when >>>> encountering such a range. >>>> >>>> MADV_COLLAPSE fails deep inside the collapse logic when trying to swap-in >>>> the special marker in __collapse_huge_page_swapin(). >>>> >>>> hpage_collapse_scan_pmd() >>>> `- collapse_huge_page() >>>> `- __collapse_huge_page_swapin() -> fails! >>>> >>>> khugepaged's behavior is slightly different due to its max_ptes_swap >>>> limit >>>> (default 64). It won't fail as deep, but it will still needlessly scan up >>>> to 64 swap entries before bailing out. >>>> >>>> IMHO, we can and should detect this much earlier. >>>> >>>> This patch adds a check directly inside the PTE scan loop. If a guard >>>> marker is found, the scan is aborted immediately with >>>> SCAN_PTE_NON_PRESENT, >>>> avoiding wasted work. >>>> >>>> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> >>>> Signed-off-by: Lance Yang <lance.yang@linux.dev> >>>> --- >>>> mm/khugepaged.c | 10 ++++++++++ >>>> 1 file changed, 10 insertions(+) >>>> >>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>>> index 9ed1af2b5c38..70ebfc7c1f3e 100644 >>>> --- a/mm/khugepaged.c >>>> +++ b/mm/khugepaged.c >>>> @@ -1306,6 +1306,16 @@ static int hpage_collapse_scan_pmd(struct >>>> mm_struct *mm, >>>> result = SCAN_PTE_UFFD_WP; >>>> goto out_unmap; >>>> } >>>> + /* >>>> + * Guard PTE markers are installed by >>>> + * MADV_GUARD_INSTALL. Any collapse path must >>>> + * not touch them, so abort the scan immediately >>>> + * if one is found. >>>> + */ >>>> + if (is_guard_pte_marker(pteval)) { >>>> + result = SCAN_PTE_NON_PRESENT; >>>> + goto out_unmap; >>>> + } >>>> continue; >>>> } else { >>>> result = SCAN_EXCEED_SWAP_PTE; >>>> >>>> >>> I would like to hear everyone else's thoughts on >>> https://lore.kernel.org/linux-mm/750a06dc-db3d-43c6- >>> b234-95efb393a9df@arm.com/ >>> wherein I suggest that we should not continue to try collapsing other >>> regions >>> but immediately exit. The SCAN_PTE_NON_PRESENT case does not exit. >> Yes! Let's hear what other folks think on that[1]. >> >> [1] https://lore.kernel.org/linux-mm/c9d4d761-202f-48ce-8e3d-fb9075671ff3@linux.dev > Since the code has changed let's discuss on this thread. > > Dev - You can have guard regions in a range that prevent one PMD from being > collapsed, I'm struggling to understand why you'd want to abort the whole > thing? > > Your reasoning there isn't clear at all, so if I had a guard region in one > page in a giant range I was trying to collapse, you're saying we should > just abort the whole thing? My reasoning was that it doesn't seem correct that the user will operate in any capacity on a guard region when it knows it is a guard region. But, as you say, we then won't be able to collapse a large region in one go and will have to do multiple madvise() calls to prevent overlapping with a guard region. So I agree with you. > > I really don't understand why we would do that? You just skip over what you > can't collapse right? > > There's no reason at all to assume that overlapping regions here matter, we > can't predict how users will use this. True. > > As Lance says, it's best effort. And also note we already do this with UFFD > WP. And note this is also a non-present, PTE marker. > > And also this would change existing behaviour which treats this as a swap > entry then just fails later down the line right? > > So yeah I don't agree, I think it's fine as is, unless I'm missing > something here. Thanks for your explanation! > > Cheers, Lorenzo
© 2016 - 2025 Red Hat, Inc.