[v12] khugepaged: mTHP support

[PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Nico Pache 3 months, 2 weeks ago

The current mechanism for determining mTHP collapse scales the
khugepaged_max_ptes_none value based on the target order. This
introduces an undesirable feedback loop, or "creep", when max_ptes_none
is set to a value greater than HPAGE_PMD_NR / 2.

With this configuration, a successful collapse to order N will populate
enough pages to satisfy the collapse condition on order N+1 on the next
scan. This leads to unnecessary work and memory churn.

To fix this issue introduce a helper function that caps the max_ptes_none
to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
the max_ptes_none number by the (PMD_ORDER - target collapse order).

The limits can be ignored by passing full_scan=true, this is useful for
madvise_collapse (which ignores limits), or in the case of
collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
collapse is available.

Signed-off-by: Nico Pache <npache@redhat.com>
---
 mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
 1 file changed, 34 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 4ccebf5dda97..286c3a7afdee 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
 		wake_up_interruptible(&khugepaged_wait);
 }
 
+/**
+ * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
+ * @order: The folio order being collapsed to
+ * @full_scan: Whether this is a full scan (ignore limits)
+ *
+ * For madvise-triggered collapses (full_scan=true), all limits are bypassed
+ * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
+ *
+ * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
+ * khugepaged_max_ptes_none value.
+ *
+ * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
+ * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
+ *
+ * Return: Maximum number of empty PTEs allowed for the collapse operation
+ */
+static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
+{
+	unsigned int max_ptes_none;
+
+	/* ignore max_ptes_none limits */
+	if (full_scan)
+		return HPAGE_PMD_NR - 1;
+
+	if (order == HPAGE_PMD_ORDER)
+		return khugepaged_max_ptes_none;
+
+	max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);
+
+	return max_ptes_none >> (HPAGE_PMD_ORDER - order);
+
+}
+
 void khugepaged_enter_vma(struct vm_area_struct *vma,
 			  vm_flags_t vm_flags)
 {
@@ -546,7 +579,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 	pte_t *_pte;
 	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
 	const unsigned long nr_pages = 1UL << order;
-	int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
+	int max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged);
 
 	for (_pte = pte; _pte < pte + nr_pages;
 	     _pte++, addr += PAGE_SIZE) {
-- 
2.51.0

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
> The current mechanism for determining mTHP collapse scales the
> khugepaged_max_ptes_none value based on the target order. This
> introduces an undesirable feedback loop, or "creep", when max_ptes_none
> is set to a value greater than HPAGE_PMD_NR / 2.
>
> With this configuration, a successful collapse to order N will populate
> enough pages to satisfy the collapse condition on order N+1 on the next
> scan. This leads to unnecessary work and memory churn.
>
> To fix this issue introduce a helper function that caps the max_ptes_none
> to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> the max_ptes_none number by the (PMD_ORDER - target collapse order).
>
> The limits can be ignored by passing full_scan=true, this is useful for
> madvise_collapse (which ignores limits), or in the case of
> collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> collapse is available.
>
> Signed-off-by: Nico Pache <npache@redhat.com>
> ---
>  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 4ccebf5dda97..286c3a7afdee 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
>  		wake_up_interruptible(&khugepaged_wait);
>  }
>
> +/**
> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> + * @order: The folio order being collapsed to
> + * @full_scan: Whether this is a full scan (ignore limits)
> + *
> + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> + *
> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> + * khugepaged_max_ptes_none value.
> + *
> + * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
> + * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
> + *
> + * Return: Maximum number of empty PTEs allowed for the collapse operation
> + */
> +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> +{
> +	unsigned int max_ptes_none;
> +
> +	/* ignore max_ptes_none limits */
> +	if (full_scan)
> +		return HPAGE_PMD_NR - 1;
> +
> +	if (order == HPAGE_PMD_ORDER)
> +		return khugepaged_max_ptes_none;
> +
> +	max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);

I mean not to beat a dead horse re: v11 commentary, but I thought we were going
to implement David's idea re: the new 'eagerness' tunable, and again we're now just
implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?

I'm still really quite uncomfortable with us silently capping this value.

If we're putting forward theoretical ideas that are to be later built upon, this
series should be an RFC.

But if we really intend to silently ignore user input the problem is that then
becomes established uAPI.

I think it's _sensible_ to avoid this mTHP escalation problem, but the issue is
visibility I think.

I think people are going to find it odd that you set it to something, but then
get something else.

As an alternative we could have a new sysfs field:

/sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none

That shows the cap clearly.

In fact, it could be read-only... and just expose it to the user. That reduces
complexity.

We can then bring in eagerness later and have the same situation of
max_ptes_none being a parameter that exists (plus this additional read-only
parameter).

> +
> +	return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> +
> +}
> +
>  void khugepaged_enter_vma(struct vm_area_struct *vma,
>  			  vm_flags_t vm_flags)
>  {
> @@ -546,7 +579,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
>  	pte_t *_pte;
>  	int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
>  	const unsigned long nr_pages = 1UL << order;
> -	int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> +	int max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged);
>
>  	for (_pte = pte; _pte < pte + nr_pages;
>  	     _pte++, addr += PAGE_SIZE) {
> --
> 2.51.0
>

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Nico Pache 3 months, 1 week ago

On Mon, Oct 27, 2025 at 11:54 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
> > The current mechanism for determining mTHP collapse scales the
> > khugepaged_max_ptes_none value based on the target order. This
> > introduces an undesirable feedback loop, or "creep", when max_ptes_none
> > is set to a value greater than HPAGE_PMD_NR / 2.
> >
> > With this configuration, a successful collapse to order N will populate
> > enough pages to satisfy the collapse condition on order N+1 on the next
> > scan. This leads to unnecessary work and memory churn.
> >
> > To fix this issue introduce a helper function that caps the max_ptes_none
> > to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> > the max_ptes_none number by the (PMD_ORDER - target collapse order).
> >
> > The limits can be ignored by passing full_scan=true, this is useful for
> > madvise_collapse (which ignores limits), or in the case of
> > collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> > collapse is available.
> >
> > Signed-off-by: Nico Pache <npache@redhat.com>
> > ---
> >  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> >  1 file changed, 34 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 4ccebf5dda97..286c3a7afdee 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
> >               wake_up_interruptible(&khugepaged_wait);
> >  }
> >
> > +/**
> > + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> > + * @order: The folio order being collapsed to
> > + * @full_scan: Whether this is a full scan (ignore limits)
> > + *
> > + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> > + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> > + *
> > + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> > + * khugepaged_max_ptes_none value.
> > + *
> > + * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
> > + * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
> > + *
> > + * Return: Maximum number of empty PTEs allowed for the collapse operation
> > + */
> > +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > +{
> > +     unsigned int max_ptes_none;
> > +
> > +     /* ignore max_ptes_none limits */
> > +     if (full_scan)
> > +             return HPAGE_PMD_NR - 1;
> > +
> > +     if (order == HPAGE_PMD_ORDER)
> > +             return khugepaged_max_ptes_none;
> > +
> > +     max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);
>

Hey Lorenzo,

> I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?

I spoke to David and he said to continue forward with this series; the
"eagerness" tunable will take some time, and may require further
considerations/discussion.

>
> I'm still really quite uncomfortable with us silently capping this value.
>
> If we're putting forward theoretical ideas that are to be later built upon, this
> series should be an RFC.
>
> But if we really intend to silently ignore user input the problem is that then
> becomes established uAPI.
>
> I think it's _sensible_ to avoid this mTHP escalation problem, but the issue is
> visibility I think.
>
> I think people are going to find it odd that you set it to something, but then
> get something else.

The alternative solution is to not support max_ptes_none for mTHP
collapse and not allow none/zero pages. This is essentially "capping"
the value too.

>
> As an alternative we could have a new sysfs field:
>
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
>
> That shows the cap clearly.
>
> In fact, it could be read-only... and just expose it to the user. That reduces
> complexity.

I agree with Baolin here; adding another tunable will only increase
the complexity for our future goals, and also provides needless
insight into the internals when they can not be customized.

Cheers,
-- Nico

>
> We can then bring in eagerness later and have the same situation of
> max_ptes_none being a parameter that exists (plus this additional read-only
> parameter).
>
> > +
> > +     return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > +
> > +}
> > +
> >  void khugepaged_enter_vma(struct vm_area_struct *vma,
> >                         vm_flags_t vm_flags)
> >  {
> > @@ -546,7 +579,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> >       pte_t *_pte;
> >       int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> >       const unsigned long nr_pages = 1UL << order;
> > -     int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > +     int max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged);
> >
> >       for (_pte = pte; _pte < pte + nr_pages;
> >            _pte++, addr += PAGE_SIZE) {
> > --
> > 2.51.0
> >
>

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Tue, Oct 28, 2025 at 07:36:55AM -0600, Nico Pache wrote:
> On Mon, Oct 27, 2025 at 11:54 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
> > > The current mechanism for determining mTHP collapse scales the
> > > khugepaged_max_ptes_none value based on the target order. This
> > > introduces an undesirable feedback loop, or "creep", when max_ptes_none
> > > is set to a value greater than HPAGE_PMD_NR / 2.
> > >
> > > With this configuration, a successful collapse to order N will populate
> > > enough pages to satisfy the collapse condition on order N+1 on the next
> > > scan. This leads to unnecessary work and memory churn.
> > >
> > > To fix this issue introduce a helper function that caps the max_ptes_none
> > > to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> > > the max_ptes_none number by the (PMD_ORDER - target collapse order).
> > >
> > > The limits can be ignored by passing full_scan=true, this is useful for
> > > madvise_collapse (which ignores limits), or in the case of
> > > collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> > > collapse is available.
> > >
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > >  mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> > >  1 file changed, 34 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 4ccebf5dda97..286c3a7afdee 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
> > >               wake_up_interruptible(&khugepaged_wait);
> > >  }
> > >
> > > +/**
> > > + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> > > + * @order: The folio order being collapsed to
> > > + * @full_scan: Whether this is a full scan (ignore limits)
> > > + *
> > > + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> > > + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> > > + *
> > > + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> > > + * khugepaged_max_ptes_none value.
> > > + *
> > > + * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
> > > + * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
> > > + *
> > > + * Return: Maximum number of empty PTEs allowed for the collapse operation
> > > + */
> > > +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > > +{
> > > +     unsigned int max_ptes_none;
> > > +
> > > +     /* ignore max_ptes_none limits */
> > > +     if (full_scan)
> > > +             return HPAGE_PMD_NR - 1;
> > > +
> > > +     if (order == HPAGE_PMD_ORDER)
> > > +             return khugepaged_max_ptes_none;
> > > +
> > > +     max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);
> >
>
> Hey Lorenzo,
>
> > I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> > to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> > implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
>
> I spoke to David and he said to continue forward with this series; the
> "eagerness" tunable will take some time, and may require further
> considerations/discussion.

It would be good to communicate this in the patch, I wasn't aware he had said go
ahead with it. Maybe I missed the mail.

Also others might not be aware. When you're explicitly ignoring prior
review from 2 version ago you really do need to spell out why, at least for
civility's sake.

Apologies if there was communication I've forgotten about/missed. But
either way please can we very explicitly communicate these things.

>
> >
> > I'm still really quite uncomfortable with us silently capping this value.
> >
> > If we're putting forward theoretical ideas that are to be later built upon, this
> > series should be an RFC.
> >
> > But if we really intend to silently ignore user input the problem is that then
> > becomes established uAPI.
> >
> > I think it's _sensible_ to avoid this mTHP escalation problem, but the issue is
> > visibility I think.
> >
> > I think people are going to find it odd that you set it to something, but then
> > get something else.
>
> The alternative solution is to not support max_ptes_none for mTHP
> collapse and not allow none/zero pages. This is essentially "capping"
> the value too.

No that alternative equally _silently_ ignores the user-specified tunable,
which is my objection.

The problem you have here is max_ptes_none _defaults_ to a value that
violates the cap for mTHP (511).

So neither solution is workable.

>
> >
> > As an alternative we could have a new sysfs field:
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> >
> > That shows the cap clearly.
> >
> > In fact, it could be read-only... and just expose it to the user. That reduces
> > complexity.
>
> I agree with Baolin here; adding another tunable will only increase
> the complexity for our future goals, and also provides needless
> insight into the internals when they can not be customized.

We already have needless insight into internals with max_pte_none which we can
never, ever remove due to uAPI so that ship has sailed I'm afraid.

I don't personally think adding a read-only view of this data really makes
that much worse.

Also if we do go ahead with eagerness, I expect we are going to want to
have different max_pte_none values for mTHP/non-mTHP.

We _will_ need to convert between max_pte_none and eagerness in some way
(though when eagerness comes along, we can start having 'detent' values,
that is if a use specifies max_ptes_none of 237 we could change it to 128
for instance) and as a result show eagerness _in terms of_ max_pte_none.

Since we _have_ to do this for uAPI reasons, it doesn't seem really that
harmful or adding to complexity to do the equivalent for a _read-only_
field for mTHP.

AFAIC this patch right now is not upstreamable for the simple reason of
violating user expectation (even if that expectation might be silly) and
_silently_ updating max_ptes_none for mTHP.

So this suggestion was designed to try to get us towards something
upstreamable.

So it's not a case of 'sorry I don't like that we can't do it' + we go
ahead with things as they are, it's a case of - we really need to find a
way to do this not-silently or AFAICT, the series is blocked on this until
this is resolved.

Perhaps we should have discussed 'what to do for v12' more on-list and
could have avoided this ahead of time...

Thanks, Lorenzo

>
> Cheers,
> -- Nico
>
> >
> > We can then bring in eagerness later and have the same situation of
> > max_ptes_none being a parameter that exists (plus this additional read-only
> > parameter).
> >
> > > +
> > > +     return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > > +
> > > +}
> > > +
> > >  void khugepaged_enter_vma(struct vm_area_struct *vma,
> > >                         vm_flags_t vm_flags)
> > >  {
> > > @@ -546,7 +579,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
> > >       pte_t *_pte;
> > >       int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0;
> > >       const unsigned long nr_pages = 1UL << order;
> > > -     int max_ptes_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > > +     int max_ptes_none = collapse_max_ptes_none(order, !cc->is_khugepaged);
> > >
> > >       for (_pte = pte; _pte < pte + nr_pages;
> > >            _pte++, addr += PAGE_SIZE) {
> > > --
> > > 2.51.0
> > >
> >
>

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by David Hildenbrand 3 months, 1 week ago

>> Hey Lorenzo,
>>
>>> I mean not to beat a dead horse re: v11 commentary, but I thought we were going
>>> to implement David's idea re: the new 'eagerness' tunable, and again we're now just
>>> implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
>>
>> I spoke to David and he said to continue forward with this series; the
>> "eagerness" tunable will take some time, and may require further
>> considerations/discussion.
> 
> It would be good to communicate this in the patch, I wasn't aware he had said go
> ahead with it. Maybe I missed the mail.

Just to clarify: yes, I think we should find a way to move forward with 
this series without an eagerness toggle.

That doesn't imply that we'll be using the capping as proposed here (I 
hate it, it's just tricky to work around it for now).

And ideally, we can do that without any temporary tunables, because I'm 
sure it is a problem we can solve internally long-term.

-- 
Cheers

David / dhildenb

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Tue, Oct 28, 2025 at 06:49:48PM +0100, David Hildenbrand wrote:
> > > Hey Lorenzo,
> > >
> > > > I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> > > > to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> > > > implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> > >
> > > I spoke to David and he said to continue forward with this series; the
> > > "eagerness" tunable will take some time, and may require further
> > > considerations/discussion.
> >
> > It would be good to communicate this in the patch, I wasn't aware he had said go
> > ahead with it. Maybe I missed the mail.
>
> Just to clarify: yes, I think we should find a way to move forward with this
> series without an eagerness toggle.

OK, let's please communicate this clearly in future. Maybe I missed the comms on
that.

>
> That doesn't imply that we'll be using the capping as proposed here (I hate
> it, it's just tricky to work around it for now).

OK well this is what I thought, that you hadn't meant that we should go ahead
with the logic completely unaltered from that which was explicitly pushed back
on in v10 I think.

We obviously need to figure out a way forward on this so let's get that
done as quickly as we can.

>
> And ideally, we can do that without any temporary tunables, because I'm sure
> it is a problem we can solve internally long-term.

I went into great detail replying on the relevant thread about this, that's
have that discussion there for sanity's sake.

>
> --
> Cheers
>
> David / dhildenb
>

Thanks, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by David Hildenbrand 3 months, 1 week ago

On 28.10.25 14:36, Nico Pache wrote:
> On Mon, Oct 27, 2025 at 11:54 AM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
>>> The current mechanism for determining mTHP collapse scales the
>>> khugepaged_max_ptes_none value based on the target order. This
>>> introduces an undesirable feedback loop, or "creep", when max_ptes_none
>>> is set to a value greater than HPAGE_PMD_NR / 2.
>>>
>>> With this configuration, a successful collapse to order N will populate
>>> enough pages to satisfy the collapse condition on order N+1 on the next
>>> scan. This leads to unnecessary work and memory churn.
>>>
>>> To fix this issue introduce a helper function that caps the max_ptes_none
>>> to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
>>> the max_ptes_none number by the (PMD_ORDER - target collapse order).
>>>
>>> The limits can be ignored by passing full_scan=true, this is useful for
>>> madvise_collapse (which ignores limits), or in the case of
>>> collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
>>> collapse is available.
>>>
>>> Signed-off-by: Nico Pache <npache@redhat.com>
>>> ---
>>>   mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 34 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>>> index 4ccebf5dda97..286c3a7afdee 100644
>>> --- a/mm/khugepaged.c
>>> +++ b/mm/khugepaged.c
>>> @@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
>>>                wake_up_interruptible(&khugepaged_wait);
>>>   }
>>>
>>> +/**
>>> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
>>> + * @order: The folio order being collapsed to
>>> + * @full_scan: Whether this is a full scan (ignore limits)
>>> + *
>>> + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
>>> + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
>>> + *
>>> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
>>> + * khugepaged_max_ptes_none value.
>>> + *
>>> + * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
>>> + * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
>>> + *
>>> + * Return: Maximum number of empty PTEs allowed for the collapse operation
>>> + */
>>> +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
>>> +{
>>> +     unsigned int max_ptes_none;
>>> +
>>> +     /* ignore max_ptes_none limits */
>>> +     if (full_scan)
>>> +             return HPAGE_PMD_NR - 1;
>>> +
>>> +     if (order == HPAGE_PMD_ORDER)
>>> +             return khugepaged_max_ptes_none;
>>> +
>>> +     max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);
>>
> 
> Hey Lorenzo,
> 
>> I mean not to beat a dead horse re: v11 commentary, but I thought we were going
>> to implement David's idea re: the new 'eagerness' tunable, and again we're now just
>> implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> 
> I spoke to David and he said to continue forward with this series; the
> "eagerness" tunable will take some time, and may require further
> considerations/discussion.

Right, after talking to Johannes it got clearer that what we envisioned 
with "eagerness" would not be like swappiness, and we will really have 
to be careful here. I don't know yet when I will have time to look into 
that.

If we want to avoid the implicit capping, I think there are the 
following possible approaches

(1) Tolerate creep for now, maybe warning if the user configures it.
(2) Avoid creep by counting zero-filled pages towards none_or_zero.
(3) Have separate toggles for each THP size. Doesn't quite solve the
     problem, only shifts it.

Anything else?

IIUC, creep is less of a problem when we have the underused shrinker 
enabled: whatever we over-allocated can (unless longterm-pinned etc) get 
reclaimed again.

So maybe having underused-shrinker support for mTHP as well would be a 
solution to tackle (1) later?

-- 
Cheers

David / dhildenb

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Tue, Oct 28, 2025 at 03:15:26PM +0100, David Hildenbrand wrote:
> On 28.10.25 14:36, Nico Pache wrote:
> > On Mon, Oct 27, 2025 at 11:54 AM Lorenzo Stoakes
> > <lorenzo.stoakes@oracle.com> wrote:
> > >
> > > On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
> > > > The current mechanism for determining mTHP collapse scales the
> > > > khugepaged_max_ptes_none value based on the target order. This
> > > > introduces an undesirable feedback loop, or "creep", when max_ptes_none
> > > > is set to a value greater than HPAGE_PMD_NR / 2.
> > > >
> > > > With this configuration, a successful collapse to order N will populate
> > > > enough pages to satisfy the collapse condition on order N+1 on the next
> > > > scan. This leads to unnecessary work and memory churn.
> > > >
> > > > To fix this issue introduce a helper function that caps the max_ptes_none
> > > > to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> > > > the max_ptes_none number by the (PMD_ORDER - target collapse order).
> > > >
> > > > The limits can be ignored by passing full_scan=true, this is useful for
> > > > madvise_collapse (which ignores limits), or in the case of
> > > > collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> > > > collapse is available.
> > > >
> > > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > > ---
> > > >   mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> > > >   1 file changed, 34 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 4ccebf5dda97..286c3a7afdee 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
> > > >                wake_up_interruptible(&khugepaged_wait);
> > > >   }
> > > >
> > > > +/**
> > > > + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> > > > + * @order: The folio order being collapsed to
> > > > + * @full_scan: Whether this is a full scan (ignore limits)
> > > > + *
> > > > + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> > > > + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> > > > + *
> > > > + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> > > > + * khugepaged_max_ptes_none value.
> > > > + *
> > > > + * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
> > > > + * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
> > > > + *
> > > > + * Return: Maximum number of empty PTEs allowed for the collapse operation
> > > > + */
> > > > +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > > > +{
> > > > +     unsigned int max_ptes_none;
> > > > +
> > > > +     /* ignore max_ptes_none limits */
> > > > +     if (full_scan)
> > > > +             return HPAGE_PMD_NR - 1;
> > > > +
> > > > +     if (order == HPAGE_PMD_ORDER)
> > > > +             return khugepaged_max_ptes_none;
> > > > +
> > > > +     max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);
> > >
> >
> > Hey Lorenzo,
> >
> > > I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> > > to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> > > implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> >
> > I spoke to David and he said to continue forward with this series; the
> > "eagerness" tunable will take some time, and may require further
> > considerations/discussion.
>
> Right, after talking to Johannes it got clearer that what we envisioned with

I'm not sure that you meant to say go ahead with the series as-is with this
silent capping?

Either way we need better communication of this, because I wasn't aware that was
the plan for one, and it means this patch directly ignores review from 2
versions ago, which needs to be documented _somewhere_ so people aren't confused.

And it would maybe allowed us to have this converation ahead of time rather than
now.

> "eagerness" would not be like swappiness, and we will really have to be
> careful here. I don't know yet when I will have time to look into that.

I guess I missed this part of the converastion, what do you mean?

The whole concept is that we have a paramaeter whose value is _abstracted_ and
which we control what it means.

I'm not sure exactly why that would now be problematic? The fundamental concept
seems sound no? Last I remember of the conversation this was the case.

>
> If we want to avoid the implicit capping, I think there are the following
> possible approaches
>
> (1) Tolerate creep for now, maybe warning if the user configures it.

I mean this seems a viable option if there is pressure to land this series
before we have a viable uAPI for configuring this.

A part of me thinks we shouldn't rush series in for that reason though and
should require that we have a proper control here.

But I guess this approach is the least-worst as it leaves us with the most
options moving forwards.

> (2) Avoid creep by counting zero-filled pages towards none_or_zero.

Would this really make all that much difference?

> (3) Have separate toggles for each THP size. Doesn't quite solve the
>     problem, only shifts it.

Yeah I did wonder about this as an alternative solution. But of course it then
makes it vague what the parent values means in respect of the individual levels,
unless we have an 'inherit' mode there too (possible).

It's going to be confusing though as max_ptes_none sits at the root khugepaged/
level and I don't think any other parameter from khugepaged/ is exposed at
individual page size levels.

And of course doing this means we

>
> Anything else?

Err... I mean I'm not sure if you missed it but I suggested an approach in the
sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at:

/sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none

Then we allow the capping, but simply document that we specify what the capped
value will be here for mTHP.

That struck me as the simplest way of getting this series landed without
necessarily violating any future eagerness which:

a. Must still support khugepaged/max_ptes_none - we aren't getting away from
   this, it's uAPI.

b. Surely must want to do different things for mTHP in eagerness, so if we're
   exposing some PTE value in max_ptes_none doing so in
   khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - it's
   readonly so unlike max_ptes_none we don't have to worry about the other
   direction).

HOWEVER, eagerness might want want to change this behaviour per-mTHP size, in
which case perhaps mthp_max_ptes_none would be problematic in that it is some
kind of average.

Then again we could always revert to putting this parameter as in (3) in that
case, ugly but kinda viable.

>
> IIUC, creep is less of a problem when we have the underused shrinker
> enabled: whatever we over-allocated can (unless longterm-pinned etc) get
> reclaimed again.
>
> So maybe having underused-shrinker support for mTHP as well would be a
> solution to tackle (1) later?

How viable is this in the short term?

>
> --
> Cheers
>
> David / dhildenb
>

Another possible solution:

If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.:

/sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none

As a simple boolean. If switched on then we document that it caps mTHP as
per Nico's suggestion.

That way we avoid the 'silent' issue I have with all this and it's an
explicit setting.

Cheers, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by David Hildenbrand 3 months, 1 week ago

>>> Hey Lorenzo,
>>>
>>>> I mean not to beat a dead horse re: v11 commentary, but I thought we were going
>>>> to implement David's idea re: the new 'eagerness' tunable, and again we're now just
>>>> implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
>>>
>>> I spoke to David and he said to continue forward with this series; the
>>> "eagerness" tunable will take some time, and may require further
>>> considerations/discussion.
>>
>> Right, after talking to Johannes it got clearer that what we envisioned with
> 
> I'm not sure that you meant to say go ahead with the series as-is with this
> silent capping?

No, "go ahead" as in "let's find some way forward that works for all and 
is not too crazy".

[...]

>> "eagerness" would not be like swappiness, and we will really have to be
>> careful here. I don't know yet when I will have time to look into that.
> 
> I guess I missed this part of the converastion, what do you mean?

Johannes raised issues with that on the list and afterwards we had an 
offline discussion about some of the details and why something 
unpredictable is not good.

> 
> The whole concept is that we have a paramaeter whose value is _abstracted_ and
> which we control what it means.
> 
> I'm not sure exactly why that would now be problematic? The fundamental concept
> seems sound no? Last I remember of the conversation this was the case.

The basic idea was to do something abstracted as swappiness. Turns out 
"swappiness" is really something predictable, not something we can 
randomly change how it behaves under the hood.

So we'd have to find something similar for "eagerness", and that's where 
it stops being easy.

> 
>>
>> If we want to avoid the implicit capping, I think there are the following
>> possible approaches
>>
>> (1) Tolerate creep for now, maybe warning if the user configures it.
> 
> I mean this seems a viable option if there is pressure to land this series
> before we have a viable uAPI for configuring this.
> 
> A part of me thinks we shouldn't rush series in for that reason though and
> should require that we have a proper control here.
> 
> But I guess this approach is the least-worst as it leaves us with the most
> options moving forwards.

Yes. There is also the alternative of respecting only 0 / 511 for mTHP 
collapse for now as discussed in the other thread.

> 
>> (2) Avoid creep by counting zero-filled pages towards none_or_zero.
> 
> Would this really make all that much difference?

It solves the creep problem I think, but it's a bit nasty IMHO.

> 
>> (3) Have separate toggles for each THP size. Doesn't quite solve the
>>      problem, only shifts it.
> 
> Yeah I did wonder about this as an alternative solution. But of course it then
> makes it vague what the parent values means in respect of the individual levels,
> unless we have an 'inherit' mode there too (possible).
> 
> It's going to be confusing though as max_ptes_none sits at the root khugepaged/
> level and I don't think any other parameter from khugepaged/ is exposed at
> individual page size levels.
> 
> And of course doing this means we
> 
>>
>> Anything else?
> 
> Err... I mean I'm not sure if you missed it but I suggested an approach in the
> sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at:
> 
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> 
> Then we allow the capping, but simply document that we specify what the capped
> value will be here for mTHP.

I did not have time to read the details on that so far.

It would be one solution forward. I dislike it because I think the whole 
capping is an intermediate thing that can be (and likely must be, when 
considering mTHP underused shrinking I think) solved in the future 
differently. That's why I would prefer adding this only if there is no 
other, simpler, way forward.

> 
> That struck me as the simplest way of getting this series landed without
> necessarily violating any future eagerness which:
> 
> a. Must still support khugepaged/max_ptes_none - we aren't getting away from
>     this, it's uAPI.
> 
> b. Surely must want to do different things for mTHP in eagerness, so if we're
>     exposing some PTE value in max_ptes_none doing so in
>     khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - it's
>     readonly so unlike max_ptes_none we don't have to worry about the other
>     direction).
> 
> HOWEVER, eagerness might want want to change this behaviour per-mTHP size, in
> which case perhaps mthp_max_ptes_none would be problematic in that it is some
> kind of average.
> 
> Then again we could always revert to putting this parameter as in (3) in that
> case, ugly but kinda viable.
> 
>>
>> IIUC, creep is less of a problem when we have the underused shrinker
>> enabled: whatever we over-allocated can (unless longterm-pinned etc) get
>> reclaimed again.
>>
>> So maybe having underused-shrinker support for mTHP as well would be a
>> solution to tackle (1) later?
> 
> How viable is this in the short term?

I once started looking into it, but it will require quite some work, 
because the lists will essentially include each and every (m)THP in the 
system ... so i think we will need some redesign.

> 
> Another possible solution:
> 
> If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.:
> 
> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none
> 
> As a simple boolean. If switched on then we document that it caps mTHP as
> per Nico's suggestion.
> 
> That way we avoid the 'silent' issue I have with all this and it's an
> explicit setting.

Right, but it's another toggle I wish we wouldn't need. We could of 
course also make it some compile-time option, but not sure if that's 
really any better.

I'd hope we find an easy way forward that doesn't require new toggles, 
at least for now ...

-- 
Cheers

David / dhildenb

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Tue, Oct 28, 2025 at 07:08:38PM +0100, David Hildenbrand wrote:
>
> > > > Hey Lorenzo,
> > > >
> > > > > I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> > > > > to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> > > > > implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> > > >
> > > > I spoke to David and he said to continue forward with this series; the
> > > > "eagerness" tunable will take some time, and may require further
> > > > considerations/discussion.
> > >
> > > Right, after talking to Johannes it got clearer that what we envisioned with
> >
> > I'm not sure that you meant to say go ahead with the series as-is with this
> > silent capping?
>
> No, "go ahead" as in "let's find some way forward that works for all and is
> not too crazy".

Right we clearly needed to discuss that further at the time but that's moot now,
we're figuring it out now :)

>
> [...]
>
> > > "eagerness" would not be like swappiness, and we will really have to be
> > > careful here. I don't know yet when I will have time to look into that.
> >
> > I guess I missed this part of the converastion, what do you mean?
>
> Johannes raised issues with that on the list and afterwards we had an
> offline discussion about some of the details and why something unpredictable
> is not good.

Could we get these details on-list so we can discuss them? This doesn't have to
be urgent, but I would like to have a say in this or at least be part of the
converastion please.

>
> >
> > The whole concept is that we have a paramaeter whose value is _abstracted_ and
> > which we control what it means.
> >
> > I'm not sure exactly why that would now be problematic? The fundamental concept
> > seems sound no? Last I remember of the conversation this was the case.
>
> The basic idea was to do something abstracted as swappiness. Turns out
> "swappiness" is really something predictable, not something we can randomly
> change how it behaves under the hood.
>
> So we'd have to find something similar for "eagerness", and that's where it
> stops being easy.

I think we shouldn't be too stuck on

>
> >
> > >
> > > If we want to avoid the implicit capping, I think there are the following
> > > possible approaches
> > >
> > > (1) Tolerate creep for now, maybe warning if the user configures it.
> >
> > I mean this seems a viable option if there is pressure to land this series
> > before we have a viable uAPI for configuring this.
> >
> > A part of me thinks we shouldn't rush series in for that reason though and
> > should require that we have a proper control here.
> >
> > But I guess this approach is the least-worst as it leaves us with the most
> > options moving forwards.
>
> Yes. There is also the alternative of respecting only 0 / 511 for mTHP
> collapse for now as discussed in the other thread.

Yes I guess let's carry that on over there.

I mean this is why I said it's better to try to keep things in one thread :) but
anyway, we've forked and can't be helped now.

To be clear that was a criticism of - email development - not you.

It's _extremely easy_ to have this happen because one thread naturally leads to
a broader discussion of a given topic, whereas another has questions from
somebody else about the same topic, to which people reply and then... you have a
fork and it can't be helped.

I guess I'm saying it'd be good if we could say 'ok let's move this to X'.

But that's also broken in its own way, you can't stop people from replying in
the other thread still and yeah. It's a limitation of this model :)

>
> >
> > > (2) Avoid creep by counting zero-filled pages towards none_or_zero.
> >
> > Would this really make all that much difference?
>
> It solves the creep problem I think, but it's a bit nasty IMHO.

Ah because you'd end up wtih a bunch of zeroed pages from the prior mTHP
collapses, interesting...

Scanning for that does seem a bit nasty though yes...

>
> >
> > > (3) Have separate toggles for each THP size. Doesn't quite solve the
> > >      problem, only shifts it.
> >
> > Yeah I did wonder about this as an alternative solution. But of course it then
> > makes it vague what the parent values means in respect of the individual levels,
> > unless we have an 'inherit' mode there too (possible).
> >
> > It's going to be confusing though as max_ptes_none sits at the root khugepaged/
> > level and I don't think any other parameter from khugepaged/ is exposed at
> > individual page size levels.
> >
> > And of course doing this means we
> >
> > >
> > > Anything else?
> >
> > Err... I mean I'm not sure if you missed it but I suggested an approach in the
> > sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at:
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> >
> > Then we allow the capping, but simply document that we specify what the capped
> > value will be here for mTHP.
>
> I did not have time to read the details on that so far.

OK. It is a bit nasty, yes. The idea is to find something that allows the
capping to work.

>
> It would be one solution forward. I dislike it because I think the whole
> capping is an intermediate thing that can be (and likely must be, when
> considering mTHP underused shrinking I think) solved in the future
> differently. That's why I would prefer adding this only if there is no
> other, simpler, way forward.

Yes I agree that if we could avoid it it'd be great.

Really I proposed this solution on the basis that we were somehow ok with the
capping.

If we can avoid that'd be ideal as it reduces complexity and 'unexpected'
behaviour.

We'll clarify on the other thread, but the 511/0 was compelling to me before as
a simplification, and if we can have a straightforward model of how mTHP
collapse across none/zero page PTEs behaves this is ideal.

The only question is w.r.t. warnings etc. but we can handle details there.

>
> >
> > That struck me as the simplest way of getting this series landed without
> > necessarily violating any future eagerness which:
> >
> > a. Must still support khugepaged/max_ptes_none - we aren't getting away from
> >     this, it's uAPI.
> >
> > b. Surely must want to do different things for mTHP in eagerness, so if we're
> >     exposing some PTE value in max_ptes_none doing so in
> >     khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - it's
> >     readonly so unlike max_ptes_none we don't have to worry about the other
> >     direction).
> >
> > HOWEVER, eagerness might want want to change this behaviour per-mTHP size, in
> > which case perhaps mthp_max_ptes_none would be problematic in that it is some
> > kind of average.
> >
> > Then again we could always revert to putting this parameter as in (3) in that
> > case, ugly but kinda viable.
> >
> > >
> > > IIUC, creep is less of a problem when we have the underused shrinker
> > > enabled: whatever we over-allocated can (unless longterm-pinned etc) get
> > > reclaimed again.
> > >
> > > So maybe having underused-shrinker support for mTHP as well would be a
> > > solution to tackle (1) later?
> >
> > How viable is this in the short term?
>
> I once started looking into it, but it will require quite some work, because
> the lists will essentially include each and every (m)THP in the system ...
> so i think we will need some redesign.

Ack.

This aligns with non-0/511 settings being non-functional for mTHP atm anyway.

>
> >
> > Another possible solution:
> >
> > If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.:
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none
> >
> > As a simple boolean. If switched on then we document that it caps mTHP as
> > per Nico's suggestion.
> >
> > That way we avoid the 'silent' issue I have with all this and it's an
> > explicit setting.
>
> Right, but it's another toggle I wish we wouldn't need. We could of course
> also make it some compile-time option, but not sure if that's really any
> better.
>
> I'd hope we find an easy way forward that doesn't require new toggles, at
> least for now ...

Right, well I agree if we can make this 0/511 thing work, let's do that.

Toggle are just 'least worst' workarounds on assumption of the need for capping.

>
> --
> Cheers
>
> David / dhildenb
>

Thanks, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by David Hildenbrand 3 months, 1 week ago


>>>> "eagerness" would not be like swappiness, and we will really have to be
>>>> careful here. I don't know yet when I will have time to look into that.
>>>
>>> I guess I missed this part of the converastion, what do you mean?
>>
>> Johannes raised issues with that on the list and afterwards we had an
>> offline discussion about some of the details and why something unpredictable
>> is not good.
> 
> Could we get these details on-list so we can discuss them? This doesn't have to
> be urgent, but I would like to have a say in this or at least be part of the
> converastion please.

Sorry, I only found now time to reply on this point. Johannes raised the 
point in [1], and afterwards we went a bit into detail in a off-list 
discussion.

In essence, I think he is right that is something we have to be very 
careful about. So it turned out as something that will take a lot more 
time+effort on my side than I originally thought, turning it not 
feasible in the short term given how I already lack behind on so many 
other things.

So I concluded that it's probably best to have such and effort be 
independent of this series. And in some way it is either way, because 
max_ptes_none is just a horrible interface given the values are 
architecture dependent.

I'll be happy if we can focus in this series on the bare minimum initial 
support, and avoid any magic (scaling / capping) as it all turned out to 
be much more tricky (interaction with the deferred shrinker ...) than 
most of us initially thought.

But I think we're already on the same page here, just wanted to share a 
bit more details on the max_ptes_none vs. eagerness idea.

[1] https://lkml.kernel.org/r/20250915134359.GA827803@cmpxchg.org

-- 
Cheers

David / dhildenb

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Nico Pache 3 months, 1 week ago

On Tue, Oct 28, 2025 at 1:00 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Oct 28, 2025 at 07:08:38PM +0100, David Hildenbrand wrote:
> >
> > > > > Hey Lorenzo,
> > > > >
> > > > > > I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> > > > > > to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> > > > > > implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> > > > >
> > > > > I spoke to David and he said to continue forward with this series; the
> > > > > "eagerness" tunable will take some time, and may require further
> > > > > considerations/discussion.
> > > >
> > > > Right, after talking to Johannes it got clearer that what we envisioned with
> > >
> > > I'm not sure that you meant to say go ahead with the series as-is with this
> > > silent capping?
> >
> > No, "go ahead" as in "let's find some way forward that works for all and is
> > not too crazy".
>
> Right we clearly needed to discuss that further at the time but that's moot now,
> we're figuring it out now :)
>
> >
> > [...]
> >
> > > > "eagerness" would not be like swappiness, and we will really have to be
> > > > careful here. I don't know yet when I will have time to look into that.
> > >
> > > I guess I missed this part of the converastion, what do you mean?
> >
> > Johannes raised issues with that on the list and afterwards we had an
> > offline discussion about some of the details and why something unpredictable
> > is not good.
>
> Could we get these details on-list so we can discuss them? This doesn't have to
> be urgent, but I would like to have a say in this or at least be part of the
> converastion please.
>
> >
> > >
> > > The whole concept is that we have a paramaeter whose value is _abstracted_ and
> > > which we control what it means.
> > >
> > > I'm not sure exactly why that would now be problematic? The fundamental concept
> > > seems sound no? Last I remember of the conversation this was the case.
> >
> > The basic idea was to do something abstracted as swappiness. Turns out
> > "swappiness" is really something predictable, not something we can randomly
> > change how it behaves under the hood.
> >
> > So we'd have to find something similar for "eagerness", and that's where it
> > stops being easy.
>
> I think we shouldn't be too stuck on
>
> >
> > >
> > > >
> > > > If we want to avoid the implicit capping, I think there are the following
> > > > possible approaches
> > > >
> > > > (1) Tolerate creep for now, maybe warning if the user configures it.
> > >
> > > I mean this seems a viable option if there is pressure to land this series
> > > before we have a viable uAPI for configuring this.
> > >
> > > A part of me thinks we shouldn't rush series in for that reason though and
> > > should require that we have a proper control here.
> > >
> > > But I guess this approach is the least-worst as it leaves us with the most
> > > options moving forwards.
> >
> > Yes. There is also the alternative of respecting only 0 / 511 for mTHP
> > collapse for now as discussed in the other thread.
>
> Yes I guess let's carry that on over there.
>
> I mean this is why I said it's better to try to keep things in one thread :) but
> anyway, we've forked and can't be helped now.
>
> To be clear that was a criticism of - email development - not you.
>
> It's _extremely easy_ to have this happen because one thread naturally leads to
> a broader discussion of a given topic, whereas another has questions from
> somebody else about the same topic, to which people reply and then... you have a
> fork and it can't be helped.
>
> I guess I'm saying it'd be good if we could say 'ok let's move this to X'.
>
> But that's also broken in its own way, you can't stop people from replying in
> the other thread still and yeah. It's a limitation of this model :)
>
> >
> > >
> > > > (2) Avoid creep by counting zero-filled pages towards none_or_zero.
> > >
> > > Would this really make all that much difference?
> >
> > It solves the creep problem I think, but it's a bit nasty IMHO.
>
> Ah because you'd end up wtih a bunch of zeroed pages from the prior mTHP
> collapses, interesting...
>
> Scanning for that does seem a bit nasty though yes...
>
> >
> > >
> > > > (3) Have separate toggles for each THP size. Doesn't quite solve the
> > > >      problem, only shifts it.
> > >
> > > Yeah I did wonder about this as an alternative solution. But of course it then
> > > makes it vague what the parent values means in respect of the individual levels,
> > > unless we have an 'inherit' mode there too (possible).
> > >
> > > It's going to be confusing though as max_ptes_none sits at the root khugepaged/
> > > level and I don't think any other parameter from khugepaged/ is exposed at
> > > individual page size levels.
> > >
> > > And of course doing this means we
> > >
> > > >
> > > > Anything else?
> > >
> > > Err... I mean I'm not sure if you missed it but I suggested an approach in the
> > > sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at:
> > >
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> > >
> > > Then we allow the capping, but simply document that we specify what the capped
> > > value will be here for mTHP.
> >
> > I did not have time to read the details on that so far.
>
> OK. It is a bit nasty, yes. The idea is to find something that allows the
> capping to work.
>
> >
> > It would be one solution forward. I dislike it because I think the whole
> > capping is an intermediate thing that can be (and likely must be, when
> > considering mTHP underused shrinking I think) solved in the future
> > differently. That's why I would prefer adding this only if there is no
> > other, simpler, way forward.
>
> Yes I agree that if we could avoid it it'd be great.
>
> Really I proposed this solution on the basis that we were somehow ok with the
> capping.
>
> If we can avoid that'd be ideal as it reduces complexity and 'unexpected'
> behaviour.
>
> We'll clarify on the other thread, but the 511/0 was compelling to me before as
> a simplification, and if we can have a straightforward model of how mTHP
> collapse across none/zero page PTEs behaves this is ideal.
>
> The only question is w.r.t. warnings etc. but we can handle details there.
>
> >
> > >
> > > That struck me as the simplest way of getting this series landed without
> > > necessarily violating any future eagerness which:
> > >
> > > a. Must still support khugepaged/max_ptes_none - we aren't getting away from
> > >     this, it's uAPI.
> > >
> > > b. Surely must want to do different things for mTHP in eagerness, so if we're
> > >     exposing some PTE value in max_ptes_none doing so in
> > >     khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - it's
> > >     readonly so unlike max_ptes_none we don't have to worry about the other
> > >     direction).
> > >
> > > HOWEVER, eagerness might want want to change this behaviour per-mTHP size, in
> > > which case perhaps mthp_max_ptes_none would be problematic in that it is some
> > > kind of average.
> > >
> > > Then again we could always revert to putting this parameter as in (3) in that
> > > case, ugly but kinda viable.
> > >
> > > >
> > > > IIUC, creep is less of a problem when we have the underused shrinker
> > > > enabled: whatever we over-allocated can (unless longterm-pinned etc) get
> > > > reclaimed again.
> > > >
> > > > So maybe having underused-shrinker support for mTHP as well would be a
> > > > solution to tackle (1) later?
> > >
> > > How viable is this in the short term?
> >
> > I once started looking into it, but it will require quite some work, because
> > the lists will essentially include each and every (m)THP in the system ...
> > so i think we will need some redesign.
>
> Ack.
>
> This aligns with non-0/511 settings being non-functional for mTHP atm anyway.
>
> >
> > >
> > > Another possible solution:
> > >
> > > If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.:
> > >
> > > /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none
> > >
> > > As a simple boolean. If switched on then we document that it caps mTHP as
> > > per Nico's suggestion.
> > >
> > > That way we avoid the 'silent' issue I have with all this and it's an
> > > explicit setting.
> >
> > Right, but it's another toggle I wish we wouldn't need. We could of course
> > also make it some compile-time option, but not sure if that's really any
> > better.
> >
> > I'd hope we find an easy way forward that doesn't require new toggles, at
> > least for now ...
>
> Right, well I agree if we can make this 0/511 thing work, let's do that.

Ok, great, some consensus! I will go ahead with that solution.

Just to make sure we are all on the same page,

the max_ptes_none value will be treated as 0 for anything other than
PMD collapse, or in the case of 511. Or will the max_ptes_none only
work for mTHP collapse when it is 0.

static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
{
unsigned int max_ptes_none;

/* ignore max_ptes_none limits */
if (full_scan)
return HPAGE_PMD_NR - 1;

if (order == HPAGE_PMD_ORDER)
return khugepaged_max_ptes_none;

if (khugepaged_max_ptes_none != HPAGE_PMD_NR - 1)
return 0;

return max_ptes_none >> (HPAGE_PMD_ORDER - order);
}

Here's the implementation for the first approach, looks like Baolin
was able to catch up and beat me to the other solution while I was
mulling over the thread lol

Cheers,
-- Nico


>
> Toggle are just 'least worst' workarounds on assumption of the need for capping.
>
> >
> > --
> > Cheers
> >
> > David / dhildenb
> >
>
> Thanks, Lorenzo
>

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Tue, Oct 28, 2025 at 08:47:12PM -0600, Nico Pache wrote:
> On Tue, Oct 28, 2025 at 1:00 PM Lorenzo Stoakes
> > Right, well I agree if we can make this 0/511 thing work, let's do that.
>
> Ok, great, some consensus! I will go ahead with that solution.

:) awesome.

>
> Just to make sure we are all on the same page,

I am still stabilising my understanding of the creep issue, see the thread
where David kindly + patiently goes in detail, I think I am at a
(pre-examining algorithm itself) broad understanding of this.

>
> the max_ptes_none value will be treated as 0 for anything other than
> PMD collapse, or in the case of 511. Or will the max_ptes_none only
> work for mTHP collapse when it is 0.

511 implies always collapse zero/none, 0 implies never, as I understand it.

>
> static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> {
> unsigned int max_ptes_none;
>
> /* ignore max_ptes_none limits */
> if (full_scan)
> return HPAGE_PMD_NR - 1;
>
> if (order == HPAGE_PMD_ORDER)
> return khugepaged_max_ptes_none;
>
> if (khugepaged_max_ptes_none != HPAGE_PMD_NR - 1)
> return 0;
>
> return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> }
>
> Here's the implementation for the first approach, looks like Baolin
> was able to catch up and beat me to the other solution while I was
> mulling over the thread lol

Broadly looks similar to Baolin's, I made some suggestions over there
though!

>
> Cheers,
> -- Nico

Thanks, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Nico Pache 3 months, 1 week ago

On Wed, Oct 29, 2025 at 12:59 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Tue, Oct 28, 2025 at 08:47:12PM -0600, Nico Pache wrote:
> > On Tue, Oct 28, 2025 at 1:00 PM Lorenzo Stoakes
> > > Right, well I agree if we can make this 0/511 thing work, let's do that.
> >
> > Ok, great, some consensus! I will go ahead with that solution.
>
> :) awesome.
>
> >
> > Just to make sure we are all on the same page,
>
> I am still stabilising my understanding of the creep issue, see the thread
> where David kindly + patiently goes in detail, I think I am at a
> (pre-examining algorithm itself) broad understanding of this.

I added some details of the creep issue in my other replies, hopefully
that also helps!

>
> >
> > the max_ptes_none value will be treated as 0 for anything other than
> > PMD collapse, or in the case of 511. Or will the max_ptes_none only
> > work for mTHP collapse when it is 0.
>
> 511 implies always collapse zero/none, 0 implies never, as I understand it.

0 implies only collapse if a given mTHP size is fully occupied by
present PTES. Since we start at PMD and work our way down we will
always end up with a PMD range of fully occupied mTHPs, potentially of
all different sizes.

>
> >
> > static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > {
> > unsigned int max_ptes_none;
> >
> > /* ignore max_ptes_none limits */
> > if (full_scan)
> > return HPAGE_PMD_NR - 1;
> >
> > if (order == HPAGE_PMD_ORDER)
> > return khugepaged_max_ptes_none;
> >
> > if (khugepaged_max_ptes_none != HPAGE_PMD_NR - 1)
> > return 0;
> >
> > return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > }
> >
> > Here's the implementation for the first approach, looks like Baolin
> > was able to catch up and beat me to the other solution while I was
> > mulling over the thread lol
>
> Broadly looks similar to Baolin's, I made some suggestions over there
> though!

Thanks! They are both based on my current collapse_max_ptes_none! Just
a slight difference in behavior surrounding the two suggested
solutions by David.

I will still have to implement the logic for not attempting mTHP
collapses if it is any intermediate value (i.e. the function returns
-EINVAL).

-- Nico

>
> >
> > Cheers,
> > -- Nico
>
> Thanks, Lorenzo
>

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Wed, Oct 29, 2025 at 03:23:27PM -0600, Nico Pache wrote:
> On Wed, Oct 29, 2025 at 12:59 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Tue, Oct 28, 2025 at 08:47:12PM -0600, Nico Pache wrote:
> > > On Tue, Oct 28, 2025 at 1:00 PM Lorenzo Stoakes
> > > > Right, well I agree if we can make this 0/511 thing work, let's do that.
> > >
> > > Ok, great, some consensus! I will go ahead with that solution.
> >
> > :) awesome.
> >
> > >
> > > Just to make sure we are all on the same page,
> >
> > I am still stabilising my understanding of the creep issue, see the thread
> > where David kindly + patiently goes in detail, I think I am at a
> > (pre-examining algorithm itself) broad understanding of this.
>
> I added some details of the creep issue in my other replies, hopefully
> that also helps!
>
> >
> > >
> > > the max_ptes_none value will be treated as 0 for anything other than
> > > PMD collapse, or in the case of 511. Or will the max_ptes_none only
> > > work for mTHP collapse when it is 0.
> >
> > 511 implies always collapse zero/none, 0 implies never, as I understand it.
>
> 0 implies only collapse if a given mTHP size is fully occupied by
> present PTES. Since we start at PMD and work our way down we will
> always end up with a PMD range of fully occupied mTHPs, potentially of
> all different sizes.

Yeah this was my understanding, I mean terminology is tricky here (+ I am
probably not being entirely clear tbh), so I mean less so '0 means no
collapse' but rather '0 means no collapse of zero/none' but of course can
allow for collapse of present PTEs (within the same VMA).


>
> >
> > >
> > > static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > > {
> > > unsigned int max_ptes_none;
> > >
> > > /* ignore max_ptes_none limits */
> > > if (full_scan)
> > > return HPAGE_PMD_NR - 1;
> > >
> > > if (order == HPAGE_PMD_ORDER)
> > > return khugepaged_max_ptes_none;
> > >
> > > if (khugepaged_max_ptes_none != HPAGE_PMD_NR - 1)
> > > return 0;
> > >
> > > return max_ptes_none >> (HPAGE_PMD_ORDER - order);
> > > }
> > >
> > > Here's the implementation for the first approach, looks like Baolin
> > > was able to catch up and beat me to the other solution while I was
> > > mulling over the thread lol
> >
> > Broadly looks similar to Baolin's, I made some suggestions over there
> > though!
>
> Thanks! They are both based on my current collapse_max_ptes_none! Just
> a slight difference in behavior surrounding the two suggested
> solutions by David.

Yes which is convenient as it's less delta for you!

>
> I will still have to implement the logic for not attempting mTHP
> collapses if it is any intermediate value (i.e. the function returns
> -EINVAL).

Ack

>
> -- Nico
>
> >
> > >
> > > Cheers,
> > > -- Nico
> >
> > Thanks, Lorenzo
> >
>

Cheers, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Baolin Wang 3 months, 1 week ago


On 2025/10/29 02:59, Lorenzo Stoakes wrote:
> On Tue, Oct 28, 2025 at 07:08:38PM +0100, David Hildenbrand wrote:
>>
>>>>> Hey Lorenzo,
>>>>>
>>>>>> I mean not to beat a dead horse re: v11 commentary, but I thought we were going
>>>>>> to implement David's idea re: the new 'eagerness' tunable, and again we're now just
>>>>>> implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
>>>>>
>>>>> I spoke to David and he said to continue forward with this series; the
>>>>> "eagerness" tunable will take some time, and may require further
>>>>> considerations/discussion.
>>>>
>>>> Right, after talking to Johannes it got clearer that what we envisioned with
>>>
>>> I'm not sure that you meant to say go ahead with the series as-is with this
>>> silent capping?
>>
>> No, "go ahead" as in "let's find some way forward that works for all and is
>> not too crazy".
> 
> Right we clearly needed to discuss that further at the time but that's moot now,
> we're figuring it out now :)
> 
>>
>> [...]
>>
>>>> "eagerness" would not be like swappiness, and we will really have to be
>>>> careful here. I don't know yet when I will have time to look into that.
>>>
>>> I guess I missed this part of the converastion, what do you mean?
>>
>> Johannes raised issues with that on the list and afterwards we had an
>> offline discussion about some of the details and why something unpredictable
>> is not good.
> 
> Could we get these details on-list so we can discuss them? This doesn't have to
> be urgent, but I would like to have a say in this or at least be part of the
> converastion please.
> 
>>
>>>
>>> The whole concept is that we have a paramaeter whose value is _abstracted_ and
>>> which we control what it means.
>>>
>>> I'm not sure exactly why that would now be problematic? The fundamental concept
>>> seems sound no? Last I remember of the conversation this was the case.
>>
>> The basic idea was to do something abstracted as swappiness. Turns out
>> "swappiness" is really something predictable, not something we can randomly
>> change how it behaves under the hood.
>>
>> So we'd have to find something similar for "eagerness", and that's where it
>> stops being easy.
> 
> I think we shouldn't be too stuck on
> 
>>
>>>
>>>>
>>>> If we want to avoid the implicit capping, I think there are the following
>>>> possible approaches
>>>>
>>>> (1) Tolerate creep for now, maybe warning if the user configures it.
>>>
>>> I mean this seems a viable option if there is pressure to land this series
>>> before we have a viable uAPI for configuring this.
>>>
>>> A part of me thinks we shouldn't rush series in for that reason though and
>>> should require that we have a proper control here.
>>>
>>> But I guess this approach is the least-worst as it leaves us with the most
>>> options moving forwards.
>>
>> Yes. There is also the alternative of respecting only 0 / 511 for mTHP
>> collapse for now as discussed in the other thread.
> 
> Yes I guess let's carry that on over there.
> 
> I mean this is why I said it's better to try to keep things in one thread :) but
> anyway, we've forked and can't be helped now.
> 
> To be clear that was a criticism of - email development - not you.
> 
> It's _extremely easy_ to have this happen because one thread naturally leads to
> a broader discussion of a given topic, whereas another has questions from
> somebody else about the same topic, to which people reply and then... you have a
> fork and it can't be helped.
> 
> I guess I'm saying it'd be good if we could say 'ok let's move this to X'.
> 
> But that's also broken in its own way, you can't stop people from replying in
> the other thread still and yeah. It's a limitation of this model :)
> 
>>
>>>
>>>> (2) Avoid creep by counting zero-filled pages towards none_or_zero.
>>>
>>> Would this really make all that much difference?
>>
>> It solves the creep problem I think, but it's a bit nasty IMHO.
> 
> Ah because you'd end up wtih a bunch of zeroed pages from the prior mTHP
> collapses, interesting...
> 
> Scanning for that does seem a bit nasty though yes...
> 
>>
>>>
>>>> (3) Have separate toggles for each THP size. Doesn't quite solve the
>>>>       problem, only shifts it.
>>>
>>> Yeah I did wonder about this as an alternative solution. But of course it then
>>> makes it vague what the parent values means in respect of the individual levels,
>>> unless we have an 'inherit' mode there too (possible).
>>>
>>> It's going to be confusing though as max_ptes_none sits at the root khugepaged/
>>> level and I don't think any other parameter from khugepaged/ is exposed at
>>> individual page size levels.
>>>
>>> And of course doing this means we
>>>
>>>>
>>>> Anything else?
>>>
>>> Err... I mean I'm not sure if you missed it but I suggested an approach in the
>>> sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at:
>>>
>>> /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
>>>
>>> Then we allow the capping, but simply document that we specify what the capped
>>> value will be here for mTHP.
>>
>> I did not have time to read the details on that so far.
> 
> OK. It is a bit nasty, yes. The idea is to find something that allows the
> capping to work.
> 
>>
>> It would be one solution forward. I dislike it because I think the whole
>> capping is an intermediate thing that can be (and likely must be, when
>> considering mTHP underused shrinking I think) solved in the future
>> differently. That's why I would prefer adding this only if there is no
>> other, simpler, way forward.
> 
> Yes I agree that if we could avoid it it'd be great.
> 
> Really I proposed this solution on the basis that we were somehow ok with the
> capping.
> 
> If we can avoid that'd be ideal as it reduces complexity and 'unexpected'
> behaviour.
> 
> We'll clarify on the other thread, but the 511/0 was compelling to me before as
> a simplification, and if we can have a straightforward model of how mTHP
> collapse across none/zero page PTEs behaves this is ideal.
> 
> The only question is w.r.t. warnings etc. but we can handle details there.
> 
>>
>>>
>>> That struck me as the simplest way of getting this series landed without
>>> necessarily violating any future eagerness which:
>>>
>>> a. Must still support khugepaged/max_ptes_none - we aren't getting away from
>>>      this, it's uAPI.
>>>
>>> b. Surely must want to do different things for mTHP in eagerness, so if we're
>>>      exposing some PTE value in max_ptes_none doing so in
>>>      khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - it's
>>>      readonly so unlike max_ptes_none we don't have to worry about the other
>>>      direction).
>>>
>>> HOWEVER, eagerness might want want to change this behaviour per-mTHP size, in
>>> which case perhaps mthp_max_ptes_none would be problematic in that it is some
>>> kind of average.
>>>
>>> Then again we could always revert to putting this parameter as in (3) in that
>>> case, ugly but kinda viable.
>>>
>>>>
>>>> IIUC, creep is less of a problem when we have the underused shrinker
>>>> enabled: whatever we over-allocated can (unless longterm-pinned etc) get
>>>> reclaimed again.
>>>>
>>>> So maybe having underused-shrinker support for mTHP as well would be a
>>>> solution to tackle (1) later?
>>>
>>> How viable is this in the short term?
>>
>> I once started looking into it, but it will require quite some work, because
>> the lists will essentially include each and every (m)THP in the system ...
>> so i think we will need some redesign.
> 
> Ack.
> 
> This aligns with non-0/511 settings being non-functional for mTHP atm anyway.
> 
>>
>>>
>>> Another possible solution:
>>>
>>> If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.:
>>>
>>> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none
>>>
>>> As a simple boolean. If switched on then we document that it caps mTHP as
>>> per Nico's suggestion.
>>>
>>> That way we avoid the 'silent' issue I have with all this and it's an
>>> explicit setting.
>>
>> Right, but it's another toggle I wish we wouldn't need. We could of course
>> also make it some compile-time option, but not sure if that's really any
>> better.
>>
>> I'd hope we find an easy way forward that doesn't require new toggles, at
>> least for now ...
> 
> Right, well I agree if we can make this 0/511 thing work, let's do that.
> 
> Toggle are just 'least worst' workarounds on assumption of the need for capping.

I finally finished reading through the discussions across multiple 
threads:), and it looks like we've reached a preliminary consensus (make 
0/511 work). Great and thanks!

IIUC, the strategy is, configuring it to 511 means always enabling mTHP 
collapse, configuring it to 0 means collapsing mTHP only if all PTEs are 
non-none/zero, and for other values, we issue a warning and prohibit 
mTHP collapse (avoid Lorenzo's concern about silently changing 
max_ptes_none). Then the implementation for collapse_max_ptes_none() 
should be as follows:

static int collapse_max_ptes_none(unsigned int order, bool full_scan)
{
         /* ignore max_ptes_none limits */
         if (full_scan)
                 return HPAGE_PMD_NR - 1;

         if (order == HPAGE_PMD_ORDER)
                 return khugepaged_max_ptes_none;

         /*
          * To prevent creeping towards larger order collapses for mTHP 
collapse,
          * we restrict khugepaged_max_ptes_none to only 511 or 0, 
simplifying the
          * logic. This means:
          * max_ptes_none == 511 -> collapse mTHP always
          * max_ptes_none == 0 -> collapse mTHP only if we all PTEs are 
non-none/zero
          */
         if (!khugepaged_max_ptes_none || khugepaged_max_ptes_none == 
HPAGE_PMD_NR - 1)
                 return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - 
order);

         pr_warn_once("mTHP collapse only supports 
khugepaged_max_ptes_none configured as 0 or %d\n", HPAGE_PMD_NR - 1);
         return -EINVAL;
}

So what do you think?

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Wed, Oct 29, 2025 at 10:09:43AM +0800, Baolin Wang wrote:
> I finally finished reading through the discussions across multiple
> threads:), and it looks like we've reached a preliminary consensus (make
> 0/511 work). Great and thanks!

Yes we're getting there :) it's a sincere effort to try to find a way to move
forwards.

>
> IIUC, the strategy is, configuring it to 511 means always enabling mTHP
> collapse, configuring it to 0 means collapsing mTHP only if all PTEs are
> non-none/zero, and for other values, we issue a warning and prohibit mTHP
> collapse (avoid Lorenzo's concern about silently changing max_ptes_none).
> Then the implementation for collapse_max_ptes_none() should be as follows:
>
> static int collapse_max_ptes_none(unsigned int order, bool full_scan)
> {
>         /* ignore max_ptes_none limits */
>         if (full_scan)
>                 return HPAGE_PMD_NR - 1;
>
>         if (order == HPAGE_PMD_ORDER)
>                 return khugepaged_max_ptes_none;
>
>         /*
>          * To prevent creeping towards larger order collapses for mTHP
> collapse,
>          * we restrict khugepaged_max_ptes_none to only 511 or 0,
> simplifying the
>          * logic. This means:
>          * max_ptes_none == 511 -> collapse mTHP always
>          * max_ptes_none == 0 -> collapse mTHP only if we all PTEs are
> non-none/zero
>          */
>         if (!khugepaged_max_ptes_none || khugepaged_max_ptes_none ==
> HPAGE_PMD_NR - 1)
>                 return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER -
> order);
>
>         pr_warn_once("mTHP collapse only supports khugepaged_max_ptes_none
> configured as 0 or %d\n", HPAGE_PMD_NR - 1);
>         return -EINVAL;
> }
>
> So what do you think?

Yeah I think something like this.

Though I'd implement it more explicitly like:

	/* Zero/non-present collapse disabled. */
	if (!khugepaged_max_ptes_none)
	   return 0;

	/* Collapse the maximum number of zero/non-present PTEs. */
	if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
		return (1 << order) - 1;

Then we can do away with this confusing (HPAGE_PMD_ORDER - order) stuff.

A quick check in google sheets suggests my maths is ok here but do correct me if
I'm wrong :)

Cheers, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Nico Pache 3 months, 1 week ago

On Wed, Oct 29, 2025 at 12:56 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Oct 29, 2025 at 10:09:43AM +0800, Baolin Wang wrote:
> > I finally finished reading through the discussions across multiple
> > threads:), and it looks like we've reached a preliminary consensus (make
> > 0/511 work). Great and thanks!
>
> Yes we're getting there :) it's a sincere effort to try to find a way to move
> forwards.
>
> >
> > IIUC, the strategy is, configuring it to 511 means always enabling mTHP
> > collapse, configuring it to 0 means collapsing mTHP only if all PTEs are
> > non-none/zero, and for other values, we issue a warning and prohibit mTHP
> > collapse (avoid Lorenzo's concern about silently changing max_ptes_none).
> > Then the implementation for collapse_max_ptes_none() should be as follows:
> >
> > static int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > {
> >         /* ignore max_ptes_none limits */
> >         if (full_scan)
> >                 return HPAGE_PMD_NR - 1;
> >
> >         if (order == HPAGE_PMD_ORDER)
> >                 return khugepaged_max_ptes_none;
> >
> >         /*
> >          * To prevent creeping towards larger order collapses for mTHP
> > collapse,
> >          * we restrict khugepaged_max_ptes_none to only 511 or 0,
> > simplifying the
> >          * logic. This means:
> >          * max_ptes_none == 511 -> collapse mTHP always
> >          * max_ptes_none == 0 -> collapse mTHP only if we all PTEs are
> > non-none/zero
> >          */
> >         if (!khugepaged_max_ptes_none || khugepaged_max_ptes_none ==
> > HPAGE_PMD_NR - 1)
> >                 return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER -
> > order);
> >
> >         pr_warn_once("mTHP collapse only supports khugepaged_max_ptes_none
> > configured as 0 or %d\n", HPAGE_PMD_NR - 1);
> >         return -EINVAL;
> > }
> >
> > So what do you think?
>
> Yeah I think something like this.
>
> Though I'd implement it more explicitly like:
>
>         /* Zero/non-present collapse disabled. */
>         if (!khugepaged_max_ptes_none)
>            return 0;
>
>         /* Collapse the maximum number of zero/non-present PTEs. */
>         if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>                 return (1 << order) - 1;
>
> Then we can do away with this confusing (HPAGE_PMD_ORDER - order) stuff.

This looks cleaner/more explicit given the limits we are enforcing!

I'll go for something like that.

>
> A quick check in google sheets suggests my maths is ok here but do correct me if
> I'm wrong :)

LGTM!

Thanks for all the reviews! I'm glad we were able to find a solution :)

-- Nico

>
> Cheers, Lorenzo
>

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Baolin Wang 3 months, 1 week ago


On 2025/10/30 05:14, Nico Pache wrote:
> On Wed, Oct 29, 2025 at 12:56 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
>>
>> On Wed, Oct 29, 2025 at 10:09:43AM +0800, Baolin Wang wrote:
>>> I finally finished reading through the discussions across multiple
>>> threads:), and it looks like we've reached a preliminary consensus (make
>>> 0/511 work). Great and thanks!
>>
>> Yes we're getting there :) it's a sincere effort to try to find a way to move
>> forwards.
>>
>>>
>>> IIUC, the strategy is, configuring it to 511 means always enabling mTHP
>>> collapse, configuring it to 0 means collapsing mTHP only if all PTEs are
>>> non-none/zero, and for other values, we issue a warning and prohibit mTHP
>>> collapse (avoid Lorenzo's concern about silently changing max_ptes_none).
>>> Then the implementation for collapse_max_ptes_none() should be as follows:
>>>
>>> static int collapse_max_ptes_none(unsigned int order, bool full_scan)
>>> {
>>>          /* ignore max_ptes_none limits */
>>>          if (full_scan)
>>>                  return HPAGE_PMD_NR - 1;
>>>
>>>          if (order == HPAGE_PMD_ORDER)
>>>                  return khugepaged_max_ptes_none;
>>>
>>>          /*
>>>           * To prevent creeping towards larger order collapses for mTHP
>>> collapse,
>>>           * we restrict khugepaged_max_ptes_none to only 511 or 0,
>>> simplifying the
>>>           * logic. This means:
>>>           * max_ptes_none == 511 -> collapse mTHP always
>>>           * max_ptes_none == 0 -> collapse mTHP only if we all PTEs are
>>> non-none/zero
>>>           */
>>>          if (!khugepaged_max_ptes_none || khugepaged_max_ptes_none ==
>>> HPAGE_PMD_NR - 1)
>>>                  return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER -
>>> order);
>>>
>>>          pr_warn_once("mTHP collapse only supports khugepaged_max_ptes_none
>>> configured as 0 or %d\n", HPAGE_PMD_NR - 1);
>>>          return -EINVAL;
>>> }
>>>
>>> So what do you think?
>>
>> Yeah I think something like this.
>>
>> Though I'd implement it more explicitly like:
>>
>>          /* Zero/non-present collapse disabled. */
>>          if (!khugepaged_max_ptes_none)
>>             return 0;
>>
>>          /* Collapse the maximum number of zero/non-present PTEs. */
>>          if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>>                  return (1 << order) - 1;
>>
>> Then we can do away with this confusing (HPAGE_PMD_ORDER - order) stuff.
> 
> This looks cleaner/more explicit given the limits we are enforcing!
> 
> I'll go for something like that.
> 
>>
>> A quick check in google sheets suggests my maths is ok here but do correct me if
>> I'm wrong :)
> 
> LGTM!

LGTM. Thanks.

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Nico Pache 3 months, 1 week ago

On Tue, Oct 28, 2025 at 8:10 PM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/10/29 02:59, Lorenzo Stoakes wrote:
> > On Tue, Oct 28, 2025 at 07:08:38PM +0100, David Hildenbrand wrote:
> >>
> >>>>> Hey Lorenzo,
> >>>>>
> >>>>>> I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> >>>>>> to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> >>>>>> implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> >>>>>
> >>>>> I spoke to David and he said to continue forward with this series; the
> >>>>> "eagerness" tunable will take some time, and may require further
> >>>>> considerations/discussion.
> >>>>
> >>>> Right, after talking to Johannes it got clearer that what we envisioned with
> >>>
> >>> I'm not sure that you meant to say go ahead with the series as-is with this
> >>> silent capping?
> >>
> >> No, "go ahead" as in "let's find some way forward that works for all and is
> >> not too crazy".
> >
> > Right we clearly needed to discuss that further at the time but that's moot now,
> > we're figuring it out now :)
> >
> >>
> >> [...]
> >>
> >>>> "eagerness" would not be like swappiness, and we will really have to be
> >>>> careful here. I don't know yet when I will have time to look into that.
> >>>
> >>> I guess I missed this part of the converastion, what do you mean?
> >>
> >> Johannes raised issues with that on the list and afterwards we had an
> >> offline discussion about some of the details and why something unpredictable
> >> is not good.
> >
> > Could we get these details on-list so we can discuss them? This doesn't have to
> > be urgent, but I would like to have a say in this or at least be part of the
> > converastion please.
> >
> >>
> >>>
> >>> The whole concept is that we have a paramaeter whose value is _abstracted_ and
> >>> which we control what it means.
> >>>
> >>> I'm not sure exactly why that would now be problematic? The fundamental concept
> >>> seems sound no? Last I remember of the conversation this was the case.
> >>
> >> The basic idea was to do something abstracted as swappiness. Turns out
> >> "swappiness" is really something predictable, not something we can randomly
> >> change how it behaves under the hood.
> >>
> >> So we'd have to find something similar for "eagerness", and that's where it
> >> stops being easy.
> >
> > I think we shouldn't be too stuck on
> >
> >>
> >>>
> >>>>
> >>>> If we want to avoid the implicit capping, I think there are the following
> >>>> possible approaches
> >>>>
> >>>> (1) Tolerate creep for now, maybe warning if the user configures it.
> >>>
> >>> I mean this seems a viable option if there is pressure to land this series
> >>> before we have a viable uAPI for configuring this.
> >>>
> >>> A part of me thinks we shouldn't rush series in for that reason though and
> >>> should require that we have a proper control here.
> >>>
> >>> But I guess this approach is the least-worst as it leaves us with the most
> >>> options moving forwards.
> >>
> >> Yes. There is also the alternative of respecting only 0 / 511 for mTHP
> >> collapse for now as discussed in the other thread.
> >
> > Yes I guess let's carry that on over there.
> >
> > I mean this is why I said it's better to try to keep things in one thread :) but
> > anyway, we've forked and can't be helped now.
> >
> > To be clear that was a criticism of - email development - not you.
> >
> > It's _extremely easy_ to have this happen because one thread naturally leads to
> > a broader discussion of a given topic, whereas another has questions from
> > somebody else about the same topic, to which people reply and then... you have a
> > fork and it can't be helped.
> >
> > I guess I'm saying it'd be good if we could say 'ok let's move this to X'.
> >
> > But that's also broken in its own way, you can't stop people from replying in
> > the other thread still and yeah. It's a limitation of this model :)
> >
> >>
> >>>
> >>>> (2) Avoid creep by counting zero-filled pages towards none_or_zero.
> >>>
> >>> Would this really make all that much difference?
> >>
> >> It solves the creep problem I think, but it's a bit nasty IMHO.
> >
> > Ah because you'd end up wtih a bunch of zeroed pages from the prior mTHP
> > collapses, interesting...
> >
> > Scanning for that does seem a bit nasty though yes...
> >
> >>
> >>>
> >>>> (3) Have separate toggles for each THP size. Doesn't quite solve the
> >>>>       problem, only shifts it.
> >>>
> >>> Yeah I did wonder about this as an alternative solution. But of course it then
> >>> makes it vague what the parent values means in respect of the individual levels,
> >>> unless we have an 'inherit' mode there too (possible).
> >>>
> >>> It's going to be confusing though as max_ptes_none sits at the root khugepaged/
> >>> level and I don't think any other parameter from khugepaged/ is exposed at
> >>> individual page size levels.
> >>>
> >>> And of course doing this means we
> >>>
> >>>>
> >>>> Anything else?
> >>>
> >>> Err... I mean I'm not sure if you missed it but I suggested an approach in the
> >>> sub-thread - exposing mthp_max_ptes_none as a _READ-ONLY_ field at:
> >>>
> >>> /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> >>>
> >>> Then we allow the capping, but simply document that we specify what the capped
> >>> value will be here for mTHP.
> >>
> >> I did not have time to read the details on that so far.
> >
> > OK. It is a bit nasty, yes. The idea is to find something that allows the
> > capping to work.
> >
> >>
> >> It would be one solution forward. I dislike it because I think the whole
> >> capping is an intermediate thing that can be (and likely must be, when
> >> considering mTHP underused shrinking I think) solved in the future
> >> differently. That's why I would prefer adding this only if there is no
> >> other, simpler, way forward.
> >
> > Yes I agree that if we could avoid it it'd be great.
> >
> > Really I proposed this solution on the basis that we were somehow ok with the
> > capping.
> >
> > If we can avoid that'd be ideal as it reduces complexity and 'unexpected'
> > behaviour.
> >
> > We'll clarify on the other thread, but the 511/0 was compelling to me before as
> > a simplification, and if we can have a straightforward model of how mTHP
> > collapse across none/zero page PTEs behaves this is ideal.
> >
> > The only question is w.r.t. warnings etc. but we can handle details there.
> >
> >>
> >>>
> >>> That struck me as the simplest way of getting this series landed without
> >>> necessarily violating any future eagerness which:
> >>>
> >>> a. Must still support khugepaged/max_ptes_none - we aren't getting away from
> >>>      this, it's uAPI.
> >>>
> >>> b. Surely must want to do different things for mTHP in eagerness, so if we're
> >>>      exposing some PTE value in max_ptes_none doing so in
> >>>      khugepaged/mthp_max_ptes_none wouldn't be problematic (note again - it's
> >>>      readonly so unlike max_ptes_none we don't have to worry about the other
> >>>      direction).
> >>>
> >>> HOWEVER, eagerness might want want to change this behaviour per-mTHP size, in
> >>> which case perhaps mthp_max_ptes_none would be problematic in that it is some
> >>> kind of average.
> >>>
> >>> Then again we could always revert to putting this parameter as in (3) in that
> >>> case, ugly but kinda viable.
> >>>
> >>>>
> >>>> IIUC, creep is less of a problem when we have the underused shrinker
> >>>> enabled: whatever we over-allocated can (unless longterm-pinned etc) get
> >>>> reclaimed again.
> >>>>
> >>>> So maybe having underused-shrinker support for mTHP as well would be a
> >>>> solution to tackle (1) later?
> >>>
> >>> How viable is this in the short term?
> >>
> >> I once started looking into it, but it will require quite some work, because
> >> the lists will essentially include each and every (m)THP in the system ...
> >> so i think we will need some redesign.
> >
> > Ack.
> >
> > This aligns with non-0/511 settings being non-functional for mTHP atm anyway.
> >
> >>
> >>>
> >>> Another possible solution:
> >>>
> >>> If mthp_max_ptes_none is not workable, we could have a toggle at, e.g.:
> >>>
> >>> /sys/kernel/mm/transparent_hugepage/khugepaged/mthp_cap_collapse_none
> >>>
> >>> As a simple boolean. If switched on then we document that it caps mTHP as
> >>> per Nico's suggestion.
> >>>
> >>> That way we avoid the 'silent' issue I have with all this and it's an
> >>> explicit setting.
> >>
> >> Right, but it's another toggle I wish we wouldn't need. We could of course
> >> also make it some compile-time option, but not sure if that's really any
> >> better.
> >>
> >> I'd hope we find an easy way forward that doesn't require new toggles, at
> >> least for now ...
> >
> > Right, well I agree if we can make this 0/511 thing work, let's do that.
> >
> > Toggle are just 'least worst' workarounds on assumption of the need for capping.
>
> I finally finished reading through the discussions across multiple
> threads:), and it looks like we've reached a preliminary consensus (make
> 0/511 work). Great and thanks!
>
> IIUC, the strategy is, configuring it to 511 means always enabling mTHP
> collapse, configuring it to 0 means collapsing mTHP only if all PTEs are
> non-none/zero, and for other values, we issue a warning and prohibit
> mTHP collapse (avoid Lorenzo's concern about silently changing
> max_ptes_none). Then the implementation for collapse_max_ptes_none()
> should be as follows:
>
> static int collapse_max_ptes_none(unsigned int order, bool full_scan)
> {
>          /* ignore max_ptes_none limits */
>          if (full_scan)
>                  return HPAGE_PMD_NR - 1;
>
>          if (order == HPAGE_PMD_ORDER)
>                  return khugepaged_max_ptes_none;
>
>          /*
>           * To prevent creeping towards larger order collapses for mTHP
> collapse,
>           * we restrict khugepaged_max_ptes_none to only 511 or 0,
> simplifying the
>           * logic. This means:
>           * max_ptes_none == 511 -> collapse mTHP always
>           * max_ptes_none == 0 -> collapse mTHP only if we all PTEs are
> non-none/zero
>           */
>          if (!khugepaged_max_ptes_none || khugepaged_max_ptes_none ==
> HPAGE_PMD_NR - 1)
>                  return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER -
> order);
>
>          pr_warn_once("mTHP collapse only supports
> khugepaged_max_ptes_none configured as 0 or %d\n", HPAGE_PMD_NR - 1);
>          return -EINVAL;
> }
>
> So what do you think?

Yes i'm glad we finally came to some consensus, despite it being a
less than ideal solution.

Hopefully the eagerness patchset re-introduces all the lost
functionality in the future.

Cheers
-- Nico

>

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Tue, Oct 28, 2025 at 06:59:31PM +0000, Lorenzo Stoakes wrote:
> On Tue, Oct 28, 2025 at 07:08:38PM +0100, David Hildenbrand wrote:
> > >
> > > The whole concept is that we have a paramaeter whose value is _abstracted_ and
> > > which we control what it means.
> > >
> > > I'm not sure exactly why that would now be problematic? The fundamental concept
> > > seems sound no? Last I remember of the conversation this was the case.
> >
> > The basic idea was to do something abstracted as swappiness. Turns out
> > "swappiness" is really something predictable, not something we can randomly
> > change how it behaves under the hood.
> >
> > So we'd have to find something similar for "eagerness", and that's where it
> > stops being easy.
>
> I think we shouldn't be too stuck on
>

I really am the master of the unfinished sentence :)

I was going to say we shouldn't be too stuck on the analogy to swappiness and
just maintain the broad concept that eagerness is abstracted and we get to
determine what that looks like.

But absolutely I accept that it's highly sensitive and likely embodies a great
many moving parts and we must be cautious absolutely.

This is something that can be deferred for later.

Cheers, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Tue, Oct 28, 2025 at 05:29:59PM +0000, Lorenzo Stoakes wrote:
> >
> > If we want to avoid the implicit capping, I think there are the following
> > possible approaches
> >
> > (1) Tolerate creep for now, maybe warning if the user configures it.
>
> I mean this seems a viable option if there is pressure to land this series
> before we have a viable uAPI for configuring this.
>
> A part of me thinks we shouldn't rush series in for that reason though and
> should require that we have a proper control here.
>
> But I guess this approach is the least-worst as it leaves us with the most
> options moving forwards.
>
> > (2) Avoid creep by counting zero-filled pages towards none_or_zero.
>
> Would this really make all that much difference?
>
> > (3) Have separate toggles for each THP size. Doesn't quite solve the
> >     problem, only shifts it.
>
> Yeah I did wonder about this as an alternative solution. But of course it then
> makes it vague what the parent values means in respect of the individual levels,
> unless we have an 'inherit' mode there too (possible).
>
> It's going to be confusing though as max_ptes_none sits at the root khugepaged/
> level and I don't think any other parameter from khugepaged/ is exposed at
> individual page size levels.
>
> And of course doing this means we

Oops didn't finish the thought!

Here it is:

And of course this means we continue to propagate this max_ptes_none concept
only now in more places which is yuck.

Unless you meant putting something other than max_ptes_none at different levels?

Cheers, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Baolin Wang 3 months, 1 week ago


On 2025/10/28 01:53, Lorenzo Stoakes wrote:
> On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
>> The current mechanism for determining mTHP collapse scales the
>> khugepaged_max_ptes_none value based on the target order. This
>> introduces an undesirable feedback loop, or "creep", when max_ptes_none
>> is set to a value greater than HPAGE_PMD_NR / 2.
>>
>> With this configuration, a successful collapse to order N will populate
>> enough pages to satisfy the collapse condition on order N+1 on the next
>> scan. This leads to unnecessary work and memory churn.
>>
>> To fix this issue introduce a helper function that caps the max_ptes_none
>> to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
>> the max_ptes_none number by the (PMD_ORDER - target collapse order).
>>
>> The limits can be ignored by passing full_scan=true, this is useful for
>> madvise_collapse (which ignores limits), or in the case of
>> collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
>> collapse is available.
>>
>> Signed-off-by: Nico Pache <npache@redhat.com>
>> ---
>>   mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
>>   1 file changed, 34 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 4ccebf5dda97..286c3a7afdee 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
>>   		wake_up_interruptible(&khugepaged_wait);
>>   }
>>
>> +/**
>> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
>> + * @order: The folio order being collapsed to
>> + * @full_scan: Whether this is a full scan (ignore limits)
>> + *
>> + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
>> + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
>> + *
>> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
>> + * khugepaged_max_ptes_none value.
>> + *
>> + * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
>> + * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
>> + *
>> + * Return: Maximum number of empty PTEs allowed for the collapse operation
>> + */
>> +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
>> +{
>> +	unsigned int max_ptes_none;
>> +
>> +	/* ignore max_ptes_none limits */
>> +	if (full_scan)
>> +		return HPAGE_PMD_NR - 1;
>> +
>> +	if (order == HPAGE_PMD_ORDER)
>> +		return khugepaged_max_ptes_none;
>> +
>> +	max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);
> 
> I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> 
> I'm still really quite uncomfortable with us silently capping this value.
> 
> If we're putting forward theoretical ideas that are to be later built upon, this
> series should be an RFC.
> 
> But if we really intend to silently ignore user input the problem is that then
> becomes established uAPI.
> 
> I think it's _sensible_ to avoid this mTHP escalation problem, but the issue is
> visibility I think.
> 
> I think people are going to find it odd that you set it to something, but then
> get something else.
> 
> As an alternative we could have a new sysfs field:
> 
> /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> 
> That shows the cap clearly.
> 
> In fact, it could be read-only... and just expose it to the user. That reduces
> complexity.
> 
> We can then bring in eagerness later and have the same situation of
> max_ptes_none being a parameter that exists (plus this additional read-only
> parameter).

We all know that ultimately using David's suggestion to add the 
'eagerness' tunable parameter is the best approach, but for now, we need 
an initial version to support mTHP collapse (as we've already discussed 
extensively here:)).

I don't like the idea of adding another and potentially confusing 
'max_mthp_ptes_none' interface, which might make it more difficult to 
accommodate the 'eagerness' parameter in the future.

If Nico's current proposal still doesn't satisfy everyone, I personally 
lean towards David's earlier simplified approach:
	max_ptes_none == 511 -> collapse mTHP always
	max_ptes_none != 511 -> collapse mTHP only if all PTEs are non-none/zero

Let's first have an initial approach in place, which will also simplify 
the following addition of the 'eagerness' tunable parameter.

Nico, Lorenzo, and David, what do you think?

Code should be:
static unsigned int collapse_max_ptes_none(unsigned int order, bool 
full_scan)
{
         unsigned int max_ptes_none;

         /* ignore max_ptes_none limits */
         if (full_scan)
                 return HPAGE_PMD_NR - 1;

         if (order == HPAGE_PMD_ORDER)
                 return khugepaged_max_ptes_none;

         /*
          * For mTHP collapse, we can simplify the logic:
          * max_ptes_none == 511 -> collapse mTHP always
          * max_ptes_none != 511 -> collapse mTHP only if we all PTEs 
are non-none/zero
          */
         if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
                 return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - 
order);

         return 0;
}

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Tue, Oct 28, 2025 at 06:09:43PM +0800, Baolin Wang wrote:
>
>
> On 2025/10/28 01:53, Lorenzo Stoakes wrote:
> > On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
> > > The current mechanism for determining mTHP collapse scales the
> > > khugepaged_max_ptes_none value based on the target order. This
> > > introduces an undesirable feedback loop, or "creep", when max_ptes_none
> > > is set to a value greater than HPAGE_PMD_NR / 2.
> > >
> > > With this configuration, a successful collapse to order N will populate
> > > enough pages to satisfy the collapse condition on order N+1 on the next
> > > scan. This leads to unnecessary work and memory churn.
> > >
> > > To fix this issue introduce a helper function that caps the max_ptes_none
> > > to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> > > the max_ptes_none number by the (PMD_ORDER - target collapse order).
> > >
> > > The limits can be ignored by passing full_scan=true, this is useful for
> > > madvise_collapse (which ignores limits), or in the case of
> > > collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> > > collapse is available.
> > >
> > > Signed-off-by: Nico Pache <npache@redhat.com>
> > > ---
> > >   mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> > >   1 file changed, 34 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 4ccebf5dda97..286c3a7afdee 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
> > >   		wake_up_interruptible(&khugepaged_wait);
> > >   }
> > >
> > > +/**
> > > + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> > > + * @order: The folio order being collapsed to
> > > + * @full_scan: Whether this is a full scan (ignore limits)
> > > + *
> > > + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> > > + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> > > + *
> > > + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> > > + * khugepaged_max_ptes_none value.
> > > + *
> > > + * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
> > > + * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
> > > + *
> > > + * Return: Maximum number of empty PTEs allowed for the collapse operation
> > > + */
> > > +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> > > +{
> > > +	unsigned int max_ptes_none;
> > > +
> > > +	/* ignore max_ptes_none limits */
> > > +	if (full_scan)
> > > +		return HPAGE_PMD_NR - 1;
> > > +
> > > +	if (order == HPAGE_PMD_ORDER)
> > > +		return khugepaged_max_ptes_none;
> > > +
> > > +	max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);
> >
> > I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> > to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> > implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> >
> > I'm still really quite uncomfortable with us silently capping this value.
> >
> > If we're putting forward theoretical ideas that are to be later built upon, this
> > series should be an RFC.
> >
> > But if we really intend to silently ignore user input the problem is that then
> > becomes established uAPI.
> >
> > I think it's _sensible_ to avoid this mTHP escalation problem, but the issue is
> > visibility I think.
> >
> > I think people are going to find it odd that you set it to something, but then
> > get something else.
> >
> > As an alternative we could have a new sysfs field:
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> >
> > That shows the cap clearly.
> >
> > In fact, it could be read-only... and just expose it to the user. That reduces
> > complexity.
> >
> > We can then bring in eagerness later and have the same situation of
> > max_ptes_none being a parameter that exists (plus this additional read-only
> > parameter).
>
> We all know that ultimately using David's suggestion to add the 'eagerness'
> tunable parameter is the best approach, but for now, we need an initial
> version to support mTHP collapse (as we've already discussed extensively
> here:)).
>
> I don't like the idea of adding another and potentially confusing
> 'max_mthp_ptes_none' interface, which might make it more difficult to
> accommodate the 'eagerness' parameter in the future.

See my reply to Nico, I disagree that it affects eagerness.

>
> If Nico's current proposal still doesn't satisfy everyone, I personally lean

It's not upstreamable. We cannot silently violate user expectation or silently
change behaviour like this.

> towards David's earlier simplified approach:
> 	max_ptes_none == 511 -> collapse mTHP always
> 	max_ptes_none != 511 -> collapse mTHP only if all PTEs are non-none/zero

Pretty sure David's suggestion was that max_ptes_none would literally get set to
511 if you specified 511, or 0 if you specified anything else.

Which would make thing visible to users and not ignore their tunable setting,
which is the whole issue IMO.

But we can't do that, because we know at the very least Meta use small non-zero
values that they expect to be honoured.

So again we're stuck in the situation of max_ptes_none being ignored for mTHP
and users being totally unaware.

>
> Let's first have an initial approach in place, which will also simplify the

Well hang on, this isn't the same as 'do anything we like'.

It immediately becomes uAPI, and 'I'll do that later' often becomes 'I'll never
do that because I got too busy'.

Yes perhaps we have to wait for the eagerness parameter, but any interim
solution must be _solid_ and not do strange/unexpected things.

We've (and of course, it was a silly thing to do) provided the ability for users
to specify this max_ptes_none behaviour for khugepaged.

Suddenly putting an asterix next to that like '*except mTHP where we totally
ignore you if you specify values we don't like' doesn't seem like a great way
forward.

As I said to Nico too, we _have_ to export and support max_ptes_none for uAPI
reasons. And presumably eagerness will want to specify different settings for
mTHP vs. PMD THP, so exposing this (read-only mind you) somehow isn't as crazy
as it might seem.

> following addition of the 'eagerness' tunable parameter.
>
> Nico, Lorenzo, and David, what do you think?
>
> Code should be:
> static unsigned int collapse_max_ptes_none(unsigned int order, bool
> full_scan)
> {
>         unsigned int max_ptes_none;
>
>         /* ignore max_ptes_none limits */
>         if (full_scan)
>                 return HPAGE_PMD_NR - 1;
>
>         if (order == HPAGE_PMD_ORDER)
>                 return khugepaged_max_ptes_none;
>
>         /*
>          * For mTHP collapse, we can simplify the logic:
>          * max_ptes_none == 511 -> collapse mTHP always
>          * max_ptes_none != 511 -> collapse mTHP only if we all PTEs are
> non-none/zero
>          */
>         if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>                 return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER -
> order);
>
>         return 0;
> }

Thanks, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by David Hildenbrand 3 months, 1 week ago

[...]

> 
>> towards David's earlier simplified approach:
>> 	max_ptes_none == 511 -> collapse mTHP always
>> 	max_ptes_none != 511 -> collapse mTHP only if all PTEs are non-none/zero
> 
> Pretty sure David's suggestion was that max_ptes_none would literally get set to
> 511 if you specified 511, or 0 if you specified anything else.

We had multiple incarnations of this approach, but the first one really was:

max_ptes_none == 511 -> collapse mTHP always
max_ptes_none == 0 -> collapse mTHP only if all non-none/zero

And for the intermediate values

(1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not 
supported yet with other values
(2) treat it like max_ptes_none == 0 or (maybe better?) just disable 
mTHP collapse


I still like that approach because it let's us defer solving the creep 
problem later and doesn't add a silent capping.

Using intermediate max_ptes_none values are really only reasonable with 
the deferred shrinker today. And that one does not support mTHP even 
with this series, so it's future work either way.

-- 
Cheers

David / dhildenb

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

(It'd be good if we could keep all the 'solutions' in one thread as I made a
detailed reply there and now all that will get lost across two threads but
*sigh* never mind. Insert rant about email development here.)

On Tue, Oct 28, 2025 at 06:56:10PM +0100, David Hildenbrand wrote:
> [...]
>
> >
> > > towards David's earlier simplified approach:
> > > 	max_ptes_none == 511 -> collapse mTHP always
> > > 	max_ptes_none != 511 -> collapse mTHP only if all PTEs are non-none/zero
> >
> > Pretty sure David's suggestion was that max_ptes_none would literally get set to
> > 511 if you specified 511, or 0 if you specified anything else.
>
> We had multiple incarnations of this approach, but the first one really was:
>
> max_ptes_none == 511 -> collapse mTHP always

But won't 511 mean we just 'creep' to maximum collapse again? Does that solve
anything?

> max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
>
> And for the intermediate values
>
> (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
> supported yet with other values

It feels a bit much to issue a kernel warning every time somebody twiddles that
value, and it's kind of against user expectation a bit.

But maybe it's the least worst way of communicating things. It's still
absolutely gross.

> (2) treat it like max_ptes_none == 0 or (maybe better?) just disable mTHP
> collapse

Yeah disabling mTHP collapse for these values seems sane, but it also seems that
we should be capping for this to work correctly no?

Also I think all this probably violates requirements of users who want to have
different behaviour for mTHP and PMD THP.

The default is 511 so we're in creep territory even with the damn default :)

>
>
> I still like that approach because it let's us defer solving the creep
> problem later and doesn't add a silent capping.

I mean it seems you're more or less saying allow creep. Which I'm kind of ok
with for a first pass thing, and defer it for later.

>
> Using intermediate max_ptes_none values are really only reasonable with the
> deferred shrinker today. And that one does not support mTHP even with this
> series, so it's future work either way.

Right, that's a nice fact to be aware of.

>
> --
> Cheers
>
> David / dhildenb
>

Thanks, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by David Hildenbrand 3 months, 1 week ago

On 28.10.25 19:09, Lorenzo Stoakes wrote:
> (It'd be good if we could keep all the 'solutions' in one thread as I made a
> detailed reply there and now all that will get lost across two threads but
> *sigh* never mind. Insert rant about email development here.)

Yeah, I focused in my other mails on things to avoid creep while 
allowing for mTHP collapse.

> 
> On Tue, Oct 28, 2025 at 06:56:10PM +0100, David Hildenbrand wrote:
>> [...]
>>
>>>
>>>> towards David's earlier simplified approach:
>>>> 	max_ptes_none == 511 -> collapse mTHP always
>>>> 	max_ptes_none != 511 -> collapse mTHP only if all PTEs are non-none/zero
>>>
>>> Pretty sure David's suggestion was that max_ptes_none would literally get set to
>>> 511 if you specified 511, or 0 if you specified anything else.
>>
>> We had multiple incarnations of this approach, but the first one really was:
>>
>> max_ptes_none == 511 -> collapse mTHP always
> 
> But won't 511 mean we just 'creep' to maximum collapse again? Does that solve
> anything?

No creep, because you'll always collapse.

Creep only happens if you wouldn't collapse a PMD without prior mTHP 
collapse, but suddenly would in the same scenario simply because you had 
prior mTHP collapse.

At least that's my understanding.

> 
>> max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
>>
>> And for the intermediate values
>>
>> (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
>> supported yet with other values
> 
> It feels a bit much to issue a kernel warning every time somebody twiddles that
> value, and it's kind of against user expectation a bit.

pr_warn_once() is what I meant.

> 
> But maybe it's the least worst way of communicating things. It's still
> absolutely gross.
> 
>> (2) treat it like max_ptes_none == 0 or (maybe better?) just disable mTHP
>> collapse
> 
> Yeah disabling mTHP collapse for these values seems sane, but it also seems that
> we should be capping for this to work correctly no?

I didn't get the interaction with capping, can you elaborate?

> 
> Also I think all this probably violates requirements of users who want to have
> different behaviour for mTHP and PMD THP.
> 
> The default is 511 so we're in creep territory even with the damn default :)

I don't think so, but maybe I am wrong.


-- 
Cheers

David / dhildenb

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Tue, Oct 28, 2025 at 07:17:16PM +0100, David Hildenbrand wrote:
> On 28.10.25 19:09, Lorenzo Stoakes wrote:
> > (It'd be good if we could keep all the 'solutions' in one thread as I made a
> > detailed reply there and now all that will get lost across two threads but
> > *sigh* never mind. Insert rant about email development here.)
>
> Yeah, I focused in my other mails on things to avoid creep while allowing
> for mTHP collapse.
>
> >
> > On Tue, Oct 28, 2025 at 06:56:10PM +0100, David Hildenbrand wrote:
> > > [...]
> > >
> > > >
> > > > > towards David's earlier simplified approach:
> > > > > 	max_ptes_none == 511 -> collapse mTHP always
> > > > > 	max_ptes_none != 511 -> collapse mTHP only if all PTEs are non-none/zero
> > > >
> > > > Pretty sure David's suggestion was that max_ptes_none would literally get set to
> > > > 511 if you specified 511, or 0 if you specified anything else.
> > >
> > > We had multiple incarnations of this approach, but the first one really was:
> > >
> > > max_ptes_none == 511 -> collapse mTHP always
> >
> > But won't 511 mean we just 'creep' to maximum collapse again? Does that solve
> > anything?
>
> No creep, because you'll always collapse.

OK so in the 511 scenario, do we simply immediately collapse to the largest
possible _mTHP_ page size if based on adjacent none/zero page entries in the
PTE, and _never_ collapse to PMD on this basis even if we do have sufficient
none/zero PTE entries to do so?

And only collapse to PMD size if we have sufficient adjacent PTE entries that
are populated?

Let's really nail this down actually so we can be super clear what the issue is
here.


>
> Creep only happens if you wouldn't collapse a PMD without prior mTHP
> collapse, but suddenly would in the same scenario simply because you had
> prior mTHP collapse.
>
> At least that's my understanding.

OK, that makes sense, is the logic (this may be part of the bit I haven't
reviewed yet tbh) then that for khugepaged mTHP we have the system where we
always require prior mTHP collapse _first_?

>
> >
> > > max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
> > >
> > > And for the intermediate values
> > >
> > > (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
> > > supported yet with other values
> >
> > It feels a bit much to issue a kernel warning every time somebody twiddles that
> > value, and it's kind of against user expectation a bit.
>
> pr_warn_once() is what I meant.

Right, but even then it feels a bit extreme, warnings are pretty serious
things. Then again there's precedent for this, and it may be the least worse
solution.

I just picture a cloud provider turning this on with mTHP then getting their
monitoring team reporting some urgent communication about warnings in dmesg :)

>
> >
> > But maybe it's the least worst way of communicating things. It's still
> > absolutely gross.
> >
> > > (2) treat it like max_ptes_none == 0 or (maybe better?) just disable mTHP
> > > collapse
> >
> > Yeah disabling mTHP collapse for these values seems sane, but it also seems that
> > we should be capping for this to work correctly no?
>
> I didn't get the interaction with capping, can you elaborate?

I think that's addressed in the discussion above, once we clarify the creep
thing then the rest should fall out.

>
> >
> > Also I think all this probably violates requirements of users who want to have
> > different behaviour for mTHP and PMD THP.
> >
> > The default is 511 so we're in creep territory even with the damn default :)
>
> I don't think so, but maybe I am wrong.

Discussed above.

>
>
> --
> Cheers
>
> David / dhildenb
>

Thanks, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by David Hildenbrand 3 months, 1 week ago

>>
>> No creep, because you'll always collapse.
> 
> OK so in the 511 scenario, do we simply immediately collapse to the largest
> possible _mTHP_ page size if based on adjacent none/zero page entries in the
> PTE, and _never_ collapse to PMD on this basis even if we do have sufficient
> none/zero PTE entries to do so?

Right. And if we fail to allocate a PMD, we would collapse to smaller 
sizes, and later, once a PMD is possible, collapse to a PMD.

But there is no creep, as we would have collapsed a PMD right from the 
start either way.

> 
> And only collapse to PMD size if we have sufficient adjacent PTE entries that
> are populated?
> 
> Let's really nail this down actually so we can be super clear what the issue is
> here.
> 

I hope what I wrote above made sense.

> 
>>
>> Creep only happens if you wouldn't collapse a PMD without prior mTHP
>> collapse, but suddenly would in the same scenario simply because you had
>> prior mTHP collapse.
>>
>> At least that's my understanding.
> 
> OK, that makes sense, is the logic (this may be part of the bit I haven't
> reviewed yet tbh) then that for khugepaged mTHP we have the system where we
> always require prior mTHP collapse _first_?

So I would describe creep as

"we would not collapse a PMD THP because max_ptes_none is violated, but 
because we collapsed smaller mTHP THPs before, we essentially suddenly 
have more PTEs that are not none-or-zero, making us suddenly collapse a 
PMD THP at the same place".

Assume the following: max_ptes_none = 256

This means we would only collapse if at most half (256/512) of the PTEs 
are none-or-zero.

But imagine the (simplified) PTE layout with PMD = 8 entries to simplify:

[ P Z P Z P Z Z Z ]

3 Present vs. 5 Zero -> do not collapse a PMD (8)

But sssume we collapse smaller mTHP (2 entries) first

[ P P P P P P Z Z ]

We collapsed 3x "P Z" into "P P" because the ratio allowed for it.

Suddenly we have

6 Present vs 2 Zero and we collapse a PMD (8)

[ P P P P P P P P ]

That's the "creep" problem.

> 
>>
>>>
>>>> max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
>>>>
>>>> And for the intermediate values
>>>>
>>>> (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
>>>> supported yet with other values
>>>
>>> It feels a bit much to issue a kernel warning every time somebody twiddles that
>>> value, and it's kind of against user expectation a bit.
>>
>> pr_warn_once() is what I meant.
> 
> Right, but even then it feels a bit extreme, warnings are pretty serious
> things. Then again there's precedent for this, and it may be the least worse
> solution.
> 
> I just picture a cloud provider turning this on with mTHP then getting their
> monitoring team reporting some urgent communication about warnings in dmesg :)

I mean, one could make the states mutually, maybe?

Disallow enabling mTHP with max_ptes_none set to unsupported values and 
the other way around.

That would probably be cleanest, although the implementation might get a 
bit more involved (but it's solvable).

But the concern could be that there are configs that could suddenly 
break: someone that set max_ptes_none and enabled mTHP.


I'll note that we could also consider only supporting "max_ptes_none = 
511" (default) to start with.

The nice thing about that value is that it us fully supported with the 
underused shrinker, because max_ptes_none=511 -> never shrink.

-- 
Cheers

David / dhildenb

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Nico Pache 3 months, 1 week ago

On Wed, Oct 29, 2025 at 9:04 AM David Hildenbrand <david@redhat.com> wrote:
>
> >>
> >> No creep, because you'll always collapse.
> >
> > OK so in the 511 scenario, do we simply immediately collapse to the largest
> > possible _mTHP_ page size if based on adjacent none/zero page entries in the
> > PTE, and _never_ collapse to PMD on this basis even if we do have sufficient
> > none/zero PTE entries to do so?
>
> Right. And if we fail to allocate a PMD, we would collapse to smaller
> sizes, and later, once a PMD is possible, collapse to a PMD.
>
> But there is no creep, as we would have collapsed a PMD right from the
> start either way.
>
> >
> > And only collapse to PMD size if we have sufficient adjacent PTE entries that
> > are populated?
> >
> > Let's really nail this down actually so we can be super clear what the issue is
> > here.
> >
>
> I hope what I wrote above made sense.
>
> >
> >>
> >> Creep only happens if you wouldn't collapse a PMD without prior mTHP
> >> collapse, but suddenly would in the same scenario simply because you had
> >> prior mTHP collapse.
> >>
> >> At least that's my understanding.
> >
> > OK, that makes sense, is the logic (this may be part of the bit I haven't
> > reviewed yet tbh) then that for khugepaged mTHP we have the system where we
> > always require prior mTHP collapse _first_?
>
> So I would describe creep as
>
> "we would not collapse a PMD THP because max_ptes_none is violated, but
> because we collapsed smaller mTHP THPs before, we essentially suddenly
> have more PTEs that are not none-or-zero, making us suddenly collapse a
> PMD THP at the same place".
>
> Assume the following: max_ptes_none = 256
>
> This means we would only collapse if at most half (256/512) of the PTEs
> are none-or-zero.
>
> But imagine the (simplified) PTE layout with PMD = 8 entries to simplify:
>
> [ P Z P Z P Z Z Z ]
>
> 3 Present vs. 5 Zero -> do not collapse a PMD (8)
>
> But sssume we collapse smaller mTHP (2 entries) first
>
> [ P P P P P P Z Z ]
>
> We collapsed 3x "P Z" into "P P" because the ratio allowed for it.
>
> Suddenly we have
>
> 6 Present vs 2 Zero and we collapse a PMD (8)
>
> [ P P P P P P P P ]
>
> That's the "creep" problem.

I'd like to add a little to this,

The worst case scenario is all mTHP sizes enabled and a value of 256.
A 16kb collapse would then lead all the way up to a PMD collapse,
stopping to collapse at each mTHP level on each subsequent scan of the
same PMD range. The larger the max_pte_none value is, the less "stops"
it will make before reaching a PMD size, but it will ultimately creep
up to a PMD. Hence the cap. At 511, a single pte in a range will
always satisfy the PMD collapse, so we will never attempt any other
orders (other than in the case of the collapse failing, which David
explains above).

Hopefully that helps give some more insight to the creep problem.

Cheers
-- Nico

>
> >
> >>
> >>>
> >>>> max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
> >>>>
> >>>> And for the intermediate values
> >>>>
> >>>> (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
> >>>> supported yet with other values
> >>>
> >>> It feels a bit much to issue a kernel warning every time somebody twiddles that
> >>> value, and it's kind of against user expectation a bit.
> >>
> >> pr_warn_once() is what I meant.
> >
> > Right, but even then it feels a bit extreme, warnings are pretty serious
> > things. Then again there's precedent for this, and it may be the least worse
> > solution.
> >
> > I just picture a cloud provider turning this on with mTHP then getting their
> > monitoring team reporting some urgent communication about warnings in dmesg :)
>
> I mean, one could make the states mutually, maybe?
>
> Disallow enabling mTHP with max_ptes_none set to unsupported values and
> the other way around.
>
> That would probably be cleanest, although the implementation might get a
> bit more involved (but it's solvable).
>
> But the concern could be that there are configs that could suddenly
> break: someone that set max_ptes_none and enabled mTHP.
>
>
> I'll note that we could also consider only supporting "max_ptes_none =
> 511" (default) to start with.
>
> The nice thing about that value is that it us fully supported with the
> underused shrinker, because max_ptes_none=511 -> never shrink.
>
> --
> Cheers
>
> David / dhildenb
>

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Wed, Oct 29, 2025 at 04:04:06PM +0100, David Hildenbrand wrote:
> > >
> > > No creep, because you'll always collapse.
> >
> > OK so in the 511 scenario, do we simply immediately collapse to the largest
> > possible _mTHP_ page size if based on adjacent none/zero page entries in the
> > PTE, and _never_ collapse to PMD on this basis even if we do have sufficient
> > none/zero PTE entries to do so?
>
> Right. And if we fail to allocate a PMD, we would collapse to smaller sizes,
> and later, once a PMD is possible, collapse to a PMD.
>
> But there is no creep, as we would have collapsed a PMD right from the start
> either way.

Hmm, would this mean at 511 mTHP collapse _across zero entries_ would only
ever collapse to PMD, except in cases where, for instance, PTE entries
belong to distinct VMAs and so you have to collapse to mTHP as a result?

Or IOW 'always collapse to the largest size you can I don't care if it
takes up more memory'

And at 0, we'd never collapse anything across zero entries, and only when
adjacent present entries can be collapse to mTHP/PMD do we do so?

>
> >
> > And only collapse to PMD size if we have sufficient adjacent PTE entries that
> > are populated?
> >
> > Let's really nail this down actually so we can be super clear what the issue is
> > here.
> >
>
> I hope what I wrote above made sense.

Asking some q's still, probably more a me thing :)

>
> >
> > >
> > > Creep only happens if you wouldn't collapse a PMD without prior mTHP
> > > collapse, but suddenly would in the same scenario simply because you had
> > > prior mTHP collapse.
> > >
> > > At least that's my understanding.
> >
> > OK, that makes sense, is the logic (this may be part of the bit I haven't
> > reviewed yet tbh) then that for khugepaged mTHP we have the system where we
> > always require prior mTHP collapse _first_?
>
> So I would describe creep as
>
> "we would not collapse a PMD THP because max_ptes_none is violated, but
> because we collapsed smaller mTHP THPs before, we essentially suddenly have
> more PTEs that are not none-or-zero, making us suddenly collapse a PMD THP
> at the same place".

Yeah that makes sense.

>
> Assume the following: max_ptes_none = 256
>
> This means we would only collapse if at most half (256/512) of the PTEs are
> none-or-zero.
>
> But imagine the (simplified) PTE layout with PMD = 8 entries to simplify:
>
> [ P Z P Z P Z Z Z ]
>
> 3 Present vs. 5 Zero -> do not collapse a PMD (8)

OK I'm thinking this is more about /ratio/ than anything else.

PMD - <=50% - ok 5/8 = 62.5% no collapse.

>
> But sssume we collapse smaller mTHP (2 entries) first
>
> [ P P P P P P Z Z ]

...512 KB mTHP (2 entries) - <= 50% means we can do...

>
> We collapsed 3x "P Z" into "P P" because the ratio allowed for it.

Yes so that's:

[ P Z P Z P Z Z Z ]

->

[ P P P P P P Z Z ]

Right?

>
> Suddenly we have
>
> 6 Present vs 2 Zero and we collapse a PMD (8)
>
> [ P P P P P P P P ]
>
> That's the "creep" problem.

I guess we try PMD collapse first then mTHP, but the worry is another pass
will collapse to PMD right?


Whereas < 50% ratio means we never end up 'propagating' or 'creeping' like
this because each collapse never provides enough reduction in zero entries
to allow for higher order collapse.

Hence the idea of capping at 255

>
> >
> > >
> > > >
> > > > > max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
> > > > >
> > > > > And for the intermediate values
> > > > >
> > > > > (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
> > > > > supported yet with other values
> > > >
> > > > It feels a bit much to issue a kernel warning every time somebody twiddles that
> > > > value, and it's kind of against user expectation a bit.
> > >
> > > pr_warn_once() is what I meant.
> >
> > Right, but even then it feels a bit extreme, warnings are pretty serious
> > things. Then again there's precedent for this, and it may be the least worse
> > solution.
> >
> > I just picture a cloud provider turning this on with mTHP then getting their
> > monitoring team reporting some urgent communication about warnings in dmesg :)
>
> I mean, one could make the states mutually, maybe?
>
> Disallow enabling mTHP with max_ptes_none set to unsupported values and the
> other way around.
>
> That would probably be cleanest, although the implementation might get a bit
> more involved (but it's solvable).
>
> But the concern could be that there are configs that could suddenly break:
> someone that set max_ptes_none and enabled mTHP.

Yeah we could always return an error on setting to an unsupported value.

I mean pr_warn() is nasty but maybe necessary.

>
>
> I'll note that we could also consider only supporting "max_ptes_none = 511"
> (default) to start with.
>
> The nice thing about that value is that it us fully supported with the
> underused shrinker, because max_ptes_none=511 -> never shrink.

It feels like = 0 would be useful though?

>
> --
> Cheers
>
> David / dhildenb
>

Thanks, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Nico Pache 3 months, 1 week ago

On Wed, Oct 29, 2025 at 12:42 PM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Wed, Oct 29, 2025 at 04:04:06PM +0100, David Hildenbrand wrote:
> > > >
> > > > No creep, because you'll always collapse.
> > >
> > > OK so in the 511 scenario, do we simply immediately collapse to the largest
> > > possible _mTHP_ page size if based on adjacent none/zero page entries in the
> > > PTE, and _never_ collapse to PMD on this basis even if we do have sufficient
> > > none/zero PTE entries to do so?
> >
> > Right. And if we fail to allocate a PMD, we would collapse to smaller sizes,
> > and later, once a PMD is possible, collapse to a PMD.
> >
> > But there is no creep, as we would have collapsed a PMD right from the start
> > either way.
>
> Hmm, would this mean at 511 mTHP collapse _across zero entries_ would only
> ever collapse to PMD, except in cases where, for instance, PTE entries
> belong to distinct VMAs and so you have to collapse to mTHP as a result?

There are a few failure cases, like exceeding thresholds, or
allocations failures, but yes your assessment is correct.

At 511, the PMD collapse will be satisfied by a single PTE. If the
collapse fails we will try both sides of the PMD (1024kb , 1024kb).
the one that contains the non-none PTE will collapse

This is where the (HPAGE_PMD_ORDER - order) comes from.
imagine the 511 case above
511 >> HPAGE_PMD_ORDER - 9 == 511 >> 0 = 511 max ptes none
511 >> PMD_ORDER - 8 (1024kb) == 511 >> 1 = 255 max_ptes_none

both of these align to the orders size minus 1.

>
> Or IOW 'always collapse to the largest size you can I don't care if it
> takes up more memory'
>
> And at 0, we'd never collapse anything across zero entries, and only when
> adjacent present entries can be collapse to mTHP/PMD do we do so?

Yep!

max_pte_none =0 + all mTHP sizes enabled, gives you a really good
distribution of mTHP sizes in the systems, as zero memory will be
wasted and the most optimal size (space wise) will eb found. At least
for the memory allocated through khugepaged. The Defer patchset I had
on top of this series was exactly for that purpose-- Allow khugepaged
to determine all the THP usage in the system (other than madvise), and
allow granular control of memory waste.

>
> >
> > >
> > > And only collapse to PMD size if we have sufficient adjacent PTE entries that
> > > are populated?
> > >
> > > Let's really nail this down actually so we can be super clear what the issue is
> > > here.
> > >
> >
> > I hope what I wrote above made sense.
>
> Asking some q's still, probably more a me thing :)
>
> >
> > >
> > > >
> > > > Creep only happens if you wouldn't collapse a PMD without prior mTHP
> > > > collapse, but suddenly would in the same scenario simply because you had
> > > > prior mTHP collapse.
> > > >
> > > > At least that's my understanding.
> > >
> > > OK, that makes sense, is the logic (this may be part of the bit I haven't
> > > reviewed yet tbh) then that for khugepaged mTHP we have the system where we
> > > always require prior mTHP collapse _first_?
> >
> > So I would describe creep as
> >
> > "we would not collapse a PMD THP because max_ptes_none is violated, but
> > because we collapsed smaller mTHP THPs before, we essentially suddenly have
> > more PTEs that are not none-or-zero, making us suddenly collapse a PMD THP
> > at the same place".
>
> Yeah that makes sense.
>
> >
> > Assume the following: max_ptes_none = 256
> >
> > This means we would only collapse if at most half (256/512) of the PTEs are
> > none-or-zero.
> >
> > But imagine the (simplified) PTE layout with PMD = 8 entries to simplify:
> >
> > [ P Z P Z P Z Z Z ]
> >
> > 3 Present vs. 5 Zero -> do not collapse a PMD (8)
>
> OK I'm thinking this is more about /ratio/ than anything else.
>
> PMD - <=50% - ok 5/8 = 62.5% no collapse.

                < 50%*.

At 50% it's 256 which is actually the worst case scenario. But I read
further, and it seems like you grasped the issue.

>
> >
> > But sssume we collapse smaller mTHP (2 entries) first
> >
> > [ P P P P P P Z Z ]
>
> ...512 KB mTHP (2 entries) - <= 50% means we can do...
>
> >
> > We collapsed 3x "P Z" into "P P" because the ratio allowed for it.
>
> Yes so that's:
>
> [ P Z P Z P Z Z Z ]
>
> ->
>
> [ P P P P P P Z Z ]
>
> Right?
>
> >
> > Suddenly we have
> >
> > 6 Present vs 2 Zero and we collapse a PMD (8)
> >
> > [ P P P P P P P P ]
> >
> > That's the "creep" problem.
>
> I guess we try PMD collapse first then mTHP, but the worry is another pass
> will collapse to PMD right?
>
>
> Whereas < 50% ratio means we never end up 'propagating' or 'creeping' like
> this because each collapse never provides enough reduction in zero entries
> to allow for higher order collapse.
>
> Hence the idea of capping at 255

Yep! We've discussed other solutions, like tracking collapsed pages,
or the solutions brought up by David. But this seemed like the most
logical to me, as it keeps some of the tunability. I now understand
the concern wasnt so much the capping, but rather the silent nature of
it, and the uAPI expectations surrounding enforcing such a limit (for
both past and future behavioral expectations).

>
> >
> > >
> > > >
> > > > >
> > > > > > max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
> > > > > >
> > > > > > And for the intermediate values
> > > > > >
> > > > > > (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
> > > > > > supported yet with other values
> > > > >
> > > > > It feels a bit much to issue a kernel warning every time somebody twiddles that
> > > > > value, and it's kind of against user expectation a bit.
> > > >
> > > > pr_warn_once() is what I meant.
> > >
> > > Right, but even then it feels a bit extreme, warnings are pretty serious
> > > things. Then again there's precedent for this, and it may be the least worse
> > > solution.
> > >
> > > I just picture a cloud provider turning this on with mTHP then getting their
> > > monitoring team reporting some urgent communication about warnings in dmesg :)
> >
> > I mean, one could make the states mutually, maybe?
> >
> > Disallow enabling mTHP with max_ptes_none set to unsupported values and the
> > other way around.
> >
> > That would probably be cleanest, although the implementation might get a bit
> > more involved (but it's solvable).
> >
> > But the concern could be that there are configs that could suddenly break:
> > someone that set max_ptes_none and enabled mTHP.
>
> Yeah we could always return an error on setting to an unsupported value.
>
> I mean pr_warn() is nasty but maybe necessary.
>
> >
> >
> > I'll note that we could also consider only supporting "max_ptes_none = 511"
> > (default) to start with.
> >
> > The nice thing about that value is that it us fully supported with the
> > underused shrinker, because max_ptes_none=511 -> never shrink.
>
> It feels like = 0 would be useful though?

I personally think the default of 511 is wrong and should be on the
lower end of the scale. The exception being thp=always, where I
believe the kernel should treat it as 511.

But the second part of that would also violate the users max_ptes_none
setting, so it's probably much harder in practice, and also not really
part of this series, just my opinion.

Cheers.
-- Nico

>
> >
> > --
> > Cheers
> >
> > David / dhildenb
> >
>
> Thanks, Lorenzo
>

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Lorenzo Stoakes 3 months, 1 week ago

On Wed, Oct 29, 2025 at 03:10:19PM -0600, Nico Pache wrote:
> On Wed, Oct 29, 2025 at 12:42 PM Lorenzo Stoakes
> <lorenzo.stoakes@oracle.com> wrote:
> >
> > On Wed, Oct 29, 2025 at 04:04:06PM +0100, David Hildenbrand wrote:
> > > > >
> > > > > No creep, because you'll always collapse.
> > > >
> > > > OK so in the 511 scenario, do we simply immediately collapse to the largest
> > > > possible _mTHP_ page size if based on adjacent none/zero page entries in the
> > > > PTE, and _never_ collapse to PMD on this basis even if we do have sufficient
> > > > none/zero PTE entries to do so?
> > >
> > > Right. And if we fail to allocate a PMD, we would collapse to smaller sizes,
> > > and later, once a PMD is possible, collapse to a PMD.
> > >
> > > But there is no creep, as we would have collapsed a PMD right from the start
> > > either way.
> >
> > Hmm, would this mean at 511 mTHP collapse _across zero entries_ would only
> > ever collapse to PMD, except in cases where, for instance, PTE entries
> > belong to distinct VMAs and so you have to collapse to mTHP as a result?
>
> There are a few failure cases, like exceeding thresholds, or
> allocations failures, but yes your assessment is correct.

Yeah of course being mm there are thorny edge cases :) we do love those...

>
> At 511, the PMD collapse will be satisfied by a single PTE. If the
> collapse fails we will try both sides of the PMD (1024kb , 1024kb).
> the one that contains the non-none PTE will collapse

Right yes.

>
> This is where the (HPAGE_PMD_ORDER - order) comes from.
> imagine the 511 case above
> 511 >> HPAGE_PMD_ORDER - 9 == 511 >> 0 = 511 max ptes none
> 511 >> PMD_ORDER - 8 (1024kb) == 511 >> 1 = 255 max_ptes_none
>
> both of these align to the orders size minus 1.

Right.

>
> >
> > Or IOW 'always collapse to the largest size you can I don't care if it
> > takes up more memory'
> >
> > And at 0, we'd never collapse anything across zero entries, and only when
> > adjacent present entries can be collapse to mTHP/PMD do we do so?
>
> Yep!
>
> max_pte_none =0 + all mTHP sizes enabled, gives you a really good
> distribution of mTHP sizes in the systems, as zero memory will be
> wasted and the most optimal size (space wise) will eb found. At least
> for the memory allocated through khugepaged. The Defer patchset I had
> on top of this series was exactly for that purpose-- Allow khugepaged
> to determine all the THP usage in the system (other than madvise), and
> allow granular control of memory waste.

Yeah, well it's a trade off really isn't it on 'eagerness' to collapse
non-present entries :)

But we'll come back to that when David has time :)

>
> >
> > >
> > > >
> > > > And only collapse to PMD size if we have sufficient adjacent PTE entries that
> > > > are populated?
> > > >
> > > > Let's really nail this down actually so we can be super clear what the issue is
> > > > here.
> > > >
> > >
> > > I hope what I wrote above made sense.
> >
> > Asking some q's still, probably more a me thing :)
> >
> > >
> > > >
> > > > >
> > > > > Creep only happens if you wouldn't collapse a PMD without prior mTHP
> > > > > collapse, but suddenly would in the same scenario simply because you had
> > > > > prior mTHP collapse.
> > > > >
> > > > > At least that's my understanding.
> > > >
> > > > OK, that makes sense, is the logic (this may be part of the bit I haven't
> > > > reviewed yet tbh) then that for khugepaged mTHP we have the system where we
> > > > always require prior mTHP collapse _first_?
> > >
> > > So I would describe creep as
> > >
> > > "we would not collapse a PMD THP because max_ptes_none is violated, but
> > > because we collapsed smaller mTHP THPs before, we essentially suddenly have
> > > more PTEs that are not none-or-zero, making us suddenly collapse a PMD THP
> > > at the same place".
> >
> > Yeah that makes sense.
> >
> > >
> > > Assume the following: max_ptes_none = 256
> > >
> > > This means we would only collapse if at most half (256/512) of the PTEs are
> > > none-or-zero.
> > >
> > > But imagine the (simplified) PTE layout with PMD = 8 entries to simplify:
> > >
> > > [ P Z P Z P Z Z Z ]
> > >
> > > 3 Present vs. 5 Zero -> do not collapse a PMD (8)
> >
> > OK I'm thinking this is more about /ratio/ than anything else.
> >
> > PMD - <=50% - ok 5/8 = 62.5% no collapse.
>
>                 < 50%*.
>
> At 50% it's 256 which is actually the worst case scenario. But I read
> further, and it seems like you grasped the issue.

Yeah this is < 50% vs. <= 50% which are fundamentally different obviously :)

>
> >
> > >
> > > But sssume we collapse smaller mTHP (2 entries) first
> > >
> > > [ P P P P P P Z Z ]
> >
> > ...512 KB mTHP (2 entries) - <= 50% means we can do...
> >
> > >
> > > We collapsed 3x "P Z" into "P P" because the ratio allowed for it.
> >
> > Yes so that's:
> >
> > [ P Z P Z P Z Z Z ]
> >
> > ->
> >
> > [ P P P P P P Z Z ]
> >
> > Right?
> >
> > >
> > > Suddenly we have
> > >
> > > 6 Present vs 2 Zero and we collapse a PMD (8)
> > >
> > > [ P P P P P P P P ]
> > >
> > > That's the "creep" problem.
> >
> > I guess we try PMD collapse first then mTHP, but the worry is another pass
> > will collapse to PMD right?
> >
> >
> > Whereas < 50% ratio means we never end up 'propagating' or 'creeping' like
> > this because each collapse never provides enough reduction in zero entries
> > to allow for higher order collapse.
> >
> > Hence the idea of capping at 255
>
> Yep! We've discussed other solutions, like tracking collapsed pages,
> or the solutions brought up by David. But this seemed like the most
> logical to me, as it keeps some of the tunability. I now understand
> the concern wasnt so much the capping, but rather the silent nature of
> it, and the uAPI expectations surrounding enforcing such a limit (for
> both past and future behavioral expectations).

Yes, that's the primary concern on my side.

>
> >
> > >
> > > >
> > > > >
> > > > > >
> > > > > > > max_ptes_none == 0 -> collapse mTHP only if all non-none/zero
> > > > > > >
> > > > > > > And for the intermediate values
> > > > > > >
> > > > > > > (1) pr_warn() when mTHPs are enabled, stating that mTHP collapse is not
> > > > > > > supported yet with other values
> > > > > >
> > > > > > It feels a bit much to issue a kernel warning every time somebody twiddles that
> > > > > > value, and it's kind of against user expectation a bit.
> > > > >
> > > > > pr_warn_once() is what I meant.
> > > >
> > > > Right, but even then it feels a bit extreme, warnings are pretty serious
> > > > things. Then again there's precedent for this, and it may be the least worse
> > > > solution.
> > > >
> > > > I just picture a cloud provider turning this on with mTHP then getting their
> > > > monitoring team reporting some urgent communication about warnings in dmesg :)
> > >
> > > I mean, one could make the states mutually, maybe?
> > >
> > > Disallow enabling mTHP with max_ptes_none set to unsupported values and the
> > > other way around.
> > >
> > > That would probably be cleanest, although the implementation might get a bit
> > > more involved (but it's solvable).
> > >
> > > But the concern could be that there are configs that could suddenly break:
> > > someone that set max_ptes_none and enabled mTHP.
> >
> > Yeah we could always return an error on setting to an unsupported value.
> >
> > I mean pr_warn() is nasty but maybe necessary.
> >
> > >
> > >
> > > I'll note that we could also consider only supporting "max_ptes_none = 511"
> > > (default) to start with.
> > >
> > > The nice thing about that value is that it us fully supported with the
> > > underused shrinker, because max_ptes_none=511 -> never shrink.
> >
> > It feels like = 0 would be useful though?
>
> I personally think the default of 511 is wrong and should be on the
> lower end of the scale. The exception being thp=always, where I
> believe the kernel should treat it as 511.

I think that'd be confusing to have different behaviour for thp=always, and I'd
rather we didn't do that.

But ultimately it's all moot I think as these are all uAPI things now.

It was a mistake to even export this IMO, but that can't be helped now :)

>
> But the second part of that would also violate the users max_ptes_none
> setting, so it's probably much harder in practice, and also not really
> part of this series, just my opinion.

I'm confused what you mean here?

In any case I think the 511/0 solution is the way forwards.

>
> Cheers.
> -- Nico
>
> >
> > >
> > > --
> > > Cheers
> > >
> > > David / dhildenb
> > >
> >
> > Thanks, Lorenzo
> >
>

Cheers, Lorenzo

Re: [PATCH v12 mm-new 06/15] khugepaged: introduce collapse_max_ptes_none helper function

Posted by Nico Pache 3 months, 1 week ago

On Tue, Oct 28, 2025 at 4:10 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
>
>
> On 2025/10/28 01:53, Lorenzo Stoakes wrote:
> > On Wed, Oct 22, 2025 at 12:37:08PM -0600, Nico Pache wrote:
> >> The current mechanism for determining mTHP collapse scales the
> >> khugepaged_max_ptes_none value based on the target order. This
> >> introduces an undesirable feedback loop, or "creep", when max_ptes_none
> >> is set to a value greater than HPAGE_PMD_NR / 2.
> >>
> >> With this configuration, a successful collapse to order N will populate
> >> enough pages to satisfy the collapse condition on order N+1 on the next
> >> scan. This leads to unnecessary work and memory churn.
> >>
> >> To fix this issue introduce a helper function that caps the max_ptes_none
> >> to HPAGE_PMD_NR / 2 - 1 (255 on 4k page size). The function also scales
> >> the max_ptes_none number by the (PMD_ORDER - target collapse order).
> >>
> >> The limits can be ignored by passing full_scan=true, this is useful for
> >> madvise_collapse (which ignores limits), or in the case of
> >> collapse_scan_pmd(), allows the full PMD to be scanned when mTHP
> >> collapse is available.
> >>
> >> Signed-off-by: Nico Pache <npache@redhat.com>
> >> ---
> >>   mm/khugepaged.c | 35 ++++++++++++++++++++++++++++++++++-
> >>   1 file changed, 34 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> >> index 4ccebf5dda97..286c3a7afdee 100644
> >> --- a/mm/khugepaged.c
> >> +++ b/mm/khugepaged.c
> >> @@ -459,6 +459,39 @@ void __khugepaged_enter(struct mm_struct *mm)
> >>              wake_up_interruptible(&khugepaged_wait);
> >>   }
> >>
> >> +/**
> >> + * collapse_max_ptes_none - Calculate maximum allowed empty PTEs for collapse
> >> + * @order: The folio order being collapsed to
> >> + * @full_scan: Whether this is a full scan (ignore limits)
> >> + *
> >> + * For madvise-triggered collapses (full_scan=true), all limits are bypassed
> >> + * and allow up to HPAGE_PMD_NR - 1 empty PTEs.
> >> + *
> >> + * For PMD-sized collapses (order == HPAGE_PMD_ORDER), use the configured
> >> + * khugepaged_max_ptes_none value.
> >> + *
> >> + * For mTHP collapses, scale down the max_ptes_none proportionally to the folio
> >> + * order, but caps it at HPAGE_PMD_NR/2-1 to prevent a collapse feedback loop.
> >> + *
> >> + * Return: Maximum number of empty PTEs allowed for the collapse operation
> >> + */
> >> +static unsigned int collapse_max_ptes_none(unsigned int order, bool full_scan)
> >> +{
> >> +    unsigned int max_ptes_none;
> >> +
> >> +    /* ignore max_ptes_none limits */
> >> +    if (full_scan)
> >> +            return HPAGE_PMD_NR - 1;
> >> +
> >> +    if (order == HPAGE_PMD_ORDER)
> >> +            return khugepaged_max_ptes_none;
> >> +
> >> +    max_ptes_none = min(khugepaged_max_ptes_none, HPAGE_PMD_NR/2 - 1);
> >
> > I mean not to beat a dead horse re: v11 commentary, but I thought we were going
> > to implement David's idea re: the new 'eagerness' tunable, and again we're now just
> > implementing the capping at HPAGE_PMD_NR/2 - 1 thing again?
> >
> > I'm still really quite uncomfortable with us silently capping this value.
> >
> > If we're putting forward theoretical ideas that are to be later built upon, this
> > series should be an RFC.
> >
> > But if we really intend to silently ignore user input the problem is that then
> > becomes established uAPI.
> >
> > I think it's _sensible_ to avoid this mTHP escalation problem, but the issue is
> > visibility I think.
> >
> > I think people are going to find it odd that you set it to something, but then
> > get something else.
> >
> > As an alternative we could have a new sysfs field:
> >
> > /sys/kernel/mm/transparent_hugepage/khugepaged/max_mthp_ptes_none
> >
> > That shows the cap clearly.
> >
> > In fact, it could be read-only... and just expose it to the user. That reduces
> > complexity.
> >
> > We can then bring in eagerness later and have the same situation of
> > max_ptes_none being a parameter that exists (plus this additional read-only
> > parameter).
>

Hey Baolin,

> We all know that ultimately using David's suggestion to add the
> 'eagerness' tunable parameter is the best approach, but for now, we need
> an initial version to support mTHP collapse (as we've already discussed
> extensively here:)).
>
> I don't like the idea of adding another and potentially confusing
> 'max_mthp_ptes_none' interface, which might make it more difficult to
> accommodate the 'eagerness' parameter in the future.
>
> If Nico's current proposal still doesn't satisfy everyone, I personally
> lean towards David's earlier simplified approach:
>         max_ptes_none == 511 -> collapse mTHP always
>         max_ptes_none != 511 -> collapse mTHP only if all PTEs are non-none/zero
>
> Let's first have an initial approach in place, which will also simplify
> the following addition of the 'eagerness' tunable parameter.
>
> Nico, Lorenzo, and David, what do you think?

I still believe capping it at PMD_NR/2 provides the right mix between
preventing the undesired behavior, and keeping some degree of
tunability, as the admin guides suggests max_ptes_none should be used.
I would be willing to compromise and take this other approach until
the "eagerness" is in place. However, I do believe David's idea for
eagerness is to also cap the max_ptes_none at PMD_NR/2 for the second
to highest eagerness level (ie, 511, 255, ...). So in practice, we
won't see any behavioral changes when that series comes around;
whereas setting max_ptes_none=0 for mTHP initially, then adding
eagerness will result in a change in behavior from the initial
implementation.

With that said, Lorenzo, David, What's the final verdict?

-- Nico

>
> Code should be:
> static unsigned int collapse_max_ptes_none(unsigned int order, bool
> full_scan)
> {
>          unsigned int max_ptes_none;
>
>          /* ignore max_ptes_none limits */
>          if (full_scan)
>                  return HPAGE_PMD_NR - 1;
>
>          if (order == HPAGE_PMD_ORDER)
>                  return khugepaged_max_ptes_none;
>
>          /*
>           * For mTHP collapse, we can simplify the logic:
>           * max_ptes_none == 511 -> collapse mTHP always
>           * max_ptes_none != 511 -> collapse mTHP only if we all PTEs
> are non-none/zero
>           */
>          if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>                  return khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER -
> order);
>
>          return 0;
> }
Side note: Thank you Baolin for your review/testing of the V12 :)
>