[PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default

David Hildenbrand posted 1 patch 4 days, 11 hours ago
mm/huge_memory.c | 3 ---
1 file changed, 3 deletions(-)
[PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 11 hours ago
We added an early exit in thp_underused(), probably to avoid scanning
pages when there is no chance for success.

However, assume we have max_ptes_none = 511 (default).

Nothing should stop us from freeing all pages part of a THP that
is completely zero (512) and khugepaged will for sure not try to
instantiate a THP in that case (512 shared zeropages).

This can just trivially happen if someone writes a single 0 byte into a
PMD area, or of course, when data ends up being zero later.

So let's remove that early exit.

Do we want to CC stable? Hm, not sure. Probably not urgent.

Note that, as default, the THP shrinker is active
(/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
THPs are added to the deferred split lists. However, with the
max_ptes_none default we would never scan them. We would not do that. If
that's not desirable, we should just disable the shrinker as default,
also not adding all THPs to the deferred split lists.

Easy to reproduce:

1) Allocate some THPs filled with 0s

<prog.c>
 #include <string.h>
 #include <stdio.h>
 #include <stdlib.h>
 #include <unistd.h>
 #include <sys/mman.h>

 const size_t size = 1024*1024*1024;

 int main(void)
 {
         size_t offs;
         char *area;

         area = mmap(0, size, PROT_READ | PROT_WRITE,
                     MAP_ANON | MAP_PRIVATE, -1, 0);
         if (area == MAP_FAILED) {
                 printf("mmap failed\n");
                 exit(-1);
         }
         madvise(area, size, MADV_HUGEPAGE);

         for (offs = 0; offs < size; offs += getpagesize())
                 area[offs] = 0;
         pause();
 }
<\prog.c>

2) Trigger the shrinker

E.g., memory pressure through memhog

3) Observe that THPs are not getting reclaimed

$ cat /proc/`pgrep prog`/smaps_rollup

Would list ~1GiB of AnonHugePages. With this fix, they would get
reclaimed as expected.

Fixes: dafff3f4c850 ("mm: split underused THPs")
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Usama Arif <usamaarif642@gmail.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 26cedfcd74189..aa3ed7a86435b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
 	void *kaddr;
 	int i;
 
-	if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
-		return false;
-
 	for (i = 0; i < folio_nr_pages(folio); i++) {
 		kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
 		if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
-- 
2.50.1
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Baolin Wang 1 day, 23 hours ago

On 2025/9/5 22:11, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
> 
> However, assume we have max_ptes_none = 511 (default).
> 
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
> 
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
> 
> So let's remove that early exit.
> 
> Do we want to CC stable? Hm, not sure. Probably not urgent.
> 
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
> 
> Easy to reproduce:
> 
> 1) Allocate some THPs filled with 0s
> 
> <prog.c>
>   #include <string.h>
>   #include <stdio.h>
>   #include <stdlib.h>
>   #include <unistd.h>
>   #include <sys/mman.h>
> 
>   const size_t size = 1024*1024*1024;
> 
>   int main(void)
>   {
>           size_t offs;
>           char *area;
> 
>           area = mmap(0, size, PROT_READ | PROT_WRITE,
>                       MAP_ANON | MAP_PRIVATE, -1, 0);
>           if (area == MAP_FAILED) {
>                   printf("mmap failed\n");
>                   exit(-1);
>           }
>           madvise(area, size, MADV_HUGEPAGE);
> 
>           for (offs = 0; offs < size; offs += getpagesize())
>                   area[offs] = 0;
>           pause();
>   }
> <\prog.c>
> 
> 2) Trigger the shrinker
> 
> E.g., memory pressure through memhog
> 
> 3) Observe that THPs are not getting reclaimed
> 
> $ cat /proc/`pgrep prog`/smaps_rollup
> 
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
> 
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

Thanks for sorting this out. Make sense to me.
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>

>   mm/huge_memory.c | 3 ---
>   1 file changed, 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd74189..aa3ed7a86435b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
>   	void *kaddr;
>   	int i;
>   
> -	if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> -		return false;
> -
>   	for (i = 0; i < folio_nr_pages(folio); i++) {
>   		kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>   		if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Lance Yang 3 days, 19 hours ago
On Fri, Sep 5, 2025 at 10:14 PM David Hildenbrand <david@redhat.com> wrote:
>
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
>
> However, assume we have max_ptes_none = 511 (default).
>
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
>
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
>
> So let's remove that early exit.
>
> Do we want to CC stable? Hm, not sure. Probably not urgent.
>
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
>
> Easy to reproduce:
>
> 1) Allocate some THPs filled with 0s
>
> <prog.c>
>  #include <string.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <sys/mman.h>
>
>  const size_t size = 1024*1024*1024;
>
>  int main(void)
>  {
>          size_t offs;
>          char *area;
>
>          area = mmap(0, size, PROT_READ | PROT_WRITE,
>                      MAP_ANON | MAP_PRIVATE, -1, 0);
>          if (area == MAP_FAILED) {
>                  printf("mmap failed\n");
>                  exit(-1);
>          }
>          madvise(area, size, MADV_HUGEPAGE);
>
>          for (offs = 0; offs < size; offs += getpagesize())
>                  area[offs] = 0;
>          pause();
>  }
> <\prog.c>
>
> 2) Trigger the shrinker
>
> E.g., memory pressure through memhog
>
> 3) Observe that THPs are not getting reclaimed
>
> $ cat /proc/`pgrep prog`/smaps_rollup
>
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Good catch! Feel free to add:
Reviewed-by: Lance Yang <lance.yang@linux.dev>

Cheers,
Lance

> ---
>  mm/huge_memory.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd74189..aa3ed7a86435b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
>         void *kaddr;
>         int i;
>
> -       if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> -               return false;
> -
>         for (i = 0; i < folio_nr_pages(folio); i++) {
>                 kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>                 if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
> --
> 2.50.1
>
>
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Lorenzo Stoakes 4 days, 10 hours ago
On Fri, Sep 05, 2025 at 04:11:37PM +0200, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
>
> However, assume we have max_ptes_none = 511 (default).
>
> Nothing should stop us from freeing all pages part of a THP that

Freeing 'all pages which are part of a THP' rather?

> is completely zero (512) and khugepaged will for sure not try to

that is -> that are?

> instantiate a THP in that case (512 shared zeropages).

But if you write faulted they're not the zero page? And how are they shared? I
mean be being dumb here.

>
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
>
> So let's remove that early exit.
>
> Do we want to CC stable? Hm, not sure. Probably not urgent.

Surely this is worth having?

>
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If

Nit but 'we would not do that' is kinda duplicative here :)

> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
>
> Easy to reproduce:
>
> 1) Allocate some THPs filled with 0s
>
> <prog.c>
>  #include <string.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <sys/mman.h>
>
>  const size_t size = 1024*1024*1024;
>
>  int main(void)
>  {
>          size_t offs;
>          char *area;
>
>          area = mmap(0, size, PROT_READ | PROT_WRITE,
>                      MAP_ANON | MAP_PRIVATE, -1, 0);
>          if (area == MAP_FAILED) {
>                  printf("mmap failed\n");
>                  exit(-1);
>          }
>          madvise(area, size, MADV_HUGEPAGE);
>
>          for (offs = 0; offs < size; offs += getpagesize())
>                  area[offs] = 0;
>          pause();
>  }
> <\prog.c>
>
> 2) Trigger the shrinker
>
> E.g., memory pressure through memhog
>
> 3) Observe that THPs are not getting reclaimed
>
> $ cat /proc/`pgrep prog`/smaps_rollup
>
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.

Yikes!

>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

LGTM, so:

Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

> ---
>  mm/huge_memory.c | 3 ---
>  1 file changed, 3 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd74189..aa3ed7a86435b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
>  	void *kaddr;
>  	int i;
>
> -	if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> -		return false;
> -
>  	for (i = 0; i < folio_nr_pages(folio); i++) {
>  		kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>  		if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
> --
> 2.50.1
>
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 10 hours ago
On 05.09.25 17:30, Lorenzo Stoakes wrote:
> On Fri, Sep 05, 2025 at 04:11:37PM +0200, David Hildenbrand wrote:
>> We added an early exit in thp_underused(), probably to avoid scanning
>> pages when there is no chance for success.
>>
>> However, assume we have max_ptes_none = 511 (default).
>>
>> Nothing should stop us from freeing all pages part of a THP that
> 
> Freeing 'all pages which are part of a THP' rather?

I'm German, I don't know what I'm doing. :D

> 
>> is completely zero (512) and khugepaged will for sure not try to
> 
> that is -> that are?

the THP is zero?

> 
>> instantiate a THP in that case (512 shared zeropages).
> 
> But if you write faulted they're not the zero page? And how are they shared? I
> mean be being dumb here.

The shrinker will replace zeroed pages by the shared zeropages.

> 
>>
>> This can just trivially happen if someone writes a single 0 byte into a
>> PMD area, or of course, when data ends up being zero later.
>>
>> So let's remove that early exit.
>>
>> Do we want to CC stable? Hm, not sure. Probably not urgent.
> 
> Surely this is worth having?

Alrighty, let me cc stable.

> 
>>
>> Note that, as default, the THP shrinker is active
>> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
>> THPs are added to the deferred split lists. However, with the
>> max_ptes_none default we would never scan them. We would not do that. If
> 
> Nit but 'we would not do that' is kinda duplicative here :)

Yeah, fixed it already while rewriting: this was meant to be "would now".


Thanks!


-- 
Cheers

David / dhildenb
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Lorenzo Stoakes 1 day, 14 hours ago
On Fri, Sep 05, 2025 at 05:36:01PM +0200, David Hildenbrand wrote:
> On 05.09.25 17:30, Lorenzo Stoakes wrote:
> > On Fri, Sep 05, 2025 at 04:11:37PM +0200, David Hildenbrand wrote:
> > > We added an early exit in thp_underused(), probably to avoid scanning
> > > pages when there is no chance for success.
> > >
> > > However, assume we have max_ptes_none = 511 (default).
> > >
> > > Nothing should stop us from freeing all pages part of a THP that
> >
> > Freeing 'all pages which are part of a THP' rather?
>
> I'm German, I don't know what I'm doing. :D

Whereas I have no excuse :P

>
> >
> > > is completely zero (512) and khugepaged will for sure not try to
> >
> > that is -> that are?
>
> the THP is zero?

I mean "all pages part of a THP that is completely zero' -> "all pages part of a
THP that are completely zero', I'm referring to the 'all pages' bit, but I guess
you mean the THP is entirely zero.

So maybe rephrase to 'all pages which are part of a zero THP' or similar? :)

>
> >
> > > instantiate a THP in that case (512 shared zeropages).
> >
> > But if you write faulted they're not the zero page? And how are they shared? I
> > mean be being dumb here.
>
> The shrinker will replace zeroed pages by the shared zeropages.

Ah thanks, I was being dumb :) too stuck in vanilla mm land...

>
> >
> > >
> > > This can just trivially happen if someone writes a single 0 byte into a
> > > PMD area, or of course, when data ends up being zero later.
> > >
> > > So let's remove that early exit.
> > >
> > > Do we want to CC stable? Hm, not sure. Probably not urgent.
> >
> > Surely this is worth having?
>
> Alrighty, let me cc stable.

Thanks!

>
> >
> > >
> > > Note that, as default, the THP shrinker is active
> > > (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> > > THPs are added to the deferred split lists. However, with the
> > > max_ptes_none default we would never scan them. We would not do that. If
> >
> > Nit but 'we would not do that' is kinda duplicative here :)
>
> Yeah, fixed it already while rewriting: this was meant to be "would now".

Cheers!

>
>
> Thanks!
>
>
> --
> Cheers
>
> David / dhildenb
>

Cheers, Lorenzo
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 10 hours ago
On 05.09.25 16:11, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
> 
> However, assume we have max_ptes_none = 511 (default).
> 
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
> 
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
> 
> So let's remove that early exit.
> 
> Do we want to CC stable? Hm, not sure. Probably not urgent.
> 
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
> 
> Easy to reproduce:
> 

Just a note that I will send a v2 with an updated patch description 
after the discussions here settled.

-- 
Cheers

David / dhildenb
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 10 hours ago

On 05/09/2025 15:11, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
> 
> However, assume we have max_ptes_none = 511 (default).
> 
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
> 
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
> 
> So let's remove that early exit.
> 
> Do we want to CC stable? Hm, not sure. Probably not urgent.
> 
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
> 
> Easy to reproduce:
> 
> 1) Allocate some THPs filled with 0s
> 
> <prog.c>
>  #include <string.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <sys/mman.h>
> 
>  const size_t size = 1024*1024*1024;
> 
>  int main(void)
>  {
>          size_t offs;
>          char *area;
> 
>          area = mmap(0, size, PROT_READ | PROT_WRITE,
>                      MAP_ANON | MAP_PRIVATE, -1, 0);
>          if (area == MAP_FAILED) {
>                  printf("mmap failed\n");
>                  exit(-1);
>          }
>          madvise(area, size, MADV_HUGEPAGE);
> 
>          for (offs = 0; offs < size; offs += getpagesize())
>                  area[offs] = 0;
>          pause();
>  }
> <\prog.c>
> 
> 2) Trigger the shrinker
> 
> E.g., memory pressure through memhog
> 
> 3) Observe that THPs are not getting reclaimed
> 
> $ cat /proc/`pgrep prog`/smaps_rollup
> 
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
> 
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>


Acked-by: Usama Arif <usamaarif642@gmail.com>
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 11 hours ago

On 05/09/2025 15:11, David Hildenbrand wrote:
> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
> 

Yes, that was the main reason.

> However, assume we have max_ptes_none = 511 (default).
> 
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).

Agree with this point for this patch.

> 
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
> 
> So let's remove that early exit.
> 
> Do we want to CC stable? Hm, not sure. Probably not urgent.
> 
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
> 

The reason I did this is for the case if you change max_ptes_none after the THP is added
to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
so that its considered for splitting.

> Easy to reproduce:
> 
> 1) Allocate some THPs filled with 0s
> 
> <prog.c>
>  #include <string.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <sys/mman.h>
> 
>  const size_t size = 1024*1024*1024;
> 
>  int main(void)
>  {
>          size_t offs;
>          char *area;
> 
>          area = mmap(0, size, PROT_READ | PROT_WRITE,
>                      MAP_ANON | MAP_PRIVATE, -1, 0);
>          if (area == MAP_FAILED) {
>                  printf("mmap failed\n");
>                  exit(-1);
>          }
>          madvise(area, size, MADV_HUGEPAGE);
> 
>          for (offs = 0; offs < size; offs += getpagesize())
>                  area[offs] = 0;
>          pause();
>  }
> <\prog.c>
> 
> 2) Trigger the shrinker
> 
> E.g., memory pressure through memhog
> 
> 3) Observe that THPs are not getting reclaimed
> 
> $ cat /proc/`pgrep prog`/smaps_rollup
> 
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
> 
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 26cedfcd74189..aa3ed7a86435b 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
>  	void *kaddr;
>  	int i;
>  
> -	if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
> -		return false;
> -

I do agree with your usecase, but I am really worried about the amount of
work and cpu time the THP shrinker will consume when max_ptes_none is 511
(I dont have any numbers to back up my worry :)), and its less likely that
we will have these completely zeroed out THPs (again no numbers to back up
this statement). We have the huge_zero_folio as well which is installed on read.

>  	for (i = 0; i < folio_nr_pages(folio); i++) {
>  		kaddr = kmap_local_folio(folio, i * PAGE_SIZE);
>  		if (!memchr_inv(kaddr, 0, PAGE_SIZE)) {
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 11 hours ago
[...]

> 
> The reason I did this is for the case if you change max_ptes_none after the THP is added
> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
> so that its considered for splitting.

Yeah, I was assuming that was the reason why the shrinker is enabled as 
default.

But in any sane system, the admin would enable the shrinker early. If 
not, we can look into handling it differently.

> 
>> Easy to reproduce:
>>
>> 1) Allocate some THPs filled with 0s
>>
>> <prog.c>
>>   #include <string.h>
>>   #include <stdio.h>
>>   #include <stdlib.h>
>>   #include <unistd.h>
>>   #include <sys/mman.h>
>>
>>   const size_t size = 1024*1024*1024;
>>
>>   int main(void)
>>   {
>>           size_t offs;
>>           char *area;
>>
>>           area = mmap(0, size, PROT_READ | PROT_WRITE,
>>                       MAP_ANON | MAP_PRIVATE, -1, 0);
>>           if (area == MAP_FAILED) {
>>                   printf("mmap failed\n");
>>                   exit(-1);
>>           }
>>           madvise(area, size, MADV_HUGEPAGE);
>>
>>           for (offs = 0; offs < size; offs += getpagesize())
>>                   area[offs] = 0;
>>           pause();
>>   }
>> <\prog.c>
>>
>> 2) Trigger the shrinker
>>
>> E.g., memory pressure through memhog
>>
>> 3) Observe that THPs are not getting reclaimed
>>
>> $ cat /proc/`pgrep prog`/smaps_rollup
>>
>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>> reclaimed as expected.
>>
>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/huge_memory.c | 3 ---
>>   1 file changed, 3 deletions(-)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 26cedfcd74189..aa3ed7a86435b 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
>>   	void *kaddr;
>>   	int i;
>>   
>> -	if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>> -		return false;
>> -
> 
> I do agree with your usecase, but I am really worried about the amount of
> work and cpu time the THP shrinker will consume when max_ptes_none is 511
> (I dont have any numbers to back up my worry :)), and its less likely that
> we will have these completely zeroed out THPs (again no numbers to back up
> this statement).

Then then shrinker shall be deactivated as default if that becomes a 
problem.

Fortunately you documented the desired semantics:

"All THPs at fault and collapse time will be added to _deferred_list,
and will therefore be split under memory pressure if they are considered
"underused". A THP is underused if the number of zero-filled pages in
the THP is above max_ptes_none (see below)."

> We have the huge_zero_folio as well which is installed on read.

Yes, only if the huge zero folio is not available. Which will then also 
get properly reclaimed.

-- 
Cheers

David / dhildenb
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 10 hours ago

On 05/09/2025 15:46, David Hildenbrand wrote:
> [...]
> 
>>
>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>> so that its considered for splitting.
> 
> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
> 
> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.

Yes, I do this as well, i.e. have a low value from the start.

Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
the usecase you are describing below, but we wont encounter the increased CPU usage.> 
>>
>>> Easy to reproduce:
>>>
>>> 1) Allocate some THPs filled with 0s
>>>
>>> <prog.c>
>>>   #include <string.h>
>>>   #include <stdio.h>
>>>   #include <stdlib.h>
>>>   #include <unistd.h>
>>>   #include <sys/mman.h>
>>>
>>>   const size_t size = 1024*1024*1024;
>>>
>>>   int main(void)
>>>   {
>>>           size_t offs;
>>>           char *area;
>>>
>>>           area = mmap(0, size, PROT_READ | PROT_WRITE,
>>>                       MAP_ANON | MAP_PRIVATE, -1, 0);
>>>           if (area == MAP_FAILED) {
>>>                   printf("mmap failed\n");
>>>                   exit(-1);
>>>           }
>>>           madvise(area, size, MADV_HUGEPAGE);
>>>
>>>           for (offs = 0; offs < size; offs += getpagesize())
>>>                   area[offs] = 0;
>>>           pause();
>>>   }
>>> <\prog.c>
>>>
>>> 2) Trigger the shrinker
>>>
>>> E.g., memory pressure through memhog
>>>
>>> 3) Observe that THPs are not getting reclaimed
>>>
>>> $ cat /proc/`pgrep prog`/smaps_rollup
>>>
>>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>>> reclaimed as expected.
>>>
>>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/huge_memory.c | 3 ---
>>>   1 file changed, 3 deletions(-)
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index 26cedfcd74189..aa3ed7a86435b 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -4110,9 +4110,6 @@ static bool thp_underused(struct folio *folio)
>>>       void *kaddr;
>>>       int i;
>>>   -    if (khugepaged_max_ptes_none == HPAGE_PMD_NR - 1)
>>> -        return false;
>>> -
>>
>> I do agree with your usecase, but I am really worried about the amount of
>> work and cpu time the THP shrinker will consume when max_ptes_none is 511
>> (I dont have any numbers to back up my worry :)), and its less likely that
>> we will have these completely zeroed out THPs (again no numbers to back up
>> this statement).
> 
> Then then shrinker shall be deactivated as default if that becomes a problem.
> 
> Fortunately you documented the desired semantics:
> 
> "All THPs at fault and collapse time will be added to _deferred_list,
> and will therefore be split under memory pressure if they are considered
> "underused". A THP is underused if the number of zero-filled pages in
> the THP is above max_ptes_none (see below)."
> 
>> We have the huge_zero_folio as well which is installed on read.
> 
> Yes, only if the huge zero folio is not available. Which will then also get properly reclaimed.
> 

Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 10 hours ago
On 05.09.25 16:53, Usama Arif wrote:
> 
> 
> On 05/09/2025 15:46, David Hildenbrand wrote:
>> [...]
>>
>>>
>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>> so that its considered for splitting.
>>
>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>
>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
> 
> Yes, I do this as well, i.e. have a low value from the start.
> 
> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
> the usecase you are describing below, but we wont encounter the increased CPU usage.>

I don't really see why we should do that.

If the shrinker is a problem than the shrinker should be disabled. But 
if it is enabled, we should be shrinking as documented.

Without more magic around our THP toggles (we want less) :)

Shrinking happens when we are under memory pressure, so I am not really 
sure how relevant the scanning bit is, and if it is relevant enought to 
change the shrinker default.

-- 
Cheers

David / dhildenb
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 10 hours ago

On 05/09/2025 15:58, David Hildenbrand wrote:
> On 05.09.25 16:53, Usama Arif wrote:
>>
>>
>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>> [...]
>>>
>>>>
>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>> so that its considered for splitting.
>>>
>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>
>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>
>> Yes, I do this as well, i.e. have a low value from the start.
>>
>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
> 
> I don't really see why we should do that.
> 
> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
> 
> Without more magic around our THP toggles (we want less) :)
> 
> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
> 

yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 10 hours ago
On 05.09.25 17:01, Usama Arif wrote:
> 
> 
> On 05/09/2025 15:58, David Hildenbrand wrote:
>> On 05.09.25 16:53, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>> [...]
>>>>
>>>>>
>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>> so that its considered for splitting.
>>>>
>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>
>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>
>>> Yes, I do this as well, i.e. have a low value from the start.
>>>
>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>
>> I don't really see why we should do that.
>>
>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>
>> Without more magic around our THP toggles (we want less) :)
>>
>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>
> 
> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)

BTW, I was also wondering if we should just always add all THP to the 
deferred split list, and make the split toggle just affect whether we 
process them or not (scan or not).

I mean, as a default we add all of them to the list already right now, 
even though nothing would ever get reclaimed as default.

What's your take?

-- 
Cheers

David / dhildenb
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 10 hours ago

On 05/09/2025 16:04, David Hildenbrand wrote:
> On 05.09.25 17:01, Usama Arif wrote:
>>
>>
>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>
>>>>
>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>> [...]
>>>>>
>>>>>>
>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>> so that its considered for splitting.
>>>>>
>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>
>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>
>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>
>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>
>>> I don't really see why we should do that.
>>>
>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>
>>> Without more magic around our THP toggles (we want less) :)
>>>
>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>
>>
>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
> 
> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
> 
> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
> 
> What's your take?
> 

hmm I probably didnt understand what you meant to say here:
we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.

In deferred_split_folio, if split_underused_thp is false, we dont add them to the list (unless partially_mapped).

Unless you are referring to non pmd mapped THPs?
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 10 hours ago
On 05.09.25 17:16, Usama Arif wrote:
> 
> 
> On 05/09/2025 16:04, David Hildenbrand wrote:
>> On 05.09.25 17:01, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>> [...]
>>>>>>
>>>>>>>
>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>> so that its considered for splitting.
>>>>>>
>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>
>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>
>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>
>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>
>>>> I don't really see why we should do that.
>>>>
>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>
>>>> Without more magic around our THP toggles (we want less) :)
>>>>
>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>
>>>
>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>
>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>
>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>
>> What's your take?
>>
> 
> hmm I probably didnt understand what you meant to say here:
> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.

This is what I mean:

commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
Author: David Hildenbrand <david@redhat.com>
Date:   Fri Sep 5 17:22:01 2025 +0200

     mm/huge_memory: always add THPs to the deferred split list
     
     When disabling the shrinker and then re-enabling it, any anon THPs
     allocated in the meantime.
     
     That also means that we cannot disable the shrinker as default during
     boot, because we would miss some THPs later when enabling it.
     
     So always add them to the deferred split list, and only skip the
     scanning if the shrinker is disabled.
     
     This is effectively what we do on all systems out there already, unless
     they disable the shrinker.
     
     Signed-off-by: David Hildenbrand <david@redhat.com>

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index aa3ed7a86435b..3ee857c1d3754 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
         if (folio_order(folio) <= 1)
                 return;
  
-       if (!partially_mapped && !split_underused_thp)
-               return;
-
         /*
          * Exclude swapcache: originally to avoid a corrupt deferred split
          * queue. Nowadays that is fully prevented by memcg1_swapout();
@@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
                 bool underused = false;
  
                 if (!folio_test_partially_mapped(folio)) {
+                       if (!split_underused_thp)
+                               goto next;
                         underused = thp_underused(folio);
                         if (!underused)
                                 goto next;


-- 
Cheers

David / dhildenb
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 9 hours ago

On 05/09/2025 16:28, David Hildenbrand wrote:
> On 05.09.25 17:16, Usama Arif wrote:
>>
>>
>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>
>>>>
>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>
>>>>>>
>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>
>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>> so that its considered for splitting.
>>>>>>>
>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>
>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>
>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>
>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>
>>>>> I don't really see why we should do that.
>>>>>
>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>
>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>
>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>
>>>>
>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>
>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>
>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>
>>> What's your take?
>>>
>>
>> hmm I probably didnt understand what you meant to say here:
>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
> 
> This is what I mean:
> 
> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
> Author: David Hildenbrand <david@redhat.com>
> Date:   Fri Sep 5 17:22:01 2025 +0200
> 
>     mm/huge_memory: always add THPs to the deferred split list
>         When disabling the shrinker and then re-enabling it, any anon THPs
>     allocated in the meantime.
>         That also means that we cannot disable the shrinker as default during
>     boot, because we would miss some THPs later when enabling it.
>         So always add them to the deferred split list, and only skip the
>     scanning if the shrinker is disabled.
>         This is effectively what we do on all systems out there already, unless
>     they disable the shrinker.
>         Signed-off-by: David Hildenbrand <david@redhat.com>
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index aa3ed7a86435b..3ee857c1d3754 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>         if (folio_order(folio) <= 1)
>                 return;
>  
> -       if (!partially_mapped && !split_underused_thp)
> -               return;
> -
>         /*
>          * Exclude swapcache: originally to avoid a corrupt deferred split
>          * queue. Nowadays that is fully prevented by memcg1_swapout();
> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>                 bool underused = false;
>  
>                 if (!folio_test_partially_mapped(folio)) {
> +                       if (!split_underused_thp)
> +                               goto next;
>                         underused = thp_underused(folio);
>                         if (!underused)
>                                 goto next;
> 
> 


Thanks for sending the diff! Now I know what you meant lol.

In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
very ineffective?

I am making up numbers, but lets there are 128 THPs in the system, only 2 of them are partially mapped
and sc->nr_to_scan is 32.

In the current code, with shrinker disabled, only the 2 partially mapped THPs will be on the deferred list, so
we will reclaim them in the first go.

With your patch, the worst case scenario is that the partially mapped THPs are at the end of the deferred_list
and we would need 4 calls for the shrinker to split them.
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 9 hours ago
On 05.09.25 17:53, Usama Arif wrote:
> 
> 
> On 05/09/2025 16:28, David Hildenbrand wrote:
>> On 05.09.25 17:16, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>> so that its considered for splitting.
>>>>>>>>
>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>
>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>
>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>
>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>
>>>>>> I don't really see why we should do that.
>>>>>>
>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>
>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>
>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>
>>>>>
>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>
>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>
>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>
>>>> What's your take?
>>>>
>>>
>>> hmm I probably didnt understand what you meant to say here:
>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>
>> This is what I mean:
>>
>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>> Author: David Hildenbrand <david@redhat.com>
>> Date:   Fri Sep 5 17:22:01 2025 +0200
>>
>>      mm/huge_memory: always add THPs to the deferred split list
>>          When disabling the shrinker and then re-enabling it, any anon THPs
>>      allocated in the meantime.
>>          That also means that we cannot disable the shrinker as default during
>>      boot, because we would miss some THPs later when enabling it.
>>          So always add them to the deferred split list, and only skip the
>>      scanning if the shrinker is disabled.
>>          This is effectively what we do on all systems out there already, unless
>>      they disable the shrinker.
>>          Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index aa3ed7a86435b..3ee857c1d3754 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>          if (folio_order(folio) <= 1)
>>                  return;
>>   
>> -       if (!partially_mapped && !split_underused_thp)
>> -               return;
>> -
>>          /*
>>           * Exclude swapcache: originally to avoid a corrupt deferred split
>>           * queue. Nowadays that is fully prevented by memcg1_swapout();
>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>                  bool underused = false;
>>   
>>                  if (!folio_test_partially_mapped(folio)) {
>> +                       if (!split_underused_thp)
>> +                               goto next;
>>                          underused = thp_underused(folio);
>>                          if (!underused)
>>                                  goto next;
>>
>>
> 
> 
> Thanks for sending the diff! Now I know what you meant lol.
> 
> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
> very ineffective?

I hope you realize that that's the default on each and every system out 
there that ships this feature :)

And don't ask me how many people even know about disabling the shrinker 
or would do it, when the default setting is mostly not splitting many 
THPs ever.

> 
> I am making up numbers, but lets there are 128 THPs in the system, only 2 of them are partially mapped
> and sc->nr_to_scan is 32.
> 
> In the current code, with shrinker disabled, only the 2 partially mapped THPs will be on the deferred list, so
> we will reclaim them in the first go.
> 
> With your patch, the worst case scenario is that the partially mapped THPs are at the end of the deferred_list
> and we would need 4 calls for the shrinker to split them.

Probably at some point we would want split lists as well, not sure how 
feasible that is.

-- 
Cheers

David / dhildenb

Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 9 hours ago

On 05/09/2025 16:58, David Hildenbrand wrote:
> On 05.09.25 17:53, Usama Arif wrote:
>>
>>
>> On 05/09/2025 16:28, David Hildenbrand wrote:
>>> On 05.09.25 17:16, Usama Arif wrote:
>>>>
>>>>
>>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>>
>>>>>>
>>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>>> so that its considered for splitting.
>>>>>>>>>
>>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>>
>>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>>
>>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>>
>>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>>
>>>>>>> I don't really see why we should do that.
>>>>>>>
>>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>>
>>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>>
>>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>>
>>>>>>
>>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>>
>>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>>
>>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>>
>>>>> What's your take?
>>>>>
>>>>
>>>> hmm I probably didnt understand what you meant to say here:
>>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>>
>>> This is what I mean:
>>>
>>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>>> Author: David Hildenbrand <david@redhat.com>
>>> Date:   Fri Sep 5 17:22:01 2025 +0200
>>>
>>>      mm/huge_memory: always add THPs to the deferred split list
>>>          When disabling the shrinker and then re-enabling it, any anon THPs
>>>      allocated in the meantime.
>>>          That also means that we cannot disable the shrinker as default during
>>>      boot, because we would miss some THPs later when enabling it.
>>>          So always add them to the deferred split list, and only skip the
>>>      scanning if the shrinker is disabled.
>>>          This is effectively what we do on all systems out there already, unless
>>>      they disable the shrinker.
>>>          Signed-off-by: David Hildenbrand <david@redhat.com>
>>>
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index aa3ed7a86435b..3ee857c1d3754 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>          if (folio_order(folio) <= 1)
>>>                  return;
>>>   -       if (!partially_mapped && !split_underused_thp)
>>> -               return;
>>> -
>>>          /*
>>>           * Exclude swapcache: originally to avoid a corrupt deferred split
>>>           * queue. Nowadays that is fully prevented by memcg1_swapout();
>>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>                  bool underused = false;
>>>                    if (!folio_test_partially_mapped(folio)) {
>>> +                       if (!split_underused_thp)
>>> +                               goto next;
>>>                          underused = thp_underused(folio);
>>>                          if (!underused)
>>>                                  goto next;
>>>
>>>
>>
>>
>> Thanks for sending the diff! Now I know what you meant lol.
>>
>> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
>> very ineffective?
> 
> I hope you realize that that's the default on each and every system out there that ships this feature :)
> 

Yes, I made it default :)

I am assuming people either keep shrinker enabled (which is an extremely large majority as its default), or disable shrinker
and they dont flip flop between the 2 settings.
There are 2 scenarios for the above patch:

- shrinker is enabled (default): the above patch wont make a difference.
- shrinker is disabled: the above patch makes splitting partially mapped folios inefficient.

I didnt talk about the shrinker enabled case as it was a no-op and just talked about the shrinker disabled
case.

> And don't ask me how many people even know about disabling the shrinker or would do it, when the default setting is mostly not splitting many THPs ever.
> 
>>
>> I am making up numbers, but lets there are 128 THPs in the system, only 2 of them are partially mapped
>> and sc->nr_to_scan is 32.
>>
>> In the current code, with shrinker disabled, only the 2 partially mapped THPs will be on the deferred list, so
>> we will reclaim them in the first go.
>>
>> With your patch, the worst case scenario is that the partially mapped THPs are at the end of the deferred_list
>> and we would need 4 calls for the shrinker to split them.
> 
> Probably at some point we would want split lists as well, not sure how feasible that is.
> 

Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 8 hours ago
On 05.09.25 18:47, Usama Arif wrote:
> 
> 
> On 05/09/2025 16:58, David Hildenbrand wrote:
>> On 05.09.25 17:53, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 16:28, David Hildenbrand wrote:
>>>> On 05.09.25 17:16, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>>>> [...]
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>>>> so that its considered for splitting.
>>>>>>>>>>
>>>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>>>
>>>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>>>
>>>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>>>
>>>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>>>
>>>>>>>> I don't really see why we should do that.
>>>>>>>>
>>>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>>>
>>>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>>>
>>>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>>>
>>>>>>>
>>>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>>>
>>>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>>>
>>>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>>>
>>>>>> What's your take?
>>>>>>
>>>>>
>>>>> hmm I probably didnt understand what you meant to say here:
>>>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>>>
>>>> This is what I mean:
>>>>
>>>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>>>> Author: David Hildenbrand <david@redhat.com>
>>>> Date:   Fri Sep 5 17:22:01 2025 +0200
>>>>
>>>>       mm/huge_memory: always add THPs to the deferred split list
>>>>           When disabling the shrinker and then re-enabling it, any anon THPs
>>>>       allocated in the meantime.
>>>>           That also means that we cannot disable the shrinker as default during
>>>>       boot, because we would miss some THPs later when enabling it.
>>>>           So always add them to the deferred split list, and only skip the
>>>>       scanning if the shrinker is disabled.
>>>>           This is effectively what we do on all systems out there already, unless
>>>>       they disable the shrinker.
>>>>           Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index aa3ed7a86435b..3ee857c1d3754 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>>           if (folio_order(folio) <= 1)
>>>>                   return;
>>>>    -       if (!partially_mapped && !split_underused_thp)
>>>> -               return;
>>>> -
>>>>           /*
>>>>            * Exclude swapcache: originally to avoid a corrupt deferred split
>>>>            * queue. Nowadays that is fully prevented by memcg1_swapout();
>>>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>                   bool underused = false;
>>>>                     if (!folio_test_partially_mapped(folio)) {
>>>> +                       if (!split_underused_thp)
>>>> +                               goto next;
>>>>                           underused = thp_underused(folio);
>>>>                           if (!underused)
>>>>                                   goto next;
>>>>
>>>>
>>>
>>>
>>> Thanks for sending the diff! Now I know what you meant lol.
>>>
>>> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
>>> very ineffective?
>>
>> I hope you realize that that's the default on each and every system out there that ships this feature :)
>>
> 
> Yes, I made it default :)
> 
> I am assuming people either keep shrinker enabled (which is an extremely large majority as its default), or disable shrinker
> and they dont flip flop between the 2 settings.
> There are 2 scenarios for the above patch:
> 
> - shrinker is enabled (default): the above patch wont make a difference.
> - shrinker is disabled: the above patch makes splitting partially mapped folios inefficient.
> 
> I didnt talk about the shrinker enabled case as it was a no-op and just talked about the shrinker disabled
> case.


Yeah, and I am saying that all you raised as a concern would be a 
problem already today in all default setups (-> 99.999999%). :)

Probably we should not just disable the shrinker during boot, and once 
enabled, it would only split THPs created afterwards.

With this patch it would also split ones created previously.

-- 
Cheers

David / dhildenb

Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 8 hours ago

On 05/09/2025 17:55, David Hildenbrand wrote:
> On 05.09.25 18:47, Usama Arif wrote:
>>
>>
>> On 05/09/2025 16:58, David Hildenbrand wrote:
>>> On 05.09.25 17:53, Usama Arif wrote:
>>>>
>>>>
>>>> On 05/09/2025 16:28, David Hildenbrand wrote:
>>>>> On 05.09.25 17:16, Usama Arif wrote:
>>>>>>
>>>>>>
>>>>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>>>>> [...]
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>>>>> so that its considered for splitting.
>>>>>>>>>>>
>>>>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>>>>
>>>>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>>>>
>>>>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>>>>
>>>>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>>>>
>>>>>>>>> I don't really see why we should do that.
>>>>>>>>>
>>>>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>>>>
>>>>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>>>>
>>>>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>>>>
>>>>>>>>
>>>>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>>>>
>>>>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>>>>
>>>>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>>>>
>>>>>>> What's your take?
>>>>>>>
>>>>>>
>>>>>> hmm I probably didnt understand what you meant to say here:
>>>>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>>>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>>>>
>>>>> This is what I mean:
>>>>>
>>>>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>>>>> Author: David Hildenbrand <david@redhat.com>
>>>>> Date:   Fri Sep 5 17:22:01 2025 +0200
>>>>>
>>>>>       mm/huge_memory: always add THPs to the deferred split list
>>>>>           When disabling the shrinker and then re-enabling it, any anon THPs
>>>>>       allocated in the meantime.
>>>>>           That also means that we cannot disable the shrinker as default during
>>>>>       boot, because we would miss some THPs later when enabling it.
>>>>>           So always add them to the deferred split list, and only skip the
>>>>>       scanning if the shrinker is disabled.
>>>>>           This is effectively what we do on all systems out there already, unless
>>>>>       they disable the shrinker.
>>>>>           Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>
>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>> index aa3ed7a86435b..3ee857c1d3754 100644
>>>>> --- a/mm/huge_memory.c
>>>>> +++ b/mm/huge_memory.c
>>>>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>>>           if (folio_order(folio) <= 1)
>>>>>                   return;
>>>>>    -       if (!partially_mapped && !split_underused_thp)
>>>>> -               return;
>>>>> -
>>>>>           /*
>>>>>            * Exclude swapcache: originally to avoid a corrupt deferred split
>>>>>            * queue. Nowadays that is fully prevented by memcg1_swapout();
>>>>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>                   bool underused = false;
>>>>>                     if (!folio_test_partially_mapped(folio)) {
>>>>> +                       if (!split_underused_thp)
>>>>> +                               goto next;
>>>>>                           underused = thp_underused(folio);
>>>>>                           if (!underused)
>>>>>                                   goto next;
>>>>>
>>>>>
>>>>
>>>>
>>>> Thanks for sending the diff! Now I know what you meant lol.
>>>>
>>>> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
>>>> very ineffective?
>>>
>>> I hope you realize that that's the default on each and every system out there that ships this feature :)
>>>
>>
>> Yes, I made it default :)
>>
>> I am assuming people either keep shrinker enabled (which is an extremely large majority as its default), or disable shrinker
>> and they dont flip flop between the 2 settings.
>> There are 2 scenarios for the above patch:
>>
>> - shrinker is enabled (default): the above patch wont make a difference.
>> - shrinker is disabled: the above patch makes splitting partially mapped folios inefficient.
>>
>> I didnt talk about the shrinker enabled case as it was a no-op and just talked about the shrinker disabled
>> case.
> 
> 
> Yeah, and I am saying that all you raised as a concern would be a problem already today in all default setups (-> 99.999999%). :)
> 
> Probably we should not just disable the shrinker during boot, and once enabled, it would only split THPs created afterwards.
> 

I probably didnt understand this again lol. Sorry its friday evening :)

split_underused_thp is true at boot time [1]. You are saying we should not disable shrinker during boot, but it is already not
disabled during boot, right?


If someone goes with system default, which is THP shrinker enabled (from boot and runtime), the above patch is a no-op, right?


> With this patch it would also split ones created previously.
> 

yes, if someone changes from shrinker being disabled to shrinker being enabled before memory pressure.

[1] https://elixir.bootlin.com/linux/v6.16.4/source/mm/huge_memory.c#L76
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 1 day, 16 hours ago
On 05.09.25 19:26, Usama Arif wrote:
> 
> 
> On 05/09/2025 17:55, David Hildenbrand wrote:
>> On 05.09.25 18:47, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 16:58, David Hildenbrand wrote:
>>>> On 05.09.25 17:53, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 05/09/2025 16:28, David Hildenbrand wrote:
>>>>>> On 05.09.25 17:16, Usama Arif wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>>>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>>>>>> [...]
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>>>>>> so that its considered for splitting.
>>>>>>>>>>>>
>>>>>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>>>>>
>>>>>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>>>>>
>>>>>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>>>>>
>>>>>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>>>>>
>>>>>>>>>> I don't really see why we should do that.
>>>>>>>>>>
>>>>>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>>>>>
>>>>>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>>>>>
>>>>>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>>>>>
>>>>>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>>>>>
>>>>>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>>>>>
>>>>>>>> What's your take?
>>>>>>>>
>>>>>>>
>>>>>>> hmm I probably didnt understand what you meant to say here:
>>>>>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>>>>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>>>>>
>>>>>> This is what I mean:
>>>>>>
>>>>>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>>>>>> Author: David Hildenbrand <david@redhat.com>
>>>>>> Date:   Fri Sep 5 17:22:01 2025 +0200
>>>>>>
>>>>>>        mm/huge_memory: always add THPs to the deferred split list
>>>>>>            When disabling the shrinker and then re-enabling it, any anon THPs
>>>>>>        allocated in the meantime.
>>>>>>            That also means that we cannot disable the shrinker as default during
>>>>>>        boot, because we would miss some THPs later when enabling it.
>>>>>>            So always add them to the deferred split list, and only skip the
>>>>>>        scanning if the shrinker is disabled.
>>>>>>            This is effectively what we do on all systems out there already, unless
>>>>>>        they disable the shrinker.
>>>>>>            Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>>
>>>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>>>> index aa3ed7a86435b..3ee857c1d3754 100644
>>>>>> --- a/mm/huge_memory.c
>>>>>> +++ b/mm/huge_memory.c
>>>>>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>>>>>            if (folio_order(folio) <= 1)
>>>>>>                    return;
>>>>>>     -       if (!partially_mapped && !split_underused_thp)
>>>>>> -               return;
>>>>>> -
>>>>>>            /*
>>>>>>             * Exclude swapcache: originally to avoid a corrupt deferred split
>>>>>>             * queue. Nowadays that is fully prevented by memcg1_swapout();
>>>>>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>>>>                    bool underused = false;
>>>>>>                      if (!folio_test_partially_mapped(folio)) {
>>>>>> +                       if (!split_underused_thp)
>>>>>> +                               goto next;
>>>>>>                            underused = thp_underused(folio);
>>>>>>                            if (!underused)
>>>>>>                                    goto next;
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> Thanks for sending the diff! Now I know what you meant lol.
>>>>>
>>>>> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
>>>>> very ineffective?
>>>>
>>>> I hope you realize that that's the default on each and every system out there that ships this feature :)
>>>>
>>>
>>> Yes, I made it default :)
>>>
>>> I am assuming people either keep shrinker enabled (which is an extremely large majority as its default), or disable shrinker
>>> and they dont flip flop between the 2 settings.
>>> There are 2 scenarios for the above patch:
>>>
>>> - shrinker is enabled (default): the above patch wont make a difference.
>>> - shrinker is disabled: the above patch makes splitting partially mapped folios inefficient.
>>>
>>> I didnt talk about the shrinker enabled case as it was a no-op and just talked about the shrinker disabled
>>> case.
>>
>>
>> Yeah, and I am saying that all you raised as a concern would be a problem already today in all default setups (-> 99.999999%). :)
>>
>> Probably we should not just disable the shrinker during boot, and once enabled, it would only split THPs created afterwards.
>>
> 
> I probably didnt understand this again lol. Sorry its friday evening :)
> 
> split_underused_thp is true at boot time [1]. You are saying we should not disable shrinker during boot, but it is already not
> disabled during boot, right?

It's all a mess, sorry, currently it's "enabled" as default but actually 
"disabled". Let me see if I can clean all that up.

-- 
Cheers

David / dhildenb

Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 9 hours ago

On 05/09/2025 16:53, Usama Arif wrote:
> 
> 
> On 05/09/2025 16:28, David Hildenbrand wrote:
>> On 05.09.25 17:16, Usama Arif wrote:
>>>
>>>
>>> On 05/09/2025 16:04, David Hildenbrand wrote:
>>>> On 05.09.25 17:01, Usama Arif wrote:
>>>>>
>>>>>
>>>>> On 05/09/2025 15:58, David Hildenbrand wrote:
>>>>>> On 05.09.25 16:53, Usama Arif wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 05/09/2025 15:46, David Hildenbrand wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The reason I did this is for the case if you change max_ptes_none after the THP is added
>>>>>>>>> to deferred split list but *before* memory pressure, i.e. before the shrinker runs,
>>>>>>>>> so that its considered for splitting.
>>>>>>>>
>>>>>>>> Yeah, I was assuming that was the reason why the shrinker is enabled as default.
>>>>>>>>
>>>>>>>> But in any sane system, the admin would enable the shrinker early. If not, we can look into handling it differently.
>>>>>>>
>>>>>>> Yes, I do this as well, i.e. have a low value from the start.
>>>>>>>
>>>>>>> Does it make sense to disable shrinker if max_ptes_none is 511? It wont shrink
>>>>>>> the usecase you are describing below, but we wont encounter the increased CPU usage.>
>>>>>>
>>>>>> I don't really see why we should do that.
>>>>>>
>>>>>> If the shrinker is a problem than the shrinker should be disabled. But if it is enabled, we should be shrinking as documented.
>>>>>>
>>>>>> Without more magic around our THP toggles (we want less) :)
>>>>>>
>>>>>> Shrinking happens when we are under memory pressure, so I am not really sure how relevant the scanning bit is, and if it is relevant enought to change the shrinker default.
>>>>>>
>>>>>
>>>>> yes agreed, I also dont have numbers to back up my worry, its all theoretical :)
>>>>
>>>> BTW, I was also wondering if we should just always add all THP to the deferred split list, and make the split toggle just affect whether we process them or not (scan or not).
>>>>
>>>> I mean, as a default we add all of them to the list already right now, even though nothing would ever get reclaimed as default.
>>>>
>>>> What's your take?
>>>>
>>>
>>> hmm I probably didnt understand what you meant to say here:
>>> we already add all of them to the list in __do_huge_pmd_anonymous_page and collapse_huge_page and
>>> shrink_underused sets/clears split_underused_thp in deferred_split_folio decides whether we process or not.
>>
>> This is what I mean:
>>
>> commit 3952b6f6b671ca7d69fd1783b1abf4806f90d436 (HEAD -> max_ptes_none)
>> Author: David Hildenbrand <david@redhat.com>
>> Date:   Fri Sep 5 17:22:01 2025 +0200
>>
>>     mm/huge_memory: always add THPs to the deferred split list
>>         When disabling the shrinker and then re-enabling it, any anon THPs
>>     allocated in the meantime.
>>         That also means that we cannot disable the shrinker as default during
>>     boot, because we would miss some THPs later when enabling it.
>>         So always add them to the deferred split list, and only skip the
>>     scanning if the shrinker is disabled.
>>         This is effectively what we do on all systems out there already, unless
>>     they disable the shrinker.
>>         Signed-off-by: David Hildenbrand <david@redhat.com>
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index aa3ed7a86435b..3ee857c1d3754 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4052,9 +4052,6 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
>>         if (folio_order(folio) <= 1)
>>                 return;
>>  
>> -       if (!partially_mapped && !split_underused_thp)
>> -               return;
>> -
>>         /*
>>          * Exclude swapcache: originally to avoid a corrupt deferred split
>>          * queue. Nowadays that is fully prevented by memcg1_swapout();
>> @@ -4175,6 +4172,8 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>                 bool underused = false;
>>  
>>                 if (!folio_test_partially_mapped(folio)) {
>> +                       if (!split_underused_thp)
>> +                               goto next;
>>                         underused = thp_underused(folio);
>>                         if (!underused)
>>                                 goto next;
>>
>>
> 
> 
> Thanks for sending the diff! Now I know what you meant lol.
> 
> In the case of when shrinker is disabled, this could make the deferred split scan for partially mapped folios
> very ineffective?
> 
> I am making up numbers, but lets there are 128 THPs in the system, only 2 of them are partially mapped
> and sc->nr_to_scan is 32.
> 
> In the current code, with shrinker disabled, only the 2 partially mapped THPs will be on the deferred list, so
> we will reclaim them in the first go.
> 
> With your patch, the worst case scenario is that the partially mapped THPs are at the end of the deferred_list
> and we would need 4 calls for the shrinker to split them.


And I am hoping people are not dynamically enabling/disabling THP shrinker :)


I have ideas about dynamically changing max_ptes_none, maybe based on system metrics like memory pressure,
but not enabling/disabling shrinker.

Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Zi Yan 4 days, 11 hours ago
On 5 Sep 2025, at 10:11, David Hildenbrand wrote:

> We added an early exit in thp_underused(), probably to avoid scanning
> pages when there is no chance for success.
>
> However, assume we have max_ptes_none = 511 (default).
>
> Nothing should stop us from freeing all pages part of a THP that
> is completely zero (512) and khugepaged will for sure not try to
> instantiate a THP in that case (512 shared zeropages).
>
> This can just trivially happen if someone writes a single 0 byte into a
> PMD area, or of course, when data ends up being zero later.
>
> So let's remove that early exit.
>
> Do we want to CC stable? Hm, not sure. Probably not urgent.
>
> Note that, as default, the THP shrinker is active
> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> THPs are added to the deferred split lists. However, with the
> max_ptes_none default we would never scan them. We would not do that. If
> that's not desirable, we should just disable the shrinker as default,
> also not adding all THPs to the deferred split lists.
>
> Easy to reproduce:
>
> 1) Allocate some THPs filled with 0s
>
> <prog.c>
>  #include <string.h>
>  #include <stdio.h>
>  #include <stdlib.h>
>  #include <unistd.h>
>  #include <sys/mman.h>
>
>  const size_t size = 1024*1024*1024;
>
>  int main(void)
>  {
>          size_t offs;
>          char *area;
>
>          area = mmap(0, size, PROT_READ | PROT_WRITE,
>                      MAP_ANON | MAP_PRIVATE, -1, 0);
>          if (area == MAP_FAILED) {
>                  printf("mmap failed\n");
>                  exit(-1);
>          }
>          madvise(area, size, MADV_HUGEPAGE);
>
>          for (offs = 0; offs < size; offs += getpagesize())
>                  area[offs] = 0;
>          pause();
>  }
> <\prog.c>
>
> 2) Trigger the shrinker
>
> E.g., memory pressure through memhog
>
> 3) Observe that THPs are not getting reclaimed
>
> $ cat /proc/`pgrep prog`/smaps_rollup
>
> Would list ~1GiB of AnonHugePages. With this fix, they would get
> reclaimed as expected.
>
> Fixes: dafff3f4c850 ("mm: split underused THPs")
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> Cc: Nico Pache <npache@redhat.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: Dev Jain <dev.jain@arm.com>
> Cc: Barry Song <baohua@kernel.org>
> Cc: Usama Arif <usamaarif642@gmail.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/huge_memory.c | 3 ---
>  1 file changed, 3 deletions(-)
>
LGTM. Acked-by: Zi Yan <ziy@nvidia.com>

I also notice that thp_underused() checks num_zero_pages directly
against khugepaged_max_ptes_none. This means mTHPs will never be regarded
as underused. A similar issue you are discussing in Nico’s khugepaged
mTHP support. Maybe checks against these khugepaged_max* variables
should be calculated based on nr_pages of a large folio, like
making these variables a ratio in other discussion.

Best Regards,
Yan, Zi
Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 11 hours ago

On 05/09/2025 15:37, Zi Yan wrote:
> On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
> 
>> We added an early exit in thp_underused(), probably to avoid scanning
>> pages when there is no chance for success.
>>
>> However, assume we have max_ptes_none = 511 (default).
>>
>> Nothing should stop us from freeing all pages part of a THP that
>> is completely zero (512) and khugepaged will for sure not try to
>> instantiate a THP in that case (512 shared zeropages).
>>
>> This can just trivially happen if someone writes a single 0 byte into a
>> PMD area, or of course, when data ends up being zero later.
>>
>> So let's remove that early exit.
>>
>> Do we want to CC stable? Hm, not sure. Probably not urgent.
>>
>> Note that, as default, the THP shrinker is active
>> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
>> THPs are added to the deferred split lists. However, with the
>> max_ptes_none default we would never scan them. We would not do that. If
>> that's not desirable, we should just disable the shrinker as default,
>> also not adding all THPs to the deferred split lists.
>>
>> Easy to reproduce:
>>
>> 1) Allocate some THPs filled with 0s
>>
>> <prog.c>
>>  #include <string.h>
>>  #include <stdio.h>
>>  #include <stdlib.h>
>>  #include <unistd.h>
>>  #include <sys/mman.h>
>>
>>  const size_t size = 1024*1024*1024;
>>
>>  int main(void)
>>  {
>>          size_t offs;
>>          char *area;
>>
>>          area = mmap(0, size, PROT_READ | PROT_WRITE,
>>                      MAP_ANON | MAP_PRIVATE, -1, 0);
>>          if (area == MAP_FAILED) {
>>                  printf("mmap failed\n");
>>                  exit(-1);
>>          }
>>          madvise(area, size, MADV_HUGEPAGE);
>>
>>          for (offs = 0; offs < size; offs += getpagesize())
>>                  area[offs] = 0;
>>          pause();
>>  }
>> <\prog.c>
>>
>> 2) Trigger the shrinker
>>
>> E.g., memory pressure through memhog
>>
>> 3) Observe that THPs are not getting reclaimed
>>
>> $ cat /proc/`pgrep prog`/smaps_rollup
>>
>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>> reclaimed as expected.
>>
>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>  mm/huge_memory.c | 3 ---
>>  1 file changed, 3 deletions(-)
>>
> LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
> 
> I also notice that thp_underused() checks num_zero_pages directly
> against khugepaged_max_ptes_none. This means mTHPs will never be regarded
> as underused. A similar issue you are discussing in Nico’s khugepaged
> mTHP support. Maybe checks against these khugepaged_max* variables
> should be calculated based on nr_pages of a large folio, like
> making these variables a ratio in other discussion.

I unfortunately didnt follow the series in the latest revisions.

In the earlier revisions, I think it was decided to not add mTHPs to shrinker
as a start, as there are diminshing returns for smaller THPs and having a lot
of smaller mTHPs in the deferred list might mean that we get to PMD mapped THPs
a lot slower?

> 
> Best Regards,
> Yan, Zi

Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 11 hours ago
On 05.09.25 16:43, Usama Arif wrote:
> 
> 
> On 05/09/2025 15:37, Zi Yan wrote:
>> On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
>>
>>> We added an early exit in thp_underused(), probably to avoid scanning
>>> pages when there is no chance for success.
>>>
>>> However, assume we have max_ptes_none = 511 (default).
>>>
>>> Nothing should stop us from freeing all pages part of a THP that
>>> is completely zero (512) and khugepaged will for sure not try to
>>> instantiate a THP in that case (512 shared zeropages).
>>>
>>> This can just trivially happen if someone writes a single 0 byte into a
>>> PMD area, or of course, when data ends up being zero later.
>>>
>>> So let's remove that early exit.
>>>
>>> Do we want to CC stable? Hm, not sure. Probably not urgent.
>>>
>>> Note that, as default, the THP shrinker is active
>>> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
>>> THPs are added to the deferred split lists. However, with the
>>> max_ptes_none default we would never scan them. We would not do that. If
>>> that's not desirable, we should just disable the shrinker as default,
>>> also not adding all THPs to the deferred split lists.
>>>
>>> Easy to reproduce:
>>>
>>> 1) Allocate some THPs filled with 0s
>>>
>>> <prog.c>
>>>   #include <string.h>
>>>   #include <stdio.h>
>>>   #include <stdlib.h>
>>>   #include <unistd.h>
>>>   #include <sys/mman.h>
>>>
>>>   const size_t size = 1024*1024*1024;
>>>
>>>   int main(void)
>>>   {
>>>           size_t offs;
>>>           char *area;
>>>
>>>           area = mmap(0, size, PROT_READ | PROT_WRITE,
>>>                       MAP_ANON | MAP_PRIVATE, -1, 0);
>>>           if (area == MAP_FAILED) {
>>>                   printf("mmap failed\n");
>>>                   exit(-1);
>>>           }
>>>           madvise(area, size, MADV_HUGEPAGE);
>>>
>>>           for (offs = 0; offs < size; offs += getpagesize())
>>>                   area[offs] = 0;
>>>           pause();
>>>   }
>>> <\prog.c>
>>>
>>> 2) Trigger the shrinker
>>>
>>> E.g., memory pressure through memhog
>>>
>>> 3) Observe that THPs are not getting reclaimed
>>>
>>> $ cat /proc/`pgrep prog`/smaps_rollup
>>>
>>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>>> reclaimed as expected.
>>>
>>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>> Cc: Zi Yan <ziy@nvidia.com>
>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>> Cc: Nico Pache <npache@redhat.com>
>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>> Cc: Dev Jain <dev.jain@arm.com>
>>> Cc: Barry Song <baohua@kernel.org>
>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/huge_memory.c | 3 ---
>>>   1 file changed, 3 deletions(-)
>>>
>> LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
>>
>> I also notice that thp_underused() checks num_zero_pages directly
>> against khugepaged_max_ptes_none. This means mTHPs will never be regarded
>> as underused. A similar issue you are discussing in Nico’s khugepaged
>> mTHP support. Maybe checks against these khugepaged_max* variables
>> should be calculated based on nr_pages of a large folio, like
>> making these variables a ratio in other discussion.
> 
> I unfortunately didnt follow the series in the latest revisions.
> 
> In the earlier revisions, I think it was decided to not add mTHPs to shrinker
> as a start, as there are diminshing returns for smaller THPs and having a lot
> of smaller mTHPs in the deferred list might mean that we get to PMD mapped THPs
> a lot slower?

Probably we would want lists per order etc.

-- 
Cheers

David / dhildenb

Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Usama Arif 4 days, 10 hours ago

On 05/09/2025 15:47, David Hildenbrand wrote:
> On 05.09.25 16:43, Usama Arif wrote:
>>
>>
>> On 05/09/2025 15:37, Zi Yan wrote:
>>> On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
>>>
>>>> We added an early exit in thp_underused(), probably to avoid scanning
>>>> pages when there is no chance for success.
>>>>
>>>> However, assume we have max_ptes_none = 511 (default).
>>>>
>>>> Nothing should stop us from freeing all pages part of a THP that
>>>> is completely zero (512) and khugepaged will for sure not try to
>>>> instantiate a THP in that case (512 shared zeropages).
>>>>
>>>> This can just trivially happen if someone writes a single 0 byte into a
>>>> PMD area, or of course, when data ends up being zero later.
>>>>
>>>> So let's remove that early exit.
>>>>
>>>> Do we want to CC stable? Hm, not sure. Probably not urgent.
>>>>
>>>> Note that, as default, the THP shrinker is active
>>>> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
>>>> THPs are added to the deferred split lists. However, with the
>>>> max_ptes_none default we would never scan them. We would not do that. If
>>>> that's not desirable, we should just disable the shrinker as default,
>>>> also not adding all THPs to the deferred split lists.
>>>>
>>>> Easy to reproduce:
>>>>
>>>> 1) Allocate some THPs filled with 0s
>>>>
>>>> <prog.c>
>>>>   #include <string.h>
>>>>   #include <stdio.h>
>>>>   #include <stdlib.h>
>>>>   #include <unistd.h>
>>>>   #include <sys/mman.h>
>>>>
>>>>   const size_t size = 1024*1024*1024;
>>>>
>>>>   int main(void)
>>>>   {
>>>>           size_t offs;
>>>>           char *area;
>>>>
>>>>           area = mmap(0, size, PROT_READ | PROT_WRITE,
>>>>                       MAP_ANON | MAP_PRIVATE, -1, 0);
>>>>           if (area == MAP_FAILED) {
>>>>                   printf("mmap failed\n");
>>>>                   exit(-1);
>>>>           }
>>>>           madvise(area, size, MADV_HUGEPAGE);
>>>>
>>>>           for (offs = 0; offs < size; offs += getpagesize())
>>>>                   area[offs] = 0;
>>>>           pause();
>>>>   }
>>>> <\prog.c>
>>>>
>>>> 2) Trigger the shrinker
>>>>
>>>> E.g., memory pressure through memhog
>>>>
>>>> 3) Observe that THPs are not getting reclaimed
>>>>
>>>> $ cat /proc/`pgrep prog`/smaps_rollup
>>>>
>>>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>>>> reclaimed as expected.
>>>>
>>>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>>>> Cc: Zi Yan <ziy@nvidia.com>
>>>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>>>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>>>> Cc: Nico Pache <npache@redhat.com>
>>>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>>>> Cc: Dev Jain <dev.jain@arm.com>
>>>> Cc: Barry Song <baohua@kernel.org>
>>>> Cc: Usama Arif <usamaarif642@gmail.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>   mm/huge_memory.c | 3 ---
>>>>   1 file changed, 3 deletions(-)
>>>>
>>> LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
>>>
>>> I also notice that thp_underused() checks num_zero_pages directly
>>> against khugepaged_max_ptes_none. This means mTHPs will never be regarded
>>> as underused. A similar issue you are discussing in Nico’s khugepaged
>>> mTHP support. Maybe checks against these khugepaged_max* variables
>>> should be calculated based on nr_pages of a large folio, like
>>> making these variables a ratio in other discussion.
>>
>> I unfortunately didnt follow the series in the latest revisions.
>>
>> In the earlier revisions, I think it was decided to not add mTHPs to shrinker
>> as a start, as there are diminshing returns for smaller THPs and having a lot
>> of smaller mTHPs in the deferred list might mean that we get to PMD mapped THPs
>> a lot slower?
> 
> Probably we would want lists per order etc.
> 

Yes that makes sense! and we start with the highest order list.

Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by David Hildenbrand 4 days, 11 hours ago
On 05.09.25 16:37, Zi Yan wrote:
> On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
> 
>> We added an early exit in thp_underused(), probably to avoid scanning
>> pages when there is no chance for success.
>>
>> However, assume we have max_ptes_none = 511 (default).
>>
>> Nothing should stop us from freeing all pages part of a THP that
>> is completely zero (512) and khugepaged will for sure not try to
>> instantiate a THP in that case (512 shared zeropages).
>>
>> This can just trivially happen if someone writes a single 0 byte into a
>> PMD area, or of course, when data ends up being zero later.
>>
>> So let's remove that early exit.
>>
>> Do we want to CC stable? Hm, not sure. Probably not urgent.
>>
>> Note that, as default, the THP shrinker is active
>> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
>> THPs are added to the deferred split lists. However, with the
>> max_ptes_none default we would never scan them. We would not do that. If
>> that's not desirable, we should just disable the shrinker as default,
>> also not adding all THPs to the deferred split lists.
>>
>> Easy to reproduce:
>>
>> 1) Allocate some THPs filled with 0s
>>
>> <prog.c>
>>   #include <string.h>
>>   #include <stdio.h>
>>   #include <stdlib.h>
>>   #include <unistd.h>
>>   #include <sys/mman.h>
>>
>>   const size_t size = 1024*1024*1024;
>>
>>   int main(void)
>>   {
>>           size_t offs;
>>           char *area;
>>
>>           area = mmap(0, size, PROT_READ | PROT_WRITE,
>>                       MAP_ANON | MAP_PRIVATE, -1, 0);
>>           if (area == MAP_FAILED) {
>>                   printf("mmap failed\n");
>>                   exit(-1);
>>           }
>>           madvise(area, size, MADV_HUGEPAGE);
>>
>>           for (offs = 0; offs < size; offs += getpagesize())
>>                   area[offs] = 0;
>>           pause();
>>   }
>> <\prog.c>
>>
>> 2) Trigger the shrinker
>>
>> E.g., memory pressure through memhog
>>
>> 3) Observe that THPs are not getting reclaimed
>>
>> $ cat /proc/`pgrep prog`/smaps_rollup
>>
>> Would list ~1GiB of AnonHugePages. With this fix, they would get
>> reclaimed as expected.
>>
>> Fixes: dafff3f4c850 ("mm: split underused THPs")
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
>> Cc: Zi Yan <ziy@nvidia.com>
>> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
>> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
>> Cc: Nico Pache <npache@redhat.com>
>> Cc: Ryan Roberts <ryan.roberts@arm.com>
>> Cc: Dev Jain <dev.jain@arm.com>
>> Cc: Barry Song <baohua@kernel.org>
>> Cc: Usama Arif <usamaarif642@gmail.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/huge_memory.c | 3 ---
>>   1 file changed, 3 deletions(-)
>>
> LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
> 
> I also notice that thp_underused() checks num_zero_pages directly
> against khugepaged_max_ptes_none. This means mTHPs will never be regarded
> as underused. A similar issue you are discussing in Nico’s khugepaged
> mTHP support. Maybe checks against these khugepaged_max* variables
> should be calculated based on nr_pages of a large folio, like
> making these variables a ratio in other discussion.

Yeah, factoring that out and cleaning it up is my next step.

But note that mTHPs are not a candidate for the shrinker right now. (see 
my explanation in reply to Nicos patch)

-- 
Cheers

David / dhildenb

Re: [PATCH v1] mm/huge_memory: fix shrinking of all-zero THPs with max_ptes_none default
Posted by Lance Yang 3 days, 19 hours ago
On Fri, Sep 5, 2025 at 11:17 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 05.09.25 16:37, Zi Yan wrote:
> > On 5 Sep 2025, at 10:11, David Hildenbrand wrote:
> >
> >> We added an early exit in thp_underused(), probably to avoid scanning
> >> pages when there is no chance for success.
> >>
> >> However, assume we have max_ptes_none = 511 (default).
> >>
> >> Nothing should stop us from freeing all pages part of a THP that
> >> is completely zero (512) and khugepaged will for sure not try to
> >> instantiate a THP in that case (512 shared zeropages).
> >>
> >> This can just trivially happen if someone writes a single 0 byte into a
> >> PMD area, or of course, when data ends up being zero later.
> >>
> >> So let's remove that early exit.
> >>
> >> Do we want to CC stable? Hm, not sure. Probably not urgent.
> >>
> >> Note that, as default, the THP shrinker is active
> >> (/sys/kernel/mm/transparent_hugepage/shrink_underused = 1), and all
> >> THPs are added to the deferred split lists. However, with the
> >> max_ptes_none default we would never scan them. We would not do that. If
> >> that's not desirable, we should just disable the shrinker as default,
> >> also not adding all THPs to the deferred split lists.
> >>
> >> Easy to reproduce:
> >>
> >> 1) Allocate some THPs filled with 0s
> >>
> >> <prog.c>
> >>   #include <string.h>
> >>   #include <stdio.h>
> >>   #include <stdlib.h>
> >>   #include <unistd.h>
> >>   #include <sys/mman.h>
> >>
> >>   const size_t size = 1024*1024*1024;
> >>
> >>   int main(void)
> >>   {
> >>           size_t offs;
> >>           char *area;
> >>
> >>           area = mmap(0, size, PROT_READ | PROT_WRITE,
> >>                       MAP_ANON | MAP_PRIVATE, -1, 0);
> >>           if (area == MAP_FAILED) {
> >>                   printf("mmap failed\n");
> >>                   exit(-1);
> >>           }
> >>           madvise(area, size, MADV_HUGEPAGE);
> >>
> >>           for (offs = 0; offs < size; offs += getpagesize())
> >>                   area[offs] = 0;
> >>           pause();
> >>   }
> >> <\prog.c>
> >>
> >> 2) Trigger the shrinker
> >>
> >> E.g., memory pressure through memhog
> >>
> >> 3) Observe that THPs are not getting reclaimed
> >>
> >> $ cat /proc/`pgrep prog`/smaps_rollup
> >>
> >> Would list ~1GiB of AnonHugePages. With this fix, they would get
> >> reclaimed as expected.
> >>
> >> Fixes: dafff3f4c850 ("mm: split underused THPs")
> >> Cc: Andrew Morton <akpm@linux-foundation.org>
> >> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> >> Cc: Zi Yan <ziy@nvidia.com>
> >> Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
> >> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>
> >> Cc: Nico Pache <npache@redhat.com>
> >> Cc: Ryan Roberts <ryan.roberts@arm.com>
> >> Cc: Dev Jain <dev.jain@arm.com>
> >> Cc: Barry Song <baohua@kernel.org>
> >> Cc: Usama Arif <usamaarif642@gmail.com>
> >> Signed-off-by: David Hildenbrand <david@redhat.com>
> >> ---
> >>   mm/huge_memory.c | 3 ---
> >>   1 file changed, 3 deletions(-)
> >>
> > LGTM. Acked-by: Zi Yan <ziy@nvidia.com>
> >
> > I also notice that thp_underused() checks num_zero_pages directly
> > against khugepaged_max_ptes_none. This means mTHPs will never be regarded
> > as underused. A similar issue you are discussing in Nico’s khugepaged
> > mTHP support. Maybe checks against these khugepaged_max* variables
> > should be calculated based on nr_pages of a large folio, like
> > making these variables a ratio in other discussion.
>
> Yeah, factoring that out and cleaning it up is my next step.
>
> But note that mTHPs are not a candidate for the shrinker right now. (see
> my explanation in reply to Nicos patch)

Right. IIUC, the logic in deferred_split_scan() is gated by
!folio_test_partially_mapped(folio). This creates two paths:

1) Splits for mTHPs that are partially mapped. These set the flag and
corectly bypass thp_underused().

2) Optional shrinker splits for new or collapsed THPs. These don't set the
flag and must go through thp_underused().

So, thp_underused() is only for whole, PMD-sized THPs right now. Please
correct me if my understanding is wrong here ;)

Cheers,
Lance