[RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option

Pankaj Raghav posted 2 patches 6 months, 3 weeks ago
arch/x86/Kconfig |  1 +
block/blk-lib.c  | 15 +++++++++---
mm/Kconfig       | 12 +++++++++
mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++----------
4 files changed, 74 insertions(+), 17 deletions(-)
[RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
Posted by Pankaj Raghav 6 months, 3 weeks ago
There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.

This concern was raised during the review of adding Large Block Size support
to XFS[1][2].

This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of a single bvec.

Some examples of places in the kernel where this could be useful:
- blkdev_issue_zero_pages()
- iomap_dio_zero()
- vmalloc.c:zero_iter()
- rxperf_process_call()
- fscrypt_zeroout_range_inline_crypt()
- bch2_checksum_update()
...

We already have huge_zero_folio that is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left.

But to use huge_zero_folio, we need to pass a mm struct and the
put_folio needs to be called in the destructor. This makes sense for
systems that have memory constraints but for bigger servers, it does not
matter if the PMD size is reasonable (like x86).

Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
the huge_zero_folio, and it will never be freed. This makes using the
huge_zero_folio without having to pass any mm struct and a call to put_folio
in the destructor.

I have converted blkdev_issue_zero_pages() as an example as a part of
this series.

I will send patches to individual subsystems using the huge_zero_folio
once this gets upstreamed.

Looking forward to some feedback.

[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/

Changes since v1:
- Added the config option based on the feedback from David.
- Removed iomap patches so that I don't clutter this series with too
  many subsystems.

Pankaj Raghav (2):
  mm: add THP_HUGE_ZERO_PAGE_ALWAYS config option
  block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()

 arch/x86/Kconfig |  1 +
 block/blk-lib.c  | 15 +++++++++---
 mm/Kconfig       | 12 +++++++++
 mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++----------
 4 files changed, 74 insertions(+), 17 deletions(-)


base-commit: f1f6aceb82a55f87d04e2896ac3782162e7859bd
-- 
2.47.2
Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
Posted by Mike Rapoport 6 months, 3 weeks ago
Hi Pankaj,

On Thu, May 22, 2025 at 11:02:41AM +0200, Pankaj Raghav wrote:
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
> 
> This concern was raised during the review of adding Large Block Size support
> to XFS[1][2].
> 
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of a single bvec.
> 
> Some examples of places in the kernel where this could be useful:
> - blkdev_issue_zero_pages()
> - iomap_dio_zero()
> - vmalloc.c:zero_iter()
> - rxperf_process_call()
> - fscrypt_zeroout_range_inline_crypt()
> - bch2_checksum_update()
> ...
> 
> We already have huge_zero_folio that is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left.
> 
> But to use huge_zero_folio, we need to pass a mm struct and the
> put_folio needs to be called in the destructor. This makes sense for
> systems that have memory constraints but for bigger servers, it does not
> matter if the PMD size is reasonable (like x86).
> 
> Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
> the huge_zero_folio, and it will never be freed. This makes using the
> huge_zero_folio without having to pass any mm struct and a call to put_folio
> in the destructor.

I don't think this config option should be tied to THP. It's perfectly
sensible to have a configuration with HUGETLB and without THP.
 
> I have converted blkdev_issue_zero_pages() as an example as a part of
> this series.
> 
> I will send patches to individual subsystems using the huge_zero_folio
> once this gets upstreamed.
> 
> Looking forward to some feedback.
> 
> [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
> [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
> 
> Changes since v1:
> - Added the config option based on the feedback from David.
> - Removed iomap patches so that I don't clutter this series with too
>   many subsystems.
> 
> Pankaj Raghav (2):
>   mm: add THP_HUGE_ZERO_PAGE_ALWAYS config option
>   block: use mm_huge_zero_folio in __blkdev_issue_zero_pages()
> 
>  arch/x86/Kconfig |  1 +
>  block/blk-lib.c  | 15 +++++++++---
>  mm/Kconfig       | 12 +++++++++
>  mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++----------
>  4 files changed, 74 insertions(+), 17 deletions(-)
> 
> 
> base-commit: f1f6aceb82a55f87d04e2896ac3782162e7859bd
> -- 
> 2.47.2
> 
> 

-- 
Sincerely yours,
Mike.
Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
Posted by David Hildenbrand 6 months, 3 weeks ago
On 22.05.25 13:31, Mike Rapoport wrote:
> Hi Pankaj,
> 
> On Thu, May 22, 2025 at 11:02:41AM +0200, Pankaj Raghav wrote:
>> There are many places in the kernel where we need to zeroout larger
>> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
>> is limited by PAGE_SIZE.
>>
>> This concern was raised during the review of adding Large Block Size support
>> to XFS[1][2].
>>
>> This is especially annoying in block devices and filesystems where we
>> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
>> bvec support in block layer, it is much more efficient to send out
>> larger zero pages as a part of a single bvec.
>>
>> Some examples of places in the kernel where this could be useful:
>> - blkdev_issue_zero_pages()
>> - iomap_dio_zero()
>> - vmalloc.c:zero_iter()
>> - rxperf_process_call()
>> - fscrypt_zeroout_range_inline_crypt()
>> - bch2_checksum_update()
>> ...
>>
>> We already have huge_zero_folio that is allocated on demand, and it will be
>> deallocated by the shrinker if there are no users of it left.
>>
>> But to use huge_zero_folio, we need to pass a mm struct and the
>> put_folio needs to be called in the destructor. This makes sense for
>> systems that have memory constraints but for bigger servers, it does not
>> matter if the PMD size is reasonable (like x86).
>>
>> Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
>> the huge_zero_folio, and it will never be freed. This makes using the
>> huge_zero_folio without having to pass any mm struct and a call to put_folio
>> in the destructor.
> 
> I don't think this config option should be tied to THP. It's perfectly
> sensible to have a configuration with HUGETLB and without THP.

Such configs are getting rarer ...

I assume we would then simply reuse that page from THP code if available?

-- 
Cheers,

David / dhildenb
Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
Posted by Pankaj Raghav (Samsung) 6 months, 3 weeks ago
Hi Mike,

> > Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
> > the huge_zero_folio, and it will never be freed. This makes using the
> > huge_zero_folio without having to pass any mm struct and a call to put_folio
> > in the destructor.
> 
> I don't think this config option should be tied to THP. It's perfectly
> sensible to have a configuration with HUGETLB and without THP.
>  

Hmm, that makes sense. You mean something like this (untested):

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2e1527580746..d447a9b9eb7d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -151,8 +151,8 @@ config X86
        select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP   if X86_64
        select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP       if X86_64
        select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
+       select ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS if X86_64
        select ARCH_WANTS_THP_SWAP              if X86_64
-       select ARCH_WANTS_THP_ZERO_PAGE_ALWAYS  if X86_64
        select ARCH_HAS_PARANOID_L1D_FLUSH
        select BUILDTIME_TABLE_SORT
        select CLKEVT_I8253
diff --git a/mm/Kconfig b/mm/Kconfig
index a2994e7d55ba..83a5b95a2286 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -823,9 +823,19 @@ config ARCH_WANT_GENERAL_HUGETLB
 config ARCH_WANTS_THP_SWAP
        def_bool n
 
-config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
+config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
        def_bool n
 
+config HUGE_ZERO_PAGE_ALWAYS
+       def_bool y
+       depends on HUGETLB_PAGE && ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
+       help
+         Typically huge_zero_folio, which is a huge page of zeroes, is allocated
+         on demand and deallocated when not in use. This option will always
+         allocate huge_zero_folio for zeroing and it is never deallocated.
+         Not suitable for memory constrained systems.
+
+
 config MM_ID
        def_bool n
 
@@ -898,15 +908,6 @@ config READ_ONLY_THP_FOR_FS
          support of file THPs will be developed in the next few release
          cycles.
 
-config THP_ZERO_PAGE_ALWAYS
-       def_bool y
-       depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
-       help
-         Typically huge_zero_folio, which is a THP of zeroes, is allocated
-         on demand and deallocated when not in use. This option will always
-         allocate huge_zero_folio for zeroing and it is never deallocated.
-         Not suitable for memory constrained systems.
-
 config NO_PAGE_MAPCOUNT
        bool "No per-page mapcount (EXPERIMENTAL)"
        help

--
Pankaj
Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
Posted by David Hildenbrand 6 months, 3 weeks ago
On 22.05.25 14:00, Pankaj Raghav (Samsung) wrote:
> Hi Mike,
> 
>>> Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
>>> the huge_zero_folio, and it will never be freed. This makes using the
>>> huge_zero_folio without having to pass any mm struct and a call to put_folio
>>> in the destructor.
>>
>> I don't think this config option should be tied to THP. It's perfectly
>> sensible to have a configuration with HUGETLB and without THP.
>>   
> 
> Hmm, that makes sense. You mean something like this (untested):
> 
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 2e1527580746..d447a9b9eb7d 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -151,8 +151,8 @@ config X86
>          select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP   if X86_64
>          select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP       if X86_64
>          select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
> +       select ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS if X86_64
>          select ARCH_WANTS_THP_SWAP              if X86_64
> -       select ARCH_WANTS_THP_ZERO_PAGE_ALWAYS  if X86_64
>          select ARCH_HAS_PARANOID_L1D_FLUSH
>          select BUILDTIME_TABLE_SORT
>          select CLKEVT_I8253
> diff --git a/mm/Kconfig b/mm/Kconfig
> index a2994e7d55ba..83a5b95a2286 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -823,9 +823,19 @@ config ARCH_WANT_GENERAL_HUGETLB
>   config ARCH_WANTS_THP_SWAP
>          def_bool n
>   
> -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
> +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
>          def_bool n
>   
> +config HUGE_ZERO_PAGE_ALWAYS

Likely something like

PMD_ZERO_PAGE

Will be a lot clearer.

 > +       def_bool y> +       depends on HUGETLB_PAGE && 
ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS

I suspect it should then also be independent of HUGETLB_PAGE?

> +       help
> +         Typically huge_zero_folio, which is a huge page of zeroes, is allocated
> +         on demand and deallocated when not in use. This option will always
> +         allocate huge_zero_folio for zeroing and it is never deallocated.
> +         Not suitable for memory constrained systems.

I assume that code then has to live in mm/memory.c ?


-- 
Cheers,

David / dhildenb
Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
Posted by Pankaj Raghav (Samsung) 6 months, 3 weeks ago
Hi David,

> >   config ARCH_WANTS_THP_SWAP
> >          def_bool n
> > -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
> > +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
> >          def_bool n
> > +config HUGE_ZERO_PAGE_ALWAYS
> 
> Likely something like
> 
> PMD_ZERO_PAGE
> 
> Will be a lot clearer.

Sounds much better :)

> 
> > +       def_bool y> +       depends on HUGETLB_PAGE &&
> ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
> 
> I suspect it should then also be independent of HUGETLB_PAGE?

You are right. So we don't depend on any of these features.

> 
> > +       help
> > +         Typically huge_zero_folio, which is a huge page of zeroes, is allocated
> > +         on demand and deallocated when not in use. This option will always
> > +         allocate huge_zero_folio for zeroing and it is never deallocated.
> > +         Not suitable for memory constrained systems.
> 
> I assume that code then has to live in mm/memory.c ?

Hmm, then huge_zero_folio should have always been in mm/memory.c to
begin with?

I assume probably this was placed in mm/huge_memory.c because the users
of this huge_zero_folio has been a part of mm/huge_memory.c?

So IIUC your comment, we should move the huge_zero_page_init() in the
first patch to mm/memory.c and the existing shrinker code can be a part
where they already are?

--
Pankaj
Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
Posted by David Hildenbrand 6 months, 3 weeks ago
On 22.05.25 14:34, Pankaj Raghav (Samsung) wrote:
> Hi David,
> 
>>>    config ARCH_WANTS_THP_SWAP
>>>           def_bool n
>>> -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
>>> +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
>>>           def_bool n
>>> +config HUGE_ZERO_PAGE_ALWAYS
>>
>> Likely something like
>>
>> PMD_ZERO_PAGE
>>
>> Will be a lot clearer.
> 
> Sounds much better :)

And maybe something like

"STATIC_PMD_ZERO_PAGE"

would be even clearer.

The other one would be the dynamic one.

> 
>>
>>> +       def_bool y> +       depends on HUGETLB_PAGE &&
>> ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
>>
>> I suspect it should then also be independent of HUGETLB_PAGE?
> 
> You are right. So we don't depend on any of these features.
> 
>>
>>> +       help
>>> +         Typically huge_zero_folio, which is a huge page of zeroes, is allocated
>>> +         on demand and deallocated when not in use. This option will always
>>> +         allocate huge_zero_folio for zeroing and it is never deallocated.
>>> +         Not suitable for memory constrained systems.
>>
>> I assume that code then has to live in mm/memory.c ?
> 
> Hmm, then huge_zero_folio should have always been in mm/memory.c to
> begin with?
> 

It's complicated. Only do_huge_pmd_anonymous_page() (and fsdax) really 
uses it, and it may only get mapped into a process under certain 
conditions (related to THP / PMD handling).

> I assume probably this was placed in mm/huge_memory.c because the users
> of this huge_zero_folio has been a part of mm/huge_memory.c?

Yes.

> 
> So IIUC your comment, we should move the huge_zero_page_init() in the
> first patch to mm/memory.c and the existing shrinker code can be a part
> where they already are?

Good question. At least the "static" part can easily be moved over. 
Maybe the dynamic part as well.

Worth trying it out and seeing how it looks :)

-- 
Cheers,

David / dhildenb
Re: [RFC v2 0/2] add THP_HUGE_ZERO_PAGE_ALWAYS config option
Posted by Pankaj Raghav (Samsung) 6 months, 3 weeks ago
On Thu, May 22, 2025 at 02:50:20PM +0200, David Hildenbrand wrote:
> On 22.05.25 14:34, Pankaj Raghav (Samsung) wrote:
> > Hi David,
> > 
> > > >    config ARCH_WANTS_THP_SWAP
> > > >           def_bool n
> > > > -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
> > > > +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
> > > >           def_bool n
> > > > +config HUGE_ZERO_PAGE_ALWAYS
> > > 
> > > Likely something like
> > > 
> > > PMD_ZERO_PAGE
> > > 
> > > Will be a lot clearer.
> > 
> > Sounds much better :)
> 
> And maybe something like
> 
> "STATIC_PMD_ZERO_PAGE"
> 
> would be even clearer.
> 
> The other one would be the dynamic one.

Got it.
So if I understand correctly, we are going to have two huge zero pages,
- one that is always allocated statically.
- the existing dynamic will still be there for the existing users.

> 
> > 
> > > 
> > > > +       def_bool y> +       depends on HUGETLB_PAGE &&
> > > ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
> > > 
> > > I suspect it should then also be independent of HUGETLB_PAGE?
> > 
> > You are right. So we don't depend on any of these features.
> > 
> > > 
> > > > +       help
> > > > +         Typically huge_zero_folio, which is a huge page of zeroes, is allocated
> > > > +         on demand and deallocated when not in use. This option will always
> > > > +         allocate huge_zero_folio for zeroing and it is never deallocated.
> > > > +         Not suitable for memory constrained systems.
> > > 
> > > I assume that code then has to live in mm/memory.c ?
> > 
> > Hmm, then huge_zero_folio should have always been in mm/memory.c to
> > begin with?
> > 
> 
> It's complicated. Only do_huge_pmd_anonymous_page() (and fsdax) really uses
> it, and it may only get mapped into a process under certain conditions
> (related to THP / PMD handling).
> 
Got it.
> > 
> > So IIUC your comment, we should move the huge_zero_page_init() in the
> > first patch to mm/memory.c and the existing shrinker code can be a part
> > where they already are?
> 
> Good question. At least the "static" part can easily be moved over. Maybe
> the dynamic part as well.
> 
> Worth trying it out and seeing how it looks :)

Challenge accepted ;) Thanks for the comments David.

--
Pankaj