arch/x86/Kconfig | 1 + block/blk-lib.c | 15 +++++++++--- mm/Kconfig | 12 +++++++++ mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++---------- 4 files changed, 74 insertions(+), 17 deletions(-)
There are many places in the kernel where we need to zeroout larger chunks but the maximum segment we can zeroout at a time by ZERO_PAGE is limited by PAGE_SIZE. This concern was raised during the review of adding Large Block Size support to XFS[1][2]. This is especially annoying in block devices and filesystems where we attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage bvec support in block layer, it is much more efficient to send out larger zero pages as a part of a single bvec. Some examples of places in the kernel where this could be useful: - blkdev_issue_zero_pages() - iomap_dio_zero() - vmalloc.c:zero_iter() - rxperf_process_call() - fscrypt_zeroout_range_inline_crypt() - bch2_checksum_update() ... We already have huge_zero_folio that is allocated on demand, and it will be deallocated by the shrinker if there are no users of it left. But to use huge_zero_folio, we need to pass a mm struct and the put_folio needs to be called in the destructor. This makes sense for systems that have memory constraints but for bigger servers, it does not matter if the PMD size is reasonable (like x86). Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate the huge_zero_folio, and it will never be freed. This makes using the huge_zero_folio without having to pass any mm struct and a call to put_folio in the destructor. I have converted blkdev_issue_zero_pages() as an example as a part of this series. I will send patches to individual subsystems using the huge_zero_folio once this gets upstreamed. Looking forward to some feedback. [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/ [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/ Changes since v1: - Added the config option based on the feedback from David. - Removed iomap patches so that I don't clutter this series with too many subsystems. Pankaj Raghav (2): mm: add THP_HUGE_ZERO_PAGE_ALWAYS config option block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() arch/x86/Kconfig | 1 + block/blk-lib.c | 15 +++++++++--- mm/Kconfig | 12 +++++++++ mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++---------- 4 files changed, 74 insertions(+), 17 deletions(-) base-commit: f1f6aceb82a55f87d04e2896ac3782162e7859bd -- 2.47.2
Hi Pankaj, On Thu, May 22, 2025 at 11:02:41AM +0200, Pankaj Raghav wrote: > There are many places in the kernel where we need to zeroout larger > chunks but the maximum segment we can zeroout at a time by ZERO_PAGE > is limited by PAGE_SIZE. > > This concern was raised during the review of adding Large Block Size support > to XFS[1][2]. > > This is especially annoying in block devices and filesystems where we > attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage > bvec support in block layer, it is much more efficient to send out > larger zero pages as a part of a single bvec. > > Some examples of places in the kernel where this could be useful: > - blkdev_issue_zero_pages() > - iomap_dio_zero() > - vmalloc.c:zero_iter() > - rxperf_process_call() > - fscrypt_zeroout_range_inline_crypt() > - bch2_checksum_update() > ... > > We already have huge_zero_folio that is allocated on demand, and it will be > deallocated by the shrinker if there are no users of it left. > > But to use huge_zero_folio, we need to pass a mm struct and the > put_folio needs to be called in the destructor. This makes sense for > systems that have memory constraints but for bigger servers, it does not > matter if the PMD size is reasonable (like x86). > > Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate > the huge_zero_folio, and it will never be freed. This makes using the > huge_zero_folio without having to pass any mm struct and a call to put_folio > in the destructor. I don't think this config option should be tied to THP. It's perfectly sensible to have a configuration with HUGETLB and without THP. > I have converted blkdev_issue_zero_pages() as an example as a part of > this series. > > I will send patches to individual subsystems using the huge_zero_folio > once this gets upstreamed. > > Looking forward to some feedback. > > [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/ > [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/ > > Changes since v1: > - Added the config option based on the feedback from David. > - Removed iomap patches so that I don't clutter this series with too > many subsystems. > > Pankaj Raghav (2): > mm: add THP_HUGE_ZERO_PAGE_ALWAYS config option > block: use mm_huge_zero_folio in __blkdev_issue_zero_pages() > > arch/x86/Kconfig | 1 + > block/blk-lib.c | 15 +++++++++--- > mm/Kconfig | 12 +++++++++ > mm/huge_memory.c | 63 ++++++++++++++++++++++++++++++++++++++---------- > 4 files changed, 74 insertions(+), 17 deletions(-) > > > base-commit: f1f6aceb82a55f87d04e2896ac3782162e7859bd > -- > 2.47.2 > > -- Sincerely yours, Mike.
On 22.05.25 13:31, Mike Rapoport wrote: > Hi Pankaj, > > On Thu, May 22, 2025 at 11:02:41AM +0200, Pankaj Raghav wrote: >> There are many places in the kernel where we need to zeroout larger >> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE >> is limited by PAGE_SIZE. >> >> This concern was raised during the review of adding Large Block Size support >> to XFS[1][2]. >> >> This is especially annoying in block devices and filesystems where we >> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage >> bvec support in block layer, it is much more efficient to send out >> larger zero pages as a part of a single bvec. >> >> Some examples of places in the kernel where this could be useful: >> - blkdev_issue_zero_pages() >> - iomap_dio_zero() >> - vmalloc.c:zero_iter() >> - rxperf_process_call() >> - fscrypt_zeroout_range_inline_crypt() >> - bch2_checksum_update() >> ... >> >> We already have huge_zero_folio that is allocated on demand, and it will be >> deallocated by the shrinker if there are no users of it left. >> >> But to use huge_zero_folio, we need to pass a mm struct and the >> put_folio needs to be called in the destructor. This makes sense for >> systems that have memory constraints but for bigger servers, it does not >> matter if the PMD size is reasonable (like x86). >> >> Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate >> the huge_zero_folio, and it will never be freed. This makes using the >> huge_zero_folio without having to pass any mm struct and a call to put_folio >> in the destructor. > > I don't think this config option should be tied to THP. It's perfectly > sensible to have a configuration with HUGETLB and without THP. Such configs are getting rarer ... I assume we would then simply reuse that page from THP code if available? -- Cheers, David / dhildenb
Hi Mike,
> > Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate
> > the huge_zero_folio, and it will never be freed. This makes using the
> > huge_zero_folio without having to pass any mm struct and a call to put_folio
> > in the destructor.
>
> I don't think this config option should be tied to THP. It's perfectly
> sensible to have a configuration with HUGETLB and without THP.
>
Hmm, that makes sense. You mean something like this (untested):
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 2e1527580746..d447a9b9eb7d 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -151,8 +151,8 @@ config X86
select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP if X86_64
select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
+ select ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS if X86_64
select ARCH_WANTS_THP_SWAP if X86_64
- select ARCH_WANTS_THP_ZERO_PAGE_ALWAYS if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
select CLKEVT_I8253
diff --git a/mm/Kconfig b/mm/Kconfig
index a2994e7d55ba..83a5b95a2286 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -823,9 +823,19 @@ config ARCH_WANT_GENERAL_HUGETLB
config ARCH_WANTS_THP_SWAP
def_bool n
-config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
+config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
def_bool n
+config HUGE_ZERO_PAGE_ALWAYS
+ def_bool y
+ depends on HUGETLB_PAGE && ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS
+ help
+ Typically huge_zero_folio, which is a huge page of zeroes, is allocated
+ on demand and deallocated when not in use. This option will always
+ allocate huge_zero_folio for zeroing and it is never deallocated.
+ Not suitable for memory constrained systems.
+
+
config MM_ID
def_bool n
@@ -898,15 +908,6 @@ config READ_ONLY_THP_FOR_FS
support of file THPs will be developed in the next few release
cycles.
-config THP_ZERO_PAGE_ALWAYS
- def_bool y
- depends on TRANSPARENT_HUGEPAGE && ARCH_WANTS_THP_ZERO_PAGE_ALWAYS
- help
- Typically huge_zero_folio, which is a THP of zeroes, is allocated
- on demand and deallocated when not in use. This option will always
- allocate huge_zero_folio for zeroing and it is never deallocated.
- Not suitable for memory constrained systems.
-
config NO_PAGE_MAPCOUNT
bool "No per-page mapcount (EXPERIMENTAL)"
help
--
Pankaj
On 22.05.25 14:00, Pankaj Raghav (Samsung) wrote: > Hi Mike, > >>> Add a config option THP_HUGE_ZERO_PAGE_ALWAYS that will always allocate >>> the huge_zero_folio, and it will never be freed. This makes using the >>> huge_zero_folio without having to pass any mm struct and a call to put_folio >>> in the destructor. >> >> I don't think this config option should be tied to THP. It's perfectly >> sensible to have a configuration with HUGETLB and without THP. >> > > Hmm, that makes sense. You mean something like this (untested): > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 2e1527580746..d447a9b9eb7d 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -151,8 +151,8 @@ config X86 > select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP if X86_64 > select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 > select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64 > + select ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS if X86_64 > select ARCH_WANTS_THP_SWAP if X86_64 > - select ARCH_WANTS_THP_ZERO_PAGE_ALWAYS if X86_64 > select ARCH_HAS_PARANOID_L1D_FLUSH > select BUILDTIME_TABLE_SORT > select CLKEVT_I8253 > diff --git a/mm/Kconfig b/mm/Kconfig > index a2994e7d55ba..83a5b95a2286 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -823,9 +823,19 @@ config ARCH_WANT_GENERAL_HUGETLB > config ARCH_WANTS_THP_SWAP > def_bool n > > -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS > +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS > def_bool n > > +config HUGE_ZERO_PAGE_ALWAYS Likely something like PMD_ZERO_PAGE Will be a lot clearer. > + def_bool y> + depends on HUGETLB_PAGE && ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS I suspect it should then also be independent of HUGETLB_PAGE? > + help > + Typically huge_zero_folio, which is a huge page of zeroes, is allocated > + on demand and deallocated when not in use. This option will always > + allocate huge_zero_folio for zeroing and it is never deallocated. > + Not suitable for memory constrained systems. I assume that code then has to live in mm/memory.c ? -- Cheers, David / dhildenb
Hi David, > > config ARCH_WANTS_THP_SWAP > > def_bool n > > -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS > > +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS > > def_bool n > > +config HUGE_ZERO_PAGE_ALWAYS > > Likely something like > > PMD_ZERO_PAGE > > Will be a lot clearer. Sounds much better :) > > > + def_bool y> + depends on HUGETLB_PAGE && > ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS > > I suspect it should then also be independent of HUGETLB_PAGE? You are right. So we don't depend on any of these features. > > > + help > > + Typically huge_zero_folio, which is a huge page of zeroes, is allocated > > + on demand and deallocated when not in use. This option will always > > + allocate huge_zero_folio for zeroing and it is never deallocated. > > + Not suitable for memory constrained systems. > > I assume that code then has to live in mm/memory.c ? Hmm, then huge_zero_folio should have always been in mm/memory.c to begin with? I assume probably this was placed in mm/huge_memory.c because the users of this huge_zero_folio has been a part of mm/huge_memory.c? So IIUC your comment, we should move the huge_zero_page_init() in the first patch to mm/memory.c and the existing shrinker code can be a part where they already are? -- Pankaj
On 22.05.25 14:34, Pankaj Raghav (Samsung) wrote: > Hi David, > >>> config ARCH_WANTS_THP_SWAP >>> def_bool n >>> -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS >>> +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS >>> def_bool n >>> +config HUGE_ZERO_PAGE_ALWAYS >> >> Likely something like >> >> PMD_ZERO_PAGE >> >> Will be a lot clearer. > > Sounds much better :) And maybe something like "STATIC_PMD_ZERO_PAGE" would be even clearer. The other one would be the dynamic one. > >> >>> + def_bool y> + depends on HUGETLB_PAGE && >> ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS >> >> I suspect it should then also be independent of HUGETLB_PAGE? > > You are right. So we don't depend on any of these features. > >> >>> + help >>> + Typically huge_zero_folio, which is a huge page of zeroes, is allocated >>> + on demand and deallocated when not in use. This option will always >>> + allocate huge_zero_folio for zeroing and it is never deallocated. >>> + Not suitable for memory constrained systems. >> >> I assume that code then has to live in mm/memory.c ? > > Hmm, then huge_zero_folio should have always been in mm/memory.c to > begin with? > It's complicated. Only do_huge_pmd_anonymous_page() (and fsdax) really uses it, and it may only get mapped into a process under certain conditions (related to THP / PMD handling). > I assume probably this was placed in mm/huge_memory.c because the users > of this huge_zero_folio has been a part of mm/huge_memory.c? Yes. > > So IIUC your comment, we should move the huge_zero_page_init() in the > first patch to mm/memory.c and the existing shrinker code can be a part > where they already are? Good question. At least the "static" part can easily be moved over. Maybe the dynamic part as well. Worth trying it out and seeing how it looks :) -- Cheers, David / dhildenb
On Thu, May 22, 2025 at 02:50:20PM +0200, David Hildenbrand wrote: > On 22.05.25 14:34, Pankaj Raghav (Samsung) wrote: > > Hi David, > > > > > > config ARCH_WANTS_THP_SWAP > > > > def_bool n > > > > -config ARCH_WANTS_THP_ZERO_PAGE_ALWAYS > > > > +config ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS > > > > def_bool n > > > > +config HUGE_ZERO_PAGE_ALWAYS > > > > > > Likely something like > > > > > > PMD_ZERO_PAGE > > > > > > Will be a lot clearer. > > > > Sounds much better :) > > And maybe something like > > "STATIC_PMD_ZERO_PAGE" > > would be even clearer. > > The other one would be the dynamic one. Got it. So if I understand correctly, we are going to have two huge zero pages, - one that is always allocated statically. - the existing dynamic will still be there for the existing users. > > > > > > > > > > + def_bool y> + depends on HUGETLB_PAGE && > > > ARCH_WANTS_HUGE_ZERO_PAGE_ALWAYS > > > > > > I suspect it should then also be independent of HUGETLB_PAGE? > > > > You are right. So we don't depend on any of these features. > > > > > > > > > + help > > > > + Typically huge_zero_folio, which is a huge page of zeroes, is allocated > > > > + on demand and deallocated when not in use. This option will always > > > > + allocate huge_zero_folio for zeroing and it is never deallocated. > > > > + Not suitable for memory constrained systems. > > > > > > I assume that code then has to live in mm/memory.c ? > > > > Hmm, then huge_zero_folio should have always been in mm/memory.c to > > begin with? > > > > It's complicated. Only do_huge_pmd_anonymous_page() (and fsdax) really uses > it, and it may only get mapped into a process under certain conditions > (related to THP / PMD handling). > Got it. > > > > So IIUC your comment, we should move the huge_zero_page_init() in the > > first patch to mm/memory.c and the existing shrinker code can be a part > > where they already are? > > Good question. At least the "static" part can easily be moved over. Maybe > the dynamic part as well. > > Worth trying it out and seeing how it looks :) Challenge accepted ;) Thanks for the comments David. -- Pankaj
© 2016 - 2025 Red Hat, Inc.