There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.
This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of single bvec.
This concern was raised during the review of adding LBS support to
XFS[1][2].
Usually huge_zero_folio is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left.
Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
the huge_zero_folio, and it will never be freed. This makes using the
huge_zero_folio without having to pass any mm struct and call put_folio
in the destructor.
We can enable it by default for x86_64 where the PMD size is 2M.
It is good compromise between the memory and efficiency.
As a THP zero page might be wasteful for architectures with bigger page
sizes, let's not enable it for them.
[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Pankaj Raghav <p.raghav@samsung.com>
---
arch/x86/Kconfig | 1 +
mm/Kconfig | 12 ++++++++++++
mm/memory.c | 30 ++++++++++++++++++++++++++----
3 files changed, 39 insertions(+), 4 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 055204dc211d..96f99b4f96ea 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64
select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
select ARCH_WANTS_THP_SWAP if X86_64
+ select ARCH_WANTS_STATIC_PMD_ZERO_PAGE if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
select CLKEVT_I8253
diff --git a/mm/Kconfig b/mm/Kconfig
index bd08e151fa1b..8f50f5c3f7a7 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -826,6 +826,18 @@ config ARCH_WANTS_THP_SWAP
config MM_ID
def_bool n
+config ARCH_WANTS_STATIC_PMD_ZERO_PAGE
+ bool
+
+config STATIC_PMD_ZERO_PAGE
+ def_bool y
+ depends on ARCH_WANTS_STATIC_PMD_ZERO_PAGE
+ help
+ Typically huge_zero_folio, which is a PMD page of zeroes, is allocated
+ on demand and deallocated when not in use. This option will always
+ allocate huge_zero_folio for zeroing and it is never deallocated.
+ Not suitable for memory constrained systems.
+
menuconfig TRANSPARENT_HUGEPAGE
bool "Transparent Hugepage Support"
depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT
diff --git a/mm/memory.c b/mm/memory.c
index 11edc4d66e74..ab8c16d04307 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -203,9 +203,17 @@ static void put_huge_zero_page(void)
BUG_ON(atomic_dec_and_test(&huge_zero_refcount));
}
+/*
+ * If STATIC_PMD_ZERO_PAGE is enabled, @mm can be NULL, i.e, the huge_zero_folio
+ * is not associated with any mm_struct.
+*/
struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
{
- if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
+ if (!IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE) && !mm)
+ return NULL;
+
+ if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE) ||
+ test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
return READ_ONCE(huge_zero_folio);
if (!get_huge_zero_page())
@@ -219,6 +227,9 @@ struct folio *mm_get_huge_zero_folio(struct mm_struct *mm)
void mm_put_huge_zero_folio(struct mm_struct *mm)
{
+ if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE))
+ return;
+
if (test_bit(MMF_HUGE_ZERO_PAGE, &mm->flags))
put_huge_zero_page();
}
@@ -246,15 +257,26 @@ static unsigned long shrink_huge_zero_page_scan(struct shrinker *shrink,
static int __init init_huge_zero_page(void)
{
+ int ret = 0;
+
+ if (IS_ENABLED(CONFIG_STATIC_PMD_ZERO_PAGE)) {
+ if (!get_huge_zero_page())
+ ret = -ENOMEM;
+ goto out;
+ }
+
huge_zero_page_shrinker = shrinker_alloc(0, "thp-zero");
- if (!huge_zero_page_shrinker)
- return -ENOMEM;
+ if (!huge_zero_page_shrinker) {
+ ret = -ENOMEM;
+ goto out;
+ }
huge_zero_page_shrinker->count_objects = shrink_huge_zero_page_count;
huge_zero_page_shrinker->scan_objects = shrink_huge_zero_page_scan;
shrinker_register(huge_zero_page_shrinker);
- return 0;
+out:
+ return ret;
}
early_initcall(init_huge_zero_page);
--
2.47.2
Should this say FOLIO instead of PAGE in the config option to match the symbol protected by it?
On 6/2/25 07:03, Christoph Hellwig wrote: > Should this say FOLIO instead of PAGE in the config option to match > the symbol protected by it? > I am still discussing how the final implementation should be with David. But I will change the _PAGE to _FOLIO as that is what we would like to expose at the end. -- Pankaj
On 02.06.25 16:49, Pankaj Raghav wrote: > On 6/2/25 07:03, Christoph Hellwig wrote: >> Should this say FOLIO instead of PAGE in the config option to match >> the symbol protected by it? >> > I am still discussing how the final implementation should be with David. But I will > change the _PAGE to _FOLIO as that is what we would like to expose at the end. It's a huge page, represented internally as a folio. No strong opinion, as ... MMF_HUGE_ZERO_PAGE vs. mm_get_huge_zero_folio vs. get_huge_zero_page vs ... :) -- Cheers, David / dhildenb
On 02.06.25 22:28, David Hildenbrand wrote: > On 02.06.25 16:49, Pankaj Raghav wrote: >> On 6/2/25 07:03, Christoph Hellwig wrote: >>> Should this say FOLIO instead of PAGE in the config option to match >>> the symbol protected by it? >>> >> I am still discussing how the final implementation should be with David. But I will >> change the _PAGE to _FOLIO as that is what we would like to expose at the end. > > It's a huge page, represented internally as a folio. No strong opinion, > as ... > > MMF_HUGE_ZERO_PAGE vs. mm_get_huge_zero_folio vs. get_huge_zero_page vs > ... :) Just to add, the existing one is exposed (configurable) to the user through /sys/kernel/mm/transparent_hugepage/use_zero_page combined with the implicit size through /sys/kernel/mm/transparent_hugepage/hpage_pmd_size -- Cheers, David / dhildenb
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 055204dc211d..96f99b4f96ea 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -152,6 +152,7 @@ config X86 > select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 > select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64 > select ARCH_WANTS_THP_SWAP if X86_64 > + select ARCH_WANTS_STATIC_PMD_ZERO_PAGE if X86_64 I don't think this should be the default. There are lots of little x86_64 VMs sitting around and 2MB might be significant to them. > +config ARCH_WANTS_STATIC_PMD_ZERO_PAGE > + bool > + > +config STATIC_PMD_ZERO_PAGE > + def_bool y > + depends on ARCH_WANTS_STATIC_PMD_ZERO_PAGE > + help > + Typically huge_zero_folio, which is a PMD page of zeroes, is allocated > + on demand and deallocated when not in use. This option will always > + allocate huge_zero_folio for zeroing and it is never deallocated. > + Not suitable for memory constrained systems. "Static" seems like a weird term to use for this. I was really expecting to see a 2MB object that gets allocated in .bss or something rather than a dynamically allocated page that's just never freed. > menuconfig TRANSPARENT_HUGEPAGE > bool "Transparent Hugepage Support" > depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT > diff --git a/mm/memory.c b/mm/memory.c > index 11edc4d66e74..ab8c16d04307 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -203,9 +203,17 @@ static void put_huge_zero_page(void) > BUG_ON(atomic_dec_and_test(&huge_zero_refcount)); > } > > +/* > + * If STATIC_PMD_ZERO_PAGE is enabled, @mm can be NULL, i.e, the huge_zero_folio > + * is not associated with any mm_struct. > +*/ I get that callers have to handle failure. But isn't this pretty nasty for mm==NULL callers to be *guaranteed* to fail? They'll generate code for the success case that will never even run.
On Tue, May 27, 2025 at 09:37:50AM -0700, Dave Hansen wrote: > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > index 055204dc211d..96f99b4f96ea 100644 > > --- a/arch/x86/Kconfig > > +++ b/arch/x86/Kconfig > > @@ -152,6 +152,7 @@ config X86 > > select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 > > select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64 > > select ARCH_WANTS_THP_SWAP if X86_64 > > + select ARCH_WANTS_STATIC_PMD_ZERO_PAGE if X86_64 > > I don't think this should be the default. There are lots of little > x86_64 VMs sitting around and 2MB might be significant to them. This is the feedback I wanted. I will make it optional. > > +config ARCH_WANTS_STATIC_PMD_ZERO_PAGE > > + bool > > + > > +config STATIC_PMD_ZERO_PAGE > > + def_bool y > > + depends on ARCH_WANTS_STATIC_PMD_ZERO_PAGE > > + help > > + Typically huge_zero_folio, which is a PMD page of zeroes, is allocated > > + on demand and deallocated when not in use. This option will always > > + allocate huge_zero_folio for zeroing and it is never deallocated. > > + Not suitable for memory constrained systems. > > "Static" seems like a weird term to use for this. I was really expecting > to see a 2MB object that gets allocated in .bss or something rather than > a dynamically allocated page that's just never freed. My first proposal was along those lines[0] (sorry I messed up version while sending the patches). David Hilderbrand suggested to leverage the infrastructure we already have in huge_memory. > > > menuconfig TRANSPARENT_HUGEPAGE > > bool "Transparent Hugepage Support" > > depends on HAVE_ARCH_TRANSPARENT_HUGEPAGE && !PREEMPT_RT > > diff --git a/mm/memory.c b/mm/memory.c > > index 11edc4d66e74..ab8c16d04307 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -203,9 +203,17 @@ static void put_huge_zero_page(void) > > BUG_ON(atomic_dec_and_test(&huge_zero_refcount)); > > } > > > > +/* > > + * If STATIC_PMD_ZERO_PAGE is enabled, @mm can be NULL, i.e, the huge_zero_folio > > + * is not associated with any mm_struct. > > +*/ > > I get that callers have to handle failure. But isn't this pretty nasty > for mm==NULL callers to be *guaranteed* to fail? They'll generate code > for the success case that will never even run. > The idea was to still have dynamic allocation possible even if this config was disabled. You are right that if this config is disabled, the callers with NULL mm struct are guaranteed to fail, but we are not generating extra code because there are still users who want dynamic allocation. Do you think it is better to have the code with inside an #ifdef instead of using the IS_ENABLED primitive? [1] https://lore.kernel.org/linux-fsdevel/20250516101054.676046-2-p.raghav@samsung.com/ -- Pankaj
On 27.05.25 20:00, Pankaj Raghav (Samsung) wrote: > On Tue, May 27, 2025 at 09:37:50AM -0700, Dave Hansen wrote: >>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >>> index 055204dc211d..96f99b4f96ea 100644 >>> --- a/arch/x86/Kconfig >>> +++ b/arch/x86/Kconfig >>> @@ -152,6 +152,7 @@ config X86 >>> select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 >>> select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64 >>> select ARCH_WANTS_THP_SWAP if X86_64 >>> + select ARCH_WANTS_STATIC_PMD_ZERO_PAGE if X86_64 >> >> I don't think this should be the default. There are lots of little >> x86_64 VMs sitting around and 2MB might be significant to them. > > This is the feedback I wanted. I will make it optional. > >>> +config ARCH_WANTS_STATIC_PMD_ZERO_PAGE >>> + bool >>> + >>> +config STATIC_PMD_ZERO_PAGE >>> + def_bool y >>> + depends on ARCH_WANTS_STATIC_PMD_ZERO_PAGE >>> + help >>> + Typically huge_zero_folio, which is a PMD page of zeroes, is allocated >>> + on demand and deallocated when not in use. This option will always >>> + allocate huge_zero_folio for zeroing and it is never deallocated. >>> + Not suitable for memory constrained systems. >> >> "Static" seems like a weird term to use for this. I was really expecting >> to see a 2MB object that gets allocated in .bss or something rather than >> a dynamically allocated page that's just never freed. > > My first proposal was along those lines[0] (sorry I messed up version > while sending the patches). David Hilderbrand suggested to leverage the > infrastructure we already have in huge_memory. Sorry, maybe I was not 100% clear. We could either a) Allocate it statically in bss and reuse it for huge_memory purposes (static vs. dynamic is a good fit) b) Allocate it during early boot and never free it. Assuming we allocate it from memblock, it's almost static ... :) I would not allocate it at runtime later when requested. Then, "static" is really a suboptimal fit. -- Cheers, David / dhildenb
On 5/27/25 11:00, Pankaj Raghav (Samsung) wrote: >> I get that callers have to handle failure. But isn't this pretty nasty >> for mm==NULL callers to be *guaranteed* to fail? They'll generate code >> for the success case that will never even run. >> > The idea was to still have dynamic allocation possible even if this > config was disabled. I don't really understand what you are trying to say here. > You are right that if this config is disabled, the callers with NULL mm > struct are guaranteed to fail, but we are not generating extra code > because there are still users who want dynamic allocation. I'm pretty sure you're making the compiler generate unnecessary code. Think of this: if (mm_get_huge_zero_folio(mm) foo(); else bar(); With the static zero page, foo() is always called. But bar() is dead code. The compiler doesn't know that, so it will generate both sides of the if(). If you can get the CONFIG_... option checks into the header, the compiler can figure it out and not even generate the call to bar(). > Do you think it is better to have the code with inside an #ifdef instead > of using the IS_ENABLED primitive? It has nothing to do with an #ifdef versus IS_ENABLED(). It has to do with the compiler having visibility into how mm_get_huge_zero_folio() works enough to optimize its callers.
> > You are right that if this config is disabled, the callers with NULL mm
> > struct are guaranteed to fail, but we are not generating extra code
> > because there are still users who want dynamic allocation.
>
> I'm pretty sure you're making the compiler generate unnecessary code.
> Think of this:
>
> if (mm_get_huge_zero_folio(mm)
> foo();
> else
> bar();
>
> With the static zero page, foo() is always called. But bar() is dead
> code. The compiler doesn't know that, so it will generate both sides of
> the if().
>
Ahh, yeah you are right. I was thinking about the callee and not the
caller.
> If you can get the CONFIG_... option checks into the header, the
> compiler can figure it out and not even generate the call to bar().
Got it. I will keep this in mind before sending the next version.
> > Do you think it is better to have the code with inside an #ifdef instead
> > of using the IS_ENABLED primitive?
> It has nothing to do with an #ifdef versus IS_ENABLED(). It has to do
> with the compiler having visibility into how mm_get_huge_zero_folio()
> works enough to optimize its callers.
I think something like this should give some visibility to the compiler:
struct folio *huge_zero_folio __read_mostly;
...
#if CONFIG_STATIC_PMD_ZERO_PAGE
struct folio* mm_get_huge_zero_folio(...)
{
return READ_ONCE(huge_zero_folio);
}
#else
struct folio* mm_get_huge_zero_folio(...)
{
<old-code>
}
#endif
But I am not sure here if the compiler can assume here the static
huge_zero_folio variable will be non-NULL. It will be interesting to
check that in the output.
--
Pankaj
© 2016 - 2025 Red Hat, Inc.