add static PMD zero page support

[PATCH v2 0/5] add static PMD zero page support

Posted by Pankaj Raghav (Samsung) 3 months ago

From: Pankaj Raghav <p.raghav@samsung.com>

There are many places in the kernel where we need to zeroout larger
chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
is limited by PAGE_SIZE.

This concern was raised during the review of adding Large Block Size support
to XFS[1][2].

This is especially annoying in block devices and filesystems where we
attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
bvec support in block layer, it is much more efficient to send out
larger zero pages as a part of a single bvec.

Some examples of places in the kernel where this could be useful:
- blkdev_issue_zero_pages()
- iomap_dio_zero()
- vmalloc.c:zero_iter()
- rxperf_process_call()
- fscrypt_zeroout_range_inline_crypt()
- bch2_checksum_update()
...

We already have huge_zero_folio that is allocated on demand, and it will be
deallocated by the shrinker if there are no users of it left.

At moment, huge_zero_folio infrastructure refcount is tied to the process
lifetime that created it. This might not work for bio layer as the completions
can be async and the process that created the huge_zero_folio might no
longer be alive.

Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
the huge_zero_folio via memblock, and it will never be freed.

I have converted blkdev_issue_zero_pages() as an example as a part of
this series.

I will send patches to individual subsystems using the huge_zero_folio
once this gets upstreamed.

Looking forward to some feedback.

[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
[2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/

Changes since v1:
- Move from .bss to allocating it through memblock(David)

Changes since RFC:
- Added the config option based on the feedback from David.
- Encode more info in the header to avoid dead code (Dave hansen
  feedback)
- The static part of huge_zero_folio in memory.c and the dynamic part
  stays in huge_memory.c
- Split the patches to make it easy for review.

Pankaj Raghav (5):
  mm: move huge_zero_page declaration from huge_mm.h to mm.h
  huge_memory: add huge_zero_page_shrinker_(init|exit) function
  mm: add static PMD zero page
  mm: add largest_zero_folio() routine
  block: use largest_zero_folio in __blkdev_issue_zero_pages()

 block/blk-lib.c         | 17 +++++----
 include/linux/huge_mm.h | 31 ----------------
 include/linux/mm.h      | 81 +++++++++++++++++++++++++++++++++++++++++
 mm/Kconfig              |  9 +++++
 mm/huge_memory.c        | 62 +++++++++++++++++++++++--------
 mm/memory.c             | 25 +++++++++++++
 mm/mm_init.c            |  1 +
 7 files changed, 173 insertions(+), 53 deletions(-)


base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
-- 
2.49.0

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Lorenzo Stoakes 2 months, 3 weeks ago

Note that this series does not apply to mm-new. Please rebase for the next
respin.

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Pankaj Raghav 2 months, 3 weeks ago

Hi David,

For now I have some feedback from Zi. It would be great to hear your
feedback before I send the next version :)

--
Pankaj

On Mon, Jul 07, 2025 at 04:23:14PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
>
> This concern was raised during the review of adding Large Block Size support
> to XFS[1][2].
>
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of a single bvec.
>
> Some examples of places in the kernel where this could be useful:
> - blkdev_issue_zero_pages()
> - iomap_dio_zero()
> - vmalloc.c:zero_iter()
> - rxperf_process_call()
> - fscrypt_zeroout_range_inline_crypt()
> - bch2_checksum_update()
> ...
>
> We already have huge_zero_folio that is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left.
>
> At moment, huge_zero_folio infrastructure refcount is tied to the process
> lifetime that created it. This might not work for bio layer as the completions
> can be async and the process that created the huge_zero_folio might no
> longer be alive.
>
> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
> the huge_zero_folio via memblock, and it will never be freed.
>
> I have converted blkdev_issue_zero_pages() as an example as a part of
> this series.
>
> I will send patches to individual subsystems using the huge_zero_folio
> once this gets upstreamed.
>
> Looking forward to some feedback.
>
> [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
> [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
>
> Changes since v1:
> - Move from .bss to allocating it through memblock(David)
>
> Changes since RFC:
> - Added the config option based on the feedback from David.
> - Encode more info in the header to avoid dead code (Dave hansen
>   feedback)
> - The static part of huge_zero_folio in memory.c and the dynamic part
>   stays in huge_memory.c
> - Split the patches to make it easy for review.
>
> Pankaj Raghav (5):
>   mm: move huge_zero_page declaration from huge_mm.h to mm.h
>   huge_memory: add huge_zero_page_shrinker_(init|exit) function
>   mm: add static PMD zero page
>   mm: add largest_zero_folio() routine
>   block: use largest_zero_folio in __blkdev_issue_zero_pages()
>
>  block/blk-lib.c         | 17 +++++----
>  include/linux/huge_mm.h | 31 ----------------
>  include/linux/mm.h      | 81 +++++++++++++++++++++++++++++++++++++++++
>  mm/Kconfig              |  9 +++++
>  mm/huge_memory.c        | 62 +++++++++++++++++++++++--------
>  mm/memory.c             | 25 +++++++++++++
>  mm/mm_init.c            |  1 +
>  7 files changed, 173 insertions(+), 53 deletions(-)
>
>
> base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
> --
> 2.49.0

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Andrew Morton 3 months ago

On Mon,  7 Jul 2025 16:23:14 +0200 "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> wrote:

> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
> 
> This concern was raised during the review of adding Large Block Size support
> to XFS[1][2].
> 
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of a single bvec.
> 
> Some examples of places in the kernel where this could be useful:
> - blkdev_issue_zero_pages()
> - iomap_dio_zero()
> - vmalloc.c:zero_iter()
> - rxperf_process_call()
> - fscrypt_zeroout_range_inline_crypt()
> - bch2_checksum_update()
> ...
> 
> We already have huge_zero_folio that is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left.
> 
> At moment, huge_zero_folio infrastructure refcount is tied to the process
> lifetime that created it. This might not work for bio layer as the completions
> can be async and the process that created the huge_zero_folio might no
> longer be alive.

Can we change that?  Alter the refcounting model so that dropping the
final reference at interrupt time works as expected?

And if we were to do this, what sort of benefit might it produce?

> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
> the huge_zero_folio via memblock, and it will never be freed.

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by David Hildenbrand 2 months, 3 weeks ago

On 08.07.25 00:38, Andrew Morton wrote:
> On Mon,  7 Jul 2025 16:23:14 +0200 "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> wrote:
> 
>> There are many places in the kernel where we need to zeroout larger
>> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
>> is limited by PAGE_SIZE.
>>
>> This concern was raised during the review of adding Large Block Size support
>> to XFS[1][2].
>>
>> This is especially annoying in block devices and filesystems where we
>> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
>> bvec support in block layer, it is much more efficient to send out
>> larger zero pages as a part of a single bvec.
>>
>> Some examples of places in the kernel where this could be useful:
>> - blkdev_issue_zero_pages()
>> - iomap_dio_zero()
>> - vmalloc.c:zero_iter()
>> - rxperf_process_call()
>> - fscrypt_zeroout_range_inline_crypt()
>> - bch2_checksum_update()
>> ...
>>
>> We already have huge_zero_folio that is allocated on demand, and it will be
>> deallocated by the shrinker if there are no users of it left.
>>
>> At moment, huge_zero_folio infrastructure refcount is tied to the process
>> lifetime that created it. This might not work for bio layer as the completions
>> can be async and the process that created the huge_zero_folio might no
>> longer be alive.
> 
> Can we change that?  Alter the refcounting model so that dropping the
> final reference at interrupt time works as expected?

I would hope that we can drop that whole shrinking+freeing mechanism at 
some point, and simply always keep it around once allocated.

Any unprivileged process can keep the huge zero folio mapped and, 
therefore, around, until that process is killed ...

But I assume some people might still have an opinion on the shrinker, so 
for the time being having a second static model might be less controversial.

(I don't think we should be refcounting the huge zerofolio in the long term)

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Pankaj Raghav 3 months ago

Hi Andrew,

>> We already have huge_zero_folio that is allocated on demand, and it will be
>> deallocated by the shrinker if there are no users of it left.
>>
>> At moment, huge_zero_folio infrastructure refcount is tied to the process
>> lifetime that created it. This might not work for bio layer as the completions
>> can be async and the process that created the huge_zero_folio might no
>> longer be alive.
> 
> Can we change that?  Alter the refcounting model so that dropping the
> final reference at interrupt time works as expected?
> 

That is an interesting point. I did not try it. At the moment, we always drop the reference in
__mmput().

Going back to the discussion before this work started, one of the main thing that people wanted was
to use some sort of a **drop in replacement** for ZERO_PAGE that can be bigger than PAGE_SIZE[1].

And, during the RFCs of these patches, one of the feedback I got from David was in big server
systems, 2M (in the case of 4k page size) should not be a problem and we don't need any unnecessary
refcounting for them.

Also when I had a chat with David, he also wants to make changes to the existing mm_huge_zero_folio
infrastructure to get rid of shrinker if possible. So we decided that it is better to have opt-in
static allocation and keep the existing dynamic allocation path.

So that is why I went with this approach of having a static PMD allocation.

I hope this clarifies the motivation a bit.

Let me know if you have more questions.

> And if we were to do this, what sort of benefit might it produce?
> 
>> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
>> the huge_zero_folio via memblock, and it will never be freed.
[1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Zi Yan 3 months ago

On 7 Jul 2025, at 10:23, Pankaj Raghav (Samsung) wrote:

> From: Pankaj Raghav <p.raghav@samsung.com>
>
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
>
> This concern was raised during the review of adding Large Block Size support
> to XFS[1][2].
>
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of a single bvec.
>
> Some examples of places in the kernel where this could be useful:
> - blkdev_issue_zero_pages()
> - iomap_dio_zero()
> - vmalloc.c:zero_iter()
> - rxperf_process_call()
> - fscrypt_zeroout_range_inline_crypt()
> - bch2_checksum_update()
> ...
>
> We already have huge_zero_folio that is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left.
>
> At moment, huge_zero_folio infrastructure refcount is tied to the process
> lifetime that created it. This might not work for bio layer as the completions
> can be async and the process that created the huge_zero_folio might no
> longer be alive.
>
> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
> the huge_zero_folio via memblock, and it will never be freed.

Do the above users want a PMD sized zero page or a 2MB zero page?
Because on systems with non 4KB base page size, e.g., ARM64 with 64KB
base page, PMD size is different. ARM64 with 64KB base page has
512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing
half GB memory. I am not sure if it is acceptable.

Best Regards,
Yan, Zi

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Pankaj Raghav 3 months ago

Hi Zi,

>> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via 
>> memblock, and it will never be freed.
> 
> Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non 
> 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base 
> page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am 
> not sure if it is acceptable.
> 

That is a good point. My intial RFC patches allocated 2M instead of a PMD sized
page.

But later David wanted to reuse the memory we allocate here with huge_zero_folio. So
if this config is enabled, we simply just use the same pointer for huge_zero_folio.

Since that happened, I decided to go with PMD sized page.

This config is still opt in and I would expect the users with 64k page size systems to not enable
this.

But to make sure we don't enable this for those architecture, I could do a per-arch opt in with
something like this[1] that I did in my previous patch:

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 340e5468980e..c3a9d136ec0a 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -153,6 +153,7 @@ config X86
 	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
 	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
 	select ARCH_WANTS_THP_SWAP		if X86_64
+	select ARCH_HAS_STATIC_PMD_ZERO_PAGE	if X86_64
 	select ARCH_HAS_PARANOID_L1D_FLUSH
 	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
 	select BUILDTIME_TABLE_SORT


diff --git a/mm/Kconfig b/mm/Kconfig
index 781be3240e21..fd1c51995029 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP
 config MM_ID
 	def_bool n

+config ARCH_HAS_STATIC_PMD_ZERO_PAGE
+	def_bool n
+
+config STATIC_PMD_ZERO_PAGE
+	bool "Allocate a PMD page for zeroing"
+	depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE
<snip>

Let me know your thoughts.

[1] https://lore.kernel.org/linux-mm/20250612105100.59144-4-p.raghav@samsung.com/#Z31mm:Kconfig
--
Pankaj

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Lorenzo Stoakes 2 months, 3 weeks ago

On Wed, Jul 09, 2025 at 10:03:51AM +0200, Pankaj Raghav wrote:
> Hi Zi,
>
> >> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via
> >> memblock, and it will never be freed.
> >
> > Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non
> > 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base
> > page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am
> > not sure if it is acceptable.
> >
>
> That is a good point. My intial RFC patches allocated 2M instead of a PMD sized
> page.
>
> But later David wanted to reuse the memory we allocate here with huge_zero_folio. So
> if this config is enabled, we simply just use the same pointer for huge_zero_folio.
>
> Since that happened, I decided to go with PMD sized page.
>
> This config is still opt in and I would expect the users with 64k page size systems to not enable
> this.
>
> But to make sure we don't enable this for those architecture, I could do a per-arch opt in with
> something like this[1] that I did in my previous patch:
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 340e5468980e..c3a9d136ec0a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -153,6 +153,7 @@ config X86
>  	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
>  	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
>  	select ARCH_WANTS_THP_SWAP		if X86_64
> +	select ARCH_HAS_STATIC_PMD_ZERO_PAGE	if X86_64
>  	select ARCH_HAS_PARANOID_L1D_FLUSH
>  	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>  	select BUILDTIME_TABLE_SORT
>
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 781be3240e21..fd1c51995029 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP
>  config MM_ID
>  	def_bool n
>
> +config ARCH_HAS_STATIC_PMD_ZERO_PAGE
> +	def_bool n

Hm is this correct? arm64 supports mutliple page tables sizes, so while the
architecture might 'support' it, it will vary based on page size, so actually we
don't care about arch at all?

> +
> +config STATIC_PMD_ZERO_PAGE
> +	bool "Allocate a PMD page for zeroing"
> +	depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE

Maybe need to just make this depend on !CONFIG_PAGE_SIZE_xx?

> <snip>
>
> Let me know your thoughts.
>
> [1] https://lore.kernel.org/linux-mm/20250612105100.59144-4-p.raghav@samsung.com/#Z31mm:Kconfig
> --
> Pankaj

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by David Hildenbrand 2 months, 3 weeks ago

On 15.07.25 16:02, Lorenzo Stoakes wrote:
> On Wed, Jul 09, 2025 at 10:03:51AM +0200, Pankaj Raghav wrote:
>> Hi Zi,
>>
>>>> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via
>>>> memblock, and it will never be freed.
>>>
>>> Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non
>>> 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base
>>> page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am
>>> not sure if it is acceptable.
>>>
>>
>> That is a good point. My intial RFC patches allocated 2M instead of a PMD sized
>> page.
>>
>> But later David wanted to reuse the memory we allocate here with huge_zero_folio. So
>> if this config is enabled, we simply just use the same pointer for huge_zero_folio.
>>
>> Since that happened, I decided to go with PMD sized page.
>>
>> This config is still opt in and I would expect the users with 64k page size systems to not enable
>> this.
>>
>> But to make sure we don't enable this for those architecture, I could do a per-arch opt in with
>> something like this[1] that I did in my previous patch:
>>
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index 340e5468980e..c3a9d136ec0a 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -153,6 +153,7 @@ config X86
>>   	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
>>   	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
>>   	select ARCH_WANTS_THP_SWAP		if X86_64
>> +	select ARCH_HAS_STATIC_PMD_ZERO_PAGE	if X86_64
>>   	select ARCH_HAS_PARANOID_L1D_FLUSH
>>   	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>>   	select BUILDTIME_TABLE_SORT
>>
>>
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 781be3240e21..fd1c51995029 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP
>>   config MM_ID
>>   	def_bool n
>>
>> +config ARCH_HAS_STATIC_PMD_ZERO_PAGE
>> +	def_bool n
> 
> Hm is this correct? arm64 supports mutliple page tables sizes, so while the
> architecture might 'support' it, it will vary based on page size, so actually we
> don't care about arch at all?
> 
>> +
>> +config STATIC_PMD_ZERO_PAGE
>> +	bool "Allocate a PMD page for zeroing"
>> +	depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE
> 
> Maybe need to just make this depend on !CONFIG_PAGE_SIZE_xx?

I think at some point we discussed "when does the PMD-sized zeropage 
make *any* sense on these weird arch configs" (512MiB on arm64 64bit)

No idea who wants to waste half a gig on that at runtime either.

But yeah, we should let the arch code opt in whether it wants it or not 
(in particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K)

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Lorenzo Stoakes 2 months, 3 weeks ago

On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote:
> I think at some point we discussed "when does the PMD-sized zeropage make
> *any* sense on these weird arch configs" (512MiB on arm64 64bit)
>
> No idea who wants to waste half a gig on that at runtime either.

Yeah this is a problem we _really_ need to solve. But obviously somewhat out of
scope here.

>
> But yeah, we should let the arch code opt in whether it wants it or not (in
> particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K)

I don't think this should be an ARCH_HAS_xxx.

Because that's saying 'this architecture has X', this isn't architecture
scope.

I suppose PMDs may vary in terms of how huge they are regardless of page
table size actually.

So maybe the best solution is a semantic one - just rename this to
ARCH_WANT_STATIC_PMD_ZERO_PAGE

And then put the page size selector in the arch code.

For example in arm64 we have:

	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)

So doing something similar here like:

	select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES

Would do thie job and sort everything out.

>
> --
> Cheers,
>
> David / dhildenb
>

CHeers, Lorenzo

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by David Hildenbrand 2 months, 3 weeks ago

On 15.07.25 16:12, Lorenzo Stoakes wrote:
> On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote:
>> I think at some point we discussed "when does the PMD-sized zeropage make
>> *any* sense on these weird arch configs" (512MiB on arm64 64bit)
>>
>> No idea who wants to waste half a gig on that at runtime either.
> 
> Yeah this is a problem we _really_ need to solve. But obviously somewhat out of
> scope here.
> 
>>
>> But yeah, we should let the arch code opt in whether it wants it or not (in
>> particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K)
> 
> I don't think this should be an ARCH_HAS_xxx.
> 
> Because that's saying 'this architecture has X', this isn't architecture
> scope.
> 
> I suppose PMDs may vary in terms of how huge they are regardless of page
> table size actually.
> 
> So maybe the best solution is a semantic one - just rename this to
> ARCH_WANT_STATIC_PMD_ZERO_PAGE
> 
> And then put the page size selector in the arch code.
> 
> For example in arm64 we have:
> 
> 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> 
> So doing something similar here like:
> 
> 	select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES
> 
> Would do thie job and sort everything out.

Yes.

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Pankaj Raghav (Samsung) 2 months, 3 weeks ago

On Tue, Jul 15, 2025 at 04:16:44PM +0200, David Hildenbrand wrote:
> On 15.07.25 16:12, Lorenzo Stoakes wrote:
> > On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote:
> > > I think at some point we discussed "when does the PMD-sized zeropage make
> > > *any* sense on these weird arch configs" (512MiB on arm64 64bit)
> > > 
> > > No idea who wants to waste half a gig on that at runtime either.
> > 
> > Yeah this is a problem we _really_ need to solve. But obviously somewhat out of
> > scope here.
> > 
> > > 
> > > But yeah, we should let the arch code opt in whether it wants it or not (in
> > > particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K)
> > 
> > I don't think this should be an ARCH_HAS_xxx.
> > 
> > Because that's saying 'this architecture has X', this isn't architecture
> > scope.
> > 
> > I suppose PMDs may vary in terms of how huge they are regardless of page
> > table size actually.
> > 
> > So maybe the best solution is a semantic one - just rename this to
> > ARCH_WANT_STATIC_PMD_ZERO_PAGE
> > 
> > And then put the page size selector in the arch code.
> > 
> > For example in arm64 we have:
> > 
> > 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
> > 
> > So doing something similar here like:
> > 
> > 	select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES
> > 
> > Would do thie job and sort everything out.
> 
> Yes.

Actually I had something similar in one of my earlier versions[1] where we
can opt in from arch specific Kconfig with *WANT* instead *HAS*.

For starters, I will enable this only from x86. We can probably extend
this once we get the base patches up.

[1] https://lore.kernel.org/linux-mm/20250522090243.758943-2-p.raghav@samsung.com/
--
Pankaj

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by David Hildenbrand 2 months, 3 weeks ago

On 15.07.25 17:25, Pankaj Raghav (Samsung) wrote:
> On Tue, Jul 15, 2025 at 04:16:44PM +0200, David Hildenbrand wrote:
>> On 15.07.25 16:12, Lorenzo Stoakes wrote:
>>> On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote:
>>>> I think at some point we discussed "when does the PMD-sized zeropage make
>>>> *any* sense on these weird arch configs" (512MiB on arm64 64bit)
>>>>
>>>> No idea who wants to waste half a gig on that at runtime either.
>>>
>>> Yeah this is a problem we _really_ need to solve. But obviously somewhat out of
>>> scope here.
>>>
>>>>
>>>> But yeah, we should let the arch code opt in whether it wants it or not (in
>>>> particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K)
>>>
>>> I don't think this should be an ARCH_HAS_xxx.
>>>
>>> Because that's saying 'this architecture has X', this isn't architecture
>>> scope.
>>>
>>> I suppose PMDs may vary in terms of how huge they are regardless of page
>>> table size actually.
>>>
>>> So maybe the best solution is a semantic one - just rename this to
>>> ARCH_WANT_STATIC_PMD_ZERO_PAGE
>>>
>>> And then put the page size selector in the arch code.
>>>
>>> For example in arm64 we have:
>>>
>>> 	select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36)
>>>
>>> So doing something similar here like:
>>>
>>> 	select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES
>>>
>>> Would do thie job and sort everything out.
>>
>> Yes.
> 
> Actually I had something similar in one of my earlier versions[1] where we
> can opt in from arch specific Kconfig with *WANT* instead *HAS*.
> 
> For starters, I will enable this only from x86. We can probably extend
> this once we get the base patches up.

Makes sense.

-- 
Cheers,

David / dhildenb

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Zi Yan 3 months ago

On 9 Jul 2025, at 4:03, Pankaj Raghav wrote:

> Hi Zi,
>
>>> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via
>>> memblock, and it will never be freed.
>>
>> Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non
>> 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base
>> page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am
>> not sure if it is acceptable.
>>
>
> That is a good point. My intial RFC patches allocated 2M instead of a PMD sized
> page.
>
> But later David wanted to reuse the memory we allocate here with huge_zero_folio. So
> if this config is enabled, we simply just use the same pointer for huge_zero_folio.
>
> Since that happened, I decided to go with PMD sized page.

Got it. Thank you for the explanation. This means for your use cases
2MB is big enough. For those arch which have PMD > 2MB, ideally,
a 2MB zero mTHP should be used. Thinking about this feature long term,
I wonder what we should do to support arch with PMD > 2MB. Make
the static huge zero page size a boot time parameter?

>
> This config is still opt in and I would expect the users with 64k page size systems to not enable
> this.
>
> But to make sure we don't enable this for those architecture, I could do a per-arch opt in with
> something like this[1] that I did in my previous patch:
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 340e5468980e..c3a9d136ec0a 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -153,6 +153,7 @@ config X86
>  	select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP	if X86_64
>  	select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64
>  	select ARCH_WANTS_THP_SWAP		if X86_64
> +	select ARCH_HAS_STATIC_PMD_ZERO_PAGE	if X86_64
>  	select ARCH_HAS_PARANOID_L1D_FLUSH
>  	select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>  	select BUILDTIME_TABLE_SORT
>
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 781be3240e21..fd1c51995029 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP
>  config MM_ID
>  	def_bool n
>
> +config ARCH_HAS_STATIC_PMD_ZERO_PAGE
> +	def_bool n
> +
> +config STATIC_PMD_ZERO_PAGE
> +	bool "Allocate a PMD page for zeroing"
> +	depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE
> <snip>
>
> Let me know your thoughts.

Sounds good to me, since without STATIC_PMD_ZERO_PAGE, when THP is enabled,
the use cases you mentioned are still able to use the THP zero page.

Thanks.

>
> [1] https://lore.kernel.org/linux-mm/20250612105100.59144-4-p.raghav@samsung.com/#Z31mm:Kconfig
> --
> Pankaj


Best Regards,
Yan, Zi

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Lorenzo Stoakes 2 months, 3 weeks ago

Pankaj,

There seems to be quite a lot to work on here, and it seems rather speculative,
so can we respin as an RFC please?

Thanks! :)

On Mon, Jul 07, 2025 at 04:23:14PM +0200, Pankaj Raghav (Samsung) wrote:
> From: Pankaj Raghav <p.raghav@samsung.com>
>
> There are many places in the kernel where we need to zeroout larger
> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE
> is limited by PAGE_SIZE.
>
> This concern was raised during the review of adding Large Block Size support
> to XFS[1][2].
>
> This is especially annoying in block devices and filesystems where we
> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage
> bvec support in block layer, it is much more efficient to send out
> larger zero pages as a part of a single bvec.
>
> Some examples of places in the kernel where this could be useful:
> - blkdev_issue_zero_pages()
> - iomap_dio_zero()
> - vmalloc.c:zero_iter()
> - rxperf_process_call()
> - fscrypt_zeroout_range_inline_crypt()
> - bch2_checksum_update()
> ...
>
> We already have huge_zero_folio that is allocated on demand, and it will be
> deallocated by the shrinker if there are no users of it left.
>
> At moment, huge_zero_folio infrastructure refcount is tied to the process
> lifetime that created it. This might not work for bio layer as the completions
> can be async and the process that created the huge_zero_folio might no
> longer be alive.
>
> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate
> the huge_zero_folio via memblock, and it will never be freed.
>
> I have converted blkdev_issue_zero_pages() as an example as a part of
> this series.
>
> I will send patches to individual subsystems using the huge_zero_folio
> once this gets upstreamed.
>
> Looking forward to some feedback.
>
> [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
> [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/
>
> Changes since v1:
> - Move from .bss to allocating it through memblock(David)
>
> Changes since RFC:
> - Added the config option based on the feedback from David.
> - Encode more info in the header to avoid dead code (Dave hansen
>   feedback)
> - The static part of huge_zero_folio in memory.c and the dynamic part
>   stays in huge_memory.c
> - Split the patches to make it easy for review.
>
> Pankaj Raghav (5):
>   mm: move huge_zero_page declaration from huge_mm.h to mm.h
>   huge_memory: add huge_zero_page_shrinker_(init|exit) function
>   mm: add static PMD zero page
>   mm: add largest_zero_folio() routine
>   block: use largest_zero_folio in __blkdev_issue_zero_pages()
>
>  block/blk-lib.c         | 17 +++++----
>  include/linux/huge_mm.h | 31 ----------------
>  include/linux/mm.h      | 81 +++++++++++++++++++++++++++++++++++++++++
>  mm/Kconfig              |  9 +++++
>  mm/huge_memory.c        | 62 +++++++++++++++++++++++--------
>  mm/memory.c             | 25 +++++++++++++
>  mm/mm_init.c            |  1 +
>  7 files changed, 173 insertions(+), 53 deletions(-)
>
>
> base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
> --
> 2.49.0
>

Re: [PATCH v2 0/5] add static PMD zero page support

Posted by Pankaj Raghav (Samsung) 2 months, 3 weeks ago

Hi Lorenzo,

On Tue, Jul 15, 2025 at 04:34:45PM +0100, Lorenzo Stoakes wrote:
> Pankaj,
> 
> There seems to be quite a lot to work on here, and it seems rather speculative,
> so can we respin as an RFC please?
> 

Thanks for all the review comments.

Yeah, I agree. I will resend it as RFC. I will try the new approach
suggested by David in Patch 3 in the next version.

--
Pankaj