block/blk-lib.c | 17 +++++---- include/linux/huge_mm.h | 31 ---------------- include/linux/mm.h | 81 +++++++++++++++++++++++++++++++++++++++++ mm/Kconfig | 9 +++++ mm/huge_memory.c | 62 +++++++++++++++++++++++-------- mm/memory.c | 25 +++++++++++++ mm/mm_init.c | 1 + 7 files changed, 173 insertions(+), 53 deletions(-)
From: Pankaj Raghav <p.raghav@samsung.com> There are many places in the kernel where we need to zeroout larger chunks but the maximum segment we can zeroout at a time by ZERO_PAGE is limited by PAGE_SIZE. This concern was raised during the review of adding Large Block Size support to XFS[1][2]. This is especially annoying in block devices and filesystems where we attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage bvec support in block layer, it is much more efficient to send out larger zero pages as a part of a single bvec. Some examples of places in the kernel where this could be useful: - blkdev_issue_zero_pages() - iomap_dio_zero() - vmalloc.c:zero_iter() - rxperf_process_call() - fscrypt_zeroout_range_inline_crypt() - bch2_checksum_update() ... We already have huge_zero_folio that is allocated on demand, and it will be deallocated by the shrinker if there are no users of it left. At moment, huge_zero_folio infrastructure refcount is tied to the process lifetime that created it. This might not work for bio layer as the completions can be async and the process that created the huge_zero_folio might no longer be alive. Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via memblock, and it will never be freed. I have converted blkdev_issue_zero_pages() as an example as a part of this series. I will send patches to individual subsystems using the huge_zero_folio once this gets upstreamed. Looking forward to some feedback. [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/ [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/ Changes since v1: - Move from .bss to allocating it through memblock(David) Changes since RFC: - Added the config option based on the feedback from David. - Encode more info in the header to avoid dead code (Dave hansen feedback) - The static part of huge_zero_folio in memory.c and the dynamic part stays in huge_memory.c - Split the patches to make it easy for review. Pankaj Raghav (5): mm: move huge_zero_page declaration from huge_mm.h to mm.h huge_memory: add huge_zero_page_shrinker_(init|exit) function mm: add static PMD zero page mm: add largest_zero_folio() routine block: use largest_zero_folio in __blkdev_issue_zero_pages() block/blk-lib.c | 17 +++++---- include/linux/huge_mm.h | 31 ---------------- include/linux/mm.h | 81 +++++++++++++++++++++++++++++++++++++++++ mm/Kconfig | 9 +++++ mm/huge_memory.c | 62 +++++++++++++++++++++++-------- mm/memory.c | 25 +++++++++++++ mm/mm_init.c | 1 + 7 files changed, 173 insertions(+), 53 deletions(-) base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a -- 2.49.0
Note that this series does not apply to mm-new. Please rebase for the next respin.
Hi David, For now I have some feedback from Zi. It would be great to hear your feedback before I send the next version :) -- Pankaj On Mon, Jul 07, 2025 at 04:23:14PM +0200, Pankaj Raghav (Samsung) wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > There are many places in the kernel where we need to zeroout larger > chunks but the maximum segment we can zeroout at a time by ZERO_PAGE > is limited by PAGE_SIZE. > > This concern was raised during the review of adding Large Block Size support > to XFS[1][2]. > > This is especially annoying in block devices and filesystems where we > attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage > bvec support in block layer, it is much more efficient to send out > larger zero pages as a part of a single bvec. > > Some examples of places in the kernel where this could be useful: > - blkdev_issue_zero_pages() > - iomap_dio_zero() > - vmalloc.c:zero_iter() > - rxperf_process_call() > - fscrypt_zeroout_range_inline_crypt() > - bch2_checksum_update() > ... > > We already have huge_zero_folio that is allocated on demand, and it will be > deallocated by the shrinker if there are no users of it left. > > At moment, huge_zero_folio infrastructure refcount is tied to the process > lifetime that created it. This might not work for bio layer as the completions > can be async and the process that created the huge_zero_folio might no > longer be alive. > > Add a config option STATIC_PMD_ZERO_PAGE that will always allocate > the huge_zero_folio via memblock, and it will never be freed. > > I have converted blkdev_issue_zero_pages() as an example as a part of > this series. > > I will send patches to individual subsystems using the huge_zero_folio > once this gets upstreamed. > > Looking forward to some feedback. > > [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/ > [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/ > > Changes since v1: > - Move from .bss to allocating it through memblock(David) > > Changes since RFC: > - Added the config option based on the feedback from David. > - Encode more info in the header to avoid dead code (Dave hansen > feedback) > - The static part of huge_zero_folio in memory.c and the dynamic part > stays in huge_memory.c > - Split the patches to make it easy for review. > > Pankaj Raghav (5): > mm: move huge_zero_page declaration from huge_mm.h to mm.h > huge_memory: add huge_zero_page_shrinker_(init|exit) function > mm: add static PMD zero page > mm: add largest_zero_folio() routine > block: use largest_zero_folio in __blkdev_issue_zero_pages() > > block/blk-lib.c | 17 +++++---- > include/linux/huge_mm.h | 31 ---------------- > include/linux/mm.h | 81 +++++++++++++++++++++++++++++++++++++++++ > mm/Kconfig | 9 +++++ > mm/huge_memory.c | 62 +++++++++++++++++++++++-------- > mm/memory.c | 25 +++++++++++++ > mm/mm_init.c | 1 + > 7 files changed, 173 insertions(+), 53 deletions(-) > > > base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a > -- > 2.49.0
On Mon, 7 Jul 2025 16:23:14 +0200 "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> wrote: > There are many places in the kernel where we need to zeroout larger > chunks but the maximum segment we can zeroout at a time by ZERO_PAGE > is limited by PAGE_SIZE. > > This concern was raised during the review of adding Large Block Size support > to XFS[1][2]. > > This is especially annoying in block devices and filesystems where we > attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage > bvec support in block layer, it is much more efficient to send out > larger zero pages as a part of a single bvec. > > Some examples of places in the kernel where this could be useful: > - blkdev_issue_zero_pages() > - iomap_dio_zero() > - vmalloc.c:zero_iter() > - rxperf_process_call() > - fscrypt_zeroout_range_inline_crypt() > - bch2_checksum_update() > ... > > We already have huge_zero_folio that is allocated on demand, and it will be > deallocated by the shrinker if there are no users of it left. > > At moment, huge_zero_folio infrastructure refcount is tied to the process > lifetime that created it. This might not work for bio layer as the completions > can be async and the process that created the huge_zero_folio might no > longer be alive. Can we change that? Alter the refcounting model so that dropping the final reference at interrupt time works as expected? And if we were to do this, what sort of benefit might it produce? > Add a config option STATIC_PMD_ZERO_PAGE that will always allocate > the huge_zero_folio via memblock, and it will never be freed.
On 08.07.25 00:38, Andrew Morton wrote: > On Mon, 7 Jul 2025 16:23:14 +0200 "Pankaj Raghav (Samsung)" <kernel@pankajraghav.com> wrote: > >> There are many places in the kernel where we need to zeroout larger >> chunks but the maximum segment we can zeroout at a time by ZERO_PAGE >> is limited by PAGE_SIZE. >> >> This concern was raised during the review of adding Large Block Size support >> to XFS[1][2]. >> >> This is especially annoying in block devices and filesystems where we >> attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage >> bvec support in block layer, it is much more efficient to send out >> larger zero pages as a part of a single bvec. >> >> Some examples of places in the kernel where this could be useful: >> - blkdev_issue_zero_pages() >> - iomap_dio_zero() >> - vmalloc.c:zero_iter() >> - rxperf_process_call() >> - fscrypt_zeroout_range_inline_crypt() >> - bch2_checksum_update() >> ... >> >> We already have huge_zero_folio that is allocated on demand, and it will be >> deallocated by the shrinker if there are no users of it left. >> >> At moment, huge_zero_folio infrastructure refcount is tied to the process >> lifetime that created it. This might not work for bio layer as the completions >> can be async and the process that created the huge_zero_folio might no >> longer be alive. > > Can we change that? Alter the refcounting model so that dropping the > final reference at interrupt time works as expected? I would hope that we can drop that whole shrinking+freeing mechanism at some point, and simply always keep it around once allocated. Any unprivileged process can keep the huge zero folio mapped and, therefore, around, until that process is killed ... But I assume some people might still have an opinion on the shrinker, so for the time being having a second static model might be less controversial. (I don't think we should be refcounting the huge zerofolio in the long term) -- Cheers, David / dhildenb
Hi Andrew, >> We already have huge_zero_folio that is allocated on demand, and it will be >> deallocated by the shrinker if there are no users of it left. >> >> At moment, huge_zero_folio infrastructure refcount is tied to the process >> lifetime that created it. This might not work for bio layer as the completions >> can be async and the process that created the huge_zero_folio might no >> longer be alive. > > Can we change that? Alter the refcounting model so that dropping the > final reference at interrupt time works as expected? > That is an interesting point. I did not try it. At the moment, we always drop the reference in __mmput(). Going back to the discussion before this work started, one of the main thing that people wanted was to use some sort of a **drop in replacement** for ZERO_PAGE that can be bigger than PAGE_SIZE[1]. And, during the RFCs of these patches, one of the feedback I got from David was in big server systems, 2M (in the case of 4k page size) should not be a problem and we don't need any unnecessary refcounting for them. Also when I had a chat with David, he also wants to make changes to the existing mm_huge_zero_folio infrastructure to get rid of shrinker if possible. So we decided that it is better to have opt-in static allocation and keep the existing dynamic allocation path. So that is why I went with this approach of having a static PMD allocation. I hope this clarifies the motivation a bit. Let me know if you have more questions. > And if we were to do this, what sort of benefit might it produce? > >> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate >> the huge_zero_folio via memblock, and it will never be freed. [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/
On 7 Jul 2025, at 10:23, Pankaj Raghav (Samsung) wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > There are many places in the kernel where we need to zeroout larger > chunks but the maximum segment we can zeroout at a time by ZERO_PAGE > is limited by PAGE_SIZE. > > This concern was raised during the review of adding Large Block Size support > to XFS[1][2]. > > This is especially annoying in block devices and filesystems where we > attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage > bvec support in block layer, it is much more efficient to send out > larger zero pages as a part of a single bvec. > > Some examples of places in the kernel where this could be useful: > - blkdev_issue_zero_pages() > - iomap_dio_zero() > - vmalloc.c:zero_iter() > - rxperf_process_call() > - fscrypt_zeroout_range_inline_crypt() > - bch2_checksum_update() > ... > > We already have huge_zero_folio that is allocated on demand, and it will be > deallocated by the shrinker if there are no users of it left. > > At moment, huge_zero_folio infrastructure refcount is tied to the process > lifetime that created it. This might not work for bio layer as the completions > can be async and the process that created the huge_zero_folio might no > longer be alive. > > Add a config option STATIC_PMD_ZERO_PAGE that will always allocate > the huge_zero_folio via memblock, and it will never be freed. Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am not sure if it is acceptable. Best Regards, Yan, Zi
Hi Zi, >> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via >> memblock, and it will never be freed. > > Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non > 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base > page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am > not sure if it is acceptable. > That is a good point. My intial RFC patches allocated 2M instead of a PMD sized page. But later David wanted to reuse the memory we allocate here with huge_zero_folio. So if this config is enabled, we simply just use the same pointer for huge_zero_folio. Since that happened, I decided to go with PMD sized page. This config is still opt in and I would expect the users with 64k page size systems to not enable this. But to make sure we don't enable this for those architecture, I could do a per-arch opt in with something like this[1] that I did in my previous patch: diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 340e5468980e..c3a9d136ec0a 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -153,6 +153,7 @@ config X86 select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64 select ARCH_WANTS_THP_SWAP if X86_64 + select ARCH_HAS_STATIC_PMD_ZERO_PAGE if X86_64 select ARCH_HAS_PARANOID_L1D_FLUSH select ARCH_WANT_IRQS_OFF_ACTIVATE_MM select BUILDTIME_TABLE_SORT diff --git a/mm/Kconfig b/mm/Kconfig index 781be3240e21..fd1c51995029 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP config MM_ID def_bool n +config ARCH_HAS_STATIC_PMD_ZERO_PAGE + def_bool n + +config STATIC_PMD_ZERO_PAGE + bool "Allocate a PMD page for zeroing" + depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE <snip> Let me know your thoughts. [1] https://lore.kernel.org/linux-mm/20250612105100.59144-4-p.raghav@samsung.com/#Z31mm:Kconfig -- Pankaj
On Wed, Jul 09, 2025 at 10:03:51AM +0200, Pankaj Raghav wrote: > Hi Zi, > > >> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via > >> memblock, and it will never be freed. > > > > Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non > > 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base > > page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am > > not sure if it is acceptable. > > > > That is a good point. My intial RFC patches allocated 2M instead of a PMD sized > page. > > But later David wanted to reuse the memory we allocate here with huge_zero_folio. So > if this config is enabled, we simply just use the same pointer for huge_zero_folio. > > Since that happened, I decided to go with PMD sized page. > > This config is still opt in and I would expect the users with 64k page size systems to not enable > this. > > But to make sure we don't enable this for those architecture, I could do a per-arch opt in with > something like this[1] that I did in my previous patch: > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 340e5468980e..c3a9d136ec0a 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -153,6 +153,7 @@ config X86 > select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 > select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64 > select ARCH_WANTS_THP_SWAP if X86_64 > + select ARCH_HAS_STATIC_PMD_ZERO_PAGE if X86_64 > select ARCH_HAS_PARANOID_L1D_FLUSH > select ARCH_WANT_IRQS_OFF_ACTIVATE_MM > select BUILDTIME_TABLE_SORT > > > diff --git a/mm/Kconfig b/mm/Kconfig > index 781be3240e21..fd1c51995029 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP > config MM_ID > def_bool n > > +config ARCH_HAS_STATIC_PMD_ZERO_PAGE > + def_bool n Hm is this correct? arm64 supports mutliple page tables sizes, so while the architecture might 'support' it, it will vary based on page size, so actually we don't care about arch at all? > + > +config STATIC_PMD_ZERO_PAGE > + bool "Allocate a PMD page for zeroing" > + depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE Maybe need to just make this depend on !CONFIG_PAGE_SIZE_xx? > <snip> > > Let me know your thoughts. > > [1] https://lore.kernel.org/linux-mm/20250612105100.59144-4-p.raghav@samsung.com/#Z31mm:Kconfig > -- > Pankaj
On 15.07.25 16:02, Lorenzo Stoakes wrote: > On Wed, Jul 09, 2025 at 10:03:51AM +0200, Pankaj Raghav wrote: >> Hi Zi, >> >>>> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via >>>> memblock, and it will never be freed. >>> >>> Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non >>> 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base >>> page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am >>> not sure if it is acceptable. >>> >> >> That is a good point. My intial RFC patches allocated 2M instead of a PMD sized >> page. >> >> But later David wanted to reuse the memory we allocate here with huge_zero_folio. So >> if this config is enabled, we simply just use the same pointer for huge_zero_folio. >> >> Since that happened, I decided to go with PMD sized page. >> >> This config is still opt in and I would expect the users with 64k page size systems to not enable >> this. >> >> But to make sure we don't enable this for those architecture, I could do a per-arch opt in with >> something like this[1] that I did in my previous patch: >> >> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig >> index 340e5468980e..c3a9d136ec0a 100644 >> --- a/arch/x86/Kconfig >> +++ b/arch/x86/Kconfig >> @@ -153,6 +153,7 @@ config X86 >> select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 >> select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64 >> select ARCH_WANTS_THP_SWAP if X86_64 >> + select ARCH_HAS_STATIC_PMD_ZERO_PAGE if X86_64 >> select ARCH_HAS_PARANOID_L1D_FLUSH >> select ARCH_WANT_IRQS_OFF_ACTIVATE_MM >> select BUILDTIME_TABLE_SORT >> >> >> diff --git a/mm/Kconfig b/mm/Kconfig >> index 781be3240e21..fd1c51995029 100644 >> --- a/mm/Kconfig >> +++ b/mm/Kconfig >> @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP >> config MM_ID >> def_bool n >> >> +config ARCH_HAS_STATIC_PMD_ZERO_PAGE >> + def_bool n > > Hm is this correct? arm64 supports mutliple page tables sizes, so while the > architecture might 'support' it, it will vary based on page size, so actually we > don't care about arch at all? > >> + >> +config STATIC_PMD_ZERO_PAGE >> + bool "Allocate a PMD page for zeroing" >> + depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE > > Maybe need to just make this depend on !CONFIG_PAGE_SIZE_xx? I think at some point we discussed "when does the PMD-sized zeropage make *any* sense on these weird arch configs" (512MiB on arm64 64bit) No idea who wants to waste half a gig on that at runtime either. But yeah, we should let the arch code opt in whether it wants it or not (in particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K) -- Cheers, David / dhildenb
On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote: > I think at some point we discussed "when does the PMD-sized zeropage make > *any* sense on these weird arch configs" (512MiB on arm64 64bit) > > No idea who wants to waste half a gig on that at runtime either. Yeah this is a problem we _really_ need to solve. But obviously somewhat out of scope here. > > But yeah, we should let the arch code opt in whether it wants it or not (in > particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K) I don't think this should be an ARCH_HAS_xxx. Because that's saying 'this architecture has X', this isn't architecture scope. I suppose PMDs may vary in terms of how huge they are regardless of page table size actually. So maybe the best solution is a semantic one - just rename this to ARCH_WANT_STATIC_PMD_ZERO_PAGE And then put the page size selector in the arch code. For example in arm64 we have: select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) So doing something similar here like: select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES Would do thie job and sort everything out. > > -- > Cheers, > > David / dhildenb > CHeers, Lorenzo
On 15.07.25 16:12, Lorenzo Stoakes wrote: > On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote: >> I think at some point we discussed "when does the PMD-sized zeropage make >> *any* sense on these weird arch configs" (512MiB on arm64 64bit) >> >> No idea who wants to waste half a gig on that at runtime either. > > Yeah this is a problem we _really_ need to solve. But obviously somewhat out of > scope here. > >> >> But yeah, we should let the arch code opt in whether it wants it or not (in >> particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K) > > I don't think this should be an ARCH_HAS_xxx. > > Because that's saying 'this architecture has X', this isn't architecture > scope. > > I suppose PMDs may vary in terms of how huge they are regardless of page > table size actually. > > So maybe the best solution is a semantic one - just rename this to > ARCH_WANT_STATIC_PMD_ZERO_PAGE > > And then put the page size selector in the arch code. > > For example in arm64 we have: > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > So doing something similar here like: > > select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES > > Would do thie job and sort everything out. Yes. -- Cheers, David / dhildenb
On Tue, Jul 15, 2025 at 04:16:44PM +0200, David Hildenbrand wrote: > On 15.07.25 16:12, Lorenzo Stoakes wrote: > > On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote: > > > I think at some point we discussed "when does the PMD-sized zeropage make > > > *any* sense on these weird arch configs" (512MiB on arm64 64bit) > > > > > > No idea who wants to waste half a gig on that at runtime either. > > > > Yeah this is a problem we _really_ need to solve. But obviously somewhat out of > > scope here. > > > > > > > > But yeah, we should let the arch code opt in whether it wants it or not (in > > > particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K) > > > > I don't think this should be an ARCH_HAS_xxx. > > > > Because that's saying 'this architecture has X', this isn't architecture > > scope. > > > > I suppose PMDs may vary in terms of how huge they are regardless of page > > table size actually. > > > > So maybe the best solution is a semantic one - just rename this to > > ARCH_WANT_STATIC_PMD_ZERO_PAGE > > > > And then put the page size selector in the arch code. > > > > For example in arm64 we have: > > > > select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) > > > > So doing something similar here like: > > > > select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES > > > > Would do thie job and sort everything out. > > Yes. Actually I had something similar in one of my earlier versions[1] where we can opt in from arch specific Kconfig with *WANT* instead *HAS*. For starters, I will enable this only from x86. We can probably extend this once we get the base patches up. [1] https://lore.kernel.org/linux-mm/20250522090243.758943-2-p.raghav@samsung.com/ -- Pankaj
On 15.07.25 17:25, Pankaj Raghav (Samsung) wrote: > On Tue, Jul 15, 2025 at 04:16:44PM +0200, David Hildenbrand wrote: >> On 15.07.25 16:12, Lorenzo Stoakes wrote: >>> On Tue, Jul 15, 2025 at 04:06:29PM +0200, David Hildenbrand wrote: >>>> I think at some point we discussed "when does the PMD-sized zeropage make >>>> *any* sense on these weird arch configs" (512MiB on arm64 64bit) >>>> >>>> No idea who wants to waste half a gig on that at runtime either. >>> >>> Yeah this is a problem we _really_ need to solve. But obviously somewhat out of >>> scope here. >>> >>>> >>>> But yeah, we should let the arch code opt in whether it wants it or not (in >>>> particular, maybe only on arm64 with CONFIG_PAGE_SIZE_4K) >>> >>> I don't think this should be an ARCH_HAS_xxx. >>> >>> Because that's saying 'this architecture has X', this isn't architecture >>> scope. >>> >>> I suppose PMDs may vary in terms of how huge they are regardless of page >>> table size actually. >>> >>> So maybe the best solution is a semantic one - just rename this to >>> ARCH_WANT_STATIC_PMD_ZERO_PAGE >>> >>> And then put the page size selector in the arch code. >>> >>> For example in arm64 we have: >>> >>> select ARCH_WANT_HUGE_PMD_SHARE if ARM64_4K_PAGES || (ARM64_16K_PAGES && !ARM64_VA_BITS_36) >>> >>> So doing something similar here like: >>> >>> select ARCH_WANT_STATIC_PMD_ZERO_PAGE if ARM64_4K_PAGES >>> >>> Would do thie job and sort everything out. >> >> Yes. > > Actually I had something similar in one of my earlier versions[1] where we > can opt in from arch specific Kconfig with *WANT* instead *HAS*. > > For starters, I will enable this only from x86. We can probably extend > this once we get the base patches up. Makes sense. -- Cheers, David / dhildenb
On 9 Jul 2025, at 4:03, Pankaj Raghav wrote: > Hi Zi, > >>> Add a config option STATIC_PMD_ZERO_PAGE that will always allocate the huge_zero_folio via >>> memblock, and it will never be freed. >> >> Do the above users want a PMD sized zero page or a 2MB zero page? Because on systems with non >> 4KB base page size, e.g., ARM64 with 64KB base page, PMD size is different. ARM64 with 64KB base >> page has 512MB PMD sized pages. Having STATIC_PMD_ZERO_PAGE means losing half GB memory. I am >> not sure if it is acceptable. >> > > That is a good point. My intial RFC patches allocated 2M instead of a PMD sized > page. > > But later David wanted to reuse the memory we allocate here with huge_zero_folio. So > if this config is enabled, we simply just use the same pointer for huge_zero_folio. > > Since that happened, I decided to go with PMD sized page. Got it. Thank you for the explanation. This means for your use cases 2MB is big enough. For those arch which have PMD > 2MB, ideally, a 2MB zero mTHP should be used. Thinking about this feature long term, I wonder what we should do to support arch with PMD > 2MB. Make the static huge zero page size a boot time parameter? > > This config is still opt in and I would expect the users with 64k page size systems to not enable > this. > > But to make sure we don't enable this for those architecture, I could do a per-arch opt in with > something like this[1] that I did in my previous patch: > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 340e5468980e..c3a9d136ec0a 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -153,6 +153,7 @@ config X86 > select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP if X86_64 > select ARCH_WANT_HUGETLB_VMEMMAP_PREINIT if X86_64 > select ARCH_WANTS_THP_SWAP if X86_64 > + select ARCH_HAS_STATIC_PMD_ZERO_PAGE if X86_64 > select ARCH_HAS_PARANOID_L1D_FLUSH > select ARCH_WANT_IRQS_OFF_ACTIVATE_MM > select BUILDTIME_TABLE_SORT > > > diff --git a/mm/Kconfig b/mm/Kconfig > index 781be3240e21..fd1c51995029 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -826,6 +826,19 @@ config ARCH_WANTS_THP_SWAP > config MM_ID > def_bool n > > +config ARCH_HAS_STATIC_PMD_ZERO_PAGE > + def_bool n > + > +config STATIC_PMD_ZERO_PAGE > + bool "Allocate a PMD page for zeroing" > + depends on ARCH_HAS_STATIC_PMD_ZERO_PAGE > <snip> > > Let me know your thoughts. Sounds good to me, since without STATIC_PMD_ZERO_PAGE, when THP is enabled, the use cases you mentioned are still able to use the THP zero page. Thanks. > > [1] https://lore.kernel.org/linux-mm/20250612105100.59144-4-p.raghav@samsung.com/#Z31mm:Kconfig > -- > Pankaj Best Regards, Yan, Zi
Pankaj, There seems to be quite a lot to work on here, and it seems rather speculative, so can we respin as an RFC please? Thanks! :) On Mon, Jul 07, 2025 at 04:23:14PM +0200, Pankaj Raghav (Samsung) wrote: > From: Pankaj Raghav <p.raghav@samsung.com> > > There are many places in the kernel where we need to zeroout larger > chunks but the maximum segment we can zeroout at a time by ZERO_PAGE > is limited by PAGE_SIZE. > > This concern was raised during the review of adding Large Block Size support > to XFS[1][2]. > > This is especially annoying in block devices and filesystems where we > attach multiple ZERO_PAGEs to the bio in different bvecs. With multipage > bvec support in block layer, it is much more efficient to send out > larger zero pages as a part of a single bvec. > > Some examples of places in the kernel where this could be useful: > - blkdev_issue_zero_pages() > - iomap_dio_zero() > - vmalloc.c:zero_iter() > - rxperf_process_call() > - fscrypt_zeroout_range_inline_crypt() > - bch2_checksum_update() > ... > > We already have huge_zero_folio that is allocated on demand, and it will be > deallocated by the shrinker if there are no users of it left. > > At moment, huge_zero_folio infrastructure refcount is tied to the process > lifetime that created it. This might not work for bio layer as the completions > can be async and the process that created the huge_zero_folio might no > longer be alive. > > Add a config option STATIC_PMD_ZERO_PAGE that will always allocate > the huge_zero_folio via memblock, and it will never be freed. > > I have converted blkdev_issue_zero_pages() as an example as a part of > this series. > > I will send patches to individual subsystems using the huge_zero_folio > once this gets upstreamed. > > Looking forward to some feedback. > > [1] https://lore.kernel.org/linux-xfs/20231027051847.GA7885@lst.de/ > [2] https://lore.kernel.org/linux-xfs/ZitIK5OnR7ZNY0IG@infradead.org/ > > Changes since v1: > - Move from .bss to allocating it through memblock(David) > > Changes since RFC: > - Added the config option based on the feedback from David. > - Encode more info in the header to avoid dead code (Dave hansen > feedback) > - The static part of huge_zero_folio in memory.c and the dynamic part > stays in huge_memory.c > - Split the patches to make it easy for review. > > Pankaj Raghav (5): > mm: move huge_zero_page declaration from huge_mm.h to mm.h > huge_memory: add huge_zero_page_shrinker_(init|exit) function > mm: add static PMD zero page > mm: add largest_zero_folio() routine > block: use largest_zero_folio in __blkdev_issue_zero_pages() > > block/blk-lib.c | 17 +++++---- > include/linux/huge_mm.h | 31 ---------------- > include/linux/mm.h | 81 +++++++++++++++++++++++++++++++++++++++++ > mm/Kconfig | 9 +++++ > mm/huge_memory.c | 62 +++++++++++++++++++++++-------- > mm/memory.c | 25 +++++++++++++ > mm/mm_init.c | 1 + > 7 files changed, 173 insertions(+), 53 deletions(-) > > > base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a > -- > 2.49.0 >
Hi Lorenzo, On Tue, Jul 15, 2025 at 04:34:45PM +0100, Lorenzo Stoakes wrote: > Pankaj, > > There seems to be quite a lot to work on here, and it seems rather speculative, > so can we respin as an RFC please? > Thanks for all the review comments. Yeah, I agree. I will resend it as RFC. I will try the new approach suggested by David in Patch 3 in the next version. -- Pankaj
© 2016 - 2025 Red Hat, Inc.