.../admin-guide/kdump/vmcoreinfo.rst | 2 +- Documentation/mm/vmemmap_dedup.rst | 62 ++--- include/linux/hugetlb.h | 3 + include/linux/mm_types.h | 20 +- include/linux/page-flags.h | 163 +++++------- include/linux/page_ref.h | 8 +- include/linux/types.h | 2 +- kernel/vmcore_info.c | 2 +- mm/hugetlb.c | 8 +- mm/hugetlb_vmemmap.c | 245 ++++++++---------- mm/hugetlb_vmemmap.h | 4 +- mm/internal.h | 11 +- mm/mm_init.c | 2 +- mm/page_alloc.c | 4 +- mm/slab.h | 2 +- mm/util.c | 15 +- 16 files changed, 242 insertions(+), 311 deletions(-)
This series removes "fake head pages" from the HugeTLB vmemmap
optimization (HVO) by changing how tail pages encode their relationship
to the head page.
It simplifies compound_head() and page_ref_add_unless(). Both are in the
hot path.
Background
==========
HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
and remapping the freed virtual addresses to a single physical page.
Previously, all tail page vmemmap entries were remapped to the first
vmemmap page (containing the head struct page), creating "fake heads" -
tail pages that appear to have PG_head set when accessed through the
deduplicated vmemmap.
This required special handling in compound_head() to detect and work
around fake heads, adding complexity and overhead to a very hot path.
New Approach
============
For architectures/configs where sizeof(struct page) is a power of 2 (the
common case), this series changes how position of the head page is encoded
in the tail pages.
Instead of storing a pointer to the head page, the ->compound_info
(renamed from ->compound_head) now stores a mask.
The mask can be applied to any tail page's virtual address to compute
the head page address. Critically, all tail pages of the same order now
have identical compound_info values, regardless of which compound page
they belong to.
This enables a key optimization: instead of remapping tail vmemmap
entries to the head page (creating fake heads), we remap them to a
shared, pre-initialized vmemmap_tail page per hstate. The head page
gets its own dedicated vmemmap page, eliminating fake heads entirely.
Benefits
========
1. Smaller generated code. On defconfig, I see ~15K reduction of text
in vmlinux:
add/remove: 6/33 grow/shrink: 54/262 up/down: 6130/-21922 (-15792)
2. Simplified compound_head(): No fake head detection needed. The
function is now branchless for power-of-2 struct page sizes.
3. Eliminated race condition: The old scheme required synchronize_rcu()
to coordinate between HVO remapping and speculative PFN walkers that
might write to fake heads. With the head page always in writable
memory, this synchronization is unnecessary.
4. Removed static key: hugetlb_optimize_vmemmap_key is no longer needed
since compound_head() no longer has HVO-specific branches.
5. Cleaner architecture: The vmemmap layout is now straightforward -
head page has its own vmemmap, tails share a read-only template.
I had hoped to see performance improvement, but my testing thus far has
shown either no change or only a slight improvement within the noise.
Series Organization
===================
Patches 1-3: Preparatory refactoring
- Change prep_compound_tail() interface to take order
- Rename compound_head field to compound_info
- Move set/clear_compound_head() near compound_head()
Patch 4: Core encoding change
- Implement mask-based encoding for power-of-2 struct page
Patches 5-6: HVO restructuring
- Refactor vmemmap_walk to support separate head/tail pages
- Introduce per-hstate vmemmap_tail, eliminate fake heads
Patches 7-9: Cleanup
- Remove fake head checks from compound_head(), PageTail(), etc.
- Remove VMEMMAP_SYNCHRONIZE_RCU and synchronize_rcu() calls
- Remove hugetlb_optimize_vmemmap_key static key
Patch 10: Optimization
- Implement branchless compound_head() for power-of-2 case
Patch 11: Documentation
- Update vmemmap_dedup.rst to reflect new architecture
Kiryl Shutsemau (11):
mm: Change the interface of prep_compound_tail()
mm: Rename the 'compound_head' field in the 'struct page' to
'compound_info'
mm: Move set/clear_compound_head() to compound_head()
mm: Rework compound_head() for power-of-2 sizeof(struct page)
mm/hugetlb: Refactor code around vmemmap_walk
mm/hugetlb: Remove fake head pages
mm: Drop fake head checks and fix a race condition
hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
mm: Remove the branch from compound_head()
hugetlb: Update vmemmap_dedup.rst
.../admin-guide/kdump/vmcoreinfo.rst | 2 +-
Documentation/mm/vmemmap_dedup.rst | 62 ++---
include/linux/hugetlb.h | 3 +
include/linux/mm_types.h | 20 +-
include/linux/page-flags.h | 163 +++++-------
include/linux/page_ref.h | 8 +-
include/linux/types.h | 2 +-
kernel/vmcore_info.c | 2 +-
mm/hugetlb.c | 8 +-
mm/hugetlb_vmemmap.c | 245 ++++++++----------
mm/hugetlb_vmemmap.h | 4 +-
mm/internal.h | 11 +-
mm/mm_init.c | 2 +-
mm/page_alloc.c | 4 +-
mm/slab.h | 2 +-
mm/util.c | 15 +-
16 files changed, 242 insertions(+), 311 deletions(-)
--
2.51.2
> On Dec 6, 2025, at 03:43, Kiryl Shutsemau <kas@kernel.org> wrote: > > This series removes "fake head pages" from the HugeTLB vmemmap > optimization (HVO) by changing how tail pages encode their relationship > to the head page. > > It simplifies compound_head() and page_ref_add_unless(). Both are in the > hot path. Besides, the code simplification also looks good. > > Background > ========== > > HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages > and remapping the freed virtual addresses to a single physical page. > Previously, all tail page vmemmap entries were remapped to the first > vmemmap page (containing the head struct page), creating "fake heads" - > tail pages that appear to have PG_head set when accessed through the > deduplicated vmemmap. > > This required special handling in compound_head() to detect and work > around fake heads, adding complexity and overhead to a very hot path. > > New Approach > ============ > > For architectures/configs where sizeof(struct page) is a power of 2 (the > common case), this series changes how position of the head page is encoded > in the tail pages. > > Instead of storing a pointer to the head page, the ->compound_info > (renamed from ->compound_head) now stores a mask. > > The mask can be applied to any tail page's virtual address to compute > the head page address. Critically, all tail pages of the same order now > have identical compound_info values, regardless of which compound page > they belong to. > > This enables a key optimization: instead of remapping tail vmemmap > entries to the head page (creating fake heads), we remap them to a > shared, pre-initialized vmemmap_tail page per hstate. The head page > gets its own dedicated vmemmap page, eliminating fake heads entirely. A very interesting approach. The prerequisite is that the starting address of vmemmap must be aligned to 16MB boundaries (for 1GB huge pages). Right? We should add some checks somewhere to guarantee this (not compile time but at runtime like for KASLR). > > Benefits > ======== > > 1. Smaller generated code. On defconfig, I see ~15K reduction of text > in vmlinux: > > add/remove: 6/33 grow/shrink: 54/262 up/down: 6130/-21922 (-15792) > > 2. Simplified compound_head(): No fake head detection needed. The > function is now branchless for power-of-2 struct page sizes. And it is also a common approach as well for DAX to eliminate an additional tail page. > > 3. Eliminated race condition: The old scheme required synchronize_rcu() > to coordinate between HVO remapping and speculative PFN walkers that > might write to fake heads. With the head page always in writable > memory, this synchronization is unnecessary. > > 4. Removed static key: hugetlb_optimize_vmemmap_key is no longer needed > since compound_head() no longer has HVO-specific branches. > > 5. Cleaner architecture: The vmemmap layout is now straightforward - > head page has its own vmemmap, tails share a read-only template. I have no idea about the feature of memdesc, but regarding HVO, it is a nice improvement. I'll look into the details later. Muchun, Thanks. > > I had hoped to see performance improvement, but my testing thus far has > shown either no change or only a slight improvement within the noise. > > Series Organization > =================== > > Patches 1-3: Preparatory refactoring > - Change prep_compound_tail() interface to take order > - Rename compound_head field to compound_info > - Move set/clear_compound_head() near compound_head() > > Patch 4: Core encoding change > - Implement mask-based encoding for power-of-2 struct page > > Patches 5-6: HVO restructuring > - Refactor vmemmap_walk to support separate head/tail pages > - Introduce per-hstate vmemmap_tail, eliminate fake heads > > Patches 7-9: Cleanup > - Remove fake head checks from compound_head(), PageTail(), etc. > - Remove VMEMMAP_SYNCHRONIZE_RCU and synchronize_rcu() calls > - Remove hugetlb_optimize_vmemmap_key static key > > Patch 10: Optimization > - Implement branchless compound_head() for power-of-2 case > > Patch 11: Documentation > - Update vmemmap_dedup.rst to reflect new architecture > > Kiryl Shutsemau (11): > mm: Change the interface of prep_compound_tail() > mm: Rename the 'compound_head' field in the 'struct page' to > 'compound_info' > mm: Move set/clear_compound_head() to compound_head() > mm: Rework compound_head() for power-of-2 sizeof(struct page) > mm/hugetlb: Refactor code around vmemmap_walk > mm/hugetlb: Remove fake head pages > mm: Drop fake head checks and fix a race condition > hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU > mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key > mm: Remove the branch from compound_head() > hugetlb: Update vmemmap_dedup.rst > > .../admin-guide/kdump/vmcoreinfo.rst | 2 +- > Documentation/mm/vmemmap_dedup.rst | 62 ++--- > include/linux/hugetlb.h | 3 + > include/linux/mm_types.h | 20 +- > include/linux/page-flags.h | 163 +++++------- > include/linux/page_ref.h | 8 +- > include/linux/types.h | 2 +- > kernel/vmcore_info.c | 2 +- > mm/hugetlb.c | 8 +- > mm/hugetlb_vmemmap.c | 245 ++++++++---------- > mm/hugetlb_vmemmap.h | 4 +- > mm/internal.h | 11 +- > mm/mm_init.c | 2 +- > mm/page_alloc.c | 4 +- > mm/slab.h | 2 +- > mm/util.c | 15 +- > 16 files changed, 242 insertions(+), 311 deletions(-) > > -- > 2.51.2 >
On Tue, Dec 09, 2025 at 02:22:28PM +0800, Muchun Song wrote:
> The prerequisite is that the starting address of vmemmap must be aligned to
> 16MB boundaries (for 1GB huge pages). Right? We should add some checks
> somewhere to guarantee this (not compile time but at runtime like for KASLR).
I have hard time finding the right spot to put the check.
I considered something like the patch below, but it is probably too late
if we boot preallocating huge pages.
I will dig more later, but if you have any suggestions, I would
appreciate.
diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
index 04a211a146a0..971558184587 100644
--- a/mm/hugetlb_vmemmap.c
+++ b/mm/hugetlb_vmemmap.c
@@ -886,6 +886,14 @@ static int __init hugetlb_vmemmap_init(void)
BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
for_each_hstate(h) {
+ unsigned long size = huge_page_size(h) / sizeof(struct page);
+
+ /* vmemmap is expected to be naturally aligned to page size */
+ if (WARN_ON_ONCE(!IS_ALIGNED((unsigned long)vmemmap, size))) {
+ vmemmap_optimize_enabled = false;
+ continue;
+ }
+
if (hugetlb_vmemmap_optimizable(h)) {
register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
break;
--
Kiryl Shutsemau / Kirill A. Shutemov
> On Dec 9, 2025, at 22:44, Kiryl Shutsemau <kas@kernel.org> wrote:
>
> On Tue, Dec 09, 2025 at 02:22:28PM +0800, Muchun Song wrote:
>> The prerequisite is that the starting address of vmemmap must be aligned to
>> 16MB boundaries (for 1GB huge pages). Right? We should add some checks
>> somewhere to guarantee this (not compile time but at runtime like for KASLR).
>
> I have hard time finding the right spot to put the check.
>
> I considered something like the patch below, but it is probably too late
> if we boot preallocating huge pages.
>
> I will dig more later, but if you have any suggestions, I would
> appreciate.
If you opt to record the mask information, then even when HVO is
disabled compound_head will still compute the head-page address
by means of the mask. Consequently this constraint must hold for
**every** compound page.
Therefore adding your code in hugetlb_vmemmap.c is not appropriate:
that file only turns HVO off, yet the calculation remains broken
for all other large compound pages.
From MAX_FOLIO_ORDER we know that folio_alloc_gigantic() can allocate
at most 16 GB of physically contiguous memory. We must therefore
guarantee that the vmemmap area starts on an address aligned to at
least 256 MB.
When KASLR is disabled the vmemmap base is normally fixed by a
macro, so the check can be done at compile time; when KASLR is enabled
we have to ensure that the randomly chosen offset is a multiple
of 256 MB. These two spots are, in my view, the places that need
to be changed.
Moreover, this approach requires the virtual addresses of struct
page (possibly spanning sections) to be contiguous, so the method is
valid **only** under CONFIG_SPARSEMEM_VMEMMAP.
Also, when I skimmed through the overall patch yesterday, one detail
caught my eye: the shared tail page is **not** "per hstate"; it is
"per hstate, per zone, per node", because the zone and node
information is encoded in the tail page’s flags field. We should make
sure both page_to_nid() and page_zone() work properly.
Muchun,
Thanks.
>
> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
> index 04a211a146a0..971558184587 100644
> --- a/mm/hugetlb_vmemmap.c
> +++ b/mm/hugetlb_vmemmap.c
> @@ -886,6 +886,14 @@ static int __init hugetlb_vmemmap_init(void)
> BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
>
> for_each_hstate(h) {
> + unsigned long size = huge_page_size(h) / sizeof(struct page);
> +
> + /* vmemmap is expected to be naturally aligned to page size */
> + if (WARN_ON_ONCE(!IS_ALIGNED((unsigned long)vmemmap, size))) {
> + vmemmap_optimize_enabled = false;
> + continue;
> + }
> +
> if (hugetlb_vmemmap_optimizable(h)) {
> register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
> break;
> --
> Kiryl Shutsemau / Kirill A. Shutemov
On Wed, Dec 10, 2025 at 11:39:24AM +0800, Muchun Song wrote: > > > > On Dec 9, 2025, at 22:44, Kiryl Shutsemau <kas@kernel.org> wrote: > > > > On Tue, Dec 09, 2025 at 02:22:28PM +0800, Muchun Song wrote: > >> The prerequisite is that the starting address of vmemmap must be aligned to > >> 16MB boundaries (for 1GB huge pages). Right? We should add some checks > >> somewhere to guarantee this (not compile time but at runtime like for KASLR). > > > > I have hard time finding the right spot to put the check. > > > > I considered something like the patch below, but it is probably too late > > if we boot preallocating huge pages. > > > > I will dig more later, but if you have any suggestions, I would > > appreciate. > > If you opt to record the mask information, then even when HVO is > disabled compound_head will still compute the head-page address > by means of the mask. Consequently this constraint must hold for > **every** compound page. > > Therefore adding your code in hugetlb_vmemmap.c is not appropriate: > that file only turns HVO off, yet the calculation remains broken > for all other large compound pages. > > From MAX_FOLIO_ORDER we know that folio_alloc_gigantic() can allocate > at most 16 GB of physically contiguous memory. We must therefore > guarantee that the vmemmap area starts on an address aligned to at > least 256 MB. > > When KASLR is disabled the vmemmap base is normally fixed by a > macro, so the check can be done at compile time; when KASLR is enabled > we have to ensure that the randomly chosen offset is a multiple > of 256 MB. These two spots are, in my view, the places that need > to be changed. > > Moreover, this approach requires the virtual addresses of struct > page (possibly spanning sections) to be contiguous, so the method is > valid **only** under CONFIG_SPARSEMEM_VMEMMAP. > > Also, when I skimmed through the overall patch yesterday, one detail > caught my eye: the shared tail page is **not** "per hstate"; it is > "per hstate, per zone, per node", because the zone and node > information is encoded in the tail page’s flags field. We should make > sure both page_to_nid() and page_zone() work properly. Right. Or we can slap compound_head() inside them. I stepped onto VM_BUG_ON_PAGE() in get_pfnblock_bitmap_bitidx(). Workarounded with compound_head() for now. I am not sure if we want to allocate them per-zone. Seems excessive. But per-node is reasonable. -- Kiryl Shutsemau / Kirill A. Shutemov
> On Dec 11, 2025, at 23:08, Kiryl Shutsemau <kas@kernel.org> wrote: > > On Wed, Dec 10, 2025 at 11:39:24AM +0800, Muchun Song wrote: >> >> >>> On Dec 9, 2025, at 22:44, Kiryl Shutsemau <kas@kernel.org> wrote: >>> >>> On Tue, Dec 09, 2025 at 02:22:28PM +0800, Muchun Song wrote: >>>> The prerequisite is that the starting address of vmemmap must be aligned to >>>> 16MB boundaries (for 1GB huge pages). Right? We should add some checks >>>> somewhere to guarantee this (not compile time but at runtime like for KASLR). >>> >>> I have hard time finding the right spot to put the check. >>> >>> I considered something like the patch below, but it is probably too late >>> if we boot preallocating huge pages. >>> >>> I will dig more later, but if you have any suggestions, I would >>> appreciate. >> >> If you opt to record the mask information, then even when HVO is >> disabled compound_head will still compute the head-page address >> by means of the mask. Consequently this constraint must hold for >> **every** compound page. >> >> Therefore adding your code in hugetlb_vmemmap.c is not appropriate: >> that file only turns HVO off, yet the calculation remains broken >> for all other large compound pages. >> >> From MAX_FOLIO_ORDER we know that folio_alloc_gigantic() can allocate >> at most 16 GB of physically contiguous memory. We must therefore >> guarantee that the vmemmap area starts on an address aligned to at >> least 256 MB. >> >> When KASLR is disabled the vmemmap base is normally fixed by a >> macro, so the check can be done at compile time; when KASLR is enabled >> we have to ensure that the randomly chosen offset is a multiple >> of 256 MB. These two spots are, in my view, the places that need >> to be changed. >> >> Moreover, this approach requires the virtual addresses of struct >> page (possibly spanning sections) to be contiguous, so the method is >> valid **only** under CONFIG_SPARSEMEM_VMEMMAP. >> >> Also, when I skimmed through the overall patch yesterday, one detail >> caught my eye: the shared tail page is **not** "per hstate"; it is >> "per hstate, per zone, per node", because the zone and node >> information is encoded in the tail page’s flags field. We should make >> sure both page_to_nid() and page_zone() work properly. > > Right. Or we can slap compound_head() inside them. At the same time, to keep users from accidentally passing compound_head() a handcrafted-on-stack page struct (like snapshot_page()), Shall we add a VM_BUG_ON() in compound_head() to validate that the page address falls within the vmemmap range? Otherwise, compound_head() will return an invalid head page struct (it is an address on the stack with arbitrary data). > > I stepped onto VM_BUG_ON_PAGE() in get_pfnblock_bitmap_bitidx(). > Workarounded with compound_head() for now. I don’t see why you singled out get_pfnblock_bitmap_bitidx—what’s special about that spot? > > I am not sure if we want to allocate them per-zone. Seems excessive. Yes. If we could solve page_to_nid() and page_zonenum(), it does not need to be per-zone. > But per-node is reasonable. Agree. > > -- > Kiryl Shutsemau / Kirill A. Shutemov
> On Dec 10, 2025, at 11:39, Muchun Song <muchun.song@linux.dev> wrote:
>
>> On Dec 9, 2025, at 22:44, Kiryl Shutsemau <kas@kernel.org> wrote:
>>
>> On Tue, Dec 09, 2025 at 02:22:28PM +0800, Muchun Song wrote:
>>> The prerequisite is that the starting address of vmemmap must be aligned to
>>> 16MB boundaries (for 1GB huge pages). Right? We should add some checks
>>> somewhere to guarantee this (not compile time but at runtime like for KASLR).
>>
>> I have hard time finding the right spot to put the check.
>>
>> I considered something like the patch below, but it is probably too late
>> if we boot preallocating huge pages.
>>
>> I will dig more later, but if you have any suggestions, I would
>> appreciate.
>
> If you opt to record the mask information, then even when HVO is
> disabled compound_head will still compute the head-page address
> by means of the mask. Consequently this constraint must hold for
> **every** compound page.
>
> Therefore adding your code in hugetlb_vmemmap.c is not appropriate:
> that file only turns HVO off, yet the calculation remains broken
> for all other large compound pages.
>
> From MAX_FOLIO_ORDER we know that folio_alloc_gigantic() can allocate
> at most 16 GB of physically contiguous memory. We must therefore
> guarantee that the vmemmap area starts on an address aligned to at
> least 256 MB.
>
> When KASLR is disabled the vmemmap base is normally fixed by a
> macro, so the check can be done at compile time; when KASLR is enabled
> we have to ensure that the randomly chosen offset is a multiple
> of 256 MB. These two spots are, in my view, the places that need
> to be changed.
>
> Moreover, this approach requires the virtual addresses of struct
> page (possibly spanning sections) to be contiguous, so the method is
> valid **only** under CONFIG_SPARSEMEM_VMEMMAP.
This is no longer an issue, because with nth_page removed (I only
just found out), a folio can no longer span multiple sections even
when !CONFIG_SPARSEMEM_VMEMMAP.
>
> Also, when I skimmed through the overall patch yesterday, one detail
> caught my eye: the shared tail page is **not** "per hstate"; it is
> "per hstate, per zone, per node", because the zone and node
> information is encoded in the tail page’s flags field. We should make
> sure both page_to_nid() and page_zone() work properly.
>
> Muchun,
> Thanks.
>
>>
>> diff --git a/mm/hugetlb_vmemmap.c b/mm/hugetlb_vmemmap.c
>> index 04a211a146a0..971558184587 100644
>> --- a/mm/hugetlb_vmemmap.c
>> +++ b/mm/hugetlb_vmemmap.c
>> @@ -886,6 +886,14 @@ static int __init hugetlb_vmemmap_init(void)
>> BUILD_BUG_ON(__NR_USED_SUBPAGE > HUGETLB_VMEMMAP_RESERVE_PAGES);
>>
>> for_each_hstate(h) {
>> + unsigned long size = huge_page_size(h) / sizeof(struct page);
>> +
>> + /* vmemmap is expected to be naturally aligned to page size */
>> + if (WARN_ON_ONCE(!IS_ALIGNED((unsigned long)vmemmap, size))) {
>> + vmemmap_optimize_enabled = false;
>> + continue;
>> + }
>> +
>> if (hugetlb_vmemmap_optimizable(h)) {
>> register_sysctl_init("vm", hugetlb_vmemmap_sysctls);
>> break;
>> --
>> Kiryl Shutsemau / Kirill A. Shutemov
On 12/5/25 20:43, Kiryl Shutsemau wrote: > This series removes "fake head pages" from the HugeTLB vmemmap > optimization (HVO) by changing how tail pages encode their relationship > to the head page. > > It simplifies compound_head() and page_ref_add_unless(). Both are in the > hot path. > > Background > ========== > > HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages > and remapping the freed virtual addresses to a single physical page. > Previously, all tail page vmemmap entries were remapped to the first > vmemmap page (containing the head struct page), creating "fake heads" - > tail pages that appear to have PG_head set when accessed through the > deduplicated vmemmap. > > This required special handling in compound_head() to detect and work > around fake heads, adding complexity and overhead to a very hot path. > > New Approach > ============ > > For architectures/configs where sizeof(struct page) is a power of 2 (the > common case), this series changes how position of the head page is encoded > in the tail pages. > > Instead of storing a pointer to the head page, the ->compound_info > (renamed from ->compound_head) now stores a mask. (we're in the merge window) That doesn't seem to be suitable for the memdesc plans, where we want all tail pages do directly point at the allocated memdesc (e.g., struct folio), no? @Willy what's your take? -- Cheers David
On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote: > On 12/5/25 20:43, Kiryl Shutsemau wrote: > > This series removes "fake head pages" from the HugeTLB vmemmap > > optimization (HVO) by changing how tail pages encode their relationship > > to the head page. > > > > It simplifies compound_head() and page_ref_add_unless(). Both are in the > > hot path. > > > > Background > > ========== > > > > HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages > > and remapping the freed virtual addresses to a single physical page. > > Previously, all tail page vmemmap entries were remapped to the first > > vmemmap page (containing the head struct page), creating "fake heads" - > > tail pages that appear to have PG_head set when accessed through the > > deduplicated vmemmap. > > > > This required special handling in compound_head() to detect and work > > around fake heads, adding complexity and overhead to a very hot path. > > > > New Approach > > ============ > > > > For architectures/configs where sizeof(struct page) is a power of 2 (the > > common case), this series changes how position of the head page is encoded > > in the tail pages. > > > > Instead of storing a pointer to the head page, the ->compound_info > > (renamed from ->compound_head) now stores a mask. > > (we're in the merge window) > > That doesn't seem to be suitable for the memdesc plans, where we want all > tail pages do directly point at the allocated memdesc (e.g., struct folio), > no? Sure. My understanding is that it is going to eliminate a need in compound_head() completely. I don't see the conflict so far. -- Kiryl Shutsemau / Kirill A. Shutemov
On 12/5/25 21:33, Kiryl Shutsemau wrote: > On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote: >> On 12/5/25 20:43, Kiryl Shutsemau wrote: >>> This series removes "fake head pages" from the HugeTLB vmemmap >>> optimization (HVO) by changing how tail pages encode their relationship >>> to the head page. >>> >>> It simplifies compound_head() and page_ref_add_unless(). Both are in the >>> hot path. >>> >>> Background >>> ========== >>> >>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages >>> and remapping the freed virtual addresses to a single physical page. >>> Previously, all tail page vmemmap entries were remapped to the first >>> vmemmap page (containing the head struct page), creating "fake heads" - >>> tail pages that appear to have PG_head set when accessed through the >>> deduplicated vmemmap. >>> >>> This required special handling in compound_head() to detect and work >>> around fake heads, adding complexity and overhead to a very hot path. >>> >>> New Approach >>> ============ >>> >>> For architectures/configs where sizeof(struct page) is a power of 2 (the >>> common case), this series changes how position of the head page is encoded >>> in the tail pages. >>> >>> Instead of storing a pointer to the head page, the ->compound_info >>> (renamed from ->compound_head) now stores a mask. >> >> (we're in the merge window) >> >> That doesn't seem to be suitable for the memdesc plans, where we want all >> tail pages do directly point at the allocated memdesc (e.g., struct folio), >> no? > > Sure. My understanding is that it is going to eliminate a need in > compound_head() completely. I don't see the conflict so far. Right. All compound_head pointers will point at the allocated memdesc. Would we still have to detect fake head pages though (at least for some transition period)? I don't recall whether we'll really convert all memdesc users at once, or if some memdescs will co-exist with ordinary compound pages for a while. -- Cheers David
On Fri, Dec 05, 2025 at 09:44:30PM +0100, David Hildenbrand (Red Hat) wrote: > On 12/5/25 21:33, Kiryl Shutsemau wrote: > > On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote: > > > On 12/5/25 20:43, Kiryl Shutsemau wrote: > > > > This series removes "fake head pages" from the HugeTLB vmemmap > > > > optimization (HVO) by changing how tail pages encode their relationship > > > > to the head page. > > > > > > > > It simplifies compound_head() and page_ref_add_unless(). Both are in the > > > > hot path. > > > > > > > > Background > > > > ========== > > > > > > > > HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages > > > > and remapping the freed virtual addresses to a single physical page. > > > > Previously, all tail page vmemmap entries were remapped to the first > > > > vmemmap page (containing the head struct page), creating "fake heads" - > > > > tail pages that appear to have PG_head set when accessed through the > > > > deduplicated vmemmap. > > > > > > > > This required special handling in compound_head() to detect and work > > > > around fake heads, adding complexity and overhead to a very hot path. > > > > > > > > New Approach > > > > ============ > > > > > > > > For architectures/configs where sizeof(struct page) is a power of 2 (the > > > > common case), this series changes how position of the head page is encoded > > > > in the tail pages. > > > > > > > > Instead of storing a pointer to the head page, the ->compound_info > > > > (renamed from ->compound_head) now stores a mask. > > > > > > (we're in the merge window) > > > > > > That doesn't seem to be suitable for the memdesc plans, where we want all > > > tail pages do directly point at the allocated memdesc (e.g., struct folio), > > > no? > > > > Sure. My understanding is that it is going to eliminate a need in > > compound_head() completely. I don't see the conflict so far. > > Right. All compound_head pointers will point at the allocated memdesc. > > Would we still have to detect fake head pages though (at least for some > transition period)? If we need to detect if the memdesc is tail it should be as trivial as comparing the given memdesc to the memdesc - 1. If they match, you are looking at the tail. But I don't think we wound need it. The memdesc itself doesn't hold anything you want to touch if don't hold reference to the folio. You wound need dereference memdesc and after it you don't care if the memdesc it tail. -- Kiryl Shutsemau / Kirill A. Shutemov
On 12/5/25 21:54, Kiryl Shutsemau wrote: > On Fri, Dec 05, 2025 at 09:44:30PM +0100, David Hildenbrand (Red Hat) wrote: >> On 12/5/25 21:33, Kiryl Shutsemau wrote: >>> On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote: >>>> On 12/5/25 20:43, Kiryl Shutsemau wrote: >>>>> This series removes "fake head pages" from the HugeTLB vmemmap >>>>> optimization (HVO) by changing how tail pages encode their relationship >>>>> to the head page. >>>>> >>>>> It simplifies compound_head() and page_ref_add_unless(). Both are in the >>>>> hot path. >>>>> >>>>> Background >>>>> ========== >>>>> >>>>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages >>>>> and remapping the freed virtual addresses to a single physical page. >>>>> Previously, all tail page vmemmap entries were remapped to the first >>>>> vmemmap page (containing the head struct page), creating "fake heads" - >>>>> tail pages that appear to have PG_head set when accessed through the >>>>> deduplicated vmemmap. >>>>> >>>>> This required special handling in compound_head() to detect and work >>>>> around fake heads, adding complexity and overhead to a very hot path. >>>>> >>>>> New Approach >>>>> ============ >>>>> >>>>> For architectures/configs where sizeof(struct page) is a power of 2 (the >>>>> common case), this series changes how position of the head page is encoded >>>>> in the tail pages. >>>>> >>>>> Instead of storing a pointer to the head page, the ->compound_info >>>>> (renamed from ->compound_head) now stores a mask. >>>> >>>> (we're in the merge window) >>>> >>>> That doesn't seem to be suitable for the memdesc plans, where we want all >>>> tail pages do directly point at the allocated memdesc (e.g., struct folio), >>>> no? >>> >>> Sure. My understanding is that it is going to eliminate a need in >>> compound_head() completely. I don't see the conflict so far. >> >> Right. All compound_head pointers will point at the allocated memdesc. >> >> Would we still have to detect fake head pages though (at least for some >> transition period)? > > If we need to detect if the memdesc is tail it should be as trivial as > comparing the given memdesc to the memdesc - 1. If they match, you are > looking at the tail. How could you assume memdesc - 1 exists without performing other checks? > > But I don't think we wound need it. I would guess so. > > The memdesc itself doesn't hold anything you want to touch if don't hold > reference to the folio. You wound need dereference memdesc and after it > you don't care if the memdesc it tail. Hopefully. So the real question is how this would affect the transition period (some memdescs allocated, others not allocated separately) that Willy might soon want to start. And the dual mode where, whether "struct folio" is allocated separately will be a config option. Let's wait for Willy's reply. -- Cheers David
On Fri, Dec 05, 2025 at 10:34:48PM +0100, David Hildenbrand (Red Hat) wrote: > On 12/5/25 21:54, Kiryl Shutsemau wrote: > > On Fri, Dec 05, 2025 at 09:44:30PM +0100, David Hildenbrand (Red Hat) wrote: > > > On 12/5/25 21:33, Kiryl Shutsemau wrote: > > > > On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote: > > > > > On 12/5/25 20:43, Kiryl Shutsemau wrote: > > > > > > This series removes "fake head pages" from the HugeTLB vmemmap > > > > > > optimization (HVO) by changing how tail pages encode their relationship > > > > > > to the head page. > > > > > > > > > > > > It simplifies compound_head() and page_ref_add_unless(). Both are in the > > > > > > hot path. > > > > > > > > > > > > Background > > > > > > ========== > > > > > > > > > > > > HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages > > > > > > and remapping the freed virtual addresses to a single physical page. > > > > > > Previously, all tail page vmemmap entries were remapped to the first > > > > > > vmemmap page (containing the head struct page), creating "fake heads" - > > > > > > tail pages that appear to have PG_head set when accessed through the > > > > > > deduplicated vmemmap. > > > > > > > > > > > > This required special handling in compound_head() to detect and work > > > > > > around fake heads, adding complexity and overhead to a very hot path. > > > > > > > > > > > > New Approach > > > > > > ============ > > > > > > > > > > > > For architectures/configs where sizeof(struct page) is a power of 2 (the > > > > > > common case), this series changes how position of the head page is encoded > > > > > > in the tail pages. > > > > > > > > > > > > Instead of storing a pointer to the head page, the ->compound_info > > > > > > (renamed from ->compound_head) now stores a mask. > > > > > > > > > > (we're in the merge window) > > > > > > > > > > That doesn't seem to be suitable for the memdesc plans, where we want all > > > > > tail pages do directly point at the allocated memdesc (e.g., struct folio), > > > > > no? > > > > > > > > Sure. My understanding is that it is going to eliminate a need in > > > > compound_head() completely. I don't see the conflict so far. > > > > > > Right. All compound_head pointers will point at the allocated memdesc. > > > > > > Would we still have to detect fake head pages though (at least for some > > > transition period)? > > > > If we need to detect if the memdesc is tail it should be as trivial as > > comparing the given memdesc to the memdesc - 1. If they match, you are > > looking at the tail. > > How could you assume memdesc - 1 exists without performing other checks? Map zero page in front of every discontinuous vmemmap region :P -- Kiryl Shutsemau / Kirill A. Shutemov
On 12/5/25 22:41, Kiryl Shutsemau wrote: > On Fri, Dec 05, 2025 at 10:34:48PM +0100, David Hildenbrand (Red Hat) wrote: >> On 12/5/25 21:54, Kiryl Shutsemau wrote: >>> On Fri, Dec 05, 2025 at 09:44:30PM +0100, David Hildenbrand (Red Hat) wrote: >>>> On 12/5/25 21:33, Kiryl Shutsemau wrote: >>>>> On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote: >>>>>> On 12/5/25 20:43, Kiryl Shutsemau wrote: >>>>>>> This series removes "fake head pages" from the HugeTLB vmemmap >>>>>>> optimization (HVO) by changing how tail pages encode their relationship >>>>>>> to the head page. >>>>>>> >>>>>>> It simplifies compound_head() and page_ref_add_unless(). Both are in the >>>>>>> hot path. >>>>>>> >>>>>>> Background >>>>>>> ========== >>>>>>> >>>>>>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages >>>>>>> and remapping the freed virtual addresses to a single physical page. >>>>>>> Previously, all tail page vmemmap entries were remapped to the first >>>>>>> vmemmap page (containing the head struct page), creating "fake heads" - >>>>>>> tail pages that appear to have PG_head set when accessed through the >>>>>>> deduplicated vmemmap. >>>>>>> >>>>>>> This required special handling in compound_head() to detect and work >>>>>>> around fake heads, adding complexity and overhead to a very hot path. >>>>>>> >>>>>>> New Approach >>>>>>> ============ >>>>>>> >>>>>>> For architectures/configs where sizeof(struct page) is a power of 2 (the >>>>>>> common case), this series changes how position of the head page is encoded >>>>>>> in the tail pages. >>>>>>> >>>>>>> Instead of storing a pointer to the head page, the ->compound_info >>>>>>> (renamed from ->compound_head) now stores a mask. >>>>>> >>>>>> (we're in the merge window) >>>>>> >>>>>> That doesn't seem to be suitable for the memdesc plans, where we want all >>>>>> tail pages do directly point at the allocated memdesc (e.g., struct folio), >>>>>> no? >>>>> >>>>> Sure. My understanding is that it is going to eliminate a need in >>>>> compound_head() completely. I don't see the conflict so far. >>>> >>>> Right. All compound_head pointers will point at the allocated memdesc. >>>> >>>> Would we still have to detect fake head pages though (at least for some >>>> transition period)? >>> >>> If we need to detect if the memdesc is tail it should be as trivial as >>> comparing the given memdesc to the memdesc - 1. If they match, you are >>> looking at the tail. >> >> How could you assume memdesc - 1 exists without performing other checks? > > Map zero page in front of every discontinuous vmemmap region :P Good luck convincing memory hotplug maintainers about this added complexity when making vmemmap ranges (un)available ;) -- Cheers David
On 05/12/2025 21:41, Kiryl Shutsemau wrote: > On Fri, Dec 05, 2025 at 10:34:48PM +0100, David Hildenbrand (Red Hat) wrote: >> On 12/5/25 21:54, Kiryl Shutsemau wrote: >>> On Fri, Dec 05, 2025 at 09:44:30PM +0100, David Hildenbrand (Red Hat) wrote: >>>> On 12/5/25 21:33, Kiryl Shutsemau wrote: >>>>> On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote: >>>>>> On 12/5/25 20:43, Kiryl Shutsemau wrote: >>>>>>> This series removes "fake head pages" from the HugeTLB vmemmap >>>>>>> optimization (HVO) by changing how tail pages encode their relationship >>>>>>> to the head page. >>>>>>> >>>>>>> It simplifies compound_head() and page_ref_add_unless(). Both are in the >>>>>>> hot path. >>>>>>> >>>>>>> Background >>>>>>> ========== >>>>>>> >>>>>>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages >>>>>>> and remapping the freed virtual addresses to a single physical page. >>>>>>> Previously, all tail page vmemmap entries were remapped to the first >>>>>>> vmemmap page (containing the head struct page), creating "fake heads" - >>>>>>> tail pages that appear to have PG_head set when accessed through the >>>>>>> deduplicated vmemmap. >>>>>>> >>>>>>> This required special handling in compound_head() to detect and work >>>>>>> around fake heads, adding complexity and overhead to a very hot path. >>>>>>> >>>>>>> New Approach >>>>>>> ============ >>>>>>> >>>>>>> For architectures/configs where sizeof(struct page) is a power of 2 (the >>>>>>> common case), this series changes how position of the head page is encoded >>>>>>> in the tail pages. >>>>>>> >>>>>>> Instead of storing a pointer to the head page, the ->compound_info >>>>>>> (renamed from ->compound_head) now stores a mask. >>>>>> >>>>>> (we're in the merge window) >>>>>> >>>>>> That doesn't seem to be suitable for the memdesc plans, where we want all >>>>>> tail pages do directly point at the allocated memdesc (e.g., struct folio), >>>>>> no? >>>>> >>>>> Sure. My understanding is that it is going to eliminate a need in >>>>> compound_head() completely. I don't see the conflict so far. >>>> >>>> Right. All compound_head pointers will point at the allocated memdesc. >>>> >>>> Would we still have to detect fake head pages though (at least for some >>>> transition period)? >>> >>> If we need to detect if the memdesc is tail it should be as trivial as >>> comparing the given memdesc to the memdesc - 1. If they match, you are >>> looking at the tail. >> >> How could you assume memdesc - 1 exists without performing other checks? > > Map zero page in front of every discontinuous vmemmap region :P > I made an initial pass at reviewing the series. I think the best thing about this is that someone looking at compound_head won't need to understand HVO to know how compound_head works, so its a very nice clean up :) Would be nice to make the commit messages more verbose, and also maybe add more comments about why it works a certain way when sizeof struct page is a power of 2. I don't know what the current memdesc plans are, so cant comment on that part.
On 12/6/25 18:47, Usama Arif wrote: > > > On 05/12/2025 21:41, Kiryl Shutsemau wrote: >> On Fri, Dec 05, 2025 at 10:34:48PM +0100, David Hildenbrand (Red Hat) wrote: >>> On 12/5/25 21:54, Kiryl Shutsemau wrote: >>>> On Fri, Dec 05, 2025 at 09:44:30PM +0100, David Hildenbrand (Red Hat) wrote: >>>>> On 12/5/25 21:33, Kiryl Shutsemau wrote: >>>>>> On Fri, Dec 05, 2025 at 09:16:08PM +0100, David Hildenbrand (Red Hat) wrote: >>>>>>> On 12/5/25 20:43, Kiryl Shutsemau wrote: >>>>>>>> This series removes "fake head pages" from the HugeTLB vmemmap >>>>>>>> optimization (HVO) by changing how tail pages encode their relationship >>>>>>>> to the head page. >>>>>>>> >>>>>>>> It simplifies compound_head() and page_ref_add_unless(). Both are in the >>>>>>>> hot path. >>>>>>>> >>>>>>>> Background >>>>>>>> ========== >>>>>>>> >>>>>>>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages >>>>>>>> and remapping the freed virtual addresses to a single physical page. >>>>>>>> Previously, all tail page vmemmap entries were remapped to the first >>>>>>>> vmemmap page (containing the head struct page), creating "fake heads" - >>>>>>>> tail pages that appear to have PG_head set when accessed through the >>>>>>>> deduplicated vmemmap. >>>>>>>> >>>>>>>> This required special handling in compound_head() to detect and work >>>>>>>> around fake heads, adding complexity and overhead to a very hot path. >>>>>>>> >>>>>>>> New Approach >>>>>>>> ============ >>>>>>>> >>>>>>>> For architectures/configs where sizeof(struct page) is a power of 2 (the >>>>>>>> common case), this series changes how position of the head page is encoded >>>>>>>> in the tail pages. >>>>>>>> >>>>>>>> Instead of storing a pointer to the head page, the ->compound_info >>>>>>>> (renamed from ->compound_head) now stores a mask. >>>>>>> >>>>>>> (we're in the merge window) >>>>>>> >>>>>>> That doesn't seem to be suitable for the memdesc plans, where we want all >>>>>>> tail pages do directly point at the allocated memdesc (e.g., struct folio), >>>>>>> no? >>>>>> >>>>>> Sure. My understanding is that it is going to eliminate a need in >>>>>> compound_head() completely. I don't see the conflict so far. >>>>> >>>>> Right. All compound_head pointers will point at the allocated memdesc. >>>>> >>>>> Would we still have to detect fake head pages though (at least for some >>>>> transition period)? >>>> >>>> If we need to detect if the memdesc is tail it should be as trivial as >>>> comparing the given memdesc to the memdesc - 1. If they match, you are >>>> looking at the tail. >>> >>> How could you assume memdesc - 1 exists without performing other checks? >> >> Map zero page in front of every discontinuous vmemmap region :P >> > > I made an initial pass at reviewing the series. I think the best thing about this is that > someone looking at compound_head won't need to understand HVO to know how compound_head works, > so its a very nice clean up :) Yeah, I am also not a particular fan of fake-head detection code, and how this hugetlb monstrosity affects our implementation of compound pages. :) Moving from compound_head -> compound_info sounds like a suboptimal temporary step, though, as we want compound_head to to point at "struct folio" etc soon (either allocated separately or an overlay of "struct page", based on a config option). So operating on vmemmap addresses is not what the new world will look like. Of course, we could lookup the head page first and then use the memdesc pointer in there to get our "struct folio", but it will be one unnecessary roundtrip through the head page. I'm sure Willy has an opinion on this. but likely has other priorities given we are in the merge window and LPC is coming up. -- Cheers David
On Fri, Dec 5, 2025 at 11:44 AM Kiryl Shutsemau <kas@kernel.org> wrote: > > This series removes "fake head pages" from the HugeTLB vmemmap > optimization (HVO) by changing how tail pages encode their relationship > to the head page. > > It simplifies compound_head() and page_ref_add_unless(). Both are in the > hot path. > > Background > ========== > > HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages > and remapping the freed virtual addresses to a single physical page. > Previously, all tail page vmemmap entries were remapped to the first > vmemmap page (containing the head struct page), creating "fake heads" - > tail pages that appear to have PG_head set when accessed through the > deduplicated vmemmap. > > This required special handling in compound_head() to detect and work > around fake heads, adding complexity and overhead to a very hot path. > > New Approach > ============ > > For architectures/configs where sizeof(struct page) is a power of 2 (the > common case), this series changes how position of the head page is encoded > in the tail pages. > > Instead of storing a pointer to the head page, the ->compound_info > (renamed from ->compound_head) now stores a mask. > > The mask can be applied to any tail page's virtual address to compute > the head page address. Critically, all tail pages of the same order now > have identical compound_info values, regardless of which compound page > they belong to. > > This enables a key optimization: instead of remapping tail vmemmap > entries to the head page (creating fake heads), we remap them to a > shared, pre-initialized vmemmap_tail page per hstate. The head page > gets its own dedicated vmemmap page, eliminating fake heads entirely. > > Benefits > ======== > > 1. Smaller generated code. On defconfig, I see ~15K reduction of text > in vmlinux: > > add/remove: 6/33 grow/shrink: 54/262 up/down: 6130/-21922 (-15792) > > 2. Simplified compound_head(): No fake head detection needed. The > function is now branchless for power-of-2 struct page sizes. > > 3. Eliminated race condition: The old scheme required synchronize_rcu() > to coordinate between HVO remapping and speculative PFN walkers that > might write to fake heads. With the head page always in writable > memory, this synchronization is unnecessary. > > 4. Removed static key: hugetlb_optimize_vmemmap_key is no longer needed > since compound_head() no longer has HVO-specific branches. > > 5. Cleaner architecture: The vmemmap layout is now straightforward - > head page has its own vmemmap, tails share a read-only template. > > I had hoped to see performance improvement, but my testing thus far has > shown either no change or only a slight improvement within the noise. > > Series Organization > =================== > > Patches 1-3: Preparatory refactoring > - Change prep_compound_tail() interface to take order > - Rename compound_head field to compound_info > - Move set/clear_compound_head() near compound_head() > > Patch 4: Core encoding change > - Implement mask-based encoding for power-of-2 struct page > > Patches 5-6: HVO restructuring > - Refactor vmemmap_walk to support separate head/tail pages > - Introduce per-hstate vmemmap_tail, eliminate fake heads > > Patches 7-9: Cleanup > - Remove fake head checks from compound_head(), PageTail(), etc. > - Remove VMEMMAP_SYNCHRONIZE_RCU and synchronize_rcu() calls > - Remove hugetlb_optimize_vmemmap_key static key > > Patch 10: Optimization > - Implement branchless compound_head() for power-of-2 case > > Patch 11: Documentation > - Update vmemmap_dedup.rst to reflect new architecture > > Kiryl Shutsemau (11): > mm: Change the interface of prep_compound_tail() > mm: Rename the 'compound_head' field in the 'struct page' to > 'compound_info' > mm: Move set/clear_compound_head() to compound_head() > mm: Rework compound_head() for power-of-2 sizeof(struct page) > mm/hugetlb: Refactor code around vmemmap_walk > mm/hugetlb: Remove fake head pages > mm: Drop fake head checks and fix a race condition > hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU > mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key > mm: Remove the branch from compound_head() > hugetlb: Update vmemmap_dedup.rst > > .../admin-guide/kdump/vmcoreinfo.rst | 2 +- > Documentation/mm/vmemmap_dedup.rst | 62 ++--- > include/linux/hugetlb.h | 3 + > include/linux/mm_types.h | 20 +- > include/linux/page-flags.h | 163 +++++------- > include/linux/page_ref.h | 8 +- > include/linux/types.h | 2 +- > kernel/vmcore_info.c | 2 +- > mm/hugetlb.c | 8 +- > mm/hugetlb_vmemmap.c | 245 ++++++++---------- > mm/hugetlb_vmemmap.h | 4 +- > mm/internal.h | 11 +- > mm/mm_init.c | 2 +- > mm/page_alloc.c | 4 +- > mm/slab.h | 2 +- > mm/util.c | 15 +- > 16 files changed, 242 insertions(+), 311 deletions(-) > > -- > 2.51.2 > > I love this in general - I've always disliked the fake head construction (though I understand the reason behind it). However, it seems like you didn't add support to vmemmap_populate_hvo, as far as I can tell. That's the function that is used to do HVO early on bootmem (memblock) allocated 'gigantic' pages. So I think that would break with this patch. Could you add support there too? I don't think it would be hard to. While at it, you could also do it for vmemmap_populate_hugepages to support devdax :-) - Frank
On Tue, Dec 09, 2025 at 10:20:14AM -0800, Frank van der Linden wrote: > On Fri, Dec 5, 2025 at 11:44 AM Kiryl Shutsemau <kas@kernel.org> wrote: > > > > This series removes "fake head pages" from the HugeTLB vmemmap > > optimization (HVO) by changing how tail pages encode their relationship > > to the head page. > > > > It simplifies compound_head() and page_ref_add_unless(). Both are in the > > hot path. > > > > Background > > ========== > > > > HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages > > and remapping the freed virtual addresses to a single physical page. > > Previously, all tail page vmemmap entries were remapped to the first > > vmemmap page (containing the head struct page), creating "fake heads" - > > tail pages that appear to have PG_head set when accessed through the > > deduplicated vmemmap. > > > > This required special handling in compound_head() to detect and work > > around fake heads, adding complexity and overhead to a very hot path. > > > > New Approach > > ============ > > > > For architectures/configs where sizeof(struct page) is a power of 2 (the > > common case), this series changes how position of the head page is encoded > > in the tail pages. > > > > Instead of storing a pointer to the head page, the ->compound_info > > (renamed from ->compound_head) now stores a mask. > > > > The mask can be applied to any tail page's virtual address to compute > > the head page address. Critically, all tail pages of the same order now > > have identical compound_info values, regardless of which compound page > > they belong to. > > > > This enables a key optimization: instead of remapping tail vmemmap > > entries to the head page (creating fake heads), we remap them to a > > shared, pre-initialized vmemmap_tail page per hstate. The head page > > gets its own dedicated vmemmap page, eliminating fake heads entirely. > > > > Benefits > > ======== > > > > 1. Smaller generated code. On defconfig, I see ~15K reduction of text > > in vmlinux: > > > > add/remove: 6/33 grow/shrink: 54/262 up/down: 6130/-21922 (-15792) > > > > 2. Simplified compound_head(): No fake head detection needed. The > > function is now branchless for power-of-2 struct page sizes. > > > > 3. Eliminated race condition: The old scheme required synchronize_rcu() > > to coordinate between HVO remapping and speculative PFN walkers that > > might write to fake heads. With the head page always in writable > > memory, this synchronization is unnecessary. > > > > 4. Removed static key: hugetlb_optimize_vmemmap_key is no longer needed > > since compound_head() no longer has HVO-specific branches. > > > > 5. Cleaner architecture: The vmemmap layout is now straightforward - > > head page has its own vmemmap, tails share a read-only template. > > > > I had hoped to see performance improvement, but my testing thus far has > > shown either no change or only a slight improvement within the noise. > > > > Series Organization > > =================== > > > > Patches 1-3: Preparatory refactoring > > - Change prep_compound_tail() interface to take order > > - Rename compound_head field to compound_info > > - Move set/clear_compound_head() near compound_head() > > > > Patch 4: Core encoding change > > - Implement mask-based encoding for power-of-2 struct page > > > > Patches 5-6: HVO restructuring > > - Refactor vmemmap_walk to support separate head/tail pages > > - Introduce per-hstate vmemmap_tail, eliminate fake heads > > > > Patches 7-9: Cleanup > > - Remove fake head checks from compound_head(), PageTail(), etc. > > - Remove VMEMMAP_SYNCHRONIZE_RCU and synchronize_rcu() calls > > - Remove hugetlb_optimize_vmemmap_key static key > > > > Patch 10: Optimization > > - Implement branchless compound_head() for power-of-2 case > > > > Patch 11: Documentation > > - Update vmemmap_dedup.rst to reflect new architecture > > > > Kiryl Shutsemau (11): > > mm: Change the interface of prep_compound_tail() > > mm: Rename the 'compound_head' field in the 'struct page' to > > 'compound_info' > > mm: Move set/clear_compound_head() to compound_head() > > mm: Rework compound_head() for power-of-2 sizeof(struct page) > > mm/hugetlb: Refactor code around vmemmap_walk > > mm/hugetlb: Remove fake head pages > > mm: Drop fake head checks and fix a race condition > > hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU > > mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key > > mm: Remove the branch from compound_head() > > hugetlb: Update vmemmap_dedup.rst > > > > .../admin-guide/kdump/vmcoreinfo.rst | 2 +- > > Documentation/mm/vmemmap_dedup.rst | 62 ++--- > > include/linux/hugetlb.h | 3 + > > include/linux/mm_types.h | 20 +- > > include/linux/page-flags.h | 163 +++++------- > > include/linux/page_ref.h | 8 +- > > include/linux/types.h | 2 +- > > kernel/vmcore_info.c | 2 +- > > mm/hugetlb.c | 8 +- > > mm/hugetlb_vmemmap.c | 245 ++++++++---------- > > mm/hugetlb_vmemmap.h | 4 +- > > mm/internal.h | 11 +- > > mm/mm_init.c | 2 +- > > mm/page_alloc.c | 4 +- > > mm/slab.h | 2 +- > > mm/util.c | 15 +- > > 16 files changed, 242 insertions(+), 311 deletions(-) > > > > -- > > 2.51.2 > > > > > > I love this in general - I've always disliked the fake head > construction (though I understand the reason behind it). > > However, it seems like you didn't add support to vmemmap_populate_hvo, > as far as I can tell. That's the function that is used to do HVO early > on bootmem (memblock) allocated 'gigantic' pages. So I think that > would break with this patch. Ouch. Good catch. Will fix. > Could you add support there too? I don't think it would be hard to. > While at it, you could also do it for vmemmap_populate_hugepages to > support devdax :-) Yeah, DAX was on my radar. I will see if it makes sense to make part of this patchset or make an follow up. Other thing I want to change is that we probably want to make vmemmap_tails per node, so each node would use local memory for it. -- Kiryl Shutsemau / Kirill A. Shutemov
© 2016 - 2025 Red Hat, Inc.