[v4] mm: Eliminate fake head pages from vmemmap optimization

[PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap optimization

Posted by Kiryl Shutsemau 2 weeks, 3 days ago

This series removes "fake head pages" from the HugeTLB vmemmap
optimization (HVO) by changing how tail pages encode their relationship
to the head page.

It simplifies compound_head() and page_ref_add_unless(). Both are in the
hot path.

Background
==========

HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
and remapping the freed virtual addresses to a single physical page.
Previously, all tail page vmemmap entries were remapped to the first
vmemmap page (containing the head struct page), creating "fake heads" -
tail pages that appear to have PG_head set when accessed through the
deduplicated vmemmap.

This required special handling in compound_head() to detect and work
around fake heads, adding complexity and overhead to a very hot path.

New Approach
============

For architectures/configs where sizeof(struct page) is a power of 2 (the
common case), this series changes how position of the head page is encoded
in the tail pages.

Instead of storing a pointer to the head page, the ->compound_info
(renamed from ->compound_head) now stores a mask.

The mask can be applied to any tail page's virtual address to compute
the head page address. Critically, all tail pages of the same order now
have identical compound_info values, regardless of which compound page
they belong to.

The key insight is that all tail pages of the same order now have
identical compound_info values, regardless of which compound page they
belong to. This allows a single page of tail struct pages to be shared
across all huge pages of the same order on a NUMA node.

Benefits
========

1. Simplified compound_head(): No fake head detection needed, can be
   implemented in a branchless manner.

2. Simplified page_ref_add_unless(): RCU protection removed since there's
   no race with fake head remapping.

3. Cleaner architecture: The shared tail pages are truly read-only and
   contain valid tail page metadata.

If sizeof(struct page) is not power-of-2, there are no functional changes.
HVO is not supported in this configuration.

I had hoped to see performance improvement, but my testing thus far has
shown either no change or only a slight improvement within the noise.

Series Organization
===================

Patch 1: Preparation - move MAX_FOLIO_ORDER to mmzone.h
Patches 2-4: Refactoring - interface changes, field rename, code movement
Patch 5: Core change - new mask-based compound_head() encoding
Patch 6: Correctness fix - page_zonenum() must use head page
Patch 7: Add memmap alignment check for compound_info_has_mask()
Patch 8: Refactor vmemmap_walk for new design
Patch 9: Eliminate fake heads with shared tail pages
Patches 10-13: Cleanup - remove fake head infrastructure
Patch 14: Documentation update

Changes in v4:
==============
  - Fix build issues due to linux/mmzone.h <-> linux/pgtable.h
    dependency loop by avoiding including linux/pgtable.h into
    linux/mmzone.h

  - Rework vmemmap_remap_alloc() interface. (Muchun)

  - Use &folio->page instead of folio address for optimization
    target. (Muchun)

Changes in v3:
==============
  - Fixed error recovery path in vmemmap_remap_free() to pass correct start
    address for TLB flush. (Muchun)

  - Wrapped the mask-based compound_info encoding within CONFIG_SPARSEMEM_VMEMMAP
    check via compound_info_has_mask(). For other memory models, alignment
    guarantees are harder to verify. (Muchun)

  - Updated vmemmap_dedup.rst documentation wording: changed "vmemmap_tail
    shared for the struct hstate" to "A single, per-node page frame shared
    among all hugepages of the same size". (Muchun)

  - Fixed build error with MAX_FOLIO_ORDER expanding to undefined PUD_ORDER
    in certain configurations. (kernel test robot)

Changes in v2:
==============

- Handle boot-allocated huge pages correctly. (Frank)

- Changed from per-hstate vmemmap_tail to per-node vmemmap_tails[] array
  in pglist_data. (Muchun)

- Added spin_lock(&hugetlb_lock) protection in vmemmap_get_tail() to fix
  a race condition where two threads could both allocate tail pages.
  The losing thread now properly frees its allocated page. (Usama)

- Add warning if memmap is not aligned to MAX_FOLIO_SIZE, which is
  required for the mask approach. (Muchun)

- Make page_zonenum() use head page - correctness fix since shared
  tail pages cannot have valid zone information. (Muchun)

- Added 'const' qualifier to head parameter in set_compound_head() and
  prep_compound_tail(). (Usama)

- Updated commit messages.

Kiryl Shutsemau (14):
  mm: Move MAX_FOLIO_ORDER definition to mmzone.h
  mm: Change the interface of prep_compound_tail()
  mm: Rename the 'compound_head' field in the 'struct page' to
    'compound_info'
  mm: Move set/clear_compound_head() next to compound_head()
  mm: Rework compound_head() for power-of-2 sizeof(struct page)
  mm: Make page_zonenum() use head page
  mm/sparse: Check memmap alignment for compound_info_has_mask()
  mm/hugetlb: Refactor code around vmemmap_walk
  mm/hugetlb: Remove fake head pages
  mm: Drop fake head checks
  hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
  mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
  mm: Remove the branch from compound_head()
  hugetlb: Update vmemmap_dedup.rst

 .../admin-guide/kdump/vmcoreinfo.rst          |   2 +-
 Documentation/mm/vmemmap_dedup.rst            |  62 ++--
 include/linux/mm.h                            |  31 --
 include/linux/mm_types.h                      |  20 +-
 include/linux/mmzone.h                        |  47 +++
 include/linux/page-flags.h                    | 167 +++++-----
 include/linux/page_ref.h                      |   8 +-
 include/linux/types.h                         |   2 +-
 kernel/vmcore_info.c                          |   2 +-
 mm/hugetlb.c                                  |   8 +-
 mm/hugetlb_vmemmap.c                          | 300 ++++++++----------
 mm/internal.h                                 |  12 +-
 mm/mm_init.c                                  |   2 +-
 mm/page_alloc.c                               |   4 +-
 mm/slab.h                                     |   2 +-
 mm/sparse-vmemmap.c                           |  44 ++-
 mm/sparse.c                                   |   5 +
 mm/util.c                                     |  16 +-
 18 files changed, 369 insertions(+), 365 deletions(-)

-- 
2.51.2

Re: [PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap optimization

Posted by Vlastimil Babka 2 weeks, 3 days ago

On 1/21/26 17:22, Kiryl Shutsemau wrote:
> This series removes "fake head pages" from the HugeTLB vmemmap
> optimization (HVO) by changing how tail pages encode their relationship
> to the head page.
> 
> It simplifies compound_head() and page_ref_add_unless(). Both are in the
> hot path.

We never got the definitive answer in the previous version discussions
whether it's worth to do this now with the upcoming memdesc stuff, right?

> Background
> ==========
> 
> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
> and remapping the freed virtual addresses to a single physical page.
> Previously, all tail page vmemmap entries were remapped to the first
> vmemmap page (containing the head struct page), creating "fake heads" -
> tail pages that appear to have PG_head set when accessed through the
> deduplicated vmemmap.
> 
> This required special handling in compound_head() to detect and work
> around fake heads, adding complexity and overhead to a very hot path.

So a very stupid question, why did we remap everything to the first page,
and not instead create two pages, where the first one would contain the head
and the first batch of tails, and the second one would be used for the rest
of the tails? I'd expect it wouldn't make the memory savings that much
worse, and eliminate most of the issues?

> New Approach
> ============
> 
> For architectures/configs where sizeof(struct page) is a power of 2 (the
> common case), this series changes how position of the head page is encoded
> in the tail pages.
> 
> Instead of storing a pointer to the head page, the ->compound_info
> (renamed from ->compound_head) now stores a mask.
> 
> The mask can be applied to any tail page's virtual address to compute
> the head page address. Critically, all tail pages of the same order now
> have identical compound_info values, regardless of which compound page
> they belong to.
> 
> The key insight is that all tail pages of the same order now have
> identical compound_info values, regardless of which compound page they
> belong to. This allows a single page of tail struct pages to be shared
> across all huge pages of the same order on a NUMA node.
> 
> Benefits
> ========
> 
> 1. Simplified compound_head(): No fake head detection needed, can be
>    implemented in a branchless manner.
> 
> 2. Simplified page_ref_add_unless(): RCU protection removed since there's
>    no race with fake head remapping.
> 
> 3. Cleaner architecture: The shared tail pages are truly read-only and
>    contain valid tail page metadata.
> 
> If sizeof(struct page) is not power-of-2, there are no functional changes.
> HVO is not supported in this configuration.
> 
> I had hoped to see performance improvement, but my testing thus far has
> shown either no change or only a slight improvement within the noise.
> 
> Series Organization
> ===================
> 
> Patch 1: Preparation - move MAX_FOLIO_ORDER to mmzone.h
> Patches 2-4: Refactoring - interface changes, field rename, code movement
> Patch 5: Core change - new mask-based compound_head() encoding
> Patch 6: Correctness fix - page_zonenum() must use head page
> Patch 7: Add memmap alignment check for compound_info_has_mask()
> Patch 8: Refactor vmemmap_walk for new design
> Patch 9: Eliminate fake heads with shared tail pages
> Patches 10-13: Cleanup - remove fake head infrastructure
> Patch 14: Documentation update
> 
> Changes in v4:
> ==============
>   - Fix build issues due to linux/mmzone.h <-> linux/pgtable.h
>     dependency loop by avoiding including linux/pgtable.h into
>     linux/mmzone.h
> 
>   - Rework vmemmap_remap_alloc() interface. (Muchun)
> 
>   - Use &folio->page instead of folio address for optimization
>     target. (Muchun)
> 
> Changes in v3:
> ==============
>   - Fixed error recovery path in vmemmap_remap_free() to pass correct start
>     address for TLB flush. (Muchun)
> 
>   - Wrapped the mask-based compound_info encoding within CONFIG_SPARSEMEM_VMEMMAP
>     check via compound_info_has_mask(). For other memory models, alignment
>     guarantees are harder to verify. (Muchun)
> 
>   - Updated vmemmap_dedup.rst documentation wording: changed "vmemmap_tail
>     shared for the struct hstate" to "A single, per-node page frame shared
>     among all hugepages of the same size". (Muchun)
> 
>   - Fixed build error with MAX_FOLIO_ORDER expanding to undefined PUD_ORDER
>     in certain configurations. (kernel test robot)
> 
> Changes in v2:
> ==============
> 
> - Handle boot-allocated huge pages correctly. (Frank)
> 
> - Changed from per-hstate vmemmap_tail to per-node vmemmap_tails[] array
>   in pglist_data. (Muchun)
> 
> - Added spin_lock(&hugetlb_lock) protection in vmemmap_get_tail() to fix
>   a race condition where two threads could both allocate tail pages.
>   The losing thread now properly frees its allocated page. (Usama)
> 
> - Add warning if memmap is not aligned to MAX_FOLIO_SIZE, which is
>   required for the mask approach. (Muchun)
> 
> - Make page_zonenum() use head page - correctness fix since shared
>   tail pages cannot have valid zone information. (Muchun)
> 
> - Added 'const' qualifier to head parameter in set_compound_head() and
>   prep_compound_tail(). (Usama)
> 
> - Updated commit messages.
> 
> Kiryl Shutsemau (14):
>   mm: Move MAX_FOLIO_ORDER definition to mmzone.h
>   mm: Change the interface of prep_compound_tail()
>   mm: Rename the 'compound_head' field in the 'struct page' to
>     'compound_info'
>   mm: Move set/clear_compound_head() next to compound_head()
>   mm: Rework compound_head() for power-of-2 sizeof(struct page)
>   mm: Make page_zonenum() use head page
>   mm/sparse: Check memmap alignment for compound_info_has_mask()
>   mm/hugetlb: Refactor code around vmemmap_walk
>   mm/hugetlb: Remove fake head pages
>   mm: Drop fake head checks
>   hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
>   mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
>   mm: Remove the branch from compound_head()
>   hugetlb: Update vmemmap_dedup.rst
> 
>  .../admin-guide/kdump/vmcoreinfo.rst          |   2 +-
>  Documentation/mm/vmemmap_dedup.rst            |  62 ++--
>  include/linux/mm.h                            |  31 --
>  include/linux/mm_types.h                      |  20 +-
>  include/linux/mmzone.h                        |  47 +++
>  include/linux/page-flags.h                    | 167 +++++-----
>  include/linux/page_ref.h                      |   8 +-
>  include/linux/types.h                         |   2 +-
>  kernel/vmcore_info.c                          |   2 +-
>  mm/hugetlb.c                                  |   8 +-
>  mm/hugetlb_vmemmap.c                          | 300 ++++++++----------
>  mm/internal.h                                 |  12 +-
>  mm/mm_init.c                                  |   2 +-
>  mm/page_alloc.c                               |   4 +-
>  mm/slab.h                                     |   2 +-
>  mm/sparse-vmemmap.c                           |  44 ++-
>  mm/sparse.c                                   |   5 +
>  mm/util.c                                     |  16 +-
>  18 files changed, 369 insertions(+), 365 deletions(-)
>

Re: [PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap optimization

Posted by Zi Yan 2 weeks, 3 days ago

On 21 Jan 2026, at 13:44, Vlastimil Babka wrote:

> On 1/21/26 17:22, Kiryl Shutsemau wrote:
>> This series removes "fake head pages" from the HugeTLB vmemmap
>> optimization (HVO) by changing how tail pages encode their relationship
>> to the head page.
>>
>> It simplifies compound_head() and page_ref_add_unless(). Both are in the
>> hot path.
>
> We never got the definitive answer in the previous version discussions
> whether it's worth to do this now with the upcoming memdesc stuff, right?
>
>> Background
>> ==========
>>
>> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
>> and remapping the freed virtual addresses to a single physical page.
>> Previously, all tail page vmemmap entries were remapped to the first
>> vmemmap page (containing the head struct page), creating "fake heads" -
>> tail pages that appear to have PG_head set when accessed through the
>> deduplicated vmemmap.
>>
>> This required special handling in compound_head() to detect and work
>> around fake heads, adding complexity and overhead to a very hot path.
>
> So a very stupid question, why did we remap everything to the first page,
> and not instead create two pages, where the first one would contain the head
> and the first batch of tails, and the second one would be used for the rest
> of the tails? I'd expect it wouldn't make the memory savings that much
> worse, and eliminate most of the issues?

I think it was using 2 pages before[1]. The benefit of using one page is:
“
It further reduces the overhead of struct
page by 12.5% for a 2MB HugeTLB compared to the previous approach,
which means 2GB per 1TB HugeTLB (2MB type).
“

[1] https://lore.kernel.org/all/20211101031651.75851-1-songmuchun@bytedance.com/T/#u

>
>> New Approach
>> ============
>>
>> For architectures/configs where sizeof(struct page) is a power of 2 (the
>> common case), this series changes how position of the head page is encoded
>> in the tail pages.
>>
>> Instead of storing a pointer to the head page, the ->compound_info
>> (renamed from ->compound_head) now stores a mask.
>>
>> The mask can be applied to any tail page's virtual address to compute
>> the head page address. Critically, all tail pages of the same order now
>> have identical compound_info values, regardless of which compound page
>> they belong to.
>>
>> The key insight is that all tail pages of the same order now have
>> identical compound_info values, regardless of which compound page they
>> belong to. This allows a single page of tail struct pages to be shared
>> across all huge pages of the same order on a NUMA node.
>>
>> Benefits
>> ========
>>
>> 1. Simplified compound_head(): No fake head detection needed, can be
>>    implemented in a branchless manner.
>>
>> 2. Simplified page_ref_add_unless(): RCU protection removed since there's
>>    no race with fake head remapping.
>>
>> 3. Cleaner architecture: The shared tail pages are truly read-only and
>>    contain valid tail page metadata.
>>
>> If sizeof(struct page) is not power-of-2, there are no functional changes.
>> HVO is not supported in this configuration.
>>
>> I had hoped to see performance improvement, but my testing thus far has
>> shown either no change or only a slight improvement within the noise.
>>
>> Series Organization
>> ===================
>>
>> Patch 1: Preparation - move MAX_FOLIO_ORDER to mmzone.h
>> Patches 2-4: Refactoring - interface changes, field rename, code movement
>> Patch 5: Core change - new mask-based compound_head() encoding
>> Patch 6: Correctness fix - page_zonenum() must use head page
>> Patch 7: Add memmap alignment check for compound_info_has_mask()
>> Patch 8: Refactor vmemmap_walk for new design
>> Patch 9: Eliminate fake heads with shared tail pages
>> Patches 10-13: Cleanup - remove fake head infrastructure
>> Patch 14: Documentation update
>>
>> Changes in v4:
>> ==============
>>   - Fix build issues due to linux/mmzone.h <-> linux/pgtable.h
>>     dependency loop by avoiding including linux/pgtable.h into
>>     linux/mmzone.h
>>
>>   - Rework vmemmap_remap_alloc() interface. (Muchun)
>>
>>   - Use &folio->page instead of folio address for optimization
>>     target. (Muchun)
>>
>> Changes in v3:
>> ==============
>>   - Fixed error recovery path in vmemmap_remap_free() to pass correct start
>>     address for TLB flush. (Muchun)
>>
>>   - Wrapped the mask-based compound_info encoding within CONFIG_SPARSEMEM_VMEMMAP
>>     check via compound_info_has_mask(). For other memory models, alignment
>>     guarantees are harder to verify. (Muchun)
>>
>>   - Updated vmemmap_dedup.rst documentation wording: changed "vmemmap_tail
>>     shared for the struct hstate" to "A single, per-node page frame shared
>>     among all hugepages of the same size". (Muchun)
>>
>>   - Fixed build error with MAX_FOLIO_ORDER expanding to undefined PUD_ORDER
>>     in certain configurations. (kernel test robot)
>>
>> Changes in v2:
>> ==============
>>
>> - Handle boot-allocated huge pages correctly. (Frank)
>>
>> - Changed from per-hstate vmemmap_tail to per-node vmemmap_tails[] array
>>   in pglist_data. (Muchun)
>>
>> - Added spin_lock(&hugetlb_lock) protection in vmemmap_get_tail() to fix
>>   a race condition where two threads could both allocate tail pages.
>>   The losing thread now properly frees its allocated page. (Usama)
>>
>> - Add warning if memmap is not aligned to MAX_FOLIO_SIZE, which is
>>   required for the mask approach. (Muchun)
>>
>> - Make page_zonenum() use head page - correctness fix since shared
>>   tail pages cannot have valid zone information. (Muchun)
>>
>> - Added 'const' qualifier to head parameter in set_compound_head() and
>>   prep_compound_tail(). (Usama)
>>
>> - Updated commit messages.
>>
>> Kiryl Shutsemau (14):
>>   mm: Move MAX_FOLIO_ORDER definition to mmzone.h
>>   mm: Change the interface of prep_compound_tail()
>>   mm: Rename the 'compound_head' field in the 'struct page' to
>>     'compound_info'
>>   mm: Move set/clear_compound_head() next to compound_head()
>>   mm: Rework compound_head() for power-of-2 sizeof(struct page)
>>   mm: Make page_zonenum() use head page
>>   mm/sparse: Check memmap alignment for compound_info_has_mask()
>>   mm/hugetlb: Refactor code around vmemmap_walk
>>   mm/hugetlb: Remove fake head pages
>>   mm: Drop fake head checks
>>   hugetlb: Remove VMEMMAP_SYNCHRONIZE_RCU
>>   mm/hugetlb: Remove hugetlb_optimize_vmemmap_key static key
>>   mm: Remove the branch from compound_head()
>>   hugetlb: Update vmemmap_dedup.rst
>>
>>  .../admin-guide/kdump/vmcoreinfo.rst          |   2 +-
>>  Documentation/mm/vmemmap_dedup.rst            |  62 ++--
>>  include/linux/mm.h                            |  31 --
>>  include/linux/mm_types.h                      |  20 +-
>>  include/linux/mmzone.h                        |  47 +++
>>  include/linux/page-flags.h                    | 167 +++++-----
>>  include/linux/page_ref.h                      |   8 +-
>>  include/linux/types.h                         |   2 +-
>>  kernel/vmcore_info.c                          |   2 +-
>>  mm/hugetlb.c                                  |   8 +-
>>  mm/hugetlb_vmemmap.c                          | 300 ++++++++----------
>>  mm/internal.h                                 |  12 +-
>>  mm/mm_init.c                                  |   2 +-
>>  mm/page_alloc.c                               |   4 +-
>>  mm/slab.h                                     |   2 +-
>>  mm/sparse-vmemmap.c                           |  44 ++-
>>  mm/sparse.c                                   |   5 +
>>  mm/util.c                                     |  16 +-
>>  18 files changed, 369 insertions(+), 365 deletions(-)
>>


Best Regards,
Yan, Zi

Re: [PATCHv4 00/14] mm: Eliminate fake head pages from vmemmap optimization

Posted by Kiryl Shutsemau 2 weeks, 3 days ago

On Wed, Jan 21, 2026 at 03:31:59PM -0500, Zi Yan wrote:
> On 21 Jan 2026, at 13:44, Vlastimil Babka wrote:
> 
> > On 1/21/26 17:22, Kiryl Shutsemau wrote:
> >> This series removes "fake head pages" from the HugeTLB vmemmap
> >> optimization (HVO) by changing how tail pages encode their relationship
> >> to the head page.
> >>
> >> It simplifies compound_head() and page_ref_add_unless(). Both are in the
> >> hot path.
> >
> > We never got the definitive answer in the previous version discussions
> > whether it's worth to do this now with the upcoming memdesc stuff, right?

Right. Willy shared some details[1] about memdesc plan, but I cannot say
I fully understand what it means for this patchset.

I guess we will find out :P

[1] https://lore.kernel.org/all/aWF3xg-72SV4tmLk@casper.infradead.org

> >> Background
> >> ==========
> >>
> >> HVO reduces memory overhead by freeing vmemmap pages for HugeTLB pages
> >> and remapping the freed virtual addresses to a single physical page.
> >> Previously, all tail page vmemmap entries were remapped to the first
> >> vmemmap page (containing the head struct page), creating "fake heads" -
> >> tail pages that appear to have PG_head set when accessed through the
> >> deduplicated vmemmap.
> >>
> >> This required special handling in compound_head() to detect and work
> >> around fake heads, adding complexity and overhead to a very hot path.
> >
> > So a very stupid question, why did we remap everything to the first page,
> > and not instead create two pages, where the first one would contain the head
> > and the first batch of tails, and the second one would be used for the rest
> > of the tails? I'd expect it wouldn't make the memory savings that much
> > worse, and eliminate most of the issues?
> 
> I think it was using 2 pages before[1]. The benefit of using one page is:
> “
> It further reduces the overhead of struct
> page by 12.5% for a 2MB HugeTLB compared to the previous approach,
> which means 2GB per 1TB HugeTLB (2MB type).
> “
> 
> [1] https://lore.kernel.org/all/20211101031651.75851-1-songmuchun@bytedance.com/T/#u

Yeah, the 12.5%.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov