[v1] mm: PMD-level swap entries for anonymous THPs

[PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

Posted by Usama Arif 1 month, 2 weeks ago

When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
unmap.

This series introduces a PMD-level swap entry. The huge mapping is
preserved across the swap round-trip, and do_huge_pmd_swap_page()
resolves the entire 2 MB region in a single fault on swap-in,
no khugepaged involvement is needed. swap_map metadata is identical
either way (512 single-slot counts), so the PTE split buys nothing
on the swap side, it is purely a page-table representation change.

This work was brought about after Hugh reported that one of the
major blockers for having lazy page table deposit is the lack of
PMD swap entries [1]. However, this series has benefits of its
own:
- The huge mapping is restored on swap-in.  Today even when the
  folio is still in swap cache as a single 2 MB folio, the swap-in
  path installs 512 PTE mappings -- the PMD mapping is gone, the
  freshly-materialised PTE table sticks around, and only
  khugepaged can later collapse the range back into a THP.
  do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
  one fault, no khugepaged involvement.
- Memory saved per swapped-out THP *once lazy page table deposit is
  merged* [2]. With lazy page table deposit [2], splitting a PMD into
  512 PTE swap entries forces allocation of a 4 KB PTE table page.
  The new path leaves the pgtable hierarchy at PMD level and avoids
  that allocation entirely.
  This will save memory when swapping, which is likely when there is
  memory pressure and exactly when allocations are most likely to
  fail.
- Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
  visit one PMD entry instead of 512 PTEs, reducing traversal
  time and lock-hold windows.

The swap entry value is identical to 512 PTE swap entries (same
type, same starting offset), so swap_map refcounting is unchanged.
Only the page-table representation differs; the swap slot allocator,
swap I/O, and swap cache are untouched.  The new path falls back to
the existing PTE-split path whenever a PMD-order resource is
unavailable: zswap enabled, non-contiguous swap allocation
(THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
or fork, racing folio split, or rmap-driven split on a swapcache
folio.  Walkers that previously assumed every non-present PMD encodes
a PFN (migration / device_private) are taught to recognise PMD swap
entries.

Patch breakdown:

The series is ordered to preserve git bisectability: every consumer
of a PMD swap entry (split, fork, swapoff, walkers, UFFDIO_MOVE,
swap-in fault) lands before the producer.  The swap-out path that
actually installs PMD swap entries is the very last functional patch
(12), so no intermediate commit can leave the kernel handling a
PMD swap entry it does not yet understand.

The first 4 patches are preparatory patches. Some of them (like
softleaf_to_pmd() change in patch 1) are not exactly needed but its
done to hopefully improve code quality and so that the PMD swap
entry changes look well integrated with the rest of mm.

Prep patches:
  1. mm: add softleaf_to_pmd() and convert existing callers
     PMD counterpart to softleaf_to_pte(); needed to construct a
     PMD from a swap entry in later patches.
  2. mm: extract ensure_on_mmlist() helper
     Hoists the "register mm with swapoff" double-checked-locking
     pattern out of try_to_unmap_one() / copy_nonpresent_pte() so
     the PMD swap-out and PMD fork paths can reuse it without a
     third open-coded copy.
  3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker
     pagemap_pmd_range_thp() today calls softleaf_to_page()
     unconditionally; a PMD swap entry has no PFN and would crash
     it.
  4. mm/huge_memory: move softleaf_to_folio() inside migration branch
     change_non_present_huge_pmd() today calls softleaf_to_folio()
     before branching on entry type, so a PMD swap entry would
     produce a bogus folio pointer that the migration-only code
     below would then dereference.

Core patches:
  5. PMD swap entry detection (pmd_is_swap_entry,
     softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
     helpers (x86/arm64/s390/riscv/loongarch).
  6. __split_huge_pmd_locked() learns to split a PMD swap entry
     into 512 PTE swap entries, used as the fallback when a
     PMD-order resource is unavailable.
  7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
     in one folio_dup_swap() call, with GFP_KERNEL retry mirroring
     copy_pte_range().
  8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
     the PMD; falls back to PTE-split + unuse_pte_range() on error.
  9. Walker updates: zap_huge_pmd, change_huge_pmd,
     change_non_present_huge_pmd, move_soft_dirty_pmd,
     clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry,
     queue_folios_pmd (mempolicy), check_pmd_state (khugepaged),
     and the madvise_cold_or_pageout_pte_range / madvise_free_huge_pmd
     VM_BUG_ON extensions.
 10. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap
     entry whole via a new move_swap_pmd() helper modeled on
     move_swap_pte().
 11. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in
     one shot.  Handles racing splits, SWP_STABLE_WRITES read-only
     mapping, immediate COW for write faults; falls back to PTE-split
     on any PMD-order resource shortfall.
 12. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for
     PMD-mappable swapcache folios (when zswap is disabled), and
     try_to_unmap_one() installs one PMD swap entry via
     set_pmd_swap_entry() instead of splitting.

Testing:
 13. selftests/mm: 12 tests covering swap-out/in, fork, fork+COW,
     repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
     MADV_FREE, UFFDIO_MOVE, swapoff.

Making PMD swap entries work with zswap is another project on its own and
should be in a separate follow up series.

The patches are on top of mm-unstable from 23 April
(2bcc13c29c711381d815c1ba5d5b25737400c71a).

[1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/
[2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/

Usama Arif (13):
  mm: add softleaf_to_pmd() and convert existing callers
  mm: extract ensure_on_mmlist() helper
  fs/proc: use softleaf_has_pfn() in pagemap PMD walker
  mm/huge_memory: move softleaf_to_folio() inside migration branch
  mm: add PMD swap entry detection support
  mm: add PMD swap entry splitting support
  mm: handle PMD swap entries in fork path
  mm: swap in PMD swap entries as whole THPs during swapoff
  mm: handle PMD swap entries in non-present PMD walkers
  mm: handle PMD swap entries in UFFDIO_MOVE
  mm: handle PMD swap entry faults on swap-in
  mm: install PMD swap entries on swap-out
  selftests/mm: add PMD swap entry tests

 arch/arm64/include/asm/pgtable.h      |   4 +
 arch/loongarch/include/asm/pgtable.h  |  17 +
 arch/riscv/include/asm/pgtable.h      |  15 +
 arch/s390/include/asm/pgtable.h       |  15 +
 arch/x86/include/asm/pgtable.h        |  15 +
 fs/proc/task_mmu.c                    |  47 +-
 include/linux/huge_mm.h               |  11 +
 include/linux/leafops.h               |  44 +-
 include/linux/swap.h                  |   4 +-
 include/linux/vm_event_item.h         |   1 +
 mm/hmm.c                              |   3 +-
 mm/huge_memory.c                      | 540 +++++++++++++++++++++--
 mm/internal.h                         |  49 +++
 mm/khugepaged.c                       |   6 +
 mm/madvise.c                          |   5 +-
 mm/memory.c                           |  51 +--
 mm/mempolicy.c                        |   2 +
 mm/rmap.c                             |  27 +-
 mm/swap.h                             |   7 +
 mm/swap_state.c                       |  35 ++
 mm/swapfile.c                         | 144 +++++-
 mm/vmscan.c                           |  14 +-
 mm/vmstat.c                           |   1 +
 tools/testing/selftests/mm/Makefile   |   1 +
 tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++
 25 files changed, 1554 insertions(+), 111 deletions(-)
 create mode 100644 tools/testing/selftests/mm/pmd_swap.c

-- 
2.52.0

Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

Posted by Usama Arif 1 month, 2 weeks ago


On 27/04/2026 11:01, Usama Arif wrote:
> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
> unmap.
> 
> This series introduces a PMD-level swap entry. The huge mapping is
> preserved across the swap round-trip, and do_huge_pmd_swap_page()
> resolves the entire 2 MB region in a single fault on swap-in,
> no khugepaged involvement is needed. swap_map metadata is identical
> either way (512 single-slot counts), so the PTE split buys nothing
> on the swap side, it is purely a page-table representation change.
> 
> This work was brought about after Hugh reported that one of the
> major blockers for having lazy page table deposit is the lack of
> PMD swap entries [1]. However, this series has benefits of its
> own:


+Hugh. Hugh raised this in [1], and I completely forgot to add him
to the series, sorry about that!

[1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/

Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

Posted by Kairui Song 1 month, 2 weeks ago

On Mon, Apr 27, 2026 at 6:09 PM Usama Arif <usama.arif@linux.dev> wrote:
>
> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
> unmap.
>
> This series introduces a PMD-level swap entry. The huge mapping is
> preserved across the swap round-trip, and do_huge_pmd_swap_page()
> resolves the entire 2 MB region in a single fault on swap-in,

Hi Usama,

Thanks for the work!

> no khugepaged involvement is needed. swap_map metadata is identical

swap_map is gone, metadata is still per slot but with PMD sized
swapout, I think soon we can store a swp_tb entry directly in
ci->table (make it a union maybe) so the metadata is significantly
reduced from there too. Better do that later with cluster compaction.

> Core patches:
>   5. PMD swap entry detection (pmd_is_swap_entry,
>      softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
>      helpers (x86/arm64/s390/riscv/loongarch).
>   6. __split_huge_pmd_locked() learns to split a PMD swap entry
>      into 512 PTE swap entries, used as the fallback when a
>      PMD-order resource is unavailable.
>   7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
>      in one folio_dup_swap() call, with GFP_KERNEL retry mirroring
>      copy_pte_range().
>   8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
>      the PMD; falls back to PTE-split + unuse_pte_range() on error.

There is a slight conflict with the swap folio allocation unification,
which should be easy to solve. Just a little head up, check the
swap_cache_alloc_folio helper here:
https://lore.kernel.org/linux-mm/20260421-swap-table-p4-v3-4-2f23759a76bc@tencent.com/

We will be able to directly allocate 2M folios using
swap_cache_alloc_folio(orders = BIT(PMD_ORDER)) in the patch link
above. Might even help to avoid issues with splitting or raced swapin?

The conflict can be solved from either side, I'll update that series to
disable the forced order 0 fallback and let caller pass in (orders =
<mTHP order> | BIT(0)) instead.

Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

Posted by Usama Arif 1 month, 2 weeks ago


On 29/04/2026 11:44, Kairui Song wrote:
> On Mon, Apr 27, 2026 at 6:09 PM Usama Arif <usama.arif@linux.dev> wrote:
>>
>> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
>> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
>> unmap.
>>
>> This series introduces a PMD-level swap entry. The huge mapping is
>> preserved across the swap round-trip, and do_huge_pmd_swap_page()
>> resolves the entire 2 MB region in a single fault on swap-in,
> 
> Hi Usama,
> 
> Thanks for the work!
> 
>> no khugepaged involvement is needed. swap_map metadata is identical
> 
> swap_map is gone, metadata is still per slot but with PMD sized
> swapout, I think soon we can store a swp_tb entry directly in
> ci->table (make it a union maybe) so the metadata is significantly
> reduced from there too. Better do that later with cluster compaction.
> 
>> Core patches:
>>   5. PMD swap entry detection (pmd_is_swap_entry,
>>      softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
>>      helpers (x86/arm64/s390/riscv/loongarch).
>>   6. __split_huge_pmd_locked() learns to split a PMD swap entry
>>      into 512 PTE swap entries, used as the fallback when a
>>      PMD-order resource is unavailable.
>>   7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
>>      in one folio_dup_swap() call, with GFP_KERNEL retry mirroring
>>      copy_pte_range().
>>   8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
>>      the PMD; falls back to PTE-split + unuse_pte_range() on error.
> 
> There is a slight conflict with the swap folio allocation unification,
> which should be easy to solve. Just a little head up, check the
> swap_cache_alloc_folio helper here:
> https://lore.kernel.org/linux-mm/20260421-swap-table-p4-v3-4-2f23759a76bc@tencent.com/
> 
> We will be able to directly allocate 2M folios using
> swap_cache_alloc_folio(orders = BIT(PMD_ORDER)) in the patch link
> above. Might even help to avoid issues with splitting or raced swapin?

Oh yeah, I like your swapin_alloc_pmd_folio a lot more than
swapin_alloc_pmd_folio.

> The conflict can be solved from either side, I'll update that series to
> disable the forced order 0 fallback and let caller pass in (orders =
> <mTHP order> | BIT(0)) instead.

Yes, that would be great. We dont want order 0 fallback in the 2 cases
where we fail in this series.

Thanks!

Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

Posted by David Hildenbrand (Arm) 1 month, 2 weeks ago

On 4/27/26 12:01, Usama Arif wrote:
> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
> unmap.
> 
> This series introduces a PMD-level swap entry. The huge mapping is
> preserved across the swap round-trip, and do_huge_pmd_swap_page()
> resolves the entire 2 MB region in a single fault on swap-in,
> no khugepaged involvement is needed. swap_map metadata is identical
> either way (512 single-slot counts), so the PTE split buys nothing
> on the swap side, it is purely a page-table representation change.
> 
> This work was brought about after Hugh reported that one of the
> major blockers for having lazy page table deposit is the lack of
> PMD swap entries [1]. However, this series has benefits of its
> own:
> - The huge mapping is restored on swap-in.  Today even when the
>   folio is still in swap cache as a single 2 MB folio, the swap-in
>   path installs 512 PTE mappings -- the PMD mapping is gone, the
>   freshly-materialised PTE table sticks around, and only
>   khugepaged can later collapse the range back into a THP.
>   do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
>   one fault, no khugepaged involvement.

Ack, that's nice.

> - Memory saved per swapped-out THP *once lazy page table deposit is
>   merged* [2]. With lazy page table deposit [2], splitting a PMD into
>   512 PTE swap entries forces allocation of a 4 KB PTE table page.
>   The new path leaves the pgtable hierarchy at PMD level and avoids
>   that allocation entirely.
>   This will save memory when swapping, which is likely when there is
>   memory pressure and exactly when allocations are most likely to
>   fail.

Also ack.

> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
>   visit one PMD entry instead of 512 PTEs, reducing traversal
>   time and lock-hold windows.

Right.

> 
> The swap entry value is identical to 512 PTE swap entries (same
> type, same starting offset), so swap_map refcounting is unchanged.
> Only the page-table representation differs; the swap slot allocator,
> swap I/O, and swap cache are untouched.  The new path falls back to
> the existing PTE-split path whenever a PMD-order resource is
> unavailable: zswap enabled, non-contiguous swap allocation
> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
> or fork, racing folio split, or rmap-driven split on a swapcache
> folio.  Walkers that previously assumed every non-present PMD encodes
> a PFN (migration / device_private) are taught to recognise PMD swap
> entries.

All sounds nice. I'll get to review this soon. LSF/MM and travel will slow me a
bit down in May :(

-- 
Cheers,

David

Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

Posted by Usama Arif 1 month, 2 weeks ago


On 28/04/2026 20:54, David Hildenbrand (Arm) wrote:
> On 4/27/26 12:01, Usama Arif wrote:
>> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
>> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
>> unmap.
>>
>> This series introduces a PMD-level swap entry. The huge mapping is
>> preserved across the swap round-trip, and do_huge_pmd_swap_page()
>> resolves the entire 2 MB region in a single fault on swap-in,
>> no khugepaged involvement is needed. swap_map metadata is identical
>> either way (512 single-slot counts), so the PTE split buys nothing
>> on the swap side, it is purely a page-table representation change.
>>
>> This work was brought about after Hugh reported that one of the
>> major blockers for having lazy page table deposit is the lack of
>> PMD swap entries [1]. However, this series has benefits of its
>> own:
>> - The huge mapping is restored on swap-in.  Today even when the
>>   folio is still in swap cache as a single 2 MB folio, the swap-in
>>   path installs 512 PTE mappings -- the PMD mapping is gone, the
>>   freshly-materialised PTE table sticks around, and only
>>   khugepaged can later collapse the range back into a THP.
>>   do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
>>   one fault, no khugepaged involvement.
> 
> Ack, that's nice.
> 
>> - Memory saved per swapped-out THP *once lazy page table deposit is
>>   merged* [2]. With lazy page table deposit [2], splitting a PMD into
>>   512 PTE swap entries forces allocation of a 4 KB PTE table page.
>>   The new path leaves the pgtable hierarchy at PMD level and avoids
>>   that allocation entirely.
>>   This will save memory when swapping, which is likely when there is
>>   memory pressure and exactly when allocations are most likely to
>>   fail.
> 
> Also ack.
> 
>> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
>>   visit one PMD entry instead of 512 PTEs, reducing traversal
>>   time and lock-hold windows.
> 
> Right.
> 
>>
>> The swap entry value is identical to 512 PTE swap entries (same
>> type, same starting offset), so swap_map refcounting is unchanged.
>> Only the page-table representation differs; the swap slot allocator,
>> swap I/O, and swap cache are untouched.  The new path falls back to
>> the existing PTE-split path whenever a PMD-order resource is
>> unavailable: zswap enabled, non-contiguous swap allocation
>> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
>> or fork, racing folio split, or rmap-driven split on a swapcache
>> folio.  Walkers that previously assumed every non-present PMD encodes
>> a PFN (migration / device_private) are taught to recognise PMD swap
>> entries.
> 
> All sounds nice. I'll get to review this soon. LSF/MM and travel will slow me a
> bit down in May :(
> 

Thanks! Appreciate it!

Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

Posted by Lorenzo Stoakes 1 month, 2 weeks ago

On Wed, Apr 29, 2026 at 10:39:23AM +0100, Usama Arif wrote:
>
>
> On 28/04/2026 20:54, David Hildenbrand (Arm) wrote:
> > On 4/27/26 12:01, Usama Arif wrote:
> >> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
> >> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
> >> unmap.
> >>
> >> This series introduces a PMD-level swap entry. The huge mapping is
> >> preserved across the swap round-trip, and do_huge_pmd_swap_page()
> >> resolves the entire 2 MB region in a single fault on swap-in,
> >> no khugepaged involvement is needed. swap_map metadata is identical
> >> either way (512 single-slot counts), so the PTE split buys nothing
> >> on the swap side, it is purely a page-table representation change.
> >>
> >> This work was brought about after Hugh reported that one of the
> >> major blockers for having lazy page table deposit is the lack of
> >> PMD swap entries [1]. However, this series has benefits of its
> >> own:
> >> - The huge mapping is restored on swap-in.  Today even when the
> >>   folio is still in swap cache as a single 2 MB folio, the swap-in
> >>   path installs 512 PTE mappings -- the PMD mapping is gone, the
> >>   freshly-materialised PTE table sticks around, and only
> >>   khugepaged can later collapse the range back into a THP.
> >>   do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
> >>   one fault, no khugepaged involvement.
> >
> > Ack, that's nice.
> >
> >> - Memory saved per swapped-out THP *once lazy page table deposit is
> >>   merged* [2]. With lazy page table deposit [2], splitting a PMD into
> >>   512 PTE swap entries forces allocation of a 4 KB PTE table page.
> >>   The new path leaves the pgtable hierarchy at PMD level and avoids
> >>   that allocation entirely.
> >>   This will save memory when swapping, which is likely when there is
> >>   memory pressure and exactly when allocations are most likely to
> >>   fail.
> >
> > Also ack.
> >
> >> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
> >>   visit one PMD entry instead of 512 PTEs, reducing traversal
> >>   time and lock-hold windows.
> >
> > Right.
> >
> >>
> >> The swap entry value is identical to 512 PTE swap entries (same
> >> type, same starting offset), so swap_map refcounting is unchanged.
> >> Only the page-table representation differs; the swap slot allocator,
> >> swap I/O, and swap cache are untouched.  The new path falls back to
> >> the existing PTE-split path whenever a PMD-order resource is
> >> unavailable: zswap enabled, non-contiguous swap allocation
> >> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
> >> or fork, racing folio split, or rmap-driven split on a swapcache
> >> folio.  Walkers that previously assumed every non-present PMD encodes
> >> a PFN (migration / device_private) are taught to recognise PMD swap
> >> entries.
> >
> > All sounds nice. I'll get to review this soon. LSF/MM and travel will slow me a
> > bit down in May :(
> >
>
> Thanks! Appreciate it!
>

My email is a disaster right now, various other stuff + lately working hard on
the thing-I'm-going-to-talk-about-at-LSF and the-slides-for-that has left me
with only backlog but... :) will want to have a look post-LSF also. But May
likely to be slow for me alos.

Cheers, Lorenzo

Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

Posted by Zi Yan 1 month, 2 weeks ago

+Ying, who did the original THP swap work[1].

[1] https://lkml.org/lkml/2016/8/9/588

On 27 Apr 2026, at 6:01, Usama Arif wrote:

> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
> unmap.
>
> This series introduces a PMD-level swap entry. The huge mapping is
> preserved across the swap round-trip, and do_huge_pmd_swap_page()
> resolves the entire 2 MB region in a single fault on swap-in,
> no khugepaged involvement is needed. swap_map metadata is identical
> either way (512 single-slot counts), so the PTE split buys nothing
> on the swap side, it is purely a page-table representation change.
>
> This work was brought about after Hugh reported that one of the
> major blockers for having lazy page table deposit is the lack of
> PMD swap entries [1]. However, this series has benefits of its
> own:
> - The huge mapping is restored on swap-in.  Today even when the
>   folio is still in swap cache as a single 2 MB folio, the swap-in
>   path installs 512 PTE mappings -- the PMD mapping is gone, the
>   freshly-materialised PTE table sticks around, and only
>   khugepaged can later collapse the range back into a THP.
>   do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
>   one fault, no khugepaged involvement.
> - Memory saved per swapped-out THP *once lazy page table deposit is
>   merged* [2]. With lazy page table deposit [2], splitting a PMD into
>   512 PTE swap entries forces allocation of a 4 KB PTE table page.
>   The new path leaves the pgtable hierarchy at PMD level and avoids
>   that allocation entirely.
>   This will save memory when swapping, which is likely when there is
>   memory pressure and exactly when allocations are most likely to
>   fail.
> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
>   visit one PMD entry instead of 512 PTEs, reducing traversal
>   time and lock-hold windows.
>
> The swap entry value is identical to 512 PTE swap entries (same
> type, same starting offset), so swap_map refcounting is unchanged.
> Only the page-table representation differs; the swap slot allocator,
> swap I/O, and swap cache are untouched.  The new path falls back to
> the existing PTE-split path whenever a PMD-order resource is
> unavailable: zswap enabled, non-contiguous swap allocation
> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
> or fork, racing folio split, or rmap-driven split on a swapcache
> folio.  Walkers that previously assumed every non-present PMD encodes
> a PFN (migration / device_private) are taught to recognise PMD swap
> entries.
>
> Patch breakdown:
>
> The series is ordered to preserve git bisectability: every consumer
> of a PMD swap entry (split, fork, swapoff, walkers, UFFDIO_MOVE,
> swap-in fault) lands before the producer.  The swap-out path that
> actually installs PMD swap entries is the very last functional patch
> (12), so no intermediate commit can leave the kernel handling a
> PMD swap entry it does not yet understand.
>
> The first 4 patches are preparatory patches. Some of them (like
> softleaf_to_pmd() change in patch 1) are not exactly needed but its
> done to hopefully improve code quality and so that the PMD swap
> entry changes look well integrated with the rest of mm.
>
> Prep patches:
>   1. mm: add softleaf_to_pmd() and convert existing callers
>      PMD counterpart to softleaf_to_pte(); needed to construct a
>      PMD from a swap entry in later patches.
>   2. mm: extract ensure_on_mmlist() helper
>      Hoists the "register mm with swapoff" double-checked-locking
>      pattern out of try_to_unmap_one() / copy_nonpresent_pte() so
>      the PMD swap-out and PMD fork paths can reuse it without a
>      third open-coded copy.
>   3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>      pagemap_pmd_range_thp() today calls softleaf_to_page()
>      unconditionally; a PMD swap entry has no PFN and would crash
>      it.
>   4. mm/huge_memory: move softleaf_to_folio() inside migration branch
>      change_non_present_huge_pmd() today calls softleaf_to_folio()
>      before branching on entry type, so a PMD swap entry would
>      produce a bogus folio pointer that the migration-only code
>      below would then dereference.
>
> Core patches:
>   5. PMD swap entry detection (pmd_is_swap_entry,
>      softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
>      helpers (x86/arm64/s390/riscv/loongarch).
>   6. __split_huge_pmd_locked() learns to split a PMD swap entry
>      into 512 PTE swap entries, used as the fallback when a
>      PMD-order resource is unavailable.

I was wondering how to handle insufficient memory during swap-in.
Here it is. I have not read the code, but the split should be
straightforward, since we already have a contiguous swap space at
swap-out time and the split is just to enable PTE-level swap in, right?

>   7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
>      in one folio_dup_swap() call, with GFP_KERNEL retry mirroring
>      copy_pte_range().
>   8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
>      the PMD; falls back to PTE-split + unuse_pte_range() on error.
>   9. Walker updates: zap_huge_pmd, change_huge_pmd,
>      change_non_present_huge_pmd, move_soft_dirty_pmd,
>      clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry,
>      queue_folios_pmd (mempolicy), check_pmd_state (khugepaged),
>      and the madvise_cold_or_pageout_pte_range / madvise_free_huge_pmd
>      VM_BUG_ON extensions.
>  10. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap
>      entry whole via a new move_swap_pmd() helper modeled on
>      move_swap_pte().
>  11. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in
>      one shot.  Handles racing splits, SWP_STABLE_WRITES read-only
>      mapping, immediate COW for write faults; falls back to PTE-split
>      on any PMD-order resource shortfall.
>  12. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for
>      PMD-mappable swapcache folios (when zswap is disabled), and
>      try_to_unmap_one() installs one PMD swap entry via
>      set_pmd_swap_entry() instead of splitting.
>
> Testing:
>  13. selftests/mm: 12 tests covering swap-out/in, fork, fork+COW,
>      repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
>      MADV_FREE, UFFDIO_MOVE, swapoff.
>
> Making PMD swap entries work with zswap is another project on its own and
> should be in a separate follow up series.
>
> The patches are on top of mm-unstable from 23 April
> (2bcc13c29c711381d815c1ba5d5b25737400c71a).
>
> [1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/
> [2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/
>
> Usama Arif (13):
>   mm: add softleaf_to_pmd() and convert existing callers
>   mm: extract ensure_on_mmlist() helper
>   fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>   mm/huge_memory: move softleaf_to_folio() inside migration branch
>   mm: add PMD swap entry detection support
>   mm: add PMD swap entry splitting support
>   mm: handle PMD swap entries in fork path
>   mm: swap in PMD swap entries as whole THPs during swapoff
>   mm: handle PMD swap entries in non-present PMD walkers
>   mm: handle PMD swap entries in UFFDIO_MOVE
>   mm: handle PMD swap entry faults on swap-in
>   mm: install PMD swap entries on swap-out
>   selftests/mm: add PMD swap entry tests
>
>  arch/arm64/include/asm/pgtable.h      |   4 +
>  arch/loongarch/include/asm/pgtable.h  |  17 +
>  arch/riscv/include/asm/pgtable.h      |  15 +
>  arch/s390/include/asm/pgtable.h       |  15 +
>  arch/x86/include/asm/pgtable.h        |  15 +
>  fs/proc/task_mmu.c                    |  47 +-
>  include/linux/huge_mm.h               |  11 +
>  include/linux/leafops.h               |  44 +-
>  include/linux/swap.h                  |   4 +-
>  include/linux/vm_event_item.h         |   1 +
>  mm/hmm.c                              |   3 +-
>  mm/huge_memory.c                      | 540 +++++++++++++++++++++--
>  mm/internal.h                         |  49 +++
>  mm/khugepaged.c                       |   6 +
>  mm/madvise.c                          |   5 +-
>  mm/memory.c                           |  51 +--
>  mm/mempolicy.c                        |   2 +
>  mm/rmap.c                             |  27 +-
>  mm/swap.h                             |   7 +
>  mm/swap_state.c                       |  35 ++
>  mm/swapfile.c                         | 144 +++++-
>  mm/vmscan.c                           |  14 +-
>  mm/vmstat.c                           |   1 +
>  tools/testing/selftests/mm/Makefile   |   1 +
>  tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++
>  25 files changed, 1554 insertions(+), 111 deletions(-)
>  create mode 100644 tools/testing/selftests/mm/pmd_swap.c
>
> -- 
> 2.52.0


Best Regards,
Yan, Zi

Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

Posted by Usama Arif 1 month, 2 weeks ago


On 27/04/2026 19:26, Zi Yan wrote:
> +Ying, who did the original THP swap work[1].
> 
> [1] https://lkml.org/lkml/2016/8/9/588
> 

Thanks Zi!

Sorry Ying for not CCing you! checkpatch on the whole series produced
a really long list and I wasnt sure if people would start thinking of
it as spam. I added reviewers and maintainers of swap and THP + a few
folks that commented on previous related work from which this kicked off.
I should have just CC'ed everyone.

> On 27 Apr 2026, at 6:01, Usama Arif wrote:
> 
>> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
>> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
>> unmap.
>>
>> This series introduces a PMD-level swap entry. The huge mapping is
>> preserved across the swap round-trip, and do_huge_pmd_swap_page()
>> resolves the entire 2 MB region in a single fault on swap-in,
>> no khugepaged involvement is needed. swap_map metadata is identical
>> either way (512 single-slot counts), so the PTE split buys nothing
>> on the swap side, it is purely a page-table representation change.
>>
>> This work was brought about after Hugh reported that one of the
>> major blockers for having lazy page table deposit is the lack of
>> PMD swap entries [1]. However, this series has benefits of its
>> own:
>> - The huge mapping is restored on swap-in.  Today even when the
>>   folio is still in swap cache as a single 2 MB folio, the swap-in
>>   path installs 512 PTE mappings -- the PMD mapping is gone, the
>>   freshly-materialised PTE table sticks around, and only
>>   khugepaged can later collapse the range back into a THP.
>>   do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
>>   one fault, no khugepaged involvement.
>> - Memory saved per swapped-out THP *once lazy page table deposit is
>>   merged* [2]. With lazy page table deposit [2], splitting a PMD into
>>   512 PTE swap entries forces allocation of a 4 KB PTE table page.
>>   The new path leaves the pgtable hierarchy at PMD level and avoids
>>   that allocation entirely.
>>   This will save memory when swapping, which is likely when there is
>>   memory pressure and exactly when allocations are most likely to
>>   fail.
>> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
>>   visit one PMD entry instead of 512 PTEs, reducing traversal
>>   time and lock-hold windows.
>>
>> The swap entry value is identical to 512 PTE swap entries (same
>> type, same starting offset), so swap_map refcounting is unchanged.
>> Only the page-table representation differs; the swap slot allocator,
>> swap I/O, and swap cache are untouched.  The new path falls back to
>> the existing PTE-split path whenever a PMD-order resource is
>> unavailable: zswap enabled, non-contiguous swap allocation
>> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
>> or fork, racing folio split, or rmap-driven split on a swapcache
>> folio.  Walkers that previously assumed every non-present PMD encodes
>> a PFN (migration / device_private) are taught to recognise PMD swap
>> entries.
>>
>> Patch breakdown:
>>
>> The series is ordered to preserve git bisectability: every consumer
>> of a PMD swap entry (split, fork, swapoff, walkers, UFFDIO_MOVE,
>> swap-in fault) lands before the producer.  The swap-out path that
>> actually installs PMD swap entries is the very last functional patch
>> (12), so no intermediate commit can leave the kernel handling a
>> PMD swap entry it does not yet understand.
>>
>> The first 4 patches are preparatory patches. Some of them (like
>> softleaf_to_pmd() change in patch 1) are not exactly needed but its
>> done to hopefully improve code quality and so that the PMD swap
>> entry changes look well integrated with the rest of mm.
>>
>> Prep patches:
>>   1. mm: add softleaf_to_pmd() and convert existing callers
>>      PMD counterpart to softleaf_to_pte(); needed to construct a
>>      PMD from a swap entry in later patches.
>>   2. mm: extract ensure_on_mmlist() helper
>>      Hoists the "register mm with swapoff" double-checked-locking
>>      pattern out of try_to_unmap_one() / copy_nonpresent_pte() so
>>      the PMD swap-out and PMD fork paths can reuse it without a
>>      third open-coded copy.
>>   3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>>      pagemap_pmd_range_thp() today calls softleaf_to_page()
>>      unconditionally; a PMD swap entry has no PFN and would crash
>>      it.
>>   4. mm/huge_memory: move softleaf_to_folio() inside migration branch
>>      change_non_present_huge_pmd() today calls softleaf_to_folio()
>>      before branching on entry type, so a PMD swap entry would
>>      produce a bogus folio pointer that the migration-only code
>>      below would then dereference.
>>
>> Core patches:
>>   5. PMD swap entry detection (pmd_is_swap_entry,
>>      softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
>>      helpers (x86/arm64/s390/riscv/loongarch).
>>   6. __split_huge_pmd_locked() learns to split a PMD swap entry
>>      into 512 PTE swap entries, used as the fallback when a
>>      PMD-order resource is unavailable.
> 
> I was wondering how to handle insufficient memory during swap-in.
> Here it is. I have not read the code, but the split should be
> straightforward, since we already have a contiguous swap space at
m> swap-out time and the split is just to enable PTE-level swap in, right?
> 

Yes that is correct. Actually patch 6 was one of the easier patches.
If the kernel can't allocate 2M, memcg charge fails and a few other reasons,
we split THP.


>>   7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
>>      in one folio_dup_swap() call, with GFP_KERNEL retry mirroring
>>      copy_pte_range().
>>   8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
>>      the PMD; falls back to PTE-split + unuse_pte_range() on error.
>>   9. Walker updates: zap_huge_pmd, change_huge_pmd,
>>      change_non_present_huge_pmd, move_soft_dirty_pmd,
>>      clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry,
>>      queue_folios_pmd (mempolicy), check_pmd_state (khugepaged),
>>      and the madvise_cold_or_pageout_pte_range / madvise_free_huge_pmd
>>      VM_BUG_ON extensions.
>>  10. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap
>>      entry whole via a new move_swap_pmd() helper modeled on
>>      move_swap_pte().
>>  11. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in
>>      one shot.  Handles racing splits, SWP_STABLE_WRITES read-only
>>      mapping, immediate COW for write faults; falls back to PTE-split
>>      on any PMD-order resource shortfall.
>>  12. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for
>>      PMD-mappable swapcache folios (when zswap is disabled), and
>>      try_to_unmap_one() installs one PMD swap entry via
>>      set_pmd_swap_entry() instead of splitting.
>>
>> Testing:
>>  13. selftests/mm: 12 tests covering swap-out/in, fork, fork+COW,
>>      repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
>>      MADV_FREE, UFFDIO_MOVE, swapoff.
>>
>> Making PMD swap entries work with zswap is another project on its own and
>> should be in a separate follow up series.
>>
>> The patches are on top of mm-unstable from 23 April
>> (2bcc13c29c711381d815c1ba5d5b25737400c71a).
>>
>> [1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/
>> [2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/
>>
>> Usama Arif (13):
>>   mm: add softleaf_to_pmd() and convert existing callers
>>   mm: extract ensure_on_mmlist() helper
>>   fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>>   mm/huge_memory: move softleaf_to_folio() inside migration branch
>>   mm: add PMD swap entry detection support
>>   mm: add PMD swap entry splitting support
>>   mm: handle PMD swap entries in fork path
>>   mm: swap in PMD swap entries as whole THPs during swapoff
>>   mm: handle PMD swap entries in non-present PMD walkers
>>   mm: handle PMD swap entries in UFFDIO_MOVE
>>   mm: handle PMD swap entry faults on swap-in
>>   mm: install PMD swap entries on swap-out
>>   selftests/mm: add PMD swap entry tests
>>
>>  arch/arm64/include/asm/pgtable.h      |   4 +
>>  arch/loongarch/include/asm/pgtable.h  |  17 +
>>  arch/riscv/include/asm/pgtable.h      |  15 +
>>  arch/s390/include/asm/pgtable.h       |  15 +
>>  arch/x86/include/asm/pgtable.h        |  15 +
>>  fs/proc/task_mmu.c                    |  47 +-
>>  include/linux/huge_mm.h               |  11 +
>>  include/linux/leafops.h               |  44 +-
>>  include/linux/swap.h                  |   4 +-
>>  include/linux/vm_event_item.h         |   1 +
>>  mm/hmm.c                              |   3 +-
>>  mm/huge_memory.c                      | 540 +++++++++++++++++++++--
>>  mm/internal.h                         |  49 +++
>>  mm/khugepaged.c                       |   6 +
>>  mm/madvise.c                          |   5 +-
>>  mm/memory.c                           |  51 +--
>>  mm/mempolicy.c                        |   2 +
>>  mm/rmap.c                             |  27 +-
>>  mm/swap.h                             |   7 +
>>  mm/swap_state.c                       |  35 ++
>>  mm/swapfile.c                         | 144 +++++-
>>  mm/vmscan.c                           |  14 +-
>>  mm/vmstat.c                           |   1 +
>>  tools/testing/selftests/mm/Makefile   |   1 +
>>  tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++
>>  25 files changed, 1554 insertions(+), 111 deletions(-)
>>  create mode 100644 tools/testing/selftests/mm/pmd_swap.c
>>
>> -- 
>> 2.52.0
> 
> 
> Best Regards,
> Yan, Zi

Re: [PATCH 00/13] mm: PMD-level swap entries for anonymous THPs

Posted by Zi Yan 1 month, 2 weeks ago

On 27 Apr 2026, at 16:12, Usama Arif wrote:

> On 27/04/2026 19:26, Zi Yan wrote:
>> +Ying, who did the original THP swap work[1].
>>
>> [1] https://lkml.org/lkml/2016/8/9/588
>>
>
> Thanks Zi!
>
> Sorry Ying for not CCing you! checkpatch on the whole series produced
> a really long list and I wasnt sure if people would start thinking of
> it as spam. I added reviewers and maintainers of swap and THP + a few
> folks that commented on previous related work from which this kicked off.
> I should have just CC'ed everyone.
>
>> On 27 Apr 2026, at 6:01, Usama Arif wrote:
>>
>>> When reclaim swaps out a PMD-mapped anonymous THP today, the PMD is
>>> split into 512 PTE-level swap entries via TTU_SPLIT_HUGE_PMD before
>>> unmap.
>>>
>>> This series introduces a PMD-level swap entry. The huge mapping is
>>> preserved across the swap round-trip, and do_huge_pmd_swap_page()
>>> resolves the entire 2 MB region in a single fault on swap-in,
>>> no khugepaged involvement is needed. swap_map metadata is identical
>>> either way (512 single-slot counts), so the PTE split buys nothing
>>> on the swap side, it is purely a page-table representation change.
>>>
>>> This work was brought about after Hugh reported that one of the
>>> major blockers for having lazy page table deposit is the lack of
>>> PMD swap entries [1]. However, this series has benefits of its
>>> own:
>>> - The huge mapping is restored on swap-in.  Today even when the
>>>   folio is still in swap cache as a single 2 MB folio, the swap-in
>>>   path installs 512 PTE mappings -- the PMD mapping is gone, the
>>>   freshly-materialised PTE table sticks around, and only
>>>   khugepaged can later collapse the range back into a THP.
>>>   do_huge_pmd_swap_page() reinstalls the PMD mapping directly in
>>>   one fault, no khugepaged involvement.
>>> - Memory saved per swapped-out THP *once lazy page table deposit is
>>>   merged* [2]. With lazy page table deposit [2], splitting a PMD into
>>>   512 PTE swap entries forces allocation of a 4 KB PTE table page.
>>>   The new path leaves the pgtable hierarchy at PMD level and avoids
>>>   that allocation entirely.
>>>   This will save memory when swapping, which is likely when there is
>>>   memory pressure and exactly when allocations are most likely to
>>>   fail.
>>> - Walkers (zap, mprotect, smaps, pagemap, soft-dirty, uffd-wp)
>>>   visit one PMD entry instead of 512 PTEs, reducing traversal
>>>   time and lock-hold windows.
>>>
>>> The swap entry value is identical to 512 PTE swap entries (same
>>> type, same starting offset), so swap_map refcounting is unchanged.
>>> Only the page-table representation differs; the swap slot allocator,
>>> swap I/O, and swap cache are untouched.  The new path falls back to
>>> the existing PTE-split path whenever a PMD-order resource is
>>> unavailable: zswap enabled, non-contiguous swap allocation
>>> (THP_SWPOUT_FALLBACK), PMD-order folio allocation failure on swap-in
>>> or fork, racing folio split, or rmap-driven split on a swapcache
>>> folio.  Walkers that previously assumed every non-present PMD encodes
>>> a PFN (migration / device_private) are taught to recognise PMD swap
>>> entries.
>>>
>>> Patch breakdown:
>>>
>>> The series is ordered to preserve git bisectability: every consumer
>>> of a PMD swap entry (split, fork, swapoff, walkers, UFFDIO_MOVE,
>>> swap-in fault) lands before the producer.  The swap-out path that
>>> actually installs PMD swap entries is the very last functional patch
>>> (12), so no intermediate commit can leave the kernel handling a
>>> PMD swap entry it does not yet understand.
>>>
>>> The first 4 patches are preparatory patches. Some of them (like
>>> softleaf_to_pmd() change in patch 1) are not exactly needed but its
>>> done to hopefully improve code quality and so that the PMD swap
>>> entry changes look well integrated with the rest of mm.
>>>
>>> Prep patches:
>>>   1. mm: add softleaf_to_pmd() and convert existing callers
>>>      PMD counterpart to softleaf_to_pte(); needed to construct a
>>>      PMD from a swap entry in later patches.
>>>   2. mm: extract ensure_on_mmlist() helper
>>>      Hoists the "register mm with swapoff" double-checked-locking
>>>      pattern out of try_to_unmap_one() / copy_nonpresent_pte() so
>>>      the PMD swap-out and PMD fork paths can reuse it without a
>>>      third open-coded copy.
>>>   3. fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>>>      pagemap_pmd_range_thp() today calls softleaf_to_page()
>>>      unconditionally; a PMD swap entry has no PFN and would crash
>>>      it.
>>>   4. mm/huge_memory: move softleaf_to_folio() inside migration branch
>>>      change_non_present_huge_pmd() today calls softleaf_to_folio()
>>>      before branching on entry type, so a PMD swap entry would
>>>      produce a bogus folio pointer that the migration-only code
>>>      below would then dereference.
>>>
>>> Core patches:
>>>   5. PMD swap entry detection (pmd_is_swap_entry,
>>>      softleaf_is_valid_pmd_entry) and per-arch pmd_swp_*exclusive
>>>      helpers (x86/arm64/s390/riscv/loongarch).
>>>   6. __split_huge_pmd_locked() learns to split a PMD swap entry
>>>      into 512 PTE swap entries, used as the fallback when a
>>>      PMD-order resource is unavailable.
>>
>> I was wondering how to handle insufficient memory during swap-in.
>> Here it is. I have not read the code, but the split should be
>> straightforward, since we already have a contiguous swap space at
> m> swap-out time and the split is just to enable PTE-level swap in, right?
>>
>
> Yes that is correct. Actually patch 6 was one of the easier patches.
> If the kernel can't allocate 2M, memcg charge fails and a few other reasons,
> we split THP.

Thank you for the confirmation. I will be mostly AFK in May and will
probably check the patches later.
>
>
>>>   7. Fork: copy_huge_non_present_pmd() duplicates the PMD swap entry
>>>      in one folio_dup_swap() call, with GFP_KERNEL retry mirroring
>>>      copy_pte_range().
>>>   8. Swapoff: unuse_pmd() reads the whole 2 MB folio and reinstalls
>>>      the PMD; falls back to PTE-split + unuse_pte_range() on error.
>>>   9. Walker updates: zap_huge_pmd, change_huge_pmd,
>>>      change_non_present_huge_pmd, move_soft_dirty_pmd,
>>>      clear_soft_dirty_pmd, make_uffd_wp_pmd, smaps_pmd_entry,
>>>      queue_folios_pmd (mempolicy), check_pmd_state (khugepaged),
>>>      and the madvise_cold_or_pageout_pte_range / madvise_free_huge_pmd
>>>      VM_BUG_ON extensions.
>>>  10. UFFDIO_MOVE: move_pages_huge_pmd() learns to move a PMD swap
>>>      entry whole via a new move_swap_pmd() helper modeled on
>>>      move_swap_pte().
>>>  11. Swap-in: do_huge_pmd_swap_page() resolves a PMD swap fault in
>>>      one shot.  Handles racing splits, SWP_STABLE_WRITES read-only
>>>      mapping, immediate COW for write faults; falls back to PTE-split
>>>      on any PMD-order resource shortfall.
>>>  12. Swap-out: shrink_folio_list() drops TTU_SPLIT_HUGE_PMD for
>>>      PMD-mappable swapcache folios (when zswap is disabled), and
>>>      try_to_unmap_one() installs one PMD swap entry via
>>>      set_pmd_swap_entry() instead of splitting.
>>>
>>> Testing:
>>>  13. selftests/mm: 12 tests covering swap-out/in, fork, fork+COW,
>>>      repeated cycles, write fault, munmap, mprotect, mremap, pagemap,
>>>      MADV_FREE, UFFDIO_MOVE, swapoff.
>>>
>>> Making PMD swap entries work with zswap is another project on its own and
>>> should be in a separate follow up series.
>>>
>>> The patches are on top of mm-unstable from 23 April
>>> (2bcc13c29c711381d815c1ba5d5b25737400c71a).
>>>
>>> [1] https://lore.kernel.org/all/6869b7f0-84e1-fb93-03f1-9442cdfe476b@google.com/
>>> [2] https://lore.kernel.org/all/20260327021403.214713-1-usama.arif@linux.dev/
>>>
>>> Usama Arif (13):
>>>   mm: add softleaf_to_pmd() and convert existing callers
>>>   mm: extract ensure_on_mmlist() helper
>>>   fs/proc: use softleaf_has_pfn() in pagemap PMD walker
>>>   mm/huge_memory: move softleaf_to_folio() inside migration branch
>>>   mm: add PMD swap entry detection support
>>>   mm: add PMD swap entry splitting support
>>>   mm: handle PMD swap entries in fork path
>>>   mm: swap in PMD swap entries as whole THPs during swapoff
>>>   mm: handle PMD swap entries in non-present PMD walkers
>>>   mm: handle PMD swap entries in UFFDIO_MOVE
>>>   mm: handle PMD swap entry faults on swap-in
>>>   mm: install PMD swap entries on swap-out
>>>   selftests/mm: add PMD swap entry tests
>>>
>>>  arch/arm64/include/asm/pgtable.h      |   4 +
>>>  arch/loongarch/include/asm/pgtable.h  |  17 +
>>>  arch/riscv/include/asm/pgtable.h      |  15 +
>>>  arch/s390/include/asm/pgtable.h       |  15 +
>>>  arch/x86/include/asm/pgtable.h        |  15 +
>>>  fs/proc/task_mmu.c                    |  47 +-
>>>  include/linux/huge_mm.h               |  11 +
>>>  include/linux/leafops.h               |  44 +-
>>>  include/linux/swap.h                  |   4 +-
>>>  include/linux/vm_event_item.h         |   1 +
>>>  mm/hmm.c                              |   3 +-
>>>  mm/huge_memory.c                      | 540 +++++++++++++++++++++--
>>>  mm/internal.h                         |  49 +++
>>>  mm/khugepaged.c                       |   6 +
>>>  mm/madvise.c                          |   5 +-
>>>  mm/memory.c                           |  51 +--
>>>  mm/mempolicy.c                        |   2 +
>>>  mm/rmap.c                             |  27 +-
>>>  mm/swap.h                             |   7 +
>>>  mm/swap_state.c                       |  35 ++
>>>  mm/swapfile.c                         | 144 +++++-
>>>  mm/vmscan.c                           |  14 +-
>>>  mm/vmstat.c                           |   1 +
>>>  tools/testing/selftests/mm/Makefile   |   1 +
>>>  tools/testing/selftests/mm/pmd_swap.c | 607 ++++++++++++++++++++++++++
>>>  25 files changed, 1554 insertions(+), 111 deletions(-)
>>>  create mode 100644 tools/testing/selftests/mm/pmd_swap.c
>>>
>>> -- 
>>> 2.52.0
>>
>>
>> Best Regards,
>> Yan, Zi


Best Regards,
Yan, Zi