mm: support zswap-backed anonymous large folio swapin

[RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin

Posted by fujunjie 1 month ago

Hi,

This RFC explores anonymous large folio swapin when a contiguous swap
range is backed consistently by zswap.

Large folio swapout to zswap is already supported by storing each base
page in the folio as a separate zswap entry. The anonymous synchronous
swapin path has remained order-0 once zswap has ever been enabled:
zswap_load() rejected large folios, and alloc_swap_folio() avoided large
folio allocation to protect against mixed backend ranges.

This RFC keeps the scope intentionally conservative. It does not try to
read one large folio from mixed zswap and disk backends, and it does not
change shmem swapin. Shmem still has its existing zswap fallback and is
left for later discussion. For anonymous swapin, the backend rule is made
explicit:

- a range fully absent from zswap can keep using the disk backend
- a range fully present in zswap can be decompressed into a large folio
- a mixed zswap/non-zswap range falls back to order-0 swapin

The series adds a zswap range query helper, teaches zswap_load() to
decompress all-zswap large folios one base page at a time, accounts mTHP
swpin for zswap-loaded large folios, retries synchronous large-folio
insertion races with order-0 swapin, and removes the anonymous
zswap-never-enabled restriction once mixed ranges are filtered.

I tested the series with a full bzImage build using CONFIG_ZSWAP=y,
CONFIG_ZRAM=y, CONFIG_MEMCG=y and CONFIG_THP_SWAP=y.

The QEMU/KVM runs covered both the fully-zswap path and the mixed-backend
fallback path. In the all-zswap run, a 512MiB anonymous mapping was faulted
as 8192 64KiB groups, reclaimed into zswap, and faulted back. Reclaim
reported mthp64_zswpout=8192 and zswpout=131072. Refault then reported
mthp64_swpin=8192 and zswpin=131072, and pagemap/kpageflags showed 8192
order-4 THP groups in the mapping.

In the mixed-backend run, the workload used a 64MiB anonymous mapping
split into 1024 64KiB groups. After shrinker debugfs wrote back exactly
one zswap base-page entry, refault left 1023 order-4 THP groups and one
order-0 mixed group. The kernel stats matched that shape:
mthp64_swpin=1023, zswpin=16383 and zswpwb=1.

CONFIG_SHRINKER_DEBUG is only a test aid for making that one zswap
writeback deterministic; it is not required by the implementation.

Nhat Pham's active Virtual Swap Space series is adjacent work. It moves
swap cache and zswap entry state into a virtual swap descriptor, and lists
mixed backing THP swapin as a future use case. This RFC is independent and
works with the current swap/zswap infrastructure, but may need rebasing if
VSS lands first.

Feedback would be especially helpful on:

1. whether it makes sense to support all-zswap large folio swapin first,
   while keeping mixed zswap/disk ranges on the order-0 fallback path
2. whether a follow-up for mixed zswap/disk large folio swapin would be
   useful after this RFC

Thanks.

---

fujunjie (5):
  mm: zswap: decompress into a folio subpage
  mm: zswap: add a zswap entry batch helper
  mm: zswap: load fully stored large folios
  mm: swap: fall back to order-0 after large swapin races
  mm: swap: allow zswap-backed large folio swapin

 Documentation/admin-guide/mm/transhuge.rst |   4 +-
 include/linux/zswap.h                      |   9 ++
 mm/memory.c                                |  67 ++++++++-----
 mm/swap_state.c                            |  23 +++--
 mm/zswap.c                                 | 111 ++++++++++++++++-----
 5 files changed, 154 insertions(+), 60 deletions(-)


base-commit: 917719c412c48687d4a176965d1fa35320ec457c
-- 
2.34.1

Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin

Posted by Yosry Ahmed 1 month ago

On Fri, May 08, 2026 at 08:18:29PM +0000, fujunjie wrote:
> Hi,
> 
> This RFC explores anonymous large folio swapin when a contiguous swap
> range is backed consistently by zswap.
> 
> Large folio swapout to zswap is already supported by storing each base
> page in the folio as a separate zswap entry. The anonymous synchronous
> swapin path has remained order-0 once zswap has ever been enabled:
> zswap_load() rejected large folios, and alloc_swap_folio() avoided large
> folio allocation to protect against mixed backend ranges.
> 
> This RFC keeps the scope intentionally conservative. It does not try to
> read one large folio from mixed zswap and disk backends, and it does not
> change shmem swapin. Shmem still has its existing zswap fallback and is
> left for later discussion. For anonymous swapin, the backend rule is made
> explicit:
> 
> - a range fully absent from zswap can keep using the disk backend
> - a range fully present in zswap can be decompressed into a large folio
> - a mixed zswap/non-zswap range falls back to order-0 swapin
> 
> The series adds a zswap range query helper, teaches zswap_load() to
> decompress all-zswap large folios one base page at a time, accounts mTHP
> swpin for zswap-loaded large folios, retries synchronous large-folio
> insertion races with order-0 swapin, and removes the anonymous
> zswap-never-enabled restriction once mixed ranges are filtered.
> 
> I tested the series with a full bzImage build using CONFIG_ZSWAP=y,
> CONFIG_ZRAM=y, CONFIG_MEMCG=y and CONFIG_THP_SWAP=y.
> 
> The QEMU/KVM runs covered both the fully-zswap path and the mixed-backend
> fallback path. In the all-zswap run, a 512MiB anonymous mapping was faulted
> as 8192 64KiB groups, reclaimed into zswap, and faulted back. Reclaim
> reported mthp64_zswpout=8192 and zswpout=131072. Refault then reported
> mthp64_swpin=8192 and zswpin=131072, and pagemap/kpageflags showed 8192
> order-4 THP groups in the mapping.
> 
> In the mixed-backend run, the workload used a 64MiB anonymous mapping
> split into 1024 64KiB groups. After shrinker debugfs wrote back exactly
> one zswap base-page entry, refault left 1023 order-4 THP groups and one
> order-0 mixed group. The kernel stats matched that shape:
> mthp64_swpin=1023, zswpin=16383 and zswpwb=1.
> 
> CONFIG_SHRINKER_DEBUG is only a test aid for making that one zswap
> writeback deterministic; it is not required by the implementation.
> 
> Nhat Pham's active Virtual Swap Space series is adjacent work. It moves
> swap cache and zswap entry state into a virtual swap descriptor, and lists
> mixed backing THP swapin as a future use case. This RFC is independent and
> works with the current swap/zswap infrastructure, but may need rebasing if
> VSS lands first.
> 
> Feedback would be especially helpful on:
> 
> 1. whether it makes sense to support all-zswap large folio swapin first,
>    while keeping mixed zswap/disk ranges on the order-0 fallback path

I think so, yes, but based on my read of the code this RFC only affects
synchornous swapin, which is more-or-less zram+zswap. This is an
uncommon setup outside of testing.

> 2. whether a follow-up for mixed zswap/disk large folio swapin would be
>    useful after this RFC

That's a heavier lift and I think we should consider this in the
longer-term, once the virtual swap work settles down. This is
conceptually not a zswap thing, you can have parts of a folio on disk,
in zswap, in the zeromap, etc. So it needs to be handled at a higher
layer (virtual swap for example).

> 
> Thanks.
> 
> ---
> 
> fujunjie (5):
>   mm: zswap: decompress into a folio subpage
>   mm: zswap: add a zswap entry batch helper
>   mm: zswap: load fully stored large folios
>   mm: swap: fall back to order-0 after large swapin races
>   mm: swap: allow zswap-backed large folio swapin
> 
>  Documentation/admin-guide/mm/transhuge.rst |   4 +-
>  include/linux/zswap.h                      |   9 ++
>  mm/memory.c                                |  67 ++++++++-----
>  mm/swap_state.c                            |  23 +++--
>  mm/zswap.c                                 | 111 ++++++++++++++++-----
>  5 files changed, 154 insertions(+), 60 deletions(-)
> 
> 
> base-commit: 917719c412c48687d4a176965d1fa35320ec457c
> -- 
> 2.34.1
> 
>

Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin

Posted by Fujunjie 1 month ago


On 5/12/2026 6:13 AM, Yosry Ahmed wrote:
>> Feedback would be especially helpful on:
>>
>> 1. whether it makes sense to support all-zswap large folio swapin first,
>>    while keeping mixed zswap/disk ranges on the order-0 fallback path
> 
> I think so, yes, but based on my read of the code this RFC only affects
> synchornous swapin, which is more-or-less zram+zswap. This is an
> uncommon setup outside of testing.
> 
>> 2. whether a follow-up for mixed zswap/disk large folio swapin would be
>>    useful after this RFC
> 
> That's a heavier lift and I think we should consider this in the
> longer-term, once the virtual swap work settles down. This is
> conceptually not a zswap thing, you can have parts of a folio on disk,
> in zswap, in the zeromap, etc. So it needs to be handled at a higher
> layer (virtual swap for example).
>
Thanks Yosry.

That makes sense. I agree that the mixed zswap/disk/zeromap case is not
really zswap-specific and should be handled at a higher layer, likely
after the virtual swap work settles.

Given the feedback on the swapin path structure and Alexandre's ongoing
work in this area, I will pause this RFC in its current form and follow
those series first.

Thanks

Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin

Posted by David Hildenbrand (Arm) 1 month ago

On 5/12/26 00:13, Yosry Ahmed wrote:
> On Fri, May 08, 2026 at 08:18:29PM +0000, fujunjie wrote:
>> Hi,
>>
>> This RFC explores anonymous large folio swapin when a contiguous swap
>> range is backed consistently by zswap.
>>
>> Large folio swapout to zswap is already supported by storing each base
>> page in the folio as a separate zswap entry. The anonymous synchronous
>> swapin path has remained order-0 once zswap has ever been enabled:
>> zswap_load() rejected large folios, and alloc_swap_folio() avoided large
>> folio allocation to protect against mixed backend ranges.
>>
>> This RFC keeps the scope intentionally conservative. It does not try to
>> read one large folio from mixed zswap and disk backends, and it does not
>> change shmem swapin. Shmem still has its existing zswap fallback and is
>> left for later discussion. For anonymous swapin, the backend rule is made
>> explicit:
>>
>> - a range fully absent from zswap can keep using the disk backend
>> - a range fully present in zswap can be decompressed into a large folio
>> - a mixed zswap/non-zswap range falls back to order-0 swapin
>>
>> The series adds a zswap range query helper, teaches zswap_load() to
>> decompress all-zswap large folios one base page at a time, accounts mTHP
>> swpin for zswap-loaded large folios, retries synchronous large-folio
>> insertion races with order-0 swapin, and removes the anonymous
>> zswap-never-enabled restriction once mixed ranges are filtered.
>>
>> I tested the series with a full bzImage build using CONFIG_ZSWAP=y,
>> CONFIG_ZRAM=y, CONFIG_MEMCG=y and CONFIG_THP_SWAP=y.
>>
>> The QEMU/KVM runs covered both the fully-zswap path and the mixed-backend
>> fallback path. In the all-zswap run, a 512MiB anonymous mapping was faulted
>> as 8192 64KiB groups, reclaimed into zswap, and faulted back. Reclaim
>> reported mthp64_zswpout=8192 and zswpout=131072. Refault then reported
>> mthp64_swpin=8192 and zswpin=131072, and pagemap/kpageflags showed 8192
>> order-4 THP groups in the mapping.
>>
>> In the mixed-backend run, the workload used a 64MiB anonymous mapping
>> split into 1024 64KiB groups. After shrinker debugfs wrote back exactly
>> one zswap base-page entry, refault left 1023 order-4 THP groups and one
>> order-0 mixed group. The kernel stats matched that shape:
>> mthp64_swpin=1023, zswpin=16383 and zswpwb=1.
>>
>> CONFIG_SHRINKER_DEBUG is only a test aid for making that one zswap
>> writeback deterministic; it is not required by the implementation.
>>
>> Nhat Pham's active Virtual Swap Space series is adjacent work. It moves
>> swap cache and zswap entry state into a virtual swap descriptor, and lists
>> mixed backing THP swapin as a future use case. This RFC is independent and
>> works with the current swap/zswap infrastructure, but may need rebasing if
>> VSS lands first.
>>
>> Feedback would be especially helpful on:
>>
>> 1. whether it makes sense to support all-zswap large folio swapin first,
>>    while keeping mixed zswap/disk ranges on the order-0 fallback path
> 
> I think so, yes, but based on my read of the code this RFC only affects
> synchornous swapin, which is more-or-less zram+zswap. This is an
> uncommon setup outside of testing.

BLK_FEAT_SYNCHRONOUS is also set for pmem and brd devices I think, but that's
also pretty uncommon I assume. Well, maybe if your hypervisor provides you with
an emulated NVDIMM to use as swap backend ... maybe.

I thought there were other ways to get BLK_FEAT_SYNCHRONOUS set, but I don't see
other usage.

So seeing it for zswap is pretty rare I assume.

-- 
Cheers,

David

Re: [RFC PATCH 0/5] mm: support zswap-backed anonymous large folio swapin

Posted by Yosry Ahmed 1 month ago

> >> Feedback would be especially helpful on:
> >>
> >> 1. whether it makes sense to support all-zswap large folio swapin first,
> >>    while keeping mixed zswap/disk ranges on the order-0 fallback path
> >
> > I think so, yes, but based on my read of the code this RFC only affects
> > synchornous swapin, which is more-or-less zram+zswap. This is an
> > uncommon setup outside of testing.
>
> BLK_FEAT_SYNCHRONOUS is also set for pmem and brd devices I think, but that's
> also pretty uncommon I assume. Well, maybe if your hypervisor provides you with
> an emulated NVDIMM to use as swap backend ... maybe.

Yeah, I said "more-or-less" to capture pmem/brd/etc :P

> I thought there were other ways to get BLK_FEAT_SYNCHRONOUS set, but I don't see
> other usage.
>
> So seeing it for zswap is pretty rare I assume.

Yeah that's my understanding as well.