Documentation/admin-guide/mm/transhuge.rst | 4 +- include/linux/zswap.h | 9 ++ mm/memory.c | 67 ++++++++----- mm/swap_state.c | 23 +++-- mm/zswap.c | 111 ++++++++++++++++----- 5 files changed, 154 insertions(+), 60 deletions(-)
Hi, This RFC explores anonymous large folio swapin when a contiguous swap range is backed consistently by zswap. Large folio swapout to zswap is already supported by storing each base page in the folio as a separate zswap entry. The anonymous synchronous swapin path has remained order-0 once zswap has ever been enabled: zswap_load() rejected large folios, and alloc_swap_folio() avoided large folio allocation to protect against mixed backend ranges. This RFC keeps the scope intentionally conservative. It does not try to read one large folio from mixed zswap and disk backends, and it does not change shmem swapin. Shmem still has its existing zswap fallback and is left for later discussion. For anonymous swapin, the backend rule is made explicit: - a range fully absent from zswap can keep using the disk backend - a range fully present in zswap can be decompressed into a large folio - a mixed zswap/non-zswap range falls back to order-0 swapin The series adds a zswap range query helper, teaches zswap_load() to decompress all-zswap large folios one base page at a time, accounts mTHP swpin for zswap-loaded large folios, retries synchronous large-folio insertion races with order-0 swapin, and removes the anonymous zswap-never-enabled restriction once mixed ranges are filtered. I tested the series with a full bzImage build using CONFIG_ZSWAP=y, CONFIG_ZRAM=y, CONFIG_MEMCG=y and CONFIG_THP_SWAP=y. The QEMU/KVM runs covered both the fully-zswap path and the mixed-backend fallback path. In the all-zswap run, a 512MiB anonymous mapping was faulted as 8192 64KiB groups, reclaimed into zswap, and faulted back. Reclaim reported mthp64_zswpout=8192 and zswpout=131072. Refault then reported mthp64_swpin=8192 and zswpin=131072, and pagemap/kpageflags showed 8192 order-4 THP groups in the mapping. In the mixed-backend run, the workload used a 64MiB anonymous mapping split into 1024 64KiB groups. After shrinker debugfs wrote back exactly one zswap base-page entry, refault left 1023 order-4 THP groups and one order-0 mixed group. The kernel stats matched that shape: mthp64_swpin=1023, zswpin=16383 and zswpwb=1. CONFIG_SHRINKER_DEBUG is only a test aid for making that one zswap writeback deterministic; it is not required by the implementation. Nhat Pham's active Virtual Swap Space series is adjacent work. It moves swap cache and zswap entry state into a virtual swap descriptor, and lists mixed backing THP swapin as a future use case. This RFC is independent and works with the current swap/zswap infrastructure, but may need rebasing if VSS lands first. Feedback would be especially helpful on: 1. whether it makes sense to support all-zswap large folio swapin first, while keeping mixed zswap/disk ranges on the order-0 fallback path 2. whether a follow-up for mixed zswap/disk large folio swapin would be useful after this RFC Thanks. --- fujunjie (5): mm: zswap: decompress into a folio subpage mm: zswap: add a zswap entry batch helper mm: zswap: load fully stored large folios mm: swap: fall back to order-0 after large swapin races mm: swap: allow zswap-backed large folio swapin Documentation/admin-guide/mm/transhuge.rst | 4 +- include/linux/zswap.h | 9 ++ mm/memory.c | 67 ++++++++----- mm/swap_state.c | 23 +++-- mm/zswap.c | 111 ++++++++++++++++----- 5 files changed, 154 insertions(+), 60 deletions(-) base-commit: 917719c412c48687d4a176965d1fa35320ec457c -- 2.34.1
On Fri, May 08, 2026 at 08:18:29PM +0000, fujunjie wrote: > Hi, > > This RFC explores anonymous large folio swapin when a contiguous swap > range is backed consistently by zswap. > > Large folio swapout to zswap is already supported by storing each base > page in the folio as a separate zswap entry. The anonymous synchronous > swapin path has remained order-0 once zswap has ever been enabled: > zswap_load() rejected large folios, and alloc_swap_folio() avoided large > folio allocation to protect against mixed backend ranges. > > This RFC keeps the scope intentionally conservative. It does not try to > read one large folio from mixed zswap and disk backends, and it does not > change shmem swapin. Shmem still has its existing zswap fallback and is > left for later discussion. For anonymous swapin, the backend rule is made > explicit: > > - a range fully absent from zswap can keep using the disk backend > - a range fully present in zswap can be decompressed into a large folio > - a mixed zswap/non-zswap range falls back to order-0 swapin > > The series adds a zswap range query helper, teaches zswap_load() to > decompress all-zswap large folios one base page at a time, accounts mTHP > swpin for zswap-loaded large folios, retries synchronous large-folio > insertion races with order-0 swapin, and removes the anonymous > zswap-never-enabled restriction once mixed ranges are filtered. > > I tested the series with a full bzImage build using CONFIG_ZSWAP=y, > CONFIG_ZRAM=y, CONFIG_MEMCG=y and CONFIG_THP_SWAP=y. > > The QEMU/KVM runs covered both the fully-zswap path and the mixed-backend > fallback path. In the all-zswap run, a 512MiB anonymous mapping was faulted > as 8192 64KiB groups, reclaimed into zswap, and faulted back. Reclaim > reported mthp64_zswpout=8192 and zswpout=131072. Refault then reported > mthp64_swpin=8192 and zswpin=131072, and pagemap/kpageflags showed 8192 > order-4 THP groups in the mapping. > > In the mixed-backend run, the workload used a 64MiB anonymous mapping > split into 1024 64KiB groups. After shrinker debugfs wrote back exactly > one zswap base-page entry, refault left 1023 order-4 THP groups and one > order-0 mixed group. The kernel stats matched that shape: > mthp64_swpin=1023, zswpin=16383 and zswpwb=1. > > CONFIG_SHRINKER_DEBUG is only a test aid for making that one zswap > writeback deterministic; it is not required by the implementation. > > Nhat Pham's active Virtual Swap Space series is adjacent work. It moves > swap cache and zswap entry state into a virtual swap descriptor, and lists > mixed backing THP swapin as a future use case. This RFC is independent and > works with the current swap/zswap infrastructure, but may need rebasing if > VSS lands first. > > Feedback would be especially helpful on: > > 1. whether it makes sense to support all-zswap large folio swapin first, > while keeping mixed zswap/disk ranges on the order-0 fallback path I think so, yes, but based on my read of the code this RFC only affects synchornous swapin, which is more-or-less zram+zswap. This is an uncommon setup outside of testing. > 2. whether a follow-up for mixed zswap/disk large folio swapin would be > useful after this RFC That's a heavier lift and I think we should consider this in the longer-term, once the virtual swap work settles down. This is conceptually not a zswap thing, you can have parts of a folio on disk, in zswap, in the zeromap, etc. So it needs to be handled at a higher layer (virtual swap for example). > > Thanks. > > --- > > fujunjie (5): > mm: zswap: decompress into a folio subpage > mm: zswap: add a zswap entry batch helper > mm: zswap: load fully stored large folios > mm: swap: fall back to order-0 after large swapin races > mm: swap: allow zswap-backed large folio swapin > > Documentation/admin-guide/mm/transhuge.rst | 4 +- > include/linux/zswap.h | 9 ++ > mm/memory.c | 67 ++++++++----- > mm/swap_state.c | 23 +++-- > mm/zswap.c | 111 ++++++++++++++++----- > 5 files changed, 154 insertions(+), 60 deletions(-) > > > base-commit: 917719c412c48687d4a176965d1fa35320ec457c > -- > 2.34.1 > >
On 5/12/2026 6:13 AM, Yosry Ahmed wrote: >> Feedback would be especially helpful on: >> >> 1. whether it makes sense to support all-zswap large folio swapin first, >> while keeping mixed zswap/disk ranges on the order-0 fallback path > > I think so, yes, but based on my read of the code this RFC only affects > synchornous swapin, which is more-or-less zram+zswap. This is an > uncommon setup outside of testing. > >> 2. whether a follow-up for mixed zswap/disk large folio swapin would be >> useful after this RFC > > That's a heavier lift and I think we should consider this in the > longer-term, once the virtual swap work settles down. This is > conceptually not a zswap thing, you can have parts of a folio on disk, > in zswap, in the zeromap, etc. So it needs to be handled at a higher > layer (virtual swap for example). > Thanks Yosry. That makes sense. I agree that the mixed zswap/disk/zeromap case is not really zswap-specific and should be handled at a higher layer, likely after the virtual swap work settles. Given the feedback on the swapin path structure and Alexandre's ongoing work in this area, I will pause this RFC in its current form and follow those series first. Thanks
On 5/12/26 00:13, Yosry Ahmed wrote: > On Fri, May 08, 2026 at 08:18:29PM +0000, fujunjie wrote: >> Hi, >> >> This RFC explores anonymous large folio swapin when a contiguous swap >> range is backed consistently by zswap. >> >> Large folio swapout to zswap is already supported by storing each base >> page in the folio as a separate zswap entry. The anonymous synchronous >> swapin path has remained order-0 once zswap has ever been enabled: >> zswap_load() rejected large folios, and alloc_swap_folio() avoided large >> folio allocation to protect against mixed backend ranges. >> >> This RFC keeps the scope intentionally conservative. It does not try to >> read one large folio from mixed zswap and disk backends, and it does not >> change shmem swapin. Shmem still has its existing zswap fallback and is >> left for later discussion. For anonymous swapin, the backend rule is made >> explicit: >> >> - a range fully absent from zswap can keep using the disk backend >> - a range fully present in zswap can be decompressed into a large folio >> - a mixed zswap/non-zswap range falls back to order-0 swapin >> >> The series adds a zswap range query helper, teaches zswap_load() to >> decompress all-zswap large folios one base page at a time, accounts mTHP >> swpin for zswap-loaded large folios, retries synchronous large-folio >> insertion races with order-0 swapin, and removes the anonymous >> zswap-never-enabled restriction once mixed ranges are filtered. >> >> I tested the series with a full bzImage build using CONFIG_ZSWAP=y, >> CONFIG_ZRAM=y, CONFIG_MEMCG=y and CONFIG_THP_SWAP=y. >> >> The QEMU/KVM runs covered both the fully-zswap path and the mixed-backend >> fallback path. In the all-zswap run, a 512MiB anonymous mapping was faulted >> as 8192 64KiB groups, reclaimed into zswap, and faulted back. Reclaim >> reported mthp64_zswpout=8192 and zswpout=131072. Refault then reported >> mthp64_swpin=8192 and zswpin=131072, and pagemap/kpageflags showed 8192 >> order-4 THP groups in the mapping. >> >> In the mixed-backend run, the workload used a 64MiB anonymous mapping >> split into 1024 64KiB groups. After shrinker debugfs wrote back exactly >> one zswap base-page entry, refault left 1023 order-4 THP groups and one >> order-0 mixed group. The kernel stats matched that shape: >> mthp64_swpin=1023, zswpin=16383 and zswpwb=1. >> >> CONFIG_SHRINKER_DEBUG is only a test aid for making that one zswap >> writeback deterministic; it is not required by the implementation. >> >> Nhat Pham's active Virtual Swap Space series is adjacent work. It moves >> swap cache and zswap entry state into a virtual swap descriptor, and lists >> mixed backing THP swapin as a future use case. This RFC is independent and >> works with the current swap/zswap infrastructure, but may need rebasing if >> VSS lands first. >> >> Feedback would be especially helpful on: >> >> 1. whether it makes sense to support all-zswap large folio swapin first, >> while keeping mixed zswap/disk ranges on the order-0 fallback path > > I think so, yes, but based on my read of the code this RFC only affects > synchornous swapin, which is more-or-less zram+zswap. This is an > uncommon setup outside of testing. BLK_FEAT_SYNCHRONOUS is also set for pmem and brd devices I think, but that's also pretty uncommon I assume. Well, maybe if your hypervisor provides you with an emulated NVDIMM to use as swap backend ... maybe. I thought there were other ways to get BLK_FEAT_SYNCHRONOUS set, but I don't see other usage. So seeing it for zswap is pretty rare I assume. -- Cheers, David
> >> Feedback would be especially helpful on: > >> > >> 1. whether it makes sense to support all-zswap large folio swapin first, > >> while keeping mixed zswap/disk ranges on the order-0 fallback path > > > > I think so, yes, but based on my read of the code this RFC only affects > > synchornous swapin, which is more-or-less zram+zswap. This is an > > uncommon setup outside of testing. > > BLK_FEAT_SYNCHRONOUS is also set for pmem and brd devices I think, but that's > also pretty uncommon I assume. Well, maybe if your hypervisor provides you with > an emulated NVDIMM to use as swap backend ... maybe. Yeah, I said "more-or-less" to capture pmem/brd/etc :P > I thought there were other ways to get BLK_FEAT_SYNCHRONOUS set, but I don't see > other usage. > > So seeing it for zswap is pretty rare I assume. Yeah that's my understanding as well.
© 2016 - 2026 Red Hat, Inc.