[PATCH v5 00/21] Virtual Swap Space

Nhat Pham posted 21 patches 1 week, 5 days ago
Documentation/mm/swap-table.rst |   69 --
MAINTAINERS                     |    3 +-
include/linux/cpuhotplug.h      |    1 +
include/linux/memcontrol.h      |    6 +
include/linux/mm_types.h        |   16 +
include/linux/shmem_fs.h        |    7 +-
include/linux/swap.h            |  185 ++-
include/linux/swap_cgroup.h     |   17 +-
include/linux/swapops.h         |   25 +
include/linux/zswap.h           |   17 +-
kernel/power/swap.c             |    6 +-
mm/Makefile                     |    5 +-
mm/filemap.c                    |   14 +-
mm/huge_memory.c                |   11 +-
mm/internal.h                   |   24 +-
mm/madvise.c                    |    2 +-
mm/memcontrol-v1.c              |    8 +-
mm/memcontrol.c                 |  144 ++-
mm/memory.c                     |  109 +-
mm/migrate.c                    |   13 +-
mm/mincore.c                    |   15 +-
mm/page_io.c                    |   83 +-
mm/shmem.c                      |  227 +---
mm/swap.h                       |  179 +--
mm/swap_cgroup.c                |  172 ---
mm/swap_state.c                 |  306 +----
mm/swap_table.h                 |   78 +-
mm/swapfile.c                   | 1517 ++++-------------------
mm/userfaultfd.c                |   18 +-
mm/vmscan.c                     |   28 +-
mm/vswap.c                      | 2034 +++++++++++++++++++++++++++++++
mm/zswap.c                      |  142 +--
32 files changed, 2974 insertions(+), 2507 deletions(-)
delete mode 100644 Documentation/mm/swap-table.rst
delete mode 100644 mm/swap_cgroup.c
create mode 100644 mm/vswap.c
[PATCH v5 00/21] Virtual Swap Space
Posted by Nhat Pham 1 week, 5 days ago
This patch series is based on 6.19. There are a couple more
swap-related changes in mainline that I would need to coordinate
with, but I still want to send this out as an update for the
regressions reported by Kairui Song in [15]. It's probably easier
to just build this thing rather than dig through that series of
emails to get the fix patch :)

Changelog:
* v4 -> v5:
    * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
    * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
      and use guard(rcu) in vswap_cpu_dead
      (reported by Peter Zijlstra [17]).
* v3 -> v4:
    * Fix poor swap free batching behavior to alleviate a regression
      (reported by Kairui Song).
    * Fix assorted kernel build errors reported by kernel test robots
      in the case of CONFIG_SWAP=n.
* RFC v2 -> v3:
    * Implement a cluster-based allocation algorithm for virtual swap
      slots, inspired by Kairui Song and Chris Li's implementation, as
      well as Johannes Weiner's suggestions. This eliminates the lock
	  contention issues on the virtual swap layer.
    * Re-use swap table for the reverse mapping.
    * Remove CONFIG_VIRTUAL_SWAP.
    * Reducing the size of the swap descriptor from 48 bytes to 24
      bytes, i.e another 50% reduction in memory overhead from v2.
    * Remove swap cache and zswap tree and use the swap descriptor
      for this.
    * Remove zeromap, and replace the swap_map bytemap with 2 bitmaps
      (one for allocated slots, and one for bad slots).
    * Rebase on top of 6.19 (7d0a66e4bb9081d75c82ec4957c50034cb0ea449)
	* Update cover letter to include new benchmark results and discussion
	  on overhead in various cases.
* RFC v1 -> RFC v2:
    * Use a single atomic type (swap_refs) for reference counting
      purpose. This brings the size of the swap descriptor from 64 B
      down to 48 B (25% reduction). Suggested by Yosry Ahmed.
    * Zeromap bitmap is removed in the virtual swap implementation.
      This saves one bit per physical swapfile slot.
    * Rearrange the patches and the code change to make things more
      reviewable. Suggested by Johannes Weiner.
    * Update the cover letter a bit.

This patch series implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).


I. Motivation

Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.

However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
  mobile and embedded devices), users cannot adopt zswap, and are forced
  to use zram. This is confusing for users, and creates extra burdens
  for developers, having to develop and maintain similar features for
  two separate swap backends (writeback, cgroup charging, THP support,
  etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
  we have swapfile in the order of tens to hundreds of GBs, which are
  mostly unused and only exist to enable zswap usage and zero-filled
  pages swap optimizations.
* Tying zswap (and more generally, other in-memory swap backends) to
  the current physical swapfile infrastructure makes zswap implicitly
  statically sized. This does not make sense, as unlike disk swap, in
  which we consume a limited resource (disk space) to
  save another resource (memory), zswap consume the same resource it is
  saving (memory). The more we zswap, the more memory we have available,
  not less. We are not rationing a limited resource when we limit
  the size of the zswap pool, but rather we are capping the resource
  (memory) saving potential of zswap. Under memory pressure, using
  more zswap is almost always better than the alternative (disk IOs, or
  even worse, OOMs), and dynamically sizing the zswap pool on demand
  allows the system to flexibly respond to these precarious scenarios.
* Operationally, static provisioning the swapfile for zswap pose
  significant challenges, because the sysadmin has to prescribe how
  much swap is needed a priori, for each combination of
  (memory size x disk space x workload usage). It is even more
  complicated when we take into account the variance of memory
  compression, which changes the reclaim dynamics (and as a result,
  swap space size requirement). The problem is further exacerbated for
  users who rely on swap utilization (and exhaustion) as an OOM signal.

  All of these factors make it very difficult to configure the swapfile
  for zswap: too small of a swapfile and we risk preventable OOMs and
  limit the memory saving potentials of zswap; too big of a swapfile
  and we waste disk space and memory due to swap metadata overhead.
  This dilemma becomes more drastic in high memory systems, which can
  have up to TBs worth of memory.

Past attempts to decouple disk and compressed swap backends, namely the
ghost swapfile approach (see [13]), as well as the alternative
compressed swap backend zram, have mainly focused on eliminating the
disk space usage of compressed backends. We want a solution that not
only tackles that same problem, but also achieve the dynamicization of
swap space to maximize the memory saving potentials while reducing
operational and static memory overhead.

Finally, any swap redesign should support efficient backend transfer,
i.e without having to perform the expensive page table walk to
update all the PTEs that refer to the swap entry:
* The main motivation for this requirement is zswap writeback. To quote
  Johannes (from [14]): "Combining compression with disk swap is
  extremely powerful, because it dramatically reduces the worst aspects
  of both: it reduces the memory footprint of compression by shedding
  the coldest data to disk; it reduces the IO latencies and flash wear
  of disk swap through the writeback cache. In practice, this reduces
  *average event rates of the entire reclaim/paging/IO stack*."
* Another motivation is to simplify swapoff, which is both complicated
  and expensive in the current design, precisely because we are storing
  an encoding of the backend positional information in the page table,
  and thus requires a full page table walk to remove these references.


II. High Level Design Overview

To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
"virtualize" the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will "resolve" the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:

struct swp_desc {
        union {
                swp_slot_t         slot;                 /*     0     8 */
                struct zswap_entry * zswap_entry;        /*     0     8 */
        };                                               /*     0     8 */
        union {
                struct folio *     swap_cache;           /*     8     8 */
                void *             shadow;               /*     8     8 */
        };                                               /*     8     8 */
        unsigned int               swap_count;           /*    16     4 */
        unsigned short             memcgid:16;           /*    20: 0  2 */
        bool                       in_swapcache:1;       /*    22: 0  1 */

        /* Bitfield combined with previous fields */

        enum swap_type             type:2;               /*    20:17  4 */

        /* size: 24, cachelines: 1, members: 6 */
        /* bit_padding: 13 bits */
        /* last cacheline: 24 bytes */
};

(output from pahole).

This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
  simply associate the virtual swap slot with one of the supported
  backends: a zswap entry, a zero-filled swap page, a slot on the
  swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
  have the virtual swap slot points to the page instead of the on-disk
  physical swap slot. No need to perform any page table walking.

The size of the virtual swap descriptor is 24 bytes. Note that this is
not all "new" overhead, as the swap descriptor will replace:
* the swap_cgroup arrays (one per swap type) in the old design, which
  is a massive source of static memory overhead. With the new design,
  it is only allocated for used clusters.
* the swap tables, which holds the swap cache and workingset shadows.
* the zeromap bitmap, which is a bitmap of physical swap slots to
  indicate whether the swapped out page is zero-filled or not.
* huge chunk of the swap_map. The swap_map is now replaced by 2 bitmaps,
  one for allocated slots, and one for bad slots, representing 3 possible
  states of a slot on the swapfile: allocated, free, and bad.
* the zswap tree.

So, in terms of additional memory overhead:
* For zswap entries, the added memory overhead is rather minimal. The
  new indirection pointer neatly replaces the existing zswap tree.
  We really only incur less than one word of overhead for swap count
  blow up (since we no longer use swap continuation) and the swap type.
* For physical swap entries, the new design will impose fewer than 3 words
  memory overhead. However, as noted above this overhead is only for
  actively used swap entries, whereas in the current design the overhead is
  static (including the swap cgroup array for example).

  The primary victim of this overhead will be zram users. However, as
  zswap now no longer takes up disk space, zram users can consider
  switching to zswap (which, as a bonus, has a lot of useful features
  out of the box, such as cgroup tracking, dynamic zswap pool sizing,
  LRU-ordering writeback, etc.).

For a more concrete example, suppose we have a 32 GB swapfile (i.e.
8,388,608 swap entries), and we use zswap.

0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 0.00 MB

25% usage, or 2,097,152 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 48.25 MB

50% usage, or 4,194,304 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 96.50 MB

75% usage, or 6,291,456 entries:
* Old design total overhead: 121.00 MB
* Vswap total overhead: 144.75 MB

100% usage, or 8,388,608 entries:
* Old design total overhead: 153.00 MB
* Vswap total overhead: 193.00 MB

So even in the worst case scenario for virtual swap, i.e when we
somehow have an oracle to correctly size the swapfile for zswap
pool to 32 GB, the added overhead is only 40 MB, which is a mere
0.12% of the total swapfile :)

In practice, the overhead will be closer to the 50-75% usage case, as
systems tend to leave swap headroom for pathological events or sudden
spikes in memory requirements. The added overhead in these cases are
practically negligible. And in deployments where swapfiles for zswap
are previously sparsely used, switching over to virtual swap will
actually reduce memory overhead.

Doing the same math for the disk swap, which is the worst case for
virtual swap in terms of swap backends:

0% usage, or 0 entries: 0.00 MB
* Old design total overhead: 25.00 MB
* Vswap total overhead: 2.00 MB

25% usage, or 2,097,152 entries:
* Old design total overhead: 41.00 MB
* Vswap total overhead: 66.25 MB

50% usage, or 4,194,304 entries:
* Old design total overhead: 57.00 MB
* Vswap total overhead: 130.50 MB

75% usage, or 6,291,456 entries:
* Old design total overhead: 73.00 MB
* Vswap total overhead: 194.75 MB

100% usage, or 8,388,608 entries:
* Old design total overhead: 89.00 MB
* Vswap total overhead: 259.00 MB

The added overhead is 170MB, which is 0.5% of the total swapfile size,
again in the worst case when we have a sizing oracle.

Please see the attached patches for more implementation details.


III. Usage and Benchmarking

This patch series introduce no new syscalls or userspace API. Existing
userspace setups will work as-is, except we no longer have to create a
swapfile or set memory.swap.max if we want to use zswap, as zswap is no
longer tied to physical swap. The zswap pool will be automatically and
dynamically sized based on memory usage and reclaim dynamics.

To measure the performance of the new implementation, I have run the
following benchmarks:

1. Kernel building: 52 workers (one per processor), memory.max = 3G.

Using zswap as the backend:

Baseline:
real: mean: 164.29s, stdev: 0.53s
user: mean: 5109.06s, stdev: 2.04s
sys: mean: 672.62s, stdev: 30.46s

Vswap:
real: mean: 164.12s, stdev: 0.4s
user: mean: 5105.24s, stdev: 2.01s
sys: mean: 668.66s, stdev: 34.45s

Using SSD swap as the backend:

Baseline:
real: mean: 189.74s, stdev: 2.03s
user: mean: 5035.93s, stdev: 3.1s
sys: mean: 500.01s, stdev: 4.16s

Vswap:
real: mean: 190.18s, stdev: 4.35s
user: mean: 5038.26s, stdev: 7.39s
sys: mean: 497.09s, stdev: 12.3s

The performance is neck-to-neck for both swap backends, with vswap
slightly edging out in systime. However, the variance is high, so it is
hard to draw a definitive conclusion.

2. Usemem: Per a report from Kairui Song ([15]), I have run the
   following benchmark:

Memory state of the system:

free -m
               total        used        free      shared  buff/cache   available
Mem:           31596        5094       11667          19       15302       26502
Swap:          65535          33       65502

Running the usemem benchmark with n = 1, 56G for 5 times, and average
out the result:

Baseline (6.19):
real: mean: 190.93s, stdev: 5.09s
user: mean: 46.62s, stdev: 0.27s
sys: mean: 128.51s, stdev: 5.17s
throughput: mean: 382093 KB/s, stdev: 11173.6 KB/s
free time: mean: 7916690.2 usecs, stdev: 88923.0 usecs

VSS:
real: mean: 187.66s, stdev: 5.67s
user: mean: 46.5s, stdev: 0.16s
sys: mean: 125.3s, stdev: 5.58s
throughput: mean: 387506.4 KB/s, stdev: 12556.56 KB/s
free time: mean: 7029733.8 usecs, stdev: 124661.34 usecs


IV. Future Use Cases

While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:

* Multi-tier swapping (as mentioned in [5]), with transparent
  transferring (promotion/demotion) of pages across tiers (see [8] and
  [9]). Similar to swapoff, with the old design we would need to
  perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
  Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
  backing store of THPs, then you can dispatch each range of subpages
  to appropriate backend swapin handler.
* Swapping a folio out with discontiguous physical swap slots
  (see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
  physical swap space for pages when they enter the zswap pool, giving
  the kernel no flexibility at writeback time. With the virtual swap
  implementation, the backends are decoupled, and physical swap space
  is allocated on-demand at writeback time, at which point we can make
  much smarter decisions: we can batch multiple zswap writeback
  operations into a single IO request, allocating contiguous physical
  swap slots for that request. We can even perform compressed writeback
  (i.e writing these pages without decompressing them) (see [12]).


V. References

[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
[13]: https://lore.kernel.org/all/20251121-ghost-v1-1-cfc0efcf3855@kernel.org/
[14]: https://lore.kernel.org/linux-mm/20251202170222.GD430226@cmpxchg.org/
[15]: https://lore.kernel.org/linux-mm/CAMgjq7AQNGK-a=AOgvn4-V+zGO21QMbMTVbrYSW_R2oDSLoC+A@mail.gmail.com/
[16]: https://lore.kernel.org/all/69bc6c4f.050a0220.3bf4de.0001.GAE@google.com/
[17]: https://lore.kernel.org/all/20260319075621.GR3738010@noisy.programming.kicks-ass.net/

Nhat Pham (21):
  mm/swap: decouple swap cache from physical swap infrastructure
  swap: rearrange the swap header file
  mm: swap: add an abstract API for locking out swapoff
  zswap: add new helpers for zswap entry operations
  mm/swap: add a new function to check if a swap entry is in swap
    cached.
  mm: swap: add a separate type for physical swap slots
  mm: create scaffolds for the new virtual swap implementation
  zswap: prepare zswap for swap virtualization
  mm: swap: allocate a virtual swap slot for each swapped out page
  swap: move swap cache to virtual swap descriptor
  zswap: move zswap entry management to the virtual swap descriptor
  swap: implement the swap_cgroup API using virtual swap
  swap: manage swap entry lifecycle at the virtual swap layer
  mm: swap: decouple virtual swap slot from backing store
  zswap: do not start zswap shrinker if there is no physical swap slots
  swap: do not unnecesarily pin readahead swap entries
  swapfile: remove zeromap bitmap
  memcg: swap: only charge physical swap slots
  swap: simplify swapoff using virtual swap
  swapfile: replace the swap map with bitmaps
  vswap: batch contiguous vswap free calls

 Documentation/mm/swap-table.rst |   69 --
 MAINTAINERS                     |    3 +-
 include/linux/cpuhotplug.h      |    1 +
 include/linux/memcontrol.h      |    6 +
 include/linux/mm_types.h        |   16 +
 include/linux/shmem_fs.h        |    7 +-
 include/linux/swap.h            |  185 ++-
 include/linux/swap_cgroup.h     |   17 +-
 include/linux/swapops.h         |   25 +
 include/linux/zswap.h           |   17 +-
 kernel/power/swap.c             |    6 +-
 mm/Makefile                     |    5 +-
 mm/filemap.c                    |   14 +-
 mm/huge_memory.c                |   11 +-
 mm/internal.h                   |   24 +-
 mm/madvise.c                    |    2 +-
 mm/memcontrol-v1.c              |    8 +-
 mm/memcontrol.c                 |  144 ++-
 mm/memory.c                     |  109 +-
 mm/migrate.c                    |   13 +-
 mm/mincore.c                    |   15 +-
 mm/page_io.c                    |   83 +-
 mm/shmem.c                      |  227 +---
 mm/swap.h                       |  179 +--
 mm/swap_cgroup.c                |  172 ---
 mm/swap_state.c                 |  306 +----
 mm/swap_table.h                 |   78 +-
 mm/swapfile.c                   | 1517 ++++-------------------
 mm/userfaultfd.c                |   18 +-
 mm/vmscan.c                     |   28 +-
 mm/vswap.c                      | 2034 +++++++++++++++++++++++++++++++
 mm/zswap.c                      |  142 +--
 32 files changed, 2974 insertions(+), 2507 deletions(-)
 delete mode 100644 Documentation/mm/swap-table.rst
 delete mode 100644 mm/swap_cgroup.c
 create mode 100644 mm/vswap.c


base-commit: 05f7e89ab9731565d8a62e3b5d1ec206485eeb0b
-- 
2.52.0
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by Andrew Morton 1 week, 4 days ago
On Fri, 20 Mar 2026 12:27:14 -0700 Nhat Pham <nphamcs@gmail.com> wrote:

> This patch series implements the virtual swap space idea, based on Yosry's
> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> inputs from Johannes Weiner. The same idea (with different
> implementation details) has been floated by Rik van Riel since at least
> 2011 (see [8]).

AI review got partway through then decided it couldn't apply patches.  So
a partial result: https://sashiko.dev/#/patchset/20260320192735.748051-1-nphamcs@gmail.com
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by Roman Gushchin 1 week, 4 days ago
Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 20 Mar 2026 12:27:14 -0700 Nhat Pham <nphamcs@gmail.com> wrote:
>
>> This patch series implements the virtual swap space idea, based on Yosry's
>> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
>> inputs from Johannes Weiner. The same idea (with different
>> implementation details) has been floated by Rik van Riel since at least
>> 2011 (see [8]).
>
> AI review got partway through then decided it couldn't apply patches.  So
> a partial result: https://sashiko.dev/#/patchset/20260320192735.748051-1-nphamcs@gmail.com

It's a bug in the error handling. I've already fixed it, but haven't
deployed the new version yet. In the reality, the review failed for some
other reason (the most popular one now is backend/llm api transient errors).

Sashiko applies the entire patchset first and if it fails, it's not
reviewing anything.
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by Askar Safin 1 week, 1 day ago
Nhat Pham <nphamcs@gmail.com>:
> We can even perform compressed writeback
> (i.e writing these pages without decompressing them) (see [12]).

> [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/

This is supported in zram. The support was added here:
https://lore.kernel.org/all/20251201094754.4149975-1-senozhatsky@chromium.org/ .
It is already in mainline.

-- 
Askar Safin
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by Nhat Pham 1 week, 1 day ago
On Tue, Mar 24, 2026 at 9:19 AM Askar Safin <safinaskar@gmail.com> wrote:
>
> Nhat Pham <nphamcs@gmail.com>:
> > We can even perform compressed writeback
> > (i.e writing these pages without decompressing them) (see [12]).
>
> > [12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/
>
> This is supported in zram. The support was added here:
> https://lore.kernel.org/all/20251201094754.4149975-1-senozhatsky@chromium.org/ .
> It is already in mainline.

I'm aware of that work. It's an improvement, but my understanding is:

1. It only works for zram.

2. We still occupy the full PAGE_SIZE slot.

3. The writeback IO request is still of size PAGE_SIZE.

So we're saving the CPU work for decompression, but not the rest of
the potential benefits of compressed writeback.

For zswap, decoupling zswap and disk swap is a pre-requisite
(otherwise every zswap slot occupy a PAGE_SIZE slot in the swapfile
anyway).

Then, we have two alternatives. Either we implement a small-slot
allocator for swapfile-infra, or we writeback a full backing page for
compressed memory. The second option is a bit more straightforward,
but then we lose relative age of these objects - a backing page might
combine very recent compressed pages and very old compressed pages.

These approaches have different performance tradeoffs and need to be
evaluated. But anyway this is future work.
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by Askar Safin 1 week, 1 day ago
Nhat Pham <nphamcs@gmail.com>:
> I'm aware of that work. It's an improvement, but my understanding is:

Thank you for answer!

Also, is it possible to have checksummed swap?

I want to have checksummed swap to be protected from disk bit-rot
(I already have ECC memory, so RAM is protected).

And hibernation image should be protected, too.

I tried to put swap on top of dm-integrity, but this is
incompatible with hibernation in mainline kernel.

-- 
Askar Safin
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by Kairui Song 1 week, 2 days ago
On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
> This patch series is based on 6.19. There are a couple more
> swap-related changes in mainline that I would need to coordinate
> with, but I still want to send this out as an update for the
> regressions reported by Kairui Song in [15]. It's probably easier
> to just build this thing rather than dig through that series of
> emails to get the fix patch :)
>
> Changelog:
> * v4 -> v5:
>     * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
>     * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
>       and use guard(rcu) in vswap_cpu_dead
>       (reported by Peter Zijlstra [17]).
> * v3 -> v4:
>     * Fix poor swap free batching behavior to alleviate a regression
>       (reported by Kairui Song).

I tested the v5 (including the batched-free hotfix) and am still
seeing significant regressions in both sequential and concurrent swap
workloads

Thanks for the update as I can see It's a lot of thoughtful work.
Actually I did run some tests already with your previously posted
hotfix based on v3. I didn't update the result because very
unfortunately, I still see a major performance regression even with a
very simple setup.

BTW there seems a simpler way to reproduce that, just use memhog:
sudo mkswap /dev/pmem0; sudo swapon /dev/pmem0; time memhog 48G; sudo swapoff -a

Before:
(I'm using fish shell on that test machine so this is fish time format):
________________________________________________________
Executed in   20.80 secs    fish           external
   usr time    5.14 secs    0.00 millis    5.14 secs
   sys time   15.65 secs    1.17 millis   15.65 secs
________________________________________________________
Executed in   21.69 secs    fish           external
   usr time    5.31 secs  725.00 micros    5.31 secs
   sys time   16.36 secs  579.00 micros   16.36 secs
________________________________________________________
Executed in   21.86 secs    fish           external
   usr time    5.39 secs    1.02 millis    5.39 secs
   sys time   16.46 secs    0.27 millis   16.46 secs

After:
________________________________________________________
Executed in   30.77 secs    fish           external
   usr time    5.16 secs  767.00 micros    5.16 secs
   sys time   25.59 secs  580.00 micros   25.59 secs
________________________________________________________
Executed in   37.47 secs    fish           external
   usr time    5.48 secs    0.00 micros    5.48 secs
   sys time   31.98 secs  674.00 micros   31.98 secs
________________________________________________________
Executed in   31.34 secs    fish           external
   usr time    5.22 secs    0.00 millis    5.22 secs
   sys time   26.09 secs    1.30 millis   26.09 secs

It's obviously a lot slower.

pmem may seem rare but SSDs are good at sequential, and memhog uses
the same filled page and backend like ZRAM has extremely low overhead
for same filled pages. Results with ZRAM are very similar, and many
production workloads have massive amounts of samefill memory.

For example on the Android phone I'm using right now at this moment:
# cat /sys/block/zram0/mm_stat
4283899904 1317373036 1370259456        0 1475977216   116457  1991851
   87273  1793760
~450M of samefill page in ZRAM, we may see more on some server
workload. And I'm seeing similar memhog results with ZRAM, pmem is
just easier to setup and less noisy. also simulates high speed
storage.

I also ran the previous usemem matrix, which seems better than V3 but
still pretty bad:
Test: usemem --init-time -O -n 1 56G, 16G mem, 48G swap, avgs of 8 run.
Before:
Throughput (Sum): 528.98 MB/s Throughput (Mean): 526.113333 MB/s Free
Latency: 3037932.888889
After:
Throughput (Sum): 453.74 MB/s Throughput (Mean): 454.875000 MB/s Free
Latency: 5001144.500000 (~10%, 64% slower)

I'm not sure why our results differ so much — perhaps different LRU
settings, memory pressure ratios, or THP/mTHP configs? Here's my exact
config in the attachment. Also includes the full log and info, with
all debug options disabled for close to production. I ran it 8 times
and just attached the first result log, it's all similar anyway, my
test framework reboot the machine after each test run to reduce any
potential noise.

And the above tests are only about sequential performance, concurrent
ones seem worse:
Test: usemem --init-time -O -R -n 32 622M, 16G mem, 48G swap, avgs of 8 run.
Before:
Throughput (Sum): 5467.51 MB/s Throughput (Mean): 170.04 MB/s Free
Latency: 28648.65
After:
Throughput (Sum): 4914.86 MB/s Throughput (Mean): 152.74 MB/s Free
Latency: 67789.81 (~10%, 230% slower)

And I double checked I'm testing your latest V5 commit here:
commit 9114ebedb82089ebd3519854964c73d3959b10c0 (HEAD -> upstream/vswap)
Author: Nhat Pham <nphamcs@gmail.com>
Date:   Fri Mar 20 12:27:35 2026 -0700

    vswap: batch contiguous vswap free calls

    In vswap_free(), we release and reacquire the cluster lock for every
    single entry, even for non-disk-swap backends where the lock drop is
    unnecessary. Batch consecutive free operations to avoid this overhead.

    Signed-off-by: Nhat Pham <nphamcs@gmail.com>

The two kernels being tested:
/boot/vmlinuz-6.19.0.orig-g05f7e89ab973
/boot/vmlinuz-6.19.0.ptch-g9114ebedb820



These tests above are done with an EPYC 7K62, I also setup an Intel
8255C with fresh installed upstream Fedora, and using Fedora's kernel
config. So far the result matches, the gap seems smaller but still
>20% slower for many cases, so this is a common problem:

3 test run on 8255C using fresh installed Fedora and Fedora kernel config:
taskset -c 3 /usr/local/bin/usemem --init-time -O -n 1 112G
(That's a two nodes large machine so I pin the thread on CPU 3 for stability)

Before:
135291469824 bytes / 124326887 usecs = 1062687 KB/s
2157355 usecs to free memory
135291469824 bytes / 123930024 usecs = 1066090 KB/s
2244083 usecs to free memory
135291469824 bytes / 123484528 usecs = 1069936 KB/s
2268364 usecs to free memory

After:
135291469824 bytes / 127073712 usecs = 1039716 KB/s
3050394 usecs to free memory
135291469824 bytes / 130724757 usecs = 1010677 KB/s
3064270 usecs to free memory
135291469824 bytes / 127248347 usecs = 1038289 KB/s
3035986 usecs to free memory

And beside these known cases, my main concern is still that a
mandatory virtual layer seems just wrong, it changes how swap work in
many ways. Storage folks have been trying to bypass the kernel for
decades, as abstraction layers come with overhead — that's common
knowledge. Swap lives right at the intersection of storage and mm and
has to stay inside the kernel, so we really want the kernel path to be
as flat and direct as possible.

I'm also worried this risks undoing all the recent and upcoming work
for reducing memory usage and performance. We've been trying to shrink
per-entry overhead (I'm already feeling nervous over the current
8-byte per-entry cost, and hope soon we'll get down to <1–3 bytes).
The series mentions 24 bytes of overhead, but when I account for the
reverse mapping, it looks >32 bytes per entry.

The intermediate large XArray layer also worries me as the swap space
is now very large. The virtual size could grow with no limit. e.g. a 1
TB swap space would be a 4 layers radix tree, increasing global
contention (int(1024 * 1024 / 2) >> 6 >> 6 >> 6 == 2) and vswap could
be even larger if fragmentation happens. That's the exact problem the
old sub-address_space design for SWAP was created to solve. We only
eliminated that complexity a few months ago, and this approach seems
like it would have to bring a similar structure back to reduce
contention.

And for swapoff support: minor anonymous faults during busy periods
are indeed critical for some workloads, and being able to swapoff
cleanly is still very useful both for performance and troubleshooting.
You will need to touch many things to solve a minor fault.

For reference, I've been exploring an approach that keeps the virtual
layer runtime-optional, which avoids these overheads for workloads
that don't need virtualization:
https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-0-104795d19815@tencent.com/
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by Nhat Pham 1 week, 2 days ago
On Mon, Mar 23, 2026 at 6:09 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > This patch series is based on 6.19. There are a couple more
> > swap-related changes in mainline that I would need to coordinate
> > with, but I still want to send this out as an update for the
> > regressions reported by Kairui Song in [15]. It's probably easier
> > to just build this thing rather than dig through that series of
> > emails to get the fix patch :)
> >
> > Changelog:
> > * v4 -> v5:
> >     * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> >     * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> >       and use guard(rcu) in vswap_cpu_dead
> >       (reported by Peter Zijlstra [17]).
> > * v3 -> v4:
> >     * Fix poor swap free batching behavior to alleviate a regression
> >       (reported by Kairui Song).
>

Hi Kairui! Thanks a lot for the testing big boss :) I will focus on
the regression in this patch series - we can talk more about
directions in another thread :)

> I tested the v5 (including the batched-free hotfix) and am still
> seeing significant regressions in both sequential and concurrent swap
> workloads
>
> Thanks for the update as I can see It's a lot of thoughtful work.
> Actually I did run some tests already with your previously posted
> hotfix based on v3. I didn't update the result because very
> unfortunately, I still see a major performance regression even with a
> very simple setup.
>
> BTW there seems a simpler way to reproduce that, just use memhog:
> sudo mkswap /dev/pmem0; sudo swapon /dev/pmem0; time memhog 48G; sudo swapoff -a
>
> Before:
> (I'm using fish shell on that test machine so this is fish time format):
> ________________________________________________________
> Executed in   20.80 secs    fish           external
>    usr time    5.14 secs    0.00 millis    5.14 secs
>    sys time   15.65 secs    1.17 millis   15.65 secs
> ________________________________________________________
> Executed in   21.69 secs    fish           external
>    usr time    5.31 secs  725.00 micros    5.31 secs
>    sys time   16.36 secs  579.00 micros   16.36 secs
> ________________________________________________________
> Executed in   21.86 secs    fish           external
>    usr time    5.39 secs    1.02 millis    5.39 secs
>    sys time   16.46 secs    0.27 millis   16.46 secs
>
> After:
> ________________________________________________________
> Executed in   30.77 secs    fish           external
>    usr time    5.16 secs  767.00 micros    5.16 secs
>    sys time   25.59 secs  580.00 micros   25.59 secs
> ________________________________________________________
> Executed in   37.47 secs    fish           external
>    usr time    5.48 secs    0.00 micros    5.48 secs
>    sys time   31.98 secs  674.00 micros   31.98 secs
> ________________________________________________________
> Executed in   31.34 secs    fish           external
>    usr time    5.22 secs    0.00 millis    5.22 secs
>    sys time   26.09 secs    1.30 millis   26.09 secs
>
> It's obviously a lot slower.
>
> pmem may seem rare but SSDs are good at sequential, and memhog uses
> the same filled page and backend like ZRAM has extremely low overhead
> for same filled pages. Results with ZRAM are very similar, and many
> production workloads have massive amounts of samefill memory.
>
> For example on the Android phone I'm using right now at this moment:
> # cat /sys/block/zram0/mm_stat
> 4283899904 1317373036 1370259456        0 1475977216   116457  1991851
>    87273  1793760
> ~450M of samefill page in ZRAM, we may see more on some server
> workload. And I'm seeing similar memhog results with ZRAM, pmem is
> just easier to setup and less noisy. also simulates high speed
> storage.

Interesting. Normally "lots of zero-filled page" is a very beneficial
case for vswap. You don't need a swapfile, or any zram/zswap metadata
overhead - it's a native swap backend. If production workload has this
many zero-filled pages, I think the numbers of vswap would be much
less alarming - perhaps even matching memory overhead because you
don't need to maintain a zram entry metadata (it's at least 2 words
per zram entry right?), while there's no reverse map overhead induced
(so it's 24 bytes on both side), and no need to do zram-side locking
:)

So I was surprised to see that it's not working out very well here. I
checked the implementation of memhog - let me know if this is wrong
place to look:

https://man7.org/linux/man-pages/man8/memhog.8.html
https://github.com/numactl/numactl/blob/master/memhog.c#L52

I think this is what happened here: memhog was populating the memory
0xff, which triggers the full overhead of a swapfile-backed swap entry
because even though it's "same-filled" it's not zero-filled! I was
following Usama's observation - "less than 1% of the same-filled pages
were non-zero" - and so I only handled the zero-filled case here:

https://lore.kernel.org/all/20240530102126.357438-1-usamaarif642@gmail.com/

This sounds a bit artificial IMHO - as Usama pointed out above, I
think most samefilled pages are zero pages, in real production
workloads. However, if you think there are real use cases with a lot
of non-zero samefilled pages, please let me know I can fix this real
quick. We can support this in vswap with zero extra metadata overhead
- change the VSWAP_ZERO swap entry type to VSWAP_SAME_FILLED, then use
the backend field to store that value. I can send you a patch if
you're interested.

>
> I also ran the previous usemem matrix, which seems better than V3 but
> still pretty bad:
> Test: usemem --init-time -O -n 1 56G, 16G mem, 48G swap, avgs of 8 run.
> Before:
> Throughput (Sum): 528.98 MB/s Throughput (Mean): 526.113333 MB/s Free
> Latency: 3037932.888889
> After:
> Throughput (Sum): 453.74 MB/s Throughput (Mean): 454.875000 MB/s Free
> Latency: 5001144.500000 (~10%, 64% slower)
>
> I'm not sure why our results differ so much — perhaps different LRU
> settings, memory pressure ratios, or THP/mTHP configs? Here's my exact
> config in the attachment. Also includes the full log and info, with
> all debug options disabled for close to production. I ran it 8 times
> and just attached the first result log, it's all similar anyway, my
> test framework reboot the machine after each test run to reduce any
> potential noise.

Ohh interesting - I see that you're testing with MGLRU. I can give that a try.

I'm not enabling THP/mTHP, but I don't see that you're enabling it
either - there's some 2MB swpout but that seems incidental.

Another difference is the swap backend:

1. Regarding pmem backend - I'm not sure if I can get my hands on one
of these, but if you think SSD has the same characteristics maybe I
can give that a try? The problem with SSD is for some reason variance
tends to be pretty high, between iterations yes, but especially across
reboots. Or maybe zram?

2. What about the other numbers below? Are they also on pmem? FTR I
was running most of my benchmarks on zswap, except for one kernel
build benchmark on SSD.

3. Any other backends and setup you're interested in?

BTW, sounds like you have a great benchmark suite - is it open source
somewhere? If not, can you share it with us :) Vswap aside, I think
this would be a good suite to run all swap related changes for every
swap contributor.

Once again, thank you so much for your engagement, Kairui. Very much
appreciated - I owe you a beverage of your choice whenever we meet.
And have a great rest of your day :)
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by YoungJun Park 1 week ago
On Mon, Mar 23, 2026 at 11:32:57AM -0400, Nhat Pham wrote:

> Interesting. Normally "lots of zero-filled page" is a very beneficial
> case for vswap. You don't need a swapfile, or any zram/zswap metadata
> overhead - it's a native swap backend. If production workload has this
> many zero-filled pages, I think the numbers of vswap would be much
> less alarming - perhaps even matching memory overhead because you
> don't need to maintain a zram entry metadata (it's at least 2 words
> per zram entry right?), while there's no reverse map overhead induced
> (so it's 24 bytes on both side), and no need to do zram-side locking
> :)
> 
> So I was surprised to see that it's not working out very well here. I
> checked the implementation of memhog - let me know if this is wrong
> place to look:
> 
> https://man7.org/linux/man-pages/man8/memhog.8.html
> https://github.com/numactl/numactl/blob/master/memhog.c#L52
> 
> I think this is what happened here: memhog was populating the memory
> 0xff, which triggers the full overhead of a swapfile-backed swap entry
> because even though it's "same-filled" it's not zero-filled! I was
> following Usama's observation - "less than 1% of the same-filled pages
> were non-zero" - and so I only handled the zero-filled case here:
> 
> https://lore.kernel.org/all/20240530102126.357438-1-usamaarif642@gmail.com/
> 
> This sounds a bit artificial IMHO - as Usama pointed out above, I
> think most samefilled pages are zero pages, in real production
> workloads. However, if you think there are real use cases with a lot
> of non-zero samefilled pages, please let me know I can fix this real
> quick. We can support this in vswap with zero extra metadata overhead
> - change the VSWAP_ZERO swap entry type to VSWAP_SAME_FILLED, then use
> the backend field to store that value. I can send you a patch if
> you're interested.

This brings back memories -- I'm pretty sure we talked about
exactly this at LPC. Our custom swap device already handles both
zero-filled and same-filled pages on its own, so what we really
wanted was a way to tell the swap layer "just skip the detection
and let it through."
 
I looked at two approaches back then but never submitted either:
 
  - A per-swap_info flag to opt out of zero/same-filled handling.
    But this felt wrong from vswap's perspective -- if even one
    device opts out of the zeromap, the model gets messy.
 
  - Revisiting Usama's patch 2 approach.
    Sounded good in theory, but as you said,
    it's not as simple to verify in practice. And it is more clean design
    swapout time zero check as I see. So,  I gave up on it.
 
Seeing this come up again is actually kind of nice :)
 
One thought -- maybe a compile-time CONFIG or a boot param to
control the scope? e.g. zero-only, same-filled, or disabled.
That way vendors like us just turn it off, and setups like
Kairui's can opt into broader detection. Just an idea though --
open to other approaches if you have something in mind.
 
Thanks,
Youngjun Park
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by Kairui Song 1 week, 2 days ago
On Mon, Mar 23, 2026 at 11:33 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Mar 23, 2026 at 6:09 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > This patch series is based on 6.19. There are a couple more
> > > swap-related changes in mainline that I would need to coordinate
> > > with, but I still want to send this out as an update for the
> > > regressions reported by Kairui Song in [15]. It's probably easier
> > > to just build this thing rather than dig through that series of
> > > emails to get the fix patch :)
> > >
> > > Changelog:
> > > * v4 -> v5:
> > >     * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> > >     * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> > >       and use guard(rcu) in vswap_cpu_dead
> > >       (reported by Peter Zijlstra [17]).
> > > * v3 -> v4:
> > >     * Fix poor swap free batching behavior to alleviate a regression
> > >       (reported by Kairui Song).
> >
>
> Hi Kairui! Thanks a lot for the testing big boss :) I will focus on
> the regression in this patch series - we can talk more about
> directions in another thread :)

Hi Nhat,

> Interesting. Normally "lots of zero-filled page" is a very beneficial
> case for vswap. You don't need a swapfile, or any zram/zswap metadata
> overhead - it's a native swap backend. If production workload has this
> many zero-filled pages, I think the numbers of vswap would be much
> less alarming - perhaps even matching memory overhead because you
> don't need to maintain a zram entry metadata (it's at least 2 words
> per zram entry right?), while there's no reverse map overhead induced
> (so it's 24 bytes on both side), and no need to do zram-side locking
> :)
>
> So I was surprised to see that it's not working out very well here. I
> checked the implementation of memhog - let me know if this is wrong
> place to look:
>
> https://man7.org/linux/man-pages/man8/memhog.8.html
> https://github.com/numactl/numactl/blob/master/memhog.c#L52
>
> I think this is what happened here: memhog was populating the memory
> 0xff, which triggers the full overhead of a swapfile-backed swap entry
> because even though it's "same-filled" it's not zero-filled! I was
> following Usama's observation - "less than 1% of the same-filled pages
> were non-zero" - and so I only handled the zero-filled case here:
>
> https://lore.kernel.org/all/20240530102126.357438-1-usamaarif642@gmail.com/
>
> This sounds a bit artificial IMHO - as Usama pointed out above, I
> think most samefilled pages are zero pages, in real production
> workloads. However, if you think there are real use cases with a lot

I vaguely remember some workloads like Java or some JS engine
initialize their heap with fixed value, same fill might not be that
common but not a rare thing, it strongly depends on the workload.

> of non-zero samefilled pages, please let me know I can fix this real
> quick. We can support this in vswap with zero extra metadata overhead
> - change the VSWAP_ZERO swap entry type to VSWAP_SAME_FILLED, then use
> the backend field to store that value. I can send you a patch if
> you're interested.

Actually I don't think that's the main problem. For example, I just
wrote a few lines C bench program to zerofill ~50G of memory
and swapout sequentially:

Before:
Swapout: 4415467us
Swapin: 49573297us

After:
Swapout: 4955874us
Swapin: 56223658us

And vmstat:
cat /proc/vmstat | grep zero
thp_zero_page_alloc 0
thp_zero_page_alloc_failed 0
swpin_zero 12239329
swpout_zero 21516634

There are all zero filled pages, but still slower. And what's more, a
more critical issue, I just found the cgroup and global swap usage
accounting are both somehow broken for zero page swap,
maybe because you skipped some allocation? Users can
no longer see how many pages are swapped out. I don't think you can
break that, that's one major reason why we use a zero entry instead of
mapping to a zero readonly page. If that is acceptable, we can have
a very nice optimization right away with current swap.

That's still just an example. bypassing the accounting and still
slower is not a good sign. We should focus on the generic
performance and design.

Yet this is just another new found issue, there are many other parts
like the folio swap allocation may still occur even if a lower device
can no longer accept more whole folios, which I'm currently
unsure how it will affect swap.

> 1. Regarding pmem backend - I'm not sure if I can get my hands on one
> of these, but if you think SSD has the same characteristics maybe I
> can give that a try? The problem with SSD is for some reason variance
> tends to be pretty high, between iterations yes, but especially across
> reboots. Or maybe zram?

Yeah, ZRAM has a very similar number for some cases, but storage is
getting faster and faster and swap occurs through high speed networks
too. We definitely shouldn't ignore that.

> 2. What about the other numbers below? Are they also on pmem? FTR I
> was running most of my benchmarks on zswap, except for one kernel
> build benchmark on SSD.
>
> 3. Any other backends and setup you're interested in?
>
> BTW, sounds like you have a great benchmark suite - is it open source
> somewhere? If not, can you share it with us :) Vswap aside, I think
> this would be a good suite to run all swap related changes for every
> swap contributor.

I can try to post that somewhere, really nothing fancy just some
wrapper to make use of systemd for reboot and auto test. But all test
steps I mentioned before are already posted and publically available.
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by Nhat Pham 1 week, 2 days ago
On Mon, Mar 23, 2026 at 12:41 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Mar 23, 2026 at 11:33 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Mar 23, 2026 at 6:09 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Sat, Mar 21, 2026 at 3:29 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > > > This patch series is based on 6.19. There are a couple more
> > > > swap-related changes in mainline that I would need to coordinate
> > > > with, but I still want to send this out as an update for the
> > > > regressions reported by Kairui Song in [15]. It's probably easier
> > > > to just build this thing rather than dig through that series of
> > > > emails to get the fix patch :)
> > > >
> > > > Changelog:
> > > > * v4 -> v5:
> > > >     * Fix a deadlock in memcg1_swapout (reported by syzbot [16]).
> > > >     * Replace VM_WARN_ON(!spin_is_locked()) with lockdep_assert_held(),
> > > >       and use guard(rcu) in vswap_cpu_dead
> > > >       (reported by Peter Zijlstra [17]).
> > > > * v3 -> v4:
> > > >     * Fix poor swap free batching behavior to alleviate a regression
> > > >       (reported by Kairui Song).
> > >
> >
> > Hi Kairui! Thanks a lot for the testing big boss :) I will focus on
> > the regression in this patch series - we can talk more about
> > directions in another thread :)
>
> Hi Nhat,
>
> > Interesting. Normally "lots of zero-filled page" is a very beneficial
> > case for vswap. You don't need a swapfile, or any zram/zswap metadata
> > overhead - it's a native swap backend. If production workload has this
> > many zero-filled pages, I think the numbers of vswap would be much
> > less alarming - perhaps even matching memory overhead because you
> > don't need to maintain a zram entry metadata (it's at least 2 words
> > per zram entry right?), while there's no reverse map overhead induced
> > (so it's 24 bytes on both side), and no need to do zram-side locking
> > :)
> >
> > So I was surprised to see that it's not working out very well here. I
> > checked the implementation of memhog - let me know if this is wrong
> > place to look:
> >
> > https://man7.org/linux/man-pages/man8/memhog.8.html
> > https://github.com/numactl/numactl/blob/master/memhog.c#L52
> >
> > I think this is what happened here: memhog was populating the memory
> > 0xff, which triggers the full overhead of a swapfile-backed swap entry
> > because even though it's "same-filled" it's not zero-filled! I was
> > following Usama's observation - "less than 1% of the same-filled pages
> > were non-zero" - and so I only handled the zero-filled case here:
> >
> > https://lore.kernel.org/all/20240530102126.357438-1-usamaarif642@gmail.com/
> >
> > This sounds a bit artificial IMHO - as Usama pointed out above, I
> > think most samefilled pages are zero pages, in real production
> > workloads. However, if you think there are real use cases with a lot
>
> I vaguely remember some workloads like Java or some JS engine
> initialize their heap with fixed value, same fill might not be that
> common but not a rare thing, it strongly depends on the workload.

To a non-zero value? ISTR it was initialized to zero, but if I was
wrong then yeah it should just be a small simple patch.

>
> > of non-zero samefilled pages, please let me know I can fix this real
> > quick. We can support this in vswap with zero extra metadata overhead
> > - change the VSWAP_ZERO swap entry type to VSWAP_SAME_FILLED, then use
> > the backend field to store that value. I can send you a patch if
> > you're interested.
>
> Actually I don't think that's the main problem. For example, I just
> wrote a few lines C bench program to zerofill ~50G of memory
> and swapout sequentially:
>
> Before:
> Swapout: 4415467us
> Swapin: 49573297us
>
> After:
> Swapout: 4955874us
> Swapin: 56223658us
>
> And vmstat:
> cat /proc/vmstat | grep zero
> thp_zero_page_alloc 0
> thp_zero_page_alloc_failed 0
> swpin_zero 12239329
> swpout_zero 21516634
>
> There are all zero filled pages, but still slower. And what's more, a
> more critical issue, I just found the cgroup and global swap usage
> accounting are both somehow broken for zero page swap,
> maybe because you skipped some allocation? Users can
> no longer see how many pages are swapped out. I don't think you can
> break that, that's one major reason why we use a zero entry instead of
> mapping to a zero readonly page. If that is acceptable, we can have
> a very nice optimization right away with current swap.

No, that was intentional :) I probably should have documented this
better - but we're only charging towards swap usage (cgroup and system
wide) on memory. There was a whole patch that did that in the series
:)

I can add new counters to differentiate these cases, but it makes no
sense to me to charge towards swap usage for non-swapfile backend
(namely, zswap and zero swap pages). You are not actually occupying
the limited swapfile slots, but instead occupy a dynamic, vast virtual
swap space only (and memory in the case of zswap - this is actually an
argument against zram which does not do any cgroup accounting, but
that's another story for another day). I don't see a point in swap
charging here. It's the whole point of decoupling the backends - these
are not the same resource domains.

And if you follow Usama's work above, we actually were trying to
figure out a way to map it to a zero readonly page. That was Usama's
v2 of the patch series IIRC - but there was a bug. I think it was a
potential race between the reclaimer's rmap walk to unmap the page
from PTEs pointing to the page, and concurrent modifiers to the page?
We couldn't fix the race in a way that does not induce more overhead
than it's worth. But had that work we would also not do any swap
charging :)

BTW, if you can figure that part out, please let us know. We actually
quite like that idea - we just never managed to make it work (and we
have a bunch more urgent tasks).

>
> That's still just an example. bypassing the accounting and still
> slower is not a good sign. We should focus on the generic
> performance and design.

I will dig into the remaining regression :) Thanks for the report.

>
> Yet this is just another new found issue, there are many other parts
> like the folio swap allocation may still occur even if a lower device
> can no longer accept more whole folios, which I'm currently
> unsure how it will affect swap.



>
> > 1. Regarding pmem backend - I'm not sure if I can get my hands on one
> > of these, but if you think SSD has the same characteristics maybe I
> > can give that a try? The problem with SSD is for some reason variance
> > tends to be pretty high, between iterations yes, but especially across
> > reboots. Or maybe zram?
>
> Yeah, ZRAM has a very similar number for some cases, but storage is
> getting faster and faster and swap occurs through high speed networks
> too. We definitely shouldn't ignore that.

I can also simulate it using tmpfs as a swap backend (although it
might not work for certain benchmarks, like your usemem benchmark in
which we allocate more memory than the host physical memory).

>
> > 2. What about the other numbers below? Are they also on pmem? FTR I
> > was running most of my benchmarks on zswap, except for one kernel
> > build benchmark on SSD.
> >
> > 3. Any other backends and setup you're interested in?
> >
> > BTW, sounds like you have a great benchmark suite - is it open source
> > somewhere? If not, can you share it with us :) Vswap aside, I think
> > this would be a good suite to run all swap related changes for every
> > swap contributor.
>
> I can try to post that somewhere, really nothing fancy just some
> wrapper to make use of systemd for reboot and auto test. But all test
> steps I mentioned before are already posted and publically available.

Okay, thanks, Kairui!
Re: [PATCH v5 00/21] Virtual Swap Space
Posted by YoungJun Park 1 week ago
On Fri, Mar 20, 2026 at 12:27:14PM -0700, Nhat Pham wrote:
> 
> This patch series is based on 6.19. There are a couple more
> swap-related changes in mainline that I would need to coordinate
> with, but I still want to send this out as an update for the
> regressions reported by Kairui Song in [15]. It's probably easier
> to just build this thing rather than dig through that series of
> emails to get the fix patch :)

Hi Nhat,

I wanted to fully understand the patches before asking questions,
but reviewing everything takes time, and I didn't want to miss the
timing. So let me share some thoughts and ask about your direction. 

These are the perspectives I'm coming from:

Pros:
- The architecture is very clean.
- Zero entries currently consume swap space, which can prevent
  actual swap usage in some cases.
- It resolves zswap's dependency on swap device size.
- And so on.

Cons:
- An additional virtual allocation step is introduced per every swap.
- not easy to merge (change swap infrastructure totally?)

To address the cons, I think if we can demonstrate that the
benefits always outweigh the costs, it could fully replace the
existing mechanism. However, if this can be applied selectively,
we get only the pros without the cons.

1. Modularization

You removed CONFIG_* and went with a unified approach. I recall
you were also considering a module-based structure at some point.
What are your thoughts on that direction?

If we take that approach, we could extend the recent swap ops
patchset (https://lore.kernel.org/linux-mm/20260302104016.163542-1-bhe@redhat.com/)
as follows:
- Make vswap a swap module
- Have cluster allocation functions reside in swapops
- Enable vswap through swapon

I think this could result in a similar structure. An additional
benefit would be that it enables various configurations:

- vswap + regular swap together
- vswap only
- And other combinations

And merge is not that hard. it is not the total change of swap infra structure.

But, swapoff fastness might disappear? it is not that critical as I think.

2. Flash-friendly swap integration (for my use case)

I've been thinking about the flash-friendly swap concept that
I mentioned before and recently proposed:
(https://lore.kernel.org/linux-mm/aZW0voL4MmnMQlaR@yjaykim-PowerEdge-T330/)

One of its core functions requires buffering RAM-swapped pages
and writing them sequentially at an appropriate time -- not
immediately, but in proper block-sized units, sequentially.

This means allocated offsets must essentially be virtual, and
physical offsets need to be managed separately at the actual
write time.

If we integrate this into the current vswap, we would either
need vswap itself to handle the sequential writes (bypassing
the physical device and receiving pages directly), or swapon
a swap device and have vswap obtain physical offsets from it.
But since those offsets cannot be used directly (due to
buffering and sequential write requirements), they become
virtual too, resulting in:

  virtual -> virtual -> physical

This triple indirection is not ideal.

However, if the modularization from point 1 is achieved and
vswap acts as a swap device itself, then we can cleanly
establish a:

  virtual -> physical

relationship within it.

I noticed you seem to be exploring collaboration with Kairui
as well. I'm curious whether you have a compromise direction
in mind, or if you plan to stick with the current approach.

P.S. I definitely want to review the vswap code in detail
when I get the time. great work and code.

Thanks,
Youngjun Park