mm, swap: Virtual Swap Space (Swap Table Edition)

[RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Nhat Pham 1 week, 3 days ago

Based on: mm-unstable @ 444fc9435e57 + swap-table phase IV v5 [2].

I manually adapted Kairui's ghost device implementation (from [4])
for my vswap device. I've credited him as Co-developed-by on Patch I
since a substantial portion of the dynamic-cluster infrastructure is
his (I did propose the idea of using xarray/radix tree for dynamic
swap clusters allocation and management though :P).

From here on out, for simplicity, I will refer to swap table phase IV
as "P4", and the older v6 virtual swap space implementation as "v6".


I. Context and Motivation

Virtual swap decouples PTE swap entries from physical swap backing,
allowing pages to be compressed by zswap without pre-allocating a
physical swap slot. See [1] for a more involved discussion on the
motivation of swap virtualization, but in short, a swap virtualization
scheme needs to satisfy 3 requirements, which are all driven by real
pressing use cases of many parties using swap:

1. No backend coupling. For instance, a zswap entry should not
   require a physical swap slot to be allocated. This prevents
   wastage of coupled backend resources, and allows zswap to be
   used in systems that do not have enough storage capacity for
   physical swap (without having to resort to silly hacks). The same
   should hold for zero-filled swap pages, and swap cached folios too.

2. Dynamic swap space. The virtualization scheme should not require
   static provisioning, to accommodate dynamic and unpredictable swap
   usage. This massively simplifies operational provisioning, and
   allow the in-memory compression backend to be maximally utilized.
   It also makes sure we do not induce unbounded overhead on unused
   swap capacity.

3. Efficient backend transfer. The virtualization scheme should not
   introduces PTE/rmap walking overhead for backend transfer. This
   is crucial for systems that want to support multiple swap backends
   in a tiering fashion (for e.g zswap -> disk swap).

There are a lot of other future use cases as well - see [1] for more
details.

This series reimplements the virtual swap space concept (see [1])
on top of Kairui Song's swap table infrastructure, on top of [2]
and in accordance with his proposal in [3]. The proposal's idea
is interesting, so I decided to give it a shot myself. I'm still not
100% sure that this is bug-proof, but hey, it compiles, and has
not crashed in my simple stress testing :)

The prototype here is feature-complete relative to the swap-table P4
baseline — swapout, swapin, freeing, swapoff, zswap writeback, zswap
shrinker, memcg charging, and THP swapin all work for
both vswap and direct-physical entries — and satisfies all three
requirements above: no backend coupling (zswap/zero entries hold no
physical slot), dynamic swap space (clusters allocated on demand via
xarray, no static provisioning), and efficient backend transfer
(in-place vtable updates, no PTE/rmap walking).

II. Design

With vswap, pages are assigned virtual swap entries on a ghost device
with no backing storage. These entries are backed by zswap, zero pages,
or (lazily) physical swap slots. Physical backing is allocated only
when needed — on zswap writeback or reclaim writeout, after the rmap
step.

Compared to the standalone v6 implementation [1], which introduces a
24-byte per-entry swap descriptor and its own cluster allocator, this
edition uses swap_table infrastructure, and share a lot of the allocator
logic. Per-slot metadata is stored in a tag-encoded virtual_table
(atomic_long_t, 8 bytes per slot), and physical clusters store
Pointer-tagged rmap entries in the swap_table for reverse lookup back to
the virtual cluster.

Here are some data layout diagrams:

  Case 1: vswap entry (virtualized)

  PTE                  swap_cluster_info_dynamic
  vswap_entry          +-------------------------+
  (swp_entry_t) ------>| swap_cluster_info (ci)  |
                       | +--------------------+  |
                       | | swap_table         |  |
                       | |   PFN / Shadow     |  |
                       | | memcg_table        |  |
                       | | count,flags,order  |  |
                       | | lock, list         |  |
                       | +--------------------+  |
                       |                         |
                       | virtual_table           |
                       | +--------------------+  |
                       | | NONE               |  |
                       | | PHYS               |  |
                       | | ZERO               |  |
                       | | ZSWAP(entry*)      |  |
                       | | FOLIO(folio*)      |  |
                       | +--------------------+  |
                       +-------------------------+
                              |
                              | PHYS resolves to
                              v
                       PHYSICAL CLUSTER (swap_cluster_info)
                       +--------------------------+
                       | swap_table per-slot:     |
                       |   NULL   - free          |
                       |   PFN    - cached folio  |
                       |   Shadow - swapped out   |
                       |   Pointer- vswap rmap    |
                       |   Bad    - unusable      |
                       |                          |
                       | Vswap-backing slot:      |
                       |   Pointer(C|swp_entry_t) |
                       |     rmap back to vswap   |
                       +--------------------------+

  Case 2: direct-mapped physical entry (no vswap)

  PTE                  PHYSICAL CLUSTER (swap_cluster_info)
  phys_entry           +--------------------------+
  (swp_entry_t) ------>| swap_table per-slot:     |
                       |   NULL   - free          |
                       |   PFN    - cached folio  |
                       |   Shadow - swapped out   |
                       |   Bad    - unusable      |
                       +--------------------------+

struct swap_cluster_info_dynamic {
    struct swap_cluster_info ci;       /* swap_table, lock, etc. */
    unsigned int index;                /* position in xarray */
    struct rcu_head rcu;               /* kfree_rcu deferred free */
    atomic_long_t *virtual_table;      /* backend info, 8 B/slot */
};

Each vswap cluster (swap_cluster_info_dynamic) extends the classic
swap_cluster_info struct with a virtual_table array that stores the
backend information for each virtual swap entry in the cluster. Each
entry is tag-encoded in the low 3 bits to indicate backend types:

  NONE:   |----- 0000 ------|000|  free / unbacked
  PHYS:   |-- (type:5,off:N)|001|  on a physical swapfile (shifted)
  ZERO:   |----- 0000 ------|010|  zero-filled page
  ZSWAP:  |--- zswap_entry* |011|  compressed in zswap
  FOLIO:  |--- folio* ------|100|  in-memory folio

We still have room for 3 more future backend types, for e.g. CRAM, i.e
compressed-CXL-as-swap, which is laid out in [10] and [11]. Worst
case scenario, we can add more fields to this extended struct.

Other design points:
- Both vswap entries (Case 1) and directly-mapped physical entries
  (Case 2) coexist as first-class citizens. All the common swap
  code paths — swapout, swapin, swap freeing, swapoff, zswap
  writeback, THP swapin, etc. work for both. When CONFIG_VSWAP=n,
  the vswap branches compile out and behavior should be identical to
  today's swap-table P4 (at least that is my intention).
- Pointer-tagged swap_table on physical clusters for rmap (physical
  -> virtual) lookup.
- Virtual swap slots not backed by physical swap are not charged to
  memcg swap counters — only physical backing is charged (I made the
  case for this in [7]).
- Careful separation of vswap and physical swap allocation paths and
  structures adds a lot of complexity, but is crucial to make sure
  both paths are efficient and do not conflict with each other (for
  correctness and performance). I do re-use a lot of the allocation
  logic wherever possible though.

  An example of this is the per-cpu cluster caching. I have found that
  caching virtual and physical clusters in the same structure is a
  recipe for bugs and performance regressions :) For instance, zswap
  shrinker will invalidate the cached virtual cluster, and cache its
  physical cluster instead, which will be reverted by the next vswap
  allocation.


And a lot more of these random tidbits off the top of my head. See the
patches for a proof-of-concept implementation.


III. Follow-ups:

In no particular order (and most of which can be done as follow-up
patch series rather than shoving everything in the initial landing):

- More thorough stress testing is very much needed.

- Performance benchmarks to make sure I don't accidentally regress
  the vswap-less case, and that the vswap's case performance is
  good. I suspect I will have to port a lot of the
  optimizations I implemented in v6 over here - some of the
  inefficiencies are inherent in any swap virtualization, and
  would require the same fix (for e.g the MRU cluster caching
  for faster cluster lookup - see [8] and [9]).

- Runtime enable/disable of the vswap device. To be honest, I don't
  know if there is a value in this. My preference is vswap can be
  optimized to the point that any overhead is negligible. Failing that,
  maybe we can come up with some simple heuristics that automatically
  decides for users?

  In this RFC, CONFIG_VSWAP=y means the vswap device is always created at
  boot, and CONFIG_VSWAP=n means the vswap device is never created. This
  *might* be enough just on its own.

  Is a runtime knob (sysfs or sysctl) worth the complexity beyond
  these heuristics? I'm not sure yet. Maintaining both cases
  at runtime also has overhead for checking as well, and some of the
  checks are not cheap :)

  Besides, what does swapon/swapoff buy us here? We do not want
  multiple vswap devices - they're identical performance-wise, so we
  will just fragment clusters unnecessarily. We do not care about
  sizing, since the metadata layer is completely dynamic. If we want
  to opt-out of vswap at runtime per-cgroup, maybe swap.tier by
  Youngjun (see [12]) is a better interface than swapon/swapoff?

- Defer per-cluster memcg_table and zeromap allocation on physical
  clusters. A physical swap cluster backing vswap entries only do
  not really need their memcg_table, but the current design forces
  us to allocate it anyway. This is a waste of memory, and is an
  overhead regression compared to my older design on the zswap-only
  case, which Johannes has pointed out multiple times (see [6]),
  and is one of the biggest reasons why I have not been satisfied
  with this approach thus far. It honestly is a bit of a
  deal-breaker...

  That said, I think I might be able to allocate them on demand, i.e
  only when the first direct-mapped slot is allocated on that cluster.
  That will give us the best of BOTH worlds, for both the vswap and
  directly-mapped physical swap cases. No promises, but I will try
  (if this approach is good enough for all parties).

- Widen swap_info_struct->max to unsigned long. The vswap device's
  max is currently clamped to ALIGN_DOWN(UINT_MAX, SWAPFILE_CLUSTER)
  (~16 TiB) to fit in unsigned int. 16 TiB is small for vswap,
  especially when we're getting increasingly big machines memory-wise.

- Supporting 32-bit architectures. I need to do the math carefully.
  But do we want to optimize for these architectures anyway? I think
  the only argument is if somehow virtual swap is so good that we
  can just get rid of the direct-mapped physical swap case entirely,
  so we need to support 32-bit architectures. I'm willing to have my
  mind changed though.

- Add some fat design doc (assuming this approach is acceptable to
  folks).

- Samefilled page handling is still doable BTW, if folks think this
  has value :)


This is an early RFC — I have only done basic functional testing so
far, and still need to run more thorough stress tests and benchmarks.
That said, I figure I should send this out early to get folks's
feedback, before I get myself too deep in this rabbit hole - the
complexity is already mounting...


[1]: https://lore.kernel.org/all/20260505153854.1612033-1-nphamcs@gmail.com/
[2]: https://lore.kernel.org/all/20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com/
[3]: https://lwn.net/Articles/1072657/
[4]: https://lore.kernel.org/all/20260220-swap-table-p4-v1-15-104795d19815@tencent.com/
[5]: https://lore.kernel.org/all/aerrps94j70MkgdW@gourry-fedora-PF4VCD3F/
[6]: https://lore.kernel.org/all/aZyFxKGXc8J6PIij@cmpxchg.org/
[7]: https://lore.kernel.org/linux-mm/CAKEwX=P4syV38jAVCWq198r2OHXXc=xA-fx1dk6+qYef6yzxWQ@mail.gmail.com/
[8]: https://lore.kernel.org/all/CAKEwX=NrUhUrAFx+8BYJEfaVKpCm-H9JhBzYSrqOQb-NW7QRug@mail.gmail.com/
[9]: https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@gmail.com/
[10]: https://lore.kernel.org/all/aerrps94j70MkgdW@gourry-fedora-PF4VCD3F/
[11]: https://lore.kernel.org/all/afIKxG5mJZE6QgpR@gourry-fedora-PF4VCD3F/
[12]: https://lore.kernel.org/all/20260527062247.3440692-1-youngjun.park@lge.com/

Nhat Pham (5):
  mm, swap: add virtual swap device infrastructure
  mm, swap: support zswap and zeroswap as vswap backends
  mm, swap: support physical swap as a vswap backend
  mm, swap: only charge physical swap entries
  mm, swap: add debugfs counters for vswap

 MAINTAINERS           |    1 +
 include/linux/swap.h  |   71 +++
 include/linux/zswap.h |    3 +
 mm/Kconfig            |   10 +
 mm/internal.h         |   20 +-
 mm/madvise.c          |    2 +-
 mm/memcontrol.c       |  132 ++++-
 mm/memory.c           |   34 +-
 mm/page_io.c          |  195 ++++++--
 mm/swap.h             |   59 ++-
 mm/swap_state.c       |   51 +-
 mm/swap_table.h       |   56 +++
 mm/swapfile.c         | 1096 +++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c           |    5 +-
 mm/vswap.h            |  445 +++++++++++++++++
 mm/zswap.c            |  167 +++++--
 16 files changed, 2108 insertions(+), 239 deletions(-)
 create mode 100644 mm/vswap.h


base-commit: 401c55d4eacd97ffd24a89829655baa43b2b308e
-- 
2.53.0-Meta

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Kairui Song 1 week ago

On Thu, May 28, 2026 at 02:29:24PM +0800, Nhat Pham wrote:
> Based on: mm-unstable @ 444fc9435e57 + swap-table phase IV v5 [2].
> 
> I manually adapted Kairui's ghost device implementation (from [4])
> for my vswap device. I've credited him as Co-developed-by on Patch I
> since a substantial portion of the dynamic-cluster infrastructure is
> his (I did propose the idea of using xarray/radix tree for dynamic
> swap clusters allocation and management though :P).
> 
> >From here on out, for simplicity, I will refer to swap table phase IV
> as "P4", and the older v6 virtual swap space implementation as "v6".
> 

...

> 
> This series reimplements the virtual swap space concept (see [1])
> on top of Kairui Song's swap table infrastructure, on top of [2]
> and in accordance with his proposal in [3]. The proposal's idea
> is interesting, so I decided to give it a shot myself. I'm still not
> 100% sure that this is bug-proof, but hey, it compiles, and has
> not crashed in my simple stress testing :)
> 
> The prototype here is feature-complete relative to the swap-table P4
> baseline — swapout, swapin, freeing, swapoff, zswap writeback, zswap
> shrinker, memcg charging, and THP swapin all work for
> both vswap and direct-physical entries — and satisfies all three
> requirements above: no backend coupling (zswap/zero entries hold no
> physical slot), dynamic swap space (clusters allocated on demand via
> xarray, no static provisioning), and efficient backend transfer
> (in-place vtable updates, no PTE/rmap walking).
> 
> II. Design
> 
> With vswap, pages are assigned virtual swap entries on a ghost device
> with no backing storage. These entries are backed by zswap, zero pages,
> or (lazily) physical swap slots. Physical backing is allocated only
> when needed — on zswap writeback or reclaim writeout, after the rmap
> step.
> 
> Compared to the standalone v6 implementation [1], which introduces a
> 24-byte per-entry swap descriptor and its own cluster allocator, this
> edition uses swap_table infrastructure, and share a lot of the allocator
> logic. Per-slot metadata is stored in a tag-encoded virtual_table
> (atomic_long_t, 8 bytes per slot), and physical clusters store
> Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> the virtual cluster.
> 
> Here are some data layout diagrams:
> 
>   Case 1: vswap entry (virtualized)
> 
>   PTE                  swap_cluster_info_dynamic
>   vswap_entry          +-------------------------+
>   (swp_entry_t) ------>| swap_cluster_info (ci)  |
>                        | +--------------------+  |
>                        | | swap_table         |  |
>                        | |   PFN / Shadow     |  |
>                        | | memcg_table        |  |
>                        | | count,flags,order  |  |
>                        | | lock, list         |  |
>                        | +--------------------+  |
>                        |                         |
>                        | virtual_table           |
>                        | +--------------------+  |
>                        | | NONE               |  |
>                        | | PHYS               |  |
>                        | | ZERO               |  |
>                        | | ZSWAP(entry*)      |  |
>                        | | FOLIO(folio*)      |  |
>                        | +--------------------+  |
>                        +-------------------------+
>                               |
>                               | PHYS resolves to
>                               v
>                        PHYSICAL CLUSTER (swap_cluster_info)
>                        +--------------------------+
>                        | swap_table per-slot:     |
>                        |   NULL   - free          |
>                        |   PFN    - cached folio  |
>                        |   Shadow - swapped out   |
>                        |   Pointer- vswap rmap    |
>                        |   Bad    - unusable      |
>                        |                          |
>                        | Vswap-backing slot:      |
>                        |   Pointer(C|swp_entry_t) |
>                        |     rmap back to vswap   |
>                        +--------------------------+
> 
>   Case 2: direct-mapped physical entry (no vswap)
> 
>   PTE                  PHYSICAL CLUSTER (swap_cluster_info)
>   phys_entry           +--------------------------+
>   (swp_entry_t) ------>| swap_table per-slot:     |
>                        |   NULL   - free          |
>                        |   PFN    - cached folio  |
>                        |   Shadow - swapped out   |
>                        |   Bad    - unusable      |
>                        +--------------------------+
> 
> struct swap_cluster_info_dynamic {
>     struct swap_cluster_info ci;       /* swap_table, lock, etc. */
>     unsigned int index;                /* position in xarray */
>     struct rcu_head rcu;               /* kfree_rcu deferred free */
>     atomic_long_t *virtual_table;      /* backend info, 8 B/slot */
> };
> 
> Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> swap_cluster_info struct with a virtual_table array that stores the
> backend information for each virtual swap entry in the cluster. Each
> entry is tag-encoded in the low 3 bits to indicate backend types:
> 
>   NONE:   |----- 0000 ------|000|  free / unbacked
>   PHYS:   |-- (type:5,off:N)|001|  on a physical swapfile (shifted)
>   ZERO:   |----- 0000 ------|010|  zero-filled page
>   ZSWAP:  |--- zswap_entry* |011|  compressed in zswap
>   FOLIO:  |--- folio* ------|100|  in-memory folio

Thanks for trying this approach!

For the format part, PHYS don't need that much bits I think,
so by slightly adjust the format vswap device could be share
mostly the same format with ordinary device.

For example typical modern system don't have a address space larger
than 52 bit. (Even with full 64 bits used for addressing, shift it
by 12 we get 52). Plus 5 for type, you get 57, so you can have a
marker that should work as long as it shorter than 1000000 for PHYS,
and shared for all table format since it's not in conflict with
anything. You have also use a few extra bits so a single swap space
can be 8 times larger than RAM space, and since we can help
multiple swap type I think that should be far than enough?

Then you have Shadow back at 001, and zero bit in shadow. The only
special one is Zswap, which will be 100 now, and that's exactly the
reserved pointer format in current swap table format, on seeing
si->flags & VSWAP && is_pointer(swp_tb) you know that's zswap :)

Folio / PFN can still be 010 as in the current swap table format.

Then everything seems clean and aligned, no more special handling
for vswap needed, there are detailed to sort out, but it should work.

> - Pointer-tagged swap_table on physical clusters for rmap (physical
>   -> virtual) lookup.

Or reuse the PHYS format (rename it maybe) since point back to vswap
is also pointing to a si.

> III. Follow-ups:
> 
> In no particular order (and most of which can be done as follow-up
> patch series rather than shoving everything in the initial landing):
> 
> - More thorough stress testing is very much needed.
> 
> - Performance benchmarks to make sure I don't accidentally regress
>   the vswap-less case, and that the vswap's case performance is
>   good. I suspect I will have to port a lot of the
>   optimizations I implemented in v6 over here - some of the
>   inefficiencies are inherent in any swap virtualization, and
>   would require the same fix (for e.g the MRU cluster caching
>   for faster cluster lookup - see [8] and [9]).

This could be imporved by per-si percpu cluster. Both YoungJun's
tiering and Baoquan's previous swap ops mentioned this is needed,
and now vswap also need that. If the vswap is also a si, then it will
make use of this too.

YoungJun posted this a few month before:
https://lore.kernel.org/linux-mm/20260131125454.3187546-5-youngjun.park@lge.com/

The concern is that some locking contention could be heavier, or maybe
that's just a hypothetical problem though.

> 
> - Runtime enable/disable of the vswap device. To be honest, I don't
>   know if there is a value in this. My preference is vswap can be
>   optimized to the point that any overhead is negligible. Failing that,
>   maybe we can come up with some simple heuristics that automatically
>   decides for users?
> 
>   In this RFC, CONFIG_VSWAP=y means the vswap device is always created at
>   boot, and CONFIG_VSWAP=n means the vswap device is never created. This
>   *might* be enough just on its own.
> 
>   Is a runtime knob (sysfs or sysctl) worth the complexity beyond
>   these heuristics? I'm not sure yet. Maintaining both cases

I checked the code and I think it's not hard to do, patch 1 already
handling the meta data dynamically, everything will still just work
even if you remove vswap at runtime. The rest of patches need adaption
but might not end up being complex, it other comments here
are considered.

For patch 2, a few routines like vswap_can_swapin_thp seems not
needed or should be moved to __swap_cache_alloc? VSWAP_FOLIO is
same as swap cache folio check, which is already covered. Same for
zero checking, and VSWAP_NONE which is same as swap count check
I think. That way we not only save a lot of code, we also no
longer need to treat vswap specially.

If you keep the format compatible with what we already have
as the earlier comment mentions, a large portion of this part
might be unneeded.

>   at runtime also has overhead for checking as well, and some of the
>   checks are not cheap :)

I also noticed the new introduced swap_read_folio_phys in patch 3, so
this actually can be done using Baoquan's swapops idea which is now
part of Christoph's swap batching:

https://lore.kernel.org/linux-mm/20260528124559.2566481-9-hch@lst.de/#r

That series is focusing on batching and better performance but swapops
was also proposed as a way to solve the virtual layer, makes it possible
to have vswap as one kind of swapops which is Chris talked a lot about:

https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/

Following this, we could have something like:

const struct swap_ops swap_vswap_ops = {
	.submit_write		= swap_vswap_submit_write,
	.submit_read		= swap_vswap_submit_read,
};

The move the folio_realloc_swap in swap_vswap_submit_write.

Merge of IO might be moved to lower phyiscal level for vswap.

Another gain is that the memory usage and CPU overhead will be
lower with only one layer. As I'm recently trying to offload swap
dataplane off CPU so the CPU won't touch the data at all, the
overhead will be purely by swap itself, plus some mm overhead.
Things like that and IO optimization above and could make swap
subsystem more and more performance sensitive so we have cases
that needs only one layer.

> 
> - Defer per-cluster memcg_table and zeromap allocation on physical
>   clusters. A physical swap cluster backing vswap entries only do
>   not really need their memcg_table, but the current design forces
>   us to allocate it anyway. This is a waste of memory, and is an
>   overhead regression compared to my older design on the zswap-only
>   case, which Johannes has pointed out multiple times (see [6]),
>   and is one of the biggest reasons why I have not been satisfied
>   with this approach thus far. It honestly is a bit of a
>   deal-breaker...
> 
>   That said, I think I might be able to allocate them on demand, i.e
>   only when the first direct-mapped slot is allocated on that cluster.
>   That will give us the best of BOTH worlds, for both the vswap and
>   directly-mapped physical swap cases. No promises, but I will try
>   (if this approach is good enough for all parties).

Zero map is not really a problem when it's just a inlined bit I think.
For memcg table allocation, on demand seems a good idea, and actually
we are not far from there, I tried to generalize the
alloc-then-retry-sleep-alloc in swap_alloc_table but still not generic
enough I guess.. Good new is the allocation of the table is already
kind of ondemand, just need to split the detection of these two kind
of table.

Mean while I also remember we once discussed about splitting the
accounting for vswap / physical swap? If we went that approach we
don't need to treat memcg_table specially.

> - Widen swap_info_struct->max to unsigned long. The vswap device's
>   max is currently clamped to ALIGN_DOWN(UINT_MAX, SWAPFILE_CLUSTER)
>   (~16 TiB) to fit in unsigned int. 16 TiB is small for vswap,
>   especially when we're getting increasingly big machines memory-wise.

This should be very easy to do, just replace unsigned int with
unsigned long, a lot of place to touch though :)

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Nhat Pham 6 days, 21 hours ago

On Mon, Jun 1, 2026 at 12:34 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Thu, May 28, 2026 at 02:29:24PM +0800, Nhat Pham wrote:
> > Based on: mm-unstable @ 444fc9435e57 + swap-table phase IV v5 [2].
> >
> > I manually adapted Kairui's ghost device implementation (from [4])
> > for my vswap device. I've credited him as Co-developed-by on Patch I
> > since a substantial portion of the dynamic-cluster infrastructure is
> > his (I did propose the idea of using xarray/radix tree for dynamic
> > swap clusters allocation and management though :P).
> >
> > >From here on out, for simplicity, I will refer to swap table phase IV
> > as "P4", and the older v6 virtual swap space implementation as "v6".
> >
>
> ...
>
> >
> > This series reimplements the virtual swap space concept (see [1])
> > on top of Kairui Song's swap table infrastructure, on top of [2]
> > and in accordance with his proposal in [3]. The proposal's idea
> > is interesting, so I decided to give it a shot myself. I'm still not
> > 100% sure that this is bug-proof, but hey, it compiles, and has
> > not crashed in my simple stress testing :)
> >
> > The prototype here is feature-complete relative to the swap-table P4
> > baseline — swapout, swapin, freeing, swapoff, zswap writeback, zswap
> > shrinker, memcg charging, and THP swapin all work for
> > both vswap and direct-physical entries — and satisfies all three
> > requirements above: no backend coupling (zswap/zero entries hold no
> > physical slot), dynamic swap space (clusters allocated on demand via
> > xarray, no static provisioning), and efficient backend transfer
> > (in-place vtable updates, no PTE/rmap walking).
> >
> > II. Design
> >
> > With vswap, pages are assigned virtual swap entries on a ghost device
> > with no backing storage. These entries are backed by zswap, zero pages,
> > or (lazily) physical swap slots. Physical backing is allocated only
> > when needed — on zswap writeback or reclaim writeout, after the rmap
> > step.
> >
> > Compared to the standalone v6 implementation [1], which introduces a
> > 24-byte per-entry swap descriptor and its own cluster allocator, this
> > edition uses swap_table infrastructure, and share a lot of the allocator
> > logic. Per-slot metadata is stored in a tag-encoded virtual_table
> > (atomic_long_t, 8 bytes per slot), and physical clusters store
> > Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> > the virtual cluster.
> >
> > Here are some data layout diagrams:
> >
> >   Case 1: vswap entry (virtualized)
> >
> >   PTE                  swap_cluster_info_dynamic
> >   vswap_entry          +-------------------------+
> >   (swp_entry_t) ------>| swap_cluster_info (ci)  |
> >                        | +--------------------+  |
> >                        | | swap_table         |  |
> >                        | |   PFN / Shadow     |  |
> >                        | | memcg_table        |  |
> >                        | | count,flags,order  |  |
> >                        | | lock, list         |  |
> >                        | +--------------------+  |
> >                        |                         |
> >                        | virtual_table           |
> >                        | +--------------------+  |
> >                        | | NONE               |  |
> >                        | | PHYS               |  |
> >                        | | ZERO               |  |
> >                        | | ZSWAP(entry*)      |  |
> >                        | | FOLIO(folio*)      |  |
> >                        | +--------------------+  |
> >                        +-------------------------+
> >                               |
> >                               | PHYS resolves to
> >                               v
> >                        PHYSICAL CLUSTER (swap_cluster_info)
> >                        +--------------------------+
> >                        | swap_table per-slot:     |
> >                        |   NULL   - free          |
> >                        |   PFN    - cached folio  |
> >                        |   Shadow - swapped out   |
> >                        |   Pointer- vswap rmap    |
> >                        |   Bad    - unusable      |
> >                        |                          |
> >                        | Vswap-backing slot:      |
> >                        |   Pointer(C|swp_entry_t) |
> >                        |     rmap back to vswap   |
> >                        +--------------------------+
> >
> >   Case 2: direct-mapped physical entry (no vswap)
> >
> >   PTE                  PHYSICAL CLUSTER (swap_cluster_info)
> >   phys_entry           +--------------------------+
> >   (swp_entry_t) ------>| swap_table per-slot:     |
> >                        |   NULL   - free          |
> >                        |   PFN    - cached folio  |
> >                        |   Shadow - swapped out   |
> >                        |   Bad    - unusable      |
> >                        +--------------------------+
> >
> > struct swap_cluster_info_dynamic {
> >     struct swap_cluster_info ci;       /* swap_table, lock, etc. */
> >     unsigned int index;                /* position in xarray */
> >     struct rcu_head rcu;               /* kfree_rcu deferred free */
> >     atomic_long_t *virtual_table;      /* backend info, 8 B/slot */
> > };
> >
> > Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> > swap_cluster_info struct with a virtual_table array that stores the
> > backend information for each virtual swap entry in the cluster. Each
> > entry is tag-encoded in the low 3 bits to indicate backend types:
> >
> >   NONE:   |----- 0000 ------|000|  free / unbacked
> >   PHYS:   |-- (type:5,off:N)|001|  on a physical swapfile (shifted)
> >   ZERO:   |----- 0000 ------|010|  zero-filled page
> >   ZSWAP:  |--- zswap_entry* |011|  compressed in zswap
> >   FOLIO:  |--- folio* ------|100|  in-memory folio
>
> Thanks for trying this approach!

Thanks for the suggestions. I hope going forward we have sth concrete
to tinker with, rather than abstractions :P

>
> For the format part, PHYS don't need that much bits I think,
> so by slightly adjust the format vswap device could be share
> mostly the same format with ordinary device.
>
> For example typical modern system don't have a address space larger
> than 52 bit. (Even with full 64 bits used for addressing, shift it
> by 12 we get 52). Plus 5 for type, you get 57, so you can have a
> marker that should work as long as it shorter than 1000000 for PHYS,
> and shared for all table format since it's not in conflict with
> anything. You have also use a few extra bits so a single swap space
> can be 8 times larger than RAM space, and since we can help
> multiple swap type I think that should be far than enough?
>
> Then you have Shadow back at 001, and zero bit in shadow. The only
> special one is Zswap, which will be 100 now, and that's exactly the
> reserved pointer format in current swap table format, on seeing
> si->flags & VSWAP && is_pointer(swp_tb) you know that's zswap :)

Are you suggesting we merge the virtual table with main swap table?

Man, I'd love to do this. There is a problem though - we have a case
where we occupy both backing physical swap AND swap cache. Do you
think we can fit both the physical swap slot handle and the swap cache
PFN into the same slot in virtual table? Maybe with some expanding...?

Another option is we can be a bit smart about it - if a virtual swap
entry is in swap cache AND occupies physical swap slot, then put the
folio at the physical swap's table, use folio->swap as the rmap.

(I think you recommend this approach somewhere but for the life of me
I can't find the reference - apologies if I'm putting words into your
mouth :))

But this is a bit more complicated - extra care is needed for rmap
handling at the physical swap layer, and swap cache handling at the
virtual swap layer. Maybe a follow-up? :)

>
> Folio / PFN can still be 010 as in the current swap table format.
>
> Then everything seems clean and aligned, no more special handling
> for vswap needed, there are detailed to sort out, but it should work.
>
> > - Pointer-tagged swap_table on physical clusters for rmap (physical
> >   -> virtual) lookup.
>
> Or reuse the PHYS format (rename it maybe) since point back to vswap
> is also pointing to a si.

Noted. I'm just doing the simplest thing right now - working
prototype. I mean, we have enough bits :)

>
> > III. Follow-ups:
> >
> > In no particular order (and most of which can be done as follow-up
> > patch series rather than shoving everything in the initial landing):
> >
> > - More thorough stress testing is very much needed.
> >
> > - Performance benchmarks to make sure I don't accidentally regress
> >   the vswap-less case, and that the vswap's case performance is
> >   good. I suspect I will have to port a lot of the
> >   optimizations I implemented in v6 over here - some of the
> >   inefficiencies are inherent in any swap virtualization, and
> >   would require the same fix (for e.g the MRU cluster caching
> >   for faster cluster lookup - see [8] and [9]).
>
> This could be imporved by per-si percpu cluster. Both YoungJun's
> tiering and Baoquan's previous swap ops mentioned this is needed,
> and now vswap also need that. If the vswap is also a si, then it will
> make use of this too.

Yeah I made the same recommendation when I review swap tier last week:

https://lore.kernel.org/all/CAKEwX=N2XcMHN1jatppOk6wnmz-Shab5XMtTtzgYOzRvU_6YFw@mail.gmail.com/

I like it, but yeah it will be complicated. That said, I think not
fixing the fast path for tiering/vswap will seriously restrict their
usefulness. We don't want to go back to the old swap allocator days :)

We can also revive the swap slot cache, but why do it if we can
repurpose your proven design, and just extend it a bit for multiple
tiers/swap devices? :)

>
> YoungJun posted this a few month before:
> https://lore.kernel.org/linux-mm/20260131125454.3187546-5-youngjun.park@lge.com/
>
> The concern is that some locking contention could be heavier, or maybe
> that's just a hypothetical problem though.

I don't think it's hypothetical. At least with vswap, it's very easy
to get into a state where the shared per-cpu cache gets invalidated
constantly if phys swap and vswap allocation alternates (which is
actually very possible under heavy memory pressure), hammering the
slow paths...

>
> >
> > - Runtime enable/disable of the vswap device. To be honest, I don't
> >   know if there is a value in this. My preference is vswap can be
> >   optimized to the point that any overhead is negligible. Failing that,
> >   maybe we can come up with some simple heuristics that automatically
> >   decides for users?
> >
> >   In this RFC, CONFIG_VSWAP=y means the vswap device is always created at
> >   boot, and CONFIG_VSWAP=n means the vswap device is never created. This
> >   *might* be enough just on its own.
> >
> >   Is a runtime knob (sysfs or sysctl) worth the complexity beyond
> >   these heuristics? I'm not sure yet. Maintaining both cases
>
> I checked the code and I think it's not hard to do, patch 1 already
> handling the meta data dynamically, everything will still just work
> even if you remove vswap at runtime. The rest of patches need adaption
> but might not end up being complex, it other comments here
> are considered.

Yeah, it's not terribily hard to do. I'm more wondering if it's worth
the effort, both for the implementer and the user :)

As I said here, if we want vswap, just enable it at boot time and get
a vast (but dynamic) device. We can make it optional per-cgroup
through Youngjun's interface, and that would be good enough?

>
> For patch 2, a few routines like vswap_can_swapin_thp seems not
> needed or should be moved to __swap_cache_alloc? VSWAP_FOLIO is
> same as swap cache folio check, which is already covered. Same for
> zero checking, and VSWAP_NONE which is same as swap count check
> I think. That way we not only save a lot of code, we also no
> longer need to treat vswap specially.

Unfortunately, I think a lot of this complexity is still needed. Vswap
adds a new layer, which means new complications :)

For instance, I think you still need vswap_can_swapin_thp. It
basically enforces that the backend must be something
swap_read_folio() can handle. That means:

1. No zswap.

2. No mixed backend.

3. If it's phys swap, it must be contiguous in the same device.

The vswap entries might look contiguous in the virtual address space,
but completely unsuitable for the backend.... This is a new
complication that does not previously exist for non-virtualized swap,
so it needs more code to handle it...

I use vswap_can_swapin_thp() in two places - first when we try to find
a candidate range of PTEs for swap in (swap_pte_batch). And then
double check after we've added into swap cache, as the backend can
change arbitrarily before swap cache folio is locked.

Similar for some of the other helpers. For instance,
vswap_swapfile_backed() is needed in certain optimizations or for
correctness's sake, etc.

The VSWAP_FOLIO is redundant, I agree. It's just for convenient. It
represents the state of vswap where it's still in swap (swap cache),
but only backed by the swap cache folio, and not any other backends.
Technically, you can represent it with VSWAP_NONE + a check in the
swap cache field to see if there is a folio there.

If it's not too much code in v2 I can try to remove it (while leaving
behind a comment explains the state).

>
> If you keep the format compatible with what we already have
> as the earlier comment mentions, a large portion of this part
> might be unneeded.
>
> >   at runtime also has overhead for checking as well, and some of the
> >   checks are not cheap :)
>
> I also noticed the new introduced swap_read_folio_phys in patch 3, so
> this actually can be done using Baoquan's swapops idea which is now
> part of Christoph's swap batching:

I'm actually trying to have swap_read_folio() work for both vswap and
phys swpa just with a biiit of if-else. But yeah this might be cleaner
:)

>
> https://lore.kernel.org/linux-mm/20260528124559.2566481-9-hch@lst.de/#r
>
> That series is focusing on batching and better performance but swapops
> was also proposed as a way to solve the virtual layer, makes it possible
> to have vswap as one kind of swapops which is Chris talked a lot about:
>
> https://lore.kernel.org/linux-mm/aZiFvzlBJiYBUDre@MiWiFi-R3L-srv/
>
> Following this, we could have something like:
>
> const struct swap_ops swap_vswap_ops = {
>         .submit_write           = swap_vswap_submit_write,
>         .submit_read            = swap_vswap_submit_read,
> };
>
> The move the folio_realloc_swap in swap_vswap_submit_write.
>
> Merge of IO might be moved to lower phyiscal level for vswap.
>
> Another gain is that the memory usage and CPU overhead will be
> lower with only one layer. As I'm recently trying to offload swap
> dataplane off CPU so the CPU won't touch the data at all, the
> overhead will be purely by swap itself, plus some mm overhead.
> Things like that and IO optimization above and could make swap
> subsystem more and more performance sensitive so we have cases
> that needs only one layer.

Yeah I can take a look at this. This prototype purely based on your P4
and with just a bit of hackery to transfer my v6 implementation over.
Not very clean - just a PoC. If everyone is happy I can put more time
in :)

>
> >
> > - Defer per-cluster memcg_table and zeromap allocation on physical
> >   clusters. A physical swap cluster backing vswap entries only do
> >   not really need their memcg_table, but the current design forces
> >   us to allocate it anyway. This is a waste of memory, and is an
> >   overhead regression compared to my older design on the zswap-only
> >   case, which Johannes has pointed out multiple times (see [6]),
> >   and is one of the biggest reasons why I have not been satisfied
> >   with this approach thus far. It honestly is a bit of a
> >   deal-breaker...
> >
> >   That said, I think I might be able to allocate them on demand, i.e
> >   only when the first direct-mapped slot is allocated on that cluster.
> >   That will give us the best of BOTH worlds, for both the vswap and
> >   directly-mapped physical swap cases. No promises, but I will try
> >   (if this approach is good enough for all parties).
>
> Zero map is not really a problem when it's just a inlined bit I think.

Yeah no problemo indeed. I saw a zeromap field in struct phys swap
cluster, so I put this in the plan as "to remove later", but I just
took a look and realized it's only for cases where you cant fit the
bit in swap table. I'm gating vswap for 64 bit for now, so should not
be a problem.

> For memcg table allocation, on demand seems a good idea, and actually
> we are not far from there, I tried to generalize the
> alloc-then-retry-sleep-alloc in swap_alloc_table but still not generic
> enough I guess.. Good new is the allocation of the table is already
> kind of ondemand, just need to split the detection of these two kind
> of table.

I have a prototype of this, but I have not tested, so I do not want to
send it out. :)

TLDR is - I still want to record the memcg for vswap (just not charge
it towards the counter). So we still need memcg_table at both level,
generally - just not allocating until needed (basically if a physical
swap slot in the cluster is directly mapped into PTE). You can kinda
tell, since you pass the folio into the allocation path - with some
care you can distinguish between:

1. Virtual swap, or directly maped physical swap -> need memcg_table

2. Physical swap, backing vswap -> does not memcg_table.

Another alternative is you can defer this allocation until the point
where you have to do the charging action. But then you have to be
careful with failure handling, and need to backoff ya di da di da.
Funsies.

I think I did a mixed of these 2 strategies. Anyway, I'll include the
patch in v2 (if folks like this approach).

>
> Mean while I also remember we once discussed about splitting the
> accounting for vswap / physical swap? If we went that approach we
> don't need to treat memcg_table specially.

For the charging behavior, I already have a patch for it actually in
this series (just not the dynamic allocation of the memcg_table field
yet).

Basically:

1. For vswap entry, not backed by phys swap: record swap memcg, hold
reference to pin the memcg, but not charging towards swap.current.

2. For phys swap backing vswap: charging towards swap.current, but
does not record the memcg in its memcg_table, nor does it hold
reference to memcg (its vswap entry holds the reference already)

2. For phys swap directly mapped to PTE: charges, records, and holds reference.

The motivation here is I do not want vswap entry to shares the same
limit as phys swap counter. If we think of it as "infinite" or
"dynamic", it should not be capped at all, but even if it is charged,
it should be something separate.

>
> > - Widen swap_info_struct->max to unsigned long. The vswap device's
> >   max is currently clamped to ALIGN_DOWN(UINT_MAX, SWAPFILE_CLUSTER)
> >   (~16 TiB) to fit in unsigned int. 16 TiB is small for vswap,
> >   especially when we're getting increasingly big machines memory-wise.
>
> This should be very easy to do, just replace unsigned int with
> unsigned long, a lot of place to touch though :)

Agree. I'm just lazy, and this sounds like a simple patch as a
follow-up. This RFC is already 2000 LoCs - I do not want to burden
reviewers with extra useless details :)

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Kairui Song 6 days, 19 hours ago

On Mon, Jun 1, 2026 at 11:57 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Jun 1, 2026 at 12:34 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > For the format part, PHYS don't need that much bits I think,
> > so by slightly adjust the format vswap device could be share
> > mostly the same format with ordinary device.
> >
> > For example typical modern system don't have a address space larger
> > than 52 bit. (Even with full 64 bits used for addressing, shift it
> > by 12 we get 52). Plus 5 for type, you get 57, so you can have a
> > marker that should work as long as it shorter than 1000000 for PHYS,
> > and shared for all table format since it's not in conflict with
> > anything. You have also use a few extra bits so a single swap space
> > can be 8 times larger than RAM space, and since we can help
> > multiple swap type I think that should be far than enough?
> >
> > Then you have Shadow back at 001, and zero bit in shadow. The only
> > special one is Zswap, which will be 100 now, and that's exactly the
> > reserved pointer format in current swap table format, on seeing
> > si->flags & VSWAP && is_pointer(swp_tb) you know that's zswap :)
>
> Are you suggesting we merge the virtual table with main swap table?
>
> Man, I'd love to do this. There is a problem though - we have a case
> where we occupy both backing physical swap AND swap cache. Do you
> think we can fit both the physical swap slot handle and the swap cache
> PFN into the same slot in virtual table? Maybe with some expanding...?

I don't really get why we would need to do that? If you put the PFN
info in the virtual / upper layer, then the count info, locking, and
all swap IO synchronization (via folio lock), dup (current protected
by ci lock / folio lock), and allocation (folio_alloc_swap), are all
handled in this layer.

The physical / lower layer will just hold a reverse entry on
folio_realloc_swap, or no entry at all (no physical layer used, zswap,
or after swap allocation but before IO) right?

Looking up the actual folio from the physical layer will be a bit
slower since it needs to resolve the reverse entry, but the only place
we need to do that is things like migrate, compaction (none of them
exist yet) which seems totally fine?

> > This could be imporved by per-si percpu cluster. Both YoungJun's
> > tiering and Baoquan's previous swap ops mentioned this is needed,
> > and now vswap also need that. If the vswap is also a si, then it will
> > make use of this too.
>
> Yeah I made the same recommendation when I review swap tier last week:
>
> https://lore.kernel.org/all/CAKEwX=N2XcMHN1jatppOk6wnmz-Shab5XMtTtzgYOzRvU_6YFw@mail.gmail.com/
>
> I like it, but yeah it will be complicated. That said, I think not
> fixing the fast path for tiering/vswap will seriously restrict their
> usefulness. We don't want to go back to the old swap allocator days :)

Thanks. Not too complicated, actually our internal kernel
implementation still using si->percpu cluster, and use a counter for
the rotation and each order have a counter :P, it's a bit ugly but
works fine. It still serves pretty well just like the global percpu
cluster, YoungJun's previous per ci percpu cluster also still provides
the fast path, many ways to do that.

>
> >
> > YoungJun posted this a few month before:
> > https://lore.kernel.org/linux-mm/20260131125454.3187546-5-youngjun.park@lge.com/
> >
> > The concern is that some locking contention could be heavier, or maybe
> > that's just a hypothetical problem though.
>
> I don't think it's hypothetical. At least with vswap, it's very easy
> to get into a state where the shared per-cpu cache gets invalidated
> constantly if phys swap and vswap allocation alternates (which is
> actually very possible under heavy memory pressure), hammering the
> slow paths...

I mean if the per-cpu cache is moved to si level, then whoever enters
the allocation path of a si will almost always get a stable percpu
cache to use, even if the last used si changes.

We might better have a more explicit per si fast path for that. The
plist rotation should better be done in a different (will be even
better if lockless) way.

>
> >
> > >
> > > - Runtime enable/disable of the vswap device. To be honest, I don't
> > >   know if there is a value in this. My preference is vswap can be
> > >   optimized to the point that any overhead is negligible. Failing that,
> > >   maybe we can come up with some simple heuristics that automatically
> > >   decides for users?
> > >
> > >   In this RFC, CONFIG_VSWAP=y means the vswap device is always created at
> > >   boot, and CONFIG_VSWAP=n means the vswap device is never created. This
> > >   *might* be enough just on its own.
> > >
> > >   Is a runtime knob (sysfs or sysctl) worth the complexity beyond
> > >   these heuristics? I'm not sure yet. Maintaining both cases
> >
> > I checked the code and I think it's not hard to do, patch 1 already
> > handling the meta data dynamically, everything will still just work
> > even if you remove vswap at runtime. The rest of patches need adaption
> > but might not end up being complex, it other comments here
> > are considered.
>
> Yeah, it's not terribily hard to do. I'm more wondering if it's worth
> the effort, both for the implementer and the user :)
>
> As I said here, if we want vswap, just enable it at boot time and get
> a vast (but dynamic) device. We can make it optional per-cgroup
> through Youngjun's interface, and that would be good enough?
>
> >
> > For patch 2, a few routines like vswap_can_swapin_thp seems not
> > needed or should be moved to __swap_cache_alloc? VSWAP_FOLIO is
> > same as swap cache folio check, which is already covered. Same for
> > zero checking, and VSWAP_NONE which is same as swap count check
> > I think. That way we not only save a lot of code, we also no
> > longer need to treat vswap specially.
>
> Unfortunately, I think a lot of this complexity is still needed. Vswap
> adds a new layer, which means new complications :)
>
> For instance, I think you still need vswap_can_swapin_thp. It
> basically enforces that the backend must be something
> swap_read_folio() can handle. That means:
>
> 1. No zswap.
>
> 2. No mixed backend.

If mixed backend means phys vs zero vs zswap, then we already have
part of that covered with the current swap cache except for the phys
part (zswap part seems very doable with fujunjie's work).
swap_cache_alloc_folio will ensure there is no mixed zerobit, it can
be easily extended to ensure there is no mixed zswap as well
(according to what I've learned from fujunjie's code). Similar logic
for phys detection I think.

> > For memcg table allocation, on demand seems a good idea, and actually
> > we are not far from there, I tried to generalize the
> > alloc-then-retry-sleep-alloc in swap_alloc_table but still not generic
> > enough I guess.. Good new is the allocation of the table is already
> > kind of ondemand, just need to split the detection of these two kind
> > of table.
>
> I have a prototype of this, but I have not tested, so I do not want to
> send it out. :)
>
> TLDR is - I still want to record the memcg for vswap (just not charge
> it towards the counter). So we still need memcg_table at both level,
> generally - just not allocating until needed (basically if a physical
> swap slot in the cluster is directly mapped into PTE). You can kinda
> tell, since you pass the folio into the allocation path - with some
> care you can distinguish between:
>
> 1. Virtual swap, or directly maped physical swap -> need memcg_table
>
> 2. Physical swap, backing vswap -> does not memcg_table.
>
> Another alternative is you can defer this allocation until the point
> where you have to do the charging action. But then you have to be
> careful with failure handling, and need to backoff ya di da di da.
> Funsies.
>
> I think I did a mixed of these 2 strategies. Anyway, I'll include the
> patch in v2 (if folks like this approach).
>
> >
> > Mean while I also remember we once discussed about splitting the
> > accounting for vswap / physical swap? If we went that approach we
> > don't need to treat memcg_table specially.
>
> For the charging behavior, I already have a patch for it actually in
> this series (just not the dynamic allocation of the memcg_table field
> yet).
>
> Basically:
>
> 1. For vswap entry, not backed by phys swap: record swap memcg, hold
> reference to pin the memcg, but not charging towards swap.current.

Maybe you don't need to record memcg here since folio->memcg already
have that info?

I previously had a patch:
https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-7-104795d19815@tencent.com/

The defers the recording of memcg, the behavior is almost identical to
before, but charging & recording should be cleaner and you don't need
to record memcg at allocation time hence maybe reduce the possibility
of pinning a memcg. I didn't include that in P4 just to reduce LOC,
maybe can be resent or included.

> 2. For phys swap backing vswap: charging towards swap.current, but
> does not record the memcg in its memcg_table, nor does it hold
> reference to memcg (its vswap entry holds the reference already)
>
> 2. For phys swap directly mapped to PTE: charges, records, and holds reference.
>
> The motivation here is I do not want vswap entry to shares the same
> limit as phys swap counter. If we think of it as "infinite" or
> "dynamic", it should not be capped at all, but even if it is charged,
> it should be something separate.

Good to know, it's not too messy to make everything dynamic, but if
our ultimate goal is to account for them separately, maybe we can save
some effort here.

>
> >
> > > - Widen swap_info_struct->max to unsigned long. The vswap device's
> > >   max is currently clamped to ALIGN_DOWN(UINT_MAX, SWAPFILE_CLUSTER)
> > >   (~16 TiB) to fit in unsigned int. 16 TiB is small for vswap,
> > >   especially when we're getting increasingly big machines memory-wise.
> >
> > This should be very easy to do, just replace unsigned int with
> > unsigned long, a lot of place to touch though :)
>
> Agree. I'm just lazy, and this sounds like a simple patch as a
> follow-up. This RFC is already 2000 LoCs - I do not want to burden
> reviewers with extra useless details :)

No problem, good to see progress here :)

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Nhat Pham 6 days, 19 hours ago

On Mon, Jun 1, 2026 at 10:45 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Mon, Jun 1, 2026 at 11:57 PM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Jun 1, 2026 at 12:34 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > For the format part, PHYS don't need that much bits I think,
> > > so by slightly adjust the format vswap device could be share
> > > mostly the same format with ordinary device.
> > >
> > > For example typical modern system don't have a address space larger
> > > than 52 bit. (Even with full 64 bits used for addressing, shift it
> > > by 12 we get 52). Plus 5 for type, you get 57, so you can have a
> > > marker that should work as long as it shorter than 1000000 for PHYS,
> > > and shared for all table format since it's not in conflict with
> > > anything. You have also use a few extra bits so a single swap space
> > > can be 8 times larger than RAM space, and since we can help
> > > multiple swap type I think that should be far than enough?
> > >
> > > Then you have Shadow back at 001, and zero bit in shadow. The only
> > > special one is Zswap, which will be 100 now, and that's exactly the
> > > reserved pointer format in current swap table format, on seeing
> > > si->flags & VSWAP && is_pointer(swp_tb) you know that's zswap :)
> >
> > Are you suggesting we merge the virtual table with main swap table?
> >
> > Man, I'd love to do this. There is a problem though - we have a case
> > where we occupy both backing physical swap AND swap cache. Do you
> > think we can fit both the physical swap slot handle and the swap cache
> > PFN into the same slot in virtual table? Maybe with some expanding...?
>
> I don't really get why we would need to do that? If you put the PFN
> info in the virtual / upper layer, then the count info, locking, and
> all swap IO synchronization (via folio lock), dup (current protected
> by ci lock / folio lock), and allocation (folio_alloc_swap), are all
> handled in this layer.
>
> The physical / lower layer will just hold a reverse entry on
> folio_realloc_swap, or no entry at all (no physical layer used, zswap,
> or after swap allocation but before IO) right?
>
> Looking up the actual folio from the physical layer will be a bit
> slower since it needs to resolve the reverse entry, but the only place
> we need to do that is things like migrate, compaction (none of them
> exist yet) which seems totally fine?

All of this is correct, but consider swaping in a vswap entry backed
by pswap. There are cases where you still want to maintain the pswap
slots around backing vswap entry, while having the swap cache folio as
well.

For e.g, at swap in time, we add the folio into the swap cache. First
of all, we need to hold on to the physical swap slot for IO step. But
even after IO succeeds, there are cases where you would still like to
keep physical swap slots around (for e.g, to avoid swapping out again
if the folio is only speculatively fetched).

So you have to make sure we have space for both the physical swap
slot, and the swap cache folio's PFN at the same time for each vswap
entry. So we still need the vtable extension (well maybe the other
approach I mentioned could work, but I'm not 100% sure).

>
> > > This could be imporved by per-si percpu cluster. Both YoungJun's
> > > tiering and Baoquan's previous swap ops mentioned this is needed,
> > > and now vswap also need that. If the vswap is also a si, then it will
> > > make use of this too.
> >
> > Yeah I made the same recommendation when I review swap tier last week:
> >
> > https://lore.kernel.org/all/CAKEwX=N2XcMHN1jatppOk6wnmz-Shab5XMtTtzgYOzRvU_6YFw@mail.gmail.com/
> >
> > I like it, but yeah it will be complicated. That said, I think not
> > fixing the fast path for tiering/vswap will seriously restrict their
> > usefulness. We don't want to go back to the old swap allocator days :)
>
> Thanks. Not too complicated, actually our internal kernel
> implementation still using si->percpu cluster, and use a counter for
> the rotation and each order have a counter :P, it's a bit ugly but
> works fine. It still serves pretty well just like the global percpu
> cluster, YoungJun's previous per ci percpu cluster also still provides
> the fast path, many ways to do that.

Sounds like something that should be upstreamed? ;)

I was concerned that you might deem the overhead too high (so I came
up with the per-tier percpu caching), but honestly I like it a lot.

>
> >
> > >
> > > YoungJun posted this a few month before:
> > > https://lore.kernel.org/linux-mm/20260131125454.3187546-5-youngjun.park@lge.com/
> > >
> > > The concern is that some locking contention could be heavier, or maybe
> > > that's just a hypothetical problem though.
> >
> > I don't think it's hypothetical. At least with vswap, it's very easy
> > to get into a state where the shared per-cpu cache gets invalidated
> > constantly if phys swap and vswap allocation alternates (which is
> > actually very possible under heavy memory pressure), hammering the
> > slow paths...
>
> I mean if the per-cpu cache is moved to si level, then whoever enters
> the allocation path of a si will almost always get a stable percpu
> cache to use, even if the last used si changes.
>
> We might better have a more explicit per si fast path for that. The
> plist rotation should better be done in a different (will be even
> better if lockless) way.

Exactly! That's what I'm thinking too. Basically I'm special-casing
for vswap here, but if you're happy with generalizing this for every
si I'm happy with it.

>
> >
> > >
> > > >
> > > > - Runtime enable/disable of the vswap device. To be honest, I don't
> > > >   know if there is a value in this. My preference is vswap can be
> > > >   optimized to the point that any overhead is negligible. Failing that,
> > > >   maybe we can come up with some simple heuristics that automatically
> > > >   decides for users?
> > > >
> > > >   In this RFC, CONFIG_VSWAP=y means the vswap device is always created at
> > > >   boot, and CONFIG_VSWAP=n means the vswap device is never created. This
> > > >   *might* be enough just on its own.
> > > >
> > > >   Is a runtime knob (sysfs or sysctl) worth the complexity beyond
> > > >   these heuristics? I'm not sure yet. Maintaining both cases
> > >
> > > I checked the code and I think it's not hard to do, patch 1 already
> > > handling the meta data dynamically, everything will still just work
> > > even if you remove vswap at runtime. The rest of patches need adaption
> > > but might not end up being complex, it other comments here
> > > are considered.
> >
> > Yeah, it's not terribily hard to do. I'm more wondering if it's worth
> > the effort, both for the implementer and the user :)
> >
> > As I said here, if we want vswap, just enable it at boot time and get
> > a vast (but dynamic) device. We can make it optional per-cgroup
> > through Youngjun's interface, and that would be good enough?
> >
> > >
> > > For patch 2, a few routines like vswap_can_swapin_thp seems not
> > > needed or should be moved to __swap_cache_alloc? VSWAP_FOLIO is
> > > same as swap cache folio check, which is already covered. Same for
> > > zero checking, and VSWAP_NONE which is same as swap count check
> > > I think. That way we not only save a lot of code, we also no
> > > longer need to treat vswap specially.
> >
> > Unfortunately, I think a lot of this complexity is still needed. Vswap
> > adds a new layer, which means new complications :)
> >
> > For instance, I think you still need vswap_can_swapin_thp. It
> > basically enforces that the backend must be something
> > swap_read_folio() can handle. That means:
> >
> > 1. No zswap.
> >
> > 2. No mixed backend.
>
> If mixed backend means phys vs zero vs zswap, then we already have
> part of that covered with the current swap cache except for the phys
> part (zswap part seems very doable with fujunjie's work).
> swap_cache_alloc_folio will ensure there is no mixed zerobit, it can
> be easily extended to ensure there is no mixed zswap as well
> (according to what I've learned from fujunjie's code). Similar logic
> for phys detection I think.

Yeah it's basically generalizing that check, and handle the case where
we can have indirection.

I mean I can open-code it, but it has to be there :) And I figure it
might be useful to check this opportunistically (at swap_pte_batch,
even if it's not guaranteed to be correct down the line) before we
even attempt to allocate a large folio etc. to avoid large folio
allocation.

>
> > > For memcg table allocation, on demand seems a good idea, and actually
> > > we are not far from there, I tried to generalize the
> > > alloc-then-retry-sleep-alloc in swap_alloc_table but still not generic
> > > enough I guess.. Good new is the allocation of the table is already
> > > kind of ondemand, just need to split the detection of these two kind
> > > of table.
> >
> > I have a prototype of this, but I have not tested, so I do not want to
> > send it out. :)
> >
> > TLDR is - I still want to record the memcg for vswap (just not charge
> > it towards the counter). So we still need memcg_table at both level,
> > generally - just not allocating until needed (basically if a physical
> > swap slot in the cluster is directly mapped into PTE). You can kinda
> > tell, since you pass the folio into the allocation path - with some
> > care you can distinguish between:
> >
> > 1. Virtual swap, or directly maped physical swap -> need memcg_table
> >
> > 2. Physical swap, backing vswap -> does not memcg_table.
> >
> > Another alternative is you can defer this allocation until the point
> > where you have to do the charging action. But then you have to be
> > careful with failure handling, and need to backoff ya di da di da.
> > Funsies.
> >
> > I think I did a mixed of these 2 strategies. Anyway, I'll include the
> > patch in v2 (if folks like this approach).
> >
> > >
> > > Mean while I also remember we once discussed about splitting the
> > > accounting for vswap / physical swap? If we went that approach we
> > > don't need to treat memcg_table specially.
> >
> > For the charging behavior, I already have a patch for it actually in
> > this series (just not the dynamic allocation of the memcg_table field
> > yet).
> >
> > Basically:
> >
> > 1. For vswap entry, not backed by phys swap: record swap memcg, hold
> > reference to pin the memcg, but not charging towards swap.current.
>
> Maybe you don't need to record memcg here since folio->memcg already
> have that info?
>
> I previously had a patch:
> https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-7-104795d19815@tencent.com/
>
> The defers the recording of memcg, the behavior is almost identical to
> before, but charging & recording should be cleaner and you don't need
> to record memcg at allocation time hence maybe reduce the possibility
> of pinning a memcg. I didn't include that in P4 just to reduce LOC,
> maybe can be resent or included.

That works-ish when the folio is sitll in swap cache, but say if it's
vswap backed by zswap (and the swap cache folio has been reclaimed),
you need a place to store the memcg, no?

Just seems cleaner to centralize this info at vswap layer when it is
presented, for now anyway, rather than juggling this on a per-backend
basis.

>
> > 2. For phys swap backing vswap: charging towards swap.current, but
> > does not record the memcg in its memcg_table, nor does it hold
> > reference to memcg (its vswap entry holds the reference already)
> >
> > 2. For phys swap directly mapped to PTE: charges, records, and holds reference.
> >
> > The motivation here is I do not want vswap entry to shares the same
> > limit as phys swap counter. If we think of it as "infinite" or
> > "dynamic", it should not be capped at all, but even if it is charged,
> > it should be something separate.
>
> Good to know, it's not too messy to make everything dynamic, but if
> our ultimate goal is to account for them separately, maybe we can save
> some effort here.

It's not terrible, TBH.

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Kairui Song 6 days, 10 hours ago

On Tue, Jun 2, 2026 at 2:06 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Jun 1, 2026 at 10:45 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Mon, Jun 1, 2026 at 11:57 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > Are you suggesting we merge the virtual table with main swap table?
> > >
> > > Man, I'd love to do this. There is a problem though - we have a case
> > > where we occupy both backing physical swap AND swap cache. Do you
> > > think we can fit both the physical swap slot handle and the swap cache
> > > PFN into the same slot in virtual table? Maybe with some expanding...?
> >
> > I don't really get why we would need to do that? If you put the PFN
> > info in the virtual / upper layer, then the count info, locking, and
> > all swap IO synchronization (via folio lock), dup (current protected
> > by ci lock / folio lock), and allocation (folio_alloc_swap), are all
> > handled in this layer.
> >
> > The physical / lower layer will just hold a reverse entry on
> > folio_realloc_swap, or no entry at all (no physical layer used, zswap,
> > or after swap allocation but before IO) right?
> >
> > Looking up the actual folio from the physical layer will be a bit
> > slower since it needs to resolve the reverse entry, but the only place
> > we need to do that is things like migrate, compaction (none of them
> > exist yet) which seems totally fine?
>
> All of this is correct, but consider swaping in a vswap entry backed
> by pswap. There are cases where you still want to maintain the pswap
> slots around backing vswap entry, while having the swap cache folio as
> well.
>
> For e.g, at swap in time, we add the folio into the swap cache. First
> of all, we need to hold on to the physical swap slot for IO step. But
> even after IO succeeds, there are cases where you would still like to
> keep physical swap slots around (for e.g, to avoid swapping out again
> if the folio is only speculatively fetched).

A reverse entry is enough to hold the physical swap, just like how the
current hibernation works with a fake shadow, you don't need a PFN
just for holding that.

>
> So you have to make sure we have space for both the physical swap
> slot, and the swap cache folio's PFN at the same time for each vswap
> entry. So we still need the vtable extension (well maybe the other
> approach I mentioned could work, but I'm not 100% sure).

Right, vtable extension is fine, there is no redundant data. I just
mean you don't need to set the PFN twice (for vswap & pswap). So
simply reusing the PFN format in the vswap layer and solving
everything there should be enough.

> > Thanks. Not too complicated, actually our internal kernel
> > implementation still using si->percpu cluster, and use a counter for
> > the rotation and each order have a counter :P, it's a bit ugly but
> > works fine. It still serves pretty well just like the global percpu
> > cluster, YoungJun's previous per ci percpu cluster also still provides
> > the fast path, many ways to do that.
>
> Sounds like something that should be upstreamed? ;)

I'd love to :), there is a lot of work going on as you can see and
people seem to have many different proposals about this so I didn't
prioritize it. I'll try as things settle down.

> > > >
> > > > For patch 2, a few routines like vswap_can_swapin_thp seems not
> > > > needed or should be moved to __swap_cache_alloc? VSWAP_FOLIO is
> > > > same as swap cache folio check, which is already covered. Same for
> > > > zero checking, and VSWAP_NONE which is same as swap count check
> > > > I think. That way we not only save a lot of code, we also no
> > > > longer need to treat vswap specially.
> > >
> > > Unfortunately, I think a lot of this complexity is still needed. Vswap
> > > adds a new layer, which means new complications :)
> > >
> > > For instance, I think you still need vswap_can_swapin_thp. It
> > > basically enforces that the backend must be something
> > > swap_read_folio() can handle. That means:
> > >
> > > 1. No zswap.
> > >
> > > 2. No mixed backend.
> >
> > If mixed backend means phys vs zero vs zswap, then we already have
> > part of that covered with the current swap cache except for the phys
> > part (zswap part seems very doable with fujunjie's work).
> > swap_cache_alloc_folio will ensure there is no mixed zerobit, it can
> > be easily extended to ensure there is no mixed zswap as well
> > (according to what I've learned from fujunjie's code). Similar logic
> > for phys detection I think.
>
> Yeah it's basically generalizing that check, and handle the case where
> we can have indirection.
>
> I mean I can open-code it, but it has to be there :) And I figure it
> might be useful to check this opportunistically (at swap_pte_batch,
> even if it's not guaranteed to be correct down the line) before we
> even attempt to allocate a large folio etc. to avoid large folio
> allocation.

Right, but swap_cache_alloc_folio with orders=<large order> won't
attempt a large allocation if the batch check fails, so that's fine.

> > > Basically:
> > >
> > > 1. For vswap entry, not backed by phys swap: record swap memcg, hold
> > > reference to pin the memcg, but not charging towards swap.current.
> >
> > Maybe you don't need to record memcg here since folio->memcg already
> > have that info?
> >
> > I previously had a patch:
> > https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-7-104795d19815@tencent.com/
> >
> > The defers the recording of memcg, the behavior is almost identical to
> > before, but charging & recording should be cleaner and you don't need
> > to record memcg at allocation time hence maybe reduce the possibility
> > of pinning a memcg. I didn't include that in P4 just to reduce LOC,
> > maybe can be resent or included.
>
> That works-ish when the folio is sitll in swap cache, but say if it's
> vswap backed by zswap (and the swap cache folio has been reclaimed),
> you need a place to store the memcg, no?

"Backed by zswap" means the actual swapout already happened, which is
the case where we always have to record the memcg info because the
folio is gone, seems still fit in the model.

> Just seems cleaner to centralize this info at vswap layer when it is
> presented, for now anyway, rather than juggling this on a per-backend
> basis.

Zswap charge could be merged with vswap I think but pswap we just
discussed that we might want to charge it differently? And actually
vswap charge is still quite different from zswap charge if you want to
make vswap infinitely large? I think we can figure out this part as we
progress; it's not a major problem at this point.

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Nhat Pham 5 days, 22 hours ago

On Mon, Jun 1, 2026 at 8:25 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Jun 2, 2026 at 2:06 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Jun 1, 2026 at 10:45 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Mon, Jun 1, 2026 at 11:57 PM Nhat Pham <nphamcs@gmail.com> wrote:
> > > >
> > > > Are you suggesting we merge the virtual table with main swap table?
> > > >
> > > > Man, I'd love to do this. There is a problem though - we have a case
> > > > where we occupy both backing physical swap AND swap cache. Do you
> > > > think we can fit both the physical swap slot handle and the swap cache
> > > > PFN into the same slot in virtual table? Maybe with some expanding...?
> > >
> > > I don't really get why we would need to do that? If you put the PFN
> > > info in the virtual / upper layer, then the count info, locking, and
> > > all swap IO synchronization (via folio lock), dup (current protected
> > > by ci lock / folio lock), and allocation (folio_alloc_swap), are all
> > > handled in this layer.
> > >
> > > The physical / lower layer will just hold a reverse entry on
> > > folio_realloc_swap, or no entry at all (no physical layer used, zswap,
> > > or after swap allocation but before IO) right?
> > >
> > > Looking up the actual folio from the physical layer will be a bit
> > > slower since it needs to resolve the reverse entry, but the only place
> > > we need to do that is things like migrate, compaction (none of them
> > > exist yet) which seems totally fine?
> >
> > All of this is correct, but consider swaping in a vswap entry backed
> > by pswap. There are cases where you still want to maintain the pswap
> > slots around backing vswap entry, while having the swap cache folio as
> > well.
> >
> > For e.g, at swap in time, we add the folio into the swap cache. First
> > of all, we need to hold on to the physical swap slot for IO step. But
> > even after IO succeeds, there are cases where you would still like to
> > keep physical swap slots around (for e.g, to avoid swapping out again
> > if the folio is only speculatively fetched).
>
> A reverse entry is enough to hold the physical swap, just like how the
> current hibernation works with a fake shadow, you don't need a PFN
> just for holding that.
>
> >
> > So you have to make sure we have space for both the physical swap
> > slot, and the swap cache folio's PFN at the same time for each vswap
> > entry. So we still need the vtable extension (well maybe the other
> > approach I mentioned could work, but I'm not 100% sure).
>
> Right, vtable extension is fine, there is no redundant data. I just
> mean you don't need to set the PFN twice (for vswap & pswap). So
> simply reusing the PFN format in the vswap layer and solving
> everything there should be enough.

Ah yeah, then I might have misunderstood you here. I thought you were
proposing a way to remove vtable :)

"don't need to set the PFN twice" completely agree. I'm pretty sure I
did not here, but do let me know if I accidentally set it twice. I'm
be sure to check this myself for the next version.

>
> > > Thanks. Not too complicated, actually our internal kernel
> > > implementation still using si->percpu cluster, and use a counter for
> > > the rotation and each order have a counter :P, it's a bit ugly but
> > > works fine. It still serves pretty well just like the global percpu
> > > cluster, YoungJun's previous per ci percpu cluster also still provides
> > > the fast path, many ways to do that.
> >
> > Sounds like something that should be upstreamed? ;)
>
> I'd love to :), there is a lot of work going on as you can see and
> people seem to have many different proposals about this so I didn't
> prioritize it. I'll try as things settle down.

Yeah understandable. It's a very volatile codebase, with a lot of
folks trying to improve different aspects.

Hopefully we're close to a unified design :)

I'll keep my dedicated vswap per-cpu alloc caching for now, but I'll
get rid of it whenever the per-CPU per-si cache is ready.

>
> > > > >
> > > > > For patch 2, a few routines like vswap_can_swapin_thp seems not
> > > > > needed or should be moved to __swap_cache_alloc? VSWAP_FOLIO is
> > > > > same as swap cache folio check, which is already covered. Same for
> > > > > zero checking, and VSWAP_NONE which is same as swap count check
> > > > > I think. That way we not only save a lot of code, we also no
> > > > > longer need to treat vswap specially.
> > > >
> > > > Unfortunately, I think a lot of this complexity is still needed. Vswap
> > > > adds a new layer, which means new complications :)
> > > >
> > > > For instance, I think you still need vswap_can_swapin_thp. It
> > > > basically enforces that the backend must be something
> > > > swap_read_folio() can handle. That means:
> > > >
> > > > 1. No zswap.
> > > >
> > > > 2. No mixed backend.
> > >
> > > If mixed backend means phys vs zero vs zswap, then we already have
> > > part of that covered with the current swap cache except for the phys
> > > part (zswap part seems very doable with fujunjie's work).
> > > swap_cache_alloc_folio will ensure there is no mixed zerobit, it can
> > > be easily extended to ensure there is no mixed zswap as well
> > > (according to what I've learned from fujunjie's code). Similar logic
> > > for phys detection I think.
> >
> > Yeah it's basically generalizing that check, and handle the case where
> > we can have indirection.
> >
> > I mean I can open-code it, but it has to be there :) And I figure it
> > might be useful to check this opportunistically (at swap_pte_batch,
> > even if it's not guaranteed to be correct down the line) before we
> > even attempt to allocate a large folio etc. to avoid large folio
> > allocation.
>
> Right, but swap_cache_alloc_folio with orders=<large order> won't
> attempt a large allocation if the batch check fails, so that's fine.
>
> > > > Basically:
> > > >
> > > > 1. For vswap entry, not backed by phys swap: record swap memcg, hold
> > > > reference to pin the memcg, but not charging towards swap.current.
> > >
> > > Maybe you don't need to record memcg here since folio->memcg already
> > > have that info?
> > >
> > > I previously had a patch:
> > > https://lore.kernel.org/linux-mm/20260220-swap-table-p4-v1-7-104795d19815@tencent.com/
> > >
> > > The defers the recording of memcg, the behavior is almost identical to
> > > before, but charging & recording should be cleaner and you don't need
> > > to record memcg at allocation time hence maybe reduce the possibility
> > > of pinning a memcg. I didn't include that in P4 just to reduce LOC,
> > > maybe can be resent or included.
> >
> > That works-ish when the folio is sitll in swap cache, but say if it's
> > vswap backed by zswap (and the swap cache folio has been reclaimed),
> > you need a place to store the memcg, no?
>
> "Backed by zswap" means the actual swapout already happened, which is
> the case where we always have to record the memcg info because the
> folio is gone, seems still fit in the model.

Hmmm I might have misunderstood you in my last response here.

So what you are doing in that patch:

1. Charge towards folio->memcg when we allocate swap slots, but do not
record or take reference yet.

2. Once we reclaimed the folio after swap out, then we record and
acquire reference to pin.

You know what - this would simplify my usecase. For vswap entries not
backing by pswap, it *basically* just means I skip step 1 for vswap
backend. Step 2 is shared for all cases. Donezo.

You're right. This is simpler :) Let me brew on it a bit longer in
case there might be something we're missing. but it does seem like
this will reduce complexity (and with the added benefits of me not
having to come up with names for helpers).

>
> > Just seems cleaner to centralize this info at vswap layer when it is
> > presented, for now anyway, rather than juggling this on a per-backend
> > basis.
>
> Zswap charge could be merged with vswap I think but pswap we just
> discussed that we might want to charge it differently? And actually
> vswap charge is still quite different from zswap charge if you want to
> make vswap infinitely large? I think we can figure out this part as we
> progress; it's not a major problem at this point.

That was because I misunderstood your suggestions. My bad :)

Anyway, please keep the suggestions and recommendations coming :) I'm
playing with some of your suggestions right now, and waiting for other
folks' inputs as well. Will send out the next version at some point.
If there is no fundamental design flaws, I will un-RFC once I've
addressed all the main issues.

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Nhat Pham 6 days, 21 hours ago

On Mon, Jun 1, 2026 at 8:56 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Jun 1, 2026 at 12:34 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> > On Thu, May 28, 2026 at 02:29:24PM +0800, Nhat Pham wrote:
> > > Based on: mm-unstable @ 444fc9435e57 + swap-table phase IV v5 [2].
> > >
> > > I manually adapted Kairui's ghost device implementation (from [4])
> > > for my vswap device. I've credited him as Co-developed-by on Patch I
> > > since a substantial portion of the dynamic-cluster infrastructure is
> > > his (I did propose the idea of using xarray/radix tree for dynamic
> > > swap clusters allocation and management though :P).
> > >
> > > >From here on out, for simplicity, I will refer to swap table phase IV
> > > as "P4", and the older v6 virtual swap space implementation as "v6".
> > >
> >
> > ...
> >
> > >
> > > This series reimplements the virtual swap space concept (see [1])
> > > on top of Kairui Song's swap table infrastructure, on top of [2]
> > > and in accordance with his proposal in [3]. The proposal's idea
> > > is interesting, so I decided to give it a shot myself. I'm still not
> > > 100% sure that this is bug-proof, but hey, it compiles, and has
> > > not crashed in my simple stress testing :)
> > >
> > > The prototype here is feature-complete relative to the swap-table P4
> > > baseline — swapout, swapin, freeing, swapoff, zswap writeback, zswap
> > > shrinker, memcg charging, and THP swapin all work for
> > > both vswap and direct-physical entries — and satisfies all three
> > > requirements above: no backend coupling (zswap/zero entries hold no
> > > physical slot), dynamic swap space (clusters allocated on demand via
> > > xarray, no static provisioning), and efficient backend transfer
> > > (in-place vtable updates, no PTE/rmap walking).
> > >
> > > II. Design
> > >
> > > With vswap, pages are assigned virtual swap entries on a ghost device
> > > with no backing storage. These entries are backed by zswap, zero pages,
> > > or (lazily) physical swap slots. Physical backing is allocated only
> > > when needed — on zswap writeback or reclaim writeout, after the rmap
> > > step.
> > >
> > > Compared to the standalone v6 implementation [1], which introduces a
> > > 24-byte per-entry swap descriptor and its own cluster allocator, this
> > > edition uses swap_table infrastructure, and share a lot of the allocator
> > > logic. Per-slot metadata is stored in a tag-encoded virtual_table
> > > (atomic_long_t, 8 bytes per slot), and physical clusters store
> > > Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> > > the virtual cluster.
> > >
> > > Here are some data layout diagrams:
> > >
> > >   Case 1: vswap entry (virtualized)
> > >
> > >   PTE                  swap_cluster_info_dynamic
> > >   vswap_entry          +-------------------------+
> > >   (swp_entry_t) ------>| swap_cluster_info (ci)  |
> > >                        | +--------------------+  |
> > >                        | | swap_table         |  |
> > >                        | |   PFN / Shadow     |  |
> > >                        | | memcg_table        |  |
> > >                        | | count,flags,order  |  |
> > >                        | | lock, list         |  |
> > >                        | +--------------------+  |
> > >                        |                         |
> > >                        | virtual_table           |
> > >                        | +--------------------+  |
> > >                        | | NONE               |  |
> > >                        | | PHYS               |  |
> > >                        | | ZERO               |  |
> > >                        | | ZSWAP(entry*)      |  |
> > >                        | | FOLIO(folio*)      |  |
> > >                        | +--------------------+  |
> > >                        +-------------------------+
> > >                               |
> > >                               | PHYS resolves to
> > >                               v
> > >                        PHYSICAL CLUSTER (swap_cluster_info)
> > >                        +--------------------------+
> > >                        | swap_table per-slot:     |
> > >                        |   NULL   - free          |
> > >                        |   PFN    - cached folio  |
> > >                        |   Shadow - swapped out   |
> > >                        |   Pointer- vswap rmap    |
> > >                        |   Bad    - unusable      |
> > >                        |                          |
> > >                        | Vswap-backing slot:      |
> > >                        |   Pointer(C|swp_entry_t) |
> > >                        |     rmap back to vswap   |
> > >                        +--------------------------+
> > >
> > >   Case 2: direct-mapped physical entry (no vswap)
> > >
> > >   PTE                  PHYSICAL CLUSTER (swap_cluster_info)
> > >   phys_entry           +--------------------------+
> > >   (swp_entry_t) ------>| swap_table per-slot:     |
> > >                        |   NULL   - free          |
> > >                        |   PFN    - cached folio  |
> > >                        |   Shadow - swapped out   |
> > >                        |   Bad    - unusable      |
> > >                        +--------------------------+
> > >
> > > struct swap_cluster_info_dynamic {
> > >     struct swap_cluster_info ci;       /* swap_table, lock, etc. */
> > >     unsigned int index;                /* position in xarray */
> > >     struct rcu_head rcu;               /* kfree_rcu deferred free */
> > >     atomic_long_t *virtual_table;      /* backend info, 8 B/slot */
> > > };
> > >
> > > Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> > > swap_cluster_info struct with a virtual_table array that stores the
> > > backend information for each virtual swap entry in the cluster. Each
> > > entry is tag-encoded in the low 3 bits to indicate backend types:
> > >
> > >   NONE:   |----- 0000 ------|000|  free / unbacked
> > >   PHYS:   |-- (type:5,off:N)|001|  on a physical swapfile (shifted)
> > >   ZERO:   |----- 0000 ------|010|  zero-filled page
> > >   ZSWAP:  |--- zswap_entry* |011|  compressed in zswap
> > >   FOLIO:  |--- folio* ------|100|  in-memory folio
> >
> > Thanks for trying this approach!
>
> Thanks for the suggestions. I hope going forward we have sth concrete
> to tinker with, rather than abstractions :P
>
> >
> > For the format part, PHYS don't need that much bits I think,
> > so by slightly adjust the format vswap device could be share
> > mostly the same format with ordinary device.
> >
> > For example typical modern system don't have a address space larger
> > than 52 bit. (Even with full 64 bits used for addressing, shift it
> > by 12 we get 52). Plus 5 for type, you get 57, so you can have a
> > marker that should work as long as it shorter than 1000000 for PHYS,
> > and shared for all table format since it's not in conflict with
> > anything. You have also use a few extra bits so a single swap space
> > can be 8 times larger than RAM space, and since we can help
> > multiple swap type I think that should be far than enough?
> >
> > Then you have Shadow back at 001, and zero bit in shadow. The only
> > special one is Zswap, which will be 100 now, and that's exactly the
> > reserved pointer format in current swap table format, on seeing
> > si->flags & VSWAP && is_pointer(swp_tb) you know that's zswap :)
>
> Are you suggesting we merge the virtual table with main swap table?
>
> Man, I'd love to do this. There is a problem though - we have a case
> where we occupy both backing physical swap AND swap cache. Do you
> think we can fit both the physical swap slot handle and the swap cache
> PFN into the same slot in virtual table? Maybe with some expanding...?
>
> Another option is we can be a bit smart about it - if a virtual swap
> entry is in swap cache AND occupies physical swap slot, then put the
> folio at the physical swap's table, use folio->swap as the rmap.
>
> (I think you recommend this approach somewhere but for the life of me
> I can't find the reference - apologies if I'm putting words into your
> mouth :))
>
> But this is a bit more complicated - extra care is needed for rmap
> handling at the physical swap layer, and swap cache handling at the
> virtual swap layer. Maybe a follow-up? :)
>
> >
> > Folio / PFN can still be 010 as in the current swap table format.
> >
> > Then everything seems clean and aligned, no more special handling
> > for vswap needed, there are detailed to sort out, but it should work.
> >
> > > - Pointer-tagged swap_table on physical clusters for rmap (physical
> > >   -> virtual) lookup.
> >
> > Or reuse the PHYS format (rename it maybe) since point back to vswap
> > is also pointing to a si.
>
> Noted. I'm just doing the simplest thing right now - working
> prototype. I mean, we have enough bits :)
>
> >
> > > III. Follow-ups:
> > >
> > > In no particular order (and most of which can be done as follow-up
> > > patch series rather than shoving everything in the initial landing):
> > >
> > > - More thorough stress testing is very much needed.
> > >
> > > - Performance benchmarks to make sure I don't accidentally regress
> > >   the vswap-less case, and that the vswap's case performance is
> > >   good. I suspect I will have to port a lot of the
> > >   optimizations I implemented in v6 over here - some of the
> > >   inefficiencies are inherent in any swap virtualization, and
> > >   would require the same fix (for e.g the MRU cluster caching
> > >   for faster cluster lookup - see [8] and [9]).
> >
> > This could be imporved by per-si percpu cluster. Both YoungJun's
> > tiering and Baoquan's previous swap ops mentioned this is needed,
> > and now vswap also need that. If the vswap is also a si, then it will
> > make use of this too.

Oh and the MRU cluster caching I mentioned here is not the allocation
caching. It's the lookup caching, basically to avoid doing the
xa_load() to look up clusters for consecutive swap operations on the
same vswap cluster (which is the common case with vswap). For v6, it
massively reduces this indirection lookup overhead. Performance-wise
it's an absolute winner, just more complexity (because I need to
handle reference counting carefully).

I also just realized we'll induce the indirection overhead on
allocation here too, even if the cached cluster still have slots for
allocation, because we look up the cluster (which is basically free
for static swap device, but not free for vswap devices). Might need to
take care of that to maintain vswap performance (but it will then
diverge from your existing code...).

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Kairui Song 6 days, 19 hours ago

On Tue, Jun 2, 2026 at 12:22 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Jun 1, 2026 at 8:56 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Jun 1, 2026 at 12:34 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > > On Thu, May 28, 2026 at 02:29:24PM +0800, Nhat Pham wrote:
> > > > III. Follow-ups:
> > > >
> > > > In no particular order (and most of which can be done as follow-up
> > > > patch series rather than shoving everything in the initial landing):
> > > >
> > > > - More thorough stress testing is very much needed.
> > > >
> > > > - Performance benchmarks to make sure I don't accidentally regress
> > > >   the vswap-less case, and that the vswap's case performance is
> > > >   good. I suspect I will have to port a lot of the
> > > >   optimizations I implemented in v6 over here - some of the
> > > >   inefficiencies are inherent in any swap virtualization, and
> > > >   would require the same fix (for e.g the MRU cluster caching
> > > >   for faster cluster lookup - see [8] and [9]).
> > >
> > > This could be imporved by per-si percpu cluster. Both YoungJun's
> > > tiering and Baoquan's previous swap ops mentioned this is needed,
> > > and now vswap also need that. If the vswap is also a si, then it will
> > > make use of this too.
>
> Oh and the MRU cluster caching I mentioned here is not the allocation
> caching. It's the lookup caching, basically to avoid doing the
> xa_load() to look up clusters for consecutive swap operations on the
> same vswap cluster (which is the common case with vswap). For v6, it
> massively reduces this indirection lookup overhead. Performance-wise
> it's an absolute winner, just more complexity (because I need to
> handle reference counting carefully).

Ah alright, that's interesting. And I think we can keep things simple
to start, since sensitive users is stil able tol use plain device this
way.

BTW maintaining MRU is also an overhead, I'm not sure if the lookup
pattern always follows that?

> I also just realized we'll induce the indirection overhead on
> allocation here too, even if the cached cluster still have slots for
> allocation, because we look up the cluster (which is basically free
> for static swap device, but not free for vswap devices). Might need to
> take care of that to maintain vswap performance (but it will then
> diverge from your existing code...).

That part should be indeed coverable by the si->percpu cluster though, I think.

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Nhat Pham 5 days, 21 hours ago

On Mon, Jun 1, 2026 at 10:49 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Jun 2, 2026 at 12:22 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Mon, Jun 1, 2026 at 8:56 AM Nhat Pham <nphamcs@gmail.com> wrote:
> > >
> > > On Mon, Jun 1, 2026 at 12:34 AM Kairui Song <ryncsn@gmail.com> wrote:
> > > >
> > > > On Thu, May 28, 2026 at 02:29:24PM +0800, Nhat Pham wrote:
> > > > > III. Follow-ups:
> > > > >
> > > > > In no particular order (and most of which can be done as follow-up
> > > > > patch series rather than shoving everything in the initial landing):
> > > > >
> > > > > - More thorough stress testing is very much needed.
> > > > >
> > > > > - Performance benchmarks to make sure I don't accidentally regress
> > > > >   the vswap-less case, and that the vswap's case performance is
> > > > >   good. I suspect I will have to port a lot of the
> > > > >   optimizations I implemented in v6 over here - some of the
> > > > >   inefficiencies are inherent in any swap virtualization, and
> > > > >   would require the same fix (for e.g the MRU cluster caching
> > > > >   for faster cluster lookup - see [8] and [9]).
> > > >
> > > > This could be imporved by per-si percpu cluster. Both YoungJun's
> > > > tiering and Baoquan's previous swap ops mentioned this is needed,
> > > > and now vswap also need that. If the vswap is also a si, then it will
> > > > make use of this too.
> >
> > Oh and the MRU cluster caching I mentioned here is not the allocation
> > caching. It's the lookup caching, basically to avoid doing the
> > xa_load() to look up clusters for consecutive swap operations on the
> > same vswap cluster (which is the common case with vswap). For v6, it
> > massively reduces this indirection lookup overhead. Performance-wise
> > it's an absolute winner, just more complexity (because I need to
> > handle reference counting carefully).
>
> Ah alright, that's interesting. And I think we can keep things simple
> to start, since sensitive users is stil able tol use plain device this
> way.

Of course. I'm hoping vswap-on-zswap will not be too terrible at a
start. We can then optimize for the swapfile backend case later.

>
> BTW maintaining MRU is also an overhead, I'm not sure if the lookup
> pattern always follows that?

Yeah I had to be a bit careful in v6 to make sure the cache (and cache
invalidation) happens at the right time. I've had this idea for awhile
- there's a reason why I waited until v6 to implement it :)

For instance, when physical swap allocator runs out of slot for a
cluster, we try to reclaim the swap-cache-only slots. That involves
taking the rmap back to vswap layer, to check swap cache and swap
count. This is a very random pattern, so it does not benefit from this
lookup cache, and in fact invalidates the cache :) So I had to add
some hint to avoid going back to the vswap layer to check for
swap-cache-only state.

>
> > I also just realized we'll induce the indirection overhead on
> > allocation here too, even if the cached cluster still have slots for
> > allocation, because we look up the cluster (which is basically free
> > for static swap device, but not free for vswap devices). Might need to
> > take care of that to maintain vswap performance (but it will then
> > diverge from your existing code...).
>
> That part should be indeed coverable by the si->percpu cluster though, I think.

Yeah agree - we just need to be a bit craftier with it. The
fundamental problem is in the current model, we're only storing offset
and si, then look up cluster based on that. But for dynamic vswap,
that look up takes the xa_load().

Once we move to per-si per-cpu cluster, then I think it becomes ok to
store the cluster pointer directly, correct?

The reference counting needs to be carefully handled though. I think
in my old vss design I did something fairly silly - just hold a
reference to it while it's in cache, then add CPU offlining handler to
clean up. Not the end of the world I suppose, but maybe there's a
smarter scheme.

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Kairui Song 5 days, 20 hours ago

On Tue, Jun 2, 2026 at 11:54 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Mon, Jun 1, 2026 at 10:49 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> >
> > That part should be indeed coverable by the si->percpu cluster though, I think.
>
> Yeah agree - we just need to be a bit craftier with it. The
> fundamental problem is in the current model, we're only storing offset
> and si, then look up cluster based on that. But for dynamic vswap,
> that look up takes the xa_load().
>
> Once we move to per-si per-cpu cluster, then I think it becomes ok to
> store the cluster pointer directly, correct?
>
> The reference counting needs to be carefully handled though. I think
> in my old vss design I did something fairly silly - just hold a
> reference to it while it's in cache, then add CPU offlining handler to
> clean up. Not the end of the world I suppose, but maybe there's a
> smarter scheme.

Yeah... I'm not entirely sure about this at this point, maybe it can
be sorted out as we process. Maybe we can also avoid the xa_load with
other techniques too.

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Yosry Ahmed 5 days, 11 hours ago

> II. Design
> 
> With vswap, pages are assigned virtual swap entries on a ghost device
> with no backing storage. These entries are backed by zswap, zero pages,
> or (lazily) physical swap slots. Physical backing is allocated only
> when needed — on zswap writeback or reclaim writeout, after the rmap
> step.
> 
> Compared to the standalone v6 implementation [1], which introduces a
> 24-byte per-entry swap descriptor and its own cluster allocator, this
> edition uses swap_table infrastructure, and share a lot of the allocator
> logic. Per-slot metadata is stored in a tag-encoded virtual_table
> (atomic_long_t, 8 bytes per slot), and physical clusters store
> Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> the virtual cluster.
> 
> Here are some data layout diagrams:
> 
>   Case 1: vswap entry (virtualized)
> 
>   PTE                  swap_cluster_info_dynamic
>   vswap_entry          +-------------------------+
>   (swp_entry_t) ------>| swap_cluster_info (ci)  |
>                        | +--------------------+  |
>                        | | swap_table         |  |
>                        | |   PFN / Shadow     |  |
>                        | | memcg_table        |  |
>                        | | count,flags,order  |  |
>                        | | lock, list         |  |
>                        | +--------------------+  |
>                        |                         |
>                        | virtual_table           |
>                        | +--------------------+  |
>                        | | NONE               |  |
>                        | | PHYS               |  |
>                        | | ZERO               |  |
>                        | | ZSWAP(entry*)      |  |
>                        | | FOLIO(folio*)      |  |
>                        | +--------------------+  |
>                        +-------------------------+
>                               |
>                               | PHYS resolves to
>                               v
>                        PHYSICAL CLUSTER (swap_cluster_info)
>                        +--------------------------+
>                        | swap_table per-slot:     |
>                        |   NULL   - free          |
>                        |   PFN    - cached folio  |
>                        |   Shadow - swapped out   |
>                        |   Pointer- vswap rmap    |
>                        |   Bad    - unusable      |
>                        |                          |
>                        | Vswap-backing slot:      |
>                        |   Pointer(C|swp_entry_t) |
>                        |     rmap back to vswap   |
>                        +--------------------------+
> 
>   Case 2: direct-mapped physical entry (no vswap)
> 
>   PTE                  PHYSICAL CLUSTER (swap_cluster_info)
>   phys_entry           +--------------------------+
>   (swp_entry_t) ------>| swap_table per-slot:     |
>                        |   NULL   - free          |
>                        |   PFN    - cached folio  |
>                        |   Shadow - swapped out   |
>                        |   Bad    - unusable      |
>                        +--------------------------+
> 
> struct swap_cluster_info_dynamic {
>     struct swap_cluster_info ci;       /* swap_table, lock, etc. */
>     unsigned int index;                /* position in xarray */
>     struct rcu_head rcu;               /* kfree_rcu deferred free */
>     atomic_long_t *virtual_table;      /* backend info, 8 B/slot */
> };
> 
> Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> swap_cluster_info struct with a virtual_table array that stores the
> backend information for each virtual swap entry in the cluster. Each
> entry is tag-encoded in the low 3 bits to indicate backend types:
> 
>   NONE:   |----- 0000 ------|000|  free / unbacked
>   PHYS:   |-- (type:5,off:N)|001|  on a physical swapfile (shifted)
>   ZERO:   |----- 0000 ------|010|  zero-filled page
>   ZSWAP:  |--- zswap_entry* |011|  compressed in zswap
>   FOLIO:  |--- folio* ------|100|  in-memory folio
> 
> We still have room for 3 more future backend types, for e.g. CRAM, i.e
> compressed-CXL-as-swap, which is laid out in [10] and [11]. Worst
> case scenario, we can add more fields to this extended struct.
> 
> Other design points:
> - Both vswap entries (Case 1) and directly-mapped physical entries
>   (Case 2) coexist as first-class citizens. All the common swap
>   code paths — swapout, swapin, swap freeing, swapoff, zswap
>   writeback, THP swapin, etc. work for both. When CONFIG_VSWAP=n,
>   the vswap branches compile out and behavior should be identical to
>   today's swap-table P4 (at least that is my intention).
> - Pointer-tagged swap_table on physical clusters for rmap (physical
>   -> virtual) lookup.
> - Virtual swap slots not backed by physical swap are not charged to
>   memcg swap counters — only physical backing is charged (I made the
>   case for this in [7]).
> - Careful separation of vswap and physical swap allocation paths and
>   structures adds a lot of complexity, but is crucial to make sure
>   both paths are efficient and do not conflict with each other (for
>   correctness and performance). I do re-use a lot of the allocation
>   logic wherever possible though.

Thanks for working on this! I mostly looked at the high-level design and
the zswap parts, as the swap code has changed a lot since I was familiar
with it :)

It seems like the direction being taken here is that we have one
(massive) vswap swap device, and we keep normal physical swap devices
around as well.

A vswap entry can point at a physical swap entry, or zswap, or zeromap.
If a vswap entry points at a physical swap entry, then the physical swap
entry points back at the vswap entry (a reverse mapping).

I assume the main reason here is to avoid the extra overhead if
everything uses vswap, which would mainly be the reverse mapping
overhead? I guess there's also some simplicity that comes from reusing
the swap info infra as a whole, including the swap table.

I don't like that the code bifurcates for vswap vs. normal swap entries
though. Not sure if this is an issue that can be fixed with proper
abstractions to hide it, or if the design needs modifications. I was
honestly really hoping we don't end up with this. I was hoping that the
physical swap device no longer uses a full swap table and all, and
everything goes through vswap.

I hoping that if redirection isn't needed (e.g. zswap is disabled),
vswap can directly encode the physical swap slot so that the reverse
mapping isn't needed -- so we avoid the overhead without keeping the
physical swap device using a fully-fledged swap table.

All that being said, perhaps I am too out of touch with the code to
realize it's simply not possible.

Honestly, if the main reason we can't have a single swap table for vswap
is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
argument, even if we can't optimize the reverse mapping away. But maybe
I am also out of touch with RAM prices :)

I at least hope that, the current design is not painting us into a
corner (e.g. through userspace interfaces), and we can still achieve a
vswap-for-all implementation in the future (maybe that's what you have
in mind already?).

Aside from the swap code, the only sticking point for me is the logic
bifurcation in zswap. Why does zswap need to handle vswap vs. not vswap?
I thought the point of the design is to use vswap when zswap is used,
and otherwise use a normal swap table. In a way, one of the goals is to
make zswap a first class swap citizen, but it doesn't seem like we are
achieving that?

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Nhat Pham 4 days, 20 hours ago

On Tue, Jun 2, 2026 at 6:29 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> > II. Design
> >
> > With vswap, pages are assigned virtual swap entries on a ghost device
> > with no backing storage. These entries are backed by zswap, zero pages,
> > or (lazily) physical swap slots. Physical backing is allocated only
> > when needed — on zswap writeback or reclaim writeout, after the rmap
> > step.
> >
> > Compared to the standalone v6 implementation [1], which introduces a
> > 24-byte per-entry swap descriptor and its own cluster allocator, this
> > edition uses swap_table infrastructure, and share a lot of the allocator
> > logic. Per-slot metadata is stored in a tag-encoded virtual_table
> > (atomic_long_t, 8 bytes per slot), and physical clusters store
> > Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> > the virtual cluster.
> >
> > Here are some data layout diagrams:
> >
> >   Case 1: vswap entry (virtualized)
> >
> >   PTE                  swap_cluster_info_dynamic
> >   vswap_entry          +-------------------------+
> >   (swp_entry_t) ------>| swap_cluster_info (ci)  |
> >                        | +--------------------+  |
> >                        | | swap_table         |  |
> >                        | |   PFN / Shadow     |  |
> >                        | | memcg_table        |  |
> >                        | | count,flags,order  |  |
> >                        | | lock, list         |  |
> >                        | +--------------------+  |
> >                        |                         |
> >                        | virtual_table           |
> >                        | +--------------------+  |
> >                        | | NONE               |  |
> >                        | | PHYS               |  |
> >                        | | ZERO               |  |
> >                        | | ZSWAP(entry*)      |  |
> >                        | | FOLIO(folio*)      |  |
> >                        | +--------------------+  |
> >                        +-------------------------+
> >                               |
> >                               | PHYS resolves to
> >                               v
> >                        PHYSICAL CLUSTER (swap_cluster_info)
> >                        +--------------------------+
> >                        | swap_table per-slot:     |
> >                        |   NULL   - free          |
> >                        |   PFN    - cached folio  |
> >                        |   Shadow - swapped out   |
> >                        |   Pointer- vswap rmap    |
> >                        |   Bad    - unusable      |
> >                        |                          |
> >                        | Vswap-backing slot:      |
> >                        |   Pointer(C|swp_entry_t) |
> >                        |     rmap back to vswap   |
> >                        +--------------------------+
> >
> >   Case 2: direct-mapped physical entry (no vswap)
> >
> >   PTE                  PHYSICAL CLUSTER (swap_cluster_info)
> >   phys_entry           +--------------------------+
> >   (swp_entry_t) ------>| swap_table per-slot:     |
> >                        |   NULL   - free          |
> >                        |   PFN    - cached folio  |
> >                        |   Shadow - swapped out   |
> >                        |   Bad    - unusable      |
> >                        +--------------------------+
> >
> > struct swap_cluster_info_dynamic {
> >     struct swap_cluster_info ci;       /* swap_table, lock, etc. */
> >     unsigned int index;                /* position in xarray */
> >     struct rcu_head rcu;               /* kfree_rcu deferred free */
> >     atomic_long_t *virtual_table;      /* backend info, 8 B/slot */
> > };
> >
> > Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> > swap_cluster_info struct with a virtual_table array that stores the
> > backend information for each virtual swap entry in the cluster. Each
> > entry is tag-encoded in the low 3 bits to indicate backend types:
> >
> >   NONE:   |----- 0000 ------|000|  free / unbacked
> >   PHYS:   |-- (type:5,off:N)|001|  on a physical swapfile (shifted)
> >   ZERO:   |----- 0000 ------|010|  zero-filled page
> >   ZSWAP:  |--- zswap_entry* |011|  compressed in zswap
> >   FOLIO:  |--- folio* ------|100|  in-memory folio
> >
> > We still have room for 3 more future backend types, for e.g. CRAM, i.e
> > compressed-CXL-as-swap, which is laid out in [10] and [11]. Worst
> > case scenario, we can add more fields to this extended struct.
> >
> > Other design points:
> > - Both vswap entries (Case 1) and directly-mapped physical entries
> >   (Case 2) coexist as first-class citizens. All the common swap
> >   code paths — swapout, swapin, swap freeing, swapoff, zswap
> >   writeback, THP swapin, etc. work for both. When CONFIG_VSWAP=n,
> >   the vswap branches compile out and behavior should be identical to
> >   today's swap-table P4 (at least that is my intention).
> > - Pointer-tagged swap_table on physical clusters for rmap (physical
> >   -> virtual) lookup.
> > - Virtual swap slots not backed by physical swap are not charged to
> >   memcg swap counters — only physical backing is charged (I made the
> >   case for this in [7]).
> > - Careful separation of vswap and physical swap allocation paths and
> >   structures adds a lot of complexity, but is crucial to make sure
> >   both paths are efficient and do not conflict with each other (for
> >   correctness and performance). I do re-use a lot of the allocation
> >   logic wherever possible though.
>
> Thanks for working on this! I mostly looked at the high-level design and

Thank you for initiating this effort in LSFMMBPF 2023 (god, time
flies). I was very excited by your presentation and decided to take a
stab at it :)

(I'll be sure to mention the full context in a non-RFC version - it
has a lot of gems in our technical discussions).

> the zswap parts, as the swap code has changed a lot since I was familiar
> with it :)

It has changed a lot since 6.19, when I was working on v6. Very
exciting time to be a (z)swap developer right now - we have new ideas
and new features every other week :) Reviewing code has been quite a
joy (albeit a lot of work).

>
> It seems like the direction being taken here is that we have one
> (massive) vswap swap device, and we keep normal physical swap devices
> around as well.

Yep.

>
> A vswap entry can point at a physical swap entry, or zswap, or zeromap.
> If a vswap entry points at a physical swap entry, then the physical swap
> entry points back at the vswap entry (a reverse mapping).

Yep.

>
> I assume the main reason here is to avoid the extra overhead if
> everything uses vswap, which would mainly be the reverse mapping
> overhead? I guess there's also some simplicity that comes from reusing
> the swap info infra as a whole, including the swap table.

Yeah it helps a lot that we don't have to rewrite the whole allocator
and swap entry reference counting logic again :)

>
> I don't like that the code bifurcates for vswap vs. normal swap entries
> though. Not sure if this is an issue that can be fixed with proper
> abstractions to hide it, or if the design needs modifications. I was
> honestly really hoping we don't end up with this. I was hoping that the
> physical swap device no longer uses a full swap table and all, and
> everything goes through vswap.
>
> I hoping that if redirection isn't needed (e.g. zswap is disabled),
> vswap can directly encode the physical swap slot so that the reverse
> mapping isn't needed -- so we avoid the overhead without keeping the
> physical swap device using a fully-fledged swap table.

Can you expand on "vswap can directly encode the physical swap slot"?
I'm not sure I follow here.

>
> All that being said, perhaps I am too out of touch with the code to
> realize it's simply not possible.
>
> Honestly, if the main reason we can't have a single swap table for vswap
> is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
> argument, even if we can't optimize the reverse mapping away. But maybe
> I am also out of touch with RAM prices :)

In terms of the space overhead I do agree, FWIW :)

I think the other concern is the indirection overhead with going
through the xarray for every swap operation, hence the per-CPU vswap
cluster lookup caching idea:

https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@gmail.com/

>
> I at least hope that, the current design is not painting us into a
> corner (e.g. through userspace interfaces), and we can still achieve a
> vswap-for-all implementation in the future (maybe that's what you have
> in mind already?).

That's still my plan. Operationally speaking, I want to make this
completely transparent to users, with minimal to no performance
overhead.

The next action item is to optimize for vswap-on-fast-swapfile case -
that was Kairui's main concerns regarding performance. I spent a lot
of time perfing and fixing issues for this case in v6. The issues with
the most egregious effects and simplest fix (vswap-less
swap-cache-only check for e.g) are already fixed in this new design,
and eventually I will move the rest (lookup caching) and more to here.

>
> Aside from the swap code, the only sticking point for me is the logic
> bifurcation in zswap. Why does zswap need to handle vswap vs. not vswap?
> I thought the point of the design is to use vswap when zswap is used,
> and otherwise use a normal swap table. In a way, one of the goals is to
> make zswap a first class swap citizen, but it doesn't seem like we are
> achieving that?

We already have all the machinery to make zswap completely
independent. Right now, if you use vswap, you'll skip the zswap's
internal xarray entirely, and just store a zswap entry in the virtual
swap cluster's vtable.

I just haven't removed the old code for 2 reasons:

1. Reduce the delta on this RFC, to ease the burden for reviewers (and
definitely not because I'm lazy :P)

2. The only other practical reason is so that we can let users compile
with !CONFIG_VSWAP and still uses zswap on top of the old swapfile
setup during the transition/experimentation period for now.

But logically and conceptually speaking, there is no reason I can come
up with to use zswap on without vswap. The CPU indirection overhead is
already partially there (since zswap uses an xarray) and further
optimized (cluster loopup caching etc.), as well as the space overhead
(vswap replaces the zswap xarray). I actually wrote a whole paragraph
about how we should always go for vswap if we're using zswap, but then
decide to remove it since there's no code for it yet.

If folks like it, what I can do is have CONFIG_ZSWAP depends on
CONFIG_VSWAP, removes all the non-vswap logic, and call it a day? :)
Then, on the swap allocation side, if vswap allocation fail and zswap
writeback is disabled, we can error out early.

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Yosry Ahmed 4 days, 18 hours ago

> > I assume the main reason here is to avoid the extra overhead if
> > everything uses vswap, which would mainly be the reverse mapping
> > overhead? I guess there's also some simplicity that comes from reusing
> > the swap info infra as a whole, including the swap table.
> 
> Yeah it helps a lot that we don't have to rewrite the whole allocator
> and swap entry reference counting logic again :)

I specifically meant using a full swap info thing for the physical swap
device even when it's behind vswap. That seems like an overkill, and we
don't need things like the swap entry reference coutning. We probably
just need a bitmap and a reverse mapping.

So I am assuming the main reason why we are not doing that (at least for
now) is simplicity?

> >
> > I don't like that the code bifurcates for vswap vs. normal swap entries
> > though. Not sure if this is an issue that can be fixed with proper
> > abstractions to hide it, or if the design needs modifications. I was
> > honestly really hoping we don't end up with this. I was hoping that the
> > physical swap device no longer uses a full swap table and all, and
> > everything goes through vswap.
> >
> > I hoping that if redirection isn't needed (e.g. zswap is disabled),
> > vswap can directly encode the physical swap slot so that the reverse
> > mapping isn't needed -- so we avoid the overhead without keeping the
> > physical swap device using a fully-fledged swap table.
> 
> Can you expand on "vswap can directly encode the physical swap slot"?
> I'm not sure I follow here.

I meant that if redirection is not needed (e.g. zswap is disabled), then
instead of having a vswap device pointing at a physical swap device, we
can just the data (e.g. phyiscal swap slot) in the vswap device
directly. Then we don't need a full swap info thing and swap table for
the physical swap device.

This directly ties into my question above, about why we have a
fully-fledged swap info thing for the physical swap device when using
vswap.

> >
> > All that being said, perhaps I am too out of touch with the code to
> > realize it's simply not possible.
> >
> > Honestly, if the main reason we can't have a single swap table for vswap
> > is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
> > argument, even if we can't optimize the reverse mapping away. But maybe
> > I am also out of touch with RAM prices :)
> 
> In terms of the space overhead I do agree, FWIW :)
> 
> I think the other concern is the indirection overhead with going
> through the xarray for every swap operation, hence the per-CPU vswap
> cluster lookup caching idea:
> 
> https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@gmail.com/

Right, but we should already avoid the xarray with the swap table
design, right? We just have one swap table pointing to another
essentially?

> >
> > I at least hope that, the current design is not painting us into a
> > corner (e.g. through userspace interfaces), and we can still achieve a
> > vswap-for-all implementation in the future (maybe that's what you have
> > in mind already?).
> 
> That's still my plan. Operationally speaking, I want to make this
> completely transparent to users, with minimal to no performance
> overhead.

So if CONFIG_VSWAP is set all swap devices are vswap by default, right?
Would it help with testing if it's controlled by a boot param?

> 
> The next action item is to optimize for vswap-on-fast-swapfile case -
> that was Kairui's main concerns regarding performance. I spent a lot
> of time perfing and fixing issues for this case in v6. The issues with
> the most egregious effects and simplest fix (vswap-less
> swap-cache-only check for e.g) are already fixed in this new design,
> and eventually I will move the rest (lookup caching) and more to here.

So is the end goal to have vswap be the default rather than a special
swap device? It would certainly help to include some details about that.

> >
> > Aside from the swap code, the only sticking point for me is the logic
> > bifurcation in zswap. Why does zswap need to handle vswap vs. not vswap?
> > I thought the point of the design is to use vswap when zswap is used,
> > and otherwise use a normal swap table. In a way, one of the goals is to
> > make zswap a first class swap citizen, but it doesn't seem like we are
> > achieving that?
> 
> We already have all the machinery to make zswap completely
> independent. Right now, if you use vswap, you'll skip the zswap's
> internal xarray entirely, and just store a zswap entry in the virtual
> swap cluster's vtable.
> 
> I just haven't removed the old code for 2 reasons:
> 
> 1. Reduce the delta on this RFC, to ease the burden for reviewers (and
> definitely not because I'm lazy :P)
> 
> 2. The only other practical reason is so that we can let users compile
> with !CONFIG_VSWAP and still uses zswap on top of the old swapfile
> setup during the transition/experimentation period for now.
> 
> But logically and conceptually speaking, there is no reason I can come
> up with to use zswap on without vswap. The CPU indirection overhead is
> already partially there (since zswap uses an xarray) and further
> optimized (cluster loopup caching etc.), as well as the space overhead
> (vswap replaces the zswap xarray). I actually wrote a whole paragraph
> about how we should always go for vswap if we're using zswap, but then
> decide to remove it since there's no code for it yet.
> 
> If folks like it, what I can do is have CONFIG_ZSWAP depends on
> CONFIG_VSWAP, removes all the non-vswap logic, and call it a day? :)
> Then, on the swap allocation side, if vswap allocation fail and zswap
> writeback is disabled, we can error out early.

Hmm maybe we can keep it around for now and do that after vswap
stabilizes? It ultimately depend on how much complexity we maintain by
allowing both.

I think another problem is 32-bit, technically zswap can be used on
32-bit now, right? So vswap not supporitng 32-bit is a problem.

General question (for both zswap and general swap code), would a boot
param make implementation simpler? Right now we seem to key off the swap
device having the "vswap" flag, would it help if it was a runtime
constant?

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Nhat Pham 4 days, 18 hours ago

On Wed, Jun 3, 2026 at 11:58 AM Yosry Ahmed <yosry@kernel.org> wrote:
>
> > > I assume the main reason here is to avoid the extra overhead if
> > > everything uses vswap, which would mainly be the reverse mapping
> > > overhead? I guess there's also some simplicity that comes from reusing
> > > the swap info infra as a whole, including the swap table.
> >
> > Yeah it helps a lot that we don't have to rewrite the whole allocator
> > and swap entry reference counting logic again :)
>
> I specifically meant using a full swap info thing for the physical swap
> device even when it's behind vswap. That seems like an overkill, and we
> don't need things like the swap entry reference coutning. We probably
> just need a bitmap and a reverse mapping.
>
> So I am assuming the main reason why we are not doing that (at least for
> now) is simplicity?

Mostly.

FWIW, we're pretty close to full deduplication. Right now, physical
swap clusters have a couple of fields that are not needed when they're
backing a vswap cluster:

1. The main swap table (which houses swap cache, swap shadow, and
reference counting): I repurpose it for the rmap :) It's an array of
unsigned long, which works for rmap.

2. memcg_table: still duplicated, but I think I can make sure this is
not allocated if physical swap clusters only back vswap entries. I
have a prototype that I'm testing for this.

3. The zeromap field: this is actually not allocated in 64 bit
architecture, IIUC, which is what I'm gating CONFIG_VSWAP on. If we
extend vswap to supporting 32 bits, this can also be dynamically
allocated.

4. Extend table - this is for the swap count overfills, and already
dynamically allocated.

>
> > >
> > > I don't like that the code bifurcates for vswap vs. normal swap entries
> > > though. Not sure if this is an issue that can be fixed with proper
> > > abstractions to hide it, or if the design needs modifications. I was
> > > honestly really hoping we don't end up with this. I was hoping that the
> > > physical swap device no longer uses a full swap table and all, and
> > > everything goes through vswap.
> > >
> > > I hoping that if redirection isn't needed (e.g. zswap is disabled),
> > > vswap can directly encode the physical swap slot so that the reverse
> > > mapping isn't needed -- so we avoid the overhead without keeping the
> > > physical swap device using a fully-fledged swap table.
> >
> > Can you expand on "vswap can directly encode the physical swap slot"?
> > I'm not sure I follow here.
>
> I meant that if redirection is not needed (e.g. zswap is disabled), then
> instead of having a vswap device pointing at a physical swap device, we
> can just the data (e.g. phyiscal swap slot) in the vswap device
> directly. Then we don't need a full swap info thing and swap table for
> the physical swap device.
>
> This directly ties into my question above, about why we have a
> fully-fledged swap info thing for the physical swap device when using
> vswap.

See above.

>
> > >
> > > All that being said, perhaps I am too out of touch with the code to
> > > realize it's simply not possible.
> > >
> > > Honestly, if the main reason we can't have a single swap table for vswap
> > > is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
> > > argument, even if we can't optimize the reverse mapping away. But maybe
> > > I am also out of touch with RAM prices :)
> >
> > In terms of the space overhead I do agree, FWIW :)
> >
> > I think the other concern is the indirection overhead with going
> > through the xarray for every swap operation, hence the per-CPU vswap
> > cluster lookup caching idea:
> >
> > https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@gmail.com/
>
> Right, but we should already avoid the xarray with the swap table
> design, right? We just have one swap table pointing to another
> essentially?

Hmmm, I don't quite follow your suggestion here.

For normal swap devices, we organize the space into clusters, and
maintain them in various lists (free, nonfull, full etc.). The only
difference with a vswap device is we do not have a free list, and have
the clusters themselves dynamically allocated.

If we're using vswap, we will incur the xarray overhead. There's no
avoiding that if we want a dynamic indirection layer. We can of course
revisit this data structure design later.

So yes, it will be one swap table (vswap cluster) pointing to another
swap table (pswap cluster). But to get to the first swap table, you
will have to go through xarray still.

>
> > >
> > > I at least hope that, the current design is not painting us into a
> > > corner (e.g. through userspace interfaces), and we can still achieve a
> > > vswap-for-all implementation in the future (maybe that's what you have
> > > in mind already?).
> >
> > That's still my plan. Operationally speaking, I want to make this
> > completely transparent to users, with minimal to no performance
> > overhead.
>
> So if CONFIG_VSWAP is set all swap devices are vswap by default, right?
> Would it help with testing if it's controlled by a boot param?
>
> >
> > The next action item is to optimize for vswap-on-fast-swapfile case -
> > that was Kairui's main concerns regarding performance. I spent a lot
> > of time perfing and fixing issues for this case in v6. The issues with
> > the most egregious effects and simplest fix (vswap-less
> > swap-cache-only check for e.g) are already fixed in this new design,
> > and eventually I will move the rest (lookup caching) and more to here.
>
> So is the end goal to have vswap be the default rather than a special
> swap device? It would certainly help to include some details about that.

That is my preference - I did allude to it in "Runtime enable/disable
of the vswap device" follow-up section :) I'm a sucker for unified
paths. It kinda depends on whether we can optimize most of vswap
overhead away though - if not, then we have to maintain both paths.
Kairui, how do you feel about this?

>
> > >
> > > Aside from the swap code, the only sticking point for me is the logic
> > > bifurcation in zswap. Why does zswap need to handle vswap vs. not vswap?
> > > I thought the point of the design is to use vswap when zswap is used,
> > > and otherwise use a normal swap table. In a way, one of the goals is to
> > > make zswap a first class swap citizen, but it doesn't seem like we are
> > > achieving that?
> >
> > We already have all the machinery to make zswap completely
> > independent. Right now, if you use vswap, you'll skip the zswap's
> > internal xarray entirely, and just store a zswap entry in the virtual
> > swap cluster's vtable.
> >
> > I just haven't removed the old code for 2 reasons:
> >
> > 1. Reduce the delta on this RFC, to ease the burden for reviewers (and
> > definitely not because I'm lazy :P)
> >
> > 2. The only other practical reason is so that we can let users compile
> > with !CONFIG_VSWAP and still uses zswap on top of the old swapfile
> > setup during the transition/experimentation period for now.
> >
> > But logically and conceptually speaking, there is no reason I can come
> > up with to use zswap on without vswap. The CPU indirection overhead is
> > already partially there (since zswap uses an xarray) and further
> > optimized (cluster loopup caching etc.), as well as the space overhead
> > (vswap replaces the zswap xarray). I actually wrote a whole paragraph
> > about how we should always go for vswap if we're using zswap, but then
> > decide to remove it since there's no code for it yet.
> >
> > If folks like it, what I can do is have CONFIG_ZSWAP depends on
> > CONFIG_VSWAP, removes all the non-vswap logic, and call it a day? :)
> > Then, on the swap allocation side, if vswap allocation fail and zswap
> > writeback is disabled, we can error out early.
>
> Hmm maybe we can keep it around for now and do that after vswap
> stabilizes? It ultimately depend on how much complexity we maintain by
> allowing both.
>
> I think another problem is 32-bit, technically zswap can be used on
> 32-bit now, right? So vswap not supporitng 32-bit is a problem.

Ah shoot I forgot about that. Hmmm.

It's not impossible to make vswap support 32-bit. I did that for v6
after all. It just needs extra fields because we have fewer bits to
leverage in pointers etc., complicating the logic a bit. Follow-up
work? :)

>
> General question (for both zswap and general swap code), would a boot
> param make implementation simpler? Right now we seem to key off the swap
> device having the "vswap" flag, would it help if it was a runtime
> constant?

Hmmm, even if it's a runtime constant, both branches still have to be
there, no? Does the boot param simplify it somehow?

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Yosry Ahmed 4 days, 17 hours ago

On Wed, Jun 3, 2026 at 12:26 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Wed, Jun 3, 2026 at 11:58 AM Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > > > I assume the main reason here is to avoid the extra overhead if
> > > > everything uses vswap, which would mainly be the reverse mapping
> > > > overhead? I guess there's also some simplicity that comes from reusing
> > > > the swap info infra as a whole, including the swap table.
> > >
> > > Yeah it helps a lot that we don't have to rewrite the whole allocator
> > > and swap entry reference counting logic again :)
> >
> > I specifically meant using a full swap info thing for the physical swap
> > device even when it's behind vswap. That seems like an overkill, and we
> > don't need things like the swap entry reference coutning. We probably
> > just need a bitmap and a reverse mapping.
> >
> > So I am assuming the main reason why we are not doing that (at least for
> > now) is simplicity?
>
> Mostly.
>
> FWIW, we're pretty close to full deduplication. Right now, physical
> swap clusters have a couple of fields that are not needed when they're
> backing a vswap cluster:
>
> 1. The main swap table (which houses swap cache, swap shadow, and
> reference counting): I repurpose it for the rmap :) It's an array of
> unsigned long, which works for rmap.
>
> 2. memcg_table: still duplicated, but I think I can make sure this is
> not allocated if physical swap clusters only back vswap entries. I
> have a prototype that I'm testing for this.
>
> 3. The zeromap field: this is actually not allocated in 64 bit
> architecture, IIUC, which is what I'm gating CONFIG_VSWAP on. If we
> extend vswap to supporting 32 bits, this can also be dynamically
> allocated.
>
> 4. Extend table - this is for the swap count overfills, and already
> dynamically allocated.

I see.

> > > > All that being said, perhaps I am too out of touch with the code to
> > > > realize it's simply not possible.
> > > >
> > > > Honestly, if the main reason we can't have a single swap table for vswap
> > > > is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
> > > > argument, even if we can't optimize the reverse mapping away. But maybe
> > > > I am also out of touch with RAM prices :)
> > >
> > > In terms of the space overhead I do agree, FWIW :)
> > >
> > > I think the other concern is the indirection overhead with going
> > > through the xarray for every swap operation, hence the per-CPU vswap
> > > cluster lookup caching idea:
> > >
> > > https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@gmail.com/
> >
> > Right, but we should already avoid the xarray with the swap table
> > design, right? We just have one swap table pointing to another
> > essentially?
>
> Hmmm, I don't quite follow your suggestion here.
>
> For normal swap devices, we organize the space into clusters, and
> maintain them in various lists (free, nonfull, full etc.). The only
> difference with a vswap device is we do not have a free list, and have
> the clusters themselves dynamically allocated.
>
> If we're using vswap, we will incur the xarray overhead. There's no
> avoiding that if we want a dynamic indirection layer. We can of course
> revisit this data structure design later.
>
> So yes, it will be one swap table (vswap cluster) pointing to another
> swap table (pswap cluster). But to get to the first swap table, you
> will have to go through xarray still.

Why the xarray? Don't page tables (and shmem page cache) just point
directly to the vswap entry the same way they point to swap entries
today?

*looks at the code*

Oh, it's to find the actual cluster because the vswap file can be
sparse? Hmm yeah I guess we can revisit the data structure here later,
but IIRC xarrays aren't particularly good for sparse data. Maybe it's
usually not sprase in practice.

Maybe a maple tree? :)

> > > If folks like it, what I can do is have CONFIG_ZSWAP depends on
> > > CONFIG_VSWAP, removes all the non-vswap logic, and call it a day? :)
> > > Then, on the swap allocation side, if vswap allocation fail and zswap
> > > writeback is disabled, we can error out early.
> >
> > Hmm maybe we can keep it around for now and do that after vswap
> > stabilizes? It ultimately depend on how much complexity we maintain by
> > allowing both.
> >
> > I think another problem is 32-bit, technically zswap can be used on
> > 32-bit now, right? So vswap not supporitng 32-bit is a problem.
>
> Ah shoot I forgot about that. Hmmm.
>
> It's not impossible to make vswap support 32-bit. I did that for v6
> after all. It just needs extra fields because we have fewer bits to
> leverage in pointers etc., complicating the logic a bit. Follow-up
> work? :)

Yeah we can do that, but it's a blocker for zswap only using vswap.

> > General question (for both zswap and general swap code), would a boot
> > param make implementation simpler? Right now we seem to key off the swap
> > device having the "vswap" flag, would it help if it was a runtime
> > constant?
>
> Hmmm, even if it's a runtime constant, both branches still have to be
> there, no? Does the boot param simplify it somehow?

Maybe it doesn't simplify the code, but if the branching causes
performance overhead we can use static keys. I guess we can still use
static keys per-swapfile, but it would be more complicated.

Anyway, not super important now.

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Nhat Pham 4 days, 17 hours ago

On Wed, Jun 3, 2026 at 12:35 PM Yosry Ahmed <yosry@kernel.org> wrote:
>
> > > > > All that being said, perhaps I am too out of touch with the code to
> > > > > realize it's simply not possible.
> > > > >
> > > > > Honestly, if the main reason we can't have a single swap table for vswap
> > > > > is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
> > > > > argument, even if we can't optimize the reverse mapping away. But maybe
> > > > > I am also out of touch with RAM prices :)
> > > >
> > > > In terms of the space overhead I do agree, FWIW :)
> > > >
> > > > I think the other concern is the indirection overhead with going
> > > > through the xarray for every swap operation, hence the per-CPU vswap
> > > > cluster lookup caching idea:
> > > >
> > > > https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@gmail.com/
> > >
> > > Right, but we should already avoid the xarray with the swap table
> > > design, right? We just have one swap table pointing to another
> > > essentially?
> >
> > Hmmm, I don't quite follow your suggestion here.
> >
> > For normal swap devices, we organize the space into clusters, and
> > maintain them in various lists (free, nonfull, full etc.). The only
> > difference with a vswap device is we do not have a free list, and have
> > the clusters themselves dynamically allocated.
> >
> > If we're using vswap, we will incur the xarray overhead. There's no
> > avoiding that if we want a dynamic indirection layer. We can of course
> > revisit this data structure design later.
> >
> > So yes, it will be one swap table (vswap cluster) pointing to another
> > swap table (pswap cluster). But to get to the first swap table, you
> > will have to go through xarray still.
>
> Why the xarray? Don't page tables (and shmem page cache) just point
> directly to the vswap entry the same way they point to swap entries
> today?
>
> *looks at the code*
>
> Oh, it's to find the actual cluster because the vswap file can be
> sparse? Hmm yeah I guess we can revisit the data structure here later,

Less sparsity, and more dynamicity :) It might be dense for all we
know - we just don't really know (or want to figure out) the size
statically.

> but IIRC xarrays aren't particularly good for sparse data. Maybe it's
> usually not sprase in practice.
>
> Maybe a maple tree? :)

Maybe :)

>
> > > > If folks like it, what I can do is have CONFIG_ZSWAP depends on
> > > > CONFIG_VSWAP, removes all the non-vswap logic, and call it a day? :)
> > > > Then, on the swap allocation side, if vswap allocation fail and zswap
> > > > writeback is disabled, we can error out early.
> > >
> > > Hmm maybe we can keep it around for now and do that after vswap
> > > stabilizes? It ultimately depend on how much complexity we maintain by
> > > allowing both.
> > >
> > > I think another problem is 32-bit, technically zswap can be used on
> > > 32-bit now, right? So vswap not supporitng 32-bit is a problem.
> >
> > Ah shoot I forgot about that. Hmmm.
> >
> > It's not impossible to make vswap support 32-bit. I did that for v6
> > after all. It just needs extra fields because we have fewer bits to
> > leverage in pointers etc., complicating the logic a bit. Follow-up
> > work? :)
>
> Yeah we can do that, but it's a blocker for zswap only using vswap.

Yeah we can table that. FWIW, if you enable vswap, then zswap should
go through vswap already. It's just code complexity (hopefully for a
short while).

>
> > > General question (for both zswap and general swap code), would a boot
> > > param make implementation simpler? Right now we seem to key off the swap
> > > device having the "vswap" flag, would it help if it was a runtime
> > > constant?
> >
> > Hmmm, even if it's a runtime constant, both branches still have to be
> > there, no? Does the boot param simplify it somehow?
>
> Maybe it doesn't simplify the code, but if the branching causes
> performance overhead we can use static keys. I guess we can still use
> static keys per-swapfile, but it would be more complicated.
>
> Anyway, not super important now.

Ahhh I see what you mean. Yeah we can optimize this later.

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Nhat Pham 4 days, 20 hours ago

On Wed, Jun 3, 2026 at 10:12 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Jun 2, 2026 at 6:29 PM Yosry Ahmed <yosry@kernel.org> wrote:
> >
> > > II. Design
> > >
> > > With vswap, pages are assigned virtual swap entries on a ghost device
> > > with no backing storage. These entries are backed by zswap, zero pages,
> > > or (lazily) physical swap slots. Physical backing is allocated only
> > > when needed — on zswap writeback or reclaim writeout, after the rmap
> > > step.
> > >
> > > Compared to the standalone v6 implementation [1], which introduces a
> > > 24-byte per-entry swap descriptor and its own cluster allocator, this
> > > edition uses swap_table infrastructure, and share a lot of the allocator
> > > logic. Per-slot metadata is stored in a tag-encoded virtual_table
> > > (atomic_long_t, 8 bytes per slot), and physical clusters store
> > > Pointer-tagged rmap entries in the swap_table for reverse lookup back to
> > > the virtual cluster.
> > >
> > > Here are some data layout diagrams:
> > >
> > >   Case 1: vswap entry (virtualized)
> > >
> > >   PTE                  swap_cluster_info_dynamic
> > >   vswap_entry          +-------------------------+
> > >   (swp_entry_t) ------>| swap_cluster_info (ci)  |
> > >                        | +--------------------+  |
> > >                        | | swap_table         |  |
> > >                        | |   PFN / Shadow     |  |
> > >                        | | memcg_table        |  |
> > >                        | | count,flags,order  |  |
> > >                        | | lock, list         |  |
> > >                        | +--------------------+  |
> > >                        |                         |
> > >                        | virtual_table           |
> > >                        | +--------------------+  |
> > >                        | | NONE               |  |
> > >                        | | PHYS               |  |
> > >                        | | ZERO               |  |
> > >                        | | ZSWAP(entry*)      |  |
> > >                        | | FOLIO(folio*)      |  |
> > >                        | +--------------------+  |
> > >                        +-------------------------+
> > >                               |
> > >                               | PHYS resolves to
> > >                               v
> > >                        PHYSICAL CLUSTER (swap_cluster_info)
> > >                        +--------------------------+
> > >                        | swap_table per-slot:     |
> > >                        |   NULL   - free          |
> > >                        |   PFN    - cached folio  |
> > >                        |   Shadow - swapped out   |
> > >                        |   Pointer- vswap rmap    |
> > >                        |   Bad    - unusable      |
> > >                        |                          |
> > >                        | Vswap-backing slot:      |
> > >                        |   Pointer(C|swp_entry_t) |
> > >                        |     rmap back to vswap   |
> > >                        +--------------------------+
> > >
> > >   Case 2: direct-mapped physical entry (no vswap)
> > >
> > >   PTE                  PHYSICAL CLUSTER (swap_cluster_info)
> > >   phys_entry           +--------------------------+
> > >   (swp_entry_t) ------>| swap_table per-slot:     |
> > >                        |   NULL   - free          |
> > >                        |   PFN    - cached folio  |
> > >                        |   Shadow - swapped out   |
> > >                        |   Bad    - unusable      |
> > >                        +--------------------------+
> > >
> > > struct swap_cluster_info_dynamic {
> > >     struct swap_cluster_info ci;       /* swap_table, lock, etc. */
> > >     unsigned int index;                /* position in xarray */
> > >     struct rcu_head rcu;               /* kfree_rcu deferred free */
> > >     atomic_long_t *virtual_table;      /* backend info, 8 B/slot */
> > > };
> > >
> > > Each vswap cluster (swap_cluster_info_dynamic) extends the classic
> > > swap_cluster_info struct with a virtual_table array that stores the
> > > backend information for each virtual swap entry in the cluster. Each
> > > entry is tag-encoded in the low 3 bits to indicate backend types:
> > >
> > >   NONE:   |----- 0000 ------|000|  free / unbacked
> > >   PHYS:   |-- (type:5,off:N)|001|  on a physical swapfile (shifted)
> > >   ZERO:   |----- 0000 ------|010|  zero-filled page
> > >   ZSWAP:  |--- zswap_entry* |011|  compressed in zswap
> > >   FOLIO:  |--- folio* ------|100|  in-memory folio
> > >
> > > We still have room for 3 more future backend types, for e.g. CRAM, i.e
> > > compressed-CXL-as-swap, which is laid out in [10] and [11]. Worst
> > > case scenario, we can add more fields to this extended struct.
> > >
> > > Other design points:
> > > - Both vswap entries (Case 1) and directly-mapped physical entries
> > >   (Case 2) coexist as first-class citizens. All the common swap
> > >   code paths — swapout, swapin, swap freeing, swapoff, zswap
> > >   writeback, THP swapin, etc. work for both. When CONFIG_VSWAP=n,
> > >   the vswap branches compile out and behavior should be identical to
> > >   today's swap-table P4 (at least that is my intention).
> > > - Pointer-tagged swap_table on physical clusters for rmap (physical
> > >   -> virtual) lookup.
> > > - Virtual swap slots not backed by physical swap are not charged to
> > >   memcg swap counters — only physical backing is charged (I made the
> > >   case for this in [7]).
> > > - Careful separation of vswap and physical swap allocation paths and
> > >   structures adds a lot of complexity, but is crucial to make sure
> > >   both paths are efficient and do not conflict with each other (for
> > >   correctness and performance). I do re-use a lot of the allocation
> > >   logic wherever possible though.
> >
> > Thanks for working on this! I mostly looked at the high-level design and
>
> Thank you for initiating this effort in LSFMMBPF 2023 (god, time
> flies). I was very excited by your presentation and decided to take a
> stab at it :)
>
> (I'll be sure to mention the full context in a non-RFC version - it
> has a lot of gems in our technical discussions).
>
> > the zswap parts, as the swap code has changed a lot since I was familiar
> > with it :)
>
> It has changed a lot since 6.19, when I was working on v6. Very
> exciting time to be a (z)swap developer right now - we have new ideas
> and new features every other week :) Reviewing code has been quite a
> joy (albeit a lot of work).
>
> >
> > It seems like the direction being taken here is that we have one
> > (massive) vswap swap device, and we keep normal physical swap devices
> > around as well.
>
> Yep.
>
> >
> > A vswap entry can point at a physical swap entry, or zswap, or zeromap.
> > If a vswap entry points at a physical swap entry, then the physical swap
> > entry points back at the vswap entry (a reverse mapping).
>
> Yep.
>
> >
> > I assume the main reason here is to avoid the extra overhead if
> > everything uses vswap, which would mainly be the reverse mapping
> > overhead? I guess there's also some simplicity that comes from reusing
> > the swap info infra as a whole, including the swap table.
>
> Yeah it helps a lot that we don't have to rewrite the whole allocator
> and swap entry reference counting logic again :)
>
> >
> > I don't like that the code bifurcates for vswap vs. normal swap entries
> > though. Not sure if this is an issue that can be fixed with proper
> > abstractions to hide it, or if the design needs modifications. I was
> > honestly really hoping we don't end up with this. I was hoping that the
> > physical swap device no longer uses a full swap table and all, and
> > everything goes through vswap.
> >
> > I hoping that if redirection isn't needed (e.g. zswap is disabled),
> > vswap can directly encode the physical swap slot so that the reverse
> > mapping isn't needed -- so we avoid the overhead without keeping the
> > physical swap device using a fully-fledged swap table.
>
> Can you expand on "vswap can directly encode the physical swap slot"?
> I'm not sure I follow here.
>
> >
> > All that being said, perhaps I am too out of touch with the code to
> > realize it's simply not possible.
> >
> > Honestly, if the main reason we can't have a single swap table for vswap
> > is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
> > argument, even if we can't optimize the reverse mapping away. But maybe
> > I am also out of touch with RAM prices :)
>
> In terms of the space overhead I do agree, FWIW :)
>
> I think the other concern is the indirection overhead with going
> through the xarray for every swap operation, hence the per-CPU vswap
> cluster lookup caching idea:
>
> https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@gmail.com/
>
> >
> > I at least hope that, the current design is not painting us into a
> > corner (e.g. through userspace interfaces), and we can still achieve a
> > vswap-for-all implementation in the future (maybe that's what you have
> > in mind already?).
>
> That's still my plan. Operationally speaking, I want to make this
> completely transparent to users, with minimal to no performance
> overhead.

I do want to add that, even without achieving this, the current design
already enables a lot of use cases. I think it is a good compromise to
maintain both virtual and directly mapped physical swap entries for
now, and revisit the conversation of whether we can afford a mandatory
vswap layer once all the optimizations have been done :)

We should strive to simplify the codebase, and it will naturally
happen when the original overhead concern is no longer there. A
swap-related example: a few years ago, everyone thought swap slot
cache was needed. But then, Kairui optimized the swap allocator's lock
contention issue away, and that swap slot cache is suddenly redundant.
That finally allowed us to get rid of it. Similar thing happened (or
is happening?) with the SWP_SYNCHRONOUS_IO swapcache-skipping
heuristics.

Re: [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)

Posted by Yosry Ahmed 4 days, 18 hours ago

> > > I don't like that the code bifurcates for vswap vs. normal swap entries
> > > though. Not sure if this is an issue that can be fixed with proper
> > > abstractions to hide it, or if the design needs modifications. I was
> > > honestly really hoping we don't end up with this. I was hoping that the
> > > physical swap device no longer uses a full swap table and all, and
> > > everything goes through vswap.
> > >
> > > I hoping that if redirection isn't needed (e.g. zswap is disabled),
> > > vswap can directly encode the physical swap slot so that the reverse
> > > mapping isn't needed -- so we avoid the overhead without keeping the
> > > physical swap device using a fully-fledged swap table.
> >
> > Can you expand on "vswap can directly encode the physical swap slot"?
> > I'm not sure I follow here.
> >
> > >
> > > All that being said, perhaps I am too out of touch with the code to
> > > realize it's simply not possible.
> > >
> > > Honestly, if the main reason we can't have a single swap table for vswap
> > > is saving 8 bytes on the reverse mapping, it sounds like a weak-ish
> > > argument, even if we can't optimize the reverse mapping away. But maybe
> > > I am also out of touch with RAM prices :)
> >
> > In terms of the space overhead I do agree, FWIW :)
> >
> > I think the other concern is the indirection overhead with going
> > through the xarray for every swap operation, hence the per-CPU vswap
> > cluster lookup caching idea:
> >
> > https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@gmail.com/
> >
> > >
> > > I at least hope that, the current design is not painting us into a
> > > corner (e.g. through userspace interfaces), and we can still achieve a
> > > vswap-for-all implementation in the future (maybe that's what you have
> > > in mind already?).
> >
> > That's still my plan. Operationally speaking, I want to make this
> > completely transparent to users, with minimal to no performance
> > overhead.
> 
> I do want to add that, even without achieving this, the current design
> already enables a lot of use cases. I think it is a good compromise to
> maintain both virtual and directly mapped physical swap entries for
> now, and revisit the conversation of whether we can afford a mandatory
> vswap layer once all the optimizations have been done :)
> 
> We should strive to simplify the codebase, and it will naturally
> happen when the original overhead concern is no longer there. A
> swap-related example: a few years ago, everyone thought swap slot
> cache was needed. But then, Kairui optimized the swap allocator's lock
> contention issue away, and that swap slot cache is suddenly redundant.
> That finally allowed us to get rid of it. Similar thing happened (or
> is happening?) with the SWP_SYNCHRONOUS_IO swapcache-skipping
> heuristics.

I agree, I just want to make sure we have a line of sight (or at least
no blockers) to having a unified vswap layer in the future.