[RFC PATCH v2 00/18] Virtual Swap Space

Nhat Pham posted 18 patches 7 months, 3 weeks ago
MAINTAINERS                |    7 +
include/linux/mm_types.h   |    7 +
include/linux/shmem_fs.h   |    3 +
include/linux/swap.h       |  263 ++++++-
include/linux/swap_slots.h |    2 +-
include/linux/swapops.h    |   37 +
kernel/power/swap.c        |    6 +-
mm/Kconfig                 |   25 +
mm/Makefile                |    3 +
mm/huge_memory.c           |    5 +-
mm/internal.h              |   25 +-
mm/memcontrol.c            |  166 +++--
mm/memory.c                |  103 ++-
mm/migrate.c               |    1 +
mm/page_io.c               |   60 +-
mm/shmem.c                 |   29 +-
mm/swap.h                  |   45 +-
mm/swap_cgroup.c           |   10 +-
mm/swap_slots.c            |   28 +-
mm/swap_state.c            |  140 +++-
mm/swapfile.c              |  831 +++++++++++++--------
mm/userfaultfd.c           |   11 +-
mm/vmscan.c                |   26 +-
mm/vswap.c                 | 1400 ++++++++++++++++++++++++++++++++++++
mm/zswap.c                 |   80 ++-
25 files changed, 2799 insertions(+), 514 deletions(-)
create mode 100644 mm/vswap.c
[RFC PATCH v2 00/18] Virtual Swap Space
Posted by Nhat Pham 7 months, 3 weeks ago
Changelog:
* v2:
	* Use a single atomic type (swap_refs) for reference counting
	  purpose. This brings the size of the swap descriptor from 64 KB
	  down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
	* Zeromap bitmap is removed in the virtual swap implementation.
	  This saves one bit per phyiscal swapfile slot.
	* Rearrange the patches and the code change to make things more
	  reviewable. Suggested by Johannes Weiner.
	* Update the cover letter a bit.

This RFC implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).


I. Motivation

Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.

However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
  mobile and embedded devices), users cannot adopt zswap, and are forced
  to use zram. This is confusing for users, and creates extra burdens
  for developers, having to develop and maintain similar features for
  two separate swap backends (writeback, cgroup charging, THP support,
  etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage. At Meta,
  we have swapfile in the order of tens to hundreds of GBs, which are
  mostly unused and only exist to enable zswap usage and zero-filled
  pages swap optimizations. This also implicitly limits the memory
  saving potentials of these swap optimizations by the static size of
  the swapfile, which is especially problematic in high memory systems
  that can have up to TBs worth of memory.
* Operationally, the old design poses significant challenges, because
  the sysadmin has to prescribe how much swap is needed a priori, for
  each combination of (memory size x disk space x workload usage). It
  is even more complicated when we take into account the variance of
  memory compression, which changes the reclaim dynamics (and as a
  result, swap space requirement). The problem is further exarcebated
  for users who rely on swap utilization (and exhaustion) as an OOM
  signal.

Another motivation for a swap redesign is to simplify swapoff, which
is both complicated and expensive in the current design. Tight coupling
between a swap entry and its backing storage means that it requires a
whole page table walk to update all the page table entries that refer to
this swap entry, as well as updating all the associated swap data
structures (swap cache, etc.).


II. High Level Design Overview

To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated,
per-swap-entry descriptor:

struct swp_desc {
	union {
		swp_slot_t slot;
		struct folio *folio;
		struct zswap_entry *zswap_entry;
	};
	struct rcu_head rcu;

	rwlock_t lock;
	enum swap_type type;

#ifdef CONFIG_MEMCG
	atomic_t memcgid;
#endif

	atomic_t swap_refs;
};

The size of the swap descriptor (without debug config options) is 48
bytes.

This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
  simply associate the virtual swap slot with one of the supported
  backends: a zswap entry, a zero-filled swap page, a slot on the
  swapfile, or an in-memory page.
* Simplify and optimize swapoff: we only have to fault the page in and
  have the virtual swap slot points to the page instead of the on-disk
  physical swap slot. No need to perform any page table walking.

Please see the attached patches for implementation details.

Note that I do not remove the old implementation for now. Users can
select between the old and the new implementation via the
CONFIG_VIRTUAL_SWAP build config. This will also allow us to land the
new design, and iteratively optimize upon it (without having to include
everything in an even more massive patch series).


III. Future Use Cases

While the patch series focus on two applications (decoupling swap
backends and swapoff optimization/simplification), this new,
future-proof design also allows us to implement new swap features more
easily and efficiently:

* Multi-tier swapping (as mentioned in [5]), with transparent
  transferring (promotion/demotion) of pages across tiers (see [8] and
  [9]). Similar to swapoff, with the old design we would need to
  perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
  Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
  backing store of THPs, then you can dispatch each range of subpages
  to appropriate swapin handle.
* Swapping a folio out with discontiguous physical swap slots
  (see [10]).
* Zswap writeback optimization: The current architecture pre-reserves
  physical swap space for pages when they enter the zswap pool, giving
  the kernel no flexibility at writeback time. With the virtual swap
  implementation, the backends are decoupled, and physical swap space
  is allocated on-demand at writeback time, at which point we can make
  much smarter decisions: we can batch multiple zswap writeback
  operations into a single IO request, allocating contiguous physical
  swap slots for that request. We can even perform compressed writeback
  (i.e writing these pages without decompressing them) (see [12]).


IV. Potential Issues

Here is a couple of issues I can think of, along with some potential
solutions:

1. Space overhead: we need one swap descriptor per swap entry.
* Note that this overhead is dynamic, i.e only incurred when we actually
  need to swap a page out.
* The swap descriptor replaces many other swap data structures:
  swap_cgroup arrays, zeromap, etc.
* It can be further offset by swap_map reduction: we only need 3 states
  for each entry in the swap_map (unallocated, allocated, bad). The two
  last states are potentially mergeable, reducing the swap_map to a
  bitmap.

2. Lock contention: since the virtual swap space is dynamic/unbounded,
we cannot naively range partition it anymore. This can increase lock
contention on swap-related data structures (swap cache, zswap’s xarray,
etc.).
* The problem is slightly alleviated by the lockless nature of the new
  reference counting scheme, as well as the per-entry locking for
  backing store information.
* Johannes suggested that I can implement a dynamic partition scheme, in
  which new partitions (along with associated data structures) are
  allocated on demand. It is one extra layer of indirection, but global
  locking will only be done only on partition allocation, rather than on
  each access. All other accesses only take local (per-partition)
  locks, or are completely lockless (such as partition lookup).

  This idea is very similar to Kairui's work to optimize the (physical)
  swap allocator. He is currently also working on a swap redesign (see
  [11]) - perhaps we can combine the two efforts to take advantage of
  the swap allocator's efficiency for virtual swap.


V. Benchmarking

As a proof of concept, I run the prototype through some simple
benchmarks:

1. usemem: 16 threads, 2G each, memory.max = 16G

I benchmarked the following usemem commands:

time usemem --init-time -w -O -s 10 -n 16 2g

Baseline:
real: 33.96s
user: 25.31s
sys: 341.09s
average throughput: 111295.45 KB/s
average free time: 2079258.68 usecs

New Design:
real: 35.87s
user: 25.15s
sys: 373.01s
average throughput: 106965.46 KB/s
average free time: 3192465.62 usecs

To root cause this regression, I ran perf on the usemem program, as
well as on the following stress-ng program:

perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000

and observed the (predicted) increase in lock contention on swap cache
accesses. This regression is alleviated if I put together the
following hack: limit the virtual swap space to a sufficient size for
the benchmark, range partition the swap-related data structures (swap
cache, zswap tree, etc.) based on the limit, and distribute the
allocation of virtual swap slotss among these partitions (on a per-CPU
basis):

real: 34.94s
user: 25.28s
sys: 360.25s
average throughput: 108181.15 KB/s
average free time: 2680890.24 usecs

As mentioned above, I will implement proper dynamic virtual swap space
partitioning in a follow-up work, or adopt Kairui's solution.

2. Kernel building: zswap enabled, 52 workers (one per processor),
memory.max = 3G.

Baseline:
real: 183.55s
user: 5119.01s
sys: 655.16s

New Design:
real: mean: 184.5s
user: mean: 5117.4s
sys: mean: 695.23s

New Design (Static Partition)
real: 183.95s
user: 5119.29s
sys: 664.24s

3. Swapoff: 32 GB swapfile, 50% full, with a process that mmap-ed a
128GB file.

Baseline:
real: 25.54s
user: 0.00s
sys: 11.48s
    
New Design:
real: 11.69s
user: 0.00s
sys: 9.96s
    
The new design reduces the kernel CPU time by about 13%. There is also
reduction in real time, but this is mostly due to more asynchronous IO
(rather than the design itself) :)


VI. TODO list

This RFC includes a feature-complete prototype on top of 6.14. Here are
some action items:

Short-term: needs to be done before merging
* More clean-ups and stress-testing.
* Add more documentation of the new design and its API.

Medium-term: optimizations required to make virtual swap implementation
the default:
* Shrinking the swap map.
* Range partition the virtual swap space.
* More benchmarking and experiments in a variety of use cases.

Long-term: removal of the old implementation and other non-blocking
opportunities
* Remove the old implementation, when there are no major regressions and
  bottlenecks, etc remained with the new design.
* Merge more existing swap data structures into this layer (the MTE
  swap xarray for e.g).
* Adding new use cases :)


VII. References

[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/ 
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/
[11]: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/
[12]: https://lore.kernel.org/linux-mm/ZeZSDLWwDed0CgT3@casper.infradead.org/

Nhat Pham (18):
  swap: rearrange the swap header file
  swapfile: rearrange functions
  swapfile: rearrange freeing steps
  mm: swap: add an abstract API for locking out swapoff
  mm: swap: add a separate type for physical swap slots
  mm: create scaffolds for the new virtual swap implementation
  mm: swap: zswap: swap cache and zswap support for virtualized swap
  mm: swap: allocate a virtual swap slot for each swapped out page
  swap: implement the swap_cgroup API using virtual swap
  swap: manage swap entry lifetime at the virtual swap layer
  mm: swap: temporarily disable THP swapin and batched freeing swap
  mm: swap: decouple virtual swap slot from backing store
  zswap: do not start zswap shrinker if there is no physical swap slots
  memcg: swap: only charge physical swap slots
  vswap: support THP swapin and batch free_swap_and_cache
  swap: simplify swapoff using virtual swap
  swapfile: move zeromap setup out of enable_swap_info
  swapfile: remove zeromap in virtual swap implementation

 MAINTAINERS                |    7 +
 include/linux/mm_types.h   |    7 +
 include/linux/shmem_fs.h   |    3 +
 include/linux/swap.h       |  263 ++++++-
 include/linux/swap_slots.h |    2 +-
 include/linux/swapops.h    |   37 +
 kernel/power/swap.c        |    6 +-
 mm/Kconfig                 |   25 +
 mm/Makefile                |    3 +
 mm/huge_memory.c           |    5 +-
 mm/internal.h              |   25 +-
 mm/memcontrol.c            |  166 +++--
 mm/memory.c                |  103 ++-
 mm/migrate.c               |    1 +
 mm/page_io.c               |   60 +-
 mm/shmem.c                 |   29 +-
 mm/swap.h                  |   45 +-
 mm/swap_cgroup.c           |   10 +-
 mm/swap_slots.c            |   28 +-
 mm/swap_state.c            |  140 +++-
 mm/swapfile.c              |  831 +++++++++++++--------
 mm/userfaultfd.c           |   11 +-
 mm/vmscan.c                |   26 +-
 mm/vswap.c                 | 1400 ++++++++++++++++++++++++++++++++++++
 mm/zswap.c                 |   80 ++-
 25 files changed, 2799 insertions(+), 514 deletions(-)
 create mode 100644 mm/vswap.c


base-commit: 922ceb9d4bb4dae66c37e24621687e0b4991f5a4
-- 
2.47.1
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by Nhat Pham 7 months, 3 weeks ago
On Tue, Apr 29, 2025 at 4:38 PM Nhat Pham <nphamcs@gmail.com> wrote:
>
> Changelog:
> * v2:
>         * Use a single atomic type (swap_refs) for reference counting
>           purpose. This brings the size of the swap descriptor from 64 KB
>           down to 48 KB (25% reduction). Suggested by Yosry Ahmed.

bytes, not kilobytes. 48KB would be an INSANE overhead :)

Apologies for the brainfart.
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by YoungJun Park 6 months, 3 weeks ago
On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> Changelog:
> * v2:
> 	* Use a single atomic type (swap_refs) for reference counting
> 	  purpose. This brings the size of the swap descriptor from 64 KB
> 	  down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> 	* Zeromap bitmap is removed in the virtual swap implementation.
> 	  This saves one bit per phyiscal swapfile slot.
> 	* Rearrange the patches and the code change to make things more
> 	  reviewable. Suggested by Johannes Weiner.
> 	* Update the cover letter a bit.

Hi Nhat,

Thank you for sharing this patch series.
I’ve read through it with great interest.

I’m part of a kernel team working on features related to multi-tier swapping,
and this patch set appears quite relevant
to our ongoing discussions and early-stage implementation.

I had a couple of questions regarding the future direction.

> * Multi-tier swapping (as mentioned in [5]), with transparent
>   transferring (promotion/demotion) of pages across tiers (see [8] and
>   [9]). Similar to swapoff, with the old design we would need to
>   perform the expensive page table walk.

Based on the discussion in [5], it seems there was some exploration
around enabling per-cgroup selection of multiple tiers.
Do you envision the current design evolving in a similar direction
to those past discussions, or is there a different direction you're aiming for?

>   This idea is very similar to Kairui's work to optimize the (physical)
>   swap allocator. He is currently also working on a swap redesign (see
>   [11]) - perhaps we can combine the two efforts to take advantage of
>   the swap allocator's efficiency for virtual swap.

I noticed that your patch appears to be aligned with the work from Kairui.
It seems like the overall architecture may be headed toward introducing
a virtual swap device layer.
I'm curious if there’s already been any concrete discussion
around this abstraction, especially regarding how it might be layered over
multiple physical swap devices?

From a naive perspective, I imagine that while today’s swap devices
are in a 1:1 mapping with physical devices,
this virtual layer could introduce a 1:N relationship —
one virtual swap device mapped to multiple physical ones.
Would this virtual device behave as a new swappable block device
exposed via `swapon`, or is the plan to abstract it differently?

Thanks again for your work, 
and I would greatly appreciate any insights you could share.

Best regards,  
YoungJun Park
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by Nhat Pham 6 months, 3 weeks ago
On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > Changelog:
> > * v2:
> >       * Use a single atomic type (swap_refs) for reference counting
> >         purpose. This brings the size of the swap descriptor from 64 KB
> >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> >       * Zeromap bitmap is removed in the virtual swap implementation.
> >         This saves one bit per phyiscal swapfile slot.
> >       * Rearrange the patches and the code change to make things more
> >         reviewable. Suggested by Johannes Weiner.
> >       * Update the cover letter a bit.
>
> Hi Nhat,
>
> Thank you for sharing this patch series.
> I’ve read through it with great interest.
>
> I’m part of a kernel team working on features related to multi-tier swapping,
> and this patch set appears quite relevant
> to our ongoing discussions and early-stage implementation.

May I ask - what's the use case you're thinking of here? Remote swapping?

>
> I had a couple of questions regarding the future direction.
>
> > * Multi-tier swapping (as mentioned in [5]), with transparent
> >   transferring (promotion/demotion) of pages across tiers (see [8] and
> >   [9]). Similar to swapoff, with the old design we would need to
> >   perform the expensive page table walk.
>
> Based on the discussion in [5], it seems there was some exploration
> around enabling per-cgroup selection of multiple tiers.
> Do you envision the current design evolving in a similar direction
> to those past discussions, or is there a different direction you're aiming for?

IIRC, that past design focused on the interface aspect of the problem,
but never actually touched the mechanism to implement a multi-tier
swapping solution.

The simple reason is it's impossible, or at least highly inefficient
to do it in the current design, i.e without virtualizing swap. Storing
the physical swap location in PTEs means that changing the swap
backend requires a full page table walk to update all the PTEs that
refer to the old physical swap location. So you have to pick your
poison - either:

1. Pick your backend at swap out time, and never change it. You might
not have sufficient information to decide at that time. It prevents
you from adapting to the change in workload dynamics and working set -
the access frequency of pages might change, so their physical location
should change accordingly.

2. Reserve the space in every tier, and associate them with the same
handle. This is kinda what zswap is doing. It is space efficient, and
create a lot of operational issues in production.

3. Bite the bullet and perform the page table walk. This is what
swapoff is doing, basically. Raise your hands if you're excited about
a full page table walk every time you want to evict a page from zswap
to disk swap. Booo.

This new design will give us an efficient way to perform tier transfer
- you need to figure out how to obtain the right to perform the
transfer (for now, through the swap cache - but you can perhaps
envision some sort of locks), and then you can simply make the change
at the virtual layer.

>
> >   This idea is very similar to Kairui's work to optimize the (physical)
> >   swap allocator. He is currently also working on a swap redesign (see
> >   [11]) - perhaps we can combine the two efforts to take advantage of
> >   the swap allocator's efficiency for virtual swap.
>
> I noticed that your patch appears to be aligned with the work from Kairui.
> It seems like the overall architecture may be headed toward introducing
> a virtual swap device layer.
> I'm curious if there’s already been any concrete discussion
> around this abstraction, especially regarding how it might be layered over
> multiple physical swap devices?
>
> From a naive perspective, I imagine that while today’s swap devices
> are in a 1:1 mapping with physical devices,
> this virtual layer could introduce a 1:N relationship —
> one virtual swap device mapped to multiple physical ones.
> Would this virtual device behave as a new swappable block device
> exposed via `swapon`, or is the plan to abstract it differently?

That was one of the ideas I was thinking of. Problem is this is a very
special "device", and I'm not entirely sure opting in through swapon
like that won't cause issues. Imagine the following scenario:

1. We swap on a normal swapfile.

2. Users swap things with the swapfile.

2. Sysadmin then swapon a virtual swap device.

It will be quite nightmarish to manage things - we need to be extra
vigilant in handling a physical swap slot for e.g, since it can back a
PTE or a virtual swap slot. Also, swapoff becomes less efficient
again. And the physical swap allocator, even with the swap table
change, doesn't quite work out of the box for virtual swap yet (see
[1]).

I think it's better to just keep it separate, for now, and adopt
elements from Kairui's work to make virtual swap allocation more
efficient. Not a hill I will die on though,

[1]: https://lore.kernel.org/linux-mm/CAKEwX=MmD___ukRrx=hLo7d_m1J_uG_Ke+us7RQgFUV2OSg38w@mail.gmail.com/

>
> Thanks again for your work,
> and I would greatly appreciate any insights you could share.
>
> Best regards,
> YoungJun Park
>
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by YoungJun Park 6 months, 2 weeks ago
On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > Changelog:
> > > * v2:
> > >       * Use a single atomic type (swap_refs) for reference counting
> > >         purpose. This brings the size of the swap descriptor from 64 KB
> > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > >         This saves one bit per phyiscal swapfile slot.
> > >       * Rearrange the patches and the code change to make things more
> > >         reviewable. Suggested by Johannes Weiner.
> > >       * Update the cover letter a bit.
> >
> > Hi Nhat,
> >
> > Thank you for sharing this patch series.
> > I’ve read through it with great interest.
> >
> > I’m part of a kernel team working on features related to multi-tier swapping,
> > and this patch set appears quite relevant
> > to our ongoing discussions and early-stage implementation.
> 
> May I ask - what's the use case you're thinking of here? Remote swapping?
> 

Yes, that's correct.  
Our usage scenario includes remote swap, 
and we're experimenting with assigning swap tiers per cgroup 
in order to improve specific scene of our target device performance.

We’ve explored several approaches and PoCs around this, 
and in the process of evaluating 
whether our direction could eventually be aligned 
with the upstream kernel, 
I came across your patchset and wanted to ask whether 
similar efforts have been discussed or attempted before.

> >
> > I had a couple of questions regarding the future direction.
> >
> > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > >   [9]). Similar to swapoff, with the old design we would need to
> > >   perform the expensive page table walk.
> >
> > Based on the discussion in [5], it seems there was some exploration
> > around enabling per-cgroup selection of multiple tiers.
> > Do you envision the current design evolving in a similar direction
> > to those past discussions, or is there a different direction you're aiming for?
> 
> IIRC, that past design focused on the interface aspect of the problem,
> but never actually touched the mechanism to implement a multi-tier
> swapping solution.
> 
> The simple reason is it's impossible, or at least highly inefficient
> to do it in the current design, i.e without virtualizing swap. Storing

As you pointed out, there are certainly inefficiencies 
in supporting this use case with the current design, 
but if there is a valid use case,
I believe there’s room for it to be supported in the current model
—possibly in a less optimized form—
until a virtual swap device becomes available 
and provides a more efficient solution.
What do you think about?

> the physical swap location in PTEs means that changing the swap
> backend requires a full page table walk to update all the PTEs that
> refer to the old physical swap location. So you have to pick your
> poison - either:
> 1. Pick your backend at swap out time, and never change it. You might
> not have sufficient information to decide at that time. It prevents
> you from adapting to the change in workload dynamics and working set -
> the access frequency of pages might change, so their physical location
> should change accordingly.
> 
> 2. Reserve the space in every tier, and associate them with the same
> handle. This is kinda what zswap is doing. It is space efficient, and
> create a lot of operational issues in production.
> 
> 3. Bite the bullet and perform the page table walk. This is what
> swapoff is doing, basically. Raise your hands if you're excited about
> a full page table walk every time you want to evict a page from zswap
> to disk swap. Booo.
> 
> This new design will give us an efficient way to perform tier transfer
> - you need to figure out how to obtain the right to perform the
> transfer (for now, through the swap cache - but you can perhaps
> envision some sort of locks), and then you can simply make the change
> at the virtual layer.
>

One idea that comes to mind is whether the backend swap tier for
a page could be lazily adjusted at runtime—either reactively 
or via an explicit interface—before the tier changes.  
Alternatively, if it's preferable to leave pages untouched
when the tier configuration changes at runtime, 
perhaps we could consider making this behavior configurable as well. 

> >
> > >   This idea is very similar to Kairui's work to optimize the (physical)
> > >   swap allocator. He is currently also working on a swap redesign (see
> > >   [11]) - perhaps we can combine the two efforts to take advantage of
> > >   the swap allocator's efficiency for virtual swap.
> >
> > I noticed that your patch appears to be aligned with the work from Kairui.
> > It seems like the overall architecture may be headed toward introducing
> > a virtual swap device layer.
> > I'm curious if there’s already been any concrete discussion
> > around this abstraction, especially regarding how it might be layered over
> > multiple physical swap devices?
> >
> > From a naive perspective, I imagine that while today’s swap devices
> > are in a 1:1 mapping with physical devices,
> > this virtual layer could introduce a 1:N relationship —
> > one virtual swap device mapped to multiple physical ones.
> > Would this virtual device behave as a new swappable block device
> > exposed via `swapon`, or is the plan to abstract it differently?
> 
> That was one of the ideas I was thinking of. Problem is this is a very
> special "device", and I'm not entirely sure opting in through swapon
> like that won't cause issues. Imagine the following scenario:
> 
> 1. We swap on a normal swapfile.
> 
> 2. Users swap things with the swapfile.
> 
> 2. Sysadmin then swapon a virtual swap device.
> 
> It will be quite nightmarish to manage things - we need to be extra
> vigilant in handling a physical swap slot for e.g, since it can back a
> PTE or a virtual swap slot. Also, swapoff becomes less efficient
> again. And the physical swap allocator, even with the swap table
> change, doesn't quite work out of the box for virtual swap yet (see
> [1]).
> 
> I think it's better to just keep it separate, for now, and adopt
> elements from Kairui's work to make virtual swap allocation more
> efficient. Not a hill I will die on though,
> 
> [1]: https://lore.kernel.org/linux-mm/CAKEwX=MmD___ukRrx=hLo7d_m1J_uG_Ke+us7RQgFUV2OSg38w@mail.gmail.com/
> 

I also appreciate your thoughts on keeping the virtual 
and physical swap paths separate for now. 
Thanks for sharing your perspective
—it was helpful to understand the design direction.

Best regards,  
YoungJun Park
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by Nhat Pham 6 months, 2 weeks ago
On Sun, Jun 1, 2025 at 5:56 AM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> > >
> > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > Changelog:
> > > > * v2:
> > > >       * Use a single atomic type (swap_refs) for reference counting
> > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > >         This saves one bit per phyiscal swapfile slot.
> > > >       * Rearrange the patches and the code change to make things more
> > > >         reviewable. Suggested by Johannes Weiner.
> > > >       * Update the cover letter a bit.
> > >
> > > Hi Nhat,
> > >
> > > Thank you for sharing this patch series.
> > > I’ve read through it with great interest.
> > >
> > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > and this patch set appears quite relevant
> > > to our ongoing discussions and early-stage implementation.
> >
> > May I ask - what's the use case you're thinking of here? Remote swapping?
> >
>
> Yes, that's correct.
> Our usage scenario includes remote swap,
> and we're experimenting with assigning swap tiers per cgroup
> in order to improve specific scene of our target device performance.

Hmm, that can be a start. Right now, we have only 2 swap tiers
essentially, so memory.(z)swap.max and memory.zswap.writeback is
usually sufficient to describe the tiering interface. But if you have
an alternative use case in mind feel free to send a RFC to explore
this!

>
> We’ve explored several approaches and PoCs around this,
> and in the process of evaluating
> whether our direction could eventually be aligned
> with the upstream kernel,
> I came across your patchset and wanted to ask whether
> similar efforts have been discussed or attempted before.

I think it is occasionally touched upon in discussion, but AFAICS
there has not been really an actual upstream patch to add such an
interface.

>
> > >
> > > I had a couple of questions regarding the future direction.
> > >
> > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > >   [9]). Similar to swapoff, with the old design we would need to
> > > >   perform the expensive page table walk.
> > >
> > > Based on the discussion in [5], it seems there was some exploration
> > > around enabling per-cgroup selection of multiple tiers.
> > > Do you envision the current design evolving in a similar direction
> > > to those past discussions, or is there a different direction you're aiming for?
> >
> > IIRC, that past design focused on the interface aspect of the problem,
> > but never actually touched the mechanism to implement a multi-tier
> > swapping solution.
> >
> > The simple reason is it's impossible, or at least highly inefficient
> > to do it in the current design, i.e without virtualizing swap. Storing
>
> As you pointed out, there are certainly inefficiencies
> in supporting this use case with the current design,
> but if there is a valid use case,
> I believe there’s room for it to be supported in the current model
> —possibly in a less optimized form—
> until a virtual swap device becomes available
> and provides a more efficient solution.
> What do you think about?

Which less optimized form are you thinking of?

>
> > the physical swap location in PTEs means that changing the swap
> > backend requires a full page table walk to update all the PTEs that
> > refer to the old physical swap location. So you have to pick your
> > poison - either:
> > 1. Pick your backend at swap out time, and never change it. You might
> > not have sufficient information to decide at that time. It prevents
> > you from adapting to the change in workload dynamics and working set -
> > the access frequency of pages might change, so their physical location
> > should change accordingly.
> >
> > 2. Reserve the space in every tier, and associate them with the same
> > handle. This is kinda what zswap is doing. It is space efficient, and
> > create a lot of operational issues in production.
> >
> > 3. Bite the bullet and perform the page table walk. This is what
> > swapoff is doing, basically. Raise your hands if you're excited about
> > a full page table walk every time you want to evict a page from zswap
> > to disk swap. Booo.
> >
> > This new design will give us an efficient way to perform tier transfer
> > - you need to figure out how to obtain the right to perform the
> > transfer (for now, through the swap cache - but you can perhaps
> > envision some sort of locks), and then you can simply make the change
> > at the virtual layer.
> >
>
> One idea that comes to mind is whether the backend swap tier for
> a page could be lazily adjusted at runtime—either reactively
> or via an explicit interface—before the tier changes.
> Alternatively, if it's preferable to leave pages untouched
> when the tier configuration changes at runtime,
> perhaps we could consider making this behavior configurable as well.
>

I don't quite understand - could you expand on this?
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by YoungJun Park 6 months, 2 weeks ago
On Sun, Jun 01, 2025 at 02:08:22PM -0700, Nhat Pham wrote:
> On Sun, Jun 1, 2025 at 5:56 AM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> > > >
> > > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > > Changelog:
> > > > > * v2:
> > > > >       * Use a single atomic type (swap_refs) for reference counting
> > > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > > >         This saves one bit per phyiscal swapfile slot.
> > > > >       * Rearrange the patches and the code change to make things more
> > > > >         reviewable. Suggested by Johannes Weiner.
> > > > >       * Update the cover letter a bit.
> > > >
> > > > Hi Nhat,
> > > >
> > > > Thank you for sharing this patch series.
> > > > I’ve read through it with great interest.
> > > >
> > > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > > and this patch set appears quite relevant
> > > > to our ongoing discussions and early-stage implementation.
> > >
> > > May I ask - what's the use case you're thinking of here? Remote swapping?
> > >
> >
> > Yes, that's correct.
> > Our usage scenario includes remote swap,
> > and we're experimenting with assigning swap tiers per cgroup
> > in order to improve specific scene of our target device performance.
> 
> Hmm, that can be a start. Right now, we have only 2 swap tiers
> essentially, so memory.(z)swap.max and memory.zswap.writeback is
> usually sufficient to describe the tiering interface. But if you have
> an alternative use case in mind feel free to send a RFC to explore
> this!
>

Yes, sounds good.
I've organized the details of our swap tiering approach 
including the specific use case we are trying to solve.
This approach is based on leveraging 
the existing priority mechanism in the swap subsystem.
I’ll be sharing it as an RFC shortly.
 
> >
> > We’ve explored several approaches and PoCs around this,
> > and in the process of evaluating
> > whether our direction could eventually be aligned
> > with the upstream kernel,
> > I came across your patchset and wanted to ask whether
> > similar efforts have been discussed or attempted before.
> 
> I think it is occasionally touched upon in discussion, but AFAICS
> there has not been really an actual upstream patch to add such an
> interface.
> 
> >
> > > >
> > > > I had a couple of questions regarding the future direction.
> > > >
> > > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > > >   [9]). Similar to swapoff, with the old design we would need to
> > > > >   perform the expensive page table walk.
> > > >
> > > > Based on the discussion in [5], it seems there was some exploration
> > > > around enabling per-cgroup selection of multiple tiers.
> > > > Do you envision the current design evolving in a similar direction
> > > > to those past discussions, or is there a different direction you're aiming for?
> > >
> > > IIRC, that past design focused on the interface aspect of the problem,
> > > but never actually touched the mechanism to implement a multi-tier
> > > swapping solution.
> > >
> > > The simple reason is it's impossible, or at least highly inefficient
> > > to do it in the current design, i.e without virtualizing swap. Storing
> >
> > As you pointed out, there are certainly inefficiencies
> > in supporting this use case with the current design,
> > but if there is a valid use case,
> > I believe there’s room for it to be supported in the current model
> > —possibly in a less optimized form—
> > until a virtual swap device becomes available
> > and provides a more efficient solution.
> > What do you think about?
> 
> Which less optimized form are you thinking of?
>

Just mentioning current swap design would be less optimized regardless 
of the form of tiering applied.
Not meaninig my approach is less optimized.
That may have come across differently than I intended.
Please feel free to disregard that assumption — 
I believe it would be more appropriate 
to evaluate this based on the RFC I plan to share soon.
 
> >
> > > the physical swap location in PTEs means that changing the swap
> > > backend requires a full page table walk to update all the PTEs that
> > > refer to the old physical swap location. So you have to pick your
> > > poison - either:
> > > 1. Pick your backend at swap out time, and never change it. You might
> > > not have sufficient information to decide at that time. It prevents
> > > you from adapting to the change in workload dynamics and working set -
> > > the access frequency of pages might change, so their physical location
> > > should change accordingly.
> > >
> > > 2. Reserve the space in every tier, and associate them with the same
> > > handle. This is kinda what zswap is doing. It is space efficient, and
> > > create a lot of operational issues in production.
> > >
> > > 3. Bite the bullet and perform the page table walk. This is what
> > > swapoff is doing, basically. Raise your hands if you're excited about
> > > a full page table walk every time you want to evict a page from zswap
> > > to disk swap. Booo.
> > >
> > > This new design will give us an efficient way to perform tier transfer
> > > - you need to figure out how to obtain the right to perform the
> > > transfer (for now, through the swap cache - but you can perhaps
> > > envision some sort of locks), and then you can simply make the change
> > > at the virtual layer.
> > >
> >
> > One idea that comes to mind is whether the backend swap tier for
> > a page could be lazily adjusted at runtime—either reactively
> > or via an explicit interface—before the tier changes.
> > Alternatively, if it's preferable to leave pages untouched
> > when the tier configuration changes at runtime,
> > perhaps we could consider making this behavior configurable as well.
> >
> 
> I don't quite understand - could you expand on this?
>

Regarding your point, 
my understanding was that you were referring
to an immediate migration once a new swap tier is selected at runtime. 
I was suggesting whether a lazy migration approach
—or even skipping migration altogether—might 
be worth considering as alternatives.
I only mentioned it because, from our use case perspective, 
immediate migration is not strictly necessary.

Best regards,
YoungJun Park
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by Kairui Song 6 months, 2 weeks ago
On Sun, Jun 1, 2025 at 8:56 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> > >
> > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > Changelog:
> > > > * v2:
> > > >       * Use a single atomic type (swap_refs) for reference counting
> > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > >         This saves one bit per phyiscal swapfile slot.
> > > >       * Rearrange the patches and the code change to make things more
> > > >         reviewable. Suggested by Johannes Weiner.
> > > >       * Update the cover letter a bit.
> > >
> > > Hi Nhat,
> > >
> > > Thank you for sharing this patch series.
> > > I’ve read through it with great interest.
> > >
> > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > and this patch set appears quite relevant
> > > to our ongoing discussions and early-stage implementation.
> >
> > May I ask - what's the use case you're thinking of here? Remote swapping?
> >
>
> Yes, that's correct.
> Our usage scenario includes remote swap,
> and we're experimenting with assigning swap tiers per cgroup
> in order to improve specific scene of our target device performance.
>
> We’ve explored several approaches and PoCs around this,
> and in the process of evaluating
> whether our direction could eventually be aligned
> with the upstream kernel,
> I came across your patchset and wanted to ask whether
> similar efforts have been discussed or attempted before.
>
> > >
> > > I had a couple of questions regarding the future direction.
> > >
> > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > >   [9]). Similar to swapoff, with the old design we would need to
> > > >   perform the expensive page table walk.
> > >
> > > Based on the discussion in [5], it seems there was some exploration
> > > around enabling per-cgroup selection of multiple tiers.
> > > Do you envision the current design evolving in a similar direction
> > > to those past discussions, or is there a different direction you're aiming for?
> >
> > IIRC, that past design focused on the interface aspect of the problem,
> > but never actually touched the mechanism to implement a multi-tier
> > swapping solution.
> >
> > The simple reason is it's impossible, or at least highly inefficient
> > to do it in the current design, i.e without virtualizing swap. Storing
>
> As you pointed out, there are certainly inefficiencies
> in supporting this use case with the current design,
> but if there is a valid use case,
> I believe there’s room for it to be supported in the current model
> —possibly in a less optimized form—
> until a virtual swap device becomes available
> and provides a more efficient solution.
> What do you think about?

Hi All,

I'd like to share some info from my side. Currently we have an
internal solution for multi tier swap, implemented based on ZRAM and
writeback: 4 compression level and multiple block layer level. The
ZRAM table serves a similar role to the swap table in the "swap table
series" or the virtual layer here.

We hacked the BIO layer to let ZRAM be Cgroup aware, so it even
supports per-cgroup priority, and per-cgroup writeback control, and it
worked perfectly fine in production.

The interface looks something like this:
/sys/fs/cgroup/cg1/zram.prio: [1-4]
/sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
/sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]

It's really nothing fancy and complex, the four priority is simply the
four ZRAM compression streams that's already in upstream, and you can
simply hardcode four *bdev in "struct zram" and reuse the bits, then
chain the write bio with new underlayer bio... Getting the priority
info of a cgroup is even simpler once ZRAM is cgroup aware.

All interfaces can be adjusted dynamically at any time (e.g. by an
agent), and already swapped out pages won't be touched. The block
devices are specified in ZRAM's sys files during swapon.

It's easy to implement, but not a good idea for upstream at all:
redundant layers, and performance is bad (if not optimized):
- it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
SYNCHRONIZE_IO completely which actually improved the performance in
every aspect (I've been trying to upstream this for a while);
- ZRAM's block device allocator is just not good (just a bitmap) so we
want to use the SWAP allocator directly (which I'm also trying to
upstream with the swap table series);
- And many other bits and pieces like bio batching are kind of broken,
busy loop due to the ZRAM_WB bit, etc...
- Lacking support for things like effective migration/compaction,
doable but looks horrible.

So I definitely don't like this band-aid solution, but hey, it works.
I'm looking forward to replacing it with native upstream support.
That's one of the motivations behind the swap table series, which
I think it would resolve the problems in an elegant and clean way
upstreamly. The initial tests do show it has a much lower overhead
and cleans up SWAP.

But maybe this is kind of similar to the "less optimized form" you
are talking about? As I mentioned I'm already trying to upstream
some nice parts of it, and hopefully replace it with an upstream
solution finally.

I can try upstream other parts of it if there are people really
interested, but I strongly recommend that we should focus on the
right approach instead and not waste time on that and spam the
mail list.

I have no special preference on how the final upstream interface
should look like. But currently SWAP devices already have priorities,
so maybe we should just make use of that.
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by Nhat Pham 6 months, 2 weeks ago
On Sun, Jun 1, 2025 at 9:15 AM Kairui Song <ryncsn@gmail.com> wrote:
>
>
> Hi All,

Thanks for sharing your setup, Kairui! I've always been curious about
multi-tier compression swapping.

>
> I'd like to share some info from my side. Currently we have an
> internal solution for multi tier swap, implemented based on ZRAM and
> writeback: 4 compression level and multiple block layer level. The
> ZRAM table serves a similar role to the swap table in the "swap table
> series" or the virtual layer here.
>
> We hacked the BIO layer to let ZRAM be Cgroup aware, so it even

Hmmm this part seems a bit hacky to me too :-?

> supports per-cgroup priority, and per-cgroup writeback control, and it
> worked perfectly fine in production.
>
> The interface looks something like this:
> /sys/fs/cgroup/cg1/zram.prio: [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]

How do you do aging with multiple tiers like this? Or do you just rely
on time thresholds, and have userspace invokes writeback in a cron
job-style?

Tbh, I'm surprised that we see performance win with recompression. I
understand that different workloads might benefit the most from
different points in the Pareto frontier of latency-memory saving:
latency-sensitive workloads might like a fast compression algorithm,
whereas other workloads might prefer a compression algorithm that
saves more memory. So a per-cgroup compressor selection can make
sense.

However, would the overhead of moving a page from one tier to the
other not eat up all the benefit from the (usually small) extra memory
savings?

>
> It's really nothing fancy and complex, the four priority is simply the
> four ZRAM compression streams that's already in upstream, and you can
> simply hardcode four *bdev in "struct zram" and reuse the bits, then
> chain the write bio with new underlayer bio... Getting the priority
> info of a cgroup is even simpler once ZRAM is cgroup aware.
>
> All interfaces can be adjusted dynamically at any time (e.g. by an
> agent), and already swapped out pages won't be touched. The block
> devices are specified in ZRAM's sys files during swapon.
>
> It's easy to implement, but not a good idea for upstream at all:
> redundant layers, and performance is bad (if not optimized):
> - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> SYNCHRONIZE_IO completely which actually improved the performance in
> every aspect (I've been trying to upstream this for a while);
> - ZRAM's block device allocator is just not good (just a bitmap) so we
> want to use the SWAP allocator directly (which I'm also trying to
> upstream with the swap table series);
> - And many other bits and pieces like bio batching are kind of broken,

Interesting, is zram doing writeback batching?

> busy loop due to the ZRAM_WB bit, etc...

Hmmm, this sounds like something swap cache can help with. It's the
approach zswap writeback is taking - concurrent assessors can get the
page in the swap cache, and OTOH zswap writeback back off if it
detects swap cache contention (since the page is probably being
swapped in, freed, or written back by another thread).

But I'm not sure how zram writeback works...

> - Lacking support for things like effective migration/compaction,
> doable but looks horrible.
>
> So I definitely don't like this band-aid solution, but hey, it works.
> I'm looking forward to replacing it with native upstream support.
> That's one of the motivations behind the swap table series, which
> I think it would resolve the problems in an elegant and clean way
> upstreamly. The initial tests do show it has a much lower overhead
> and cleans up SWAP.
>
> But maybe this is kind of similar to the "less optimized form" you
> are talking about? As I mentioned I'm already trying to upstream
> some nice parts of it, and hopefully replace it with an upstream
> solution finally.
>
> I can try upstream other parts of it if there are people really
> interested, but I strongly recommend that we should focus on the
> right approach instead and not waste time on that and spam the
> mail list.

I suppose a lot of this is specific to zram, but bits and pieces of it
sound upstreamable to me :)

We can wait for YoungJun's patches/RFC for further discussion, but perhaps:

1. A new cgroup interface to select swap backends for a cgroup.

2. Writeback/fallback order either designated by the above interface,
or by the priority of the swap backends.


>
> I have no special preference on how the final upstream interface
> should look like. But currently SWAP devices already have priorities,
> so maybe we should just make use of that.
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by Kairui Song 6 months, 2 weeks ago
On Tue, Jun 3, 2025 at 2:30 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Sun, Jun 1, 2025 at 9:15 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> >
> > Hi All,
>
> Thanks for sharing your setup, Kairui! I've always been curious about
> multi-tier compression swapping.
>
> >
> > I'd like to share some info from my side. Currently we have an
> > internal solution for multi tier swap, implemented based on ZRAM and
> > writeback: 4 compression level and multiple block layer level. The
> > ZRAM table serves a similar role to the swap table in the "swap table
> > series" or the virtual layer here.
> >
> > We hacked the BIO layer to let ZRAM be Cgroup aware, so it even
>
> Hmmm this part seems a bit hacky to me too :-?

Yeah, terribly hackish :P

One of the reasons why I'm trying to retire it.

>
> > supports per-cgroup priority, and per-cgroup writeback control, and it
> > worked perfectly fine in production.
> >
> > The interface looks something like this:
> > /sys/fs/cgroup/cg1/zram.prio: [1-4]
> > /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> > /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]
>
> How do you do aging with multiple tiers like this? Or do you just rely
> on time thresholds, and have userspace invokes writeback in a cron
> job-style?

ZRAM already has a time threshold, and I added another LRU for swapped
out entries, aging is supposed to be done by userspace agents, I
didn't mention it here as things are becoming more irrelevant to
upstream implementation.

> Tbh, I'm surprised that we see performance win with recompression. I
> understand that different workloads might benefit the most from
> different points in the Pareto frontier of latency-memory saving:
> latency-sensitive workloads might like a fast compression algorithm,
> whereas other workloads might prefer a compression algorithm that
> saves more memory. So a per-cgroup compressor selection can make
> sense.
>
> However, would the overhead of moving a page from one tier to the
> other not eat up all the benefit from the (usually small) extra memory
> savings?

So far we are not re-compressing things, but per-cgroup compression /
writeback level is useful indeed. Compressed memory gets written back
to the block device, that's a large gain.

> > It's really nothing fancy and complex, the four priority is simply the
> > four ZRAM compression streams that's already in upstream, and you can
> > simply hardcode four *bdev in "struct zram" and reuse the bits, then
> > chain the write bio with new underlayer bio... Getting the priority
> > info of a cgroup is even simpler once ZRAM is cgroup aware.
> >
> > All interfaces can be adjusted dynamically at any time (e.g. by an
> > agent), and already swapped out pages won't be touched. The block
> > devices are specified in ZRAM's sys files during swapon.
> >
> > It's easy to implement, but not a good idea for upstream at all:
> > redundant layers, and performance is bad (if not optimized):
> > - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> > SYNCHRONIZE_IO completely which actually improved the performance in
> > every aspect (I've been trying to upstream this for a while);
> > - ZRAM's block device allocator is just not good (just a bitmap) so we
> > want to use the SWAP allocator directly (which I'm also trying to
> > upstream with the swap table series);
> > - And many other bits and pieces like bio batching are kind of broken,
>
> Interesting, is zram doing writeback batching?

Nope, it even has a comment saying "XXX: A single page IO would be
inefficient for write". We managed to chain bio on the initial page
writeback but still not an ideal design.

> > busy loop due to the ZRAM_WB bit, etc...
>
> Hmmm, this sounds like something swap cache can help with. It's the
> approach zswap writeback is taking - concurrent assessors can get the
> page in the swap cache, and OTOH zswap writeback back off if it
> detects swap cache contention (since the page is probably being
> swapped in, freed, or written back by another thread).
>
> But I'm not sure how zram writeback works...

Yeah, any bit lock design suffers a similar problem (like
SWAP_HAS_CACHE), I think we should just use folio lock or folio
writeback in the long term, it works extremely well as a generic
infrastructure (which I'm trying to push upstream) and we don't need
any extra locking, minimizing memory / design overhead.

> > - Lacking support for things like effective migration/compaction,
> > doable but looks horrible.
> >
> > So I definitely don't like this band-aid solution, but hey, it works.
> > I'm looking forward to replacing it with native upstream support.
> > That's one of the motivations behind the swap table series, which
> > I think it would resolve the problems in an elegant and clean way
> > upstreamly. The initial tests do show it has a much lower overhead
> > and cleans up SWAP.
> >
> > But maybe this is kind of similar to the "less optimized form" you
> > are talking about? As I mentioned I'm already trying to upstream
> > some nice parts of it, and hopefully replace it with an upstream
> > solution finally.
> >
> > I can try upstream other parts of it if there are people really
> > interested, but I strongly recommend that we should focus on the
> > right approach instead and not waste time on that and spam the
> > mail list.
>
> I suppose a lot of this is specific to zram, but bits and pieces of it
> sound upstreamable to me :)
>
> We can wait for YoungJun's patches/RFC for further discussion, but perhaps:
>
> 1. A new cgroup interface to select swap backends for a cgroup.
>
> 2. Writeback/fallback order either designated by the above interface,
> or by the priority of the swap backends.

Fully agree, the final interface and features definitely need more
discussion and collab in upstream...
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by YoungJun Park 6 months, 2 weeks ago
On Mon, Jun 02, 2025 at 12:14:53AM +0800, Kairui Song wrote:
> On Sun, Jun 1, 2025 at 8:56 PM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Fri, May 30, 2025 at 09:52:42AM -0700, Nhat Pham wrote:
> > > On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> > > >
> > > > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > > > Changelog:
> > > > > * v2:
> > > > >       * Use a single atomic type (swap_refs) for reference counting
> > > > >         purpose. This brings the size of the swap descriptor from 64 KB
> > > > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > > > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > > > >         This saves one bit per phyiscal swapfile slot.
> > > > >       * Rearrange the patches and the code change to make things more
> > > > >         reviewable. Suggested by Johannes Weiner.
> > > > >       * Update the cover letter a bit.
> > > >
> > > > Hi Nhat,
> > > >
> > > > Thank you for sharing this patch series.
> > > > I’ve read through it with great interest.
> > > >
> > > > I’m part of a kernel team working on features related to multi-tier swapping,
> > > > and this patch set appears quite relevant
> > > > to our ongoing discussions and early-stage implementation.
> > >
> > > May I ask - what's the use case you're thinking of here? Remote swapping?
> > >
> >
> > Yes, that's correct.
> > Our usage scenario includes remote swap,
> > and we're experimenting with assigning swap tiers per cgroup
> > in order to improve specific scene of our target device performance.
> >
> > We’ve explored several approaches and PoCs around this,
> > and in the process of evaluating
> > whether our direction could eventually be aligned
> > with the upstream kernel,
> > I came across your patchset and wanted to ask whether
> > similar efforts have been discussed or attempted before.
> >
> > > >
> > > > I had a couple of questions regarding the future direction.
> > > >
> > > > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > > > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > > > >   [9]). Similar to swapoff, with the old design we would need to
> > > > >   perform the expensive page table walk.
> > > >
> > > > Based on the discussion in [5], it seems there was some exploration
> > > > around enabling per-cgroup selection of multiple tiers.
> > > > Do you envision the current design evolving in a similar direction
> > > > to those past discussions, or is there a different direction you're aiming for?
> > >
> > > IIRC, that past design focused on the interface aspect of the problem,
> > > but never actually touched the mechanism to implement a multi-tier
> > > swapping solution.
> > >
> > > The simple reason is it's impossible, or at least highly inefficient
> > > to do it in the current design, i.e without virtualizing swap. Storing
> >
> > As you pointed out, there are certainly inefficiencies
> > in supporting this use case with the current design,
> > but if there is a valid use case,
> > I believe there’s room for it to be supported in the current model
> > —possibly in a less optimized form—
> > until a virtual swap device becomes available
> > and provides a more efficient solution.
> > What do you think about?
> 
> Hi All,
> 
> I'd like to share some info from my side. Currently we have an
> internal solution for multi tier swap, implemented based on ZRAM and
> writeback: 4 compression level and multiple block layer level. The
> ZRAM table serves a similar role to the swap table in the "swap table
> series" or the virtual layer here.
> 
> We hacked the BIO layer to let ZRAM be Cgroup aware, so it even
> supports per-cgroup priority, and per-cgroup writeback control, and it
> worked perfectly fine in production.
> 
> The interface looks something like this:
> /sys/fs/cgroup/cg1/zram.prio: [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_prio [1-4]
> /sys/fs/cgroup/cg1/zram.writeback_size [0 - 4K]
> 
> It's really nothing fancy and complex, the four priority is simply the
> four ZRAM compression streams that's already in upstream, and you can
> simply hardcode four *bdev in "struct zram" and reuse the bits, then
> chain the write bio with new underlayer bio... Getting the priority
> info of a cgroup is even simpler once ZRAM is cgroup aware.
> 
> All interfaces can be adjusted dynamically at any time (e.g. by an
> agent), and already swapped out pages won't be touched. The block
> devices are specified in ZRAM's sys files during swapon.
> 
> It's easy to implement, but not a good idea for upstream at all:
> redundant layers, and performance is bad (if not optimized):
> - it breaks SYNCHRONIZE_IO, causing a huge slowdown, so we removed the
> SYNCHRONIZE_IO completely which actually improved the performance in
> every aspect (I've been trying to upstream this for a while);
> - ZRAM's block device allocator is just not good (just a bitmap) so we
> want to use the SWAP allocator directly (which I'm also trying to
> upstream with the swap table series);
> - And many other bits and pieces like bio batching are kind of broken,
> busy loop due to the ZRAM_WB bit, etc...
> - Lacking support for things like effective migration/compaction,
> doable but looks horrible.
> 

That's interesting — we've explored a similar idea as well, 
although not by attaching it to ZRAM.
Instead, our concept involved creating a separate block device 
capable of performing the tiering functionality, and using it as follows:

1. Prepare a block device that can manage multiple backend block devices.
2. Perform swapon on this block device.
3. Within the block device, use cgroup awareness 
to carry out tiered swap operations across the prepared backend devices.

However, we ended up postponing this approach as a second-tier option, mainly 
due to the following concerns:

1. The idea of allocating physical slots but managing them internally 
as logical slots felt inefficient.
2. Embedding cgroup awareness within a block device 
seemed like a layer violation.

> So I definitely don't like this band-aid solution, but hey, it works.
> I'm looking forward to replacing it with native upstream support.
> That's one of the motivations behind the swap table series, which
> I think it would resolve the problems in an elegant and clean way
> upstreamly. The initial tests do show it has a much lower overhead
> and cleans up SWAP.
> But maybe this is kind of similar to the "less optimized form" you
> are talking about? As I mentioned I'm already trying to upstream
> some nice parts of it, and hopefully replace it with an upstream
> solution finally.
> 
> I can try upstream other parts of it if there are people really
> interested, but I strongly recommend that we should focus on the
> right approach instead and not waste time on that and spam the
> mail list.

I am in agreement with the points you’ve made.
 
> I have no special preference on how the final upstream interface
> should look like. But currently SWAP devices already have priorities,
> so maybe we should just make use of that.

I have been exploring an interface design 
that leverages the existing swap priority mechanism,
and I believe it would be valuable 
to share this for further discussion and feedback.
As mentioned in my earlier response to Nhat,
I intend to submit this as an RFC to solicit broader input from the community. 

Best regards,
YoungJun Park
Re: [RFC PATCH v2 00/18] Virtual Swap Space
Posted by Nhat Pham 6 months, 3 weeks ago
On Fri, May 30, 2025 at 9:52 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, May 29, 2025 at 11:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
> >
> > On Tue, Apr 29, 2025 at 04:38:28PM -0700, Nhat Pham wrote:
> > > Changelog:
> > > * v2:
> > >       * Use a single atomic type (swap_refs) for reference counting
> > >         purpose. This brings the size of the swap descriptor from 64 KB
> > >         down to 48 KB (25% reduction). Suggested by Yosry Ahmed.
> > >       * Zeromap bitmap is removed in the virtual swap implementation.
> > >         This saves one bit per phyiscal swapfile slot.
> > >       * Rearrange the patches and the code change to make things more
> > >         reviewable. Suggested by Johannes Weiner.
> > >       * Update the cover letter a bit.
> >
> > Hi Nhat,
> >
> > Thank you for sharing this patch series.
> > I’ve read through it with great interest.
> >
> > I’m part of a kernel team working on features related to multi-tier swapping,
> > and this patch set appears quite relevant
> > to our ongoing discussions and early-stage implementation.
>
> May I ask - what's the use case you're thinking of here? Remote swapping?
>
> >
> > I had a couple of questions regarding the future direction.
> >
> > > * Multi-tier swapping (as mentioned in [5]), with transparent
> > >   transferring (promotion/demotion) of pages across tiers (see [8] and
> > >   [9]). Similar to swapoff, with the old design we would need to
> > >   perform the expensive page table walk.
> >
> > Based on the discussion in [5], it seems there was some exploration
> > around enabling per-cgroup selection of multiple tiers.
> > Do you envision the current design evolving in a similar direction
> > to those past discussions, or is there a different direction you're aiming for?

To be extra clear, I don't have an issue with a cgroup-based interface
for swap tiering like that.

I think the only objections at the time is we do not really have a use
case in mind?

>
> IIRC, that past design focused on the interface aspect of the problem,
> but never actually touched the mechanism to implement a multi-tier
> swapping solution.
>
> The simple reason is it's impossible, or at least highly inefficient
> to do it in the current design, i.e without virtualizing swap. Storing
> the physical swap location in PTEs means that changing the swap
> backend requires a full page table walk to update all the PTEs that
> refer to the old physical swap location. So you have to pick your
> poison - either:
>
> 1. Pick your backend at swap out time, and never change it. You might
> not have sufficient information to decide at that time. It prevents
> you from adapting to the change in workload dynamics and working set -
> the access frequency of pages might change, so their physical location
> should change accordingly.
>
> 2. Reserve the space in every tier, and associate them with the same
> handle. This is kinda what zswap is doing. It is space efficient, and
> create a lot of operational issues in production.

s/efficient/inefficient

>