[RFC PATCH 00/14] Virtual Swap Space

Nhat Pham posted 14 patches 3 weeks, 4 days ago
There is a newer version of this series
MAINTAINERS                |    7 +
include/linux/mm_types.h   |    7 +
include/linux/shmem_fs.h   |    3 +
include/linux/swap.h       |  280 +++++--
include/linux/swap_slots.h |    2 +-
include/linux/swapops.h    |   37 +
kernel/power/swap.c        |    6 +-
mm/Kconfig                 |   28 +
mm/Makefile                |    3 +
mm/huge_memory.c           |    5 +-
mm/internal.h              |   25 +-
mm/memcontrol.c            |  166 ++++-
mm/memory.c                |   99 ++-
mm/migrate.c               |    1 +
mm/page_io.c               |   60 +-
mm/shmem.c                 |   29 +-
mm/swap.h                  |   45 +-
mm/swap_cgroup.c           |   10 +-
mm/swap_slots.c            |   28 +-
mm/swap_state.c            |  144 +++-
mm/swapfile.c              |  770 ++++++++++++-------
mm/vmscan.c                |   26 +-
mm/vswap.c                 | 1437 ++++++++++++++++++++++++++++++++++++
mm/zswap.c                 |   80 +-
24 files changed, 2807 insertions(+), 491 deletions(-)
create mode 100644 mm/vswap.c
[RFC PATCH 00/14] Virtual Swap Space
Posted by Nhat Pham 3 weeks, 4 days ago
This RFC implements the virtual swap space idea, based on Yosry's
proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
inputs from Johannes Weiner. The same idea (with different
implementation details) has been floated by Rik van Riel since at least
2011 (see [8]).

The code attached to this RFC is purely a prototype. It is not 100%
merge-ready (see section VI for future work). I do, however, want to show
people this prototype/RFC, including all the bells and whistles and a
couple of actual use cases, so that folks can see what the end results
will look like, and give me early feedback :)

I. Motivation

Currently, when an anon page is swapped out, a slot in a backing swap
device is allocated and stored in the page table entries that refer to
the original page. This slot is also used as the "key" to find the
swapped out content, as well as the index to swap data structures, such
as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
backing slot in this way is performant and efficient when swap is purely
just disk space, and swapoff is rare.

However, the advent of many swap optimizations has exposed major
drawbacks of this design. The first problem is that we occupy a physical
slot in the swap space, even for pages that are NEVER expected to hit
the disk: pages compressed and stored in the zswap pool, zero-filled
pages, or pages rejected by both of these optimizations when zswap
writeback is disabled. This is the arguably central shortcoming of
zswap:
* In deployments when no disk space can be afforded for swap (such as
  mobile and embedded devices), users cannot adopt zswap, and are forced
  to use zram. This is confusing for users, and creates extra burdens
  for developers, having to develop and maintain similar features for
  two separate swap backends (writeback, cgroup charging, THP support,
  etc.). For instance, see the discussion in [4].
* Resource-wise, it is hugely wasteful in terms of disk usage, and
  limits the memory saving potentials of these optimizations by the
  static size of the swapfile, especially in high memory systems that
  can have up to terabytes worth of memory. It also creates significant
  challenges for users who rely on swap utilization as an early OOM
  signal.

Another motivation for a swap redesign is to simplify swapoff, which
is complicated and expensive in the current design. Tight coupling
between a swap entry and its backing storage means that it requires a
whole page table walk to update all the page table entries that refer to
this swap entry, as well as updating all the associated swap data
structures (swap cache, etc.).


II. High Level Design Overview

To fix the aforementioned issues, we need an abstraction that separates
a swap entry from its physical backing storage. IOW, we need to
“virtualize” the swap space: swap clients will work with a dynamically
allocated virtual swap slot, storing it in page table entries, and
using it to index into various swap-related data structures. The
backing storage is decoupled from the virtual swap slot, and the newly
introduced layer will “resolve” the virtual swap slot to the actual
storage. This layer also manages other metadata of the swap entry, such
as its lifetime information (swap count), via a dynamically allocated
per-swap-entry descriptor:

struct swp_desc {
	swp_entry_t vswap;
	union {
		swp_slot_t slot;
		struct folio *folio;
		struct zswap_entry *zswap_entry;
	};
	struct rcu_head rcu;

	rwlock_t lock;
	enum swap_type type;

	atomic_t memcgid;

	atomic_t in_swapcache;
	struct kref refcnt;
	atomic_t swap_count;
};

This design allows us to:
* Decouple zswap (and zeromapped swap entry) from backing swapfile:
  simply associate the virtual swap slot with one of the supported
  backends: a zswap entry, a zero-filled swap page, a slot on the
  swapfile, or an in-memory page .
* Simplify and optimize swapoff: we only have to fault the page in and
  have the virtual swap slot points to the page instead of the on-disk
  physical swap slot. No need to perform any page table walking.

Please see the attached patches for implementation details.

Note that I do not remove the old implementation for now. Users can
select between the old and the new implementation via the
CONFIG_VIRTUAL_SWAP build config. This will also allow us to land the
new design, and iteratively optimize upon it (without having to include
everything in an even more massive patch series).

III. Future Use Cases

Other than decoupling swap backends and optimizing swapoff, this new
design allows us to implement the following more easily and
efficiently:

* Multi-tier swapping (as mentioned in [5]), with transparent
  transferring (promotion/demotion) of pages across tiers (see [8] and
  [9]). Similar to swapoff, with the old design we would need to
  perform the expensive page table walk.
* Swapfile compaction to alleviate fragmentation (as proposed by Ying
  Huang in [6]).
* Mixed backing THP swapin (see [7]): Once you have pinned down the
  backing store of THPs, then you can dispatch each range of subpages
  to appropriate swapin handle.
* Swapping a folio out with discontiguous physical swap slots (see [10])


IV. Potential Issues

Here is a couple of issues I can think of, along with some potential
solutions:

1. Space overhead: we need one swap descriptor per swap entry.
* Note that this overhead is dynamic, i.e only incurred when we actually
  need to swap a page out.
* It can be further offset by the reduction of swap map and the
  elimination of zeromapped bitmap.

2. Lock contention: since the virtual swap space is dynamic/unbounded,
we cannot naively range partition it anymore. This can increase lock
contention on swap-related data structures (swap cache, zswap’s xarray,
etc.).
* The problem is slightly alleviated by the lockless nature of the new
  reference counting scheme, as well as the per-entry locking for
  backing store information.
* Johannes suggested that I can implement a dynamic partition scheme, in
  which new partitions (along with associated data structures) are
  allocated on demand. It is one extra layer of indirection, but global
  locking will only be done only on partition allocation, rather than on
  each access. All other accesses only take local (per-partition)
  locks, or are completely lockless (such as partition lookup).


V. Benchmarking

As a proof of concept, I run the prototype through some simple
benchmarks:

1. usemem: 16 threads, 2G each, memory.max = 16G

I benchmarked the following usemem commands:

time usemem --init-time -w -O -s 10 -n 16 2g

Baseline:
real: 33.96s
user: 25.31s
sys: 341.09s
average throughput: 111295.45 KB/s
average free time: 2079258.68 usecs

New Design:
real: 35.87s
user: 25.15s
sys: 373.01s
average throughput: 106965.46 KB/s
average free time: 3192465.62 usecs

To root cause this regression, I ran perf on the usemem program, as
well as on the following stress-ng program:

perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000

and observed the (predicted) increase in lock contention on swap cache
accesses. This regression is alleviated if I put together the
following hack: limit the virtual swap space to a sufficient size for
the benchmark, range partition the swap-related data structures (swap
cache, zswap tree, etc.) based on the limit, and distribute the
allocation of virtual swap slotss among these partitions (on a per-CPU
basis):

real: 34.94s
user: 25.28s
sys: 360.25s
average throughput: 108181.15 KB/s
average free time: 2680890.24 usecs

As mentioned above, I will implement proper dynamic swap range
partitioning in a follow up work.

2. Kernel building: zswap enabled, 52 workers (one per processor),
memory.max = 3G.

Baseline:
real: 183.55s
user: 5119.01s
sys: 655.16s

New Design:
real: mean: 184.5s
user: mean: 5117.4s
sys: mean: 695.23s

New Design (Static Partition)
real: 183.95s
user: 5119.29s
sys: 664.24s

3. Swapoff: 32 GB swapfile, 50% full, with a process that mmap-ed a
128GB file.

Baseline:
real: 25.54s
user: 0.00s
sys: 11.48s
    
New Design:
real: 11.69s
user: 0.00s
sys: 9.96s
    
The new design reduces the kernel CPU time by about 13%. There is also
reduction in real time, but this is mostly due to more asynchronous IO
(rather than the design itself) :)

VI. TODO list

This RFC includes a feature-complete prototype on top of 6.14. Here are
some action items:

Short-term: needs to be done before merging
* More clean-ups and stress-testing.
* Add more documentation of the new design and its API.

Medium-term: optimizations required to make virtual swap implementation
the default:
* Swap map shrinking and zero map reduction when virtual swap is
  enabled.
* Range partition the virtual swap space.
* More benchmarking and experiments in a variety of use cases.

Long-term: removal of the old implementation and other non-blocking
opportunities
* Remove the old implementation, when there are no major regressions and
  bottlenecks, etc remained with the new design.
* Merge more existing swap data structures into this layer - for
  instance, the MTE swap xarray.
* Adding new use cases :)

[1]: https://lore.kernel.org/all/CAJD7tkbCnXJ95Qow_aOjNX6NOMU5ovMSHRC+95U4wtW6cM+puw@mail.gmail.com/
[2]: https://lwn.net/Articles/932077/
[3]: https://www.youtube.com/watch?v=Hwqw_TBGEhg
[4]: https://lore.kernel.org/all/Zqe_Nab-Df1CN7iW@infradead.org/ 
[5]: https://lore.kernel.org/lkml/CAF8kJuN-4UE0skVHvjUzpGefavkLULMonjgkXUZSBVJrcGFXCA@mail.gmail.com/
[6]: https://lore.kernel.org/linux-mm/87o78mzp24.fsf@yhuang6-desk2.ccr.corp.intel.com/
[7]: https://lore.kernel.org/all/CAGsJ_4ysCN6f7qt=6gvee1x3ttbOnifGneqcRm9Hoeun=uFQ2w@mail.gmail.com/
[8]: https://lore.kernel.org/linux-mm/4DA25039.3020700@redhat.com/
[9]: https://lore.kernel.org/all/CA+ZsKJ7DCE8PMOSaVmsmYZL9poxK6rn0gvVXbjpqxMwxS2C9TQ@mail.gmail.com/
[10]: https://lore.kernel.org/all/CACePvbUkMYMencuKfpDqtG1Ej7LiUS87VRAXb8sBn1yANikEmQ@mail.gmail.com/

Nhat Pham (14):
  swapfile: rearrange functions
  mm: swap: add an abstract API for locking out swapoff
  mm: swap: add a separate type for physical swap slots
  mm: swap: swap cache support for virtualized swap
  zswap: unify zswap tree for virtualized swap
  mm: swap: allocate a virtual swap slot for each swapped out page
  swap: implement the swap_cgroup API using virtual swap
  swap: manage swap entry lifetime at the virtual swap layer
  swap: implement locking out swapoff using virtual swap slot
  mm: swap: decouple virtual swap slot from backing store
  memcg: swap: only charge physical swap slots
  vswap: support THP swapin and batch free_swap_and_cache
  swap: simplify swapoff using virtual swap
  zswap: do not start zswap shrinker if there is no physical swap slots

 MAINTAINERS                |    7 +
 include/linux/mm_types.h   |    7 +
 include/linux/shmem_fs.h   |    3 +
 include/linux/swap.h       |  280 +++++--
 include/linux/swap_slots.h |    2 +-
 include/linux/swapops.h    |   37 +
 kernel/power/swap.c        |    6 +-
 mm/Kconfig                 |   28 +
 mm/Makefile                |    3 +
 mm/huge_memory.c           |    5 +-
 mm/internal.h              |   25 +-
 mm/memcontrol.c            |  166 ++++-
 mm/memory.c                |   99 ++-
 mm/migrate.c               |    1 +
 mm/page_io.c               |   60 +-
 mm/shmem.c                 |   29 +-
 mm/swap.h                  |   45 +-
 mm/swap_cgroup.c           |   10 +-
 mm/swap_slots.c            |   28 +-
 mm/swap_state.c            |  144 +++-
 mm/swapfile.c              |  770 ++++++++++++-------
 mm/vmscan.c                |   26 +-
 mm/vswap.c                 | 1437 ++++++++++++++++++++++++++++++++++++
 mm/zswap.c                 |   80 +-
 24 files changed, 2807 insertions(+), 491 deletions(-)
 create mode 100644 mm/vswap.c


base-commit: 922ceb9d4bb4dae66c37e24621687e0b4991f5a4
-- 
2.47.1
Re: [RFC PATCH 00/14] Virtual Swap Space
Posted by Yosry Ahmed 1 week, 3 days ago
On Mon, Apr 07, 2025 at 04:42:01PM -0700, Nhat Pham wrote:
> This RFC implements the virtual swap space idea, based on Yosry's
> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> inputs from Johannes Weiner. The same idea (with different
> implementation details) has been floated by Rik van Riel since at least
> 2011 (see [8]).
> 
> The code attached to this RFC is purely a prototype. It is not 100%
> merge-ready (see section VI for future work). I do, however, want to show
> people this prototype/RFC, including all the bells and whistles and a
> couple of actual use cases, so that folks can see what the end results
> will look like, and give me early feedback :)
> 
> I. Motivation
> 
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
> 
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
>   mobile and embedded devices), users cannot adopt zswap, and are forced
>   to use zram. This is confusing for users, and creates extra burdens
>   for developers, having to develop and maintain similar features for
>   two separate swap backends (writeback, cgroup charging, THP support,
>   etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage, and
>   limits the memory saving potentials of these optimizations by the
>   static size of the swapfile, especially in high memory systems that
>   can have up to terabytes worth of memory. It also creates significant
>   challenges for users who rely on swap utilization as an early OOM
>   signal.
> 
> Another motivation for a swap redesign is to simplify swapoff, which
> is complicated and expensive in the current design. Tight coupling
> between a swap entry and its backing storage means that it requires a
> whole page table walk to update all the page table entries that refer to
> this swap entry, as well as updating all the associated swap data
> structures (swap cache, etc.).
> 
> 
> II. High Level Design Overview
> 
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a dynamically
> allocated virtual swap slot, storing it in page table entries, and
> using it to index into various swap-related data structures. The
> backing storage is decoupled from the virtual swap slot, and the newly
> introduced layer will “resolve” the virtual swap slot to the actual
> storage. This layer also manages other metadata of the swap entry, such
> as its lifetime information (swap count), via a dynamically allocated
> per-swap-entry descriptor:
> 
> struct swp_desc {
> 	swp_entry_t vswap;
> 	union {
> 		swp_slot_t slot;
> 		struct folio *folio;
> 		struct zswap_entry *zswap_entry;
> 	};
> 	struct rcu_head rcu;
> 
> 	rwlock_t lock;
> 	enum swap_type type;
> 
> 	atomic_t memcgid;
> 
> 	atomic_t in_swapcache;
> 	struct kref refcnt;
> 	atomic_t swap_count;
> };

It's exciting to see this proposal materilizing :)

I didn't get a chance to look too closely at the code, but I have a few
high-level comments.

Do we need separate refcnt and swap_count? I am aware that there are
cases where we need to hold a reference to prevent the descriptor from
going away, without an extra page table entry referencing the swap
descriptor -- but I am wondering if we can get away by just incrementing
the swap count in these cases too? Would this mess things up?

> 
> This design allows us to:
> * Decouple zswap (and zeromapped swap entry) from backing swapfile:
>   simply associate the virtual swap slot with one of the supported
>   backends: a zswap entry, a zero-filled swap page, a slot on the
>   swapfile, or an in-memory page .
> * Simplify and optimize swapoff: we only have to fault the page in and
>   have the virtual swap slot points to the page instead of the on-disk
>   physical swap slot. No need to perform any page table walking.
> 
> Please see the attached patches for implementation details.
> 
> Note that I do not remove the old implementation for now. Users can
> select between the old and the new implementation via the
> CONFIG_VIRTUAL_SWAP build config. This will also allow us to land the
> new design, and iteratively optimize upon it (without having to include
> everything in an even more massive patch series).

I know this is easier, but honestly I'd prefer if we do an incremental
replacement (if possible) rather than introducing a new implementation
and slowly deprecating the old one, which historically doesn't seem to
go well :P

Once the series is organized as Johannes suggested, and we have better
insights into how this will be integrated with Kairui's work, it should
be clearer whether it's possible to incrementally update the current
implemetation rather than add a parallel implementation.

> 
> III. Future Use Cases
> 
> Other than decoupling swap backends and optimizing swapoff, this new
> design allows us to implement the following more easily and
> efficiently:
> 
> * Multi-tier swapping (as mentioned in [5]), with transparent
>   transferring (promotion/demotion) of pages across tiers (see [8] and
>   [9]). Similar to swapoff, with the old design we would need to
>   perform the expensive page table walk.
> * Swapfile compaction to alleviate fragmentation (as proposed by Ying
>   Huang in [6]).
> * Mixed backing THP swapin (see [7]): Once you have pinned down the
>   backing store of THPs, then you can dispatch each range of subpages
>   to appropriate swapin handle.
> * Swapping a folio out with discontiguous physical swap slots (see [10])
> 
> 
> IV. Potential Issues
> 
> Here is a couple of issues I can think of, along with some potential
> solutions:
> 
> 1. Space overhead: we need one swap descriptor per swap entry.
> * Note that this overhead is dynamic, i.e only incurred when we actually
>   need to swap a page out.
> * It can be further offset by the reduction of swap map and the
>   elimination of zeromapped bitmap.
> 
> 2. Lock contention: since the virtual swap space is dynamic/unbounded,
> we cannot naively range partition it anymore. This can increase lock
> contention on swap-related data structures (swap cache, zswap’s xarray,
> etc.).
> * The problem is slightly alleviated by the lockless nature of the new
>   reference counting scheme, as well as the per-entry locking for
>   backing store information.
> * Johannes suggested that I can implement a dynamic partition scheme, in
>   which new partitions (along with associated data structures) are
>   allocated on demand. It is one extra layer of indirection, but global
>   locking will only be done only on partition allocation, rather than on
>   each access. All other accesses only take local (per-partition)
>   locks, or are completely lockless (such as partition lookup).
> 
> 
> V. Benchmarking
> 
> As a proof of concept, I run the prototype through some simple
> benchmarks:
> 
> 1. usemem: 16 threads, 2G each, memory.max = 16G
> 
> I benchmarked the following usemem commands:
> 
> time usemem --init-time -w -O -s 10 -n 16 2g
> 
> Baseline:
> real: 33.96s
> user: 25.31s
> sys: 341.09s
> average throughput: 111295.45 KB/s
> average free time: 2079258.68 usecs
> 
> New Design:
> real: 35.87s
> user: 25.15s
> sys: 373.01s
> average throughput: 106965.46 KB/s
> average free time: 3192465.62 usecs
> 
> To root cause this regression, I ran perf on the usemem program, as
> well as on the following stress-ng program:
> 
> perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000
> 
> and observed the (predicted) increase in lock contention on swap cache
> accesses. This regression is alleviated if I put together the
> following hack: limit the virtual swap space to a sufficient size for
> the benchmark, range partition the swap-related data structures (swap
> cache, zswap tree, etc.) based on the limit, and distribute the
> allocation of virtual swap slotss among these partitions (on a per-CPU
> basis):
> 
> real: 34.94s
> user: 25.28s
> sys: 360.25s
> average throughput: 108181.15 KB/s
> average free time: 2680890.24 usecs
> 
> As mentioned above, I will implement proper dynamic swap range
> partitioning in a follow up work.

I thought there would be some improvements with the new design once the
lock contention is gone, due to the colocation of all swap metadata. Do
we know why this isn't the case?

Also, one missing key metric in this cover letter is disk space savings.
It would be useful if you can give a realistic example about how much
disk space is being provisioned and wasted today to effictively use
zswap, and how much this can decrease with this design.

I believe the disk space savings are one of the main motivations so
let's showcase that :)
Re: [RFC PATCH 00/14] Virtual Swap Space
Posted by Kairui Song 3 weeks, 3 days ago
On Tue, Apr 8, 2025 at 7:47 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> This RFC implements the virtual swap space idea, based on Yosry's
> proposals at LSFMMBPF 2023 (see [1], [2], [3]), as well as valuable
> inputs from Johannes Weiner. The same idea (with different
> implementation details) has been floated by Rik van Riel since at least
> 2011 (see [8]).
>
> The code attached to this RFC is purely a prototype. It is not 100%
> merge-ready (see section VI for future work). I do, however, want to show
> people this prototype/RFC, including all the bells and whistles and a
> couple of actual use cases, so that folks can see what the end results
> will look like, and give me early feedback :)
>
> I. Motivation
>
> Currently, when an anon page is swapped out, a slot in a backing swap
> device is allocated and stored in the page table entries that refer to
> the original page. This slot is also used as the "key" to find the
> swapped out content, as well as the index to swap data structures, such
> as the swap cache, or the swap cgroup mapping. Tying a swap entry to its
> backing slot in this way is performant and efficient when swap is purely
> just disk space, and swapoff is rare.
>
> However, the advent of many swap optimizations has exposed major
> drawbacks of this design. The first problem is that we occupy a physical
> slot in the swap space, even for pages that are NEVER expected to hit
> the disk: pages compressed and stored in the zswap pool, zero-filled
> pages, or pages rejected by both of these optimizations when zswap
> writeback is disabled. This is the arguably central shortcoming of
> zswap:
> * In deployments when no disk space can be afforded for swap (such as
>   mobile and embedded devices), users cannot adopt zswap, and are forced
>   to use zram. This is confusing for users, and creates extra burdens
>   for developers, having to develop and maintain similar features for
>   two separate swap backends (writeback, cgroup charging, THP support,
>   etc.). For instance, see the discussion in [4].
> * Resource-wise, it is hugely wasteful in terms of disk usage, and
>   limits the memory saving potentials of these optimizations by the
>   static size of the swapfile, especially in high memory systems that
>   can have up to terabytes worth of memory. It also creates significant
>   challenges for users who rely on swap utilization as an early OOM
>   signal.
>
> Another motivation for a swap redesign is to simplify swapoff, which
> is complicated and expensive in the current design. Tight coupling
> between a swap entry and its backing storage means that it requires a
> whole page table walk to update all the page table entries that refer to
> this swap entry, as well as updating all the associated swap data
> structures (swap cache, etc.).
>
>
> II. High Level Design Overview
>
> To fix the aforementioned issues, we need an abstraction that separates
> a swap entry from its physical backing storage. IOW, we need to
> “virtualize” the swap space: swap clients will work with a dynamically
> allocated virtual swap slot, storing it in page table entries, and
> using it to index into various swap-related data structures. The
> backing storage is decoupled from the virtual swap slot, and the newly
> introduced layer will “resolve” the virtual swap slot to the actual
> storage. This layer also manages other metadata of the swap entry, such
> as its lifetime information (swap count), via a dynamically allocated
> per-swap-entry descriptor:
>
> struct swp_desc {
>         swp_entry_t vswap;
>         union {
>                 swp_slot_t slot;
>                 struct folio *folio;
>                 struct zswap_entry *zswap_entry;
>         };
>         struct rcu_head rcu;
>
>         rwlock_t lock;
>         enum swap_type type;
>
>         atomic_t memcgid;
>
>         atomic_t in_swapcache;
>         struct kref refcnt;
>         atomic_t swap_count;
> };

Thanks for sharing the code, my initial idea after the discussion at
LSFMM is that there is a simple way to combine this with the "swap
table" [1] design of mine to solve the performance issue of this
series: just store the pointer of this struct in the swap table. It's
a bruteforce and glue like solution but the contention issue will be
gone.

Of course it's not a good approach, ideally the data structure can be
simplified to an entry type in the swap table. The swap table series
handles locking and synchronizations using either cluster lock
(reusing swap allocator and existing swap logics) or folio lock (kind
of like page cache). So many parts can be much simplified, I think it
will be at most ~32 bytes per page with a virtual device (including
the intermediate pointers).Will require quite some work though.

The good side with that approach is we will have a much lower memory
overhead and even better performance. And the virtual space part will
be optional, for non virtual setup the memory consumption will be only
8 bytes per page and also dynamically allocated, as discussed at
LSFMM.

So sorry that I still have a few parts undone, looking forward to
posting in about one week, eg. After this weekend it goes well. I'll
also try to check your series first to see how these can be
collaborated better.

A draft version is available here though, just in case anyone is
really anxious to see the code, I wouldn't recommend spend much effort
check it though as it may change rapidly:
https://github.com/ryncsn/linux/tree/kasong/devel/swap-unification

But the good news is the total LOC should be reduced, or at least
won't increase much, as it will unify a lot of swap infrastructures.
So things might be easier to implement after that.

[1] https://lore.kernel.org/linux-mm/CAMgjq7DHFYWhm+Z0C5tR2U2a-N_mtmgB4+idD2S+-1438u-wWw@mail.gmail.com/T/
Re: [RFC PATCH 00/14] Virtual Swap Space
Posted by Nhat Pham 3 weeks, 3 days ago
On Tue, Apr 8, 2025 at 9:23 AM Kairui Song <ryncsn@gmail.com> wrote:
>
>
> Thanks for sharing the code, my initial idea after the discussion at
> LSFMM is that there is a simple way to combine this with the "swap
> table" [1] design of mine to solve the performance issue of this
> series: just store the pointer of this struct in the swap table. It's
> a bruteforce and glue like solution but the contention issue will be
> gone.

Was waiting for your submission, but I figured I should send what I
had out first for immediate feedback :)

Johannes actually proposed something similar to your physical swap
allocator for the virtual swap slots allocation logic, to solve our
lock contention problem. My apologies - I should have name-dropped you
in the RFC cover as well (the cover was a bit outdated, and I haven't
updated the newest developments that came from the LSFMMBPF
conversation in the cover letter).

>
> Of course it's not a good approach, ideally the data structure can be
> simplified to an entry type in the swap table. The swap table series
> handles locking and synchronizations using either cluster lock
> (reusing swap allocator and existing swap logics) or folio lock (kind
> of like page cache). So many parts can be much simplified, I think it
> will be at most ~32 bytes per page with a virtual device (including
> the intermediate pointers).Will require quite some work though.
>
> The good side with that approach is we will have a much lower memory
> overhead and even better performance. And the virtual space part will
> be optional, for non virtual setup the memory consumption will be only
> 8 bytes per page and also dynamically allocated, as discussed at
> LSFMM.

I think one problem with your design, which I alluded to at the
conference, is that it doesn't quite work for our requirements -
namely the separation of zswap from its underlying backend.

All the metadata HAVE to live at the virtual layer. For once, we are
duplicating the logic if we push this to the backend.

But more than that, there are lifetime operations that HAVE to be
backend-agnostic. For instance, on the swap out path, when we unmap
the page from the page table, we do swap_duplicate() (i.,e increasing
the swap count/reference count of the swap entries). At that point, we
have not (and cannot) make a decision regarding the backend storage
yet, and thus does not have any backend-specific places to hold this
piece of information. If we couple all the backends then yeah sure we
can store it at the physical swapfile level, but that defeats the
purpose of swap virtualization :)

>
> So sorry that I still have a few parts undone, looking forward to
> posting in about one week, eg. After this weekend it goes well. I'll
> also try to check your series first to see how these can be
> collaborated better.

Of course, I'm not against collaboration :) As I mentioned earlier, we
need more work on the allocation part, which your physical swapfile
allocator should either work, or serve as the inspiration for.

Cheers,
Nhat
Re: [RFC PATCH 00/14] Virtual Swap Space
Posted by Kairui Song 3 weeks, 3 days ago
On Wed, Apr 9, 2025 at 12:48 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Tue, Apr 8, 2025 at 9:23 AM Kairui Song <ryncsn@gmail.com> wrote:
> >
> >
> > Thanks for sharing the code, my initial idea after the discussion at
> > LSFMM is that there is a simple way to combine this with the "swap
> > table" [1] design of mine to solve the performance issue of this
> > series: just store the pointer of this struct in the swap table. It's
> > a bruteforce and glue like solution but the contention issue will be
> > gone.
>
> Was waiting for your submission, but I figured I should send what I
> had out first for immediate feedback :)
>
> Johannes actually proposed something similar to your physical swap
> allocator for the virtual swap slots allocation logic, to solve our
> lock contention problem. My apologies - I should have name-dropped you
> in the RFC cover as well (the cover was a bit outdated, and I haven't
> updated the newest developments that came from the LSFMMBPF
> conversation in the cover letter).
>
> >
> > Of course it's not a good approach, ideally the data structure can be
> > simplified to an entry type in the swap table. The swap table series
> > handles locking and synchronizations using either cluster lock
> > (reusing swap allocator and existing swap logics) or folio lock (kind
> > of like page cache). So many parts can be much simplified, I think it
> > will be at most ~32 bytes per page with a virtual device (including
> > the intermediate pointers).Will require quite some work though.
> >
> > The good side with that approach is we will have a much lower memory
> > overhead and even better performance. And the virtual space part will
> > be optional, for non virtual setup the memory consumption will be only
> > 8 bytes per page and also dynamically allocated, as discussed at
> > LSFMM.
>
> I think one problem with your design, which I alluded to at the
> conference, is that it doesn't quite work for our requirements -
> namely the separation of zswap from its underlying backend.
>
> All the metadata HAVE to live at the virtual layer. For once, we are
> duplicating the logic if we push this to the backend.
>
> But more than that, there are lifetime operations that HAVE to be
> backend-agnostic. For instance, on the swap out path, when we unmap
> the page from the page table, we do swap_duplicate() (i.,e increasing
> the swap count/reference count of the swap entries). At that point, we
> have not (and cannot) make a decision regarding the backend storage
> yet, and thus does not have any backend-specific places to hold this
> piece of information. If we couple all the backends then yeah sure we
> can store it at the physical swapfile level, but that defeats the
> purpose of swap virtualization :)

Ah, now I get why you have to store the data in the virtual layer.

I was thinking that doing it in the physical layer will make it easier
to reuse what swap already has. But if you need to be completely
backend-agnostic, then just keep it in the virtual layer. Seems not a
foundunmentail issue, it could be worked out in some way I think. eg.
using another table type. I'll check if that would work after I've
done the initial parts.

>
> >
> > So sorry that I still have a few parts undone, looking forward to
> > posting in about one week, eg. After this weekend it goes well. I'll
> > also try to check your series first to see how these can be
> > collaborated better.
>
> Of course, I'm not against collaboration :) As I mentioned earlier, we
> need more work on the allocation part, which your physical swapfile
> allocator should either work, or serve as the inspiration for.
>
> Cheers,
> Nhat
Re: [RFC PATCH 00/14] Virtual Swap Space
Posted by Yosry Ahmed 1 week, 3 days ago
On Wed, Apr 09, 2025 at 12:59:24AM +0800, Kairui Song wrote:
> On Wed, Apr 9, 2025 at 12:48 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Tue, Apr 8, 2025 at 9:23 AM Kairui Song <ryncsn@gmail.com> wrote:
> > >
> > >
> > > Thanks for sharing the code, my initial idea after the discussion at
> > > LSFMM is that there is a simple way to combine this with the "swap
> > > table" [1] design of mine to solve the performance issue of this
> > > series: just store the pointer of this struct in the swap table. It's
> > > a bruteforce and glue like solution but the contention issue will be
> > > gone.
> >
> > Was waiting for your submission, but I figured I should send what I
> > had out first for immediate feedback :)
> >
> > Johannes actually proposed something similar to your physical swap
> > allocator for the virtual swap slots allocation logic, to solve our
> > lock contention problem. My apologies - I should have name-dropped you
> > in the RFC cover as well (the cover was a bit outdated, and I haven't
> > updated the newest developments that came from the LSFMMBPF
> > conversation in the cover letter).
> >
> > >
> > > Of course it's not a good approach, ideally the data structure can be
> > > simplified to an entry type in the swap table. The swap table series
> > > handles locking and synchronizations using either cluster lock
> > > (reusing swap allocator and existing swap logics) or folio lock (kind
> > > of like page cache). So many parts can be much simplified, I think it
> > > will be at most ~32 bytes per page with a virtual device (including
> > > the intermediate pointers).Will require quite some work though.
> > >
> > > The good side with that approach is we will have a much lower memory
> > > overhead and even better performance. And the virtual space part will
> > > be optional, for non virtual setup the memory consumption will be only
> > > 8 bytes per page and also dynamically allocated, as discussed at
> > > LSFMM.
> >
> > I think one problem with your design, which I alluded to at the
> > conference, is that it doesn't quite work for our requirements -
> > namely the separation of zswap from its underlying backend.
> >
> > All the metadata HAVE to live at the virtual layer. For once, we are
> > duplicating the logic if we push this to the backend.
> >
> > But more than that, there are lifetime operations that HAVE to be
> > backend-agnostic. For instance, on the swap out path, when we unmap
> > the page from the page table, we do swap_duplicate() (i.,e increasing
> > the swap count/reference count of the swap entries). At that point, we
> > have not (and cannot) make a decision regarding the backend storage
> > yet, and thus does not have any backend-specific places to hold this
> > piece of information. If we couple all the backends then yeah sure we
> > can store it at the physical swapfile level, but that defeats the
> > purpose of swap virtualization :)
> 
> Ah, now I get why you have to store the data in the virtual layer.
> 
> I was thinking that doing it in the physical layer will make it easier
> to reuse what swap already has. But if you need to be completely
> backend-agnostic, then just keep it in the virtual layer. Seems not a
> foundunmentail issue, it could be worked out in some way I think. eg.
> using another table type. I'll check if that would work after I've
> done the initial parts.

Watching from the sidelines, I am happy to see Nhat's proposal
materializing, and think there is definitely room for collaboration here
with Kairui's. Overall, both proposals seem to be complimentary
concepts, and we just need to figure out the right way to combine them
:)
Re: [RFC PATCH 00/14] Virtual Swap Space
Posted by Usama Arif 3 weeks, 3 days ago
On 08/04/2025 00:42, Nhat Pham wrote:
> 
> V. Benchmarking
> 
> As a proof of concept, I run the prototype through some simple
> benchmarks:
> 
> 1. usemem: 16 threads, 2G each, memory.max = 16G
> 
> I benchmarked the following usemem commands:
> 
> time usemem --init-time -w -O -s 10 -n 16 2g
> 
> Baseline:
> real: 33.96s
> user: 25.31s
> sys: 341.09s
> average throughput: 111295.45 KB/s
> average free time: 2079258.68 usecs
> 
> New Design:
> real: 35.87s
> user: 25.15s
> sys: 373.01s
> average throughput: 106965.46 KB/s
> average free time: 3192465.62 usecs
> 
> To root cause this regression, I ran perf on the usemem program, as
> well as on the following stress-ng program:
> 
> perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000
> 
> and observed the (predicted) increase in lock contention on swap cache
> accesses. This regression is alleviated if I put together the
> following hack: limit the virtual swap space to a sufficient size for
> the benchmark, range partition the swap-related data structures (swap
> cache, zswap tree, etc.) based on the limit, and distribute the
> allocation of virtual swap slotss among these partitions (on a per-CPU
> basis):
> 
> real: 34.94s
> user: 25.28s
> sys: 360.25s
> average throughput: 108181.15 KB/s
> average free time: 2680890.24 usecs
> 
> As mentioned above, I will implement proper dynamic swap range
> partitioning in a follow up work.
> 
> 2. Kernel building: zswap enabled, 52 workers (one per processor),
> memory.max = 3G.
> 
> Baseline:
> real: 183.55s
> user: 5119.01s
> sys: 655.16s
> 
> New Design:
> real: mean: 184.5s
> user: mean: 5117.4s
> sys: mean: 695.23s
> 
> New Design (Static Partition)
> real: 183.95s
> user: 5119.29s
> sys: 664.24s
> 

Hi Nhat,

Thanks for the patches! I have glanced over a couple of them, but this was the main question that came to my mind.

Just wanted to check if you had a look at the memory regression during these benchmarks?

Also what is sizeof(swp_desc)? Maybe we can calculate the memory overhead as sizeof(swp_desc) * swap size/PAGE_SIZE?

For a 64G swap that is filled with private anon pages, the overhead in MB might be (sizeof(swp_desc) in bytes * 16M) - 16M (zerobitmap) - 16M*8 (swap map)?

This looks like a sizeable memory regression?

Thanks,
Usama
Re: [RFC PATCH 00/14] Virtual Swap Space
Posted by Johannes Weiner 3 weeks, 3 days ago
On Tue, Apr 08, 2025 at 02:04:06PM +0100, Usama Arif wrote:
> 
> 
> On 08/04/2025 00:42, Nhat Pham wrote:
> > 
> > V. Benchmarking
> > 
> > As a proof of concept, I run the prototype through some simple
> > benchmarks:
> > 
> > 1. usemem: 16 threads, 2G each, memory.max = 16G
> > 
> > I benchmarked the following usemem commands:
> > 
> > time usemem --init-time -w -O -s 10 -n 16 2g
> > 
> > Baseline:
> > real: 33.96s
> > user: 25.31s
> > sys: 341.09s
> > average throughput: 111295.45 KB/s
> > average free time: 2079258.68 usecs
> > 
> > New Design:
> > real: 35.87s
> > user: 25.15s
> > sys: 373.01s
> > average throughput: 106965.46 KB/s
> > average free time: 3192465.62 usecs
> > 
> > To root cause this regression, I ran perf on the usemem program, as
> > well as on the following stress-ng program:
> > 
> > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000
> > 
> > and observed the (predicted) increase in lock contention on swap cache
> > accesses. This regression is alleviated if I put together the
> > following hack: limit the virtual swap space to a sufficient size for
> > the benchmark, range partition the swap-related data structures (swap
> > cache, zswap tree, etc.) based on the limit, and distribute the
> > allocation of virtual swap slotss among these partitions (on a per-CPU
> > basis):
> > 
> > real: 34.94s
> > user: 25.28s
> > sys: 360.25s
> > average throughput: 108181.15 KB/s
> > average free time: 2680890.24 usecs
> > 
> > As mentioned above, I will implement proper dynamic swap range
> > partitioning in a follow up work.
> > 
> > 2. Kernel building: zswap enabled, 52 workers (one per processor),
> > memory.max = 3G.
> > 
> > Baseline:
> > real: 183.55s
> > user: 5119.01s
> > sys: 655.16s
> > 
> > New Design:
> > real: mean: 184.5s
> > user: mean: 5117.4s
> > sys: mean: 695.23s
> > 
> > New Design (Static Partition)
> > real: 183.95s
> > user: 5119.29s
> > sys: 664.24s
> > 
> 
> Hi Nhat,
> 
> Thanks for the patches! I have glanced over a couple of them, but this was the main question that came to my mind.
> 
> Just wanted to check if you had a look at the memory regression during these benchmarks?
> 
> Also what is sizeof(swp_desc)? Maybe we can calculate the memory overhead as sizeof(swp_desc) * swap size/PAGE_SIZE?
> 
> For a 64G swap that is filled with private anon pages, the overhead in MB might be (sizeof(swp_desc) in bytes * 16M) - 16M (zerobitmap) - 16M*8 (swap map)?
> 
> This looks like a sizeable memory regression?

One thing to keep in mind is that the swap descriptor is currently
blatantly explicit, and many conversions and optimizations have not
been done yet. There are some tradeoffs made here regarding code
reviewability, but I agree it makes it hard to see what this would
look like fully realized.

I think what's really missing is an analysis of what the goal is and
what the overhead will be then.

The swapin path currently consults the swapcache, then the zeromap,
then zswap, and finally the backend. The external swap_cgroup array is
consulted to determine who to charge for the new page.

With vswap, the descriptor is looked up and resolves to a type,
location, cgroup ownership, a refcount. This means it replaces the
swapcache, the zeromap, the cgroup map, and largely the swap_map.

Nhat was not quite sure yet if the swap_map can be a single bit per
entry or two bits to represent bad slots. In any case, it's a large
reduction in static swap space overhead, and eliminates the tricky
swap count continuation code.
Re: [RFC PATCH 00/14] Virtual Swap Space
Posted by Nhat Pham 3 weeks, 3 days ago
On Tue, Apr 8, 2025 at 8:45 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Tue, Apr 08, 2025 at 02:04:06PM +0100, Usama Arif wrote:
> >
> >
> > On 08/04/2025 00:42, Nhat Pham wrote:
> > >
> > > V. Benchmarking
> > >
> > > As a proof of concept, I run the prototype through some simple
> > > benchmarks:
> > >
> > > 1. usemem: 16 threads, 2G each, memory.max = 16G
> > >
> > > I benchmarked the following usemem commands:
> > >
> > > time usemem --init-time -w -O -s 10 -n 16 2g
> > >
> > > Baseline:
> > > real: 33.96s
> > > user: 25.31s
> > > sys: 341.09s
> > > average throughput: 111295.45 KB/s
> > > average free time: 2079258.68 usecs
> > >
> > > New Design:
> > > real: 35.87s
> > > user: 25.15s
> > > sys: 373.01s
> > > average throughput: 106965.46 KB/s
> > > average free time: 3192465.62 usecs
> > >
> > > To root cause this regression, I ran perf on the usemem program, as
> > > well as on the following stress-ng program:
> > >
> > > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000
> > >
> > > and observed the (predicted) increase in lock contention on swap cache
> > > accesses. This regression is alleviated if I put together the
> > > following hack: limit the virtual swap space to a sufficient size for
> > > the benchmark, range partition the swap-related data structures (swap
> > > cache, zswap tree, etc.) based on the limit, and distribute the
> > > allocation of virtual swap slotss among these partitions (on a per-CPU
> > > basis):
> > >
> > > real: 34.94s
> > > user: 25.28s
> > > sys: 360.25s
> > > average throughput: 108181.15 KB/s
> > > average free time: 2680890.24 usecs
> > >
> > > As mentioned above, I will implement proper dynamic swap range
> > > partitioning in a follow up work.
> > >
> > > 2. Kernel building: zswap enabled, 52 workers (one per processor),
> > > memory.max = 3G.
> > >
> > > Baseline:
> > > real: 183.55s
> > > user: 5119.01s
> > > sys: 655.16s
> > >
> > > New Design:
> > > real: mean: 184.5s
> > > user: mean: 5117.4s
> > > sys: mean: 695.23s
> > >
> > > New Design (Static Partition)
> > > real: 183.95s
> > > user: 5119.29s
> > > sys: 664.24s
> > >
> >
> > Hi Nhat,
> >
> > Thanks for the patches! I have glanced over a couple of them, but this was the main question that came to my mind.
> >
> > Just wanted to check if you had a look at the memory regression during these benchmarks?
> >
> > Also what is sizeof(swp_desc)? Maybe we can calculate the memory overhead as sizeof(swp_desc) * swap size/PAGE_SIZE?
> >
> > For a 64G swap that is filled with private anon pages, the overhead in MB might be (sizeof(swp_desc) in bytes * 16M) - 16M (zerobitmap) - 16M*8 (swap map)?
> >
> > This looks like a sizeable memory regression?
>
> One thing to keep in mind is that the swap descriptor is currently
> blatantly explicit, and many conversions and optimizations have not
> been done yet. There are some tradeoffs made here regarding code
> reviewability, but I agree it makes it hard to see what this would
> look like fully realized.
>
> I think what's really missing is an analysis of what the goal is and
> what the overhead will be then.
>
> The swapin path currently consults the swapcache, then the zeromap,
> then zswap, and finally the backend. The external swap_cgroup array is
> consulted to determine who to charge for the new page.
>
> With vswap, the descriptor is looked up and resolves to a type,
> location, cgroup ownership, a refcount. This means it replaces the
> swapcache, the zeromap, the cgroup map, and largely the swap_map.
>
> Nhat was not quite sure yet if the swap_map can be a single bit per
> entry or two bits to represent bad slots. In any case, it's a large
> reduction in static swap space overhead, and eliminates the tricky
> swap count continuation code.

You're right. I haven't touched the swapfile swap map and the zeromap
bitmap at all, primarily because it's non-functional change
(optimization only). It also adds more ifdefs to the final codebase :)

In the next version, I can tag on one patch to:

1. remove zeromap bitmap. This one is pretty much straightforward -
we're not using it at all.

2. Swap map reduction. I'm like 70% sure we don't need SWAP_MAP_BAD
state. With the vswap reverse map and the swapfile inuse counters, we
should be able to convert the swapmap into a pure bitmap. If we can't,
then it's 2 bits per physical swapfiles.
Re: [RFC PATCH 00/14] Virtual Swap Space
Posted by Nhat Pham 3 weeks, 3 days ago
On Tue, Apr 8, 2025 at 9:25 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
>
> You're right. I haven't touched the swapfile swap map and the zeromap
> bitmap at all, primarily because it's non-functional change
> (optimization only). It also adds more ifdefs to the final codebase :)
>
> In the next version, I can tag on one patch to:
>
> 1. remove zeromap bitmap. This one is pretty much straightforward -
> we're not using it at all.
>
> 2. Swap map reduction. I'm like 70% sure we don't need SWAP_MAP_BAD
> state. With the vswap reverse map and the swapfile inuse counters, we
> should be able to convert the swapmap into a pure bitmap. If we can't,
> then it's 2 bits per physical swapfiles.

s/physical swapfiles/physical swap slot (3 states - unallocated,
allocated, bad slot. the latter two might be mergeable).
Re: [RFC PATCH 00/14] Virtual Swap Space
Posted by Nhat Pham 3 weeks, 3 days ago
On Tue, Apr 8, 2025 at 6:04 AM Usama Arif <usamaarif642@gmail.com> wrote:
>
>
>
> On 08/04/2025 00:42, Nhat Pham wrote:
> >
> > V. Benchmarking
> >
> > As a proof of concept, I run the prototype through some simple
> > benchmarks:
> >
> > 1. usemem: 16 threads, 2G each, memory.max = 16G
> >
> > I benchmarked the following usemem commands:
> >
> > time usemem --init-time -w -O -s 10 -n 16 2g
> >
> > Baseline:
> > real: 33.96s
> > user: 25.31s
> > sys: 341.09s
> > average throughput: 111295.45 KB/s
> > average free time: 2079258.68 usecs
> >
> > New Design:
> > real: 35.87s
> > user: 25.15s
> > sys: 373.01s
> > average throughput: 106965.46 KB/s
> > average free time: 3192465.62 usecs
> >
> > To root cause this regression, I ran perf on the usemem program, as
> > well as on the following stress-ng program:
> >
> > perf record -ag -e cycles -G perf_cg -- ./stress-ng/stress-ng  --pageswap $(nproc) --pageswap-ops 100000
> >
> > and observed the (predicted) increase in lock contention on swap cache
> > accesses. This regression is alleviated if I put together the
> > following hack: limit the virtual swap space to a sufficient size for
> > the benchmark, range partition the swap-related data structures (swap
> > cache, zswap tree, etc.) based on the limit, and distribute the
> > allocation of virtual swap slotss among these partitions (on a per-CPU
> > basis):
> >
> > real: 34.94s
> > user: 25.28s
> > sys: 360.25s
> > average throughput: 108181.15 KB/s
> > average free time: 2680890.24 usecs
> >
> > As mentioned above, I will implement proper dynamic swap range
> > partitioning in a follow up work.
> >
> > 2. Kernel building: zswap enabled, 52 workers (one per processor),
> > memory.max = 3G.
> >
> > Baseline:
> > real: 183.55s
> > user: 5119.01s
> > sys: 655.16s
> >
> > New Design:
> > real: mean: 184.5s
> > user: mean: 5117.4s
> > sys: mean: 695.23s
> >
> > New Design (Static Partition)
> > real: 183.95s
> > user: 5119.29s
> > sys: 664.24s
> >
>
> Hi Nhat,
>
> Thanks for the patches! I have glanced over a couple of them, but this was the main question that came to my mind.
>
> Just wanted to check if you had a look at the memory regression during these benchmarks?
>
> Also what is sizeof(swp_desc)? Maybe we can calculate the memory overhead as sizeof(swp_desc) * swap size/PAGE_SIZE?

Yeah, it's pretty big right now (120 bytes). I haven't done any space
optimization yet - I basically listed out all the required
information, and add one field for each of them. A couple of
optimizations I have in mind:
1. Merged swap_count and in_swapcache (suggested by Yosry).
2. Unionize the rcu field with other fields, because rcu head is only
needed for the free paths (suggested by Shakeel for a different
context, but should be applicable here). Or maybe just remove it and
free the swap descriptors in-context.
3. The type field is really only 2 bits - might be able to squeeze it
in one of the other fields as well.
4. The lock field might not be needed. I think the in_swapcache bit is
already used as a form of "backing storage pinning" mechanism, which
should allow pinners exclusive rights to the backing state.

etc. etc.

The code will get uglier though, so I wanna at least send out one
version with everything separate for clarity sake, before optimizing
them away :)

>
> For a 64G swap that is filled with private anon pages, the overhead in MB might be (sizeof(swp_desc) in bytes * 16M) - 16M (zerobitmap) - 16M*8 (swap map)?

That is true. I will note, however, that in the past the overhead was
static (i.e it is incurred no matter how much swap space you are
using). In fact, you have to often overprovision for swap, so the
overhead goes beyond what you will (ever) need.

Now the overhead is (mostly) dynamic - only incurred on demand, and
reduced when you don't need it.


>
> This looks like a sizeable memory regression?
>
> Thanks,
> Usama
>