mm, swap: swap table phase IV with dynamic ghost swapfile

[PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile

Posted by Kairui Song via B4 Relay 1 month, 1 week ago

NOTE for an RFC quality series: Swap table P4 is patch 1 - 12, and the
dynamic ghost file is patch 13 - 15. Putting them together as RFC for
easier review and discussions. Swap table P4 is stable and good to merge
if we are OK with a few memcg reparent behavior (there is also a
solution if we don't), dynamic ghost swap is yet a minimal proof of
concept. See patch 15 for more details. And see below for Swap table 4
cover letter (nice performance gain and memory save).

This is based on the latest mm-unstable, swap table P3 [1] and patches
[2] and [3], [4]. Sending this out early, as it might be helpful for us
to get a cleaner picture of the ongoing efforts, make the discussions easier.

Summary: With this approach, we can have an infinitely or dynamically
large ghost which could be identical to "virtual swap", and support
every feature we need while being *runtime configurable* with *zero
overhead* for plain swap and keep the infrastructure unified. Also
highly compatible with YoungJun's swap tiering [5], and other ideas like
swap table compaction, swapops, as it aligns with a few proposals [6]
[7] [8] [9] [10].

In the past two years, most efforts have focused on the swap
infrastructure, and we have made tremendous gains in performance,
keeping the memory usage reasonable or lower, and also greatly cleaned
up and simplified the API and conventions.

Now the infrastructures are almost ready, after P4, implementing an
infinitely or dynamically large swapfile can be done in a very easy to
maintain and flexible way, code change is minimal and progressive
for review, and makes future optimization like swap table compaction
doable too, since the infrastructure is all the same for all swaps.

The dynamic swap file is now using Xarray for the cluster info, and
inside the cluster, it's all the same swap allocator, swap table, and
existing infrastructures. A virtual table is available for any extra
data or usage. See below for the benefits and what we can achieve.

Huge thanks to Chris Li for the layered swap table and ghost swapfile
idea, without whom the work here can't be archived. Also, thanks to Nhat
for pushing and suggesting using an Xarray for the swapfile [11] for
dynamic size. I was originally planning to use a dynamic cluster
array, which requires a bit more adaptation, cleanup, and convention
changes. But during the discussion there, I got the inspiration that
Xarray can be used as the intermediate step, making this approach
doable with minimal changes. Just keep using it in the future, it
might not hurt too, as Xarray is only limited to ghost / virtual
files, so plain swaps won't have any extra overhead for lookup or high
risk of swapout allocation failure.

I'm fully open and totally fine for suggestions on naming or API
strategy, and others are highly welcome to keep the work going using
this flexible approach. Following this approach, we will have all the
following things progressively (some are already or almost there):

- 8 bytes per slot memory usage, when using only plain swap.
  - And the memory usage can be reduced to 3 or only 1 byte.
- 16 bytes per slot memory usage, when using ghost / virtual zswap.
  - Zswap can just use ci_dyn->virtual_table to free up it's content
    completely.
  - And the memory usage can be reduced to 11 or 8 bytes using the same
    code above.
  - 24 bytes only if including reverse mapping is in use.
- Minimal code review or maintenance burden. All layers are using the exact
  same infrastructure for metadata / allocation / synchronization, making
  all API and conventions consistent and easy to maintain.
- Writeback, migration and compaction are easily supportable since both
  reverse mapping and reallocation are prepared. We just need a
  folio_realloc_swap to allocate new entries for the existing entry, and
  fill the swap table with a reserve map entry.
- Fast swapoff: Just read into ghost / virtual swap cache.
- Zero static data (mostly due to swap table P4), even the clusters are
  dynamic (If using Xarray, only for ghost / virtual swap file).
- So we can have an infinitely sized swap space with no static data
  overhead.
- Everything is runtime configurable, and high-performance. An
  uncompressible workload or an offline batch workload can directly use a
  plain or remote swap for the lowest interference, memory usage, or for
  best performance.
- Highly compatible with YoungJun's swap tiering, even the ghost / virtual
  file can be just a tier. For example, if you have a huge NBD that doesn't
  care about fragmentation and compression, or the workload is
  uncompressible, setting the workload to use NBD's tier will give you only
  8 bytes of overhead per slot and peak performance, bypassing everything.
  Meanwhile, other workloads or cgroups can still use the ghost layer with
  compression or defragmentation using 16 bytes (zswap only) or 24 bytes
  (ghost swap with physical writeback) overhead.
- No force or breaking change to any existing allocation, priority, swap
  setup, or reclaim strategy. Ghost / virtual swap can be enabled or
  disabled using swapon / swapoff.

And if you consider these ops are too complex to set up and maintain, we
can then only allow one ghost / virtual file, make it infinitely large,
and be the default one and top tier, then it achieves the identical thing
to virtual swap space, but with much fewer LOC changed and being runtime
optional.

Currently, the dynamic ghost files are just reported as ordinary swap files
in /proc/swaps and we can have multiple ones, so users will have a full
view of what's going on. This is a very easy-to-change design decision.
I'm open to ideas about how we should present this to users. e.g., Hiding
it will make it more "virtual", but I don't think that's a good idea.

The size of the swapfile (si->max) is now just a number, which could be
changeable at runtime if we have a proper idea how to expose that and
might need some audit of a few remaining users. But right now, we can
already easily have a huge swap device with no overhead, for example:

free -m
               total        used        free      shared  buff/cache   available
Mem:            1465         250         927           1         356        1215
Swap:       15269887           0    15269887

And for easier testing, I added a /dev/ghostswap in this RFC. `swapon
/dev/ghostswap` enables that. Without swapon /dev/ghostswap, any existing
users, including ZRAM, won't observe any change.

===

Original cover letter for swap table phase IV:

This series unifies the allocation and charging process of anon and shmem,
provides better synchronization, and consolidates cgroup tracking, hence
dropping the cgroup array and improving the performance of mTHP by about
~15%.

Still testing with build kernel under great pressure, enabling mTHP 256kB,
on an EPYC 7K62 using 16G ZRAM, make -j48 with 1G memory limit, 12 test
runs:

Before: 2215.55s system, 2:53.03 elapsed
After:  1852.14s system, 2:41.44 elapsed (16.4% faster system time)

In some workloads, the speed gain is more than that since this reduces
memory thrashing, so even IO-bound work could benefit a lot, and I no
longer see any: "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying
PF", it was shown from time to time before this series.

Now, the swap cache layer ensures a folio will be the exclusive owner of
the swap slot, then charge it, which leads to much smaller thrashing when
under pressure.

And besides, the swap cgroup static array is gone, so for example, mounting
a 1TB swap device saves about 512MB of memory:

Before:
        total     used     free     shared  buff/cache available
Mem:    1465      854      331      1       347        610
Swap:   1048575   0        1048575

After:
        total     used     free     shared  buff/cache available
Mem:    1465      332      838      1       363        1133
Swap:   1048575   0        1048575

It saves us ~512M of memory, we now have close to 0 static overhead.

Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-0-f4e34be021a7@tencent.com/ [1]
Link: https://lore.kernel.org/linux-mm/20260213-memcg-privid-v1-1-d8cb7afcf831@tencent.com/ [2]
Link: https://lore.kernel.org/linux-mm/20260211-shmem-swap-gfp-v1-1-e9781099a861@tencent.com/ [3]
Link: https://lore.kernel.org/linux-mm/20260216-hibernate-perf-v4-0-1ba9f0bf1ec9@tencent.com/ [4]
Link: https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/ [5]
Link: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/ [6]
Link: https://lwn.net/Articles/974587/ [7]
Link: https://lwn.net/Articles/932077/ [8]
Link: https://lwn.net/Articles/1016136/ [9]
Link: https://lore.kernel.org/linux-mm/20260208215839.87595-1-nphamcs@gmail.com/ [10]
Link: https://lore.kernel.org/linux-mm/CAKEwX=OUni7PuUqGQUhbMDtErurFN_i=1RgzyQsNXy4LABhXoA@mail.gmail.com/ [11]

Signed-off-by: Kairui Song <kasong@tencent.com>
---
Chris Li (1):
      mm: ghost swapfile support for zswap

Kairui Song (14):
      mm: move thp_limit_gfp_mask to header
      mm, swap: simplify swap_cache_alloc_folio
      mm, swap: move conflict checking logic of out swap cache adding
      mm, swap: add support for large order folios in swap cache directly
      mm, swap: unify large folio allocation
      memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead
      memcg, swap: defer the recording of memcg info and reparent flexibly
      mm, swap: store and check memcg info in the swap table
      mm, swap: support flexible batch freeing of slots in different memcg
      mm, swap: always retrieve memcg id from swap table
      mm/swap, memcg: remove swap cgroup array
      mm, swap: merge zeromap into swap table
      mm, swap: add a special device for ghost swap setup
      mm, swap: allocate cluster dynamically for ghost swapfile

 MAINTAINERS                 |   1 -
 drivers/char/mem.c          |  39 ++++
 include/linux/huge_mm.h     |  24 +++
 include/linux/memcontrol.h  |  12 +-
 include/linux/swap.h        |  30 ++-
 include/linux/swap_cgroup.h |  47 -----
 mm/Makefile                 |   3 -
 mm/internal.h               |  25 ++-
 mm/memcontrol-v1.c          |  78 ++++----
 mm/memcontrol.c             | 119 ++++++++++--
 mm/memory.c                 |  89 ++-------
 mm/page_io.c                |  46 +++--
 mm/shmem.c                  | 122 +++---------
 mm/swap.h                   | 122 +++++-------
 mm/swap_cgroup.c            | 172 ----------------
 mm/swap_state.c             | 464 ++++++++++++++++++++++++--------------------
 mm/swap_table.h             | 105 ++++++++--
 mm/swapfile.c               | 278 ++++++++++++++++++++------
 mm/vmscan.c                 |   7 +-
 mm/workingset.c             |  16 +-
 mm/zswap.c                  |  29 +--
 21 files changed, 977 insertions(+), 851 deletions(-)
---
base-commit: 4750368e2cd365ac1e02c6919013c8871f35d8f9
change-id: 20260111-swap-table-p4-98ee92baa7c4

Best regards,
-- 
Kairui Song <kasong@tencent.com>

Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile

Posted by Johannes Weiner 1 month, 1 week ago

On Fri, Feb 20, 2026 at 07:42:01AM +0800, Kairui Song via B4 Relay wrote:
> - 8 bytes per slot memory usage, when using only plain swap.
>   - And the memory usage can be reduced to 3 or only 1 byte.
> - 16 bytes per slot memory usage, when using ghost / virtual zswap.
>   - Zswap can just use ci_dyn->virtual_table to free up it's content
>     completely.
>   - And the memory usage can be reduced to 11 or 8 bytes using the same
>     code above.
>   - 24 bytes only if including reverse mapping is in use.

That seems to tie us pretty permanently to duplicate metadata.

For every page that was written to disk through zswap, we have an
entry in the ghost swapfile, and an entry in the backend swapfile, no?

> - Minimal code review or maintenance burden. All layers are using the exact
>   same infrastructure for metadata / allocation / synchronization, making
>   all API and conventions consistent and easy to maintain.
> - Writeback, migration and compaction are easily supportable since both
>   reverse mapping and reallocation are prepared. We just need a
>   folio_realloc_swap to allocate new entries for the existing entry, and
>   fill the swap table with a reserve map entry.
> - Fast swapoff: Just read into ghost / virtual swap cache.

Can we get this for disk swap as well? ;)

Zswap swapoff is already fairly fast, albeit CPU intense. It's the
scattered IO that makes swapoff on disks so terrible.

> The size of the swapfile (si->max) is now just a number, which could be
> changeable at runtime if we have a proper idea how to expose that and
> might need some audit of a few remaining users. But right now, we can
> already easily have a huge swap device with no overhead, for example:
> 
> free -m
>                total        used        free      shared  buff/cache   available
> Mem:            1465         250         927           1         356        1215
> Swap:       15269887           0    15269887

I'm not a fan of this. This makes free(1) output kind of useless, and
very misleading. The swap space presented here has nothing to do with
actual swap capacity, and the actual disk swap capacity is obscured.

And how would a user choose this size? How would a distribution?

The only limit is compression ratio, and you don't know this in
advance. This restriction seems pretty arbitrary and avoidable.

There is no good technical reason to present this in any sort of
static fashion.

Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile

Posted by Kairui Song 1 month, 1 week ago

On Tue, Feb 24, 2026 at 1:00 AM Johannes Weiner <hannes@cmpxchg.org> wrote:
>
> On Fri, Feb 20, 2026 at 07:42:01AM +0800, Kairui Song via B4 Relay wrote:
> > - 8 bytes per slot memory usage, when using only plain swap.
> >   - And the memory usage can be reduced to 3 or only 1 byte.
> > - 16 bytes per slot memory usage, when using ghost / virtual zswap.
> >   - Zswap can just use ci_dyn->virtual_table to free up it's content
> >     completely.
> >   - And the memory usage can be reduced to 11 or 8 bytes using the same
> >     code above.
> >   - 24 bytes only if including reverse mapping is in use.
>
> That seems to tie us pretty permanently to duplicate metadata.
>
> For every page that was written to disk through zswap, we have an
> entry in the ghost swapfile, and an entry in the backend swapfile, no?

No, only one entry in the ghost swapfile (xswap or virtual swap file,
anyway it's just a name). The one in the physical swap is a reverse
mapping entry, it tells which slot in the ghost swapfile is pointing
to the physical slot, so swapoff / migration of the physical slot can
be done in O(1) time.

So, zero duplicate of any data.

>
> > - Minimal code review or maintenance burden. All layers are using the exact
> >   same infrastructure for metadata / allocation / synchronization, making
> >   all API and conventions consistent and easy to maintain.
> > - Writeback, migration and compaction are easily supportable since both
> >   reverse mapping and reallocation are prepared. We just need a
> >   folio_realloc_swap to allocate new entries for the existing entry, and
> >   fill the swap table with a reserve map entry.
> > - Fast swapoff: Just read into ghost / virtual swap cache.
>
> Can we get this for disk swap as well? ;)
>
> Zswap swapoff is already fairly fast, albeit CPU intense. It's the
> scattered IO that makes swapoff on disks so terrible.

I am talking about disk swap here, not zswap. Swapoff of a physical
entry just loads the swap data in the virtual slot according to the
reverse mapping entry.

> > free -m
> >                total        used        free      shared  buff/cache   available
> > Mem:            1465         250         927           1         356        1215
> > Swap:       15269887           0    15269887
>
> I'm not a fan of this. This makes free(1) output kind of useless, and
> very misleading. The swap space presented here has nothing to do with
> actual swap capacity, and the actual disk swap capacity is obscured.
>
> And how would a user choose this size? How would a distribution?

It can be dynamic (just si->max += 2M on every cluster allocation
since it's really just a number now). Can be hidden, and can have an
infinite size. That's just an interface design that can be flexibly
changed.

For example if we just set this to a super large value and hide it, it
will look identical to vss from userspace perspect, but stay optional
and zero overhead for existing ZRAM or plain swap users.

> The only limit is compression ratio, and you don't know this in
> advance. This restriction seems pretty arbitrary and avoidable.

Just as a reference: In practice we limit our ZRAM setup to 1/4 or 1:1
of the total RAM to avoid the machine goto endless reclaim and never
go OOM.

But we can also have an infinite size ZSWAP now, with this series.

Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile

Posted by Barry Song 1 month, 1 week ago

On Fri, Feb 20, 2026 at 7:42 AM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> NOTE for an RFC quality series: Swap table P4 is patch 1 - 12, and the
> dynamic ghost file is patch 13 - 15. Putting them together as RFC for
> easier review and discussions. Swap table P4 is stable and good to merge
> if we are OK with a few memcg reparent behavior (there is also a
> solution if we don't), dynamic ghost swap is yet a minimal proof of
> concept. See patch 15 for more details. And see below for Swap table 4
> cover letter (nice performance gain and memory save).

To be honest, I really dislike the name "ghost." I would
prefer something that reflects its actual functionality.
"Ghost" does not describe what it does and feels rather
arbitrary.

I suggest retiring the name "ghost" and replacing it with
something more appropriate. "vswap" could be a good option,
but Nhat is already using that name.

>
> This is based on the latest mm-unstable, swap table P3 [1] and patches
> [2] and [3], [4]. Sending this out early, as it might be helpful for us
> to get a cleaner picture of the ongoing efforts, make the discussions easier.
>
> Summary: With this approach, we can have an infinitely or dynamically
> large ghost which could be identical to "virtual swap", and support
> every feature we need while being *runtime configurable* with *zero
> overhead* for plain swap and keep the infrastructure unified. Also
> highly compatible with YoungJun's swap tiering [5], and other ideas like
> swap table compaction, swapops, as it aligns with a few proposals [6]
> [7] [8] [9] [10].
>
> In the past two years, most efforts have focused on the swap
> infrastructure, and we have made tremendous gains in performance,
> keeping the memory usage reasonable or lower, and also greatly cleaned
> up and simplified the API and conventions.
>
> Now the infrastructures are almost ready, after P4, implementing an
> infinitely or dynamically large swapfile can be done in a very easy to
> maintain and flexible way, code change is minimal and progressive
> for review, and makes future optimization like swap table compaction
> doable too, since the infrastructure is all the same for all swaps.
>
> The dynamic swap file is now using Xarray for the cluster info, and
> inside the cluster, it's all the same swap allocator, swap table, and
> existing infrastructures. A virtual table is available for any extra
> data or usage. See below for the benefits and what we can achieve.
>
> Huge thanks to Chris Li for the layered swap table and ghost swapfile
> idea, without whom the work here can't be archived. Also, thanks to Nhat
> for pushing and suggesting using an Xarray for the swapfile [11] for
> dynamic size. I was originally planning to use a dynamic cluster
> array, which requires a bit more adaptation, cleanup, and convention
> changes. But during the discussion there, I got the inspiration that
> Xarray can be used as the intermediate step, making this approach
> doable with minimal changes. Just keep using it in the future, it
> might not hurt too, as Xarray is only limited to ghost / virtual
> files, so plain swaps won't have any extra overhead for lookup or high
> risk of swapout allocation failure.
>
> I'm fully open and totally fine for suggestions on naming or API
> strategy, and others are highly welcome to keep the work going using
> this flexible approach. Following this approach, we will have all the
> following things progressively (some are already or almost there):
>
> - 8 bytes per slot memory usage, when using only plain swap.
>   - And the memory usage can be reduced to 3 or only 1 byte.
> - 16 bytes per slot memory usage, when using ghost / virtual zswap.
>   - Zswap can just use ci_dyn->virtual_table to free up it's content
>     completely.
>   - And the memory usage can be reduced to 11 or 8 bytes using the same
>     code above.
>   - 24 bytes only if including reverse mapping is in use.
> - Minimal code review or maintenance burden. All layers are using the exact
>   same infrastructure for metadata / allocation / synchronization, making
>   all API and conventions consistent and easy to maintain.
> - Writeback, migration and compaction are easily supportable since both
>   reverse mapping and reallocation are prepared. We just need a
>   folio_realloc_swap to allocate new entries for the existing entry, and
>   fill the swap table with a reserve map entry.
> - Fast swapoff: Just read into ghost / virtual swap cache.
> - Zero static data (mostly due to swap table P4), even the clusters are
>   dynamic (If using Xarray, only for ghost / virtual swap file).
> - So we can have an infinitely sized swap space with no static data
>   overhead.
> - Everything is runtime configurable, and high-performance. An
>   uncompressible workload or an offline batch workload can directly use a
>   plain or remote swap for the lowest interference, memory usage, or for
>   best performance.
> - Highly compatible with YoungJun's swap tiering, even the ghost / virtual
>   file can be just a tier. For example, if you have a huge NBD that doesn't
>   care about fragmentation and compression, or the workload is
>   uncompressible, setting the workload to use NBD's tier will give you only
>   8 bytes of overhead per slot and peak performance, bypassing everything.
>   Meanwhile, other workloads or cgroups can still use the ghost layer with
>   compression or defragmentation using 16 bytes (zswap only) or 24 bytes
>   (ghost swap with physical writeback) overhead.
> - No force or breaking change to any existing allocation, priority, swap
>   setup, or reclaim strategy. Ghost / virtual swap can be enabled or
>   disabled using swapon / swapoff.
>
> And if you consider these ops are too complex to set up and maintain, we
> can then only allow one ghost / virtual file, make it infinitely large,
> and be the default one and top tier, then it achieves the identical thing
> to virtual swap space, but with much fewer LOC changed and being runtime
> optional.
>
> Currently, the dynamic ghost files are just reported as ordinary swap files
> in /proc/swaps and we can have multiple ones, so users will have a full
> view of what's going on. This is a very easy-to-change design decision.
> I'm open to ideas about how we should present this to users. e.g., Hiding
> it will make it more "virtual", but I don't think that's a good idea.

Even if it remains visible in /proc/swaps, I would rather
not represent it as a real file in any filesystem. Putting
a "ghost" swapfile on something like ext4 seems unnatural.

>
> The size of the swapfile (si->max) is now just a number, which could be
> changeable at runtime if we have a proper idea how to expose that and
> might need some audit of a few remaining users. But right now, we can
> already easily have a huge swap device with no overhead, for example:
>
> free -m
>                total        used        free      shared  buff/cache   available
> Mem:            1465         250         927           1         356        1215
> Swap:       15269887           0    15269887
>
> And for easier testing, I added a /dev/ghostswap in this RFC. `swapon
> /dev/ghostswap` enables that. Without swapon /dev/ghostswap, any existing
> users, including ZRAM, won't observe any change.

/dev/ghostswap is assumed to be a virtual block device or
something similar? If it is a block device, how is its size
related to si->size?

Looking at [PATCH RFC 14/15] mm, swap: add a special device
for ghost swap setup, it appears to be a character device.
This feels very odd to me. I’m not in favor of coupling the
ghost swapfile with a memdev character device.
A cdev should be a true character device.

>
> ===
>
> Original cover letter for swap table phase IV:
>
> This series unifies the allocation and charging process of anon and shmem,
> provides better synchronization, and consolidates cgroup tracking, hence
> dropping the cgroup array and improving the performance of mTHP by about
> ~15%.
>
> Still testing with build kernel under great pressure, enabling mTHP 256kB,
> on an EPYC 7K62 using 16G ZRAM, make -j48 with 1G memory limit, 12 test
> runs:
>
> Before: 2215.55s system, 2:53.03 elapsed
> After:  1852.14s system, 2:41.44 elapsed (16.4% faster system time)
>
> In some workloads, the speed gain is more than that since this reduces
> memory thrashing, so even IO-bound work could benefit a lot, and I no
> longer see any: "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying
> PF", it was shown from time to time before this series.
>
> Now, the swap cache layer ensures a folio will be the exclusive owner of
> the swap slot, then charge it, which leads to much smaller thrashing when
> under pressure.
>
> And besides, the swap cgroup static array is gone, so for example, mounting
> a 1TB swap device saves about 512MB of memory:
>
> Before:
>         total     used     free     shared  buff/cache available
> Mem:    1465      854      331      1       347        610
> Swap:   1048575   0        1048575
>
> After:
>         total     used     free     shared  buff/cache available
> Mem:    1465      332      838      1       363        1133
> Swap:   1048575   0        1048575
>
> It saves us ~512M of memory, we now have close to 0 static overhead.
>
> Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-0-f4e34be021a7@tencent.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20260213-memcg-privid-v1-1-d8cb7afcf831@tencent.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20260211-shmem-swap-gfp-v1-1-e9781099a861@tencent.com/ [3]
> Link: https://lore.kernel.org/linux-mm/20260216-hibernate-perf-v4-0-1ba9f0bf1ec9@tencent.com/ [4]
> Link: https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/ [5]
> Link: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/ [6]
> Link: https://lwn.net/Articles/974587/ [7]
> Link: https://lwn.net/Articles/932077/ [8]
> Link: https://lwn.net/Articles/1016136/ [9]
> Link: https://lore.kernel.org/linux-mm/20260208215839.87595-1-nphamcs@gmail.com/ [10]
> Link: https://lore.kernel.org/linux-mm/CAKEwX=OUni7PuUqGQUhbMDtErurFN_i=1RgzyQsNXy4LABhXoA@mail.gmail.com/ [11]
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> Chris Li (1):
>       mm: ghost swapfile support for zswap
>
> Kairui Song (14):
>       mm: move thp_limit_gfp_mask to header
>       mm, swap: simplify swap_cache_alloc_folio
>       mm, swap: move conflict checking logic of out swap cache adding
>       mm, swap: add support for large order folios in swap cache directly
>       mm, swap: unify large folio allocation
>       memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead
>       memcg, swap: defer the recording of memcg info and reparent flexibly
>       mm, swap: store and check memcg info in the swap table
>       mm, swap: support flexible batch freeing of slots in different memcg
>       mm, swap: always retrieve memcg id from swap table
>       mm/swap, memcg: remove swap cgroup array
>       mm, swap: merge zeromap into swap table
>       mm, swap: add a special device for ghost swap setup
>       mm, swap: allocate cluster dynamically for ghost swapfile
>
>  MAINTAINERS                 |   1 -
>  drivers/char/mem.c          |  39 ++++
>  include/linux/huge_mm.h     |  24 +++
>  include/linux/memcontrol.h  |  12 +-
>  include/linux/swap.h        |  30 ++-
>  include/linux/swap_cgroup.h |  47 -----
>  mm/Makefile                 |   3 -
>  mm/internal.h               |  25 ++-
>  mm/memcontrol-v1.c          |  78 ++++----
>  mm/memcontrol.c             | 119 ++++++++++--
>  mm/memory.c                 |  89 ++-------
>  mm/page_io.c                |  46 +++--
>  mm/shmem.c                  | 122 +++---------
>  mm/swap.h                   | 122 +++++-------
>  mm/swap_cgroup.c            | 172 ----------------
>  mm/swap_state.c             | 464 ++++++++++++++++++++++++--------------------
>  mm/swap_table.h             | 105 ++++++++--
>  mm/swapfile.c               | 278 ++++++++++++++++++++------
>  mm/vmscan.c                 |   7 +-
>  mm/workingset.c             |  16 +-
>  mm/zswap.c                  |  29 +--
>  21 files changed, 977 insertions(+), 851 deletions(-)
> ---
> base-commit: 4750368e2cd365ac1e02c6919013c8871f35d8f9
> change-id: 20260111-swap-table-p4-98ee92baa7c4
>
> Best regards,
> --
> Kairui Song <kasong@tencent.com>
>
>

Thanks
Barry

Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile

Posted by Kairui Song 1 month, 1 week ago

On Sat, Feb 21, 2026 at 4:16 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Fri, Feb 20, 2026 at 7:42 AM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
>
> To be honest, I really dislike the name "ghost." I would
> prefer something that reflects its actual functionality.
> "Ghost" does not describe what it does and feels rather
> arbitrary.

Hi Barry,

That can be easily changed by "search and replace", I just kept the
name since patch 13 is directly from Chris and I just didn't change
it.

>
> I suggest retiring the name "ghost" and replacing it with
> something more appropriate. "vswap" could be a good option,

That looks good to me too, you can also check the slide from LSFMM
last year page 23 to see how I imaged thing would workout at that
time:
https://drive.google.com/file/d/1_QKlXErUkQ-TXmJJy79fJoLPui9TGK1S/view

The actual layout will be a bit different from that slide, since the
redirect entry will be in the lower devices, the virtual device will
have an extra virtual table to hold its redirect entry. But still I'm
glad that plain swap still has zero overhead so ZRAM or high
performance NVME is still good.

> > Currently, the dynamic ghost files are just reported as ordinary swap files
> > in /proc/swaps and we can have multiple ones, so users will have a full
> > view of what's going on. This is a very easy-to-change design decision.
> > I'm open to ideas about how we should present this to users. e.g., Hiding
> > it will make it more "virtual", but I don't think that's a good idea.
>
> Even if it remains visible in /proc/swaps, I would rather
> not represent it as a real file in any filesystem. Putting
> a "ghost" swapfile on something like ext4 seems unnatural.

How do you think about this? Here is the output after this sereis:
# swapon
NAME           TYPE       SIZE USED PRIO
/dev/ghostswap ghost     11.5G 821M   -1
/dev/ram0      partition 1024G 9.9M   -1
/dev/vdb2      partition    2G 112K   -1

Or we can rename it to:
# swapon
NAME           TYPE       SIZE USED PRIO
/dev/xswap     xswap     11.5G 821M   -1
/dev/ram0      partition 1024G 9.9M   -1
/dev/vdb2      partition    2G 112K   -1

swapon /dev/xswap will enable this layer (for now I just hardcoded it
to be 8 times the size of total ram). swapoff /dev/xswap disables it.
We can also change the priority.

We can also hide it.

> > And for easier testing, I added a /dev/ghostswap in this RFC. `swapon
> > /dev/ghostswap` enables that. Without swapon /dev/ghostswap, any existing
> > users, including ZRAM, won't observe any change.
>
> /dev/ghostswap is assumed to be a virtual block device or
> something similar? If it is a block device, how is its size
> related to si->size?

It's not a real device, just a placeholder to make swapon usable
without any modification for easier testing (some user space
implementation doesn't work well with dummy header). And it has
nothing to do with the si->size.

>
> Looking at [PATCH RFC 14/15] mm, swap: add a special device
> for ghost swap setup, it appears to be a character device.
> This feels very odd to me. I’m not in favor of coupling the
> ghost swapfile with a memdev character device.
> A cdev should be a true character device.

No coupling at all, it's just a place holder so swapon (the syscall)
knows it's a virtual device, which is just an alternative to the dummy
header approach from Chris, so people can test it easier.

The si->size is just a number and any value can be given. I just
haven't decided how we should pass the number to the kernel or just
make it dynamic: e.g. set it to total ram size and increase by 2M
every time a new cluster is used.

Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile

Posted by Barry Song 1 month, 1 week ago

On Sat, Feb 21, 2026 at 5:07 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Sat, Feb 21, 2026 at 4:16 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Fri, Feb 20, 2026 at 7:42 AM Kairui Song via B4 Relay
> > <devnull+kasong.tencent.com@kernel.org> wrote:
> >
> > To be honest, I really dislike the name "ghost." I would
> > prefer something that reflects its actual functionality.
> > "Ghost" does not describe what it does and feels rather
> > arbitrary.
>
> Hi Barry,
>
> That can be easily changed by "search and replace", I just kept the
> name since patch 13 is directly from Chris and I just didn't change
> it.
>
> >
> > I suggest retiring the name "ghost" and replacing it with
> > something more appropriate. "vswap" could be a good option,
>
> That looks good to me too, you can also check the slide from LSFMM
> last year page 23 to see how I imaged thing would workout at that
> time:
> https://drive.google.com/file/d/1_QKlXErUkQ-TXmJJy79fJoLPui9TGK1S/view
>
> The actual layout will be a bit different from that slide, since the
> redirect entry will be in the lower devices, the virtual device will
> have an extra virtual table to hold its redirect entry. But still I'm
> glad that plain swap still has zero overhead so ZRAM or high
> performance NVME is still good.
>
> > > Currently, the dynamic ghost files are just reported as ordinary swap files
> > > in /proc/swaps and we can have multiple ones, so users will have a full
> > > view of what's going on. This is a very easy-to-change design decision.
> > > I'm open to ideas about how we should present this to users. e.g., Hiding
> > > it will make it more "virtual", but I don't think that's a good idea.
> >
> > Even if it remains visible in /proc/swaps, I would rather
> > not represent it as a real file in any filesystem. Putting
> > a "ghost" swapfile on something like ext4 seems unnatural.
>
> How do you think about this? Here is the output after this sereis:
> # swapon
> NAME           TYPE       SIZE USED PRIO
> /dev/ghostswap ghost     11.5G 821M   -1
> /dev/ram0      partition 1024G 9.9M   -1
> /dev/vdb2      partition    2G 112K   -1

I’d rather have a “virtual” block device, /dev/xswap, with
its size displayed as 11.5G via `ls -l filename`. This is
also more natural than relying on a cdev placeholder.

If

>
> Or we can rename it to:
> # swapon
> NAME           TYPE       SIZE USED PRIO
> /dev/xswap     xswap     11.5G 821M   -1
> /dev/ram0      partition 1024G 9.9M   -1
> /dev/vdb2      partition    2G 112K   -1
>
> swapon /dev/xswap will enable this layer (for now I just hardcoded it
> to be 8 times the size of total ram). swapoff /dev/xswap disables it.
> We can also change the priority.
>
> We can also hide it.
>
> > > And for easier testing, I added a /dev/ghostswap in this RFC. `swapon
> > > /dev/ghostswap` enables that. Without swapon /dev/ghostswap, any existing
> > > users, including ZRAM, won't observe any change.
> >
> > /dev/ghostswap is assumed to be a virtual block device or
> > something similar? If it is a block device, how is its size
> > related to si->size?
>
> It's not a real device, just a placeholder to make swapon usable
> without any modification for easier testing (some user space
> implementation doesn't work well with dummy header). And it has
> nothing to do with the si->size.

I understand it is a placeholder for swap, but if it appears
as /dev/ghostfile, users browsing /dev/ will see it as a
real cdev. A /dev/chardev is intended for user read/write
access.
Also, udev rules can act on an exported cdev. This couples
us with a lot of userspace behavior.

>
> >
> > Looking at [PATCH RFC 14/15] mm, swap: add a special device
> > for ghost swap setup, it appears to be a character device.
> > This feels very odd to me. I’m not in favor of coupling the
> > ghost swapfile with a memdev character device.
> > A cdev should be a true character device.
>
> No coupling at all, it's just a place holder so swapon (the syscall)
> knows it's a virtual device, which is just an alternative to the dummy
> header approach from Chris, so people can test it easier.

Using a cdev as a placeholder has introduced behavioral
coupling. For swap, it serves as a placeholder; for anything
outside swap, it behaves as a regular cdev.

>
> The si->size is just a number and any value can be given. I just
> haven't decided how we should pass the number to the kernel or just
> make it dynamic: e.g. set it to total ram size and increase by 2M
> every time a new cluster is used.

Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile

Posted by Nhat Pham 1 month, 1 week ago

On Thu, Feb 19, 2026 at 3:42 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@kernel.org> wrote:
>
> NOTE for an RFC quality series: Swap table P4 is patch 1 - 12, and the
> dynamic ghost file is patch 13 - 15. Putting them together as RFC for
> easier review and discussions. Swap table P4 is stable and good to merge
> if we are OK with a few memcg reparent behavior (there is also a
> solution if we don't), dynamic ghost swap is yet a minimal proof of
> concept. See patch 15 for more details. And see below for Swap table 4
> cover letter (nice performance gain and memory save).
>
> This is based on the latest mm-unstable, swap table P3 [1] and patches
> [2] and [3], [4]. Sending this out early, as it might be helpful for us
> to get a cleaner picture of the ongoing efforts, make the discussions easier.
>
> Summary: With this approach, we can have an infinitely or dynamically
> large ghost which could be identical to "virtual swap", and support
> every feature we need while being *runtime configurable* with *zero
> overhead* for plain swap and keep the infrastructure unified. Also
> highly compatible with YoungJun's swap tiering [5], and other ideas like
> swap table compaction, swapops, as it aligns with a few proposals [6]
> [7] [8] [9] [10].
>
> In the past two years, most efforts have focused on the swap
> infrastructure, and we have made tremendous gains in performance,
> keeping the memory usage reasonable or lower, and also greatly cleaned
> up and simplified the API and conventions.
>
> Now the infrastructures are almost ready, after P4, implementing an
> infinitely or dynamically large swapfile can be done in a very easy to
> maintain and flexible way, code change is minimal and progressive
> for review, and makes future optimization like swap table compaction
> doable too, since the infrastructure is all the same for all swaps.
>
> The dynamic swap file is now using Xarray for the cluster info, and
> inside the cluster, it's all the same swap allocator, swap table, and
> existing infrastructures. A virtual table is available for any extra
> data or usage. See below for the benefits and what we can achieve.
>
> Huge thanks to Chris Li for the layered swap table and ghost swapfile
> idea, without whom the work here can't be archived. Also, thanks to Nhat
> for pushing and suggesting using an Xarray for the swapfile [11] for
> dynamic size. I was originally planning to use a dynamic cluster
> array, which requires a bit more adaptation, cleanup, and convention
> changes. But during the discussion there, I got the inspiration that
> Xarray can be used as the intermediate step, making this approach
> doable with minimal changes. Just keep using it in the future, it
> might not hurt too, as Xarray is only limited to ghost / virtual
> files, so plain swaps won't have any extra overhead for lookup or high
> risk of swapout allocation failure.

Thanks for your effort. Dynamic swap space is a very important
consideration anyone deploying compressed swapping backend on large
memory systems in general. And yeah, I think using a radix tree/xarray
is easiest out-of-the-box solution for this - thanks for citing me :P

I still have some confusion and concerns though. Johannes already made
some good points - I'll just add some thoughts from my point of view,
having gone back and forth with virtual swap designs:

1. At which layer should the metadata (swap count, swap cgroup, etc.) live?

I remember that in your LSFMMBPF presentation (time flies), your
proposal was to store a redirection entry in the top layer, and keep
all the metadata at the bottom (i.e backend) layer? This has problems
- for once, you might not know suitable backend at swap allocation
time, but only at writeout time. For e.g, in certain zswap setups, we
reject the incompressible page and cycle it back to the active LRU, so
we have no space in zswap layer to store swap entry metadata (note
that at this point the swap entry cannot be freed, because we have
already unmapped the page from the PTEs (and would require a page
table walk to undo this a la swapoff). Similarly, when we
exclusive-load a page from zswap, we invalidate the zswap metadata
struct, so we will no longer have a space for the swap entry metadata.

The zero-filled (or same-filled) swap entry case is an even more
egregious example :) It really shouldn't be a state in any backend -
it should be a completely independent backend.

The only design that makes sense is to store metadata in the top layer
as well. It's what I'm doing for my virtual swap patch series, but if
we're pursuing this opt-in swapfile direction we are going to
duplicate metadata :)

>
> I'm fully open and totally fine for suggestions on naming or API
> strategy, and others are highly welcome to keep the work going using
> this flexible approach. Following this approach, we will have all the
> following things progressively (some are already or almost there):
>
> - 8 bytes per slot memory usage, when using only plain swap.
>   - And the memory usage can be reduced to 3 or only 1 byte.
> - 16 bytes per slot memory usage, when using ghost / virtual zswap.
>   - Zswap can just use ci_dyn->virtual_table to free up it's content
>     completely.
>   - And the memory usage can be reduced to 11 or 8 bytes using the same
>     code above.
>   - 24 bytes only if including reverse mapping is in use.
> - Minimal code review or maintenance burden. All layers are using the exact
>   same infrastructure for metadata / allocation / synchronization, making
>   all API and conventions consistent and easy to maintain.
> - Writeback, migration and compaction are easily supportable since both
>   reverse mapping and reallocation are prepared. We just need a
>   folio_realloc_swap to allocate new entries for the existing entry, and
>   fill the swap table with a reserve map entry.
> - Fast swapoff: Just read into ghost / virtual swap cache.
> - Zero static data (mostly due to swap table P4), even the clusters are
>   dynamic (If using Xarray, only for ghost / virtual swap file).
> - So we can have an infinitely sized swap space with no static data
>   overhead.
> - Everything is runtime configurable, and high-performance. An
>   uncompressible workload or an offline batch workload can directly use a
>   plain or remote swap for the lowest interference, memory usage, or for
>   best performance.
> - Highly compatible with YoungJun's swap tiering, even the ghost / virtual
>   file can be just a tier. For example, if you have a huge NBD that doesn't
>   care about fragmentation and compression, or the workload is
>   uncompressible, setting the workload to use NBD's tier will give you only
>   8 bytes of overhead per slot and peak performance, bypassing everything.
>   Meanwhile, other workloads or cgroups can still use the ghost layer with
>   compression or defragmentation using 16 bytes (zswap only) or 24 bytes
>   (ghost swap with physical writeback) overhead.
> - No force or breaking change to any existing allocation, priority, swap
>   setup, or reclaim strategy. Ghost / virtual swap can be enabled or
>   disabled using swapon / swapoff.
>
> And if you consider these ops are too complex to set up and maintain, we
> can then only allow one ghost / virtual file, make it infinitely large,
> and be the default one and top tier, then it achieves the identical thing
> to virtual swap space, but with much fewer LOC changed and being runtime
> optional.

2. I think the "fewer LOC changed" claim here is misleading ;)

A lot of the behaviors that is required in a virtual swap setup is
missing from this patch series. You are essentially just implementing
a swapfile with a dynamic allocator. You still need a bunch more logic
to support a proper multi-tier virtual swap setup - just on top of my
mind:

a. Charging: virtual swap usage not be charged the same as the
physical swap usage, especially when you have a zswap + disk swap
setup, powered by virtual swap. For once, I don't believe in sizing
virtual swap, but also a latency-sensitive cgroup allowe to use only
zswap (backed by virtual swap) is using and competing for resources
very differently from a cgroup whose memory is incompressible and only
allowed to use disk swap.

b. Backend decision making and efficient backend transfer - as you
said, "folio_realloc_swap" is yet to be implemented :) And as I
mention earlier, we CANNOT determine swap backend before PTE unmap
time, because backend suitability is content-dependent. You will have
to add extra logic to handle this nuanced swap allocation behavior.

c. Virtual swap freeing - it requires more work, as you have to free
both the virtual swap entry itself, as well as digging into the
physical backend layer.

d. Swapoff - now you have to both page tables and virtual swap table.

By the time you implement all of this, I think it will be MORE
complex, especially since you want to maintain BOTH the new setup and
the old non-virtual swap setup. You'll have to litter the codes with a
bunch of ifs (or ifdefs) to check - hey do we have a virtual swapfile?
Hey is this a virtual swap slot? Etc. Etc. everywhere, from the PTE
infra (zapping, page fault, etc.), to cgroup infra, to physical swap
architecture.

Comparing this line of work by itself with the vswap series, which
already comes with all of these included, is a bit apples-to-oranges
(and especially with the fact that vswap simplifies logic and removes
LoCs in a lot of places too, such as in swapoff. The delta LoC is only
300-400 IIRC?).

>
> Currently, the dynamic ghost files are just reported as ordinary swap files
> in /proc/swaps and we can have multiple ones, so users will have a full
> view of what's going on. This is a very easy-to-change design decision.
> I'm open to ideas about how we should present this to users. e.g., Hiding
> it will make it more "virtual", but I don't think that's a good idea.
>
> The size of the swapfile (si->max) is now just a number, which could be
> changeable at runtime if we have a proper idea how to expose that and
> might need some audit of a few remaining users. But right now, we can
> already easily have a huge swap device with no overhead, for example:
>
> free -m
>                total        used        free      shared  buff/cache   available
> Mem:            1465         250         927           1         356        1215
> Swap:       15269887           0    15269887
>

3. I don't think we should expose virtual swap state to users (in this
case, in the swapfile summary view i.e in free). It is just confusing,
as it poorly reflects the physical state (be it compressed memory
footprint, or actual disk usage). We obviously should expose a bunch
of sysfs debug counters for troubleshootings, but for average users,
it should be all transparent.

> And for easier testing, I added a /dev/ghostswap in this RFC. `swapon
> /dev/ghostswap` enables that. Without swapon /dev/ghostswap, any existing
> users, including ZRAM, won't observe any change.
>
> ===
>
> Original cover letter for swap table phase IV:
>
> This series unifies the allocation and charging process of anon and shmem,
> provides better synchronization, and consolidates cgroup tracking, hence
> dropping the cgroup array and improving the performance of mTHP by about
> ~15%.
>
> Still testing with build kernel under great pressure, enabling mTHP 256kB,
> on an EPYC 7K62 using 16G ZRAM, make -j48 with 1G memory limit, 12 test
> runs:
>
> Before: 2215.55s system, 2:53.03 elapsed
> After:  1852.14s system, 2:41.44 elapsed (16.4% faster system time)
>
> In some workloads, the speed gain is more than that since this reduces
> memory thrashing, so even IO-bound work could benefit a lot, and I no
> longer see any: "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying
> PF", it was shown from time to time before this series.
>
> Now, the swap cache layer ensures a folio will be the exclusive owner of
> the swap slot, then charge it, which leads to much smaller thrashing when
> under pressure.
>
> And besides, the swap cgroup static array is gone, so for example, mounting
> a 1TB swap device saves about 512MB of memory:
>
> Before:
>         total     used     free     shared  buff/cache available
> Mem:    1465      854      331      1       347        610
> Swap:   1048575   0        1048575
>
> After:
>         total     used     free     shared  buff/cache available
> Mem:    1465      332      838      1       363        1133
> Swap:   1048575   0        1048575
>
> It saves us ~512M of memory, we now have close to 0 static overhead.
>
> Link: https://lore.kernel.org/linux-mm/20260218-swap-table-p3-v3-0-f4e34be021a7@tencent.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20260213-memcg-privid-v1-1-d8cb7afcf831@tencent.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20260211-shmem-swap-gfp-v1-1-e9781099a861@tencent.com/ [3]
> Link: https://lore.kernel.org/linux-mm/20260216-hibernate-perf-v4-0-1ba9f0bf1ec9@tencent.com/ [4]
> Link: https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/ [5]
> Link: https://lore.kernel.org/all/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com/ [6]
> Link: https://lwn.net/Articles/974587/ [7]
> Link: https://lwn.net/Articles/932077/ [8]
> Link: https://lwn.net/Articles/1016136/ [9]
> Link: https://lore.kernel.org/linux-mm/20260208215839.87595-1-nphamcs@gmail.com/ [10]
> Link: https://lore.kernel.org/linux-mm/CAKEwX=OUni7PuUqGQUhbMDtErurFN_i=1RgzyQsNXy4LABhXoA@mail.gmail.com/ [11]
>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> Chris Li (1):
>       mm: ghost swapfile support for zswap
>
> Kairui Song (14):
>       mm: move thp_limit_gfp_mask to header
>       mm, swap: simplify swap_cache_alloc_folio
>       mm, swap: move conflict checking logic of out swap cache adding
>       mm, swap: add support for large order folios in swap cache directly
>       mm, swap: unify large folio allocation
>       memcg, swap: reparent the swap entry on swapin if swapout cgroup is dead
>       memcg, swap: defer the recording of memcg info and reparent flexibly
>       mm, swap: store and check memcg info in the swap table
>       mm, swap: support flexible batch freeing of slots in different memcg
>       mm, swap: always retrieve memcg id from swap table
>       mm/swap, memcg: remove swap cgroup array
>       mm, swap: merge zeromap into swap table
>       mm, swap: add a special device for ghost swap setup
>       mm, swap: allocate cluster dynamically for ghost swapfile
>
>  MAINTAINERS                 |   1 -
>  drivers/char/mem.c          |  39 ++++
>  include/linux/huge_mm.h     |  24 +++
>  include/linux/memcontrol.h  |  12 +-
>  include/linux/swap.h        |  30 ++-
>  include/linux/swap_cgroup.h |  47 -----
>  mm/Makefile                 |   3 -
>  mm/internal.h               |  25 ++-
>  mm/memcontrol-v1.c          |  78 ++++----
>  mm/memcontrol.c             | 119 ++++++++++--
>  mm/memory.c                 |  89 ++-------
>  mm/page_io.c                |  46 +++--
>  mm/shmem.c                  | 122 +++---------
>  mm/swap.h                   | 122 +++++-------
>  mm/swap_cgroup.c            | 172 ----------------
>  mm/swap_state.c             | 464 ++++++++++++++++++++++++--------------------
>  mm/swap_table.h             | 105 ++++++++--
>  mm/swapfile.c               | 278 ++++++++++++++++++++------
>  mm/vmscan.c                 |   7 +-
>  mm/workingset.c             |  16 +-
>  mm/zswap.c                  |  29 +--
>  21 files changed, 977 insertions(+), 851 deletions(-)
> ---
> base-commit: 4750368e2cd365ac1e02c6919013c8871f35d8f9
> change-id: 20260111-swap-table-p4-98ee92baa7c4
>
> Best regards,
> --
> Kairui Song <kasong@tencent.com>
>
>

Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile

Posted by Kairui Song 1 month, 1 week ago

On Tue, Feb 24, 2026 at 2:22 AM Nhat Pham <nphamcs@gmail.com> wrote:
>
> On Thu, Feb 19, 2026 at 3:42 PM Kairui Song via B4 Relay
> <devnull+kasong.tencent.com@kernel.org> wrote:
> > Huge thanks to Chris Li for the layered swap table and ghost swapfile
> > idea, without whom the work here can't be archived. Also, thanks to Nhat
> > for pushing and suggesting using an Xarray for the swapfile [11] for
> > dynamic size. I was originally planning to use a dynamic cluster
> > array, which requires a bit more adaptation, cleanup, and convention
> > changes. But during the discussion there, I got the inspiration that
> > Xarray can be used as the intermediate step, making this approach
> > doable with minimal changes. Just keep using it in the future, it
> > might not hurt too, as Xarray is only limited to ghost / virtual
> > files, so plain swaps won't have any extra overhead for lookup or high
> > risk of swapout allocation failure.
>
> Thanks for your effort. Dynamic swap space is a very important
> consideration anyone deploying compressed swapping backend on large
> memory systems in general. And yeah, I think using a radix tree/xarray
> is easiest out-of-the-box solution for this - thanks for citing me :P

Thanks for the discussion :)

>
> I still have some confusion and concerns though. Johannes already made
> some good points - I'll just add some thoughts from my point of view,
> having gone back and forth with virtual swap designs:
>
> 1. At which layer should the metadata (swap count, swap cgroup, etc.) live?
>
> I remember that in your LSFMMBPF presentation (time flies), your
> proposal was to store a redirection entry in the top layer, and keep
> all the metadata at the bottom (i.e backend) layer? This has problems
> - for once, you might not know suitable backend at swap allocation
> time, but only at writeout time. For e.g, in certain zswap setups, we
> reject the incompressible page and cycle it back to the active LRU, so
> we have no space in zswap layer to store swap entry metadata (note
> that at this point the swap entry cannot be freed, because we have
> already unmapped the page from the PTEs (and would require a page
> table walk to undo this a la swapoff). Similarly, when we
> exclusive-load a page from zswap, we invalidate the zswap metadata
> struct, so we will no longer have a space for the swap entry metadata.
>
> The zero-filled (or same-filled) swap entry case is an even more
> egregious example :) It really shouldn't be a state in any backend -
> it should be a completely independent backend.
>
> The only design that makes sense is to store metadata in the top layer
> as well. It's what I'm doing for my virtual swap patch series, but if
> we're pursuing this opt-in swapfile direction we are going to
> duplicate metadata :)

It's already doing that, storing metadata at the top layer, only a
reverse mapping in the lower layer.

So none of these issues are still there. Don't worry, I do remember
that conversation and kept that in mind :)

> > And if you consider these ops are too complex to set up and maintain, we
> > can then only allow one ghost / virtual file, make it infinitely large,
> > and be the default one and top tier, then it achieves the identical thing
> > to virtual swap space, but with much fewer LOC changed and being runtime
> > optional.
>
> 2. I think the "fewer LOC changed" claim here is misleading ;)
>
> A lot of the behaviors that is required in a virtual swap setup is
> missing from this patch series. You are essentially just implementing
> a swapfile with a dynamic allocator. You still need a bunch more logic
> to support a proper multi-tier virtual swap setup - just on top of my
> mind:

I left that part undone kind of on purpose, since it's only RFC, and
in hope that there could be collaboration.

And the dynamic allocator is only ~200 LOC now. Other parts of this
series are not only for virtual swap. For example the unified folio
alloc for swapin, which gives us 15% performance gain in real
workloads, can still get merged and benifit all of us without
involving the virtual swap or memcg part.

And meanwhile, with the later patches, we don't have to re-implement
the whole infrastructure to have a virtual table. And future plans
like data compaction should benifit every layer naturally (same
infra).

> a. Charging: virtual swap usage not be charged the same as the
> physical swap usage, especially when you have a zswap + disk swap
> setup, powered by virtual swap. For once, I don't believe in sizing
> virtual swap, but also a latency-sensitive cgroup allowe to use only
> zswap (backed by virtual swap) is using and competing for resources
> very differently from a cgroup whose memory is incompressible and only
> allowed to use disk swap.

Ah, now as you mention it, I see in the beginning of this series I
added: "Swap table P4 is stable and good to merge if we are OK with a
few memcg reparent behavior (there is also a solution if we don't)".
The "other solution" also fits your different charge idea here. Just
have a ci->memcg_table, then each layer can have their own charge
design, and the shadow is still only used for refault check. That
gives us 10 bytes per slot overhead though, but still lower than
before and stays completely dynamic.

Also, no duplicated memcg, since the upper layer and lower layer
should be charged differently. If they don't, then just let
ci->memcg_table stay NULL.

>
> b. Backend decision making and efficient backend transfer - as you
> said, "folio_realloc_swap" is yet to be implemented :) And as I
> mention earlier, we CANNOT determine swap backend before PTE unmap

And we are not doing that at all. folio_alloc_swap happens before
unmap, but realloc happens after that. VSS does the same thing.

> time, because backend suitability is content-dependent. You will have
> to add extra logic to handle this nuanced swap allocation behavior.
>
> c. Virtual swap freeing - it requires more work, as you have to free
> both the virtual swap entry itself, as well as digging into the
> physical backend layer.
>
> d. Swapoff - now you have to both page tables and virtual swap table.

Swapoff is actually easy here... If it sees a reverse map slot, read
into the upper layer. Else goto the old logic. Then it's done. If
ghost swap is the layer with highest priority, then every slot is a
reverse map slot.

>
> By the time you implement all of this, I think it will be MORE
> complex, especially since you want to maintain BOTH the new setup and
> the old non-virtual swap setup. You'll have to litter the codes with a
> bunch of ifs (or ifdefs) to check - hey do we have a virtual swapfile?
> Hey is this a virtual swap slot? Etc. Etc. everywhere, from the PTE
> infra (zapping, page fault, etc.), to cgroup infra, to physical swap
> architecture.

It is using the same infrastructure, which means a lot of things are
reused and unified. Isn't that a good sign? And again we don't need to
re-implement the whole infra.

And if you need multiple layers then there will be more "if"s and
overhead however you implement it. But with unified infra, each layer
can stay optional. And checking "si->flags & GHOST / VIRTUAL" really
shouldn't be costly or trouble some at all, compared to a mandatory
layer with layers of Xarray walk.

And we can move, maintain the virt part in a separate place.

> Comparing this line of work by itself with the vswap series, which
> already comes with all of these included, is a bit apples-to-oranges
> (and especially with the fact that vswap simplifies logic and removes
> LoCs in a lot of places too, such as in swapoff. The delta LoC is only
> 300-400 IIRC?).

One thing I want to highlight here is that the old swapoff really
shouldn't just die. That gives us no chance to clear up the swap cache
at all (vss holding swap data in RAM is also just swap cache). Pages
still in swap cache means minor page faults will still trigger. If the
workload is opaque but we knows a high load of traffic is coming and
wants to get rid of any performance bottleneck by reading all folios
into the right place, swapoff gives the guarantee that no anon fault
will be ever triggered, that happens a lot in multiple tenant cloud
environments, and these workload are opaque so madvise doesn't apply.

> > The size of the swapfile (si->max) is now just a number, which could be
> > changeable at runtime if we have a proper idea how to expose that and
> > might need some audit of a few remaining users. But right now, we can
> > already easily have a huge swap device with no overhead, for example:
> >
> > free -m
> >                total        used        free      shared  buff/cache   available
> > Mem:            1465         250         927           1         356        1215
> > Swap:       15269887           0    15269887
> >
>
> 3. I don't think we should expose virtual swap state to users (in this
> case, in the swapfile summary view i.e in free). It is just confusing,
> as it poorly reflects the physical state (be it compressed memory
> footprint, or actual disk usage). We obviously should expose a bunch
> of sysfs debug counters for troubleshootings, but for average users,
> it should be all transparent.

Using sysfs can also be a choice, that's really just a demonstration
interface. But I do think it's worse if the user has no idea what is
actually going on.

Re: [PATCH RFC 00/15] mm, swap: swap table phase IV with dynamic ghost swapfile

Posted by Nhat Pham 1 month, 1 week ago

On Mon, Feb 23, 2026 at 7:35 PM Kairui Song <ryncsn@gmail.com> wrote:
>
> On Tue, Feb 24, 2026 at 2:22 AM Nhat Pham <nphamcs@gmail.com> wrote:
> >
> > On Thu, Feb 19, 2026 at 3:42 PM Kairui Song via B4 Relay
> > <devnull+kasong.tencent.com@kernel.org> wrote:
> > > Huge thanks to Chris Li for the layered swap table and ghost swapfile
> > > idea, without whom the work here can't be archived. Also, thanks to Nhat
> > > for pushing and suggesting using an Xarray for the swapfile [11] for
> > > dynamic size. I was originally planning to use a dynamic cluster
> > > array, which requires a bit more adaptation, cleanup, and convention
> > > changes. But during the discussion there, I got the inspiration that
> > > Xarray can be used as the intermediate step, making this approach
> > > doable with minimal changes. Just keep using it in the future, it
> > > might not hurt too, as Xarray is only limited to ghost / virtual
> > > files, so plain swaps won't have any extra overhead for lookup or high
> > > risk of swapout allocation failure.
> >
> > Thanks for your effort. Dynamic swap space is a very important
> > consideration anyone deploying compressed swapping backend on large
> > memory systems in general. And yeah, I think using a radix tree/xarray
> > is easiest out-of-the-box solution for this - thanks for citing me :P
>
> Thanks for the discussion :)
>
> >
> > I still have some confusion and concerns though. Johannes already made
> > some good points - I'll just add some thoughts from my point of view,
> > having gone back and forth with virtual swap designs:
> >
> > 1. At which layer should the metadata (swap count, swap cgroup, etc.) live?
> >
> > I remember that in your LSFMMBPF presentation (time flies), your
> > proposal was to store a redirection entry in the top layer, and keep
> > all the metadata at the bottom (i.e backend) layer? This has problems
> > - for once, you might not know suitable backend at swap allocation
> > time, but only at writeout time. For e.g, in certain zswap setups, we
> > reject the incompressible page and cycle it back to the active LRU, so
> > we have no space in zswap layer to store swap entry metadata (note
> > that at this point the swap entry cannot be freed, because we have
> > already unmapped the page from the PTEs (and would require a page
> > table walk to undo this a la swapoff). Similarly, when we
> > exclusive-load a page from zswap, we invalidate the zswap metadata
> > struct, so we will no longer have a space for the swap entry metadata.
> >
> > The zero-filled (or same-filled) swap entry case is an even more
> > egregious example :) It really shouldn't be a state in any backend -
> > it should be a completely independent backend.
> >
> > The only design that makes sense is to store metadata in the top layer
> > as well. It's what I'm doing for my virtual swap patch series, but if
> > we're pursuing this opt-in swapfile direction we are going to
> > duplicate metadata :)
>
> It's already doing that, storing metadata at the top layer, only a
> reverse mapping in the lower layer.
>
> So none of these issues are still there. Don't worry, I do remember
> that conversation and kept that in mind :)
>
> > > And if you consider these ops are too complex to set up and maintain, we
> > > can then only allow one ghost / virtual file, make it infinitely large,
> > > and be the default one and top tier, then it achieves the identical thing
> > > to virtual swap space, but with much fewer LOC changed and being runtime
> > > optional.
> >
> > 2. I think the "fewer LOC changed" claim here is misleading ;)
> >
> > A lot of the behaviors that is required in a virtual swap setup is
> > missing from this patch series. You are essentially just implementing
> > a swapfile with a dynamic allocator. You still need a bunch more logic
> > to support a proper multi-tier virtual swap setup - just on top of my
> > mind:
>
> I left that part undone kind of on purpose, since it's only RFC, and
> in hope that there could be collaboration.
>
> And the dynamic allocator is only ~200 LOC now. Other parts of this
> series are not only for virtual swap. For example the unified folio
> alloc for swapin, which gives us 15% performance gain in real
> workloads, can still get merged and benifit all of us without
> involving the virtual swap or memcg part.
>
> And meanwhile, with the later patches, we don't have to re-implement
> the whole infrastructure to have a virtual table. And future plans
> like data compaction should benifit every layer naturally (same
> infra).
>
> > a. Charging: virtual swap usage not be charged the same as the
> > physical swap usage, especially when you have a zswap + disk swap
> > setup, powered by virtual swap. For once, I don't believe in sizing
> > virtual swap, but also a latency-sensitive cgroup allowe to use only
> > zswap (backed by virtual swap) is using and competing for resources
> > very differently from a cgroup whose memory is incompressible and only
> > allowed to use disk swap.
>
> Ah, now as you mention it, I see in the beginning of this series I
> added: "Swap table P4 is stable and good to merge if we are OK with a
> few memcg reparent behavior (there is also a solution if we don't)".
> The "other solution" also fits your different charge idea here. Just
> have a ci->memcg_table, then each layer can have their own charge
> design, and the shadow is still only used for refault check. That
> gives us 10 bytes per slot overhead though, but still lower than
> before and stays completely dynamic.
>
> Also, no duplicated memcg, since the upper layer and lower layer
> should be charged differently. If they don't, then just let
> ci->memcg_table stay NULL.
>
> >
> > b. Backend decision making and efficient backend transfer - as you
> > said, "folio_realloc_swap" is yet to be implemented :) And as I
> > mention earlier, we CANNOT determine swap backend before PTE unmap
>
> And we are not doing that at all. folio_alloc_swap happens before
> unmap, but realloc happens after that. VSS does the same thing.
>
> > time, because backend suitability is content-dependent. You will have
> > to add extra logic to handle this nuanced swap allocation behavior.
> >
> > c. Virtual swap freeing - it requires more work, as you have to free
> > both the virtual swap entry itself, as well as digging into the
> > physical backend layer.
> >
> > d. Swapoff - now you have to both page tables and virtual swap table.
>
> Swapoff is actually easy here... If it sees a reverse map slot, read
> into the upper layer. Else goto the old logic. Then it's done. If
> ghost swap is the layer with highest priority, then every slot is a
> reverse map slot.
>
> >
> > By the time you implement all of this, I think it will be MORE
> > complex, especially since you want to maintain BOTH the new setup and
> > the old non-virtual swap setup. You'll have to litter the codes with a
> > bunch of ifs (or ifdefs) to check - hey do we have a virtual swapfile?
> > Hey is this a virtual swap slot? Etc. Etc. everywhere, from the PTE
> > infra (zapping, page fault, etc.), to cgroup infra, to physical swap
> > architecture.
>
> It is using the same infrastructure, which means a lot of things are
> reused and unified. Isn't that a good sign? And again we don't need to
> re-implement the whole infra.
>
> And if you need multiple layers then there will be more "if"s and
> overhead however you implement it. But with unified infra, each layer
> can stay optional. And checking "si->flags & GHOST / VIRTUAL" really
> shouldn't be costly or trouble some at all, compared to a mandatory
> layer with layers of Xarray walk.
>
> And we can move, maintain the virt part in a separate place.

The point is not that it's hard to do. That's the whole sale pitch of
vswap - once you have it all the use case is neatly facilitated ;)

I'm just pointing out that "minimal LoC" is not exactly fair here, as
we still have (in my estimate) quite a sizable amount of work.

>
> > Comparing this line of work by itself with the vswap series, which
> > already comes with all of these included, is a bit apples-to-oranges
> > (and especially with the fact that vswap simplifies logic and removes
> > LoCs in a lot of places too, such as in swapoff. The delta LoC is only
> > 300-400 IIRC?).
>
> One thing I want to highlight here is that the old swapoff really
> shouldn't just die. That gives us no chance to clear up the swap cache
> at all (vss holding swap data in RAM is also just swap cache). Pages
> still in swap cache means minor page faults will still trigger. If the
> workload is opaque but we knows a high load of traffic is coming and
> wants to get rid of any performance bottleneck by reading all folios
> into the right place, swapoff gives the guarantee that no anon fault
> will be ever triggered, that happens a lot in multiple tenant cloud
> environments, and these workload are opaque so madvise doesn't apply.

I somewhat agree with Johannes that the problem is quite academic in
nature here, but I will think more about it.

>
> > > The size of the swapfile (si->max) is now just a number, which could be
> > > changeable at runtime if we have a proper idea how to expose that and
> > > might need some audit of a few remaining users. But right now, we can
> > > already easily have a huge swap device with no overhead, for example:
> > >
> > > free -m
> > >                total        used        free      shared  buff/cache   available
> > > Mem:            1465         250         927           1         356        1215
> > > Swap:       15269887           0    15269887
> > >
> >
> > 3. I don't think we should expose virtual swap state to users (in this
> > case, in the swapfile summary view i.e in free). It is just confusing,
> > as it poorly reflects the physical state (be it compressed memory
> > footprint, or actual disk usage). We obviously should expose a bunch
> > of sysfs debug counters for troubleshootings, but for average users,
> > it should be all transparent.
>
> Using sysfs can also be a choice, that's really just a demonstration
> interface. But I do think it's worse if the user has no idea what is
> actually going on.

I think the users should know that virtual swap is enabled or not, and
some diagnostics stats - allocated, used, rejected/failure etc.

But from users perspective, the other traditional swapfile states
don't seem that useful, and might give users misconceptions. When you
see swapfile stats, you know that you are occupying a limited physical
resource, and how much of it is left. I don't think there's even a
good reason to statically size virtual swap space - it's just a
facility to enable use cases, not an actual resource in the same way
as memory, or disk drive, and is dynamic (on-demand) in nature.