mm, swap: never bypass swap cache and cleanup flags (swap table phase II)

[PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)

Posted by Kairui Song 3 months, 1 week ago

This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and
special swap bits including SWAP_HAS_CACHE, along with many historical
issues. The performance is about ~20% better for some workloads, like
Redis with persistence. This also cleans up the code to prepare for
later phases, some patches are from a previously posted series.

Swap cache bypassing and swap synchronization in general had many
issues. Some are solved as workarounds, and some are still there [1]. To
resolve them in a clean way, one good solution is to always use swap
cache as the synchronization layer [2]. So we have to remove the swap
cache bypass swap-in path first. It wasn't very doable due to
performance issues, but now combined with the swap table, removing
the swap cache bypass path will instead improve the performance,
there is no reason to keep it.

Now we can rework the swap entry and cache synchronization following
the new design. Swap cache synchronization was heavily relying on
SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
of special swap map bits and related workarounds, we get a cleaner code
base and prepare for merging the swap count into the swap table in the
next step.

Test results:

Redis / Valkey bench:
=====================

Testing on a ARM64 VM 1.5G memory:
Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 460475.84 RPS               311591.19 RPS
After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)

Testing on a x86_64 VM with 4G memory (system components takes about 2G):
Server:
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 306044.38 RPS               102745.88 RPS
After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)

The performance is a lot better when persistence is applied. This should
apply to many other workloads that involve sharing memory and COW. A
slight performance drop was observed for the ARM64 Redis test: We are
still using swap_map to track the swap count, which is causing redundant
cache and CPU overhead and is not very performance-friendly for some
arches. This will be improved once we merge the swap map into the swap
table (as already demonstrated previously [3]).

vm-scabiity
===========
usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
simulated PMEM as swap), average result of 6 test run:

                           Before:         After:
System time:               282.22s         283.47s
Sum Throughput:            5677.35 MB/s    5688.78 MB/s
Single process Throughput: 176.41 MB/s     176.23 MB/s
Free latency:              518477.96 us    521488.06 us

Which is almost identical.

Build kernel test:
==================
Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1379.91s          1364.22s (-0.11%)

Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
with 4G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1822.52s          1803.33s (-0.11%)

Which is almost identical.

MySQL:
======
sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
--table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).

Before: 318162.18 qps
After:  318512.01 qps (+0.01%)

In conclusion, the result is looking better or identical for most cases,
and it's especially better for workloads with swap count > 1 on SYNC_IO
devices, about ~20% gain in above test. Next phases will start to merge
swap count into swap table and reduce memory usage.

One more gain here is that we now have better support for THP swapin.
Previously, the THP swapin was bound with swap cache bypassing, which
only works for single-mapped folios. Removing the bypassing path also
enabled THP swapin for all folios. It's still limited to SYNC_IO
devices, though, this limitation can will be removed later. This may
cause more serious thrashing for certain workloads, but that's not an
issue caused by this series, it's a common THP issue we should resolve
separately.

Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
Kairui Song (18):
      mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
      mm, swap: split swap cache preparation loop into a standalone helper
      mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
      mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
      mm, swap: simplify the code and reduce indention
      mm, swap: free the swap cache after folio is mapped
      mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
      mm, swap: swap entry of a bad slot should not be considered as swapped out
      mm, swap: consolidate cluster reclaim and check logic
      mm, swap: split locked entry duplicating into a standalone helper
      mm, swap: use swap cache as the swap in synchronize layer
      mm, swap: remove workaround for unsynchronized swap map cache state
      mm, swap: sanitize swap entry management workflow
      mm, swap: add folio to swap cache directly on allocation
      mm, swap: check swap table directly for checking cache
      mm, swap: clean up and improve swap entries freeing
      mm, swap: drop the SWAP_HAS_CACHE flag
      mm, swap: remove no longer needed _swap_info_get

Nhat Pham (1):
      mm/shmem, swap: remove SWAP_MAP_SHMEM

 arch/s390/mm/pgtable.c |   2 +-
 include/linux/swap.h   |  77 ++---
 kernel/power/swap.c    |  10 +-
 mm/madvise.c           |   2 +-
 mm/memory.c            | 270 +++++++---------
 mm/rmap.c              |   7 +-
 mm/shmem.c             |  75 ++---
 mm/swap.h              |  69 +++-
 mm/swap_state.c        | 341 +++++++++++++-------
 mm/swapfile.c          | 849 +++++++++++++++++++++----------------------------
 mm/userfaultfd.c       |  10 +-
 mm/vmscan.c            |   1 -
 mm/zswap.c             |   4 +-
 13 files changed, 840 insertions(+), 877 deletions(-)
---
base-commit: f30d294530d939fa4b77d61bc60f25c4284841fa
change-id: 20251007-swap-table-p2-7d3086e5c38a

Best regards,
-- 
Kairui Song <kasong@tencent.com>

Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)

Posted by Yosry Ahmed 3 months, 1 week ago

On Wed, Oct 29, 2025 at 11:58:26PM +0800, Kairui Song wrote:
> This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and
> special swap bits including SWAP_HAS_CACHE, along with many historical
> issues. The performance is about ~20% better for some workloads, like
> Redis with persistence. This also cleans up the code to prepare for
> later phases, some patches are from a previously posted series.
> 
> Swap cache bypassing and swap synchronization in general had many
> issues. Some are solved as workarounds, and some are still there [1]. To
> resolve them in a clean way, one good solution is to always use swap
> cache as the synchronization layer [2]. So we have to remove the swap
> cache bypass swap-in path first. It wasn't very doable due to
> performance issues, but now combined with the swap table, removing
> the swap cache bypass path will instead improve the performance,
> there is no reason to keep it.
> 
> Now we can rework the swap entry and cache synchronization following
> the new design. Swap cache synchronization was heavily relying on
> SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> of special swap map bits and related workarounds, we get a cleaner code
> base and prepare for merging the swap count into the swap table in the
> next step.
> 
> Test results:
> 
> Redis / Valkey bench:
> =====================
> 
> Testing on a ARM64 VM 1.5G memory:
> Server: valkey-server --maxmemory 2560M
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> 
>         no persistence              with BGSAVE
> Before: 460475.84 RPS               311591.19 RPS
> After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)
> 
> Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> Server:
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> 
>         no persistence              with BGSAVE
> Before: 306044.38 RPS               102745.88 RPS
> After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)
> 
> The performance is a lot better when persistence is applied. This should
> apply to many other workloads that involve sharing memory and COW. A
> slight performance drop was observed for the ARM64 Redis test: We are
> still using swap_map to track the swap count, which is causing redundant
> cache and CPU overhead and is not very performance-friendly for some
> arches. This will be improved once we merge the swap map into the swap
> table (as already demonstrated previously [3]).
> 
> vm-scabiity
> ===========
> usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> simulated PMEM as swap), average result of 6 test run:
> 
>                            Before:         After:
> System time:               282.22s         283.47s
> Sum Throughput:            5677.35 MB/s    5688.78 MB/s
> Single process Throughput: 176.41 MB/s     176.23 MB/s
> Free latency:              518477.96 us    521488.06 us
> 
> Which is almost identical.
> 
> Build kernel test:
> ==================
> Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
> 
>                 Before            After:
> System time:    1379.91s          1364.22s (-0.11%)
> 
> Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
> 
>                 Before            After:
> System time:    1822.52s          1803.33s (-0.11%)
> 
> Which is almost identical.
> 
> MySQL:
> ======
> sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
> 
> Before: 318162.18 qps
> After:  318512.01 qps (+0.01%)
> 
> In conclusion, the result is looking better or identical for most cases,
> and it's especially better for workloads with swap count > 1 on SYNC_IO
> devices, about ~20% gain in above test. Next phases will start to merge
> swap count into swap table and reduce memory usage.
> 
> One more gain here is that we now have better support for THP swapin.
> Previously, the THP swapin was bound with swap cache bypassing, which
> only works for single-mapped folios. Removing the bypassing path also
> enabled THP swapin for all folios. It's still limited to SYNC_IO
> devices, though, this limitation can will be removed later. This may
> cause more serious thrashing for certain workloads, but that's not an
> issue caused by this series, it's a common THP issue we should resolve
> separately.
> 
> Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
> 
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>

Unfortunately I don't have time to go through the series and review it,
but I wanted to just say awesome work here. The special cases in the
swap code to avoid using the swapcache have always been a pain.

In fact, there's one more special case that we can probably remove in
zswap_load() now, the one introduced by commit 25cd241408a2 ("mm: zswap:
fix data loss on SWP_SYNCHRONOUS_IO devices").

> ---
> Kairui Song (18):
>       mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
>       mm, swap: split swap cache preparation loop into a standalone helper
>       mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
>       mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
>       mm, swap: simplify the code and reduce indention
>       mm, swap: free the swap cache after folio is mapped
>       mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
>       mm, swap: swap entry of a bad slot should not be considered as swapped out
>       mm, swap: consolidate cluster reclaim and check logic
>       mm, swap: split locked entry duplicating into a standalone helper
>       mm, swap: use swap cache as the swap in synchronize layer
>       mm, swap: remove workaround for unsynchronized swap map cache state
>       mm, swap: sanitize swap entry management workflow
>       mm, swap: add folio to swap cache directly on allocation
>       mm, swap: check swap table directly for checking cache
>       mm, swap: clean up and improve swap entries freeing
>       mm, swap: drop the SWAP_HAS_CACHE flag
>       mm, swap: remove no longer needed _swap_info_get
> 
> Nhat Pham (1):
>       mm/shmem, swap: remove SWAP_MAP_SHMEM
> 
>  arch/s390/mm/pgtable.c |   2 +-
>  include/linux/swap.h   |  77 ++---
>  kernel/power/swap.c    |  10 +-
>  mm/madvise.c           |   2 +-
>  mm/memory.c            | 270 +++++++---------
>  mm/rmap.c              |   7 +-
>  mm/shmem.c             |  75 ++---
>  mm/swap.h              |  69 +++-
>  mm/swap_state.c        | 341 +++++++++++++-------
>  mm/swapfile.c          | 849 +++++++++++++++++++++----------------------------
>  mm/userfaultfd.c       |  10 +-
>  mm/vmscan.c            |   1 -
>  mm/zswap.c             |   4 +-
>  13 files changed, 840 insertions(+), 877 deletions(-)
> ---
> base-commit: f30d294530d939fa4b77d61bc60f25c4284841fa
> change-id: 20251007-swap-table-p2-7d3086e5c38a
> 
> Best regards,
> -- 
> Kairui Song <kasong@tencent.com>
>

Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)

Posted by Kairui Song 3 months, 1 week ago

On Fri, Oct 31, 2025 at 7:05 AM Yosry Ahmed <yosry.ahmed@linux.dev> wrote:
>
> On Wed, Oct 29, 2025 at 11:58:26PM +0800, Kairui Song wrote:
> > This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and
> > special swap bits including SWAP_HAS_CACHE, along with many historical
> > issues. The performance is about ~20% better for some workloads, like
> > Redis with persistence. This also cleans up the code to prepare for
> > later phases, some patches are from a previously posted series.
> >
> > Swap cache bypassing and swap synchronization in general had many
> > issues. Some are solved as workarounds, and some are still there [1]. To
> > resolve them in a clean way, one good solution is to always use swap
> > cache as the synchronization layer [2]. So we have to remove the swap
> > cache bypass swap-in path first. It wasn't very doable due to
> > performance issues, but now combined with the swap table, removing
> > the swap cache bypass path will instead improve the performance,
> > there is no reason to keep it.
> >
> > Now we can rework the swap entry and cache synchronization following
> > the new design. Swap cache synchronization was heavily relying on
> > SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> > of special swap map bits and related workarounds, we get a cleaner code
> > base and prepare for merging the swap count into the swap table in the
> > next step.
> >
> > Test results:
> >
> > Redis / Valkey bench:
> > =====================
> >
> > Testing on a ARM64 VM 1.5G memory:
> > Server: valkey-server --maxmemory 2560M
> > Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> >
> >         no persistence              with BGSAVE
> > Before: 460475.84 RPS               311591.19 RPS
> > After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)
> >
> > Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> > Server:
> > Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
> >
> >         no persistence              with BGSAVE
> > Before: 306044.38 RPS               102745.88 RPS
> > After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)
> >
> > The performance is a lot better when persistence is applied. This should
> > apply to many other workloads that involve sharing memory and COW. A
> > slight performance drop was observed for the ARM64 Redis test: We are
> > still using swap_map to track the swap count, which is causing redundant
> > cache and CPU overhead and is not very performance-friendly for some
> > arches. This will be improved once we merge the swap map into the swap
> > table (as already demonstrated previously [3]).
> >
> > vm-scabiity
> > ===========
> > usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> > simulated PMEM as swap), average result of 6 test run:
> >
> >                            Before:         After:
> > System time:               282.22s         283.47s
> > Sum Throughput:            5677.35 MB/s    5688.78 MB/s
> > Single process Throughput: 176.41 MB/s     176.23 MB/s
> > Free latency:              518477.96 us    521488.06 us
> >
> > Which is almost identical.
> >
> > Build kernel test:
> > ==================
> > Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> > with 4G RAM, under global pressure, avg of 32 test run:
> >
> >                 Before            After:
> > System time:    1379.91s          1364.22s (-0.11%)
> >
> > Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> > with 4G RAM, under global pressure, avg of 32 test run:
> >
> >                 Before            After:
> > System time:    1822.52s          1803.33s (-0.11%)
> >
> > Which is almost identical.
> >
> > MySQL:
> > ======
> > sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> > --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> > 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
> >
> > Before: 318162.18 qps
> > After:  318512.01 qps (+0.01%)
> >
> > In conclusion, the result is looking better or identical for most cases,
> > and it's especially better for workloads with swap count > 1 on SYNC_IO
> > devices, about ~20% gain in above test. Next phases will start to merge
> > swap count into swap table and reduce memory usage.
> >
> > One more gain here is that we now have better support for THP swapin.
> > Previously, the THP swapin was bound with swap cache bypassing, which
> > only works for single-mapped folios. Removing the bypassing path also
> > enabled THP swapin for all folios. It's still limited to SYNC_IO
> > devices, though, this limitation can will be removed later. This may
> > cause more serious thrashing for certain workloads, but that's not an
> > issue caused by this series, it's a common THP issue we should resolve
> > separately.
> >
> > Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> > Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> > Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
> >
> > Suggested-by: Chris Li <chrisl@kernel.org>
> > Signed-off-by: Kairui Song <kasong@tencent.com>
>
> Unfortunately I don't have time to go through the series and review it,
> but I wanted to just say awesome work here. The special cases in the
> swap code to avoid using the swapcache have always been a pain.
>
> In fact, there's one more special case that we can probably remove in
> zswap_load() now, the one introduced by commit 25cd241408a2 ("mm: zswap:
> fix data loss on SWP_SYNCHRONOUS_IO devices").

Thanks! Oh, now I remember that one, it can be removed indeed. There
are several more cleanup and optimizations that can be done after this
series, it's getting too long already so I didn't include everything.

But removing 25cd241408a2 is easy to do and easy to review, I can
include it in the next update.

Re: [PATCH 00/19] mm, swap: never bypass swap cache and cleanup flags (swap table phase II)

Posted by Chris Li 3 months ago

Sorry I have been super busy and late to the review party.

I am still catching up on my backlogs.

The cover letter title is a bit too long, I suggest you put the swap
table phase II in the beginning of the title rather than the end. The
title is too long and "phase II" gets wrapped to another line. Maybe
just use "swap table phase II" as the cover letter title is good
enough. You can explain what this series does in more detail in the
body of the cover letter.

Also we can mention the total estimate of phases for the swap tables
(4-5 phases?). Does not need to be precise, just serves as an overall
indication of the swap table progress bar.

On Wed, Oct 29, 2025 at 8:59 AM Kairui Song <ryncsn@gmail.com> wrote:
>
> This series removes the SWP_SYNCHRONOUS_IO swap cache bypass code and

Great job!

> special swap bits including SWAP_HAS_CACHE, along with many historical
> issues. The performance is about ~20% better for some workloads, like
> Redis with persistence. This also cleans up the code to prepare for
> later phases, some patches are from a previously posted series.

That is wonderful we can remove SWAP_HAS_CACHE and remove sync IO swap
cache bypass. Swap table is so fast the bypass does not make any sense
any more.

> Swap cache bypassing and swap synchronization in general had many
> issues. Some are solved as workarounds, and some are still there [1]. To
> resolve them in a clean way, one good solution is to always use swap
> cache as the synchronization layer [2]. So we have to remove the swap
> cache bypass swap-in path first. It wasn't very doable due to
> performance issues, but now combined with the swap table, removing
> the swap cache bypass path will instead improve the performance,
> there is no reason to keep it.
>
> Now we can rework the swap entry and cache synchronization following
> the new design. Swap cache synchronization was heavily relying on
> SWAP_HAS_CACHE, which is the cause of many issues. By dropping the usage
> of special swap map bits and related workarounds, we get a cleaner code
> base and prepare for merging the swap count into the swap table in the
> next step.
>
> Test results:
>
> Redis / Valkey bench:
> =====================
>
> Testing on a ARM64 VM 1.5G memory:
> Server: valkey-server --maxmemory 2560M
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
>         no persistence              with BGSAVE
> Before: 460475.84 RPS               311591.19 RPS
> After:  451943.34 RPS (-1.9%)       371379.06 RPS (+19.2%)
>
> Testing on a x86_64 VM with 4G memory (system components takes about 2G):
> Server:
> Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
>         no persistence              with BGSAVE
> Before: 306044.38 RPS               102745.88 RPS
> After:  309645.44 RPS (+1.2%)       125313.28 RPS (+22.0%)
>
> The performance is a lot better when persistence is applied. This should
> apply to many other workloads that involve sharing memory and COW. A
> slight performance drop was observed for the ARM64 Redis test: We are
> still using swap_map to track the swap count, which is causing redundant
> cache and CPU overhead and is not very performance-friendly for some
> arches. This will be improved once we merge the swap map into the swap
> table (as already demonstrated previously [3]).
>
> vm-scabiity
> ===========
> usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
> simulated PMEM as swap), average result of 6 test run:
>
>                            Before:         After:
> System time:               282.22s         283.47s
> Sum Throughput:            5677.35 MB/s    5688.78 MB/s
> Single process Throughput: 176.41 MB/s     176.23 MB/s
> Free latency:              518477.96 us    521488.06 us
>
> Which is almost identical.
>
> Build kernel test:
> ==================
> Test using ZRAM as SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
>                 Before            After:
> System time:    1379.91s          1364.22s (-0.11%)
>
> Test using ZSWAP with NVME SWAP, make -j48, defconfig, on a x86_64 VM
> with 4G RAM, under global pressure, avg of 32 test run:
>
>                 Before            After:
> System time:    1822.52s          1803.33s (-0.11%)
>
> Which is almost identical.
>
> MySQL:
> ======
> sysbench /usr/share/sysbench/oltp_read_only.lua --tables=16
> --table-size=1000000 --threads=96 --time=600 (using ZRAM as SWAP, in a
> 512M memory cgroup, buffer pool set to 3G, 3 test run and 180s warm up).
>
> Before: 318162.18 qps
> After:  318512.01 qps (+0.01%)
>
> In conclusion, the result is looking better or identical for most cases,
> and it's especially better for workloads with swap count > 1 on SYNC_IO
> devices, about ~20% gain in above test. Next phases will start to merge
> swap count into swap table and reduce memory usage.
>
> One more gain here is that we now have better support for THP swapin.
> Previously, the THP swapin was bound with swap cache bypassing, which
> only works for single-mapped folios. Removing the bypassing path also
> enabled THP swapin for all folios. It's still limited to SYNC_IO
> devices, though, this limitation can will be removed later. This may

Grammer. "though, this"  "can will be"

 The THP swapin is still limited to SYNC_IO devices.  This limitation
can be removed later.

Chris

> cause more serious thrashing for certain workloads, but that's not an
> issue caused by this series, it's a common THP issue we should resolve
> separately.
>
> Link: https://lore.kernel.org/linux-mm/CAMgjq7D5qoFEK9Omvd5_Zqs6M+TEoG03+2i_mhuP5CQPSOPrmQ@mail.gmail.com/ [1]
> Link: https://lore.kernel.org/linux-mm/20240326185032.72159-1-ryncsn@gmail.com/ [2]
> Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
>
> Suggested-by: Chris Li <chrisl@kernel.org>
> Signed-off-by: Kairui Song <kasong@tencent.com>
> ---
> Kairui Song (18):
>       mm/swap: rename __read_swap_cache_async to swap_cache_alloc_folio
>       mm, swap: split swap cache preparation loop into a standalone helper
>       mm, swap: never bypass the swap cache even for SWP_SYNCHRONOUS_IO
>       mm, swap: always try to free swap cache for SWP_SYNCHRONOUS_IO devices
>       mm, swap: simplify the code and reduce indention
>       mm, swap: free the swap cache after folio is mapped
>       mm/shmem: never bypass the swap cache for SWP_SYNCHRONOUS_IO
>       mm, swap: swap entry of a bad slot should not be considered as swapped out
>       mm, swap: consolidate cluster reclaim and check logic
>       mm, swap: split locked entry duplicating into a standalone helper
>       mm, swap: use swap cache as the swap in synchronize layer
>       mm, swap: remove workaround for unsynchronized swap map cache state
>       mm, swap: sanitize swap entry management workflow
>       mm, swap: add folio to swap cache directly on allocation
>       mm, swap: check swap table directly for checking cache
>       mm, swap: clean up and improve swap entries freeing
>       mm, swap: drop the SWAP_HAS_CACHE flag
>       mm, swap: remove no longer needed _swap_info_get
>
> Nhat Pham (1):
>       mm/shmem, swap: remove SWAP_MAP_SHMEM
>
>  arch/s390/mm/pgtable.c |   2 +-
>  include/linux/swap.h   |  77 ++---
>  kernel/power/swap.c    |  10 +-
>  mm/madvise.c           |   2 +-
>  mm/memory.c            | 270 +++++++---------
>  mm/rmap.c              |   7 +-
>  mm/shmem.c             |  75 ++---
>  mm/swap.h              |  69 +++-
>  mm/swap_state.c        | 341 +++++++++++++-------
>  mm/swapfile.c          | 849 +++++++++++++++++++++----------------------------
>  mm/userfaultfd.c       |  10 +-
>  mm/vmscan.c            |   1 -
>  mm/zswap.c             |   4 +-
>  13 files changed, 840 insertions(+), 877 deletions(-)
> ---
> base-commit: f30d294530d939fa4b77d61bc60f25c4284841fa
> change-id: 20251007-swap-table-p2-7d3086e5c38a
>
> Best regards,
> --
> Kairui Song <kasong@tencent.com>
>