[PATCH 00/12] mm, swap: swap table phase III: remove swap_map

Kairui Song posted 12 patches 1 week, 6 days ago
There is a newer version of this series
include/linux/swap.h |   28 +-
mm/memory.c          |    2 +-
mm/swap.h            |   20 +-
mm/swap_state.c      |   72 ++--
mm/swap_table.h      |  131 +++++-
mm/swapfile.c        | 1104 +++++++++++++++++++++-----------------------------
mm/workingset.c      |   49 ++-
7 files changed, 653 insertions(+), 753 deletions(-)
[PATCH 00/12] mm, swap: swap table phase III: remove swap_map
Posted by Kairui Song 1 week, 6 days ago
This series is based on phase II which is still in mm-unstable.

This series removes the static swap_map and uses the swap table for the
swap count directly. This saves about ~30% memory usage for the static
swap metadata. For example, this saves 256MB of memory when mounting a
1TB swap device. Performance is slightly better too, since the double
update of the swap table and swap_map is now gone.

Test results:

Mounting a swap device:
=======================
Mount a 1TB brd device as SWAP, just to verify the memory save:

`free -m` before:
               total        used        free      shared  buff/cache   available
Mem:            1465        1051         417           1          61         413
Swap:        1054435           0     1054435

`free -m` after:
               total        used        free      shared  buff/cache   available
Mem:            1465         795         672           1          62         670
Swap:        1054435           0     1054435

Idle memory usage is reduced by ~256MB just as expected. And following
this design we should be able to save another ~512MB in a next phase.

Build kernel test:
==================
Test using ZSWAP with NVME SWAP, make -j48, defconfig, in a x86_64 VM
with 5G RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    1038.97s          1013.75s (-2.4%)

Test using ZRAM as SWAP, make -j12, tinyconfig, in a ARM64 VM with 1.5G
RAM, under global pressure, avg of 32 test run:

                Before            After:
System time:    67.75s            66.65s (-1.6%)

The result is slightly better.

Redis / Valkey benchmark:
=========================
Test using ZRAM as SWAP, in a ARM64 VM with 1.5G RAM, under global pressure,
avg of 64 test run:

Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

        no persistence              with BGSAVE
Before: 472705.71 RPS               369451.68 RPS
After:  481197.93 RPS (+1.8%)       374922.32 RPS (+1.5%)

In conclusion, performance is better in all cases, and memory usage is
much lower.

The swap cgroup array will also be merged into the swap table in a later
phase, saving the other ~60% part of the static swap metadata and making
all the swap metadata dynamic. The improved API for swap operations also
reduces the lock contention and makes more batching operations possible.

Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
Kairui Song (12):
      mm, swap: protect si->swap_file properly and use as a mount indicator
      mm, swap: clean up swapon process and locking
      mm, swap: remove redundant arguments and locking for enabling a device
      mm, swap: consolidate bad slots setup and make it more robust
      mm/workingset: leave highest bits empty for anon shadow
      mm, swap: implement helpers for reserving data in the swap table
      mm, swap: mark bad slots in swap table directly
      mm, swap: simplify swap table sanity range check
      mm, swap: use the swap table to track the swap count
      mm, swap: no need to truncate the scan border
      mm, swap: simplify checking if a folio is swapped
      mm, swap: no need to clear the shadow explicitly

 include/linux/swap.h |   28 +-
 mm/memory.c          |    2 +-
 mm/swap.h            |   20 +-
 mm/swap_state.c      |   72 ++--
 mm/swap_table.h      |  131 +++++-
 mm/swapfile.c        | 1104 +++++++++++++++++++++-----------------------------
 mm/workingset.c      |   49 ++-
 7 files changed, 653 insertions(+), 753 deletions(-)
---
base-commit: 10de4550639e9df9242e32e9affc90ed75a27c7d
change-id: 20251216-swap-table-p3-8de73fee7b5f

Best regards,
-- 
Kairui Song <kasong@tencent.com>
[syzbot ci] Re: mm, swap: swap table phase III: remove swap_map
Posted by syzbot ci 1 week, 6 days ago
syzbot ci has tested the following series

[v1] mm, swap: swap table phase III: remove swap_map
https://lore.kernel.org/all/20260126-swap-table-p3-v1-0-a74155fab9b0@tencent.com
* [PATCH 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator
* [PATCH 02/12] mm, swap: clean up swapon process and locking
* [PATCH 03/12] mm, swap: remove redundant arguments and locking for enabling a device
* [PATCH 04/12] mm, swap: consolidate bad slots setup and make it more robust
* [PATCH 05/12] mm/workingset: leave highest bits empty for anon shadow
* [PATCH 06/12] mm, swap: implement helpers for reserving data in the swap table
* [PATCH 07/12] mm, swap: mark bad slots in swap table directly
* [PATCH 08/12] mm, swap: simplify swap table sanity range check
* [PATCH 09/12] mm, swap: use the swap table to track the swap count
* [PATCH 10/12] mm, swap: no need to truncate the scan border
* [PATCH 11/12] mm, swap: simplify checking if a folio is swapped
* [PATCH 12/12] mm, swap: no need to clear the shadow explicitly

and found the following issue:
WARNING in swap_cluster_lock

Full report is available here:
https://ci.syzbot.org/series/3f6169fc-e24a-4a19-ba56-e5907b448edc

***

WARNING in swap_cluster_lock

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      5a3704ed2dce0b54a7f038b765bb752b87ee8cc2
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/0eabd97a-86d8-4606-9d94-dbe4e7fd7c07/config
C repro:   https://ci.syzbot.org/findings/5b039fd0-70da-4954-817d-8bf86315c684/c_repro
syz repro: https://ci.syzbot.org/findings/5b039fd0-70da-4954-817d-8bf86315c684/syz_repro

------------[ cut here ]------------
offset >= si->max
WARNING: mm/swap.h:88 at __swap_offset_to_cluster mm/swap.h:88 [inline], CPU#1: syz.0.548/6508
WARNING: mm/swap.h:88 at __swap_cluster_lock mm/swap.h:101 [inline], CPU#1: syz.0.548/6508
WARNING: mm/swap.h:88 at swap_cluster_lock+0xef/0x130 mm/swap.h:132, CPU#1: syz.0.548/6508
Modules linked in:
CPU: 1 UID: 0 PID: 6508 Comm: syz.0.548 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:__swap_offset_to_cluster mm/swap.h:88 [inline]
RIP: 0010:__swap_cluster_lock mm/swap.h:101 [inline]
RIP: 0010:swap_cluster_lock+0xef/0x130 mm/swap.h:132
Code: e8 86 3b 5a 09 4c 89 f8 5b 41 5c 41 5e 41 5f 5d e9 86 86 5a 09 cc e8 90 ff a0 ff 90 0f 0b 90 e9 3f ff ff ff e8 82 ff a0 ff 90 <0f> 0b 90 e9 6f ff ff ff e8 74 ff a0 ff 90 0f 0b 90 eb a4 e8 69 ff
RSP: 0018:ffffc90004ae66c0 EFLAGS: 00010293
RAX: ffffffff82219a6e RBX: 0000000000007a12 RCX: ffff888110363a80
RDX: 0000000000000000 RSI: 0000000000007a12 RDI: 0000000000007a12
RBP: 0000000000007a12 R08: 0000000000000003 R09: 0000000000000004
R10: dffffc0000000000 R11: fffff5200095cccc R12: dffffc0000000000
R13: ffff888175c2a010 R14: ffff888175c2a000 R15: 0000000000007a12
FS:  000055556978b500(0000) GS:ffff8882a9923000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fca0a017dac CR3: 0000000112e64000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 cluster_alloc_swap_entry+0x20f/0xa40 mm/swapfile.c:1090
 swap_alloc_slow mm/swapfile.c:1385 [inline]
 folio_alloc_swap+0x81f/0x1190 mm/swapfile.c:1717
 shrink_folio_list+0x2714/0x52b0 mm/vmscan.c:1306
 reclaim_folio_list+0x100/0x4f0 mm/vmscan.c:2205
 reclaim_pages+0x45b/0x530 mm/vmscan.c:2242
 madvise_cold_or_pageout_pte_range+0x19b9/0x1d00 mm/madvise.c:561
 walk_pmd_range mm/pagewalk.c:130 [inline]
 walk_pud_range mm/pagewalk.c:224 [inline]
 walk_p4d_range mm/pagewalk.c:262 [inline]
 walk_pgd_range+0x1032/0x1d30 mm/pagewalk.c:303
 __walk_page_range+0x14c/0x710 mm/pagewalk.c:411
 walk_page_range_vma_unsafe+0x309/0x410 mm/pagewalk.c:715
 madvise_pageout_page_range mm/madvise.c:620 [inline]
 madvise_pageout mm/madvise.c:645 [inline]
 madvise_vma_behavior+0x382e/0x4240 mm/madvise.c:1364
 madvise_walk_vmas+0x573/0xae0 mm/madvise.c:1719
 madvise_do_behavior+0x386/0x540 mm/madvise.c:1935
 do_madvise+0x1fa/0x2e0 mm/madvise.c:2028
 __do_sys_madvise mm/madvise.c:2037 [inline]
 __se_sys_madvise mm/madvise.c:2035 [inline]
 __x64_sys_madvise+0xa6/0xc0 mm/madvise.c:2035
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7fca09d9acb9
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007ffe78abed08 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007fca0a015fa0 RCX: 00007fca09d9acb9
RDX: 0000000000000015 RSI: 0000000000600003 RDI: 0000200000000000
RBP: 00007fca09e08bf7 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007fca0a015fac R14: 00007fca0a015fa0 R15: 00007fca0a015fa0
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.
Re: [syzbot ci] Re: mm, swap: swap table phase III: remove swap_map
Posted by Kairui Song 1 week, 6 days ago
On Sun, Jan 25, 2026 at 02:13:41PM +0800, syzbot ci wrote:
> syzbot ci has tested the following series
> 
> [v1] mm, swap: swap table phase III: remove swap_map
> https://lore.kernel.org/all/20260126-swap-table-p3-v1-0-a74155fab9b0@tencent.com
> * [PATCH 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator
> * [PATCH 02/12] mm, swap: clean up swapon process and locking
> * [PATCH 03/12] mm, swap: remove redundant arguments and locking for enabling a device
> * [PATCH 04/12] mm, swap: consolidate bad slots setup and make it more robust
> * [PATCH 05/12] mm/workingset: leave highest bits empty for anon shadow
> * [PATCH 06/12] mm, swap: implement helpers for reserving data in the swap table
> * [PATCH 07/12] mm, swap: mark bad slots in swap table directly
> * [PATCH 08/12] mm, swap: simplify swap table sanity range check
> * [PATCH 09/12] mm, swap: use the swap table to track the swap count
> * [PATCH 10/12] mm, swap: no need to truncate the scan border
> * [PATCH 11/12] mm, swap: simplify checking if a folio is swapped
> * [PATCH 12/12] mm, swap: no need to clear the shadow explicitly
> 
> and found the following issue:
> WARNING in swap_cluster_lock
> 
> Full report is available here:
> https://ci.syzbot.org/series/3f6169fc-e24a-4a19-ba56-e5907b448edc
> 
> ***
> 
> WARNING in swap_cluster_lock
> 
> tree:      mm-new
> URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
> base:      5a3704ed2dce0b54a7f038b765bb752b87ee8cc2
> arch:      amd64
> compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
> config:    https://ci.syzbot.org/builds/0eabd97a-86d8-4606-9d94-dbe4e7fd7c07/config
> C repro:   https://ci.syzbot.org/findings/5b039fd0-70da-4954-817d-8bf86315c684/c_repro
> syz repro: https://ci.syzbot.org/findings/5b039fd0-70da-4954-817d-8bf86315c684/syz_repro
> 
> ------------[ cut here ]------------
> offset >= si->max
> WARNING: mm/swap.h:88 at __swap_offset_to_cluster mm/swap.h:88 [inline], CPU#1: syz.0.548/6508
> WARNING: mm/swap.h:88 at __swap_cluster_lock mm/swap.h:101 [inline], CPU#1: syz.0.548/6508
> WARNING: mm/swap.h:88 at swap_cluster_lock+0xef/0x130 mm/swap.h:132, CPU#1: syz.0.548/6508

This is a good catch from the bot. It's caused by the patch "[PATCH 10/12] mm, swap: no need to truncate the scan border", however that patch is not wrong, it just have to update the debug check too:

diff --git a/mm/swap.h b/mm/swap.h
index 087cef49cf69..386a289ef8e7 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -85,7 +85,7 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
                struct swap_info_struct *si, pgoff_t offset)
 {
        VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
-       VM_WARN_ON_ONCE(offset >= si->max);
+       VM_WARN_ON_ONCE(offset >= roundup(si->max, SWAPFILE_CLUSTER));
        return &si->cluster_info[offset / SWAPFILE_CLUSTER];
 }

I'll update this in V2.