include/linux/swap.h | 28 +- mm/memory.c | 2 +- mm/swap.h | 20 +- mm/swap_state.c | 72 ++-- mm/swap_table.h | 131 +++++- mm/swapfile.c | 1104 +++++++++++++++++++++----------------------------- mm/workingset.c | 49 ++- 7 files changed, 653 insertions(+), 753 deletions(-)
This series is based on phase II which is still in mm-unstable.
This series removes the static swap_map and uses the swap table for the
swap count directly. This saves about ~30% memory usage for the static
swap metadata. For example, this saves 256MB of memory when mounting a
1TB swap device. Performance is slightly better too, since the double
update of the swap table and swap_map is now gone.
Test results:
Mounting a swap device:
=======================
Mount a 1TB brd device as SWAP, just to verify the memory save:
`free -m` before:
total used free shared buff/cache available
Mem: 1465 1051 417 1 61 413
Swap: 1054435 0 1054435
`free -m` after:
total used free shared buff/cache available
Mem: 1465 795 672 1 62 670
Swap: 1054435 0 1054435
Idle memory usage is reduced by ~256MB just as expected. And following
this design we should be able to save another ~512MB in a next phase.
Build kernel test:
==================
Test using ZSWAP with NVME SWAP, make -j48, defconfig, in a x86_64 VM
with 5G RAM, under global pressure, avg of 32 test run:
Before After:
System time: 1038.97s 1013.75s (-2.4%)
Test using ZRAM as SWAP, make -j12, tinyconfig, in a ARM64 VM with 1.5G
RAM, under global pressure, avg of 32 test run:
Before After:
System time: 67.75s 66.65s (-1.6%)
The result is slightly better.
Redis / Valkey benchmark:
=========================
Test using ZRAM as SWAP, in a ARM64 VM with 1.5G RAM, under global pressure,
avg of 64 test run:
Server: valkey-server --maxmemory 2560M
Client: redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
no persistence with BGSAVE
Before: 472705.71 RPS 369451.68 RPS
After: 481197.93 RPS (+1.8%) 374922.32 RPS (+1.5%)
In conclusion, performance is better in all cases, and memory usage is
much lower.
The swap cgroup array will also be merged into the swap table in a later
phase, saving the other ~60% part of the static swap metadata and making
all the swap metadata dynamic. The improved API for swap operations also
reduces the lock contention and makes more batching operations possible.
Suggested-by: Chris Li <chrisl@kernel.org>
Signed-off-by: Kairui Song <kasong@tencent.com>
---
Kairui Song (12):
mm, swap: protect si->swap_file properly and use as a mount indicator
mm, swap: clean up swapon process and locking
mm, swap: remove redundant arguments and locking for enabling a device
mm, swap: consolidate bad slots setup and make it more robust
mm/workingset: leave highest bits empty for anon shadow
mm, swap: implement helpers for reserving data in the swap table
mm, swap: mark bad slots in swap table directly
mm, swap: simplify swap table sanity range check
mm, swap: use the swap table to track the swap count
mm, swap: no need to truncate the scan border
mm, swap: simplify checking if a folio is swapped
mm, swap: no need to clear the shadow explicitly
include/linux/swap.h | 28 +-
mm/memory.c | 2 +-
mm/swap.h | 20 +-
mm/swap_state.c | 72 ++--
mm/swap_table.h | 131 +++++-
mm/swapfile.c | 1104 +++++++++++++++++++++-----------------------------
mm/workingset.c | 49 ++-
7 files changed, 653 insertions(+), 753 deletions(-)
---
base-commit: 10de4550639e9df9242e32e9affc90ed75a27c7d
change-id: 20251216-swap-table-p3-8de73fee7b5f
Best regards,
--
Kairui Song <kasong@tencent.com>
syzbot ci has tested the following series [v1] mm, swap: swap table phase III: remove swap_map https://lore.kernel.org/all/20260126-swap-table-p3-v1-0-a74155fab9b0@tencent.com * [PATCH 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator * [PATCH 02/12] mm, swap: clean up swapon process and locking * [PATCH 03/12] mm, swap: remove redundant arguments and locking for enabling a device * [PATCH 04/12] mm, swap: consolidate bad slots setup and make it more robust * [PATCH 05/12] mm/workingset: leave highest bits empty for anon shadow * [PATCH 06/12] mm, swap: implement helpers for reserving data in the swap table * [PATCH 07/12] mm, swap: mark bad slots in swap table directly * [PATCH 08/12] mm, swap: simplify swap table sanity range check * [PATCH 09/12] mm, swap: use the swap table to track the swap count * [PATCH 10/12] mm, swap: no need to truncate the scan border * [PATCH 11/12] mm, swap: simplify checking if a folio is swapped * [PATCH 12/12] mm, swap: no need to clear the shadow explicitly and found the following issue: WARNING in swap_cluster_lock Full report is available here: https://ci.syzbot.org/series/3f6169fc-e24a-4a19-ba56-e5907b448edc *** WARNING in swap_cluster_lock tree: mm-new URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git base: 5a3704ed2dce0b54a7f038b765bb752b87ee8cc2 arch: amd64 compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8 config: https://ci.syzbot.org/builds/0eabd97a-86d8-4606-9d94-dbe4e7fd7c07/config C repro: https://ci.syzbot.org/findings/5b039fd0-70da-4954-817d-8bf86315c684/c_repro syz repro: https://ci.syzbot.org/findings/5b039fd0-70da-4954-817d-8bf86315c684/syz_repro ------------[ cut here ]------------ offset >= si->max WARNING: mm/swap.h:88 at __swap_offset_to_cluster mm/swap.h:88 [inline], CPU#1: syz.0.548/6508 WARNING: mm/swap.h:88 at __swap_cluster_lock mm/swap.h:101 [inline], CPU#1: syz.0.548/6508 WARNING: mm/swap.h:88 at swap_cluster_lock+0xef/0x130 mm/swap.h:132, CPU#1: syz.0.548/6508 Modules linked in: CPU: 1 UID: 0 PID: 6508 Comm: syz.0.548 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:__swap_offset_to_cluster mm/swap.h:88 [inline] RIP: 0010:__swap_cluster_lock mm/swap.h:101 [inline] RIP: 0010:swap_cluster_lock+0xef/0x130 mm/swap.h:132 Code: e8 86 3b 5a 09 4c 89 f8 5b 41 5c 41 5e 41 5f 5d e9 86 86 5a 09 cc e8 90 ff a0 ff 90 0f 0b 90 e9 3f ff ff ff e8 82 ff a0 ff 90 <0f> 0b 90 e9 6f ff ff ff e8 74 ff a0 ff 90 0f 0b 90 eb a4 e8 69 ff RSP: 0018:ffffc90004ae66c0 EFLAGS: 00010293 RAX: ffffffff82219a6e RBX: 0000000000007a12 RCX: ffff888110363a80 RDX: 0000000000000000 RSI: 0000000000007a12 RDI: 0000000000007a12 RBP: 0000000000007a12 R08: 0000000000000003 R09: 0000000000000004 R10: dffffc0000000000 R11: fffff5200095cccc R12: dffffc0000000000 R13: ffff888175c2a010 R14: ffff888175c2a000 R15: 0000000000007a12 FS: 000055556978b500(0000) GS:ffff8882a9923000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fca0a017dac CR3: 0000000112e64000 CR4: 00000000000006f0 Call Trace: <TASK> cluster_alloc_swap_entry+0x20f/0xa40 mm/swapfile.c:1090 swap_alloc_slow mm/swapfile.c:1385 [inline] folio_alloc_swap+0x81f/0x1190 mm/swapfile.c:1717 shrink_folio_list+0x2714/0x52b0 mm/vmscan.c:1306 reclaim_folio_list+0x100/0x4f0 mm/vmscan.c:2205 reclaim_pages+0x45b/0x530 mm/vmscan.c:2242 madvise_cold_or_pageout_pte_range+0x19b9/0x1d00 mm/madvise.c:561 walk_pmd_range mm/pagewalk.c:130 [inline] walk_pud_range mm/pagewalk.c:224 [inline] walk_p4d_range mm/pagewalk.c:262 [inline] walk_pgd_range+0x1032/0x1d30 mm/pagewalk.c:303 __walk_page_range+0x14c/0x710 mm/pagewalk.c:411 walk_page_range_vma_unsafe+0x309/0x410 mm/pagewalk.c:715 madvise_pageout_page_range mm/madvise.c:620 [inline] madvise_pageout mm/madvise.c:645 [inline] madvise_vma_behavior+0x382e/0x4240 mm/madvise.c:1364 madvise_walk_vmas+0x573/0xae0 mm/madvise.c:1719 madvise_do_behavior+0x386/0x540 mm/madvise.c:1935 do_madvise+0x1fa/0x2e0 mm/madvise.c:2028 __do_sys_madvise mm/madvise.c:2037 [inline] __se_sys_madvise mm/madvise.c:2035 [inline] __x64_sys_madvise+0xa6/0xc0 mm/madvise.c:2035 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xe2/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7fca09d9acb9 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffe78abed08 EFLAGS: 00000246 ORIG_RAX: 000000000000001c RAX: ffffffffffffffda RBX: 00007fca0a015fa0 RCX: 00007fca09d9acb9 RDX: 0000000000000015 RSI: 0000000000600003 RDI: 0000200000000000 RBP: 00007fca09e08bf7 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007fca0a015fac R14: 00007fca0a015fa0 R15: 00007fca0a015fa0 </TASK> *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com.
On Sun, Jan 25, 2026 at 02:13:41PM +0800, syzbot ci wrote:
> syzbot ci has tested the following series
>
> [v1] mm, swap: swap table phase III: remove swap_map
> https://lore.kernel.org/all/20260126-swap-table-p3-v1-0-a74155fab9b0@tencent.com
> * [PATCH 01/12] mm, swap: protect si->swap_file properly and use as a mount indicator
> * [PATCH 02/12] mm, swap: clean up swapon process and locking
> * [PATCH 03/12] mm, swap: remove redundant arguments and locking for enabling a device
> * [PATCH 04/12] mm, swap: consolidate bad slots setup and make it more robust
> * [PATCH 05/12] mm/workingset: leave highest bits empty for anon shadow
> * [PATCH 06/12] mm, swap: implement helpers for reserving data in the swap table
> * [PATCH 07/12] mm, swap: mark bad slots in swap table directly
> * [PATCH 08/12] mm, swap: simplify swap table sanity range check
> * [PATCH 09/12] mm, swap: use the swap table to track the swap count
> * [PATCH 10/12] mm, swap: no need to truncate the scan border
> * [PATCH 11/12] mm, swap: simplify checking if a folio is swapped
> * [PATCH 12/12] mm, swap: no need to clear the shadow explicitly
>
> and found the following issue:
> WARNING in swap_cluster_lock
>
> Full report is available here:
> https://ci.syzbot.org/series/3f6169fc-e24a-4a19-ba56-e5907b448edc
>
> ***
>
> WARNING in swap_cluster_lock
>
> tree: mm-new
> URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
> base: 5a3704ed2dce0b54a7f038b765bb752b87ee8cc2
> arch: amd64
> compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
> config: https://ci.syzbot.org/builds/0eabd97a-86d8-4606-9d94-dbe4e7fd7c07/config
> C repro: https://ci.syzbot.org/findings/5b039fd0-70da-4954-817d-8bf86315c684/c_repro
> syz repro: https://ci.syzbot.org/findings/5b039fd0-70da-4954-817d-8bf86315c684/syz_repro
>
> ------------[ cut here ]------------
> offset >= si->max
> WARNING: mm/swap.h:88 at __swap_offset_to_cluster mm/swap.h:88 [inline], CPU#1: syz.0.548/6508
> WARNING: mm/swap.h:88 at __swap_cluster_lock mm/swap.h:101 [inline], CPU#1: syz.0.548/6508
> WARNING: mm/swap.h:88 at swap_cluster_lock+0xef/0x130 mm/swap.h:132, CPU#1: syz.0.548/6508
This is a good catch from the bot. It's caused by the patch "[PATCH 10/12] mm, swap: no need to truncate the scan border", however that patch is not wrong, it just have to update the debug check too:
diff --git a/mm/swap.h b/mm/swap.h
index 087cef49cf69..386a289ef8e7 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -85,7 +85,7 @@ static inline struct swap_cluster_info *__swap_offset_to_cluster(
struct swap_info_struct *si, pgoff_t offset)
{
VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
- VM_WARN_ON_ONCE(offset >= si->max);
+ VM_WARN_ON_ONCE(offset >= roundup(si->max, SWAPFILE_CLUSTER));
return &si->cluster_info[offset / SWAPFILE_CLUSTER];
}
I'll update this in V2.
© 2016 - 2026 Red Hat, Inc.