Documentation/admin-guide/cgroup-v2.rst | 27 ++ Documentation/mm/swap-tier.rst | 159 +++++++++ MAINTAINERS | 3 + include/linux/memcontrol.h | 3 +- include/linux/swap.h | 1 + mm/Kconfig | 12 + mm/Makefile | 2 +- mm/memcontrol.c | 95 +++++ mm/swap.h | 4 + mm/swap_state.c | 75 ++++ mm/swap_tier.c | 451 ++++++++++++++++++++++++ mm/swap_tier.h | 74 ++++ mm/swapfile.c | 23 +- 13 files changed, 923 insertions(+), 6 deletions(-) create mode 100644 Documentation/mm/swap-tier.rst create mode 100644 mm/swap_tier.c create mode 100644 mm/swap_tier.h
This is v5 of the "Swap Tiers" series.
For clarity, this cover letter is structured in two parts:
Part 1 describes the patch series itself (what is implemented in v5).
Part 2 consolidates the design rationale and use case discussion,
including clarification around the memcg-integrated model and
comparison with BPF-based approaches.
This separation is intentional so reviewers can clearly distinguish
between patch introduction and design discussion (for Shakeel's
ongoing feedback).
v4:
https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/
Earlier RFC versions:
v3: https://lore.kernel.org/linux-mm/20260131125454.3187546-1-youngjun.park@lge.com/
v2: https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/
Earlier Approach (per cgroup swap priority)
RFC: https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d
v1: https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
======================================================================
Part 1: Patch Series Summary
======================================================================
Overview
========
Swap Tiers group swap devices into performance classes (e.g. NVMe,
HDD, Network) and allow per-memcg selection of which tiers to use.
This mechanism was suggested by Chris Li.
This series introduces:
- Core tier infrastructure
- Per-memcg tier assignment (subset of parent)
- memory.swap.tiers and memory.swap.tiers.effective interfaces
Changes in v5
=============
- Fixed build errors reported in v4
- rebased on up to date mm-new
- Minor cleanups
- Design docs with validation (by Shakeel Butt discussion)
Changes in v4 (summary)
=======================
- Simplified control flow and indentation
- Added CONFIG option for MAX_SWAPTIER (default: 4)
- Added memory.swap.tiers.effective interface
- Reworked save/restore logic into snapshot/rollback model
- Removed tier priority modification support (deferred)
- Improved validation and fixed edge cases
- Rebased onto latest mm-new
Deferred / Future Work
======================
- Per-tier swap_active_head to reduce contention (Suggested by Chris Li)
- Fast path and slow path allocation improvement
(this will be introduced after Kairui's work)
Real-world Results
==================
Tested on our internal platform using NBD as a separate swap tier.
Our first production's simple usecase.
Without tiers:
- No selective control over flash wear
- Cannot selectively assign NBD to specific applications
Cold launch improvement (preloaded vs. baseline):
- App A: 13.17s -> 4.18s (68%)
- App B: 5.60s -> 1.12s (80%)
- App C: 10.25s -> 2.00s (80%)
Performance impact with no tiers configured:
<1% regression in kernel build and vm-scalability benchmarks
(measured in RFC v2).
======================================================================
Part 2: Design Rationale and Use Cases
======================================================================
Design Rationale
================
Swap tier selection is attached to memcg. A child cgroup may select a
subset of the parent's allowed tiers.
This:
- Preserves cgroup inheritance semantics (boundary at parent,
refinement at child).
- Reuses memcg, which already groups processes and enforces
hierarchical memory limits.
- Aligns with existing memcg swap controls (e.g. swap.max, zswap.writeback)
- Avoids introducing a parallel swap control hierarchy.
Placing tier control outside memcg (e.g. bpf, syscall, madvise etc..)
would allow swap preference to diverge from the memcg hierarchy.
Integrating it into memcg keeps swap policy consistent with
existing memory ownership semantics.
Use case #1: Latency separation (our primary deployment scenario)
=================================================================
[ / ]
|
+-- latency-sensitive workload (fast tier)
+-- background workload (slow tier)
The parent defines the memory boundary.
Each workload selects a swap tier via memory.swap.tiers according to
latency requirements.
This prevents latency-sensitive workloads from being swapped to
slow devices used by background workloads.
Use case #2: Per-VM swap selection (Chris Li's deployment scenario)
==================================================================
[ / ]
|
+-- [ Job on VM ] (tiers: zswap, SSD)
|
+-- [ VMM guest memory ] (tiers: SSD)
The parent (job) has access to both zswap and SSD tiers.
The child (VMM guest memory) selects SSD as its swap tier via
memory.swap.tiers. In this deployment, swap device selection
happens at the child level from the parent's available set.
Use case #3: Tier isolation for reduced contention (hypothetical)
=================================================================
[ / ] (tiers: A, B)
|
+-- workload X (tiers: A)
+-- workload Y (tiers: B)
Each child uses a different tier. Since swap paths are separated
per tier, synchronization overhead between the two workloads is
reduced.
How the Current Interface Supports Future Extensions
====================================================
- Intra-tier distribution policy:
Currently, swap devices with the same priority are allocated in a
round-robin fashion. Per-tier policy files under
/sys/kernel/mm/swap/tiers/ can control how devices within a tier
are selected (e.g. round-robin, weighted).
- Inter-tier promotion and demotion:
Promotion and demotion apply between tiers, not within a single
tier. The current interface defines only tier assignment; it does
not yet define when or how pages move between tiers. Two triggering
models are possible:
(a) User-triggered: userspace explicitly initiates migration between
tiers (e.g. via a new interface or existing move_pages semantics).
(b) Kernel-triggered: the kernel moves pages between tiers at
appropriate points such as reclaim or refault.
From the memcg perspective, inter-tier movement is bounded by
memory.swap.tiers.effective -- pages can only be promoted or demoted
to tiers within the memcg's effective set. The specific policy and
triggering mechanism require further discussion and are not part of
this series.
- Per-VMA or per-process swap hints:
A future madvise-style hint (e.g. MADV_SWAP_TIER) could reference
the tier indices in /sys/kernel/mm/swap/tiers/. At reclaim time,
the kernel would check the VMA hint against the memcg's effective
tier set to pick the swap-out target.
BPF Comparison
==============
The use cases described above already rely on memcg for swap tier
control, and real deployments are built around this model.
A BPF-based approach has additional considerations:
- Hierarchy consistency: BPF programs operate outside the memcg
tree. Without explicit constraints, a BPF selector could
contradict parent tier restrictions. Edge cases such as zombie
memcgs make the resolution less clear.
- Deployment scope: requiring BPF for core swap behavior may not
be suitable for constrained or embedded configurations.
BPF could still work as an extension on top of the tier model
in the future.
Youngjun Park (4):
mm: swap: introduce swap tier infrastructure
mm: swap: associate swap devices with tiers
mm: memcontrol: add interfaces for swap tier selection
mm: swap: filter swap allocation by memcg tier mask
Documentation/admin-guide/cgroup-v2.rst | 27 ++
Documentation/mm/swap-tier.rst | 159 +++++++++
MAINTAINERS | 3 +
include/linux/memcontrol.h | 3 +-
include/linux/swap.h | 1 +
mm/Kconfig | 12 +
mm/Makefile | 2 +-
mm/memcontrol.c | 95 +++++
mm/swap.h | 4 +
mm/swap_state.c | 75 ++++
mm/swap_tier.c | 451 ++++++++++++++++++++++++
mm/swap_tier.h | 74 ++++
mm/swapfile.c | 23 +-
13 files changed, 923 insertions(+), 6 deletions(-)
create mode 100644 Documentation/mm/swap-tier.rst
create mode 100644 mm/swap_tier.c
create mode 100644 mm/swap_tier.h
base-commit: 6381a729fa7dda43574d93ab9c61cec516dd885b
--
2.34.1
On Thu, 26 Mar 2026 02:54:49 +0900 Youngjun Park <youngjun.park@lge.com> wrote: > This is v5 of the "Swap Tiers" series. Thanks. I'd prefer to hold off until the next cycle, please. As I mentioned in https://lkml.kernel.org/r/20260323202941.08ddf2b0411501cae801ab4c@linux-foundation.org Also, AI review had a lot to say, Please take a look. Should you do so, I'm interested in learning how much of that material was useful. Thanks. https://sashiko.dev/#/patchset/20260325175453.2523280-1-youngjun.park%40lge.com
On Wed, Mar 25, 2026 at 04:20:03PM -0700, Andrew Morton wrote: > On Thu, 26 Mar 2026 02:54:49 +0900 Youngjun Park <youngjun.park@lge.com> wrote: > > > This is v5 of the "Swap Tiers" series. > > Thanks. I'd prefer to hold off until the next cycle, please. As I > mentioned in > > https://lkml.kernel.org/r/20260323202941.08ddf2b0411501cae801ab4c@linux-foundation.org > > Also, AI review had a lot to say, Please take a look. Should you do > so, I'm interested in learning how much of that material was useful. > Thanks. > > https://sashiko.dev/#/patchset/20260325175453.2523280-1-youngjun.park%40lge.com Hi Andrew, Understood. I'll address the AI review comments and run syzbot CI, then resubmit for the next cycle. Thanks, Youngjun Park
syzbot ci has tested the following series [v5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control https://lore.kernel.org/all/20260325175453.2523280-1-youngjun.park@lge.com * [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure * [PATCH v5 2/4] mm: swap: associate swap devices with tiers * [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection * [PATCH v5 4/4] mm: swap: filter swap allocation by memcg tier mask and found the following issue: WARNING in folio_tier_effective_mask Full report is available here: https://ci.syzbot.org/series/6ed50ca2-a106-41e9-aa4d-7c46869e0011 *** WARNING in folio_tier_effective_mask tree: mm-new URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git base: 6381a729fa7dda43574d93ab9c61cec516dd885b arch: amd64 compiler: Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8 config: https://ci.syzbot.org/builds/e5c66fa8-a7fd-4809-9564-448847b5f230/config C repro: https://ci.syzbot.org/findings/d64cc6fa-636a-40a0-b131-d02ce1129494/c_repro syz repro: https://ci.syzbot.org/findings/d64cc6fa-636a-40a0-b131-d02ce1129494/syz_repro ------------[ cut here ]------------ debug_locks && !(rcu_read_lock_held() || lock_is_held(&(&cgroup_mutex)->dep_map)) WARNING: ./include/linux/memcontrol.h:377 at obj_cgroup_memcg include/linux/memcontrol.h:377 [inline], CPU#1: syz.0.17/5955 WARNING: ./include/linux/memcontrol.h:377 at folio_memcg include/linux/memcontrol.h:431 [inline], CPU#1: syz.0.17/5955 WARNING: ./include/linux/memcontrol.h:377 at folio_tier_effective_mask+0x175/0x210 mm/swap_tier.h:63, CPU#1: syz.0.17/5955 Modules linked in: CPU: 1 UID: 0 PID: 5955 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:obj_cgroup_memcg include/linux/memcontrol.h:377 [inline] RIP: 0010:folio_memcg include/linux/memcontrol.h:431 [inline] RIP: 0010:folio_tier_effective_mask+0x175/0x210 mm/swap_tier.h:63 Code: 0f b6 04 20 84 c0 75 6b 8b 03 eb 0a e8 04 b8 9e ff b8 ff ff ff ff 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc cc e8 ec b7 9e ff 90 <0f> 0b 90 eb 9b 44 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c c2 fe ff ff RSP: 0018:ffffc90004bee6d0 EFLAGS: 00010293 RAX: ffffffff8226dd04 RBX: ffff888113589280 RCX: ffff8881727b8000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: ffffea0006c62207 R09: 1ffffd4000d8c440 R10: dffffc0000000000 R11: fffff94000d8c441 R12: dffffc0000000000 R13: ffffea0006c62208 R14: ffffea0006c62200 R15: ffffea0006c62230 FS: 00005555771cb500(0000) GS:ffff8882a9462000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2ed63fff CR3: 0000000112d86000 CR4: 00000000000006f0 Call Trace: <TASK> swap_alloc_fast mm/swapfile.c:1355 [inline] folio_alloc_swap+0x392/0x13a0 mm/swapfile.c:1735 shrink_folio_list+0x26a7/0x5250 mm/vmscan.c:1281 reclaim_folio_list+0x100/0x460 mm/vmscan.c:2171 reclaim_pages+0x45b/0x530 mm/vmscan.c:2208 madvise_cold_or_pageout_pte_range+0x1ef5/0x2220 mm/madvise.c:563 walk_pmd_range mm/pagewalk.c:142 [inline] walk_pud_range mm/pagewalk.c:233 [inline] walk_p4d_range mm/pagewalk.c:275 [inline] walk_pgd_range+0xfdc/0x1d90 mm/pagewalk.c:316 __walk_page_range+0x14c/0x710 mm/pagewalk.c:424 walk_page_range_vma_unsafe+0x309/0x410 mm/pagewalk.c:728 madvise_pageout_page_range mm/madvise.c:622 [inline] madvise_pageout mm/madvise.c:647 [inline] madvise_vma_behavior+0x28b9/0x42c0 mm/madvise.c:1358 madvise_walk_vmas+0x573/0xae0 mm/madvise.c:1713 madvise_do_behavior+0x386/0x540 mm/madvise.c:1929 do_madvise+0x1fa/0x2e0 mm/madvise.c:2022 __do_sys_madvise mm/madvise.c:2031 [inline] __se_sys_madvise mm/madvise.c:2029 [inline] __x64_sys_madvise+0xa6/0xc0 mm/madvise.c:2029 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f5e5af9c799 Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fff0c8e2708 EFLAGS: 00000246 ORIG_RAX: 000000000000001c RAX: ffffffffffffffda RBX: 00007f5e5b215fa0 RCX: 00007f5e5af9c799 RDX: 0000000000000015 RSI: 0000000000600000 RDI: 0000200000000000 RBP: 00007f5e5b032c99 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007f5e5b215fac R14: 00007f5e5b215fa0 R15: 00007f5e5b215fa0 </TASK> *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot@syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller@googlegroups.com.
© 2016 - 2026 Red Hat, Inc.