[PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

Youngjun Park posted 4 patches 1 week ago
Documentation/admin-guide/cgroup-v2.rst |  27 ++
Documentation/mm/swap-tier.rst          | 159 +++++++++
MAINTAINERS                             |   3 +
include/linux/memcontrol.h              |   3 +-
include/linux/swap.h                    |   1 +
mm/Kconfig                              |  12 +
mm/Makefile                             |   2 +-
mm/memcontrol.c                         |  95 +++++
mm/swap.h                               |   4 +
mm/swap_state.c                         |  75 ++++
mm/swap_tier.c                          | 451 ++++++++++++++++++++++++
mm/swap_tier.h                          |  74 ++++
mm/swapfile.c                           |  23 +-
13 files changed, 923 insertions(+), 6 deletions(-)
create mode 100644 Documentation/mm/swap-tier.rst
create mode 100644 mm/swap_tier.c
create mode 100644 mm/swap_tier.h
[PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Posted by Youngjun Park 1 week ago
This is v5 of the "Swap Tiers" series.
For clarity, this cover letter is structured in two parts:

  Part 1 describes the patch series itself (what is implemented in v5).
  Part 2 consolidates the design rationale and use case discussion,
  including clarification around the memcg-integrated model and
  comparison with BPF-based approaches.

This separation is intentional so reviewers can clearly distinguish
between patch introduction and design discussion (for Shakeel's
ongoing feedback).

v4:
  https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/

Earlier RFC versions:
  v3: https://lore.kernel.org/linux-mm/20260131125454.3187546-1-youngjun.park@lge.com/
  v2: https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
  v1: https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/

Earlier Approach (per cgroup swap priority)
  RFC: https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d
  v1: https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
======================================================================
Part 1: Patch Series Summary
======================================================================

Overview
========
Swap Tiers group swap devices into performance classes (e.g. NVMe,
HDD, Network) and allow per-memcg selection of which tiers to use.
This mechanism was suggested by Chris Li.

This series introduces:

- Core tier infrastructure
- Per-memcg tier assignment (subset of parent)
- memory.swap.tiers and memory.swap.tiers.effective interfaces

Changes in v5
=============
- Fixed build errors reported in v4
- rebased on up to date mm-new 
- Minor cleanups
- Design docs with validation (by Shakeel Butt discussion)

Changes in v4 (summary)
=======================
- Simplified control flow and indentation
- Added CONFIG option for MAX_SWAPTIER (default: 4)
- Added memory.swap.tiers.effective interface
- Reworked save/restore logic into snapshot/rollback model
- Removed tier priority modification support (deferred)
- Improved validation and fixed edge cases
- Rebased onto latest mm-new

Deferred / Future Work
======================
- Per-tier swap_active_head to reduce contention (Suggested by Chris Li)
- Fast path and slow path allocation improvement
  (this will be introduced after Kairui's work)

Real-world Results
==================
Tested on our internal platform using NBD as a separate swap tier.
Our first production's simple usecase.

Without tiers:
- No selective control over flash wear
- Cannot selectively assign NBD to specific applications

Cold launch improvement (preloaded vs. baseline):
- App A: 13.17s -> 4.18s (68%)
- App B: 5.60s -> 1.12s (80%)
- App C: 10.25s -> 2.00s (80%)

Performance impact with no tiers configured:
<1% regression in kernel build and vm-scalability benchmarks
(measured in RFC v2).

======================================================================
Part 2: Design Rationale and Use Cases
======================================================================

Design Rationale
================
Swap tier selection is attached to memcg. A child cgroup may select a
subset of the parent's allowed tiers.

This:
- Preserves cgroup inheritance semantics (boundary at parent,
  refinement at child).
- Reuses memcg, which already groups processes and enforces
  hierarchical memory limits.
- Aligns with existing memcg swap controls (e.g. swap.max, zswap.writeback)
- Avoids introducing a parallel swap control hierarchy.

Placing tier control outside memcg (e.g. bpf, syscall, madvise etc..)
would allow swap preference to diverge from the memcg hierarchy.
Integrating it into memcg keeps swap policy consistent with
existing memory ownership semantics.

Use case #1: Latency separation (our primary deployment scenario)
=================================================================
  [ / ]
     |
     +-- latency-sensitive workload  (fast tier)
     +-- background workload         (slow tier)

The parent defines the memory boundary.
Each workload selects a swap tier via memory.swap.tiers according to
latency requirements.

This prevents latency-sensitive workloads from being swapped to
slow devices used by background workloads.

Use case #2: Per-VM swap selection (Chris Li's deployment scenario)
==================================================================
  [ / ]
     |
     +-- [ Job on VM ]              (tiers: zswap, SSD)
            |
            +-- [ VMM guest memory ]  (tiers: SSD)

The parent (job) has access to both zswap and SSD tiers.
The child (VMM guest memory) selects SSD as its swap tier via
memory.swap.tiers. In this deployment, swap device selection
happens at the child level from the parent's available set.


Use case #3: Tier isolation for reduced contention (hypothetical)
=================================================================
  [ / ]                    (tiers: A, B)
     |
     +-- workload X        (tiers: A)
     +-- workload Y        (tiers: B)

Each child uses a different tier. Since swap paths are separated
per tier, synchronization overhead between the two workloads is
reduced.

How the Current Interface Supports Future Extensions
====================================================

- Intra-tier distribution policy:
  Currently, swap devices with the same priority are allocated in a
  round-robin fashion. Per-tier policy files under
  /sys/kernel/mm/swap/tiers/ can control how devices within a tier
  are selected (e.g. round-robin, weighted).

- Inter-tier promotion and demotion:
  Promotion and demotion apply between tiers, not within a single
  tier. The current interface defines only tier assignment; it does
  not yet define when or how pages move between tiers. Two triggering
  models are possible:

  (a) User-triggered: userspace explicitly initiates migration between
      tiers (e.g. via a new interface or existing move_pages semantics).
  (b) Kernel-triggered: the kernel moves pages between tiers at
      appropriate points such as reclaim or refault.

  From the memcg perspective, inter-tier movement is bounded by
  memory.swap.tiers.effective -- pages can only be promoted or demoted
  to tiers within the memcg's effective set. The specific policy and
  triggering mechanism require further discussion and are not part of
  this series.

- Per-VMA or per-process swap hints:
  A future madvise-style hint (e.g. MADV_SWAP_TIER) could reference
  the tier indices in /sys/kernel/mm/swap/tiers/. At reclaim time,
  the kernel would check the VMA hint against the memcg's effective
  tier set to pick the swap-out target.

BPF Comparison
==============
The use cases described above already rely on memcg for swap tier
control, and real deployments are built around this model.
A BPF-based approach has additional considerations:

- Hierarchy consistency: BPF programs operate outside the memcg
  tree. Without explicit constraints, a BPF selector could
  contradict parent tier restrictions. Edge cases such as zombie
  memcgs make the resolution less clear.
- Deployment scope: requiring BPF for core swap behavior may not
  be suitable for constrained or embedded configurations.

BPF could still work as an extension on top of the tier model
in the future.

Youngjun Park (4):
  mm: swap: introduce swap tier infrastructure
  mm: swap: associate swap devices with tiers
  mm: memcontrol: add interfaces for swap tier selection
  mm: swap: filter swap allocation by memcg tier mask

 Documentation/admin-guide/cgroup-v2.rst |  27 ++
 Documentation/mm/swap-tier.rst          | 159 +++++++++
 MAINTAINERS                             |   3 +
 include/linux/memcontrol.h              |   3 +-
 include/linux/swap.h                    |   1 +
 mm/Kconfig                              |  12 +
 mm/Makefile                             |   2 +-
 mm/memcontrol.c                         |  95 +++++
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  75 ++++
 mm/swap_tier.c                          | 451 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  74 ++++
 mm/swapfile.c                           |  23 +-
 13 files changed, 923 insertions(+), 6 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 6381a729fa7dda43574d93ab9c61cec516dd885b 
-- 
2.34.1
Re: [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Posted by Andrew Morton 1 week ago
On Thu, 26 Mar 2026 02:54:49 +0900 Youngjun Park <youngjun.park@lge.com> wrote:

> This is v5 of the "Swap Tiers" series.

Thanks.  I'd prefer to hold off until the next cycle, please.  As I
mentioned in 

https://lkml.kernel.org/r/20260323202941.08ddf2b0411501cae801ab4c@linux-foundation.org

Also, AI review had a lot to say, Please take a look.  Should you do
so, I'm interested in learning how much of that material was useful. 
Thanks.

https://sashiko.dev/#/patchset/20260325175453.2523280-1-youngjun.park%40lge.com
Re: [PATCH v5 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Posted by YoungJun Park 1 week ago
On Wed, Mar 25, 2026 at 04:20:03PM -0700, Andrew Morton wrote:
> On Thu, 26 Mar 2026 02:54:49 +0900 Youngjun Park <youngjun.park@lge.com> wrote:
> 
> > This is v5 of the "Swap Tiers" series.
> 
> Thanks.  I'd prefer to hold off until the next cycle, please.  As I
> mentioned in 
> 
> https://lkml.kernel.org/r/20260323202941.08ddf2b0411501cae801ab4c@linux-foundation.org
> 
> Also, AI review had a lot to say, Please take a look.  Should you do
> so, I'm interested in learning how much of that material was useful. 
> Thanks.
> 
> https://sashiko.dev/#/patchset/20260325175453.2523280-1-youngjun.park%40lge.com

Hi Andrew, Understood. 
I'll address the AI review comments and run syzbot CI, 
then resubmit for the next cycle.

Thanks,
Youngjun Park
[syzbot ci] Re: mm/swap, memcg: Introduce swap tiers for cgroup based swap control
Posted by syzbot ci 1 week ago
syzbot ci has tested the following series

[v5] mm/swap, memcg: Introduce swap tiers for cgroup based swap control
https://lore.kernel.org/all/20260325175453.2523280-1-youngjun.park@lge.com
* [PATCH v5 1/4] mm: swap: introduce swap tier infrastructure
* [PATCH v5 2/4] mm: swap: associate swap devices with tiers
* [PATCH v5 3/4] mm: memcontrol: add interfaces for swap tier selection
* [PATCH v5 4/4] mm: swap: filter swap allocation by memcg tier mask

and found the following issue:
WARNING in folio_tier_effective_mask

Full report is available here:
https://ci.syzbot.org/series/6ed50ca2-a106-41e9-aa4d-7c46869e0011

***

WARNING in folio_tier_effective_mask

tree:      mm-new
URL:       https://kernel.googlesource.com/pub/scm/linux/kernel/git/akpm/mm.git
base:      6381a729fa7dda43574d93ab9c61cec516dd885b
arch:      amd64
compiler:  Debian clang version 21.1.8 (++20251221033036+2078da43e25a-1~exp1~20251221153213.50), Debian LLD 21.1.8
config:    https://ci.syzbot.org/builds/e5c66fa8-a7fd-4809-9564-448847b5f230/config
C repro:   https://ci.syzbot.org/findings/d64cc6fa-636a-40a0-b131-d02ce1129494/c_repro
syz repro: https://ci.syzbot.org/findings/d64cc6fa-636a-40a0-b131-d02ce1129494/syz_repro

------------[ cut here ]------------
debug_locks && !(rcu_read_lock_held() || lock_is_held(&(&cgroup_mutex)->dep_map))
WARNING: ./include/linux/memcontrol.h:377 at obj_cgroup_memcg include/linux/memcontrol.h:377 [inline], CPU#1: syz.0.17/5955
WARNING: ./include/linux/memcontrol.h:377 at folio_memcg include/linux/memcontrol.h:431 [inline], CPU#1: syz.0.17/5955
WARNING: ./include/linux/memcontrol.h:377 at folio_tier_effective_mask+0x175/0x210 mm/swap_tier.h:63, CPU#1: syz.0.17/5955
Modules linked in:
CPU: 1 UID: 0 PID: 5955 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
RIP: 0010:obj_cgroup_memcg include/linux/memcontrol.h:377 [inline]
RIP: 0010:folio_memcg include/linux/memcontrol.h:431 [inline]
RIP: 0010:folio_tier_effective_mask+0x175/0x210 mm/swap_tier.h:63
Code: 0f b6 04 20 84 c0 75 6b 8b 03 eb 0a e8 04 b8 9e ff b8 ff ff ff ff 5b 41 5c 41 5e 41 5f 5d c3 cc cc cc cc cc e8 ec b7 9e ff 90 <0f> 0b 90 eb 9b 44 89 f9 80 e1 07 80 c1 03 38 c1 0f 8c c2 fe ff ff
RSP: 0018:ffffc90004bee6d0 EFLAGS: 00010293
RAX: ffffffff8226dd04 RBX: ffff888113589280 RCX: ffff8881727b8000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffea0006c62207 R09: 1ffffd4000d8c440
R10: dffffc0000000000 R11: fffff94000d8c441 R12: dffffc0000000000
R13: ffffea0006c62208 R14: ffffea0006c62200 R15: ffffea0006c62230
FS:  00005555771cb500(0000) GS:ffff8882a9462000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2ed63fff CR3: 0000000112d86000 CR4: 00000000000006f0
Call Trace:
 <TASK>
 swap_alloc_fast mm/swapfile.c:1355 [inline]
 folio_alloc_swap+0x392/0x13a0 mm/swapfile.c:1735
 shrink_folio_list+0x26a7/0x5250 mm/vmscan.c:1281
 reclaim_folio_list+0x100/0x460 mm/vmscan.c:2171
 reclaim_pages+0x45b/0x530 mm/vmscan.c:2208
 madvise_cold_or_pageout_pte_range+0x1ef5/0x2220 mm/madvise.c:563
 walk_pmd_range mm/pagewalk.c:142 [inline]
 walk_pud_range mm/pagewalk.c:233 [inline]
 walk_p4d_range mm/pagewalk.c:275 [inline]
 walk_pgd_range+0xfdc/0x1d90 mm/pagewalk.c:316
 __walk_page_range+0x14c/0x710 mm/pagewalk.c:424
 walk_page_range_vma_unsafe+0x309/0x410 mm/pagewalk.c:728
 madvise_pageout_page_range mm/madvise.c:622 [inline]
 madvise_pageout mm/madvise.c:647 [inline]
 madvise_vma_behavior+0x28b9/0x42c0 mm/madvise.c:1358
 madvise_walk_vmas+0x573/0xae0 mm/madvise.c:1713
 madvise_do_behavior+0x386/0x540 mm/madvise.c:1929
 do_madvise+0x1fa/0x2e0 mm/madvise.c:2022
 __do_sys_madvise mm/madvise.c:2031 [inline]
 __se_sys_madvise mm/madvise.c:2029 [inline]
 __x64_sys_madvise+0xa6/0xc0 mm/madvise.c:2029
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x14d/0xf80 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f5e5af9c799
Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fff0c8e2708 EFLAGS: 00000246 ORIG_RAX: 000000000000001c
RAX: ffffffffffffffda RBX: 00007f5e5b215fa0 RCX: 00007f5e5af9c799
RDX: 0000000000000015 RSI: 0000000000600000 RDI: 0000200000000000
RBP: 00007f5e5b032c99 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007f5e5b215fac R14: 00007f5e5b215fa0 R15: 00007f5e5b215fa0
 </TASK>


***

If these findings have caused you to resend the series or submit a
separate fix, please add the following tag to your commit message:
  Tested-by: syzbot@syzkaller.appspotmail.com

---
This report is generated by a bot. It may contain errors.
syzbot ci engineers can be reached at syzkaller@googlegroups.com.