[v8] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

[PATCH v8 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

Posted by Youngjun Park 2 days, 16 hours ago

This is the v8 series of the swap tier patchset.

Great thanks to Shakeel Butt and Yosry for the reviews and discussions [1].
The main change in this version is the interface change to use
memory.swap.tiers.max with '0' (disable) and 'max' (enable) values.
This mechanism was suggested by Shakeel and Yosry.

This change allows for future extensions to control swap
between tiers and aligns better with existing memcg interfaces.
Even with this memcg interface change, only patch #3 needed updates.
Internally, patch #3 still uses the existing mask processing method
(which is implementation-efficient), so only the user-facing interface
was modified.

We also discussed tier extensions. Thanks to Yosry, Nhat and Shakeel for their
valuable feedback.

Here is a brief summary of our tentative conclusions. Please correct me
if anything is misrepresented (details in references):

* Zswap tiering [2]:
  Tiering applies only to the vswap + zswap combo. Zswap itself will
  not be tiered, as the current architecture requires a physical device
  for zswap allocation.
* Vswap tiering [3]:
  Vswap should be handled transparently to the user. Vswap itself will
  not be tiered. But, someday supported if there is strong and real usecase.
* Relationship with zswap.writeback [4]:
  If zswap tiering is introduced, it could replace the zswap-only tier.
  However, since zswap cannot be tiered independently, it is still
  needed for non-vswap cases. Separately, the internal logic could
  potentially be integrated into the tiering logic.
* Tier demotion [5]:
  A separate interface like memory.swap.tiers.demotion might be needed.
  For now, we only support 0/max to enable/disable tiers. In the future,
  we could introduce an "auto" mode to automatically scale the limit
  based on swapfile size and memory.swap.max, similar to the direction
  memory tiering is heading in.

I plan to apply the swap tier infrastructure and the first use case
(cgroup-based swap control) first, and continue following up on the
discussions above.

Overview
========

Swap Tiers group swap devices into performance classes (e.g. NVMe,
HDD, Network) and allow per-memcg selection of which tiers to use.
This mechanism was suggested by Chris Li.

Design Rationale
================

Swap tier selection is attached to memcg. A child cgroup may select a
subset of the parent's allowed tiers.

This
- Preserves cgroup inheritance semantics (boundary at parent,
  refinement at child).
- Reuses memcg, which already groups processes and enforces
  hierarchical memory limits.
- Aligns with existing memcg swap controls (e.g. swap.max, zswap.writeback)
- Avoids introducing a parallel swap control hierarchy.

Placing tier control outside memcg (e.g., via BPF, syscalls, or
madvise) would allow swap preference to diverge from the memcg
hierarchy. Integrating it into memcg keeps the swap policy
consistent with existing memory ownership semantics. There are
also real use cases built around memcg.

In the future, this can be extended to other interfaces to cover
additional use cases.

I believe a memcg-based swap control is a good starting point
before such extensions.

Use Cases
=========

#1: Latency separation (our primary deployment scenario)
  [ / ]
     |
     +-- latency-sensitive workload  (fast tier)
     +-- background workload         (slow tier)

The parent defines the memory boundary.
Each workload selects a swap tier via memory.swap.tiers.max according to
latency requirements.

This prevents latency-sensitive workloads from being swapped to
slow devices used by background workloads.

#2: Per-VM swap selection (Chris Li's deployment scenario)
  [ / ]
     |
     +-- [ Job on VM ]              (tiers: zswap, SSD)
            |
            +-- [ VMM guest memory ]  (tiers: SSD)

The parent (job) has access to both zswap and SSD tiers.
The child (VMM guest memory) selects SSD as its swap tier via
memory.swap.tiers.max. In this deployment, swap device selection
happens at the child level from the parent's available set.

#3: Tier isolation for reduced contention (hypothetical)
  [ / ]                    (tiers: A, B)
     |
     +-- workload X        (tiers: A)
     +-- workload Y        (tiers: B)

Each child uses a different tier. Since swap paths are separated
per tier, synchronization overhead between the two workloads is
reduced.

Future extension
================

#1: Intra-tier distribution policy:
  Currently, swap devices with the same priority are allocated in a
  round-robin fashion. Per-tier policy files under
  /sys/kernel/mm/swap/tiers/ can control how devices within a tier
  are selected (e.g. round-robin, weighted).

#2: Inter-tier promotion and demotion:
  Promotion and demotion apply between tiers, not within a single
  tier. The current interface defines only tier assignment; it does
  not yet define when or how pages move between tiers. Two triggering
  models are possible:

  (a) User-triggered: userspace explicitly initiates migration between
      tiers (e.g. via a new interface or existing move_pages semantics).
  (b) Kernel-triggered: the kernel moves pages between tiers at
      appropriate points such as reclaim or refault.

#3: Per-VMA, per-process swap and BPF:
  Not just for memcg based swap, possible to extend Per-VMA or per-process
  swap. Or we can use it as BPF program.

#4: Zswap and vswap tiering:
  Tiering applies to the vswap + zswap combination.

#5: Vswap on/off control:
  Currently not supported. If a strong use case arises where vswap needs
  to be controlled by memcg, the tier interface could be used for it.

Experimentation
===============

Tested on our internal platform using NBD as a separate swap tier.
Our first production's simple usecase.

Without tiers:
- No selective control over flash wear
- Cannot selectively assign NBD to specific applications

Cold launch improvement (preloaded vs. baseline):
- App A: 13.17s -> 4.18s (68%)
- App B: 5.60s -> 1.12s (80%)
- App C: 10.25s -> 2.00s (80%)

Performance impact with no tiers configured:
<1% regression in kernel build and vm-scalability benchmarks

Change log
===========

v8
- Changed the memcg interface to memory.swap.tiers.max.
  Values are '0' (disable) and 'max' (enable). Default is 'max'.
- Addressed Sashiko's review: Update the mask value atomically at once and
  read the mask value while grabbing lock.
- Collected review tags from Kairui and Nhat.
- Rebase on recent mm-new
- v7 link: https://lore.kernel.org/linux-mm/20260527062247.3440692-1-youngjun.park@lge.com/

v7
- Collect Baoquan's review tag
- Baoquan's feedback on fixing improper comment
- Minor code adjustments per Baoquan's feedback.
- Rebase on recent mm-new
- v6 link: https://lore.kernel.org/linux-mm/20260421055323.940344-1-youngjun.park@lge.com/

v6
- Sashiko AI review fixes
 - Fix batch parsing error path to restore snapshot before exit
 - Reject overlong tier names to prevent truncated duplicates
 - Avoid restoring raw list_head via memcpy (stale pointer risk)
 - Ensure early parse errors do not skip DEF_SWAP_PRIO validation
 - Use (1U << TIER_DEFAULT_IDX) to avoid signed shift UB
 - Defer tier mask inheritance to css_online() to close race window
 - Add READ_ONCE()/WRITE_ONCE() for tier mask accesses
- Other fixes
 - Fix build error reintroduced due to missing v5 change
 - Fix WARNING in folio_tier_effective_mask by adding rcu_read_lock()
 - default number of swap tier max (change to 32->31, for reserving last bit)
 - commit message refinement.
 - rebased on recently mm-new
- v5 link: https://lore.kernel.org/linux-mm/20260325175453.2523280-1-youngjun.park@lge.com/

v5
- Fixed build errors reported in v4
- rebased on up to date mm-new
- Minor cleanups
- Design docs with validation (by Shakeel Butt discussion)
- v4 link : https://lore.kernel.org/linux-mm/20260217000950.4015880-1-youngjun.park@lge.com/

v4
- Simplified control flow and indentation
- Added CONFIG option for MAX_SWAPTIER (default: 4)
- Added memory.swap.tiers.effective interface
- Reworked save/restore logic into snapshot/rollback model
- Removed tier priority modification support (deferred)
- Improved validation and fixed edge cases
- Rebased onto latest mm-new
- RFC v3 link: https://lore.kernel.org/linux-mm/20260131125454.3187546-1-youngjun.park@lge.com/

RFC v1 ~ v3
- Change the direction after discussion with Chris-Li
- apply some LPC feedback.
- RFC v2 - https://lore.kernel.org/linux-mm/20260126065242.1221862-1-youngjun.park@lge.com/
- RFC v1 - https://lore.kernel.org/linux-mm/20251109124947.1101520-1-youngjun.park@lge.com/

Earlier Approach (per cgroup swap priority)
- v1: https://lore.kernel.org/linux-mm/20250716202006.3640584-1-youngjun.park@lge.com/
- RFC: https://lore.kernel.org/linux-mm/aEvLjEInMQC7hEyh@yjaykim-PowerEdge-T330/T/#mbbb6a5e9e30843097e1f5f65fb98f31d582b973d

Reference
=========

[1] https://lore.kernel.org/linux-doc/aiw2p5ANjsQUCIHA@linux.dev/
[2] https://lore.kernel.org/linux-mm/CAKEwX=Nz9SWcEVQGQjHN8P8OANJY4BG0w+iQOzoNOWuteoVjAg@mail.gmail.com/
[3] https://lore.kernel.org/cgroups/CAKEwX=O23a4iWBZoewKVb8QqODte6r3Xijckw3_oCJNoiO9M5A@mail.gmail.com/
[4] https://lore.kernel.org/linux-mm/CAO9r8zOg0OP1Ak1v7CRzSfQq0D8b4Dw+_T0Jui6YTM_KwQQNOA@mail.gmail.com/
[5] https://lore.kernel.org/linux-mm/CAO9r8zNi4-QC4sUi=xXWHt9WMeG39mbyoSf8kON9vLOZ=cbCmw@mail.gmail.com/

Youngjun Park (4):
  mm: swap: introduce swap tier infrastructure
  mm: swap: associate swap devices with tiers
  mm: memcontrol: add interface for swap tier selection
  mm: swap: filter swap allocation by memcg tier mask

 Documentation/admin-guide/cgroup-v2.rst |  20 +
 Documentation/mm/index.rst              |   1 +
 Documentation/mm/swap-tier.rst          | 159 ++++++++
 MAINTAINERS                             |   3 +
 include/linux/memcontrol.h              |   5 +
 include/linux/swap.h                    |   1 +
 mm/Kconfig                              |  12 +
 mm/Makefile                             |   2 +-
 mm/memcontrol.c                         |  67 ++++
 mm/swap.h                               |   4 +
 mm/swap_state.c                         |  75 ++++
 mm/swap_tier.c                          | 483 ++++++++++++++++++++++++
 mm/swap_tier.h                          |  76 ++++
 mm/swapfile.c                           |  20 +-
 14 files changed, 923 insertions(+), 5 deletions(-)
 create mode 100644 Documentation/mm/swap-tier.rst
 create mode 100644 mm/swap_tier.c
 create mode 100644 mm/swap_tier.h

base-commit: 9d335aed8840f6bf83ba93309ae5e185de829c21
-- 
2.34.1

Re: [PATCH v8 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

Posted by Nhat Pham 2 days, 4 hours ago

On Wed, Jun 17, 2026 at 1:34 AM Youngjun Park <youngjun.park@lge.com> wrote:
>
> This is the v8 series of the swap tier patchset.
>
> Great thanks to Shakeel Butt and Yosry for the reviews and discussions [1].
> The main change in this version is the interface change to use
> memory.swap.tiers.max with '0' (disable) and 'max' (enable) values.
> This mechanism was suggested by Shakeel and Yosry

I like this interface too :)

>
> This change allows for future extensions to control swap
> between tiers and aligns better with existing memcg interfaces.
> Even with this memcg interface change, only patch #3 needed updates.
> Internally, patch #3 still uses the existing mask processing method
> (which is implementation-efficient), so only the user-facing interface
> was modified.
>
> We also discussed tier extensions. Thanks to Yosry, Nhat and Shakeel for their
> valuable feedback.
>
> Here is a brief summary of our tentative conclusions. Please correct me
> if anything is misrepresented (details in references):
>
> * Zswap tiering [2]:
>   Tiering applies only to the vswap + zswap combo. Zswap itself will
>   not be tiered, as the current architecture requires a physical device
>   for zswap allocation.

I think Yosry wants zswap as a tier, right?

Just that without vswap, maybe don't allow it to be an tier of itself?

> * Vswap tiering [3]:
>   Vswap should be handled transparently to the user. Vswap itself will
>   not be tiered. But, someday supported if there is strong and real usecase.
> * Relationship with zswap.writeback [4]:
>   If zswap tiering is introduced, it could replace the zswap-only tier.
>   However, since zswap cannot be tiered independently, it is still
>   needed for non-vswap cases. Separately, the internal logic could
>   potentially be integrated into the tiering logic.
> * Tier demotion [5]:
>   A separate interface like memory.swap.tiers.demotion might be needed.
>   For now, we only support 0/max to enable/disable tiers. In the future,
>   we could introduce an "auto" mode to automatically scale the limit
>   based on swapfile size and memory.swap.max, similar to the direction
>   memory tiering is heading in.
>
> I plan to apply the swap tier infrastructure and the first use case
> (cgroup-based swap control) first, and continue following up on the
> discussions above.
>
> Overview
> ========
>
> Swap Tiers group swap devices into performance classes (e.g. NVMe,
> HDD, Network) and allow per-memcg selection of which tiers to use.
> This mechanism was suggested by Chris Li.
>
>
> #2: Inter-tier promotion and demotion:
>   Promotion and demotion apply between tiers, not within a single
>   tier. The current interface defines only tier assignment; it does
>   not yet define when or how pages move between tiers. Two triggering
>   models are possible:
>
>   (a) User-triggered: userspace explicitly initiates migration between
>       tiers (e.g. via a new interface or existing move_pages semantics).
>   (b) Kernel-triggered: the kernel moves pages between tiers at
>       appropriate points such as reclaim or refault.

We'll likely need some kernel-triggered mechanism, or we'd have LRU inversion :)

Cold pages will fill up fast tiers first, and more recent/warm pages
will land on slow tiers...

We'll also need to enforce isolation/fairness to make sure no wordload
hoard the fast tiers too (but that probably requires demotion
support).

>
> #3: Per-VMA, per-process swap and BPF:
>   Not just for memcg based swap, possible to extend Per-VMA or per-process
>   swap. Or we can use it as BPF program.
>
> #4: Zswap and vswap tiering:
>   Tiering applies to the vswap + zswap combination.
>
> #5: Vswap on/off control:
>   Currently not supported. If a strong use case arises where vswap needs
>   to be controlled by memcg, the tier interface could be used for it.

+1.

Also, per-si/per-tier per-CPU allocation caching? :) Kairui already
has a patch for it, IIUC, but if not it's pretty critical I'd say.

BTW, can we add some selftests, to make sure the new interface works
as expected, and to have example programs for new users to model their
scripts after? :)

Re: [PATCH v8 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

Posted by YoungJun Park 1 day, 20 hours ago

On Wed, Jun 17, 2026 at 01:50:49PM -0400, Nhat Pham wrote:

> On Wed, Jun 17, 2026 at 1:34 AM Youngjun Park <youngjun.park@lge.com> wrote:
> >
> > This is the v8 series of the swap tier patchset.
> >
> > Great thanks to Shakeel Butt and Yosry for the reviews and discussions [1].
> > The main change in this version is the interface change to use
> > memory.swap.tiers.max with '0' (disable) and 'max' (enable) values.
> > This mechanism was suggested by Shakeel and Yosry
> 
> I like this interface too :)

Good to hear. Now it looks like we have found a memcg interface that
aligns well with the existing memcg model.

I like this idea as well. Thanks again to Shakeel Butt and Yosry.

> > Here is a brief summary of our tentative conclusions. Please correct me
> > if anything is misrepresented (details in references):
> >
> > * Zswap tiering [2]:
> >   Tiering applies only to the vswap + zswap combo. Zswap itself will
> >   not be tiered, as the current architecture requires a physical device
> >   for zswap allocation.
> 
> I think Yosry wants zswap as a tier, right?
> 
> Just that without vswap, maybe don't allow it to be an tier of itself?

With the current architecture, users cannot dynamically specify zswap as
a tier, and zswap is a separate layer, so it is not tiered by itself.

Once your vswap work lands, I think we can make the zswap 
become the default, top-level tier.

After that, we can also look into cleaning up the zswap.writeback
interface together.

> #2: Inter-tier promotion and demotion:
>   Promotion and demotion apply between tiers, not within a single
>   tier. The current interface defines only tier assignment; it does
>   not yet define when or how pages move between tiers. Two triggering
>   models are possible:
>
> >   (a) User-triggered: userspace explicitly initiates migration between
> >       tiers (e.g. via a new interface or existing move_pages semantics).
> >   (b) Kernel-triggered: the kernel moves pages between tiers at
> >       appropriate points such as reclaim or refault.
> 
> We'll likely need some kernel-triggered mechanism, or we'd have LRU inversion :)
> 
> Cold pages will fill up fast tiers first, and more recent/warm pages
> will land on slow tiers...

Yeah, good point!

> We'll also need to enforce isolation/fairness to make sure no wordload
> hoard the fast tiers too (but that probably requires demotion
> support).

Right, that makes sense.

BTW, One thing I am curious about, though, is whether there are strong
real-world use cases that require demotion/promotion.
Theoretically, this looks useful but it would be helpful to better understand 
the requirements from such deployments.

> >
> > #3: Per-VMA, per-process swap and BPF:
> >   Not just for memcg based swap, possible to extend Per-VMA or per-process
> >   swap. Or we can use it as BPF program.
> >
> > #4: Zswap and vswap tiering:
> >   Tiering applies to the vswap + zswap combination.
> >
> > #5: Vswap on/off control:
> >   Currently not supported. If a strong use case arises where vswap needs
> >   to be controlled by memcg, the tier interface could be used for it.
> 
> +1.
> 
> Also, per-si/per-tier per-CPU allocation caching? :) Kairui already
> has a patch for it, IIUC, but if not it's pretty critical I'd say.

Yes, I missed it. Thank you for addressing it.
we need an implementation that integrates this with the per-CPU
allocation currently implemented on the vswap side.

If Kairui's patch lands, my patch #4 also can be optimized based on that.

> BTW, can we add some selftests, to make sure the new interface works
> as expected, and to have example programs for new users to model their
> scripts after? :)

Yes, I agree. I think selftests are necessary.

Do you want them to be introduced in this patchset, or would it be okay
to add them separately as follow-up work?

Re: [PATCH v8 0/4] mm/swap, memcg: Introduce swap tiers for cgroup based swap control

Posted by Nhat Pham 1 day, 9 hours ago

On Wed, Jun 17, 2026 at 9:47 PM YoungJun Park <youngjun.park@lge.com> wrote:
>
> On Wed, Jun 17, 2026 at 01:50:49PM -0400, Nhat Pham wrote:
>
> > On Wed, Jun 17, 2026 at 1:34 AM Youngjun Park <youngjun.park@lge.com> wrote:
> > >
> > > This is the v8 series of the swap tier patchset.
> > >
> > > Great thanks to Shakeel Butt and Yosry for the reviews and discussions [1].
> > > The main change in this version is the interface change to use
> > > memory.swap.tiers.max with '0' (disable) and 'max' (enable) values.
> > > This mechanism was suggested by Shakeel and Yosry
> >
> > I like this interface too :)
>
> > I think Yosry wants zswap as a tier, right?
> >
> > Just that without vswap, maybe don't allow it to be an tier of itself?
>
> With the current architecture, users cannot dynamically specify zswap as
> a tier, and zswap is a separate layer, so it is not tiered by itself.
>
> Once your vswap work lands, I think we can make the zswap
> become the default, top-level tier.
>
> After that, we can also look into cleaning up the zswap.writeback
> interface together.

SGTM if Yosry is happy with it :) FWIW, zswap is a conceptual tier,
whether we want it to express with your interface or not. This is just
interface clean-up work.

>
> > #2: Inter-tier promotion and demotion:
> >   Promotion and demotion apply between tiers, not within a single
> >   tier. The current interface defines only tier assignment; it does
> >   not yet define when or how pages move between tiers. Two triggering
> >   models are possible:
> >
> > >   (a) User-triggered: userspace explicitly initiates migration between
> > >       tiers (e.g. via a new interface or existing move_pages semantics).
> > >   (b) Kernel-triggered: the kernel moves pages between tiers at
> > >       appropriate points such as reclaim or refault.
> >
> > We'll likely need some kernel-triggered mechanism, or we'd have LRU inversion :)
> >
> > Cold pages will fill up fast tiers first, and more recent/warm pages
> > will land on slow tiers...
>
> Yeah, good point!
>
> > We'll also need to enforce isolation/fairness to make sure no wordload
> > hoard the fast tiers too (but that probably requires demotion
> > support).
>
> Right, that makes sense.
>
> BTW, One thing I am curious about, though, is whether there are strong
> real-world use cases that require demotion/promotion.
> Theoretically, this looks useful but it would be helpful to better understand
> the requirements from such deployments.

I think so, yeah. The LRU inversion problem above is one :) Hard to
make proper tiering without demotion.

Say I have a workload that have a SLO - for example a PSI target - but
don't particularly care about exact memory placement. To optimize
resource, we want to place the warmer stuff in fast tier, and the
coldest stuff in slow tier, etc. Having the ability to do demotion
derisk the initial placement - we can place things in the fast tier
initially (and rather aggressively), then as pages age and prove their
coldness, we can move them to slower and slower tier, etc.

Otherwise, what we end up with is really a placement preference
interface more than true tiering. Which is still useful especially
when co-tenant workloads have strict latency requirements, but perhaps
we don't need a full hierarchy-style interface for it? :)

The other use case is for fairness enforcement. We can (and probably
should) start with strict limits, but setting memory.swap.tier.max for
each cgroup is a bit of a drag, and it might leave stranded capacity
in cgroups that are allocated but not utilized their fast swap tier
capacity. If demotion is possible, we can let workloads use more than
what is fair, but then demote swap pages from swap tier to enforce
fairness when necessary...

Obviously, it's a moot point if there is no good mechanism to transfer
data one tier to another. The data might also be so cold that all of
this has diminishing returns, and moving things around cost more than
it's worth :) So I'm happy to start with something simple, then we can
figure out the next steps.

>
> > >
> > > #3: Per-VMA, per-process swap and BPF:
> > >   Not just for memcg based swap, possible to extend Per-VMA or per-process
> > >   swap. Or we can use it as BPF program.
> > >
> > > #4: Zswap and vswap tiering:
> > >   Tiering applies to the vswap + zswap combination.
> > >
> > > #5: Vswap on/off control:
> > >   Currently not supported. If a strong use case arises where vswap needs
> > >   to be controlled by memcg, the tier interface could be used for it.
> >
> > +1.
> >
> > Also, per-si/per-tier per-CPU allocation caching? :) Kairui already
> > has a patch for it, IIUC, but if not it's pretty critical I'd say.
>
> Yes, I missed it. Thank you for addressing it.
> we need an implementation that integrates this with the per-CPU
> allocation currently implemented on the vswap side.
>
> If Kairui's patch lands, my patch #4 also can be optimized based on that.

Yup!!

>
> > BTW, can we add some selftests, to make sure the new interface works
> > as expected, and to have example programs for new users to model their
> > scripts after? :)
>
> Yes, I agree. I think selftests are necessary.
>
> Do you want them to be introduced in this patchset, or would it be okay
> to add them separately as follow-up work?

If you have to send another version, might as well include them :)

Otherwise a follow-up is good. Thanks in advance for keeping our
codebase tested!

I'll take a look at the exact implementation on the swap side later,
but I suspect nothing much will have changed :)