[RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware

Joshua Hahn posted 6 patches 1 month, 1 week ago
include/linux/memcontrol.h   |  21 ++++-
include/linux/memory-tiers.h |  30 +++++++
include/linux/page_counter.h |  31 ++++++-
include/linux/swap.h         |   3 +-
kernel/cgroup/cpuset.c       |   2 +-
kernel/cgroup/dmem.c         |   2 +-
mm/memcontrol-v1.c           |   6 +-
mm/memcontrol.c              | 155 +++++++++++++++++++++++++++++++----
mm/memory-tiers.c            |  63 ++++++++++++++
mm/page_counter.c            |  77 ++++++++++++++++-
mm/vmscan.c                  |  24 ++++--
11 files changed, 376 insertions(+), 38 deletions(-)
[RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Posted by Joshua Hahn 1 month, 1 week ago
Memory cgroups provide an interface that allow multiple workloads on a
host to co-exist, and establish both weak and strong memory isolation
guarantees. For large servers and small embedded systems alike, memcgs
provide an effective way to provide a baseline quality of service for
protected workloads.

This works, because for the most part, all memory is equal (except for
zram / zswap). Restricting a cgroup's memory footprint restricts how
much it can hurt other workloads competing for memory. Likewise, setting
memory.low or memory.min limits can provide weak and strong guarantees
to the performance of a cgroup.

However, on systems with tiered memory (e.g. CXL / compressed memory),
the quality of service guarantees that memcg limits enforced become less
effective, as memcg has no awareness of the physical location of its
charged memory. In other words, a workload that is well-behaved within
its memcg limits may still be hurting the performance of other
well-behaving workloads on the system by hogging more than its
"fair share" of toptier memory.

Introduce tier-aware memcg limits, which scale memory.low/high to
reflect the ratio of toptier:total memory the cgroup has access.

Take the following scenario as an example:
On a host with 3:1 toptier:lowtier, say 150G toptier, and 50Glowtier,
setting a cgroup's limits to:
	memory.min:  15G
	memory.low:  20G
	memory.high: 40G
	memory.max:  50G

Will be enforced at the toptier as:
	memory.min:          15G
	memory.toptier_low:  15G (20 * 150/200)
	memory.toptier_high: 30G (40 * 150/200)
	memory.max:          50G

Let's say that there are 4 such cgroups on the host. Previously, it would
be possible for 3 hosts to completely take over all of DRAM, while one
cgroup could only access the lowtier memory. In the perspective of a
tier-agnostic memcg limit enforcement, the three cgroups are all
well-behaved, consuming within their memory limits.

This is not to say that the scenario above is incorrect. In fact, for
letting the hottest cgroups run in DRAM while pushing out colder cgroups
to lowtier memory lets the system perform the most aggregate work total.

But for other scenarios, the target might not be maximizing aggregate
work, but maximizing the minimum performance guarantee for each
individual workload (think hosts shared across different users, such as
VM hosting services).

To reflect these two scenarios, introduce a sysctl tier_aware_memcg,
which allows the host to toggle between enforcing and overlooking
toptier memcg limit breaches.

This work is inspired & based off of Kaiyang Zhao's work from 2024 [1],
where he referred to this concept as "memory tiering fairness".
The biggest difference in the implementations lie in how toptier memory
is tracked; in his implementation, an lruvec stat aggregation is done on
each usage check, while in this implementation, a new cacheline is
introduced in page_coutner to keep track of toptier usage (Kaiyang also
introduces a new cachline in page_counter, but only uses it to cache
capacity and thresholds). This implementation also extends the memory
limit enforcement to memory.high as well.

[1] https://lore.kernel.org/linux-mm/20240920221202.1734227-1-kaiyang2@cs.cmu.edu/

---
Joshua Hahn (6):
  mm/memory-tiers: Introduce tier-aware memcg limit sysfs
  mm/page_counter: Introduce tiered memory awareness to page_counter
  mm/memory-tiers, memcontrol: Introduce toptier capacity updates
  mm/memcontrol: Charge and uncharge from toptier
  mm/memcontrol, page_counter: Make memory.low tier-aware
  mm/memcontrol: Make memory.high tier-aware

 include/linux/memcontrol.h   |  21 ++++-
 include/linux/memory-tiers.h |  30 +++++++
 include/linux/page_counter.h |  31 ++++++-
 include/linux/swap.h         |   3 +-
 kernel/cgroup/cpuset.c       |   2 +-
 kernel/cgroup/dmem.c         |   2 +-
 mm/memcontrol-v1.c           |   6 +-
 mm/memcontrol.c              | 155 +++++++++++++++++++++++++++++++----
 mm/memory-tiers.c            |  63 ++++++++++++++
 mm/page_counter.c            |  77 ++++++++++++++++-
 mm/vmscan.c                  |  24 ++++--
 11 files changed, 376 insertions(+), 38 deletions(-)

-- 
2.47.3
Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Posted by Donet Tom 1 week, 5 days ago
Hi Josua

On 2/24/26 4:08 AM, Joshua Hahn wrote:
> Memory cgroups provide an interface that allow multiple workloads on a
> host to co-exist, and establish both weak and strong memory isolation
> guarantees. For large servers and small embedded systems alike, memcgs
> provide an effective way to provide a baseline quality of service for
> protected workloads.
>
> This works, because for the most part, all memory is equal (except for
> zram / zswap). Restricting a cgroup's memory footprint restricts how
> much it can hurt other workloads competing for memory. Likewise, setting
> memory.low or memory.min limits can provide weak and strong guarantees
> to the performance of a cgroup.
>
> However, on systems with tiered memory (e.g. CXL / compressed memory),
> the quality of service guarantees that memcg limits enforced become less
> effective, as memcg has no awareness of the physical location of its
> charged memory. In other words, a workload that is well-behaved within
> its memcg limits may still be hurting the performance of other
> well-behaving workloads on the system by hogging more than its
> "fair share" of toptier memory.
>
> Introduce tier-aware memcg limits, which scale memory.low/high to
> reflect the ratio of toptier:total memory the cgroup has access.
>
> Take the following scenario as an example:
> On a host with 3:1 toptier:lowtier, say 150G toptier, and 50Glowtier,
> setting a cgroup's limits to:
> 	memory.min:  15G
> 	memory.low:  20G
> 	memory.high: 40G
> 	memory.max:  50G
>
> Will be enforced at the toptier as:
> 	memory.min:          15G
> 	memory.toptier_low:  15G (20 * 150/200)
> 	memory.toptier_high: 30G (40 * 150/200)
> 	memory.max:          50G



Currently, the high and low thresholds are adjusted based on the ratio 
of top-tier to total memory. One concern I see is that if the working 
set size exceeds the top-tier high threshold, it could lead to frequent 
demotions and promotions. Instead, would it make sense to introduce a 
tunable knob to configure the top-tier high threshold?

Another concern is that if the lower-tier memory size is very large, the 
cgroup may end up getting only a small portion of higher-tier memory.


>
> Let's say that there are 4 such cgroups on the host. Previously, it would
> be possible for 3 hosts to completely take over all of DRAM, while one
> cgroup could only access the lowtier memory. In the perspective of a
> tier-agnostic memcg limit enforcement, the three cgroups are all
> well-behaved, consuming within their memory limits.
>
> This is not to say that the scenario above is incorrect. In fact, for
> letting the hottest cgroups run in DRAM while pushing out colder cgroups
> to lowtier memory lets the system perform the most aggregate work total.
>
> But for other scenarios, the target might not be maximizing aggregate
> work, but maximizing the minimum performance guarantee for each
> individual workload (think hosts shared across different users, such as
> VM hosting services).
>
> To reflect these two scenarios, introduce a sysctl tier_aware_memcg,
> which allows the host to toggle between enforcing and overlooking
> toptier memcg limit breaches.
>
> This work is inspired & based off of Kaiyang Zhao's work from 2024 [1],
> where he referred to this concept as "memory tiering fairness".
> The biggest difference in the implementations lie in how toptier memory
> is tracked; in his implementation, an lruvec stat aggregation is done on
> each usage check, while in this implementation, a new cacheline is
> introduced in page_coutner to keep track of toptier usage (Kaiyang also
> introduces a new cachline in page_counter, but only uses it to cache
> capacity and thresholds). This implementation also extends the memory
> limit enforcement to memory.high as well.
>
> [1] https://lore.kernel.org/linux-mm/20240920221202.1734227-1-kaiyang2@cs.cmu.edu/
>
> ---
> Joshua Hahn (6):
>    mm/memory-tiers: Introduce tier-aware memcg limit sysfs
>    mm/page_counter: Introduce tiered memory awareness to page_counter
>    mm/memory-tiers, memcontrol: Introduce toptier capacity updates
>    mm/memcontrol: Charge and uncharge from toptier
>    mm/memcontrol, page_counter: Make memory.low tier-aware
>    mm/memcontrol: Make memory.high tier-aware
>
>   include/linux/memcontrol.h   |  21 ++++-
>   include/linux/memory-tiers.h |  30 +++++++
>   include/linux/page_counter.h |  31 ++++++-
>   include/linux/swap.h         |   3 +-
>   kernel/cgroup/cpuset.c       |   2 +-
>   kernel/cgroup/dmem.c         |   2 +-
>   mm/memcontrol-v1.c           |   6 +-
>   mm/memcontrol.c              | 155 +++++++++++++++++++++++++++++++----
>   mm/memory-tiers.c            |  63 ++++++++++++++
>   mm/page_counter.c            |  77 ++++++++++++++++-
>   mm/vmscan.c                  |  24 ++++--
>   11 files changed, 376 insertions(+), 38 deletions(-)
>
Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Posted by Joshua Hahn 1 week, 5 days ago
On Tue, 24 Mar 2026 16:00:34 +0530 Donet Tom <donettom@linux.ibm.com> wrote:

> Hi Josua
> 
> On 2/24/26 4:08 AM, Joshua Hahn wrote:
> > Memory cgroups provide an interface that allow multiple workloads on a
> > host to co-exist, and establish both weak and strong memory isolation
> > guarantees. For large servers and small embedded systems alike, memcgs
> > provide an effective way to provide a baseline quality of service for
> > protected workloads.
> >
> > This works, because for the most part, all memory is equal (except for
> > zram / zswap). Restricting a cgroup's memory footprint restricts how
> > much it can hurt other workloads competing for memory. Likewise, setting
> > memory.low or memory.min limits can provide weak and strong guarantees
> > to the performance of a cgroup.
> >
> > However, on systems with tiered memory (e.g. CXL / compressed memory),
> > the quality of service guarantees that memcg limits enforced become less
> > effective, as memcg has no awareness of the physical location of its
> > charged memory. In other words, a workload that is well-behaved within
> > its memcg limits may still be hurting the performance of other
> > well-behaving workloads on the system by hogging more than its
> > "fair share" of toptier memory.
> >
> > Introduce tier-aware memcg limits, which scale memory.low/high to
> > reflect the ratio of toptier:total memory the cgroup has access.
> >
> > Take the following scenario as an example:
> > On a host with 3:1 toptier:lowtier, say 150G toptier, and 50Glowtier,
> > setting a cgroup's limits to:
> > 	memory.min:  15G
> > 	memory.low:  20G
> > 	memory.high: 40G
> > 	memory.max:  50G
> >
> > Will be enforced at the toptier as:
> > 	memory.min:          15G
> > 	memory.toptier_low:  15G (20 * 150/200)
> > 	memory.toptier_high: 30G (40 * 150/200)
> > 	memory.max:          50G
> 
> 

Hello Donet,

Thank you for reviewing the series! I hope you are doing well.

> Currently, the high and low thresholds are adjusted based on the ratio 
> of top-tier to total memory. One concern I see is that if the working 
> set size exceeds the top-tier high threshold, it could lead to frequent 
> demotions and promotions. Instead, would it make sense to introduce a 
> tunable knob to configure the top-tier high threshold?

Yes, this is true. It is also a concern that I have, and I think that
adding a tunable knob could be helpful. The other side of the question is
whether there are too many tunables for the users already, with min / 
low / high / max. I'm hoping to get a consensus for this at LSFMMBPF, 
I hope we can talk about it there!

The other way to approach this is to throttle promotions and demotions
when workloads are thrashing. Personally I prefer this decision, although
it isn't mutually exclusive to adding more knobs. 

> Another concern is that if the lower-tier memory size is very large, the 
> cgroup may end up getting only a small portion of higher-tier memory.

I think the issue you mentioned above is a bigger problem.

If the lower tier memory is large and the toptier memory is small, then it
makes toptier memory an even more constrained resource, so splitting it
fairly among the cgroups becomes an even bigger issue. Remember, we're
limiting workloads' toptier memory usage because other workloads have
to use it; if we let a cgroup use more toptier memory, it has to come
from another cgroup's share.

Thanks again. Please let me know if you have any other concerns, I'm
excited to talk about this more as well!

Joshua
Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Posted by Michal Hocko 1 month, 1 week ago
On Mon 23-02-26 14:38:23, Joshua Hahn wrote:
> Memory cgroups provide an interface that allow multiple workloads on a
> host to co-exist, and establish both weak and strong memory isolation
> guarantees. For large servers and small embedded systems alike, memcgs
> provide an effective way to provide a baseline quality of service for
> protected workloads.
> 
> This works, because for the most part, all memory is equal (except for
> zram / zswap). Restricting a cgroup's memory footprint restricts how
> much it can hurt other workloads competing for memory. Likewise, setting
> memory.low or memory.min limits can provide weak and strong guarantees
> to the performance of a cgroup.
> 
> However, on systems with tiered memory (e.g. CXL / compressed memory),
> the quality of service guarantees that memcg limits enforced become less
> effective, as memcg has no awareness of the physical location of its
> charged memory. In other words, a workload that is well-behaved within
> its memcg limits may still be hurting the performance of other
> well-behaving workloads on the system by hogging more than its
> "fair share" of toptier memory.

This assumes that the active workingset size of all workloads doesn't
fit into the top tier right? Otherwise promotions would make sure to
that we have the most active memory in the top tier. Is this typical in
real life configurations?

Or do you intend to limit memory consumption on particular tier even
without an external pressure?

> Introduce tier-aware memcg limits, which scale memory.low/high to
> reflect the ratio of toptier:total memory the cgroup has access.
> 
> Take the following scenario as an example:
> On a host with 3:1 toptier:lowtier, say 150G toptier, and 50Glowtier,
> setting a cgroup's limits to:
> 	memory.min:  15G
> 	memory.low:  20G
> 	memory.high: 40G
> 	memory.max:  50G
> 
> Will be enforced at the toptier as:
> 	memory.min:          15G
> 	memory.toptier_low:  15G (20 * 150/200)
> 	memory.toptier_high: 30G (40 * 150/200)
> 	memory.max:          50G

Let's spend some more time with the interface first. You seem to be
focusing only on the top tier with this interface, right? Is this really the
right way to go long term? What makes you believe that we do not really
hit the same issue with other tiers as well? Also do we want/need to
duplicate all the limits for each/top tier? What is the reasoning for
the switch to be runtime sysctl rather than boot-time or cgroup mount
option?

I will likely have more questions but these are immediate ones after
reading the cover. Please note I haven't really looked at the
implementation yet. I really want to understand usecases and interface
first.
-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Posted by Joshua Hahn 1 month, 1 week ago
Hello Michal,

I hope that you are doing well! Thank you for taking the time to review my
work and leaving your thoughts.

I wanted to note that I hope to bring this discussion to LSFMMBPF as well,
to discuss what the scope of the project should be, what usecases there
are (as I will note below), how to make this scalable and sustainable
for the future, etc. I'll send out a topic proposal later today. I had
separated the series from the proposal because I imagined that this
series would go through many versions, so it would be helpful to have
the topic as a unified place for pre-conference discussions.

> > Memory cgroups provide an interface that allow multiple workloads on a
> > host to co-exist, and establish both weak and strong memory isolation
> > guarantees. For large servers and small embedded systems alike, memcgs
> > provide an effective way to provide a baseline quality of service for
> > protected workloads.
> > 
> > This works, because for the most part, all memory is equal (except for
> > zram / zswap). Restricting a cgroup's memory footprint restricts how
> > much it can hurt other workloads competing for memory. Likewise, setting
> > memory.low or memory.min limits can provide weak and strong guarantees
> > to the performance of a cgroup.
> > 
> > However, on systems with tiered memory (e.g. CXL / compressed memory),
> > the quality of service guarantees that memcg limits enforced become less
> > effective, as memcg has no awareness of the physical location of its
> > charged memory. In other words, a workload that is well-behaved within
> > its memcg limits may still be hurting the performance of other
> > well-behaving workloads on the system by hogging more than its
> > "fair share" of toptier memory.

I will split up your questions to answer them individually:

> This assumes that the active workingset size of all workloads doesn't
> fit into the top tier right?

Yes, for the scenario above, a workload that is violating its fair share
of toptier memory mostly hurts other workloads if the aggregate working
set size of all workloads exceeds the size of toptier memory.

> Otherwise promotions would make sure to that we have the most active
> memory in the top tier.

This is true. And for a lot of usecases, this is 100% the right thing to do.
However, with this patch I want to encourage a different perspective,
which is to think about things in a per-workload perspective, and not a
per-system perspective.

Having hot memory in high tiers and cold memory in low tiers is only
logical, since we increase the system's throughput and make the most
optimal choices for latency. However, what about systems that care about
objectives other than simply maximizing throughput?

In the original cover letter I offered an example of VM hosting services
that care less about maximizing host-wide throughput, but more on ensuring
a bottomline performance guarantee for all workloads running on the system.
For the users on these services, they don't care that the host their VM is
running on is maximizing throughput; rather, they care that their VM meets
the performance guarantees that their provider promised. If there is no
way to know or enforce which tier of memory their workload lands on, either
the bottomline guarantee becomes very underestimated, or users must deal
with a high variance in performance.

Here's another example: Let's say there is a host with multiple workloads,
each serving queries for a database. The host would like to guarantee the
lowest maximum latency possible, while maximizing the total throughput
of the system. Once again in this situation, without tier-aware memcg
limits the host can maximize throughput, but can only make severely
underestimated promises on the bottom line.

> Is this typical in real life configurations?

I would say so. I think that the two examples above are realistic
scenarios that cloud providers and hyperscalers might face on tiered systems.

> Or do you intend to limit memory consumption on particular tier even
> without an external pressure?

This is a great question, and one that I hope to discuss at LSFMMBPF
to see how people expect an interface like this to work.

Over the past few weeks, I have been discussing this idea during the
Linux Memory Hotness and Promotion biweekly calls with Gregory Price [1].
One of the proposals that we made there (but did not include in this
series) is the idea of "fixed" vs. "opportunistic" reclaim.

Fixed mode is what we have here -- start limiting toptier usage whenever
a workload goes above its fair slice of toptier.
Opportunistic mode would allow workloads to use more toptier memory than
its fair share, but only be restricted when toptier is pressured.

What do you think about these two options? For the stated goal of this
series, which is to help maximize the bottom line for workloads, fair
share seemed to make sense. Implementing opportunistic mode changes
on top of this work would most likely just be another sysctl.

> > Introduce tier-aware memcg limits, which scale memory.low/high to
> > reflect the ratio of toptier:total memory the cgroup has access.
> > 
> > Take the following scenario as an example:
> > On a host with 3:1 toptier:lowtier, say 150G toptier, and 50Glowtier,
> > setting a cgroup's limits to:
> > 	memory.min:  15G
> > 	memory.low:  20G
> > 	memory.high: 40G
> > 	memory.max:  50G
> > 
> > Will be enforced at the toptier as:
> > 	memory.min:          15G
> > 	memory.toptier_low:  15G (20 * 150/200)
> > 	memory.toptier_high: 30G (40 * 150/200)
> > 	memory.max:          50G

I will split up the following points to answer them individually as well:

> Let's spend some more time with the interface first.

That sounds good with me, my goal was to bring this out as an RFC patchset
so folks could look at the code and understand the motivation, and then send
out the LSFMMBPF topic proposal. In retrospect I think I should have done
it in the opposite order. I'm sorry if this caused any confusion.

> You seem to be focusing only on the top tier with this interface, right?
> Is this really the right way to go long term? What makes you believe that
> we do not really hit the same issue with other tiers as well?

Yes, that's right. I'm not sure if this is the right way to go long-term
(say, past the next 5 years). My thinking was that I can stick with doing
this for toptier vs. non-toptier memory for now, and deal with having
3+ tiers in the future, when we start to have systems with that many tiers.
AFAICT two-tiered systems are still ~relatively new, and I don't think
there are a lot of genuine usecases for enforcing mid-tier memory limits
as of now. Of course, I would be excited to learn about these usecases
and work this patchset to support them as well if anybody has them.

> Also do we want/need to duplicate all the limits for each/top tier?

Sorry, I'm not sure that I completely understood this question. Are you
referring to the case where we have multiple nodes in the toptier?
If so, then all of those nodes are treated the same, and don't have
unique limits. Or are you referring to the case where we have multiple
tiers in the toptier? If so, I hope the answer above can answer this too.

> What is the reasoning for the switch to be runtime sysctl rather than
> boot-time or cgroup mount option?

Good point : -) I don't think cgroup mount options are a good idea,
since this would mean that we can have a set of cgroups self-policing
their toptier usage, while another cgroup allocates memory unrestricted.
This would punish the self-policing cgroup and we would lose the benefit
of having a bottomline performance guarantee.

> I will likely have more questions but these are immediate ones after
> reading the cover. Please note I haven't really looked at the
> implementation yet. I really want to understand usecases and interface
> first.

That sounds good to me, thank you again for reviewing this work!
I hope you have a great day : -)
Joshua

[1] https://lore.kernel.org/linux-mm/c8bc2dce-d4ec-c16e-8df4-2624c48cfc06@google.com/
Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Posted by Michal Hocko 1 month, 1 week ago
On Tue 24-02-26 08:13:56, Joshua Hahn wrote:
> Hello Michal,
> 
> I hope that you are doing well! Thank you for taking the time to review my
> work and leaving your thoughts.
> 
> I wanted to note that I hope to bring this discussion to LSFMMBPF as well,
> to discuss what the scope of the project should be, what usecases there
> are (as I will note below), how to make this scalable and sustainable
> for the future, etc. I'll send out a topic proposal later today. I had
> separated the series from the proposal because I imagined that this
> series would go through many versions, so it would be helpful to have
> the topic as a unified place for pre-conference discussions.

yes, this is a really good topic to bring to LSFMMBPF. I will not be
attending this year unfortunately but I will keep watching progress on
the this. I am really sure there will be people in the room that can
help with the discussion.

> > > Memory cgroups provide an interface that allow multiple workloads on a
> > > host to co-exist, and establish both weak and strong memory isolation
> > > guarantees. For large servers and small embedded systems alike, memcgs
> > > provide an effective way to provide a baseline quality of service for
> > > protected workloads.
> > > 
> > > This works, because for the most part, all memory is equal (except for
> > > zram / zswap). Restricting a cgroup's memory footprint restricts how
> > > much it can hurt other workloads competing for memory. Likewise, setting
> > > memory.low or memory.min limits can provide weak and strong guarantees
> > > to the performance of a cgroup.
> > > 
> > > However, on systems with tiered memory (e.g. CXL / compressed memory),
> > > the quality of service guarantees that memcg limits enforced become less
> > > effective, as memcg has no awareness of the physical location of its
> > > charged memory. In other words, a workload that is well-behaved within
> > > its memcg limits may still be hurting the performance of other
> > > well-behaving workloads on the system by hogging more than its
> > > "fair share" of toptier memory.
> 
> I will split up your questions to answer them individually:
> 
> > This assumes that the active workingset size of all workloads doesn't
> > fit into the top tier right?
> 
> Yes, for the scenario above, a workload that is violating its fair share
> of toptier memory mostly hurts other workloads if the aggregate working
> set size of all workloads exceeds the size of toptier memory.

I think it would be good to provide some more insight into how this is
supposed to work exactly. If the real working set size doesn't fit into
the top tier then I suspect we can expect quite a lot of disruption by
constant promotions and demotions, right. I guess what you would like to
achieve is to stop those from happening right? If that is the case then
how exactly do you envision to configure the workload. Do you cap the
each workload with max/high limits? Or do you want to rely on the
low/min limits to protect workloads you care about. Or both? How does
that play with promotion side of things.

> > Otherwise promotions would make sure to that we have the most active
> > memory in the top tier.
> 
> This is true. And for a lot of usecases, this is 100% the right thing to do.
> However, with this patch I want to encourage a different perspective,
> which is to think about things in a per-workload perspective, and not a
> per-system perspective.
> 
> Having hot memory in high tiers and cold memory in low tiers is only
> logical, since we increase the system's throughput and make the most
> optimal choices for latency. However, what about systems that care about
> objectives other than simply maximizing throughput?
> 
> In the original cover letter I offered an example of VM hosting services
> that care less about maximizing host-wide throughput, but more on ensuring
> a bottomline performance guarantee for all workloads running on the system.
> For the users on these services, they don't care that the host their VM is
> running on is maximizing throughput; rather, they care that their VM meets
> the performance guarantees that their provider promised. If there is no
> way to know or enforce which tier of memory their workload lands on, either
> the bottomline guarantee becomes very underestimated, or users must deal
> with a high variance in performance.
> 
> Here's another example: Let's say there is a host with multiple workloads,
> each serving queries for a database. The host would like to guarantee the
> lowest maximum latency possible, while maximizing the total throughput
> of the system. Once again in this situation, without tier-aware memcg
> limits the host can maximize throughput, but can only make severely
> underestimated promises on the bottom line.

Thanks useful examples. And it would be really great to provide an
example of intended configuration (no specific numbers but something to
demonstrate the intention). Because this will not be just about limits,
right. It would require more tweaks to the system - at least numa
balancing (promotions) to be controlled in some way AFAICS.

> > Is this typical in real life configurations?
> 
> I would say so. I think that the two examples above are realistic
> scenarios that cloud providers and hyperscalers might face on tiered systems.
> 
> > Or do you intend to limit memory consumption on particular tier even
> > without an external pressure?
> 
> This is a great question, and one that I hope to discuss at LSFMMBPF
> to see how people expect an interface like this to work.
> 
> Over the past few weeks, I have been discussing this idea during the
> Linux Memory Hotness and Promotion biweekly calls with Gregory Price [1].
> One of the proposals that we made there (but did not include in this
> series) is the idea of "fixed" vs. "opportunistic" reclaim.
> 
> Fixed mode is what we have here -- start limiting toptier usage whenever
> a workload goes above its fair slice of toptier.
> Opportunistic mode would allow workloads to use more toptier memory than
> its fair share, but only be restricted when toptier is pressured.
> 
> What do you think about these two options? For the stated goal of this
> series, which is to help maximize the bottom line for workloads, fair
> share seemed to make sense. Implementing opportunistic mode changes
> on top of this work would most likely just be another sysctl.

To me it would sounds like the distinction between max/high vs. low/min
reclaim.

[...]
> > You seem to be focusing only on the top tier with this interface, right?
> > Is this really the right way to go long term? What makes you believe that
> > we do not really hit the same issue with other tiers as well?
> 
> Yes, that's right. I'm not sure if this is the right way to go long-term
> (say, past the next 5 years). My thinking was that I can stick with doing
> this for toptier vs. non-toptier memory for now, and deal with having
> 3+ tiers in the future, when we start to have systems with that many tiers.
> AFAICT two-tiered systems are still ~relatively new, and I don't think
> there are a lot of genuine usecases for enforcing mid-tier memory limits
> as of now. Of course, I would be excited to learn about these usecases
> and work this patchset to support them as well if anybody has them.

I guess a more fundamental question is whether this need to replicate
all limits for tiers or whether we can get an extension that would
control tier behavior for existing ones. In other words can we define
which proportion of the max/high resp. low/min limits are reserved for
each tier? Is that feasible? I do not have answer to that myself at this
stage TBH.

[...]
> > What is the reasoning for the switch to be runtime sysctl rather than
> > boot-time or cgroup mount option?
> 
> Good point : -) I don't think cgroup mount options are a good idea,
> since this would mean that we can have a set of cgroups self-policing
> their toptier usage, while another cgroup allocates memory unrestricted.
> This would punish the self-policing cgroup and we would lose the benefit
> of having a bottomline performance guarantee.

I do not follow. cgroup mount option would apply to all cgroups. In
sense whatever is achievable by sysctl should apply to kernel cmdline or
mount option. The question is what is the best fit AFAICS.
 
> > I will likely have more questions but these are immediate ones after
> > reading the cover. Please note I haven't really looked at the
> > implementation yet. I really want to understand usecases and interface
> > first.
> 
> That sounds good to me, thank you again for reviewing this work!
> I hope you have a great day : -)
> Joshua
> 
> [1] https://lore.kernel.org/linux-mm/c8bc2dce-d4ec-c16e-8df4-2624c48cfc06@google.com/

-- 
Michal Hocko
SUSE Labs
Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Posted by Joshua Hahn 1 month, 1 week ago
On Thu, 26 Feb 2026 09:04:43 +0100 Michal Hocko <mhocko@suse.com> wrote:

> On Tue 24-02-26 08:13:56, Joshua Hahn wrote:
> > Hello Michal,
> > 
> > I hope that you are doing well! Thank you for taking the time to review my
> > work and leaving your thoughts.
> > 
> > I wanted to note that I hope to bring this discussion to LSFMMBPF as well,
> > to discuss what the scope of the project should be, what usecases there
> > are (as I will note below), how to make this scalable and sustainable
> > for the future, etc. I'll send out a topic proposal later today. I had
> > separated the series from the proposal because I imagined that this
> > series would go through many versions, so it would be helpful to have
> > the topic as a unified place for pre-conference discussions.
> 
> yes, this is a really good topic to bring to LSFMMBPF. I will not be
> attending this year unfortunately but I will keep watching progress on
> the this. I am really sure there will be people in the room that can
> help with the discussion.

Hello Michal, thank you for the encouraging words : -)
Yes, I am sure that the audience will have valuable ideas to share
as well. Hopefully I can catch you at another conference!

And by the way, I've sent out the proposal here [1] if you are interested!

[...snip...]

> > > This assumes that the active workingset size of all workloads doesn't
> > > fit into the top tier right?
> > 
> > Yes, for the scenario above, a workload that is violating its fair share
> > of toptier memory mostly hurts other workloads if the aggregate working
> > set size of all workloads exceeds the size of toptier memory.
> 
> I think it would be good to provide some more insight into how this is
> supposed to work exactly. If the real working set size doesn't fit into
> the top tier then I suspect we can expect quite a lot of disruption by
> constant promotions and demotions, right. I guess what you would like to
> achieve is to stop those from happening right? If that is the case then
> how exactly do you envision to configure the workload. Do you cap the
> each workload with max/high limits? Or do you want to rely on the
> low/min limits to protect workloads you care about. Or both? How does
> that play with promotion side of things.

Yes, thrashing is probably the biggest concern with the actual performance
if deployed to a real machine. I would like to add that this is
(arguably an even bigger) problem without this setup as well.

Once again on multi-tenant hosts, if we have three hot cgroups whose
workingset size consumes all of DRAM, and one cgroup whose memory is
colder than the other three cgroups, then it will constantly face
thrashing as it has to compete with the other cgroups for hotness.

So the question is whether the thrashing happens to a well-behaving
victim cgroup, or if it happens to the ones whose workingset sizes are
too big.

I also have two qualifying points to add here:
First is that the effective toptier memory limits is not visible to the
users. So when they are designing their workloads, specifically on how
big the workingset size can be, they have no idea how to tune it. So
cgroups that appear to be well-behaved and whose total footprint is
within its memory.high threshold would still see reclaim activity.
Maybe the solution is as simple as exposing the toptier memory limits
as a new sysfs file? But I'm hoping that there is a more clever way to
do this that doesn't add more sysfs entries to the cgroup interface ; -)

Second is that there are scenarios where on a relatively idle machine
with just one cgroup where memory.high, memory.max << toptier capacity,
we would still see reclaim activity. I would argue that this is not
so different from having a cgroup go into reclaim on an empty host, 
even when there is memory avaialble.

But I could also see the argument that those two scenarios are different.
What do you think?

[...snip...]

> > In the original cover letter I offered an example of VM hosting services
> > that care less about maximizing host-wide throughput, but more on ensuring
> > a bottomline performance guarantee for all workloads running on the system.
> > For the users on these services, they don't care that the host their VM is
> > running on is maximizing throughput; rather, they care that their VM meets
> > the performance guarantees that their provider promised. If there is no
> > way to know or enforce which tier of memory their workload lands on, either
> > the bottomline guarantee becomes very underestimated, or users must deal
> > with a high variance in performance.
> > 
> > Here's another example: Let's say there is a host with multiple workloads,
> > each serving queries for a database. The host would like to guarantee the
> > lowest maximum latency possible, while maximizing the total throughput
> > of the system. Once again in this situation, without tier-aware memcg
> > limits the host can maximize throughput, but can only make severely
> > underestimated promises on the bottom line.
> 
> Thanks useful examples. And it would be really great to provide an
> example of intended configuration (no specific numbers but something to
> demonstrate the intention). Because this will not be just about limits,
> right. It would require more tweaks to the system - at least numa
> balancing (promotions) to be controlled in some way AFAICS.

Definitely. Two components that make sense here would be to throttle
promotions when toptier is facing cgroup-local pressure (reaching the
limit), and to also have some background balancing between the two nodes,
maybe by kswapd. I'll be sure to include some of these along with
performance numbers in the next version. 

[...snip...]

> > Fixed mode is what we have here -- start limiting toptier usage whenever
> > a workload goes above its fair slice of toptier.
> > Opportunistic mode would allow workloads to use more toptier memory than
> > its fair share, but only be restricted when toptier is pressured.
> > 
> > What do you think about these two options? For the stated goal of this
> > series, which is to help maximize the bottom line for workloads, fair
> > share seemed to make sense. Implementing opportunistic mode changes
> > on top of this work would most likely just be another sysctl.
> 
> To me it would sounds like the distinction between max/high vs. low/min
> reclaim.

Ack. Makes sense to me.

[...snip...]

> > > You seem to be focusing only on the top tier with this interface, right?
> > > Is this really the right way to go long term? What makes you believe that
> > > we do not really hit the same issue with other tiers as well?
> > 
> > Yes, that's right. I'm not sure if this is the right way to go long-term
> > (say, past the next 5 years). My thinking was that I can stick with doing
> > this for toptier vs. non-toptier memory for now, and deal with having
> > 3+ tiers in the future, when we start to have systems with that many tiers.
> > AFAICT two-tiered systems are still ~relatively new, and I don't think
> > there are a lot of genuine usecases for enforcing mid-tier memory limits
> > as of now. Of course, I would be excited to learn about these usecases
> > and work this patchset to support them as well if anybody has them.
> 
> I guess a more fundamental question is whether this need to replicate
> all limits for tiers or whether we can get an extension that would
> control tier behavior for existing ones. In other words can we define
> which proportion of the max/high resp. low/min limits are reserved for
> each tier? Is that feasible? I do not have answer to that myself at this
> stage TBH.

In terms of feasibility, I think the easiest would be to enforce limits
based on capacity, since this would let us get by without defining
per-tier per-cgroup limits. So for a 4-tier system with capacity of 200G
and at each tier,  100G : 60G : 20G : 20G, and a cgroup with a 50G memory.high:

tier0.ratio: 100 / 200 = 0.5		tier0.toptier_high = 50G * 0.5 = 25G
tier1.ratio: 60 / 200  = 0.3		tier1.toptier_high = 50G * 0.3 = 15G
tier2.ratio: 20 / 200  = 0.1		tier2.toptier_high = 50G * 0.1 = 5G
tier3.ratio: 20 / 200  = 0.1		tier3.toptier_high = 50G * 0.1 = 5G

The alternative would be to have 4 sysctls here to set limits which...
doesn't sound too fun ; -) And I'm not entirely sure if we want limits
per-tier anyways. For most scenarios I think it should be enough to limit
how much to protect or limit toptier usage.

> [...]
> > > What is the reasoning for the switch to be runtime sysctl rather than
> > > boot-time or cgroup mount option?
> > 
> > Good point : -) I don't think cgroup mount options are a good idea,
> > since this would mean that we can have a set of cgroups self-policing
> > their toptier usage, while another cgroup allocates memory unrestricted.
> > This would punish the self-policing cgroup and we would lose the benefit
> > of having a bottomline performance guarantee.
> 
> I do not follow. cgroup mount option would apply to all cgroups. In
> sense whatever is achievable by sysctl should apply to kernel cmdline or
> mount option. The question is what is the best fit AFAICS.

Yup, you're right. I mixed it up in my head and got confused, in terms
of functionality I think kernel cmdline and mount option are same.

Actually everything except for runtime toggle makes sense, since this
requires the system to do the additioanl per-tier accounting even when
it is disabled. With kernel cmdline we can tell the system to completely
ignore the per-tier accounting and enforcement and the user faces no
effects at all (except well, the additional cacheline in struct page_coutner?)

Anyways, thank you very much for your thoughs and encouraging words.
I hope you have a great day, Michal!
Joshua

[1] https://lore.kernel.org/all/CAN+CAwNwpjRf9QhgAEhBQZD7r7sXCzLXqAKbNrPeMEq=7bX8Jg@mail.gmail.com/
Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Posted by Gregory Price 1 month, 1 week ago
On Tue, Feb 24, 2026 at 08:13:56AM -0800, Joshua Hahn wrote:
> ... snip ...

Just injecting a few points here
(disclosure: I have been in the development loop for this feature)

> 
> > Otherwise promotions would make sure to that we have the most active
> > memory in the top tier.
> 

Yes / No.  This makes the assumption that you always want this.

Barring a minimum Quality of Service mechanism (as Joshua explains)
this reduces the usefulness of a secondary tier of memory.

Services will just prefer not to be deployed to these kinds of
machines because the performance variance is too high.

> 
> > Is this typical in real life configurations?
> 
> I would say so. I think that the two examples above are realistic
> scenarios that cloud providers and hyperscalers might face on tiered systems.
> 

The answer is unequivocally yes.

Lacking tier-awareness is actually a huge blocker for deploying mixed
workloads on large, dense memory systems with multiple tiers (2+).

Technically we're already at 4-ish tiers: DDR, CXL, ZSWAP, SWAP.

We have zswap/swap controls in cgroups already, we just lack that same
control for coherent memory tiers.  This tries to use the existing nobs
(max/high/low/min) to do what they already do - just proportionally.

~Gregory
Re: [RFC PATCH 0/6] mm/memcontrol: Make memcg limits tier-aware
Posted by Kaiyang Zhao 1 month, 1 week ago
On Tue, Feb 24, 2026 at 01:49:21PM -0500, Gregory Price wrote:
> 
> > 
> > > Is this typical in real life configurations?
> > 
> > I would say so. I think that the two examples above are realistic
> > scenarios that cloud providers and hyperscalers might face on tiered systems.
> > 
> 
> The answer is unequivocally yes.
> 
> Lacking tier-awareness is actually a huge blocker for deploying mixed
> workloads on large, dense memory systems with multiple tiers (2+).

Hello! I'm the author of the RFC in 2024. Just want to add that we've
recently released a preprint paper on arXiv that includes case studies
with a few of Meta's production workloads using a prototype version of
the patches.

The results confirmed that co-colocated workloads can have working set
sizes exceeding the limited top-tier memory capacity given today's
server memory shapes and workload stacking settings, causing contention
of top-tier memory. Workloads see significant variations in tail
latency and throughput depending on the share of top-tier tier memory
they get, which this patch set will alleviate.

Best,
Kaiyang

[1] https://arxiv.org/pdf/2602.08800