[v1] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

[PATCH 0/8 RFC] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Joshua Hahn 2 months, 1 week ago

Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
allocations, allowing small allocations and frees to avoid walking the
expensive mem_cgroup hierarchy traversal on each charge. This design
introduces a fastpath to charge/uncharge, but has several limitations:

1. Each CPU can track up to 7 (NR_MEMCG_STOCK) mem_cgroups. When more
   than 7 mem_cgroups are actively charging on a single CPU, a random
   victim is evicted, and its associated stock is drained, which
   triggers unnecessary hierarchy walks.

   Note that previously there used to be a 1-1 mapping between CPU and
   memcg stock; it was bumped up to 7 in f735eebe55f8f ("multi-memcg
   percpu charge cache") because it was observed that stock would
   frequently get flushed and refilled.

2. Stock management is tightly coupled to struct mem_cgroup, which
   makes it difficult to add a new page_counter to struct mem_cgroup
   and do its own stock management, since each operation has to be
   duplicated.

3. Each stock slot requires a css reference, as well as a traversal
   overhead on every stock operation to check which cpu-memcg we are
   trying to consume stock for.

This series moves the per-cpu stock down into the page_counter, which
consolidates stock limit checking and page_counter limit checking into
page_counter_try_charge. This eliminates the 7-memcg-per-cpu slot
limit, the random evictions (drain & refill), slot traversal, and
css refcounting.

In addition, it makes independent stock management scalable for future
users. As a demonstration, this series also introduces independent
stock management for the cgroup v1 memsw page_counter, which curbs
the likelihood of the worst-case scenario (traversing both the
memsw and memory page_counter hierarchies).

One change that should be noted is that draining is simplified to use
work_on_cpu() for synchronous remote CPU drain. This eliminates the
need for backpointers and embedded work_structs in the per-cpu stock
struct, which minimizes memory overhead. This change over the existing
async drain scheduling was done since the drain operation is much
more rare now, only happening under memory pressure and on cgroup
death (as opposed to the previous arbitrary scenario where more than
7 memcgs are charging to a CPU).

Performance testing across single-cgroup, as well as 4-cgroup (under the
7 memcg limit) and 32-cgroup scenarios on a 40CPU, 50G memory system
shows negligible performance differences. In the tests, I repeatedly
fault and release anonymous pages using madvise(MADV_DONTNEED) to
stress the charge/uncharge path, across 30 trials of 50 iterations.
Metric here is time it took across each iteration (ms).

+----------+--------+-------+--------+-----------+
| #cgroups | before | after | stddev | delta (%) |
+----------+--------+-------+--------+-----------+
|        1 |    446 |   441 |  5.097 |    -1.195 |
|        4 |   1832 |  1822 | 11.897 |    -0.582 |
|       32 |  14730 | 14739 | 54.089 |     0.061 |
+----------+--------+-------+--------+-----------+

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Joshua Hahn (8):
  mm/page_counter: introduce per-page_counter stock
  mm/page_counter: use page_counter_stock in page_counter_try_charge
  mm/page_counter: use page_counter_stock in page_counter_uncharge
  mm/page_counter: introduce stock drain APIs
  mm/memcontrol: convert memcg to use page_counter_stock
  mm/memcontrol: optimize memsw stock for cgroup v1
  mm/memcontrol: optimize stock usage for cgroup v2
  mm/memcontrol: remove unused memcg_stock code

 include/linux/page_counter.h |  15 ++
 mm/memcontrol.c              | 269 ++++++-----------------------------
 mm/page_counter.c            | 173 +++++++++++++++++++++-
 3 files changed, 224 insertions(+), 233 deletions(-)

-- 
2.52.0

Re: [PATCH 0/8 RFC] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Michal Hocko 2 months, 1 week ago

On Fri 10-04-26 14:06:54, Joshua Hahn wrote:
> Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
> allocations, allowing small allocations and frees to avoid walking the
> expensive mem_cgroup hierarchy traversal on each charge. This design
> introduces a fastpath to charge/uncharge, but has several limitations:
> 
> 1. Each CPU can track up to 7 (NR_MEMCG_STOCK) mem_cgroups. When more
>    than 7 mem_cgroups are actively charging on a single CPU, a random
>    victim is evicted, and its associated stock is drained, which
>    triggers unnecessary hierarchy walks.
> 
>    Note that previously there used to be a 1-1 mapping between CPU and
>    memcg stock; it was bumped up to 7 in f735eebe55f8f ("multi-memcg
>    percpu charge cache") because it was observed that stock would
>    frequently get flushed and refilled.

All true but it is quite important to note that this all is bounded to
nr_online_cpus*NR_MEMCG_STOCK*MEMCG_CHARGE_BATCH. You are proposing to
increase this to s@NR_MEMCG_STOCK@nr_leaf_cgroups@. In invornments with
many cpus and and directly charged cgroups this can be considerable
hidden overcharge. Have you considered that and evaluated potential
impact?

> 2. Stock management is tightly coupled to struct mem_cgroup, which
>    makes it difficult to add a new page_counter to struct mem_cgroup
>    and do its own stock management, since each operation has to be
>    duplicated.

Could you expand why this is a problem we need to address?

> 3. Each stock slot requires a css reference, as well as a traversal
>    overhead on every stock operation to check which cpu-memcg we are
>    trying to consume stock for.

Why is this a problem?
 
Please also be more explicit what kind of workloads are going to benefit
from this change. The existing caching scheme is simple and ineffective
but is it worth improving (likely your points 2 and 3 could clarify that)?

All that being said, I like the resulting code which is much easier to
follow. The caching is nicely transparent in the charging path which is
a plus. My main worry is that caching has caused some confusion in the
past and this change will amplify that by the scaling the amount of
cached charge. This needs to be really carefully evaluated.
-- 
Michal Hocko
SUSE Labs

Re: [PATCH 0/8 RFC] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Joshua Hahn 2 months, 1 week ago

On Mon, 13 Apr 2026 09:23:38 +0200 Michal Hocko <mhocko@suse.com> wrote:

Hello Michal,

Thank you for your review as always!

> On Fri 10-04-26 14:06:54, Joshua Hahn wrote:
> > Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
> > allocations, allowing small allocations and frees to avoid walking the
> > expensive mem_cgroup hierarchy traversal on each charge. This design
> > introduces a fastpath to charge/uncharge, but has several limitations:
> > 
> > 1. Each CPU can track up to 7 (NR_MEMCG_STOCK) mem_cgroups. When more
> >    than 7 mem_cgroups are actively charging on a single CPU, a random
> >    victim is evicted, and its associated stock is drained, which
> >    triggers unnecessary hierarchy walks.
> > 
> >    Note that previously there used to be a 1-1 mapping between CPU and
> >    memcg stock; it was bumped up to 7 in f735eebe55f8f ("multi-memcg
> >    percpu charge cache") because it was observed that stock would
> >    frequently get flushed and refilled.
> 
> All true but it is quite important to note that this all is bounded to
> nr_online_cpus*NR_MEMCG_STOCK*MEMCG_CHARGE_BATCH. You are proposing to
> increase this to s@NR_MEMCG_STOCK@nr_leaf_cgroups@. In invornments with
> many cpus and and directly charged cgroups this can be considerable
> hidden overcharge. Have you considered that and evaluated potential
> impact?

This is a great point. I would like to note though, that for systems running
less than 7 leaf cgroups (I'm not sure what systems typically look like outside
of Meta, so I cannot say whether this is likely or not!) this change would
be an optimization since we allocate only for the leaf cgroups we need ; -)

But let's do the math for the worst-case scenario:
Because we initialize the stock to be 0 and only refill on a charge / 
uncharge, the worst-case scenario involves a workload that charges
to all CPUs just once, so that it is not enough to benefit from the
cacheing. On a very large system, say 300 CPUs, with 4k pages, that's
300 * 64 * 4kb = 75 mb of overcharging per leaf-cgroup.

This is definitely a serious amount of overcharging. With that said, I
would like to note that this seems like quite a rare scenario; what
would cause a workload to jump across 300 CPUs? For this to be a regression
it also has to be 8+ workloads all jumping around the CPUs and storing
not-to-be-used cache on all of them, and anything below that would still be
an optimization over the current setup.

Also, let's talk about what happens when we do reach the worst-case scenario.
Once we reach the degenerate state where the stock is charged and the workload
has no intention of running on the CPUs with idle cache, we would eventually
reach the failure branch of try_charge_memcg, which drains all stock!

So IMO, I think the issue of overcharging isn't too bad. It's very difficult
to reach the scenario where all CPUs are caching idle stock, and the existing
recovery mechanism in try_charge_memcg puts us right back into the optimal
scenario where none of the CPUs have stock, and we only refill those that
the workload runs on. I'll be sure to add this in the next spin of the series,
since I think it's important to note (the other overhead being the memory
that we have to allocate percpu for each of the stock structs, which is
only 2 words/cpu/memcg (including parents). But still worth noting explicitly!)

Above is the perspective from the system, in terms of memory pressure and
overcharging. From a user interpretability POV, I think there is a gap between
when a workload litters unused charge everywhere, but there is not enough
memory pressure to trigger a drain_all_stock, so a user might be confused
why their workload is using so much memory.

I think this could be a problem. Especially if there is a userspace
load balancer that schedules work based on how much memory the workload is
using. At Meta we use Senpai in userspace to create benevolent memory pressure
that should be enough to reap cold memory (and also idle stock), but I'm
wondering what this will mean for systems that don't have such cold memory
purging mechanisms. I'll think about this a little bit more.

> > 2. Stock management is tightly coupled to struct mem_cgroup, which
> >    makes it difficult to add a new page_counter to struct mem_cgroup
> >    and do its own stock management, since each operation has to be
> >    duplicated.
> 
> Could you expand why this is a problem we need to address?

Yes of course. So to give some context, I realized that stock was a bit
uncomfortable to work with at a memcg granularity when I tried to introduce
a new page counter for toptier memory tracking (in order to enforce strict
limits. I didn't explicitly note this in the cover letter because I thought
that there was a lot of good motivation aside from the specific use case
I was thinking of, so decided to leave it out. What do you think? : -)

I'm not a memcgv1 user so I cannot tell from experience whether this is a
pain point or not, but I also did find it awkward that one stock gated the
charges for two page_counters memsw and memory, which made the slowpath
incur double the hierarchy walks on a single stock failing, instead of keeping
them separate so that it is less likely for both the page hierarchy walks
to happen on a single charge attempt.

> > 3. Each stock slot requires a css reference, as well as a traversal
> >    overhead on every stock operation to check which cpu-memcg we are
> >    trying to consume stock for.
> 
> Why is this a problem?

I don't think this is really that big of a problem, but just something that
I wanted to note as a benefit of these changes. I remember being a bit
confused by the memcg slot scanning & traversal when reading the stock
code, personally I think being able to directly be able to attribute stock
to the page_cache it comes from, as well as not randomly evicting stock
could be helpful.

> Please also be more explicit what kind of workloads are going to benefit
> from this change. The existing caching scheme is simple and ineffective
> but is it worth improving (likely your points 2 and 3 could clarify that)?

I think that the biggest strength for this series is actually not with
performance gains but rather with more interpretable semantics for stock
management and transparent charging in try_charge_memcg.

But to break it down, any systems using less than 7 cgroups will get
reduced memory overhead (from the percpu structs) and comparable performance.
Any systems using more than 7 leaf cgroups will benefit because stock is
no longer randomly evicted and needed to refill.

From my limited benchmark tests, these didn't seem too visible from a
wall time perspective. But I can trace for how often we refill the stock
in the next version, and I hope that it can show more tangible results.

> All that being said, I like the resulting code which is much easier to
> follow. The caching is nicely transparent in the charging path which is
> a plus. My main worry is that caching has caused some confusion in the
> past and this change will amplify that by the scaling the amount of
> cached charge. This needs to be really carefully evaluated.

Thank you for the words of encouragement Michal!!!

On the point of cached charge, I hope that I've explained it above, I'll
think some more about that scenario as well. 

One last thing to note, that is orthogonal to our conversation here. Above,
I assumed 4k pages. But on systems with bigger base page sizes like 64k,
maybe it makes sense to lower the amount of stock that is cached. 
64 * 64kb = 4mb per CPU, maybe this is a bit overkill? ; -)

Thanks a lot for your thoughtful review, it is always appreciated.
I hope you have a great day!
Joshua

Re: [PATCH 0/8 RFC] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Michal Hocko 2 months, 1 week ago

On Mon 13-04-26 07:29:58, Joshua Hahn wrote:
> On Mon, 13 Apr 2026 09:23:38 +0200 Michal Hocko <mhocko@suse.com> wrote:
> 
> Hello Michal,
> 
> Thank you for your review as always!
> 
> > On Fri 10-04-26 14:06:54, Joshua Hahn wrote:
> > > Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
> > > allocations, allowing small allocations and frees to avoid walking the
> > > expensive mem_cgroup hierarchy traversal on each charge. This design
> > > introduces a fastpath to charge/uncharge, but has several limitations:
> > > 
> > > 1. Each CPU can track up to 7 (NR_MEMCG_STOCK) mem_cgroups. When more
> > >    than 7 mem_cgroups are actively charging on a single CPU, a random
> > >    victim is evicted, and its associated stock is drained, which
> > >    triggers unnecessary hierarchy walks.
> > > 
> > >    Note that previously there used to be a 1-1 mapping between CPU and
> > >    memcg stock; it was bumped up to 7 in f735eebe55f8f ("multi-memcg
> > >    percpu charge cache") because it was observed that stock would
> > >    frequently get flushed and refilled.
> > 
> > All true but it is quite important to note that this all is bounded to
> > nr_online_cpus*NR_MEMCG_STOCK*MEMCG_CHARGE_BATCH. You are proposing to
> > increase this to s@NR_MEMCG_STOCK@nr_leaf_cgroups@. In invornments with
> > many cpus and and directly charged cgroups this can be considerable
> > hidden overcharge. Have you considered that and evaluated potential
> > impact?
> 
> This is a great point. I would like to note though, that for systems running
> less than 7 leaf cgroups (I'm not sure what systems typically look like outside
> of Meta, so I cannot say whether this is likely or not!) this change would
> be an optimization since we allocate only for the leaf cgroups we need ; -)
> 
> But let's do the math for the worst-case scenario:
> Because we initialize the stock to be 0 and only refill on a charge / 
> uncharge, the worst-case scenario involves a workload that charges
> to all CPUs just once, so that it is not enough to benefit from the
> cacheing. On a very large system, say 300 CPUs, with 4k pages, that's
> 300 * 64 * 4kb = 75 mb of overcharging per leaf-cgroup.
>
> This is definitely a serious amount of overcharging. With that said, I
> would like to note that this seems like quite a rare scenario; what
> would cause a workload to jump across 300 CPUs? 

A typical situation I would expect this to be more visible is a large
machine hosting a lot of smaller containers. Not an untypical situation.

Without an external pressure those caches could accumulate a lot. On the
other hand a large machine the overall overcharging shouldn't cause the
memory depletion even if we are talking about 1000s of memcgs. The
behavior will change though and this is something you should explain
in your changelog. There will certainly be cons that we need to weigh
against pros. There are many good points below that you can use.
[...]

> > > 2. Stock management is tightly coupled to struct mem_cgroup, which
> > >    makes it difficult to add a new page_counter to struct mem_cgroup
> > >    and do its own stock management, since each operation has to be
> > >    duplicated.
> > 
> > Could you expand why this is a problem we need to address?
> 
> Yes of course. So to give some context, I realized that stock was a bit
> uncomfortable to work with at a memcg granularity when I tried to introduce
> a new page counter for toptier memory tracking (in order to enforce strict
> limits. I didn't explicitly note this in the cover letter because I thought
> that there was a lot of good motivation aside from the specific use case
> I was thinking of, so decided to leave it out. What do you think? : -)

Yes, if there are future plans that might benefit from this then this is
worth mentioning. Because just based on 1 I cannot really tell whether
going this way is better then tune NR_MEMCG_STOCK. As I've said I like
the resulting code better but there are some practical cons as well.

> I'm not a memcgv1 user so I cannot tell from experience whether this is a
> pain point or not, but I also did find it awkward that one stock gated the
> charges for two page_counters memsw and memory, which made the slowpath
> incur double the hierarchy walks on a single stock failing, instead of keeping
> them separate so that it is less likely for both the page hierarchy walks
> to happen on a single charge attempt.

v1 is legacy and we have decided to not invest into new
optimizations/feature long ago.

> 
> > > 3. Each stock slot requires a css reference, as well as a traversal
> > >    overhead on every stock operation to check which cpu-memcg we are
> > >    trying to consume stock for.
> > 
> > Why is this a problem?
> 
> I don't think this is really that big of a problem, but just something that
> I wanted to note as a benefit of these changes. I remember being a bit
> confused by the memcg slot scanning & traversal when reading the stock
> code, personally I think being able to directly be able to attribute stock
> to the page_cache it comes from, as well as not randomly evicting stock
> could be helpful.

OK so this boils down to code clarity.

> > Please also be more explicit what kind of workloads are going to benefit
> > from this change. The existing caching scheme is simple and ineffective
> > but is it worth improving (likely your points 2 and 3 could clarify that)?
> 
> I think that the biggest strength for this series is actually not with
> performance gains but rather with more interpretable semantics for stock
> management and transparent charging in try_charge_memcg.
> 
> But to break it down, any systems using less than 7 cgroups will get
> reduced memory overhead (from the percpu structs) and comparable performance.
> Any systems using more than 7 leaf cgroups will benefit because stock is
> no longer randomly evicted and needed to refill.
> 
> >From my limited benchmark tests, these didn't seem too visible from a
> wall time perspective. But I can trace for how often we refill the stock
> in the next version, and I hope that it can show more tangible results.

Another points for the changelog.
-- 
Michal Hocko
SUSE Labs