[v3] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

[PATCH 0/6 v3] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Joshua Hahn 2 days, 12 hours ago

Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
allocations, allowing small allocations to avoid walking the expensive
mem_cgroup hierarchy traversal and atomic operations on each charge.
This design introduces a fastpath, but there is room for improvement:

1. Currently, each CPU tracks up to 7 (NR_MEMCG_STOCK) mem_cgroups. When
   more than 7 mem_cgroups are actively charging on a single CPU, a
   random victim is evicted and its associated stock is drained.

2. Stock management is tightly coupled to struct mem_cgroup, which makes
   it difficult to add a new page_counter to mem_cgroup and have
   multiple sources of stock management, which is required when trying
   to introduce fastpaths to multiple hard limit checks.

This series moves the per-cpu stock down into the page_counter which
consolidates stock limit checking and page_counter limit checking into
page_counter_try_charge. This eliminates the 7-memcg-per-cpu slot limit,
the random evictions (drain & refill), and slot traversal.

In turn, we can add independent stock management for additional
page_counters in each memcg, which is used in my tiered memory limits
series to add a new page_counter to track toptier usage [1].

The resulting code in memcg is also easier to follow, as the caching
becomes transparent from memcg's perspective and managed entirely within
page_counter.

There are, however, a few tradeoffs.

First, the bound on how much memory can be overcharged (and remain stale
as stock) is raised. Previously, it was fixed to nr_cpus x 7 x 64 pages.
Now, it becomes nr_leaf_cgroups x nr_cpus x 64 pages. On large machines
with many cgroups, this could be significant.

There are four qualifying points:
1. Larger machines should be able to tolerate the additional overhead,
2. Stock should not remain stale as long as the cgroups are actively
   charging memory,
3. Getting close to this overhead is rare, as it would require a process
   to migrate across all CPUs and leave stock there.
4. These charges are not "real" allocated memory, but rather accounting
   done in memcg; they are easily returned on pressure.

Secondly, we introduce some additional memory footprint. The new struct
page_counter_stock adds 2 words of extra overhead per-(cpu x memcg).

A small change is that for cgroupv1, reported memsw usage can be lower
than reported memory usage, if the memsw page_counter overcharges to its
stock whereas the memory page_counter does not.

Finally, to keep the above memory footprint limited, I opted to not
embed a work_struct into page_counter_stock, but rather decided to
trigger synchronous stock draining, since the drain operation is rarer
now, and only happens under memory pressure and on cgroup death.

One side effect of doing synchronous work is that drain_all_stock holds
the percpu_charge_mutex longer while it performs the work, which means
chargers may be more likely to be unable to grab the mutex lock and
exhaust MAX_RECLAIM_RETRIES and OOM, in theory. In practice, I have not
been able to replicate this behavior in my experiments.

Performance testing across single-cgroup, as well as 4-cgroup (under the
7 memcg limit) and 32-cgroup scenarios on a 40CPU, 50G memory system
shows moderate performance gains (~1%). In the tests, I repeatedly
fault and release anonymous pages using madvise(MADV_DONTNEED) to
stress the charge/uncharge path, across 30 trials of 50 iterations.
Metric here is time it took across each iteration (ms).

+----------+--------+-------+-----------+
| #cgroups | before | after | delta (%) |
+----------+--------+-------+-----------+
|        1 |    357 |   350 |    -1.960 |
|        4 |   1221 |  1204 |    -1.392 |
|       32 |   9184 |  9032 |    -1.682 |
+----------+--------+-------+-----------+

Further testing on other stress-ng microbenchmarks also agreed with
these results.

v2 --> v3:
- Dropped the cgroup v2 optimization, since it could indeed lead to too
  much time held with the cgroup_mutex. Instead we let the stock
  accumulate in the parent cgroups, which is not so bad; charges can
  still land on these cgroups, and if we ever reach the mem_cgroup
  limit, we can easily return those charges.
- page_counter_disable_stock no longer drains, just prevents
  accumulating stock. The actual draining is done in the free_stock
  variant, where we know for sure there are no in-flight charges.
- Reordering the page_counter_disable_stock path to disable before
  draining as to prevent accumulating stock first.
- Skip isolated CPUs when draining synchronously
- Rebase on newest mm-new
- Wordsmithing

v1 --> v2:
- Dropped stock returning on uncharge to preserve same behavior as memcg
  stock. This resolves some race conditions present in v1.
- Fixed many race conditions between disabling page_counter_stock and
  in-flight charges
- Restructured drain_all_stock to iterate over all CPUs first before
  memcgs, to reduce the number of synchronous CPU work scheduling
- Optimized cgroup v2 further to drain only on the first child and skip
  the root mem_cgroup
- Dropped RFC
- Wordsmithing cover letter

[1] https://lore.kernel.org/all/20260423203445.2914963-1-joshua.hahnjy@gmail.com/

Joshua Hahn (6):
  mm/page_counter: introduce per-page_counter stock
  mm/page_counter: use page_counter_stock in page_counter_try_charge
  mm/page_counter: introduce stock drain APIs
  mm/memcontrol: convert memcg to use page_counter_stock
  mm/memcontrol: optimize memsw stock for cgroup v1
  mm/memcontrol: remove unused memcg_stock code

 include/linux/page_counter.h |  16 ++
 mm/memcontrol.c              | 276 ++++++-----------------------------
 mm/page_counter.c            | 129 +++++++++++++++-
 3 files changed, 188 insertions(+), 233 deletions(-)

-- 
2.53.0-Meta

Re: [PATCH 0/6 v3] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Joshua Hahn 2 days, 11 hours ago

On Fri,  5 Jun 2026 08:35:56 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
> allocations, allowing small allocations to avoid walking the expensive
> mem_cgroup hierarchy traversal and atomic operations on each charge.
> This design introduces a fastpath, but there is room for improvement:
> 
> 1. Currently, each CPU tracks up to 7 (NR_MEMCG_STOCK) mem_cgroups. When
>    more than 7 mem_cgroups are actively charging on a single CPU, a
>    random victim is evicted and its associated stock is drained.
> 
> 2. Stock management is tightly coupled to struct mem_cgroup, which makes
>    it difficult to add a new page_counter to mem_cgroup and have
>    multiple sources of stock management, which is required when trying
>    to introduce fastpaths to multiple hard limit checks.
> 
> This series moves the per-cpu stock down into the page_counter which
> consolidates stock limit checking and page_counter limit checking into
> page_counter_try_charge. This eliminates the 7-memcg-per-cpu slot limit,
> the random evictions (drain & refill), and slot traversal.

Hello reviewers,

I was hoping to receive some input on a point that Sashiko raises.
The draining work we do per-cpu uses work_on_cpu(), which does
schedule_work_on() and flush_work() on the system_percpu_wq, which is
not WQ_MEM_RECLAIM. And drain_all_stock() runs from try_charge_memcg()
on the reclaim path, so it actually triggers the check_flush_dependency()
since a wq_mem_reclaim is flushing a !wq_mem_reclaim.

In my testing, I haven't seen this become an issue. The flushing work
and draining only takes a local_lock() and does atomic operations,
and it never allocates, so there is no question on whether we can make
forward progress.

But this does slip up the WARN_ON since this is not obvious to the system.
I see three options here:

1. Trust that this is OK, and document that we can alwyas make forward
   progress.
2. Keep the draining work synchronous, but queue and flush on memcg_wq
   marked WQ_MEM_RECLAIM instead of just using work_on_cpu(). This would
   add 2 words per-cpu-memcg for the work struct backpointer.
3. Go back to asynchronous, which would get rid of all the synchronous
   concerns, but add an additional 2 words per-cpu-memcg for the work
   struct backpointer here as well.

What do you think is the right decision here? I was thinking about this
quite a bit recently but just decided to send it out, but I think I should
have asked for upstream opinion sooner...

I would prefer to keep the memory footprint of this series minimal, and
opting to do things synchronously helped achieve this goal since we can
get rid of the backpointers. But I think this is beginning to show up as
a tradeoff, so I would really appreciate any input on what seems to be
the best decision here.

Thank you very much for your time!
Joshua

Re: [PATCH 0/6 v3] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Joshua Hahn 2 days, 12 hours ago

On Fri,  5 Jun 2026 08:35:56 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
> allocations, allowing small allocations to avoid walking the expensive
> mem_cgroup hierarchy traversal and atomic operations on each charge.
> This design introduces a fastpath, but there is room for improvement:
> 
> 1. Currently, each CPU tracks up to 7 (NR_MEMCG_STOCK) mem_cgroups. When
>    more than 7 mem_cgroups are actively charging on a single CPU, a
>    random victim is evicted and its associated stock is drained.
> 
> 2. Stock management is tightly coupled to struct mem_cgroup, which makes
>    it difficult to add a new page_counter to mem_cgroup and have
>    multiple sources of stock management, which is required when trying
>    to introduce fastpaths to multiple hard limit checks.
> 
> This series moves the per-cpu stock down into the page_counter which
> consolidates stock limit checking and page_counter limit checking into
> page_counter_try_charge. This eliminates the 7-memcg-per-cpu slot limit,
> the random evictions (drain & refill), and slot traversal.

Sorry, two things that I forgot to add as reviewers' notes:
- There was a previous v3, but that was just a rebase, so I wasn't sure
  how to name that / this. I decided to name this one v3, since the
  last one didn't have any changes at all. I apologize in case there are
  any confusions.
- I think it is quite late in the merge cycle, this is intended for the
  next cycle, not this one.

Thank you!