[v2] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

[PATCH 0/7 v2] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Joshua Hahn 1 day, 21 hours ago

Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
allocations, allowing small allocations to avoid walking the expensive
mem_cgroup hierarchy traversal and atomic operations on each charge.
This design introduces a fastpath, but there is room for improvement:

1. Currently, each CPU tracks up to 7 (NR_MEMCG_STOCK) mem_cgroups. When
   more than 7 mem_cgroups are actively charging on a single CPU, a
   random victim is evicted and its associated stock is drained.

2. Stock management is tightly coupled to struct mem_cgroup, which makes
   it difficult to add a new page_counter to mem_cgroup and have
   multiple sources of stock management, which is required when trying
   to introduce fastpaths to multiple hard limit checks.

This series moves the per-cpu stock down into the page_counter which
consolidates stock limit checking and page_counter limit checking into
page_counter_try_charge. This eliminates the 7-memcg-per-cpu slot limit,
the random evictions (drain & refill), and slot traversal.

In turn, we can add independent stock management for additional
page_counters in each memcg, which is used in my tiered memory limits
series to add a new page_counter to track toptier usage [1].

The resulting code in memcg is also easier to follow, as the caching
becomes transparent from memcg's perspective and managed entirely within
page_counter.

There are, however, a few tradeoffs.

First, the bound on how much memory can be overcharged (and remain stale
as stock) is raised. Previously, it was fixed to nr_cpus x 7 x 64 pages.
Now, it becomes nr_leaf_cgroups x nr_cpus x 64 pages. On large machines
with many cgroups, this could be significant. There are three qualifying
points: (1) larger machines should be able to tolerate the additional
overhead, (2) the stock should not remain stale as long as the
cgroups are actively charging memory, and (3) a process would have to
migrate across all CPUs to incur this upper bound on overhead.

Secondly, we introduce some additional memory footprint. The new struct
page_counter_stock adds 2 words of extra overhead per-(cpu x memcg).

A small change is that for cgroupv1, reported memsw usage can be lower
than reported memory usage, if the memsw page_counter overcharges to its
stock whereas the memory page_counter does not.

Finally, to keep the above memory footprint limited, I opted to not
embed a work_struct into page_counter_stock, but rather decided to
trigger synchronous stock draining, since the drain operation is rarer
now, and only happens under memory pressure and on cgroup death.

Performance testing across single-cgroup, as well as 4-cgroup (under the
7 memcg limit) and 32-cgroup scenarios on a 40CPU, 50G memory system
shows negligible performance differences. In the tests, I repeatedly
fault and release anonymous pages using madvise(MADV_DONTNEED) to
stress the charge/uncharge path, across 30 trials of 50 iterations.
Metric here is time it took across each iteration (ms).

+----------+--------+-------+-----------+
| #cgroups | before | after | delta (%) |
+----------+--------+-------+-----------+
|        1 |    352 |   353 |    +0.283 |
|        4 |   1198 |  1217 |    +1.585 |
|       32 |   8980 |  9027 |    +0.526 |
+----------+--------+-------+-----------+

Further testing on other stress-ng microbenchmarks also agreed with
these results.

v1 --> v2:
- Dropped stock returning on uncharge to preserve same behavior as memcg
  stock. This resolves some race conditions present in v1.
- Fixed many race conditions between disabling page_counter_stock and
  in-flight charges
- Restructured drain_all_stock to iterate over all CPUs first before
  memcgs, to reduce the number of synchronous CPU work scheduling
- Optimized cgroup v2 further to drain only on the first child and skip
  the root mem_cgroup
- Dropped RFC
- Wordsmithing cover letter

[1] https://lore.kernel.org/all/20260423203445.2914963-1-joshua.hahnjy@gmail.com/

Joshua Hahn (7):
  mm/page_counter: introduce per-page_counter stock
  mm/page_counter: use page_counter_stock in page_counter_try_charge
  mm/page_counter: introduce stock drain APIs
  mm/memcontrol: convert memcg to use page_counter_stock
  mm/memcontrol: optimize memsw stock for cgroup v1
  mm/memcontrol: optimize stock usage for cgroup v2
  mm/memcontrol: remove unused memcg_stock code

 include/linux/page_counter.h |  16 ++
 mm/memcontrol.c              | 286 ++++++++---------------------------
 mm/page_counter.c            | 140 ++++++++++++++++-
 3 files changed, 212 insertions(+), 230 deletions(-)

-- 
2.53.0-Meta

Re: [PATCH 0/7 v2] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Joshua Hahn 1 day, 16 hours ago

On Fri, 22 May 2026 15:06:18 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
> allocations, allowing small allocations to avoid walking the expensive
> mem_cgroup hierarchy traversal and atomic operations on each charge.
> This design introduces a fastpath, but there is room for improvement:

This iteration was developed and tested on top of mm-stable.
I'm seeing now that Sashiko cannot apply this patch, and I think it expects
it to have been built on top of mm-new.

Reviewers -- would it make sense to rebase this on top of mm-new and
re-send this as a v3, or should I wait for feedback on this cycle
before sending out a new version?

In any case, this could have been avoided if I just developed on top of
mm-new. I'll be mindful to do that in the future.

Thank you everyone,
Joshua

Re: [PATCH 0/7 v2] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Andrew Morton 1 day, 15 hours ago

On Fri, 22 May 2026 19:50:50 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> On Fri, 22 May 2026 15:06:18 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> > Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
> > allocations, allowing small allocations to avoid walking the expensive
> > mem_cgroup hierarchy traversal and atomic operations on each charge.
> > This design introduces a fastpath, but there is room for improvement:
> 
> This iteration was developed and tested on top of mm-stable.
> I'm seeing now that Sashiko cannot apply this patch, and I think it expects
> it to have been built on top of mm-new.
> 
> Reviewers -- would it make sense to rebase this on top of mm-new and
> re-send this as a v3, or should I wait for feedback on this cycle
> before sending out a new version?
> 
> In any case, this could have been avoided if I just developed on top of
> mm-new. I'll be mindful to do that in the future.

Sashiko does attempt various branches, including mm-stable so I'm not
sure what went wrong here.

It's a bit of a crapshoot at this time, as mm-new is still growing like
a weed (19 patches yesterday, 24 so far today).  Slowing way down soon!
I'll be pushing mm-new later this evening and after that it shouldn't change
much for several days because we all take weekends off, don't we?

Convention appears to say "wait a week before resending" but IMO
there's value in parallelizing Sashiko review with human review.  And
it's understandable if some humans aren't very motivated to review
until the AI thing has had a shot at it, because that might result in
alterations.  So I'd say that a resend-for-Sashiko is appropriate.