[v3] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

[PATCH 0/7 v3] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Joshua Hahn 2 weeks ago

Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
allocations, allowing small allocations to avoid walking the expensive
mem_cgroup hierarchy traversal and atomic operations on each charge.
This design introduces a fastpath, but there is room for improvement:

1. Currently, each CPU tracks up to 7 (NR_MEMCG_STOCK) mem_cgroups. When
   more than 7 mem_cgroups are actively charging on a single CPU, a
   random victim is evicted and its associated stock is drained.

2. Stock management is tightly coupled to struct mem_cgroup, which makes
   it difficult to add a new page_counter to mem_cgroup and have
   multiple sources of stock management, which is required when trying
   to introduce fastpaths to multiple hard limit checks.

This series moves the per-cpu stock down into the page_counter which
consolidates stock limit checking and page_counter limit checking into
page_counter_try_charge. This eliminates the 7-memcg-per-cpu slot limit,
the random evictions (drain & refill), and slot traversal.

In turn, we can add independent stock management for additional
page_counters in each memcg, which is used in my tiered memory limits
series to add a new page_counter to track toptier usage [1].

The resulting code in memcg is also easier to follow, as the caching
becomes transparent from memcg's perspective and managed entirely within
page_counter.

There are, however, a few tradeoffs.

First, the bound on how much memory can be overcharged (and remain stale
as stock) is raised. Previously, it was fixed to nr_cpus x 7 x 64 pages.
Now, it becomes nr_leaf_cgroups x nr_cpus x 64 pages. On large machines
with many cgroups, this could be significant. There are three qualifying
points: (1) larger machines should be able to tolerate the additional
overhead, (2) the stock should not remain stale as long as the
cgroups are actively charging memory, and (3) a process would have to
migrate across all CPUs to incur this upper bound on overhead.

Secondly, we introduce some additional memory footprint. The new struct
page_counter_stock adds 2 words of extra overhead per-(cpu x memcg).

A small change is that for cgroupv1, reported memsw usage can be lower
than reported memory usage, if the memsw page_counter overcharges to its
stock whereas the memory page_counter does not.

Finally, to keep the above memory footprint limited, I opted to not
embed a work_struct into page_counter_stock, but rather decided to
trigger synchronous stock draining, since the drain operation is rarer
now, and only happens under memory pressure and on cgroup death.

Performance testing across single-cgroup, as well as 4-cgroup (under the
7 memcg limit) and 32-cgroup scenarios on a 40CPU, 50G memory system
shows negligible performance differences. In the tests, I repeatedly
fault and release anonymous pages using madvise(MADV_DONTNEED) to
stress the charge/uncharge path, across 40 trials of 50 iterations.
Metric here is time it took across each iteration (ms).

There are two testing versions below; the only difference is that v3
is based on top of mm-new, and v2 is based on top of mm-stable. The
"after" on both sides are similar, but mm-new and mm-stable have
different perforamnces. 

v3, tested against mm-new
+----------+--------+-------+-----------+
| #cgroups | mm-new | after | delta (%) |
+----------+--------+-------+-----------+
|        1 |    357 |   358 |    +0.283 |
|        4 |   1245 |  1214 |    -2.430 |
|       32 |   9281 |  8970 |    -3.470 |
+----------+--------+-------+-----------+

v2, tested against mm-stable
+----------+-----------+-------+-----------+
| #cgroups | mm-stable | after | delta (%) |
+----------+-----------+-------+-----------+
|        1 |       352 |   353 |    +0.283 |
|        4 |      1198 |  1217 |    +1.585 |
|       32 |      8980 |  9027 |    +0.526 |
+----------+-----------+-------+-----------+

Further testing on other stress-ng microbenchmarks also agreed with
these results.

v2 --> v3:
- Rebased on top of latest mm-new, May 25, 2026, since the previous
  version could not be applied for Sashiko review.
- Re-ran test numbers

v1 --> v2:
- Dropped stock returning on uncharge to preserve same behavior as memcg
  stock. This resolves some race conditions present in v1.
- Fixed many race conditions between disabling page_counter_stock and
  in-flight charges
- Restructured drain_all_stock to iterate over all CPUs first before
  memcgs, to reduce the number of synchronous CPU work scheduling
- Optimized cgroup v2 further to drain only on the first child and skip
  the root mem_cgroup
- Dropped RFC
- Wordsmithing cover letter

[1] https://lore.kernel.org/all/20260423203445.2914963-1-joshua.hahnjy@gmail.com/

Joshua Hahn (7):
  mm/page_counter: introduce per-page_counter stock
  mm/page_counter: use page_counter_stock in page_counter_try_charge
  mm/page_counter: introduce stock drain APIs
  mm/memcontrol: convert memcg to use page_counter_stock
  mm/memcontrol: optimize memsw stock for cgroup v1
  mm/memcontrol: optimize stock usage for cgroup v2
  mm/memcontrol: remove unused memcg_stock code

 include/linux/page_counter.h |  16 ++
 mm/memcontrol.c              | 289 +++++++----------------------------
 mm/page_counter.c            | 140 ++++++++++++++++-
 3 files changed, 212 insertions(+), 233 deletions(-)

-- 
2.53.0-Meta

Re: [PATCH 0/7 v3] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

Posted by Joshua Hahn 1 week, 6 days ago

On Mon, 25 May 2026 12:04:47 -0700 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
> allocations, allowing small allocations to avoid walking the expensive
> mem_cgroup hierarchy traversal and atomic operations on each charge.
> This design introduces a fastpath, but there is room for improvement:

Hello everyone,

Sashiko has left some great comments on the series, some of which are
things I need to address for the next iteration, while others are false
positives. There aren't a bunch, so I'll go over all of them below.
Note that the same warning brought up in 1/7 is duplicated for the rest
of the series; I've addressed my thoughts for that in [1].

> Joshua Hahn (7):
>   mm/page_counter: introduce per-page_counter stock

Sashiko raised a concern about how zeroing per-cpu stock during the
nolock drain could race with in-flight charges that read the value
before it gets zeroed, leading to duplicate drains. I think this is
solvable by just changing the order in the callsites (disable, then drain).
More details can be found in [1].

>   mm/page_counter: use page_counter_stock in page_counter_try_charge

Sashiko raises the same concern as [1].

>   mm/page_counter: introduce stock drain APIs

Same concern as [1].

>   mm/memcontrol: convert memcg to use page_counter_stock

Sashiko raises 4 concerns, of which I think 3 of them are false positives
(or not as serious as Sashiko makes it out to be).

(1) Sashiko asks whether the synchronous draining with the percpu_charge_mutex
    lock held could lead to more time spent holding the lock, which means
    more callers of drain_all_stock would fail the trylock and just skip draining.

    To clarify, even in the original code, two tasks simultaneously calling
    drain_all_stock would serialize and only one of them would schedule the
    drain work, so this problem definitely existed before as well. It's just that
    the window for this race is a bit longer now.

    I do think that there is actually a behavior change here (for the better).
    Previously, drain_all_stock had no guarantees on whether the stock was
    drained before retrying. Now, if the caller can get the trylock, they have
    a stronger guarantee that the stock is drained before retrying the drain.

    On the note of premature OOMs, each retry loop takes much longer than the
    draining itself; I would imagine that by the time the next retry loop happens,
    there's a better chance that the trylock succeeds in the next iteration.

(2) Sashiko also raises another concern about a potential ABBA deadlock with
    the mmap_lock. I think this concern is not really true, the synchronous
    work being done (drain_stock_on_cpu) only takes a local lock. Hopefully I'm
    not missing anything here.

(3) I think Sashiko's concerns about NOHZ / CPU isolation is real. But it shouldn't
    be too bad, all I need is a cpu_is_isolated() check in the for_each_online_cpu
    iterator. Again, not draining a CPU is not fatal here, so it shouldn't be too
    big of a problem to skip some of them.

    I also just wanted to note here explicitly that we don't need the
    migrate_disable() for the memcg stock drain, since we don't differentiate
    between local drain work & remote drain scheduling (like objcg_stock).

(4) Finally Sashiko asks if we should enable the memcg->memsw stock here.
    That's included in the very next patch : -) I separated them so that they
    can be reviewed separately, since they are separate ideas. 

>   mm/memcontrol: optimize memsw stock for cgroup v1

Both concerns here are addressed in the previous section.

>   mm/memcontrol: optimize stock usage for cgroup v2

Sashiko raises 3 concerns, of which I think all of them are actually OK.

(1) If we drain the parent memcg stock on first child creation, then this would
    mean that there will be additional synchronous work being done with the
    cgroup_mutex lock held. I personally think this is fine, since it happens
    once per parent cgroup, and the draining work is pretty cheap. But I would
    appreciate it if other reviewers could chime in here.

(2) Sashiko also asks whether we need cpus_read_lock during the iteration.
    I think it's fine without it; if a CPU happens to go offline during the
    iteration, then that work will be scheduled on another CPU. That's fine,
    duplicate draining work on 1 CPU isn't the end of the world (and preferable
    to taking a cpus_read_lock here). As for the dying CPU, it will drain its
    own stock during the destruction path anyways, so no stock is lost.

(3) This one is not related to this series, so I'll move on.

>   mm/memcontrol: remove unused memcg_stock code

No comments for this patch.

I think that's all the comments that Sashiko raised for this patch. Most of them
had to do with performance tradeoffs, for which I hope that my testing results
in the cover letter were able to instill some confidence that a lot of these
tradeoffs aren't as bad as they seem. Regardless, I would really appreciate
reviewer feedback on whether they think it is acceptable.

There are definitely some real bugs that I want to address, so a v4 will be
incoming to address those (in a week or so).

Thank you Sashiko!
Joshua

[1] https://lore.kernel.org/linux-mm/20260525194506.3414995-1-joshua.hahnjy@gmail.com/