mm: BPF struct_ops for dynamic memory protection and async reclaim

[RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim

Posted by Hui Zhu 1 week, 6 days ago

From: Hui Zhu <zhuhui@kylinos.cn>

Overview:
This series introduces BPF struct_ops support for the memory controller,
enabling userspace BPF programs to implement custom, dynamic memory
management policies per cgroup. The feature allows BPF programs to hook
into the core reclaim and charge paths without requiring kernel
modifications, providing a flexible alternative to static knobs such as
memory.low and memory.min.
 
The series enables two complementary use cases.
 
Dynamic memory protection: static memory protection thresholds
(memory.low, memory.min) are poor fits for workloads whose actual memory
activity varies over time. A high-priority cgroup holding a large working
set but temporarily idle will still suppress reclaim on its siblings,
wasting available memory. A BPF-driven approach can observe real workload
activity -- page faults, charge/uncharge events -- and activate or
withdraw protection dynamically. The test results at the end of this
letter quantify the difference: in a scenario where the high-priority
cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly
37x higher throughput than with static memory.low.
 
Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
hooks, combined with the BPF workqueue mechanism and the new
bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
proactive background reclaim without blocking the charge path. The
pattern works as follows: the memcg_charged callback tracks accumulated
memory usage; when usage crosses a configurable threshold, it enqueues an
asynchronous work item via bpf_wq_start() and returns immediately without
throttling the charging task. The workqueue callback then invokes
bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
cgroup; if usage remains elevated after reclaim, the callback re-enqueues
itself to continue. This allows a BPF program to keep a cgroup's
footprint below its hard limit (memory.max) entirely in the background,
avoiding the OOM killer or direct-reclaim stalls that would otherwise
occur. The selftest for this feature (patch 10/11) validates the
mechanism concretely: a workload that writes and mmaps a 64 MB file inside
a 32 MB cgroup reliably triggers memory.events "max" events without BPF;
with the async reclaim program attached, the "max" counter does not
increase at all across the same workload.
 
In this patch series, I've incorporated a portion of Roman's patch in
[1] to ensure the entire series can be compiled cleanly with bpf-next.
 
Patch Breakdown:
Patches 1-4 are from Roman Gushchin's series [1], included here to
provide the necessary BPF infrastructure for attaching struct_ops to
cgroups.
 
Patches 5-11 are the new work in this series:
 
  05/11  bpf: Pass flags in bpf_link_create for struct_ops
         Stores attr->link_create.flags in struct bpf_struct_ops_link
         and extends the validation to allow BPF_F_ALLOW_OVERRIDE.
         Also updates the UAPI comment to reflect that cgroup-bpf attach
         flags now apply to BPF_LINK_CREATE in addition to
         BPF_PROG_ATTACH.
 
  06/11  mm: memcontrol: Add BPF struct_ops for memory controller
         The core feature patch. Introduces the memcg_bpf_ops struct_ops
         type with the following hooks:
 
         - memcg_charged(memcg, batch): called on the synchronous charge
           path. Returns a throttling delay in milliseconds; used as a
           lower bound for __mem_cgroup_handle_over_high(), effective
           even when memory.high is not breached.
 
         - memcg_uncharged(memcg, batch): called on uncharge, allowing
           BPF programs to track memory releases.
 
         - below_low(memcg, elow, usage): overrides the memory.low
           protection check. Returns true to treat the cgroup as
           protected regardless of the elow >= usage comparison.
 
         - below_min(memcg, emin, usage): same as below_low but for
           memory.min protection.
 
         - handle_cgroup_online/offline(memcg): lifecycle callbacks for
           per-cgroup state management in BPF programs.
 
         BPF_F_ALLOW_OVERRIDE is supported: when a program is registered
         with this flag, descendant cgroups may attach their own
         memcg_bpf_ops to override the inherited policy. Registration
         propagates ops down through the subtree via mem_cgroup_iter;
         unregistration restores each descendant to the ops its
         registering ancestor's parent held, correctly preserving
         override chains.
 
  07/11  mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
         Exposes try_to_free_mem_cgroup_pages() to BPF programs as a
         KF_SLEEPABLE kfunc. A swappiness parameter controls the
         override value passed to the core reclaim path
         (effective only when MEMCG_RECLAIM_PROACTIVE is set in
         reclaim_options).
 
  08/11  selftests/bpf: Add tests for memcg_bpf_ops
         Adds prog_tests/memcg_ops.c covering three scenarios:
         memcg_charged-only throttling, below_low + memcg_charged
         interaction, and below_min + memcg_charged interaction. A
         tracepoint on memcg:count_memcg_events (PGFAULT) is used to
         detect memory pressure and trigger hooks accordingly.
 
  09/11  selftests/bpf: Add test for memcg_bpf_ops hierarchies
         Validates BPF_F_ALLOW_OVERRIDE attachment semantics across a
         three-level cgroup hierarchy: attach with ALLOW_OVERRIDE at the
         root, override at the middle level without the flag, then assert
         that attaching to the leaf correctly fails with -EBUSY.
 
  10/11  selftests/bpf: Add selftest for memcg async reclaim via BPF
         Demonstrates and validates asynchronous memory reclaim: a BPF
         program uses the memcg_charged/memcg_uncharged hooks to track
         accumulated usage and, when a threshold is exceeded, enqueues a
         bpf_wq_start() workqueue item that calls
         bpf_try_to_free_mem_cgroup_pages() without blocking the charge
         path. The test asserts that with the BPF program active,
         memory.events "max" events do not increase under a workload
         that would otherwise exceed the hard limit.
 
  11/11  samples/bpf: Add memcg priority control and async reclaim example
         Adds a complete sample (samples/bpf/memcg.bpf.c + memcg.c)
         demonstrating both features. The BPF side monitors PGFAULT
         events on a high-priority cgroup; when the per-second fault
         count crosses a configurable threshold, it activates below_low
         or below_min protection for the high-priority cgroup and/or
         applies a charge delay to the low-priority cgroup. Six
         struct_ops variants are exported so userspace can attach only
         the hooks needed. Async reclaim is optionally combined with
         priority throttling via a shared low-cgroup ops map.
 
Test Environment:
The following examples run on x86_64 QEMU (10 CPUs, 2 GB RAM), using
a tmpfs-backed file on the host as a swap device to reduce I/O impact.
Two cgroups are created -- high (high-priority) and low (low-priority)
-- and each test runs two concurrent stress-ng workloads, one per
cgroup, each requesting 3 GB of memory.
 
  # mkdir /sys/fs/cgroup/high /sys/fs/cgroup/low
  # free -h
                 total   used    free  shared  buff/cache  available
  Mem:           1.9Gi  317Mi  1.6Gi   1.0Mi       144Mi      1.6Gi
  Swap:          4.0Gi     0B  4.0Gi
 
Baseline: no memory priority policy:
Both cgroups run without any reclaim protection. Results are roughly
equal, as expected:
 
  cgroup    bogo ops/s
  high           4,979
  low            4,927
 
Test 1: memory.low protection:
Setting memory.low on the high-priority cgroup protects it from
reclaim, at the cost of pushing reclaim pressure onto the low-priority
cgroup:
 
  # echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
 
  cgroup    bogo ops/s
  high         450,290
  low           11,307
 
The high-priority cgroup benefits significantly, but memory.low relies
on static usage thresholds and cannot adapt to actual workload
behavior.
 
Test 2: memory.low with an idle high-priority task:
Here the high-priority cgroup runs a Python script that allocates 3 GB
and then sleeps, simulating a low-activity but memory-holding workload.
Because the process is idle, it generates no page faults and does not
actively use its memory. Yet memory.low still protects it, continuing
to suppress the low-priority cgroup's performance:
 
  cgroup    bogo ops/s
  low           14,757
 
The low-priority cgroup remains significantly throttled despite the
high-priority cgroup being effectively idle -- a clear limitation of
static memory.low control.
 
Test 3: memcg eBPF -- dynamic priority control:
memcg is a sample program introduced in this patch series
(samples/bpf/memcg.c + memcg.bpf.c). It loads a BPF program that
monitors PGFAULT events in the high-priority cgroup. When the
per-second fault count exceeds a configured threshold, the hook
activates below_min protection for one second; otherwise the cgroup
receives no special treatment.
 
  # ./memcg --low_path=/sys/fs/cgroup/low  \
            --high_path=/sys/fs/cgroup/high \
            --threshold=1 --use_below_min
  Successfully attached!
 
3a. Both cgroups under active memory pressure:
 
When both cgroups run stress-ng, the high-priority cgroup generates
frequent page faults and the BPF hook activates protection, matching
the behavior of memory.low:
 
  cgroup    bogo ops/s
  high         404,392
  low           11,404
 
3b. High-priority cgroup is idle (Python + sleep):
 
Because the sleeping Python process generates no page faults, the BPF
hook never activates, and the low-priority cgroup is free to reclaim
memory normally:
 
  cgroup    bogo ops/s
  low          551,083
 
This is a ~37x improvement over the equivalent memory.low scenario
(Test 2), demonstrating that eBPF-driven dynamic control can
accurately reflect actual workload activity and avoid unnecessary
protection of idle high-priority tasks.
 
Summary:
  Scenario                          low-cgroup bogo ops/s
  Baseline (no policy)                           ~4,927
  memory.low, both active                       ~11,307
  memory.low, high idle                         ~14,757
  memcg eBPF, both active                       ~11,404
  memcg eBPF, high idle                        ~551,083
 
References:
[1] https://patchew.org/linux/20260127024421.494929-1-roman.gushchin@linux.dev/

Changelog:
v7:
Change base commits of "mm: BPF OOM" to v3.
Some fixes according to the comments of bpf-ci.
Rename get_high_delay_ms hook to memcg_charged; add memcg_uncharged
hook for tracking uncharge events.
Update below_low and below_min hooks to receive elow/emin and usage
as explicit arguments.
Add bpf_try_to_free_mem_cgroup_pages kfunc to expose cgroup reclaim
to BPF programs.
Add selftest for BPF-driven asynchronous page reclaim.
Extend samples/bpf/memcg to support async reclaim in addition to
priority throttling.
v6:
Based on the bot+bof-ci comments, fixed the following issues.
Added fast-path check with unlikely() before SRCU lock acquisition to
optimize the no-BPF case in BPF_MEMCG_CALL.
Add missing newline in pr_warn message to bpf_memcontrol_init.
Added comprehensive child process exit status checking with WIFEXITED()
and WEXITSTATUS(), and added zombie process prevention in
real_test_memcg_ops.
Changed malloc() to calloc() for BSS data allocation in all test
functions and samples main function.
Change srcu_read_lock(&memcg_bpf_srcu) to
lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online
and memcontrol_bpf_offline.
v5:
Based on the bot+bof-ci comments, fixed the following issues.
Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable
declaration to the beginning of need_threshold() function.
The 'u64 current_ts' variable must be declared before any
executable statements
Improved input validation in samples/bpf/memcg.c by adding a new
parse_u64() helper function. This function properly handles errors
from strtoull() and provides better error messages when parsing
threshold and over_high_ms command-line arguments.
Move check for prog->sleepable after validating member offsets in
mm/bpf_memcontrol.c bpf_memcg_ops_check_member.
Fixed sscanf return value checking in prog_tests/memcg_ops.c.
Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because
sscanf returns the number of successfully matched items, not a negative
value on error. This makes the test more reliable when reading timing
data from temporary files.
v4:
Fix the issues according to the comments from bot+bof-ci.
According to JP Kobryn's comments, move exit(0) from
real_test_memcg_ops_child_work to real_test_memcg_ops.
Fix issues in the bpf_memcg_ops_reg function.
v3:
According to the comments from Michal Koutný and Chen Ridong, update hooks
to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and
handle_cgroup_offline.
According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE
support to memcg_bpf_ops.
v2:
According to Tejun Heo's comments, rebased on Roman Gushchin's BPF
OOM patch series [1] and added hierarchical delegation support.
According to the comments from Roman Gushchin and Michal Hocko, designed
concrete use case scenarios and provided test results.

Hui Zhu (7):
  bpf: Pass flags in bpf_link_create for struct_ops
  mm: memcontrol: Add BPF struct_ops for memory controller
  mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
  selftests/bpf: Add tests for memcg_bpf_ops
  selftests/bpf: Add test for memcg_bpf_ops hierarchies
  selftests/bpf: Add selftest for memcg async reclaim via BPF
  samples/bpf: Add memcg priority control and async reclaim example

Roman Gushchin (4):
  bpf: move bpf_struct_ops_link into bpf.h
  bpf: allow attaching struct_ops to cgroups
  libbpf: fix return value on memory allocation failure
  libbpf: introduce bpf_map__attach_struct_ops_opts()

 MAINTAINERS                                   |   6 +
 include/linux/bpf-cgroup-defs.h               |   3 +
 include/linux/bpf-cgroup.h                    |  16 +
 include/linux/bpf.h                           |  10 +
 include/linux/memcontrol.h                    | 250 ++++++-
 include/uapi/linux/bpf.h                      |   5 +-
 kernel/bpf/bpf_struct_ops.c                   |  67 +-
 kernel/bpf/cgroup.c                           |  46 ++
 mm/bpf_memcontrol.c                           | 355 +++++++++-
 mm/memcontrol.c                               |  43 +-
 samples/bpf/.gitignore                        |   1 +
 samples/bpf/Makefile                          |   8 +-
 samples/bpf/memcg.bpf.c                       | 380 +++++++++++
 samples/bpf/memcg.c                           | 411 ++++++++++++
 tools/include/uapi/linux/bpf.h                |   3 +-
 tools/lib/bpf/libbpf.c                        |  22 +-
 tools/lib/bpf/libbpf.h                        |  14 +
 tools/lib/bpf/libbpf.map                      |   1 +
 tools/testing/selftests/bpf/cgroup_helpers.c  |  41 ++
 tools/testing/selftests/bpf/cgroup_helpers.h  |   2 +
 .../bpf/prog_tests/memcg_async_reclaim.c      | 333 +++++++++
 .../selftests/bpf/prog_tests/memcg_ops.c      | 634 ++++++++++++++++++
 .../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++
 tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++
 24 files changed, 2952 insertions(+), 34 deletions(-)
 create mode 100644 samples/bpf/memcg.bpf.c
 create mode 100644 samples/bpf/memcg.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
 create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
 create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c

-- 
2.43.0

Re: [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim

Posted by Michal Hocko 1 week, 5 days ago

On Tue 26-05-26 10:20:00, Hui Zhu wrote:
> From: Hui Zhu <zhuhui@kylinos.cn>
> 
> Overview:
> This series introduces BPF struct_ops support for the memory controller,
> enabling userspace BPF programs to implement custom, dynamic memory
> management policies per cgroup. The feature allows BPF programs to hook
> into the core reclaim and charge paths without requiring kernel
> modifications, providing a flexible alternative to static knobs such as
> memory.low and memory.min.
>  
> The series enables two complementary use cases.
>  
> Dynamic memory protection: static memory protection thresholds
> (memory.low, memory.min) are poor fits for workloads whose actual memory
> activity varies over time. A high-priority cgroup holding a large working
> set but temporarily idle will still suppress reclaim on its siblings,
> wasting available memory. A BPF-driven approach can observe real workload
> activity -- page faults, charge/uncharge events -- and activate or
> withdraw protection dynamically.

Why the same cannot be achieved by dynamically changing protection?

> The test results at the end of this
> letter quantify the difference: in a scenario where the high-priority
> cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly
> 37x higher throughput than with static memory.low.
>  
> Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
> hooks, combined with the BPF workqueue mechanism and the new
> bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
> proactive background reclaim without blocking the charge path. The
> pattern works as follows: the memcg_charged callback tracks accumulated
> memory usage; when usage crosses a configurable threshold, it enqueues an
> asynchronous work item via bpf_wq_start() and returns immediately without
> throttling the charging task. The workqueue callback then invokes
> bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
> cgroup; if usage remains elevated after reclaim, the callback re-enqueues
> itself to continue. This allows a BPF program to keep a cgroup's
> footprint below its hard limit (memory.max) entirely in the background,
> avoiding the OOM killer or direct-reclaim stalls that would otherwise
> occur.

How do you account the overall work done to the specific memcg as the
large part of the reclaim is done from WQ context?
Also when introducing a BPF hook please focus on describing why existing
interfaces fail to achieve what you need. For the async reclaim why it
is not practical or feasible to use userspace driven memory reclaim.
-- 
Michal Hocko
SUSE Labs

Re: [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim

Posted by teawater 1 week, 4 days ago

> 
> On Tue 26-05-26 10:20:00, Hui Zhu wrote:
> 

Hi Michal,

> > 
> > From: Hui Zhu <zhuhui@kylinos.cn>
> >  
> >  Overview:
> >  This series introduces BPF struct_ops support for the memory controller,
> >  enabling userspace BPF programs to implement custom, dynamic memory
> >  management policies per cgroup. The feature allows BPF programs to hook
> >  into the core reclaim and charge paths without requiring kernel
> >  modifications, providing a flexible alternative to static knobs such as
> >  memory.low and memory.min.
> >  
> >  The series enables two complementary use cases.
> >  
> >  Dynamic memory protection: static memory protection thresholds
> >  (memory.low, memory.min) are poor fits for workloads whose actual memory
> >  activity varies over time. A high-priority cgroup holding a large working
> >  set but temporarily idle will still suppress reclaim on its siblings,
> >  wasting available memory. A BPF-driven approach can observe real workload
> >  activity -- page faults, charge/uncharge events -- and activate or
> >  withdraw protection dynamically.
> > 
> Why the same cannot be achieved by dynamically changing protection?

Dynamically adjusting memory.low or memory.min is indeed an
option, but it has a practical drawback: in many production
environments these values are managed and pushed down by a
cluster-level orchestrator (e.g. a container runtime or resource
manager). Modifying them from a separate BPF-based agent risks
conflicts with the orchestrator's own control loop and makes the
system harder to reason about.

Beyond that, the intended use case requires rapid, short-lived
adjustments -- reacting to bursts of page faults or PSI spikes
and reverting just as quickly once the pressure subsides. Mutating
the static knobs for that purpose feels like the wrong abstraction:
the knobs express policy intent, while what we need is a transient
override that sits on top of that policy.

The hooks are therefore not meant to replace the existing limits,
but to complement them: the orchestrator continues to own
memory.low / memory.min, while a BPF program makes small, brief
corrections in response to observed runtime behavior.

> 
> > 
> > The test results at the end of this
> >  letter quantify the difference: in a scenario where the high-priority
> >  cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly
> >  37x higher throughput than with static memory.low.
> >  
> >  Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
> >  hooks, combined with the BPF workqueue mechanism and the new
> >  bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
> >  proactive background reclaim without blocking the charge path. The
> >  pattern works as follows: the memcg_charged callback tracks accumulated
> >  memory usage; when usage crosses a configurable threshold, it enqueues an
> >  asynchronous work item via bpf_wq_start() and returns immediately without
> >  throttling the charging task. The workqueue callback then invokes
> >  bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
> >  cgroup; if usage remains elevated after reclaim, the callback re-enqueues
> >  itself to continue. This allows a BPF program to keep a cgroup's
> >  footprint below its hard limit (memory.max) entirely in the background,
> >  avoiding the OOM killer or direct-reclaim stalls that would otherwise
> >  occur.
> > 
> How do you account the overall work done to the specific memcg as the
> large part of the reclaim is done from WQ context?

One approach to attribute the reclaim work accurately to the target
memcg would be to expose a kfunc that creates a kthread_worker and
attaches it to a specific cgroup. Reclaim work enqueued to that
worker would then run in a context already associated with the
target memcg, so the accounting would naturally fall to the right
cgroup without any extra bookkeeping.

The tradeoff is additional complexity: creating a per-cgroup worker
introduces resource overhead and lifecycle management concerns
(e.g. when should the worker be torn down). Whether that cost is
justified depends on how strictly the caller needs the reclaim to
be attributed.

That said, I am not certain this is the right direction yet and
would welcome your thoughts on whether this is worth pursuing, or
whether there is a simpler mechanism I am overlooking.

> Also when introducing a BPF hook please focus on describing why existing
> interfaces fail to achieve what you need. For the async reclaim why it
> is not practical or feasible to use userspace driven memory reclaim.

Noted, and thank you for both points. In the next revision I will
add a dedicated section to each hook's description covering:

Why existing interfaces are insufficient. For the async reclaim
case specifically, I will explain why userspace-driven reclaim
(e.g. memory.reclaim, cgroup-aware madvise, or a dedicated
reclaim daemon) is not practical: userspace cannot react at the
granularity or latency required, and the round-trip through a
syscall or procfs write introduces overhead that defeats the
purpose of proactive reclaim.
What gap the new hook fills that cannot be closed by tuning
existing knobs.

Best,
Hui

> -- 
> Michal Hocko
> SUSE Labs
>

Re: [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim

Posted by Usama Arif 1 week, 6 days ago

On Tue, 26 May 2026 10:20:00 +0800 Hui Zhu <hui.zhu@linux.dev> wrote:

> From: Hui Zhu <zhuhui@kylinos.cn>
> 
> Overview:
> This series introduces BPF struct_ops support for the memory controller,
> enabling userspace BPF programs to implement custom, dynamic memory
> management policies per cgroup. The feature allows BPF programs to hook
> into the core reclaim and charge paths without requiring kernel
> modifications, providing a flexible alternative to static knobs such as
> memory.low and memory.min.
>  
> The series enables two complementary use cases.
>  
> Dynamic memory protection: static memory protection thresholds
> (memory.low, memory.min) are poor fits for workloads whose actual memory
> activity varies over time. A high-priority cgroup holding a large working
> set but temporarily idle will still suppress reclaim on its siblings,
> wasting available memory. A BPF-driven approach can observe real workload
> activity -- page faults, charge/uncharge events -- and activate or
> withdraw protection dynamically. The test results at the end of this
> letter quantify the difference: in a scenario where the high-priority
> cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly
> 37x higher throughput than with static memory.low.
>  
> Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
> hooks, combined with the BPF workqueue mechanism and the new
> bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
> proactive background reclaim without blocking the charge path. The
> pattern works as follows: the memcg_charged callback tracks accumulated
> memory usage; when usage crosses a configurable threshold, it enqueues an
> asynchronous work item via bpf_wq_start() and returns immediately without
> throttling the charging task. The workqueue callback then invokes
> bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
> cgroup; if usage remains elevated after reclaim, the callback re-enqueues
> itself to continue. This allows a BPF program to keep a cgroup's
> footprint below its hard limit (memory.max) entirely in the background,
> avoiding the OOM killer or direct-reclaim stalls that would otherwise
> occur. The selftest for this feature (patch 10/11) validates the
> mechanism concretely: a workload that writes and mmaps a 64 MB file inside
> a 32 MB cgroup reliably triggers memory.events "max" events without BPF;
> with the async reclaim program attached, the "max" counter does not
> increase at all across the same workload.
>  


Hi Hui,

Thanks for the series.
Would it not be simpler to just have another memcg knob, something like
memory.high_async.
When memory usage > memory.high_async, queue a per-memcg work item that calls
try_to_free_mem_cgroup_pages() until usage drops back below some threshold.
I am not sure I see what programability aspect from bpf you need here.

Thanks

>  
>   08/11  selftests/bpf: Add tests for memcg_bpf_ops
>          Adds prog_tests/memcg_ops.c covering three scenarios:
>          memcg_charged-only throttling, below_low + memcg_charged
>          interaction, and below_min + memcg_charged interaction. A
>          tracepoint on memcg:count_memcg_events (PGFAULT) is used to
>          detect memory pressure and trigger hooks accordingly.
>  
>   09/11  selftests/bpf: Add test for memcg_bpf_ops hierarchies
>          Validates BPF_F_ALLOW_OVERRIDE attachment semantics across a
>          three-level cgroup hierarchy: attach with ALLOW_OVERRIDE at the
>          root, override at the middle level without the flag, then assert
>          that attaching to the leaf correctly fails with -EBUSY.
>  
>   10/11  selftests/bpf: Add selftest for memcg async reclaim via BPF
>          Demonstrates and validates asynchronous memory reclaim: a BPF
>          program uses the memcg_charged/memcg_uncharged hooks to track
>          accumulated usage and, when a threshold is exceeded, enqueues a
>          bpf_wq_start() workqueue item that calls
>          bpf_try_to_free_mem_cgroup_pages() without blocking the charge
>          path. The test asserts that with the BPF program active,
>          memory.events "max" events do not increase under a workload
>          that would otherwise exceed the hard limit.
>  
>   11/11  samples/bpf: Add memcg priority control and async reclaim example
>          Adds a complete sample (samples/bpf/memcg.bpf.c + memcg.c)
>          demonstrating both features. The BPF side monitors PGFAULT
>          events on a high-priority cgroup; when the per-second fault
>          count crosses a configurable threshold, it activates below_low
>          or below_min protection for the high-priority cgroup and/or
>          applies a charge delay to the low-priority cgroup. Six
>          struct_ops variants are exported so userspace can attach only
>          the hooks needed. Async reclaim is optionally combined with
>          priority throttling via a shared low-cgroup ops map.
>  
> Test Environment:
> The following examples run on x86_64 QEMU (10 CPUs, 2 GB RAM), using
> a tmpfs-backed file on the host as a swap device to reduce I/O impact.
> Two cgroups are created -- high (high-priority) and low (low-priority)
> -- and each test runs two concurrent stress-ng workloads, one per
> cgroup, each requesting 3 GB of memory.
>  
>   # mkdir /sys/fs/cgroup/high /sys/fs/cgroup/low
>   # free -h
>                  total   used    free  shared  buff/cache  available
>   Mem:           1.9Gi  317Mi  1.6Gi   1.0Mi       144Mi      1.6Gi
>   Swap:          4.0Gi     0B  4.0Gi
>  
> Baseline: no memory priority policy:
> Both cgroups run without any reclaim protection. Results are roughly
> equal, as expected:
>  
>   cgroup    bogo ops/s
>   high           4,979
>   low            4,927
>  
> Test 1: memory.low protection:
> Setting memory.low on the high-priority cgroup protects it from
> reclaim, at the cost of pushing reclaim pressure onto the low-priority
> cgroup:
>  
>   # echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
>  
>   cgroup    bogo ops/s
>   high         450,290
>   low           11,307
>  
> The high-priority cgroup benefits significantly, but memory.low relies
> on static usage thresholds and cannot adapt to actual workload
> behavior.
>  
> Test 2: memory.low with an idle high-priority task:
> Here the high-priority cgroup runs a Python script that allocates 3 GB
> and then sleeps, simulating a low-activity but memory-holding workload.
> Because the process is idle, it generates no page faults and does not
> actively use its memory. Yet memory.low still protects it, continuing
> to suppress the low-priority cgroup's performance:
>  
>   cgroup    bogo ops/s
>   low           14,757
>  
> The low-priority cgroup remains significantly throttled despite the
> high-priority cgroup being effectively idle -- a clear limitation of
> static memory.low control.
>  
> Test 3: memcg eBPF -- dynamic priority control:
> memcg is a sample program introduced in this patch series
> (samples/bpf/memcg.c + memcg.bpf.c). It loads a BPF program that
> monitors PGFAULT events in the high-priority cgroup. When the
> per-second fault count exceeds a configured threshold, the hook
> activates below_min protection for one second; otherwise the cgroup
> receives no special treatment.
>  
>   # ./memcg --low_path=/sys/fs/cgroup/low  \
>             --high_path=/sys/fs/cgroup/high \
>             --threshold=1 --use_below_min
>   Successfully attached!
>  
> 3a. Both cgroups under active memory pressure:
>  
> When both cgroups run stress-ng, the high-priority cgroup generates
> frequent page faults and the BPF hook activates protection, matching
> the behavior of memory.low:
>  
>   cgroup    bogo ops/s
>   high         404,392
>   low           11,404
>  
> 3b. High-priority cgroup is idle (Python + sleep):
>  
> Because the sleeping Python process generates no page faults, the BPF
> hook never activates, and the low-priority cgroup is free to reclaim
> memory normally:
>  
>   cgroup    bogo ops/s
>   low          551,083
>  
> This is a ~37x improvement over the equivalent memory.low scenario
> (Test 2), demonstrating that eBPF-driven dynamic control can
> accurately reflect actual workload activity and avoid unnecessary
> protection of idle high-priority tasks.
>  
> Summary:
>   Scenario                          low-cgroup bogo ops/s
>   Baseline (no policy)                           ~4,927
>   memory.low, both active                       ~11,307
>   memory.low, high idle                         ~14,757
>   memcg eBPF, both active                       ~11,404
>   memcg eBPF, high idle                        ~551,083
>  
> References:
> [1] https://patchew.org/linux/20260127024421.494929-1-roman.gushchin@linux.dev/
> 
> Changelog:
> v7:
> Change base commits of "mm: BPF OOM" to v3.
> Some fixes according to the comments of bpf-ci.
> Rename get_high_delay_ms hook to memcg_charged; add memcg_uncharged
> hook for tracking uncharge events.
> Update below_low and below_min hooks to receive elow/emin and usage
> as explicit arguments.
> Add bpf_try_to_free_mem_cgroup_pages kfunc to expose cgroup reclaim
> to BPF programs.
> Add selftest for BPF-driven asynchronous page reclaim.
> Extend samples/bpf/memcg to support async reclaim in addition to
> priority throttling.
> v6:
> Based on the bot+bof-ci comments, fixed the following issues.
> Added fast-path check with unlikely() before SRCU lock acquisition to
> optimize the no-BPF case in BPF_MEMCG_CALL.
> Add missing newline in pr_warn message to bpf_memcontrol_init.
> Added comprehensive child process exit status checking with WIFEXITED()
> and WEXITSTATUS(), and added zombie process prevention in
> real_test_memcg_ops.
> Changed malloc() to calloc() for BSS data allocation in all test
> functions and samples main function.
> Change srcu_read_lock(&memcg_bpf_srcu) to
> lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online
> and memcontrol_bpf_offline.
> v5:
> Based on the bot+bof-ci comments, fixed the following issues.
> Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable
> declaration to the beginning of need_threshold() function.
> The 'u64 current_ts' variable must be declared before any
> executable statements
> Improved input validation in samples/bpf/memcg.c by adding a new
> parse_u64() helper function. This function properly handles errors
> from strtoull() and provides better error messages when parsing
> threshold and over_high_ms command-line arguments.
> Move check for prog->sleepable after validating member offsets in
> mm/bpf_memcontrol.c bpf_memcg_ops_check_member.
> Fixed sscanf return value checking in prog_tests/memcg_ops.c.
> Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because
> sscanf returns the number of successfully matched items, not a negative
> value on error. This makes the test more reliable when reading timing
> data from temporary files.
> v4:
> Fix the issues according to the comments from bot+bof-ci.
> According to JP Kobryn's comments, move exit(0) from
> real_test_memcg_ops_child_work to real_test_memcg_ops.
> Fix issues in the bpf_memcg_ops_reg function.
> v3:
> According to the comments from Michal Koutný and Chen Ridong, update hooks
> to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and
> handle_cgroup_offline.
> According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE
> support to memcg_bpf_ops.
> v2:
> According to Tejun Heo's comments, rebased on Roman Gushchin's BPF
> OOM patch series [1] and added hierarchical delegation support.
> According to the comments from Roman Gushchin and Michal Hocko, designed
> concrete use case scenarios and provided test results.
> 
> Hui Zhu (7):
>   bpf: Pass flags in bpf_link_create for struct_ops
>   mm: memcontrol: Add BPF struct_ops for memory controller
>   mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
>   selftests/bpf: Add tests for memcg_bpf_ops
>   selftests/bpf: Add test for memcg_bpf_ops hierarchies
>   selftests/bpf: Add selftest for memcg async reclaim via BPF
>   samples/bpf: Add memcg priority control and async reclaim example
> 
> Roman Gushchin (4):
>   bpf: move bpf_struct_ops_link into bpf.h
>   bpf: allow attaching struct_ops to cgroups
>   libbpf: fix return value on memory allocation failure
>   libbpf: introduce bpf_map__attach_struct_ops_opts()
> 
>  MAINTAINERS                                   |   6 +
>  include/linux/bpf-cgroup-defs.h               |   3 +
>  include/linux/bpf-cgroup.h                    |  16 +
>  include/linux/bpf.h                           |  10 +
>  include/linux/memcontrol.h                    | 250 ++++++-
>  include/uapi/linux/bpf.h                      |   5 +-
>  kernel/bpf/bpf_struct_ops.c                   |  67 +-
>  kernel/bpf/cgroup.c                           |  46 ++
>  mm/bpf_memcontrol.c                           | 355 +++++++++-
>  mm/memcontrol.c                               |  43 +-
>  samples/bpf/.gitignore                        |   1 +
>  samples/bpf/Makefile                          |   8 +-
>  samples/bpf/memcg.bpf.c                       | 380 +++++++++++
>  samples/bpf/memcg.c                           | 411 ++++++++++++
>  tools/include/uapi/linux/bpf.h                |   3 +-
>  tools/lib/bpf/libbpf.c                        |  22 +-
>  tools/lib/bpf/libbpf.h                        |  14 +
>  tools/lib/bpf/libbpf.map                      |   1 +
>  tools/testing/selftests/bpf/cgroup_helpers.c  |  41 ++
>  tools/testing/selftests/bpf/cgroup_helpers.h  |   2 +
>  .../bpf/prog_tests/memcg_async_reclaim.c      | 333 +++++++++
>  .../selftests/bpf/prog_tests/memcg_ops.c      | 634 ++++++++++++++++++
>  .../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++
>  tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++
>  24 files changed, 2952 insertions(+), 34 deletions(-)
>  create mode 100644 samples/bpf/memcg.bpf.c
>  create mode 100644 samples/bpf/memcg.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
>  create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
>  create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c
> 
> -- 
> 2.43.0
> 
>

Re: [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim

Posted by teawater 1 week, 5 days ago

> 
> On Tue, 26 May 2026 10:20:00 +0800 Hui Zhu <hui.zhu@linux.dev> wrote:
> 
> > 
> > From: Hui Zhu <zhuhui@kylinos.cn>
> >  
> >  Overview:
> >  This series introduces BPF struct_ops support for the memory controller,
> >  enabling userspace BPF programs to implement custom, dynamic memory
> >  management policies per cgroup. The feature allows BPF programs to hook
> >  into the core reclaim and charge paths without requiring kernel
> >  modifications, providing a flexible alternative to static knobs such as
> >  memory.low and memory.min.
> >  
> >  The series enables two complementary use cases.
> >  
...
...
...
> >  
> >  Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
> >  hooks, combined with the BPF workqueue mechanism and the new
> >  bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
> >  proactive background reclaim without blocking the charge path. The
> >  pattern works as follows: the memcg_charged callback tracks accumulated
> >  memory usage; when usage crosses a configurable threshold, it enqueues an
> >  asynchronous work item via bpf_wq_start() and returns immediately without
> >  throttling the charging task. The workqueue callback then invokes
> >  bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
> >  cgroup; if usage remains elevated after reclaim, the callback re-enqueues
> >  itself to continue. This allows a BPF program to keep a cgroup's
> >  footprint below its hard limit (memory.max) entirely in the background,
> >  avoiding the OOM killer or direct-reclaim stalls that would otherwise
> >  occur. The selftest for this feature (patch 10/11) validates the
> >  mechanism concretely: a workload that writes and mmaps a 64 MB file inside
> >  a 32 MB cgroup reliably triggers memory.events "max" events without BPF;
> >  with the async reclaim program attached, the "max" counter does not
> >  increase at all across the same workload.
> > 
> Hi Hui,
> 
> Thanks for the series.
> Would it not be simpler to just have another memcg knob, something like
> memory.high_async.
> When memory usage > memory.high_async, queue a per-memcg work item that calls
> try_to_free_mem_cgroup_pages() until usage drops back below some threshold.
> I am not sure I see what programability aspect from bpf you need here.
> 
> Thanks

Hi Usama,

That's a good question.

By introducing a new BPF kfunc bpf_try_to_free_mem_cgroup_pages,
a BPF program can flexibly control when to start and stop async
reclaim, rather than being constrained to trigger and stop based
solely on memcg usage or one or two fixed events, as with
traditional proactive reclaim interfaces.

For example, async reclaim could be triggered based on PSI, or
on the number of page faults, or even on a combination of
multiple events working together to decide both when to start
and when to stop async reclaim.

That is the motivation behind adding the BPF kfunc
bpf_try_to_free_mem_cgroup_pages in this patch set.

I admit the cover letter did not explain this well enough, and
the example code does not demonstrate this use case either. I
will address both in the next version.

Best,
Hui

> 
> > 
> > 08/11 selftests/bpf: Add tests for memcg_bpf_ops
> >  Adds prog_tests/memcg_ops.c covering three scenarios:
> >  memcg_charged-only throttling, below_low + memcg_charged
> >  interaction, and below_min + memcg_charged interaction. A
> >  tracepoint on memcg:count_memcg_events (PGFAULT) is used to
> >  detect memory pressure and trigger hooks accordingly.
...
...
...
> >  -- 
> >  2.43.0
> >
>