MAINTAINERS | 6 + include/linux/bpf-cgroup-defs.h | 3 + include/linux/bpf-cgroup.h | 16 + include/linux/bpf.h | 10 + include/linux/memcontrol.h | 250 ++++++- include/uapi/linux/bpf.h | 5 +- kernel/bpf/bpf_struct_ops.c | 67 +- kernel/bpf/cgroup.c | 46 ++ mm/bpf_memcontrol.c | 355 +++++++++- mm/memcontrol.c | 43 +- samples/bpf/.gitignore | 1 + samples/bpf/Makefile | 8 +- samples/bpf/memcg.bpf.c | 380 +++++++++++ samples/bpf/memcg.c | 411 ++++++++++++ tools/include/uapi/linux/bpf.h | 3 +- tools/lib/bpf/libbpf.c | 22 +- tools/lib/bpf/libbpf.h | 14 + tools/lib/bpf/libbpf.map | 1 + tools/testing/selftests/bpf/cgroup_helpers.c | 41 ++ tools/testing/selftests/bpf/cgroup_helpers.h | 2 + .../bpf/prog_tests/memcg_async_reclaim.c | 333 +++++++++ .../selftests/bpf/prog_tests/memcg_ops.c | 634 ++++++++++++++++++ .../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++ tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++ 24 files changed, 2952 insertions(+), 34 deletions(-) create mode 100644 samples/bpf/memcg.bpf.c create mode 100644 samples/bpf/memcg.c create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c
From: Hui Zhu <zhuhui@kylinos.cn>
Overview:
This series introduces BPF struct_ops support for the memory controller,
enabling userspace BPF programs to implement custom, dynamic memory
management policies per cgroup. The feature allows BPF programs to hook
into the core reclaim and charge paths without requiring kernel
modifications, providing a flexible alternative to static knobs such as
memory.low and memory.min.
The series enables two complementary use cases.
Dynamic memory protection: static memory protection thresholds
(memory.low, memory.min) are poor fits for workloads whose actual memory
activity varies over time. A high-priority cgroup holding a large working
set but temporarily idle will still suppress reclaim on its siblings,
wasting available memory. A BPF-driven approach can observe real workload
activity -- page faults, charge/uncharge events -- and activate or
withdraw protection dynamically. The test results at the end of this
letter quantify the difference: in a scenario where the high-priority
cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly
37x higher throughput than with static memory.low.
Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
hooks, combined with the BPF workqueue mechanism and the new
bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
proactive background reclaim without blocking the charge path. The
pattern works as follows: the memcg_charged callback tracks accumulated
memory usage; when usage crosses a configurable threshold, it enqueues an
asynchronous work item via bpf_wq_start() and returns immediately without
throttling the charging task. The workqueue callback then invokes
bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
cgroup; if usage remains elevated after reclaim, the callback re-enqueues
itself to continue. This allows a BPF program to keep a cgroup's
footprint below its hard limit (memory.max) entirely in the background,
avoiding the OOM killer or direct-reclaim stalls that would otherwise
occur. The selftest for this feature (patch 10/11) validates the
mechanism concretely: a workload that writes and mmaps a 64 MB file inside
a 32 MB cgroup reliably triggers memory.events "max" events without BPF;
with the async reclaim program attached, the "max" counter does not
increase at all across the same workload.
In this patch series, I've incorporated a portion of Roman's patch in
[1] to ensure the entire series can be compiled cleanly with bpf-next.
Patch Breakdown:
Patches 1-4 are from Roman Gushchin's series [1], included here to
provide the necessary BPF infrastructure for attaching struct_ops to
cgroups.
Patches 5-11 are the new work in this series:
05/11 bpf: Pass flags in bpf_link_create for struct_ops
Stores attr->link_create.flags in struct bpf_struct_ops_link
and extends the validation to allow BPF_F_ALLOW_OVERRIDE.
Also updates the UAPI comment to reflect that cgroup-bpf attach
flags now apply to BPF_LINK_CREATE in addition to
BPF_PROG_ATTACH.
06/11 mm: memcontrol: Add BPF struct_ops for memory controller
The core feature patch. Introduces the memcg_bpf_ops struct_ops
type with the following hooks:
- memcg_charged(memcg, batch): called on the synchronous charge
path. Returns a throttling delay in milliseconds; used as a
lower bound for __mem_cgroup_handle_over_high(), effective
even when memory.high is not breached.
- memcg_uncharged(memcg, batch): called on uncharge, allowing
BPF programs to track memory releases.
- below_low(memcg, elow, usage): overrides the memory.low
protection check. Returns true to treat the cgroup as
protected regardless of the elow >= usage comparison.
- below_min(memcg, emin, usage): same as below_low but for
memory.min protection.
- handle_cgroup_online/offline(memcg): lifecycle callbacks for
per-cgroup state management in BPF programs.
BPF_F_ALLOW_OVERRIDE is supported: when a program is registered
with this flag, descendant cgroups may attach their own
memcg_bpf_ops to override the inherited policy. Registration
propagates ops down through the subtree via mem_cgroup_iter;
unregistration restores each descendant to the ops its
registering ancestor's parent held, correctly preserving
override chains.
07/11 mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
Exposes try_to_free_mem_cgroup_pages() to BPF programs as a
KF_SLEEPABLE kfunc. A swappiness parameter controls the
override value passed to the core reclaim path
(effective only when MEMCG_RECLAIM_PROACTIVE is set in
reclaim_options).
08/11 selftests/bpf: Add tests for memcg_bpf_ops
Adds prog_tests/memcg_ops.c covering three scenarios:
memcg_charged-only throttling, below_low + memcg_charged
interaction, and below_min + memcg_charged interaction. A
tracepoint on memcg:count_memcg_events (PGFAULT) is used to
detect memory pressure and trigger hooks accordingly.
09/11 selftests/bpf: Add test for memcg_bpf_ops hierarchies
Validates BPF_F_ALLOW_OVERRIDE attachment semantics across a
three-level cgroup hierarchy: attach with ALLOW_OVERRIDE at the
root, override at the middle level without the flag, then assert
that attaching to the leaf correctly fails with -EBUSY.
10/11 selftests/bpf: Add selftest for memcg async reclaim via BPF
Demonstrates and validates asynchronous memory reclaim: a BPF
program uses the memcg_charged/memcg_uncharged hooks to track
accumulated usage and, when a threshold is exceeded, enqueues a
bpf_wq_start() workqueue item that calls
bpf_try_to_free_mem_cgroup_pages() without blocking the charge
path. The test asserts that with the BPF program active,
memory.events "max" events do not increase under a workload
that would otherwise exceed the hard limit.
11/11 samples/bpf: Add memcg priority control and async reclaim example
Adds a complete sample (samples/bpf/memcg.bpf.c + memcg.c)
demonstrating both features. The BPF side monitors PGFAULT
events on a high-priority cgroup; when the per-second fault
count crosses a configurable threshold, it activates below_low
or below_min protection for the high-priority cgroup and/or
applies a charge delay to the low-priority cgroup. Six
struct_ops variants are exported so userspace can attach only
the hooks needed. Async reclaim is optionally combined with
priority throttling via a shared low-cgroup ops map.
Test Environment:
The following examples run on x86_64 QEMU (10 CPUs, 2 GB RAM), using
a tmpfs-backed file on the host as a swap device to reduce I/O impact.
Two cgroups are created -- high (high-priority) and low (low-priority)
-- and each test runs two concurrent stress-ng workloads, one per
cgroup, each requesting 3 GB of memory.
# mkdir /sys/fs/cgroup/high /sys/fs/cgroup/low
# free -h
total used free shared buff/cache available
Mem: 1.9Gi 317Mi 1.6Gi 1.0Mi 144Mi 1.6Gi
Swap: 4.0Gi 0B 4.0Gi
Baseline: no memory priority policy:
Both cgroups run without any reclaim protection. Results are roughly
equal, as expected:
cgroup bogo ops/s
high 4,979
low 4,927
Test 1: memory.low protection:
Setting memory.low on the high-priority cgroup protects it from
reclaim, at the cost of pushing reclaim pressure onto the low-priority
cgroup:
# echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
cgroup bogo ops/s
high 450,290
low 11,307
The high-priority cgroup benefits significantly, but memory.low relies
on static usage thresholds and cannot adapt to actual workload
behavior.
Test 2: memory.low with an idle high-priority task:
Here the high-priority cgroup runs a Python script that allocates 3 GB
and then sleeps, simulating a low-activity but memory-holding workload.
Because the process is idle, it generates no page faults and does not
actively use its memory. Yet memory.low still protects it, continuing
to suppress the low-priority cgroup's performance:
cgroup bogo ops/s
low 14,757
The low-priority cgroup remains significantly throttled despite the
high-priority cgroup being effectively idle -- a clear limitation of
static memory.low control.
Test 3: memcg eBPF -- dynamic priority control:
memcg is a sample program introduced in this patch series
(samples/bpf/memcg.c + memcg.bpf.c). It loads a BPF program that
monitors PGFAULT events in the high-priority cgroup. When the
per-second fault count exceeds a configured threshold, the hook
activates below_min protection for one second; otherwise the cgroup
receives no special treatment.
# ./memcg --low_path=/sys/fs/cgroup/low \
--high_path=/sys/fs/cgroup/high \
--threshold=1 --use_below_min
Successfully attached!
3a. Both cgroups under active memory pressure:
When both cgroups run stress-ng, the high-priority cgroup generates
frequent page faults and the BPF hook activates protection, matching
the behavior of memory.low:
cgroup bogo ops/s
high 404,392
low 11,404
3b. High-priority cgroup is idle (Python + sleep):
Because the sleeping Python process generates no page faults, the BPF
hook never activates, and the low-priority cgroup is free to reclaim
memory normally:
cgroup bogo ops/s
low 551,083
This is a ~37x improvement over the equivalent memory.low scenario
(Test 2), demonstrating that eBPF-driven dynamic control can
accurately reflect actual workload activity and avoid unnecessary
protection of idle high-priority tasks.
Summary:
Scenario low-cgroup bogo ops/s
Baseline (no policy) ~4,927
memory.low, both active ~11,307
memory.low, high idle ~14,757
memcg eBPF, both active ~11,404
memcg eBPF, high idle ~551,083
References:
[1] https://patchew.org/linux/20260127024421.494929-1-roman.gushchin@linux.dev/
Changelog:
v7:
Change base commits of "mm: BPF OOM" to v3.
Some fixes according to the comments of bpf-ci.
Rename get_high_delay_ms hook to memcg_charged; add memcg_uncharged
hook for tracking uncharge events.
Update below_low and below_min hooks to receive elow/emin and usage
as explicit arguments.
Add bpf_try_to_free_mem_cgroup_pages kfunc to expose cgroup reclaim
to BPF programs.
Add selftest for BPF-driven asynchronous page reclaim.
Extend samples/bpf/memcg to support async reclaim in addition to
priority throttling.
v6:
Based on the bot+bof-ci comments, fixed the following issues.
Added fast-path check with unlikely() before SRCU lock acquisition to
optimize the no-BPF case in BPF_MEMCG_CALL.
Add missing newline in pr_warn message to bpf_memcontrol_init.
Added comprehensive child process exit status checking with WIFEXITED()
and WEXITSTATUS(), and added zombie process prevention in
real_test_memcg_ops.
Changed malloc() to calloc() for BSS data allocation in all test
functions and samples main function.
Change srcu_read_lock(&memcg_bpf_srcu) to
lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online
and memcontrol_bpf_offline.
v5:
Based on the bot+bof-ci comments, fixed the following issues.
Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable
declaration to the beginning of need_threshold() function.
The 'u64 current_ts' variable must be declared before any
executable statements
Improved input validation in samples/bpf/memcg.c by adding a new
parse_u64() helper function. This function properly handles errors
from strtoull() and provides better error messages when parsing
threshold and over_high_ms command-line arguments.
Move check for prog->sleepable after validating member offsets in
mm/bpf_memcontrol.c bpf_memcg_ops_check_member.
Fixed sscanf return value checking in prog_tests/memcg_ops.c.
Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because
sscanf returns the number of successfully matched items, not a negative
value on error. This makes the test more reliable when reading timing
data from temporary files.
v4:
Fix the issues according to the comments from bot+bof-ci.
According to JP Kobryn's comments, move exit(0) from
real_test_memcg_ops_child_work to real_test_memcg_ops.
Fix issues in the bpf_memcg_ops_reg function.
v3:
According to the comments from Michal Koutný and Chen Ridong, update hooks
to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and
handle_cgroup_offline.
According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE
support to memcg_bpf_ops.
v2:
According to Tejun Heo's comments, rebased on Roman Gushchin's BPF
OOM patch series [1] and added hierarchical delegation support.
According to the comments from Roman Gushchin and Michal Hocko, designed
concrete use case scenarios and provided test results.
Hui Zhu (7):
bpf: Pass flags in bpf_link_create for struct_ops
mm: memcontrol: Add BPF struct_ops for memory controller
mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc
selftests/bpf: Add tests for memcg_bpf_ops
selftests/bpf: Add test for memcg_bpf_ops hierarchies
selftests/bpf: Add selftest for memcg async reclaim via BPF
samples/bpf: Add memcg priority control and async reclaim example
Roman Gushchin (4):
bpf: move bpf_struct_ops_link into bpf.h
bpf: allow attaching struct_ops to cgroups
libbpf: fix return value on memory allocation failure
libbpf: introduce bpf_map__attach_struct_ops_opts()
MAINTAINERS | 6 +
include/linux/bpf-cgroup-defs.h | 3 +
include/linux/bpf-cgroup.h | 16 +
include/linux/bpf.h | 10 +
include/linux/memcontrol.h | 250 ++++++-
include/uapi/linux/bpf.h | 5 +-
kernel/bpf/bpf_struct_ops.c | 67 +-
kernel/bpf/cgroup.c | 46 ++
mm/bpf_memcontrol.c | 355 +++++++++-
mm/memcontrol.c | 43 +-
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 8 +-
samples/bpf/memcg.bpf.c | 380 +++++++++++
samples/bpf/memcg.c | 411 ++++++++++++
tools/include/uapi/linux/bpf.h | 3 +-
tools/lib/bpf/libbpf.c | 22 +-
tools/lib/bpf/libbpf.h | 14 +
tools/lib/bpf/libbpf.map | 1 +
tools/testing/selftests/bpf/cgroup_helpers.c | 41 ++
tools/testing/selftests/bpf/cgroup_helpers.h | 2 +
.../bpf/prog_tests/memcg_async_reclaim.c | 333 +++++++++
.../selftests/bpf/prog_tests/memcg_ops.c | 634 ++++++++++++++++++
.../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++
tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++
24 files changed, 2952 insertions(+), 34 deletions(-)
create mode 100644 samples/bpf/memcg.bpf.c
create mode 100644 samples/bpf/memcg.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c
create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c
--
2.43.0
On Tue 26-05-26 10:20:00, Hui Zhu wrote: > From: Hui Zhu <zhuhui@kylinos.cn> > > Overview: > This series introduces BPF struct_ops support for the memory controller, > enabling userspace BPF programs to implement custom, dynamic memory > management policies per cgroup. The feature allows BPF programs to hook > into the core reclaim and charge paths without requiring kernel > modifications, providing a flexible alternative to static knobs such as > memory.low and memory.min. > > The series enables two complementary use cases. > > Dynamic memory protection: static memory protection thresholds > (memory.low, memory.min) are poor fits for workloads whose actual memory > activity varies over time. A high-priority cgroup holding a large working > set but temporarily idle will still suppress reclaim on its siblings, > wasting available memory. A BPF-driven approach can observe real workload > activity -- page faults, charge/uncharge events -- and activate or > withdraw protection dynamically. Why the same cannot be achieved by dynamically changing protection? > The test results at the end of this > letter quantify the difference: in a scenario where the high-priority > cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly > 37x higher throughput than with static memory.low. > > Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged > hooks, combined with the BPF workqueue mechanism and the new > bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform > proactive background reclaim without blocking the charge path. The > pattern works as follows: the memcg_charged callback tracks accumulated > memory usage; when usage crosses a configurable threshold, it enqueues an > asynchronous work item via bpf_wq_start() and returns immediately without > throttling the charging task. The workqueue callback then invokes > bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target > cgroup; if usage remains elevated after reclaim, the callback re-enqueues > itself to continue. This allows a BPF program to keep a cgroup's > footprint below its hard limit (memory.max) entirely in the background, > avoiding the OOM killer or direct-reclaim stalls that would otherwise > occur. How do you account the overall work done to the specific memcg as the large part of the reclaim is done from WQ context? Also when introducing a BPF hook please focus on describing why existing interfaces fail to achieve what you need. For the async reclaim why it is not practical or feasible to use userspace driven memory reclaim. -- Michal Hocko SUSE Labs
> > On Tue 26-05-26 10:20:00, Hui Zhu wrote: > Hi Michal, > > > > From: Hui Zhu <zhuhui@kylinos.cn> > > > > Overview: > > This series introduces BPF struct_ops support for the memory controller, > > enabling userspace BPF programs to implement custom, dynamic memory > > management policies per cgroup. The feature allows BPF programs to hook > > into the core reclaim and charge paths without requiring kernel > > modifications, providing a flexible alternative to static knobs such as > > memory.low and memory.min. > > > > The series enables two complementary use cases. > > > > Dynamic memory protection: static memory protection thresholds > > (memory.low, memory.min) are poor fits for workloads whose actual memory > > activity varies over time. A high-priority cgroup holding a large working > > set but temporarily idle will still suppress reclaim on its siblings, > > wasting available memory. A BPF-driven approach can observe real workload > > activity -- page faults, charge/uncharge events -- and activate or > > withdraw protection dynamically. > > > Why the same cannot be achieved by dynamically changing protection? Dynamically adjusting memory.low or memory.min is indeed an option, but it has a practical drawback: in many production environments these values are managed and pushed down by a cluster-level orchestrator (e.g. a container runtime or resource manager). Modifying them from a separate BPF-based agent risks conflicts with the orchestrator's own control loop and makes the system harder to reason about. Beyond that, the intended use case requires rapid, short-lived adjustments -- reacting to bursts of page faults or PSI spikes and reverting just as quickly once the pressure subsides. Mutating the static knobs for that purpose feels like the wrong abstraction: the knobs express policy intent, while what we need is a transient override that sits on top of that policy. The hooks are therefore not meant to replace the existing limits, but to complement them: the orchestrator continues to own memory.low / memory.min, while a BPF program makes small, brief corrections in response to observed runtime behavior. > > > > > The test results at the end of this > > letter quantify the difference: in a scenario where the high-priority > > cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly > > 37x higher throughput than with static memory.low. > > > > Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged > > hooks, combined with the BPF workqueue mechanism and the new > > bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform > > proactive background reclaim without blocking the charge path. The > > pattern works as follows: the memcg_charged callback tracks accumulated > > memory usage; when usage crosses a configurable threshold, it enqueues an > > asynchronous work item via bpf_wq_start() and returns immediately without > > throttling the charging task. The workqueue callback then invokes > > bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target > > cgroup; if usage remains elevated after reclaim, the callback re-enqueues > > itself to continue. This allows a BPF program to keep a cgroup's > > footprint below its hard limit (memory.max) entirely in the background, > > avoiding the OOM killer or direct-reclaim stalls that would otherwise > > occur. > > > How do you account the overall work done to the specific memcg as the > large part of the reclaim is done from WQ context? One approach to attribute the reclaim work accurately to the target memcg would be to expose a kfunc that creates a kthread_worker and attaches it to a specific cgroup. Reclaim work enqueued to that worker would then run in a context already associated with the target memcg, so the accounting would naturally fall to the right cgroup without any extra bookkeeping. The tradeoff is additional complexity: creating a per-cgroup worker introduces resource overhead and lifecycle management concerns (e.g. when should the worker be torn down). Whether that cost is justified depends on how strictly the caller needs the reclaim to be attributed. That said, I am not certain this is the right direction yet and would welcome your thoughts on whether this is worth pursuing, or whether there is a simpler mechanism I am overlooking. > Also when introducing a BPF hook please focus on describing why existing > interfaces fail to achieve what you need. For the async reclaim why it > is not practical or feasible to use userspace driven memory reclaim. Noted, and thank you for both points. In the next revision I will add a dedicated section to each hook's description covering: Why existing interfaces are insufficient. For the async reclaim case specifically, I will explain why userspace-driven reclaim (e.g. memory.reclaim, cgroup-aware madvise, or a dedicated reclaim daemon) is not practical: userspace cannot react at the granularity or latency required, and the round-trip through a syscall or procfs write introduces overhead that defeats the purpose of proactive reclaim. What gap the new hook fills that cannot be closed by tuning existing knobs. Best, Hui > -- > Michal Hocko > SUSE Labs >
On Tue, 26 May 2026 10:20:00 +0800 Hui Zhu <hui.zhu@linux.dev> wrote: > From: Hui Zhu <zhuhui@kylinos.cn> > > Overview: > This series introduces BPF struct_ops support for the memory controller, > enabling userspace BPF programs to implement custom, dynamic memory > management policies per cgroup. The feature allows BPF programs to hook > into the core reclaim and charge paths without requiring kernel > modifications, providing a flexible alternative to static knobs such as > memory.low and memory.min. > > The series enables two complementary use cases. > > Dynamic memory protection: static memory protection thresholds > (memory.low, memory.min) are poor fits for workloads whose actual memory > activity varies over time. A high-priority cgroup holding a large working > set but temporarily idle will still suppress reclaim on its siblings, > wasting available memory. A BPF-driven approach can observe real workload > activity -- page faults, charge/uncharge events -- and activate or > withdraw protection dynamically. The test results at the end of this > letter quantify the difference: in a scenario where the high-priority > cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly > 37x higher throughput than with static memory.low. > > Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged > hooks, combined with the BPF workqueue mechanism and the new > bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform > proactive background reclaim without blocking the charge path. The > pattern works as follows: the memcg_charged callback tracks accumulated > memory usage; when usage crosses a configurable threshold, it enqueues an > asynchronous work item via bpf_wq_start() and returns immediately without > throttling the charging task. The workqueue callback then invokes > bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target > cgroup; if usage remains elevated after reclaim, the callback re-enqueues > itself to continue. This allows a BPF program to keep a cgroup's > footprint below its hard limit (memory.max) entirely in the background, > avoiding the OOM killer or direct-reclaim stalls that would otherwise > occur. The selftest for this feature (patch 10/11) validates the > mechanism concretely: a workload that writes and mmaps a 64 MB file inside > a 32 MB cgroup reliably triggers memory.events "max" events without BPF; > with the async reclaim program attached, the "max" counter does not > increase at all across the same workload. > Hi Hui, Thanks for the series. Would it not be simpler to just have another memcg knob, something like memory.high_async. When memory usage > memory.high_async, queue a per-memcg work item that calls try_to_free_mem_cgroup_pages() until usage drops back below some threshold. I am not sure I see what programability aspect from bpf you need here. Thanks > > 08/11 selftests/bpf: Add tests for memcg_bpf_ops > Adds prog_tests/memcg_ops.c covering three scenarios: > memcg_charged-only throttling, below_low + memcg_charged > interaction, and below_min + memcg_charged interaction. A > tracepoint on memcg:count_memcg_events (PGFAULT) is used to > detect memory pressure and trigger hooks accordingly. > > 09/11 selftests/bpf: Add test for memcg_bpf_ops hierarchies > Validates BPF_F_ALLOW_OVERRIDE attachment semantics across a > three-level cgroup hierarchy: attach with ALLOW_OVERRIDE at the > root, override at the middle level without the flag, then assert > that attaching to the leaf correctly fails with -EBUSY. > > 10/11 selftests/bpf: Add selftest for memcg async reclaim via BPF > Demonstrates and validates asynchronous memory reclaim: a BPF > program uses the memcg_charged/memcg_uncharged hooks to track > accumulated usage and, when a threshold is exceeded, enqueues a > bpf_wq_start() workqueue item that calls > bpf_try_to_free_mem_cgroup_pages() without blocking the charge > path. The test asserts that with the BPF program active, > memory.events "max" events do not increase under a workload > that would otherwise exceed the hard limit. > > 11/11 samples/bpf: Add memcg priority control and async reclaim example > Adds a complete sample (samples/bpf/memcg.bpf.c + memcg.c) > demonstrating both features. The BPF side monitors PGFAULT > events on a high-priority cgroup; when the per-second fault > count crosses a configurable threshold, it activates below_low > or below_min protection for the high-priority cgroup and/or > applies a charge delay to the low-priority cgroup. Six > struct_ops variants are exported so userspace can attach only > the hooks needed. Async reclaim is optionally combined with > priority throttling via a shared low-cgroup ops map. > > Test Environment: > The following examples run on x86_64 QEMU (10 CPUs, 2 GB RAM), using > a tmpfs-backed file on the host as a swap device to reduce I/O impact. > Two cgroups are created -- high (high-priority) and low (low-priority) > -- and each test runs two concurrent stress-ng workloads, one per > cgroup, each requesting 3 GB of memory. > > # mkdir /sys/fs/cgroup/high /sys/fs/cgroup/low > # free -h > total used free shared buff/cache available > Mem: 1.9Gi 317Mi 1.6Gi 1.0Mi 144Mi 1.6Gi > Swap: 4.0Gi 0B 4.0Gi > > Baseline: no memory priority policy: > Both cgroups run without any reclaim protection. Results are roughly > equal, as expected: > > cgroup bogo ops/s > high 4,979 > low 4,927 > > Test 1: memory.low protection: > Setting memory.low on the high-priority cgroup protects it from > reclaim, at the cost of pushing reclaim pressure onto the low-priority > cgroup: > > # echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low > > cgroup bogo ops/s > high 450,290 > low 11,307 > > The high-priority cgroup benefits significantly, but memory.low relies > on static usage thresholds and cannot adapt to actual workload > behavior. > > Test 2: memory.low with an idle high-priority task: > Here the high-priority cgroup runs a Python script that allocates 3 GB > and then sleeps, simulating a low-activity but memory-holding workload. > Because the process is idle, it generates no page faults and does not > actively use its memory. Yet memory.low still protects it, continuing > to suppress the low-priority cgroup's performance: > > cgroup bogo ops/s > low 14,757 > > The low-priority cgroup remains significantly throttled despite the > high-priority cgroup being effectively idle -- a clear limitation of > static memory.low control. > > Test 3: memcg eBPF -- dynamic priority control: > memcg is a sample program introduced in this patch series > (samples/bpf/memcg.c + memcg.bpf.c). It loads a BPF program that > monitors PGFAULT events in the high-priority cgroup. When the > per-second fault count exceeds a configured threshold, the hook > activates below_min protection for one second; otherwise the cgroup > receives no special treatment. > > # ./memcg --low_path=/sys/fs/cgroup/low \ > --high_path=/sys/fs/cgroup/high \ > --threshold=1 --use_below_min > Successfully attached! > > 3a. Both cgroups under active memory pressure: > > When both cgroups run stress-ng, the high-priority cgroup generates > frequent page faults and the BPF hook activates protection, matching > the behavior of memory.low: > > cgroup bogo ops/s > high 404,392 > low 11,404 > > 3b. High-priority cgroup is idle (Python + sleep): > > Because the sleeping Python process generates no page faults, the BPF > hook never activates, and the low-priority cgroup is free to reclaim > memory normally: > > cgroup bogo ops/s > low 551,083 > > This is a ~37x improvement over the equivalent memory.low scenario > (Test 2), demonstrating that eBPF-driven dynamic control can > accurately reflect actual workload activity and avoid unnecessary > protection of idle high-priority tasks. > > Summary: > Scenario low-cgroup bogo ops/s > Baseline (no policy) ~4,927 > memory.low, both active ~11,307 > memory.low, high idle ~14,757 > memcg eBPF, both active ~11,404 > memcg eBPF, high idle ~551,083 > > References: > [1] https://patchew.org/linux/20260127024421.494929-1-roman.gushchin@linux.dev/ > > Changelog: > v7: > Change base commits of "mm: BPF OOM" to v3. > Some fixes according to the comments of bpf-ci. > Rename get_high_delay_ms hook to memcg_charged; add memcg_uncharged > hook for tracking uncharge events. > Update below_low and below_min hooks to receive elow/emin and usage > as explicit arguments. > Add bpf_try_to_free_mem_cgroup_pages kfunc to expose cgroup reclaim > to BPF programs. > Add selftest for BPF-driven asynchronous page reclaim. > Extend samples/bpf/memcg to support async reclaim in addition to > priority throttling. > v6: > Based on the bot+bof-ci comments, fixed the following issues. > Added fast-path check with unlikely() before SRCU lock acquisition to > optimize the no-BPF case in BPF_MEMCG_CALL. > Add missing newline in pr_warn message to bpf_memcontrol_init. > Added comprehensive child process exit status checking with WIFEXITED() > and WEXITSTATUS(), and added zombie process prevention in > real_test_memcg_ops. > Changed malloc() to calloc() for BSS data allocation in all test > functions and samples main function. > Change srcu_read_lock(&memcg_bpf_srcu) to > lockdep_assert_held(&cgroup_mutex) in function memcontrol_bpf_online > and memcontrol_bpf_offline. > v5: > Based on the bot+bof-ci comments, fixed the following issues. > Fixed issues in memcg_ops.c and memcg.bpf.c by moving variable > declaration to the beginning of need_threshold() function. > The 'u64 current_ts' variable must be declared before any > executable statements > Improved input validation in samples/bpf/memcg.c by adding a new > parse_u64() helper function. This function properly handles errors > from strtoull() and provides better error messages when parsing > threshold and over_high_ms command-line arguments. > Move check for prog->sleepable after validating member offsets in > mm/bpf_memcontrol.c bpf_memcg_ops_check_member. > Fixed sscanf return value checking in prog_tests/memcg_ops.c. > Changed the condition from 'sscanf() < 0' to 'sscanf() != 1' because > sscanf returns the number of successfully matched items, not a negative > value on error. This makes the test more reliable when reading timing > data from temporary files. > v4: > Fix the issues according to the comments from bot+bof-ci. > According to JP Kobryn's comments, move exit(0) from > real_test_memcg_ops_child_work to real_test_memcg_ops. > Fix issues in the bpf_memcg_ops_reg function. > v3: > According to the comments from Michal Koutný and Chen Ridong, update hooks > to get_high_delay_ms, below_low, below_min, handle_cgroup_online, and > handle_cgroup_offline. > According to Michal Koutný's comments, add BPF_F_ALLOW_OVERRIDE > support to memcg_bpf_ops. > v2: > According to Tejun Heo's comments, rebased on Roman Gushchin's BPF > OOM patch series [1] and added hierarchical delegation support. > According to the comments from Roman Gushchin and Michal Hocko, designed > concrete use case scenarios and provided test results. > > Hui Zhu (7): > bpf: Pass flags in bpf_link_create for struct_ops > mm: memcontrol: Add BPF struct_ops for memory controller > mm/bpf: Add bpf_try_to_free_mem_cgroup_pages kfunc > selftests/bpf: Add tests for memcg_bpf_ops > selftests/bpf: Add test for memcg_bpf_ops hierarchies > selftests/bpf: Add selftest for memcg async reclaim via BPF > samples/bpf: Add memcg priority control and async reclaim example > > Roman Gushchin (4): > bpf: move bpf_struct_ops_link into bpf.h > bpf: allow attaching struct_ops to cgroups > libbpf: fix return value on memory allocation failure > libbpf: introduce bpf_map__attach_struct_ops_opts() > > MAINTAINERS | 6 + > include/linux/bpf-cgroup-defs.h | 3 + > include/linux/bpf-cgroup.h | 16 + > include/linux/bpf.h | 10 + > include/linux/memcontrol.h | 250 ++++++- > include/uapi/linux/bpf.h | 5 +- > kernel/bpf/bpf_struct_ops.c | 67 +- > kernel/bpf/cgroup.c | 46 ++ > mm/bpf_memcontrol.c | 355 +++++++++- > mm/memcontrol.c | 43 +- > samples/bpf/.gitignore | 1 + > samples/bpf/Makefile | 8 +- > samples/bpf/memcg.bpf.c | 380 +++++++++++ > samples/bpf/memcg.c | 411 ++++++++++++ > tools/include/uapi/linux/bpf.h | 3 +- > tools/lib/bpf/libbpf.c | 22 +- > tools/lib/bpf/libbpf.h | 14 + > tools/lib/bpf/libbpf.map | 1 + > tools/testing/selftests/bpf/cgroup_helpers.c | 41 ++ > tools/testing/selftests/bpf/cgroup_helpers.h | 2 + > .../bpf/prog_tests/memcg_async_reclaim.c | 333 +++++++++ > .../selftests/bpf/prog_tests/memcg_ops.c | 634 ++++++++++++++++++ > .../selftests/bpf/progs/memcg_async_reclaim.c | 203 ++++++ > tools/testing/selftests/bpf/progs/memcg_ops.c | 132 ++++ > 24 files changed, 2952 insertions(+), 34 deletions(-) > create mode 100644 samples/bpf/memcg.bpf.c > create mode 100644 samples/bpf/memcg.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_async_reclaim.c > create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c > create mode 100644 tools/testing/selftests/bpf/progs/memcg_async_reclaim.c > create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c > > -- > 2.43.0 > >
> > On Tue, 26 May 2026 10:20:00 +0800 Hui Zhu <hui.zhu@linux.dev> wrote: > > > > > From: Hui Zhu <zhuhui@kylinos.cn> > > > > Overview: > > This series introduces BPF struct_ops support for the memory controller, > > enabling userspace BPF programs to implement custom, dynamic memory > > management policies per cgroup. The feature allows BPF programs to hook > > into the core reclaim and charge paths without requiring kernel > > modifications, providing a flexible alternative to static knobs such as > > memory.low and memory.min. > > > > The series enables two complementary use cases. > > ... ... ... > > > > Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged > > hooks, combined with the BPF workqueue mechanism and the new > > bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform > > proactive background reclaim without blocking the charge path. The > > pattern works as follows: the memcg_charged callback tracks accumulated > > memory usage; when usage crosses a configurable threshold, it enqueues an > > asynchronous work item via bpf_wq_start() and returns immediately without > > throttling the charging task. The workqueue callback then invokes > > bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target > > cgroup; if usage remains elevated after reclaim, the callback re-enqueues > > itself to continue. This allows a BPF program to keep a cgroup's > > footprint below its hard limit (memory.max) entirely in the background, > > avoiding the OOM killer or direct-reclaim stalls that would otherwise > > occur. The selftest for this feature (patch 10/11) validates the > > mechanism concretely: a workload that writes and mmaps a 64 MB file inside > > a 32 MB cgroup reliably triggers memory.events "max" events without BPF; > > with the async reclaim program attached, the "max" counter does not > > increase at all across the same workload. > > > Hi Hui, > > Thanks for the series. > Would it not be simpler to just have another memcg knob, something like > memory.high_async. > When memory usage > memory.high_async, queue a per-memcg work item that calls > try_to_free_mem_cgroup_pages() until usage drops back below some threshold. > I am not sure I see what programability aspect from bpf you need here. > > Thanks Hi Usama, That's a good question. By introducing a new BPF kfunc bpf_try_to_free_mem_cgroup_pages, a BPF program can flexibly control when to start and stop async reclaim, rather than being constrained to trigger and stop based solely on memcg usage or one or two fixed events, as with traditional proactive reclaim interfaces. For example, async reclaim could be triggered based on PSI, or on the number of page faults, or even on a combination of multiple events working together to decide both when to start and when to stop async reclaim. That is the motivation behind adding the BPF kfunc bpf_try_to_free_mem_cgroup_pages in this patch set. I admit the cover letter did not explain this well enough, and the example code does not demonstrate this use case either. I will address both in the next version. Best, Hui > > > > > 08/11 selftests/bpf: Add tests for memcg_bpf_ops > > Adds prog_tests/memcg_ops.c covering three scenarios: > > memcg_charged-only throttling, below_low + memcg_charged > > interaction, and below_min + memcg_charged interaction. A > > tracepoint on memcg:count_memcg_events (PGFAULT) is used to > > detect memory pressure and trigger hooks accordingly. ... ... ... > > -- > > 2.43.0 > > >
© 2016 - 2026 Red Hat, Inc.