MAINTAINERS | 5 + include/linux/memcontrol.h | 2 + mm/bpf_memcontrol.c | 241 ++++++++++++- mm/bpf_memcontrol.h | 73 ++++ mm/memcontrol.c | 27 +- samples/bpf/.gitignore | 1 + samples/bpf/Makefile | 9 +- samples/bpf/memcg.bpf.c | 95 +++++ samples/bpf/memcg.c | 204 +++++++++++ .../selftests/bpf/prog_tests/memcg_ops.c | 340 ++++++++++++++++++ .../selftests/bpf/progs/memcg_ops_over_high.c | 95 +++++ 11 files changed, 1082 insertions(+), 10 deletions(-) create mode 100644 mm/bpf_memcontrol.h create mode 100644 samples/bpf/memcg.bpf.c create mode 100644 samples/bpf/memcg.c create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops_over_high.c
From: Hui Zhu <zhuhui@kylinos.cn> This series adds BPF struct_ops support to the memory controller, enabling dynamic control over memory pressure through the memcg_nr_pages_over_high mechanism. This allows administrators to suppress low-priority cgroups' memory usage based on custom policies implemented in BPF programs. Background and Motivation The memory controller provides memory.high limits to throttle cgroups exceeding their soft limit. However, the current implementation applies the same policy across all cgroups without considering priority or workload characteristics. This series introduces a BPF hook that allows reporting additional "pages over high" for specific cgroups, effectively increasing memory pressure and throttling for lower-priority workloads when higher-priority cgroups need resources. Use Case: Priority-Based Memory Management Consider a system running both latency-sensitive services and batch processing workloads. When the high-priority service experiences memory pressure (detected via page scan events), the BPF program can artificially inflate the "over high" count for low-priority cgroups, causing them to be throttled more aggressively and freeing up memory for the critical workload. Implementation This series builds upon Roman Gushchin's BPF OOM patch series in [1]. The implementation adds: 1. A memcg_bpf_ops struct_ops type with memcg_nr_pages_over_high hook 2. Integration into memory pressure calculation paths 3. Cgroup hierarchy management (inheritance during online/offline) 4. SRCU protection for safe concurrent access Why Not PSI? This implementation does not use PSI for triggering, as discussed in [2]. Instead, the sample code monitors PGSCAN events via tracepoints, which provides more direct feedback on memory pressure. Example Results Testing on x86_64 QEMU (10 CPU, 4GB RAM, cache=none swap): root@ubuntu:~# cat /proc/sys/vm/swappiness 60 root@ubuntu:~# mkdir /sys/fs/cgroup/high root@ubuntu:~# mkdir /sys/fs/cgroup/low root@ubuntu:~# ./memcg /sys/fs/cgroup/low /sys/fs/cgroup/high 100 1024 Successfully attached! root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% \ --vm-method all --seed 2025 --metrics -t 60 \ & cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% \ --vm-method all --seed 2025 --metrics -t 60 [1] 1075 stress-ng: info: [1075] setting to a 1 min, 0 secs run per stressor stress-ng: info: [1076] setting to a 1 min, 0 secs run per stressor stress-ng: info: [1075] dispatching hogs: 4 vm stress-ng: info: [1076] dispatching hogs: 4 vm stress-ng: metrc: [1076] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max stress-ng: metrc: [1076] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB) stress-ng: metrc: [1076] vm 21033377 60.47 158.04 3.66 347825.55 130076.67 66.85 834836 stress-ng: info: [1076] skipped: 0 stress-ng: info: [1076] passed: 4: vm (4) stress-ng: info: [1076] failed: 0 stress-ng: info: [1076] metrics untrustworthy: 0 stress-ng: info: [1076] successful run completed in 1 min, 0.72 secs root@ubuntu:~# stress-ng: metrc: [1075] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max stress-ng: metrc: [1075] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB) stress-ng: metrc: [1075] vm 11568 65.05 0.00 0.21 177.83 56123.74 0.08 3200 stress-ng: info: [1075] skipped: 0 stress-ng: info: [1075] passed: 4: vm (4) stress-ng: info: [1075] failed: 0 stress-ng: info: [1075] metrics untrustworthy: 0 stress-ng: info: [1075] successful run completed in 1 min, 5.06 secs Results show the low-priority cgroup (/sys/fs/cgroup/low) was significantly throttled: - High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s - Low-priority cgroup: 11,568 bogo ops at 177 ops/s The stress-ng process in the low-priority cgroup experienced a ~99.9% slowdown in memory operations compared to the high-priority cgroup, demonstrating effective priority enforcement through BPF-controlled memory pressure. Patch Overview PATCH 1/3: Core kernel implementation - Adds memcg_bpf_ops struct_ops support - Implements cgroup lifecycle management - Integrates hook into pressure calculation PATCH 2/3: Selftest suite - Validates attach/detach behavior - Tests hierarchy inheritance - Verifies throttling effectiveness PATCH 3/3: Sample programs - Demonstrates PGSCAN-based triggering - Shows priority-based throttling - Provides reference implementation Changelog: v2: According to the comments of Tejun Heo, rebased on Roman Gushchin's BPF OOM patch series [1] and added hierarchical delegation support. According to the comments of Roman Gushchin and Michal Hocko, Designed concrete use case scenarios and provided test results. [1] https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.dev/ [2] https://lore.kernel.org/lkml/1d9a162605a3f32ac215430131f7745488deaa34@linux.dev/ Hui Zhu (3): mm: memcontrol: Add BPF struct_ops for memory pressure control selftests/bpf: Add tests for memcg_bpf_ops samples/bpf: Add memcg priority control example MAINTAINERS | 5 + include/linux/memcontrol.h | 2 + mm/bpf_memcontrol.c | 241 ++++++++++++- mm/bpf_memcontrol.h | 73 ++++ mm/memcontrol.c | 27 +- samples/bpf/.gitignore | 1 + samples/bpf/Makefile | 9 +- samples/bpf/memcg.bpf.c | 95 +++++ samples/bpf/memcg.c | 204 +++++++++++ .../selftests/bpf/prog_tests/memcg_ops.c | 340 ++++++++++++++++++ .../selftests/bpf/progs/memcg_ops_over_high.c | 95 +++++ 11 files changed, 1082 insertions(+), 10 deletions(-) create mode 100644 mm/bpf_memcontrol.h create mode 100644 samples/bpf/memcg.bpf.c create mode 100644 samples/bpf/memcg.c create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops_over_high.c -- 2.43.0
Hi Hui. On Tue, Dec 30, 2025 at 11:01:58AM +0800, Hui Zhu <hui.zhu@linux.dev> wrote: > This allows administrators to suppress low-priority cgroups' memory > usage based on custom policies implemented in BPF programs. BTW memory.low was conceived as a work-conserving mechanism for prioritization of different workloads. Have you tried that? No need to go directly to (high) limits. (<- Main question, below are some secondary implementation questions/remarks.) ... > This series introduces a BPF hook that allows reporting > additional "pages over high" for specific cgroups, effectively > increasing memory pressure and throttling for lower-priority > workloads when higher-priority cgroups need resources. Have you considered hooking into calculate_high_delay() instead? (That function has undergone some evolution so it'd seem like the candidate for BPFication.) ... > 3. Cgroup hierarchy management (inheritance during online/offline) I see you're copying the program upon memcg creation. Configuration copies aren't such a good way to properly handle hierarchical behavior. I wonder if this could follow the more generic pattern of how BPF progs are evaluated in hierarchies, see BPF_F_ALLOW_OVERRIDE and BPF_F_ALLOW_MULTI. > Example Results ... > Results show the low-priority cgroup (/sys/fs/cgroup/low) was > significantly throttled: > - High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s > - Low-priority cgroup: 11,568 bogo ops at 177 ops/s > > The stress-ng process in the low-priority cgroup experienced a > ~99.9% slowdown in memory operations compared to the > high-priority cgroup, demonstrating effective priority > enforcement through BPF-controlled memory pressure. As a demonstrator, it'd be good to compare this with a baseline without any extra progs, e.g. show that high-prio performed better and low-prio wasn't throttled for nothing. Thanks, Michal
2025年12月30日 17:49, "Michal Koutný" <mkoutny@suse.com mailto:mkoutny@suse.com?to=%22Michal%20Koutn%C3%BD%22%20%3Cmkoutny%40suse.com%3E > 写到: Hi Michal and Ridong, > > Hi Hui. > > On Tue, Dec 30, 2025 at 11:01:58AM +0800, Hui Zhu <hui.zhu@linux.dev> wrote: > > > > > This allows administrators to suppress low-priority cgroups' memory > > usage based on custom policies implemented in BPF programs. > > > BTW memory.low was conceived as a work-conserving mechanism for > prioritization of different workloads. Have you tried that? No need to > go directly to (high) limits. (<- Main question, below are some > secondary implementation questions/remarks.) > > ... > memory.low is a helpful feature, but it can struggle to effectively throttle low-priority processes that continuously access their memory. For instance, consider the following example I ran: root@ubuntu:~# echo $((4 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% --vm-method all --seed 2025 --metrics -t 60 & cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% --vm-method all --seed 2025 --metrics -t 60 [1] 2011 stress-ng: info: [2011] setting to a 1 min, 0 secs run per stressor stress-ng: info: [2012] setting to a 1 min, 0 secs run per stressor stress-ng: info: [2011] dispatching hogs: 4 vm stress-ng: info: [2012] dispatching hogs: 4 vm stress-ng: metrc: [2012] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max stress-ng: metrc: [2012] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB) stress-ng: metrc: [2012] vm 23584 60.21 2.75 15.94 391.73 1262.07 7.76 649988 stress-ng: info: [2012] skipped: 0 stress-ng: info: [2012] passed: 4: vm (4) stress-ng: info: [2012] failed: 0 stress-ng: info: [2012] metrics untrustworthy: 0 stress-ng: info: [2012] successful run completed in 1 min, 0.22 secs stress-ng: metrc: [2011] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max stress-ng: metrc: [2011] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB) stress-ng: metrc: [2011] vm 23584 60.22 3.06 16.19 391.63 1224.97 7.99 688836 stress-ng: info: [2011] skipped: 0 stress-ng: info: [2011] passed: 4: vm (4) stress-ng: info: [2011] failed: 0 stress-ng: info: [2011] metrics untrustworthy: 0 stress-ng: info: [2011] successful run completed in 1 min, 0.23 secs As the results show, setting memory.low on the cgroup with the high-priority workload did not improve its memory performance. However, memory.low is beneficial in many other scenarios. Perhaps extending it with eBPF support could help address a wider range of issues. > > > > This series introduces a BPF hook that allows reporting > > additional "pages over high" for specific cgroups, effectively > > increasing memory pressure and throttling for lower-priority > > workloads when higher-priority cgroups need resources. > > > Have you considered hooking into calculate_high_delay() instead? (That > function has undergone some evolution so it'd seem like the candidate > for BPFication.) > It seems that try_charge_memcg will not reach __mem_cgroup_handle_over_high if it only hook calculate_high_delay without setting memory.high. What do you think about hooking try_charge_memcg as well, so that it ensures __mem_cgroup_handle_over_high is called? > ... > > > > > 3. Cgroup hierarchy management (inheritance during online/offline) > > > I see you're copying the program upon memcg creation. > Configuration copies aren't such a good way to properly handle > hierarchical behavior. > I wonder if this could follow the more generic pattern of how BPF progs > are evaluated in hierarchies, see BPF_F_ALLOW_OVERRIDE and > BPF_F_ALLOW_MULTI. I will support them in the next version. > > > > > Example Results > > > ... > > > > > Results show the low-priority cgroup (/sys/fs/cgroup/low) was > > significantly throttled: > > - High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s > > - Low-priority cgroup: 11,568 bogo ops at 177 ops/s > > > > The stress-ng process in the low-priority cgroup experienced a > > ~99.9% slowdown in memory operations compared to the > > high-priority cgroup, demonstrating effective priority > > enforcement through BPF-controlled memory pressure. > > > As a demonstrator, it'd be good to compare this with a baseline without > any extra progs, e.g. show that high-prio performed better and low-prio > wasn't throttled for nothing. Thanks for your remind. This is a test log in the test environment without any extra progs: root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% \ --vm-method all --seed 2025 --metrics -t 60 \ & cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% \ --vm-method all --seed 2025 --metrics -t 60 [1] 982 stress-ng: info: [982] setting to a 1 min, 0 secs run per stressor stress-ng: info: [983] setting to a 1 min, 0 secs run per stressor stress-ng: info: [982] dispatching hogs: 4 vm stress-ng: info: [983] dispatching hogs: 4 vm stress-ng: metrc: [982] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max stress-ng: metrc: [982] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB) stress-ng: metrc: [982] vm 23544 60.08 2.90 15.74 391.85 1263.43 7.75 524708 stress-ng: info: [982] skipped: 0 stress-ng: info: [982] passed: 4: vm (4) stress-ng: info: [982] failed: 0 stress-ng: info: [982] metrics untrustworthy: 0 stress-ng: info: [982] successful run completed in 1 min, 0.09 secs stress-ng: metrc: [983] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max stress-ng: metrc: [983] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB) stress-ng: metrc: [983] vm 23544 60.09 3.12 15.91 391.81 1237.10 7.92 705076 stress-ng: info: [983] skipped: 0 stress-ng: info: [983] passed: 4: vm (4) stress-ng: info: [983] failed: 0 stress-ng: info: [983] metrics untrustworthy: 0 stress-ng: info: [983] successful run completed in 1 min, 0.09 secs Best, Hui > > Thanks, > Michal >
On Sun, Jan 04, 2026 at 09:30:46AM +0000, hui.zhu@linux.dev wrote: > memory.low is a helpful feature, but it can struggle to effectively > throttle low-priority processes that continuously access their memory. > > For instance, consider the following example I ran: > root@ubuntu:~# echo $((4 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low > root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% --vm-method all --seed 2025 --metrics -t 60 & > cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% --vm-method all --seed 2025 --metrics -t 60 > [1] 2011 > stress-ng: info: [2011] setting to a 1 min, 0 secs run per stressor > stress-ng: info: [2011] dispatching hogs: 4 vm > stress-ng: metrc: [2011] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max > stress-ng: metrc: [2011] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB) > stress-ng: metrc: [2011] vm 23584 60.22 3.06 16.19 391.63 1224.97 7.99 688836 > stress-ng: info: [2011] skipped: 0 > stress-ng: info: [2011] passed: 4: vm (4) > stress-ng: info: [2011] failed: 0 > stress-ng: info: [2011] metrics untrustworthy: 0 > stress-ng: info: [2011] successful run completed in 1 min, 0.23 secs > > stress-ng: info: [2012] setting to a 1 min, 0 secs run per stressor > stress-ng: info: [2012] dispatching hogs: 4 vm > stress-ng: metrc: [2012] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max > stress-ng: metrc: [2012] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB) > stress-ng: metrc: [2012] vm 23584 60.21 2.75 15.94 391.73 1262.07 7.76 649988 > stress-ng: info: [2012] skipped: 0 > stress-ng: info: [2012] passed: 4: vm (4) > stress-ng: info: [2012] failed: 0 > stress-ng: info: [2012] metrics untrustworthy: 0 > stress-ng: info: [2012] successful run completed in 1 min, 0.22 secs > As the results show, setting memory.low on the cgroup with the > high-priority workload did not improve its memory performance. It could also be that memory isn't the bottleneck here. I reckon that 80%+80% > 100% but I don't know how quickly stress-ng accesses it. I.e. actual workingset size may be lower than those 80%. If it was accompanied with a run in one cg only, it'd help determning benchmark's baseline. > It seems that try_charge_memcg will not reach > __mem_cgroup_handle_over_high if it only hook calculate_high_delay > without setting memory.high. That's expected, no action is needed when the current consumption is below memory.high. > What do you think about hooking try_charge_memcg as well, > so that it ensures __mem_cgroup_handle_over_high is called? The logic in try_charge_memcg is alredy quite involved and I think only simple concepts (that won't deviate too much as implementation changes) should be exposed to the hooks. > Thanks for your remind. > This is a test log in the test environment without any extra progs: Thanks, it's similar to the example above (I assume you're after "bogo ops/s" in real time, RSS footprint isn't the observed metric), i.e. the jobs don't differ. But it made me to review the results in your original posting (with your patch) and the high group has RSS Max of 834836 KB (so that'd be the actual workingset size for the stressor). So both of them should easily fit into the 4G of the machine, hence I guess the bottleneck is IO (you have swap right?), that's where prioritization should be applied (at least in this demostration/representative case). HTH, Michal
On 2025/12/30 17:49, Michal Koutný wrote: > Hi Hui. > > On Tue, Dec 30, 2025 at 11:01:58AM +0800, Hui Zhu <hui.zhu@linux.dev> wrote: >> This allows administrators to suppress low-priority cgroups' memory >> usage based on custom policies implemented in BPF programs. > > BTW memory.low was conceived as a work-conserving mechanism for > prioritization of different workloads. Have you tried that? No need to > go directly to (high) limits. (<- Main question, below are some > secondary implementation questions/remarks.) > > ... >> This series introduces a BPF hook that allows reporting >> additional "pages over high" for specific cgroups, effectively >> increasing memory pressure and throttling for lower-priority >> workloads when higher-priority cgroups need resources. > > Have you considered hooking into calculate_high_delay() instead? (That > function has undergone some evolution so it'd seem like the candidate > for BPFication.) > +1 This issue[1] might be resolved by hooking into calculate_high_delay(). [1] https://lore.kernel.org/cgroups/4txrfjc5lqkmydmsesfq3l5drmzdio6pkmtfb64sk3ld6bwkhs@w4dkn76s4dbo/T/#t > ... >> 3. Cgroup hierarchy management (inheritance during online/offline) > > I see you're copying the program upon memcg creation. > Configuration copies aren't such a good way to properly handle > hierarchical behavior. > I wonder if this could follow the more generic pattern of how BPF progs > are evaluated in hierarchies, see BPF_F_ALLOW_OVERRIDE and > BPF_F_ALLOW_MULTI. > > >> Example Results > ... >> Results show the low-priority cgroup (/sys/fs/cgroup/low) was >> significantly throttled: >> - High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s >> - Low-priority cgroup: 11,568 bogo ops at 177 ops/s >> >> The stress-ng process in the low-priority cgroup experienced a >> ~99.9% slowdown in memory operations compared to the >> high-priority cgroup, demonstrating effective priority >> enforcement through BPF-controlled memory pressure. > > As a demonstrator, it'd be good to compare this with a baseline without > any extra progs, e.g. show that high-prio performed better and low-prio > wasn't throttled for nothing. > > Thanks, > Michal -- Best regards, Ridong
© 2016 - 2026 Red Hat, Inc.