[v2] Memory Controller eBPF support

[RFC PATCH v2 0/3] Memory Controller eBPF support

Posted by Hui Zhu 1 month, 1 week ago

From: Hui Zhu <zhuhui@kylinos.cn>

This series adds BPF struct_ops support to the memory controller,
enabling dynamic control over memory pressure through the
memcg_nr_pages_over_high mechanism. This allows administrators to
suppress low-priority cgroups' memory usage based on custom
policies implemented in BPF programs.

Background and Motivation

The memory controller provides memory.high limits to throttle
cgroups exceeding their soft limit. However, the current
implementation applies the same policy across all cgroups
without considering priority or workload characteristics.

This series introduces a BPF hook that allows reporting
additional "pages over high" for specific cgroups, effectively
increasing memory pressure and throttling for lower-priority
workloads when higher-priority cgroups need resources.

Use Case: Priority-Based Memory Management

Consider a system running both latency-sensitive services and
batch processing workloads. When the high-priority service
experiences memory pressure (detected via page scan events),
the BPF program can artificially inflate the "over high" count
for low-priority cgroups, causing them to be throttled more
aggressively and freeing up memory for the critical workload.

Implementation

This series builds upon Roman Gushchin's BPF OOM patch series in [1].

The implementation adds:
1. A memcg_bpf_ops struct_ops type with memcg_nr_pages_over_high
   hook
2. Integration into memory pressure calculation paths
3. Cgroup hierarchy management (inheritance during online/offline)
4. SRCU protection for safe concurrent access

Why Not PSI?

This implementation does not use PSI for triggering, as discussed
in [2].
Instead, the sample code monitors PGSCAN events via tracepoints,
which provides more direct feedback on memory pressure.

Example Results

Testing on x86_64 QEMU (10 CPU, 4GB RAM, cache=none swap):
root@ubuntu:~# cat /proc/sys/vm/swappiness
60
root@ubuntu:~# mkdir /sys/fs/cgroup/high
root@ubuntu:~# mkdir /sys/fs/cgroup/low
root@ubuntu:~# ./memcg /sys/fs/cgroup/low /sys/fs/cgroup/high 100 1024
Successfully attached!
root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% \
--vm-method all --seed 2025 --metrics -t 60
[1] 1075
stress-ng: info:  [1075] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1076] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1075] dispatching hogs: 4 vm
stress-ng: info:  [1076] dispatching hogs: 4 vm
stress-ng: metrc: [1076] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [1076]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [1076] vm             21033377     60.47    158.04      3.66    347825.55      130076.67        66.85        834836
stress-ng: info:  [1076] skipped: 0
stress-ng: info:  [1076] passed: 4: vm (4)
stress-ng: info:  [1076] failed: 0
stress-ng: info:  [1076] metrics untrustworthy: 0
stress-ng: info:  [1076] successful run completed in 1 min, 0.72 secs
root@ubuntu:~# stress-ng: metrc: [1075] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [1075]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [1075] vm                11568     65.05      0.00      0.21       177.83       56123.74         0.08          3200
stress-ng: info:  [1075] skipped: 0
stress-ng: info:  [1075] passed: 4: vm (4)
stress-ng: info:  [1075] failed: 0
stress-ng: info:  [1075] metrics untrustworthy: 0
stress-ng: info:  [1075] successful run completed in 1 min, 5.06 secs

Results show the low-priority cgroup (/sys/fs/cgroup/low) was
significantly throttled:
- High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s
- Low-priority cgroup: 11,568 bogo ops at 177 ops/s

The stress-ng process in the low-priority cgroup experienced a
~99.9% slowdown in memory operations compared to the
high-priority cgroup, demonstrating effective priority
enforcement through BPF-controlled memory pressure.

Patch Overview

PATCH 1/3: Core kernel implementation
  - Adds memcg_bpf_ops struct_ops support
  - Implements cgroup lifecycle management
  - Integrates hook into pressure calculation

PATCH 2/3: Selftest suite
  - Validates attach/detach behavior
  - Tests hierarchy inheritance
  - Verifies throttling effectiveness

PATCH 3/3: Sample programs
  - Demonstrates PGSCAN-based triggering
  - Shows priority-based throttling
  - Provides reference implementation

Changelog:
v2:
According to the comments of Tejun Heo, rebased on Roman Gushchin's BPF
OOM patch series [1] and added hierarchical delegation support.
According to the comments of Roman Gushchin and Michal Hocko, Designed
concrete use case scenarios and provided test results.

[1] https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.dev/
[2] https://lore.kernel.org/lkml/1d9a162605a3f32ac215430131f7745488deaa34@linux.dev/

Hui Zhu (3):
  mm: memcontrol: Add BPF struct_ops for memory  pressure control
  selftests/bpf: Add tests for memcg_bpf_ops
  samples/bpf: Add memcg priority control example

 MAINTAINERS                                   |   5 +
 include/linux/memcontrol.h                    |   2 +
 mm/bpf_memcontrol.c                           | 241 ++++++++++++-
 mm/bpf_memcontrol.h                           |  73 ++++
 mm/memcontrol.c                               |  27 +-
 samples/bpf/.gitignore                        |   1 +
 samples/bpf/Makefile                          |   9 +-
 samples/bpf/memcg.bpf.c                       |  95 +++++
 samples/bpf/memcg.c                           | 204 +++++++++++
 .../selftests/bpf/prog_tests/memcg_ops.c      | 340 ++++++++++++++++++
 .../selftests/bpf/progs/memcg_ops_over_high.c |  95 +++++
 11 files changed, 1082 insertions(+), 10 deletions(-)
 create mode 100644 mm/bpf_memcontrol.h
 create mode 100644 samples/bpf/memcg.bpf.c
 create mode 100644 samples/bpf/memcg.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
 create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops_over_high.c

-- 
2.43.0

Re: [RFC PATCH v2 0/3] Memory Controller eBPF support

Posted by Michal Koutný 1 month, 1 week ago

Hi Hui.

On Tue, Dec 30, 2025 at 11:01:58AM +0800, Hui Zhu <hui.zhu@linux.dev> wrote:
> This allows administrators to suppress low-priority cgroups' memory
> usage based on custom policies implemented in BPF programs.

BTW memory.low was conceived as a work-conserving mechanism for
prioritization of different workloads. Have you tried that? No need to
go directly to (high) limits. (<- Main question, below are some
secondary implementation questions/remarks.)

...
> This series introduces a BPF hook that allows reporting
> additional "pages over high" for specific cgroups, effectively
> increasing memory pressure and throttling for lower-priority
> workloads when higher-priority cgroups need resources.

Have you considered hooking into calculate_high_delay() instead? (That
function has undergone some evolution so it'd seem like the candidate
for BPFication.)

...
> 3. Cgroup hierarchy management (inheritance during online/offline)

I see you're copying the program upon memcg creation.
Configuration copies aren't such a good way to properly handle
hierarchical behavior.
I wonder if this could follow the more generic pattern of how BPF progs
are evaluated in hierarchies, see BPF_F_ALLOW_OVERRIDE and
BPF_F_ALLOW_MULTI.

> Example Results
...
> Results show the low-priority cgroup (/sys/fs/cgroup/low) was
> significantly throttled:
> - High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s
> - Low-priority cgroup: 11,568 bogo ops at 177 ops/s
> 
> The stress-ng process in the low-priority cgroup experienced a
> ~99.9% slowdown in memory operations compared to the
> high-priority cgroup, demonstrating effective priority
> enforcement through BPF-controlled memory pressure.

As a demonstrator, it'd be good to compare this with a baseline without
any extra progs, e.g. show that high-prio performed better and low-prio
wasn't throttled for nothing.

Thanks,
Michal

Re: [RFC PATCH v2 0/3] Memory Controller eBPF support

Posted by hui.zhu@linux.dev 1 month ago

2025年12月30日 17:49, "Michal Koutný" <mkoutny@suse.com mailto:mkoutny@suse.com?to=%22Michal%20Koutn%C3%BD%22%20%3Cmkoutny%40suse.com%3E > 写到:


Hi Michal and Ridong,

> 
> Hi Hui.
> 
> On Tue, Dec 30, 2025 at 11:01:58AM +0800, Hui Zhu <hui.zhu@linux.dev> wrote:
> 
> > 
> > This allows administrators to suppress low-priority cgroups' memory
> >  usage based on custom policies implemented in BPF programs.
> > 
> BTW memory.low was conceived as a work-conserving mechanism for
> prioritization of different workloads. Have you tried that? No need to
> go directly to (high) limits. (<- Main question, below are some
> secondary implementation questions/remarks.)
> 
> ...
> 

memory.low is a helpful feature, but it can struggle to effectively
throttle low-priority processes that continuously access their memory.

For instance, consider the following example I ran:
root@ubuntu:~# echo $((4 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% --vm-method all --seed 2025 --metrics -t 60 & cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% --vm-method all --seed 2025 --metrics -t 60
[1] 2011
stress-ng: info:  [2011] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [2012] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [2011] dispatching hogs: 4 vm
stress-ng: info:  [2012] dispatching hogs: 4 vm
stress-ng: metrc: [2012] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [2012]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [2012] vm                23584     60.21      2.75     15.94       391.73        1262.07         7.76        649988
stress-ng: info:  [2012] skipped: 0
stress-ng: info:  [2012] passed: 4: vm (4)
stress-ng: info:  [2012] failed: 0
stress-ng: info:  [2012] metrics untrustworthy: 0
stress-ng: info:  [2012] successful run completed in 1 min, 0.22 secs
stress-ng: metrc: [2011] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [2011]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [2011] vm                23584     60.22      3.06     16.19       391.63        1224.97         7.99        688836
stress-ng: info:  [2011] skipped: 0
stress-ng: info:  [2011] passed: 4: vm (4)
stress-ng: info:  [2011] failed: 0
stress-ng: info:  [2011] metrics untrustworthy: 0
stress-ng: info:  [2011] successful run completed in 1 min, 0.23 secs

As the results show, setting memory.low on the cgroup with the
high-priority workload did not improve its memory performance.

However, memory.low is beneficial in many other scenarios.
Perhaps extending it with eBPF support could help address a wider
range of issues.

> > 
> > This series introduces a BPF hook that allows reporting
> >  additional "pages over high" for specific cgroups, effectively
> >  increasing memory pressure and throttling for lower-priority
> >  workloads when higher-priority cgroups need resources.
> > 
> Have you considered hooking into calculate_high_delay() instead? (That
> function has undergone some evolution so it'd seem like the candidate
> for BPFication.)
> 

It seems that try_charge_memcg will not reach
__mem_cgroup_handle_over_high if it only hook calculate_high_delay
without setting memory.high.

What do you think about hooking try_charge_memcg as well,
so that it ensures __mem_cgroup_handle_over_high is called?


> ...
> 
> > 
> > 3. Cgroup hierarchy management (inheritance during online/offline)
> > 
> I see you're copying the program upon memcg creation.
> Configuration copies aren't such a good way to properly handle
> hierarchical behavior.
> I wonder if this could follow the more generic pattern of how BPF progs
> are evaluated in hierarchies, see BPF_F_ALLOW_OVERRIDE and
> BPF_F_ALLOW_MULTI.

I will support them in the next version.

> 
> > 
> > Example Results
> > 
> ...
> 
> > 
> > Results show the low-priority cgroup (/sys/fs/cgroup/low) was
> >  significantly throttled:
> >  - High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s
> >  - Low-priority cgroup: 11,568 bogo ops at 177 ops/s
> >  
> >  The stress-ng process in the low-priority cgroup experienced a
> >  ~99.9% slowdown in memory operations compared to the
> >  high-priority cgroup, demonstrating effective priority
> >  enforcement through BPF-controlled memory pressure.
> > 
> As a demonstrator, it'd be good to compare this with a baseline without
> any extra progs, e.g. show that high-prio performed better and low-prio
> wasn't throttled for nothing.

Thanks for your remind.
This is a test log in the test environment without any extra progs:

root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% \
--vm-method all --seed 2025 --metrics -t 60
[1] 982
stress-ng: info:  [982] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [983] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [982] dispatching hogs: 4 vm
stress-ng: info:  [983] dispatching hogs: 4 vm

stress-ng: metrc: [982] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [982]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [982] vm                23544     60.08      2.90     15.74       391.85        1263.43         7.75        524708
stress-ng: info:  [982] skipped: 0
stress-ng: info:  [982] passed: 4: vm (4)
stress-ng: info:  [982] failed: 0
stress-ng: info:  [982] metrics untrustworthy: 0
stress-ng: info:  [982] successful run completed in 1 min, 0.09 secs
stress-ng: metrc: [983] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [983]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [983] vm                23544     60.09      3.12     15.91       391.81        1237.10         7.92        705076
stress-ng: info:  [983] skipped: 0
stress-ng: info:  [983] passed: 4: vm (4)
stress-ng: info:  [983] failed: 0
stress-ng: info:  [983] metrics untrustworthy: 0
stress-ng: info:  [983] successful run completed in 1 min, 0.09 secs

Best,
Hui


> 
> Thanks,
> Michal
>

Re: [RFC PATCH v2 0/3] Memory Controller eBPF support

Posted by Michal Koutný 4 weeks, 1 day ago

On Sun, Jan 04, 2026 at 09:30:46AM +0000, hui.zhu@linux.dev wrote:
> memory.low is a helpful feature, but it can struggle to effectively
> throttle low-priority processes that continuously access their memory.
> 
> For instance, consider the following example I ran:
> root@ubuntu:~# echo $((4 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
> root@ubuntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% --vm-method all --seed 2025 --metrics -t 60 & 
>                cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% --vm-method all --seed 2025 --metrics -t 60
> [1] 2011
> stress-ng: info:  [2011] setting to a 1 min, 0 secs run per stressor
> stress-ng: info:  [2011] dispatching hogs: 4 vm
> stress-ng: metrc: [2011] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
> stress-ng: metrc: [2011]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
> stress-ng: metrc: [2011] vm                23584     60.22      3.06     16.19       391.63        1224.97         7.99        688836
> stress-ng: info:  [2011] skipped: 0
> stress-ng: info:  [2011] passed: 4: vm (4)
> stress-ng: info:  [2011] failed: 0
> stress-ng: info:  [2011] metrics untrustworthy: 0
> stress-ng: info:  [2011] successful run completed in 1 min, 0.23 secs
>
> stress-ng: info:  [2012] setting to a 1 min, 0 secs run per stressor
> stress-ng: info:  [2012] dispatching hogs: 4 vm
> stress-ng: metrc: [2012] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
> stress-ng: metrc: [2012]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
> stress-ng: metrc: [2012] vm                23584     60.21      2.75     15.94       391.73        1262.07         7.76        649988
> stress-ng: info:  [2012] skipped: 0
> stress-ng: info:  [2012] passed: 4: vm (4)
> stress-ng: info:  [2012] failed: 0
> stress-ng: info:  [2012] metrics untrustworthy: 0
> stress-ng: info:  [2012] successful run completed in 1 min, 0.22 secs
 

> As the results show, setting memory.low on the cgroup with the
> high-priority workload did not improve its memory performance.

It could also be that memory isn't the bottleneck here. I reckon that
80%+80% > 100% but I don't know how quickly stress-ng accesses it. I.e.
actual workingset size may be lower than those 80%.

If it was accompanied with a run in one cg only, it'd help determning
benchmark's baseline.

> It seems that try_charge_memcg will not reach
> __mem_cgroup_handle_over_high if it only hook calculate_high_delay
> without setting memory.high.

That's expected, no action is needed when the current consumption is
below memory.high.

> What do you think about hooking try_charge_memcg as well,
> so that it ensures __mem_cgroup_handle_over_high is called?

The logic in try_charge_memcg is alredy quite involved and I think only
simple concepts (that won't deviate too much as implementation changes)
should be exposed to the hooks.

> Thanks for your remind.
> This is a test log in the test environment without any extra progs:

Thanks, it's similar to the example above (I assume you're after "bogo
ops/s" in real time, RSS footprint isn't the observed metric), i.e. the
jobs don't differ. 
But it made me to review the results in your original posting (with your
patch) and the high group has RSS Max of 834836 KB (so that'd be the
actual workingset size for the stressor). So both of them should easily
fit into the 4G of the machine, hence I guess the bottleneck is IO
(you have swap right?), that's where prioritization should be applied
(at least in this demostration/representative case).

HTH,
Michal

Re: [RFC PATCH v2 0/3] Memory Controller eBPF support

Posted by Chen Ridong 1 month, 1 week ago


On 2025/12/30 17:49, Michal Koutný wrote:
> Hi Hui.
> 
> On Tue, Dec 30, 2025 at 11:01:58AM +0800, Hui Zhu <hui.zhu@linux.dev> wrote:
>> This allows administrators to suppress low-priority cgroups' memory
>> usage based on custom policies implemented in BPF programs.
> 
> BTW memory.low was conceived as a work-conserving mechanism for
> prioritization of different workloads. Have you tried that? No need to
> go directly to (high) limits. (<- Main question, below are some
> secondary implementation questions/remarks.)
> 
> ...
>> This series introduces a BPF hook that allows reporting
>> additional "pages over high" for specific cgroups, effectively
>> increasing memory pressure and throttling for lower-priority
>> workloads when higher-priority cgroups need resources.
> 
> Have you considered hooking into calculate_high_delay() instead? (That
> function has undergone some evolution so it'd seem like the candidate
> for BPFication.)
> 

+1

This issue[1] might be resolved by hooking into calculate_high_delay().

[1] https://lore.kernel.org/cgroups/4txrfjc5lqkmydmsesfq3l5drmzdio6pkmtfb64sk3ld6bwkhs@w4dkn76s4dbo/T/#t

> ...
>> 3. Cgroup hierarchy management (inheritance during online/offline)
> 
> I see you're copying the program upon memcg creation.
> Configuration copies aren't such a good way to properly handle
> hierarchical behavior.
> I wonder if this could follow the more generic pattern of how BPF progs
> are evaluated in hierarchies, see BPF_F_ALLOW_OVERRIDE and
> BPF_F_ALLOW_MULTI.
> 
> 
>> Example Results
> ...
>> Results show the low-priority cgroup (/sys/fs/cgroup/low) was
>> significantly throttled:
>> - High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s
>> - Low-priority cgroup: 11,568 bogo ops at 177 ops/s
>>
>> The stress-ng process in the low-priority cgroup experienced a
>> ~99.9% slowdown in memory operations compared to the
>> high-priority cgroup, demonstrating effective priority
>> enforcement through BPF-controlled memory pressure.
> 
> As a demonstrator, it'd be good to compare this with a baseline without
> any extra progs, e.g. show that high-prio performed better and low-prio
> wasn't throttled for nothing.
> 
> Thanks,
> Michal

-- 
Best regards,
Ridong